Boltzmann machines

I alluded previously that information geometry had many interesting applications, among them machine learning and computational neuroscience more generally. A classic example is the original paper by Amari, Kurata, and Nagaoka, Information Geometry of Boltzmann Machines [1]. This paper has a couple hundred citations, so the following is by no means a review of the current state of the art; however, it does provide a concrete setting in which to apply the theoretical framework introduced in my three-part information geometry sequence, and is as good a place to start as any.

Boltzmann machines belong to the class of so-called “energy-based” models of neural networks, for reasons which will be elucidated below, which makes them particularly intuitive from a physics perspective (see, e.g., the spin-glass analogy of neural networks). Additionally, at least in the absence of hidden layers, they have a number of nice mathematical properties that make them ideal toy models for the geometric approach. Let us briefly introduce these models, following the notation in [1]. A Boltzmann machine is network of ${n}$ stochastic neurons with symmetric connection weights ${w_{ij}=w_{ji}}$ between the ${i^\mathrm{th}}$ and ${j^\mathrm{th}}$ neurons (with ${w_{ii}=0}$). A neuron is a binary variable which has a value of either 1 (excited) or 0 (quiescent). As the network is evolved in time, the state of the ${i^\mathrm{th}}$ neuron is updated according to the potential

$\displaystyle u_i=\sum_{j=1}^nw_{ij}x_j-h_i=\sum_{j=0}^nw_{ij}x_j~, \ \ \ \ \ (1)$

where ${h_i}$ is the threshold for the neuron (i.e., a free parameter that determines that neuron’s sensitivity to firing), and in the second equality we have absorbed this by defining the zeroth neuron with ${x_0=1}$ and ${w_{i0}=-h_i}$. The state of the neuron is updated to 1 with probability ${f(u_i)}$, and 0 with probability ${1\!-\!f(u_i)}$, where

$\displaystyle f(u)=\frac{1}{1+e^{-u}}~. \ \ \ \ \ (2)$

We can now see the reason for the name as follows: the network’s state is described by the vector ${\mathbf{x}=(x_1,\ldots,x_n)}$, and hence the transition between states forms a Markov chain with ${2^n}$ sites. In particular, if there are no disconnected nodes, then it forms an ergodic Markov chain with a unique stationary distribution

$\displaystyle p(\mathbf{x})=Z^{-1}e^{-E(\mathbf{x})}~, \ \ \ \ \ (3)$

i.e., the probability of finding the network in a given state will asymptotically approach the Boltzmann distribution. In this expression, the normalization is given by the partition function for the canonical ensemble,

$\displaystyle Z=\sum_\mathbf{x} e^{-E(\mathbf{x})}~, \ \ \ \ \ (4)$

where we sum over all possible microstates. We then recognize ${E(\mathbf{x})}$ as the energy of the state ${\mathbf{x}}$, which is given by

$\displaystyle E(\mathbf{x})=-\frac{1}{2}\sum_{i,j}w_{ij}x_ix_j=-\sum_{i>j}w_{ij}x_ix_j~, \ \ \ \ \ (5)$

where the second equality follows from the symmetry of the connection matrix. In other words, (5) is the hamiltonian for a system of nearest-neighbor interactions with coupling ${w_{ij}}$.

The relationship between the Boltzmann distribution and the update rule (2) is most easily seen in reverse: consider two states which differ only by a single neuron. Their energy difference is ${\Delta E=E(x_i\!=\!1)-E(x_i\!=\!0)}$, where ${E(x_i\!=\!1)}$ denotes the state in which the ${i^\mathrm{th}}$ neuron is excited, etc. By (3), we have

\displaystyle \begin{aligned} \Delta E&=-\ln p(x_i\!=\!1)+\ln p(x_i\!=\!0) =-\ln p(x_i\!=\!1)+\ln [1-p(x_i\!=\!1)]\\ &=\ln\left[\frac{1}{p(x_i\!=\!1)}-1\right] \;\implies\; p(x_i\!=\!1)=\frac{1}{1+e^{\Delta E}}~, \end{aligned} \ \ \ \ \ (6)

where the ${\ln Z}$ terms have canceled, and we have used the fact that probabilities sum to 1. Now observe that since we only changed a single neuron, ${p(x_i\!=\!1)=f(u)}$, whereupon (2) mandates the identification ${u=\Delta E}$. Thus, thermodynamically, ${f(u)}$ is a thermal spectrum, and ${u_i}$ is the momentum mode associated with the excitation ${x_i=0\!\rightarrow\!1}$ (said differently, ${u_i}$ is the energy gap associated with the transition). Given the hamiltonian (5) — that is, the fact that the network is stochastic and fully connected — the form of the probability update rule (2) therefore encodes the fact that the system is in thermal equilibrium: the set of all possible states satisfies a Boltzmann distribution (with temperature ${T=1}$ in our units).

Now, (machine) “learning”, in this context, consists of adjusting the distribution ${p(\mathbf{x})}$ to match, as closely as possible, a given input or environmental distribution ${q(\mathbf{x})}$. This is achieved by modifying the connection weights ${w_{ij}}$ (which in our compact notation includes the threshold values ${h_i}$) according to a specified learning rule. Accordingly, we shall impose that the average adjustment to ${w_{ij}}$ is given by

$\displaystyle \Delta w_{ij}=\epsilon(q_{ij}-p_{ij})~, \ \ \ \ \ (7)$

where ${\epsilon\!>\!0}$ is a small constant, and the expectation values with respect to the distributions ${q(\mathbf{x})}$, ${p(\mathbf{x})}$ are

$\displaystyle q_{ij}=E_q[x_ix_j]=\sum_xq(\mathbf{x})x_ix_j~, \qquad\quad p_{ij}=E_p[x_ix_j]=\sum_xp(\mathbf{x})x_ix_j~, \ \ \ \ \ (8)$

cf. eq. (9) in part 1 of my information geometry sequence. Intuitively, the rule (7) simply says that we adjust the weights according to the difference between our reference and target distributions (more carefully: we adjust the connection strength between two neurons in proportion to the difference between the expectation values of both neurons being excited under said distributions). The reason for this particular rule — aside from its intuitive appeal — is that it realizes the (stochastic) gradient descent algorithm for optimizing the Kullback-Leibler divergence between ${q(\mathbf{x})}$ and ${p(\mathbf{x})}$,

$\displaystyle I(q:p)=\sum_\mathbf{x} q(\mathbf{x})\ln\frac{q(\mathbf{x})}{p(\mathbf{x})}~, \ \ \ \ \ (9)$

i.e., (7) may be written

$\displaystyle \Delta w_{ij}=-\epsilon\frac{\partial I(q:p)}{\partial w_{ij}}~, \ \ \ \ \ (10)$

where the derivative is with respect to the weights of the initial distribution ${p(\mathbf{x};\omega)}$. We’ll see this explicitly below, after introducing some facilitating machinery, whereupon the fact that this realizes gradient descent on the divergence (9) will also be seen to emerge naturally from the geometric framework; the original proof, as far as I’m aware, was given in [2].

Incidentally, another interesting feature of the learning rule (7) is that it’s (ultra)local. Thus, in contrast to methods like backpropagation which require global information about the network, it may represent a more biologically plausible model insofar as it relies only on local synaptic data (in addition to being computationally simpler).

In order to apply the methods of information geometry to the Boltzmann machines introduced above, let us first review the relevant aspects in this context. Let ${S=\{q(\mathbf{x})\}}$ be the set of all probability distributions over the state space ${X}$. The latter consists of ${2^n}$ states, but the constraint

$\displaystyle \sum_\mathbf{x} q(\mathbf{x})=1 \ \ \ \ \ (11)$

implies that we have only ${2^n\!-\!1}$ degrees of freedom; thus we may treat ${S}$ as a ${(2^n\!-\!1)}$-dimensional Euclidean manifold. (Since we’re working with finite-dimensional distributions, we mean that ${S}$ is a manifold in the sense that it is topologically locally equivalent to Euclidean space; additionally, the constraint ${0 implies that we exclude the boundary ${\partial S}$. So technically, ${S}$ is an open simplex with this dimensionality, but the distinction will not be relevant for our purposes). As will become apparent below, ${S}$ forms a dually flat space, which underlies many of the nice properties alluded above.

By a suitable choice of coordinates, we can represent ${S}$ as an exponential family,

$\displaystyle q(\mathbf{x};\theta)=\exp\!\left[\theta^IX_I-\psi(\theta)\right]~, \ \ \ \ \ (12)$

cf. eq. (26) in part 1. Here, ${\theta=(\theta^I)}$ is a ${(2^n\!-\!1)}$-dimensional vector of canonical parameters, ${\mathbf{X}=(X_I)}$ with ${X_I=x_{i_1}\ldots x_{i_m}}$, and the index ${I}$ runs over all combinations of subsets ${(i_1,\ldots,i_m)}$ for ${1\leq m\leq n}$. The condition ${\sum_\mathbf{x} q(\mathbf{x};\theta)=1}$ then identifies the potential ${\psi(\theta)}$ as the cumulant generating function,

$\displaystyle \psi(\theta)=\ln\sum_\mathbf{x}\exp\!\left[\theta^IX_I(\mathbf{x})\right]~. \ \ \ \ \ (13)$

Now recall from part 2 that the dual coordinate ${\eta}$ is defined via the expectation with respect to ${q(\mathbf{x};\theta)}$,

$\displaystyle \eta_I\equiv E_\theta[X_I]=\mathrm{prob}\left\{x_{i_1}=\ldots=x_{i_m}=1\,|\,I=(i_1,\ldots,i_m)\right\}~, \ \ \ \ \ (14)$

where the second equality follows from the fact that the only non-zero term in the sum is that for which all neurons are excited, hence we just have the probability ${q(X_I)}$. As explained in part 3, this identification is tantamount to the statement that the coordinates ${\theta^I}$ and ${\eta_I}$ are related through the dual potential ${\phi(\eta)}$, which is defined via the Legendre transform. In the present case, we have via (13),

$\displaystyle \phi(\eta)=\theta^I\partial_I\psi(\theta)-\psi(\theta) =\sum_\mathbf{x} q(\mathbf{x};\theta(\eta))\ln q(\mathbf{x};\theta(\eta))=-H(\mathbf{x})~, \ \ \ \ \ (15)$

where ${H(\mathbf{x})}$ is the entropy of state. Note that we can also express this as

$\displaystyle \psi(\theta)+\phi(\eta)-\theta^I\eta_I=0~, \ \ \ \ \ (16)$

which provides a useful relationship between the (dual) coordinates and potentials that will come in handy later.

To see that ${S}$ is dually flat, recall that the canonical and dual parameters, ${(\theta^I)}$ and ${(\eta_I)}$, respectively induce 1-affine and ${(-1)}$-affine coordinate systems, with respect to which ${S}$ is ${(\pm 1)}$-flat; cf. eq. (30) and (34) in part 1. Thus ${S}$ admits a unique Riemannian metric — the Fisher information metric — which can be expressed in the natural basis of coordinates ${(\theta^I)}$ as

$\displaystyle g_{IJ}(\theta)=\partial_I\cdot \partial_J~, \ \ \ \ \ (17)$

i.e., for any two vectors ${X,Y\in T_\theta S}$,

$\displaystyle g_{IJ}(\theta)=\langle X,Y\rangle_\theta=E_\theta[X\ell_\theta(\mathbf{x}) Y\ell_\theta(\mathbf{x})] \ \ \ \ \ (18)$

where ${\ell_\theta(\mathbf{x})=\ell(\mathbf{x};\theta)\equiv\ln q(\mathbf{x};\theta)}$; cf. eqs. (12) and (14) in part 1. Taking ${X}$ and ${Y}$ to be basis vectors, we recover the general definition

$\displaystyle g_{IJ}(\theta)=E_\theta[\partial_I\ell_\theta(\mathbf{x}) \partial_J\ell_\theta(\mathbf{x})]~. \ \ \ \ \ (19)$

This expression is general insofar as it holds for arbitrary coordinate systems (cf. the inner product on random variables). Since ${(\theta^I)}$ is ${e}$-flat however, it reduces to

$\displaystyle g_{IJ}=\partial_I\partial_J\psi(\theta)~, \ \ \ \ \ (20)$

which one can verify explicitly via (13) (it is convenient to start with the form eq. (11) in part 1). Thus, dual flatness has the advantage that the metric can be directly obtained from the potential ${\psi(\theta)}$. In accordance with our discussion in part 2, we also have the inverse metric

$\displaystyle g^{IJ}=\langle\partial^I,\partial^J\rangle=\partial^I\partial^J\phi(\eta)~,\qquad\quad \sum g^{IJ}g_{JK}=\delta^I_K~, \ \ \ \ \ (21)$

which implies that ${\partial_I=\frac{\partial}{\partial\theta^I}}$ and ${\partial^I=\frac{\partial}{\partial\eta_I}}$ are mutually dual basis vectors at every point in ${S}$, i.e., ${\partial_I\cdot\partial^J=\delta_I^J}$.

Of course, the manifold ${S}$ is generically curved with respect to the Riemannian metric ${g_{IJ}}$ above. But the special feature of dually flat spaces is that we can introduce dual ${(\pm1)}$-connections with respect to which ${S}$ appears flat. Recall that any dual structure ${(g,\nabla,\nabla^*)}$ is naturally induced from a divergence. Specifically, let us take the canonical divergence

$\displaystyle D(p||q)\equiv\psi(p)+\phi(q)-\theta^I(p)\eta_I(q)~, \ \ \ \ \ (22)$

where by the argument of the potential, we mean the associated coordinate; e.g., ${\psi(p)=\psi(\theta_p)}$, where ${\theta_p=(\theta^I_p)}$ is the canonical coordinate of ${p\in S}$. One can readily see that this induces the metric above, in the sense that

$\displaystyle D\left((\partial_I\partial_J)_p||q\right)\equiv \partial_I\partial_J D(p||q)=\partial_I\partial_J\psi(p)=g_{IJ}(p)~, \ \ \ \ \ (23)$

where the last equality follows by (20). Additionally, we saw in part 2 that the Kullback-Leibler divergence corresponds to the ${\alpha}$-divergence for ${\alpha=\pm1}$, and that the latter is in turn a special case of the more general ${f}$-divergence. But for the dual structure ${(g,\nabla^{(1)},\nabla^{(-1)})}$, this is none other than the canonical divergence! That is, given the potentials and dual coordinate systems ${\theta^I}$ and ${\eta_I}$ above, (22) leads to (9), i.e.,

$\displaystyle D(p||q)=I(q:p) \quad\quad\forall\,p,q\in S~, \ \ \ \ \ (24)$

(note the change in ordering).

We may now regard the space of Boltzmann machines ${B}$ as a submanifold of ${S}$. One can see this by taking the log of the distribution (3),

$\displaystyle \ln p(\mathbf{x};w)=\sum_{i>j}w_{ij}x_ix_j-\psi(w)~, \ \ \ \ \ (25)$

where the potential ${\psi(w)=\ln Z}$ accounts for the normalization; note that this is constant with respect to ${\mathbf{x}}$. In terms of the canonical coordinates above, this corresponds to ${\theta^i_1=w_{i0}}$, ${\theta^{ij}_2=w_{ij}}$, and ${\theta^{i_1\ldots i_m}_m=0}$ for all ${m>2}$. The last of these constitutes a system of ${e}$-linear constraints, which implies that ${B}$ is in particular an ${e}$-linear submanifold of ${S}$, and therefore inherits the dually linear properties of ${S}$ itself. Furthermore, since ${B}$ is an exponential family, we shall henceforth denote its canonical coordinates by ${w^{ij}}$ (with upper indices) for consistency with the conventions above. We then define the dual coordinates (cf. eq. (14))

$\displaystyle \eta_{ij}=E_w[x_ix_j]=p_{ij}=\mathrm{prob}\{x_i=x_j=1\}~, \qquad\quad i

where the penultimate equality follows from (8). Note that unlike the canonical coordinates above, the remaining parts ${\eta_{i_1,\ldots,i_m}^m}$ are generally nonlinear functions of ${\eta_{ij}}$ for points in ${B}$. (This implies that ${B}$ is not an ${m}$-linear submanifold of ${S}$, but since it’s ${e}$-linear, we can find some other coordinates to reveal the dual ${m}$-linear structure that ${B}$ must possess). The (invariant!) divergence between two Boltzmann machines ${w_1^{ij}}$ and ${w_2^{ij}}$ is then given by

$\displaystyle D(w_1,w_2)=\sum_\mathbf{x} p(\mathbf{x};w_1)\ln\frac{p(\mathbf{x};w_1)}{p(\mathbf{x};w_2)} =\phi(w_1)+\psi(w_2)-\sum_{i>j}w_2^{ij}{p_1}_{ij}~. \ \ \ \ \ (27)$

(Note that this differs from eq. (34) in [1]; I believe the authors have mistakenly swapped the distribution labels ${1\leftrightarrow 2}$. Explicitly, (15) allows us to identify the first term as

$\displaystyle \sum_\mathbf{x} p(\mathbf{x};w_1)\ln p(\mathbf{x};w_1)=\phi(w_1)~. \ \ \ \ \ (28)$

Meanwhile, the second term is easily found to be

$\displaystyle \sum_\mathbf{x} p(\mathbf{x};w_1)\ln p(\mathbf{x};w_2)=w_2^{ij}\sum_\mathbf{x} x_ix_j p(\mathbf{x};w_1)-\psi(w_2)=w_2^{ij}{p_1}_{ij}-\psi(w_2)~, \ \ \ \ \ (29)$

where we used the fact that ${p_{ij}=\eta_{ij}=E_w[x_ix_j]}$. Subtracting this from the entropy term above yields (27) as written). From (20) and (16), the induced the metric is simply given by

$\displaystyle g_{IJ}=g_{ij,kl}=\frac{\partial^2}{\partial w^{ij}\partial w^{kl}}\psi(w)=\frac{\partial p_{ij}}{\partial w^{kl}}~. \ \ \ \ \ (30)$

This, of course, has a statistical interpretation as the Fisher information matrix; it may be alternatively expressed as [1]

$\displaystyle g_{ij,kl}=p_{ij,kl}-p_{ij}p_{kl}~, \ \ \ \ \ (31)$

where the first term is the probability that all four neurons are excited,

$\displaystyle p_{ij,kl}=\mathrm{prob}\{x_i=x_j=x_k=x_l=1\}~. \ \ \ \ \ (32)$

As mentioned above, the learning task is for the Boltzmann machine to realize an environmental or target probability distribution ${q(\mathbf{x})}$ as closely as possible, according to the divergence (27). We now wish to connect this notion of minimum divergence to the intuitive differential geometric picture of ${B}$ as a submanifold of ${S}$, whereby one expects that the best approximation is given by the point ${p(\mathbf{x})\in B\subset S}$ that minimizes the geodesic distance between ${p(\mathbf{x})}$ and ${q(\mathbf{x})}$. To do so, consider three points, ${p,r\in B}$ and ${q\in S}$, where ${r}$ is the current state of the Boltzmann machine, and ${p}$ is the best possible approximation of the environmental distribution ${q}$ (which generically lies beyond ${B}$). Clearly, the point ${p}$ is given by

$\displaystyle p=\mathrm{min}_{p'\in B}D(q,p')~, \ \ \ \ \ (33)$

where ${D(q,p)}$ is the divergence between the distributions ${q}$ and ${p}$ defined above. The right-hand side of (33) describes the geodesic or ${m}$-projection of ${q}$ onto ${B}$, so named because the tangent vector of the geodesic is orthogonal to ${T_pB}$ (the tangent space of ${B}$ at the point ${p}$). Furthermore, since ${B}$ is ${e}$-flat, this projection is unique: there are no local minima of the divergence aside from this global minimum.

To elaborate, let

$\displaystyle \theta^I=(w^{ij},w_3^{ijk},\ldots,w_n^{i\ldots n})~, \qquad\quad \eta_I=(p_{ij},p^3_{ijk},\ldots,p^n_{i\ldots n})~, \ \ \ \ \ (34)$

respectively denote the canonical and dual coordinates of ${q(\mathbf{x})}$, where as a shorthand we include both ${w_1^{0i}}$ and ${w_2^{ij}}$ in ${w^{ij}}$, and both ${p_{0i}^1}$ and ${p_{ij}^2}$ in ${p_{ij}}$. Then recall that since ${p(\mathbf{x})\in B}$, we must have ${w_m^{i_1\ldots i_m}=0}$ ${\forall m>2}$. Thus both ${p}$ and ${q}$ lie in the submanifold

$\displaystyle M=\{\eta\,|\,\eta_{ij}=p_{ij}\}\subset S~, \ \ \ \ \ (35)$

that is, the first part of the ${\eta}$-coordinates are constrained to be ${p_{ij}}$, while the remainder are left free. This defines ${M}$ in terms of a set of linear constraints in the ${\eta}$-coordinates, and hence ${M}$ is an ${m}$-flat manifold (analogous to how ${B}$ is an ${e}$-flat manifold). The geodesic projection defined by (33) is therefore an ${m}$-geodesic in ${M}$ (hence the name).

The construction of ${M}$ also implies that the tangent space of ${M}$ at ${P}$, ${T_pM}$, is spanned by the basis vectors ${\{\partial^J\}}$, where ${J}$ runs over the second (unspecified) part of the ${\eta}$-coordinates. Similarly, ${T_pB}$ is spanned by ${\{\partial_I\}}$, where ${I}$ runs over the first (unspecified) part of the ${\theta}$-coordinates (since ${B}$ is defined by fixing the second part to zero). Since ${\partial_I\cdot\partial^J=0}$ for ${I\neq J}$, we conclude that ${M}$ and ${B}$ are orthogonal at ${p}$, and hence that the geodesic projection defined by (33) — that is, the ${m}$-geodesic connecting ${q}$ and ${p}$ — is likewise orthogonal to ${B}$. This is useful because it enables us to use the generalized Pythagorean theorem, in the form

$\displaystyle D(q,r)=D(q,p)+D(p,r)~. \ \ \ \ \ (36)$

Here, the left-hand side plays the role of the hypotenuse: it is the divergence between the distribution ${q(\mathbf{x})}$ and the current state of the Boltzmann machine ${r(\mathbf{x})}$. This is given, on the right-hand side, by the aforementioned ${m}$-geodesic representing the projection of ${q}$ “straight down” to ${p}$, ${D(q,p)}$, and an ${e}$-geodesic (because it lies entirely within ${B}$) representing the divergence between the current machine ${r(\mathbf{x})}$ and the optimal machine ${p(\mathbf{x})}$, ${D(p,r)}$. Since we can’t leave the submanifold ${B}$, the part ${D(q,p)}$ is an unavoidable error: we can’t shrink the divergence between the current state of the machine ${r}$ and the target state ${q}$ to zero; note that this is independent of the current state of the machine, and arises simply from the fact that ${q\notin B}$. We thus see that to accomplish our learning task, we need to minimize ${D(p,r)}$ by bringing the current machine closer to the optimum; i.e., the learning trajectory is the e-geodesic connecting the state of the current machine ${r(\mathbf{x})}$ to that of the optimal machine ${p(\mathbf{x})}$.

To move along this geodesic, we simply follow the steepest descent algorithm from the point ${r}$. Denoting the connection weights thereof by ${w^{ij}}$ (with apologies for reusing notation), the gradient is

$\displaystyle \frac{\partial D(p,r)}{\partial w^{ij}} =\frac{\partial D(q,r)}{\partial w^{ij}} =q_{ij}-r_{ij}~, \ \ \ \ \ (37)$

where, in the first equality, we have used the aforementioned fact that the unavoidable error ${D(q,p)}$ is independent of ${r}$ (and hence of ${w^{ij}}$); as for the second, we have via (27)

$\displaystyle \frac{\partial}{\partial w_1^{ij}}D(w_1,w_2) =\frac{\partial}{\partial w_1^{ij}}\left[\phi(w_1)+\psi(w_2)-\sum_{i>j}w_1^{ij}{p_2}_{ij}\right] ={p_1}_{ij}-{p_2}_{ij}~, \ \ \ \ \ (38)$

where we have used the fact that, from (16), ${\eta_I=\tfrac{\partial\phi}{\partial\theta^I}}$; we then identified ${q_{ij}}$ and ${r_{ij}}$ according to (8). We thus come full-circle to (7), i.e.,

$\displaystyle \Delta w^{ij}=\epsilon\left(r_{ij}-q_{ij}\right)~. \ \ \ \ \ (39)$

(Note the change of order). Of course, this expression is only valid in the dually flat coordinates above, and does not have an invariant meaning. But we can uplift the learning rule for steepest descent to a general Riemannian manifold simply by inserting the (inverse) metric:

$\displaystyle \Delta w^{ij}=\epsilon\sum g^{ij,kl}\left(r_{kl}-q_{kl}\right)~. \ \ \ \ \ (40)$

Since the inverse metric merely interchanges upper and lower indices, the generalized rule (40) is still an ${m}$-geodesic equation on ${B}$, and is therefore linear in the ${\eta}$-coordinates. As noted in [1], this may be practically relevant since, in addition to being more tractable, the convergence to the optimum is likely much faster than in more complicated (i.e., non-linear) networks.

The paper [1] goes on to generalize their results to Boltzmann machines with hidden units; while these are more useful in practice (particularly in the form of restricted Boltzmann machines), I will not discuss them here. Instead, let me close by tying back into the physics-lingo at the beginning, where I mentioned that Boltzmann machines belong to the class of energy-based models. In particular, note that the energy difference between the target ${q(\mathbf{x})}$ and optimum ${p(\mathbf{x})}$ is

$\displaystyle \Delta E(p,q)=-\ln p+\ln q=\ln\frac{q}{p} \quad\implies\quad I(q:p)=\sum_\mathbf{x} q\,\Delta E(p,q)~. \ \ \ \ \ (41)$

Therefore, since ${D(p||q)=I(q:p)}$, we may think of the optimum Boltzmann machine as that which minimizes the energy difference between these two states, cf. (33). Thus the learning problem is intuitively the mandate that we flow to the minimum energy state, where here the energy is defined relative to the target distribution ${q(\mathbf{x})}$. And the solution — that is, the learning trajectory — is the minimum-length geodesic in the space of Boltzmann machines ${B}$. Neat!

References

1. S.-i. Amari, K. Kurata, and H. Nagaoka, “Information Geometry of Boltzmann Machines,” IEEE Transactions on Neural Networks 3 no. 2, (1992).
2. D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,” Cognitive Science 9 (1985).
This entry was posted in Minds & Machines. Bookmark the permalink.