I alluded previously that information geometry had many interesting applications, among them machine learning and computational neuroscience more generally. A classic example is the original paper by Amari, Kurata, and Nagaoka, Information Geometry of Boltzmann Machines [1]. This paper has a couple hundred citations, so the following is by no means a review of the current state of the art; however, it does provide a concrete setting in which to apply the theoretical framework introduced in my three-part information geometry sequence, and is as good a place to start as any.
Boltzmann machines belong to the class of so-called “energy-based” models of neural networks, for reasons which will be elucidated below, which makes them particularly intuitive from a physics perspective (see, e.g., the spin-glass analogy of neural networks). Additionally, at least in the absence of hidden layers, they have a number of nice mathematical properties that make them ideal toy models for the geometric approach. Let us briefly introduce these models, following the notation in [1]. A Boltzmann machine is network of stochastic neurons with symmetric connection weights
between the
and
neurons (with
). A neuron is a binary variable which has a value of either 1 (excited) or 0 (quiescent). As the network is evolved in time, the state of the
neuron is updated according to the potential
where is the threshold for the neuron (i.e., a free parameter that determines that neuron’s sensitivity to firing), and in the second equality we have absorbed this by defining the zeroth neuron with
and
. The state of the neuron is updated to 1 with probability
, and 0 with probability
, where
We can now see the reason for the name as follows: the network’s state is described by the vector , and hence the transition between states forms a Markov chain with
sites. In particular, if there are no disconnected nodes, then it forms an ergodic Markov chain with a unique stationary distribution
i.e., the probability of finding the network in a given state will asymptotically approach the Boltzmann distribution. In this expression, the normalization is given by the partition function for the canonical ensemble,
where we sum over all possible microstates. We then recognize as the energy of the state
, which is given by
where the second equality follows from the symmetry of the connection matrix. In other words, (5) is the hamiltonian for a system of nearest-neighbor interactions with coupling .
The relationship between the Boltzmann distribution and the update rule (2) is most easily seen in reverse: consider two states which differ only by a single neuron. Their energy difference is , where
denotes the state in which the
neuron is excited, etc. By (3), we have
where the terms have canceled, and we have used the fact that probabilities sum to 1. Now observe that since we only changed a single neuron,
, whereupon (2) mandates the identification
. Thus, thermodynamically,
is a thermal spectrum, and
is the momentum mode associated with the excitation
(said differently,
is the energy gap associated with the transition). Given the hamiltonian (5) — that is, the fact that the network is stochastic and fully connected — the form of the probability update rule (2) therefore encodes the fact that the system is in thermal equilibrium: the set of all possible states satisfies a Boltzmann distribution (with temperature
in our units).
Now, (machine) “learning”, in this context, consists of adjusting the distribution to match, as closely as possible, a given input or environmental distribution
. This is achieved by modifying the connection weights
(which in our compact notation includes the threshold values
) according to a specified learning rule. Accordingly, we shall impose that the average adjustment to
is given by
where is a small constant, and the expectation values with respect to the distributions
,
are
cf. eq. (9) in part 1 of my information geometry sequence. Intuitively, the rule (7) simply says that we adjust the weights according to the difference between our reference and target distributions (more carefully: we adjust the connection strength between two neurons in proportion to the difference between the expectation values of both neurons being excited under said distributions). The reason for this particular rule — aside from its intuitive appeal — is that it realizes the (stochastic) gradient descent algorithm for optimizing the Kullback-Leibler divergence between and
,
i.e., (7) may be written
where the derivative is with respect to the weights of the initial distribution . We’ll see this explicitly below, after introducing some facilitating machinery, whereupon the fact that this realizes gradient descent on the divergence (9) will also be seen to emerge naturally from the geometric framework; the original proof, as far as I’m aware, was given in [2].
Incidentally, another interesting feature of the learning rule (7) is that it’s (ultra)local. Thus, in contrast to methods like backpropagation which require global information about the network, it may represent a more biologically plausible model insofar as it relies only on local synaptic data (in addition to being computationally simpler).
In order to apply the methods of information geometry to the Boltzmann machines introduced above, let us first review the relevant aspects in this context. Let be the set of all probability distributions over the state space
. The latter consists of
states, but the constraint
implies that we have only degrees of freedom; thus we may treat
as a
-dimensional Euclidean manifold. (Since we’re working with finite-dimensional distributions, we mean that
is a manifold in the sense that it is topologically locally equivalent to Euclidean space; additionally, the constraint
implies that we exclude the boundary
. So technically,
is an open simplex with this dimensionality, but the distinction will not be relevant for our purposes). As will become apparent below,
forms a dually flat space, which underlies many of the nice properties alluded above.
By a suitable choice of coordinates, we can represent as an exponential family,
cf. eq. (26) in part 1. Here, is a
-dimensional vector of canonical parameters,
with
, and the index
runs over all combinations of subsets
for
. The condition
then identifies the potential
as the cumulant generating function,
Now recall from part 2 that the dual coordinate is defined via the expectation with respect to
,
where the second equality follows from the fact that the only non-zero term in the sum is that for which all neurons are excited, hence we just have the probability . As explained in part 3, this identification is tantamount to the statement that the coordinates
and
are related through the dual potential
, which is defined via the Legendre transform. In the present case, we have via (13),
where is the entropy of state. Note that we can also express this as
which provides a useful relationship between the (dual) coordinates and potentials that will come in handy later.
To see that is dually flat, recall that the canonical and dual parameters,
and
, respectively induce 1-affine and
-affine coordinate systems, with respect to which
is
-flat; cf. eq. (30) and (34) in part 1. Thus
admits a unique Riemannian metric — the Fisher information metric — which can be expressed in the natural basis of coordinates
as
i.e., for any two vectors ,
where ; cf. eqs. (12) and (14) in part 1. Taking
and
to be basis vectors, we recover the general definition
This expression is general insofar as it holds for arbitrary coordinate systems (cf. the inner product on random variables). Since is
-flat however, it reduces to
which one can verify explicitly via (13) (it is convenient to start with the form eq. (11) in part 1). Thus, dual flatness has the advantage that the metric can be directly obtained from the potential . In accordance with our discussion in part 2, we also have the inverse metric
which implies that and
are mutually dual basis vectors at every point in
, i.e.,
.
Of course, the manifold is generically curved with respect to the Riemannian metric
above. But the special feature of dually flat spaces is that we can introduce dual
-connections with respect to which
appears flat. Recall that any dual structure
is naturally induced from a divergence. Specifically, let us take the canonical divergence
where by the argument of the potential, we mean the associated coordinate; e.g., , where
is the canonical coordinate of
. One can readily see that this induces the metric above, in the sense that
where the last equality follows by (20). Additionally, we saw in part 2 that the Kullback-Leibler divergence corresponds to the -divergence for
, and that the latter is in turn a special case of the more general
-divergence. But for the dual structure
, this is none other than the canonical divergence! That is, given the potentials and dual coordinate systems
and
above, (22) leads to (9), i.e.,
(note the change in ordering).
We may now regard the space of Boltzmann machines as a submanifold of
. One can see this by taking the log of the distribution (3),
where the potential accounts for the normalization; note that this is constant with respect to
. In terms of the canonical coordinates above, this corresponds to
,
, and
for all
. The last of these constitutes a system of
-linear constraints, which implies that
is in particular an
-linear submanifold of
, and therefore inherits the dually linear properties of
itself. Furthermore, since
is an exponential family, we shall henceforth denote its canonical coordinates by
(with upper indices) for consistency with the conventions above. We then define the dual coordinates (cf. eq. (14))
where the penultimate equality follows from (8). Note that unlike the canonical coordinates above, the remaining parts are generally nonlinear functions of
for points in
. (This implies that
is not an
-linear submanifold of
, but since it’s
-linear, we can find some other coordinates to reveal the dual
-linear structure that
must possess). The (invariant!) divergence between two Boltzmann machines
and
is then given by
(Note that this differs from eq. (34) in [1]; I believe the authors have mistakenly swapped the distribution labels . Explicitly, (15) allows us to identify the first term as
Meanwhile, the second term is easily found to be
where we used the fact that . Subtracting this from the entropy term above yields (27) as written). From (20) and (16), the induced the metric is simply given by
This, of course, has a statistical interpretation as the Fisher information matrix; it may be alternatively expressed as [1]
where the first term is the probability that all four neurons are excited,
As mentioned above, the learning task is for the Boltzmann machine to realize an environmental or target probability distribution as closely as possible, according to the divergence (27). We now wish to connect this notion of minimum divergence to the intuitive differential geometric picture of
as a submanifold of
, whereby one expects that the best approximation is given by the point
that minimizes the geodesic distance between
and
. To do so, consider three points,
and
, where
is the current state of the Boltzmann machine, and
is the best possible approximation of the environmental distribution
(which generically lies beyond
). Clearly, the point
is given by
where is the divergence between the distributions
and
defined above. The right-hand side of (33) describes the geodesic or
-projection of
onto
, so named because the tangent vector of the geodesic is orthogonal to
(the tangent space of
at the point
). Furthermore, since
is
-flat, this projection is unique: there are no local minima of the divergence aside from this global minimum.
To elaborate, let
respectively denote the canonical and dual coordinates of , where as a shorthand we include both
and
in
, and both
and
in
. Then recall that since
, we must have
. Thus both
and
lie in the submanifold
that is, the first part of the -coordinates are constrained to be
, while the remainder are left free. This defines
in terms of a set of linear constraints in the
-coordinates, and hence
is an
-flat manifold (analogous to how
is an
-flat manifold). The geodesic projection defined by (33) is therefore an
-geodesic in
(hence the name).
The construction of also implies that the tangent space of
at
,
, is spanned by the basis vectors
, where
runs over the second (unspecified) part of the
-coordinates. Similarly,
is spanned by
, where
runs over the first (unspecified) part of the
-coordinates (since
is defined by fixing the second part to zero). Since
for
, we conclude that
and
are orthogonal at
, and hence that the geodesic projection defined by (33) — that is, the
-geodesic connecting
and
— is likewise orthogonal to
. This is useful because it enables us to use the generalized Pythagorean theorem, in the form
Here, the left-hand side plays the role of the hypotenuse: it is the divergence between the distribution and the current state of the Boltzmann machine
. This is given, on the right-hand side, by the aforementioned
-geodesic representing the projection of
“straight down” to
,
, and an
-geodesic (because it lies entirely within
) representing the divergence between the current machine
and the optimal machine
,
. Since we can’t leave the submanifold
, the part
is an unavoidable error: we can’t shrink the divergence between the current state of the machine
and the target state
to zero; note that this is independent of the current state of the machine, and arises simply from the fact that
. We thus see that to accomplish our learning task, we need to minimize
by bringing the current machine closer to the optimum; i.e., the learning trajectory is the e-geodesic connecting the state of the current machine
to that of the optimal machine
.
To move along this geodesic, we simply follow the steepest descent algorithm from the point . Denoting the connection weights thereof by
(with apologies for reusing notation), the gradient is
where, in the first equality, we have used the aforementioned fact that the unavoidable error is independent of
(and hence of
); as for the second, we have via (27)
where we have used the fact that, from (16), ; we then identified
and
according to (8). We thus come full-circle to (7), i.e.,
(Note the change of order). Of course, this expression is only valid in the dually flat coordinates above, and does not have an invariant meaning. But we can uplift the learning rule for steepest descent to a general Riemannian manifold simply by inserting the (inverse) metric:
Since the inverse metric merely interchanges upper and lower indices, the generalized rule (40) is still an -geodesic equation on
, and is therefore linear in the
-coordinates. As noted in [1], this may be practically relevant since, in addition to being more tractable, the convergence to the optimum is likely much faster than in more complicated (i.e., non-linear) networks.
The paper [1] goes on to generalize their results to Boltzmann machines with hidden units; while these are more useful in practice (particularly in the form of restricted Boltzmann machines), I will not discuss them here. Instead, let me close by tying back into the physics-lingo at the beginning, where I mentioned that Boltzmann machines belong to the class of energy-based models. In particular, note that the energy difference between the target and optimum
is
Therefore, since , we may think of the optimum Boltzmann machine as that which minimizes the energy difference between these two states, cf. (33). Thus the learning problem is intuitively the mandate that we flow to the minimum energy state, where here the energy is defined relative to the target distribution
. And the solution — that is, the learning trajectory — is the minimum-length geodesic in the space of Boltzmann machines
. Neat!
References
- S.-i. Amari, K. Kurata, and H. Nagaoka, “Information Geometry of Boltzmann Machines,” IEEE Transactions on Neural Networks 3 no. 2, (1992).
- D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,” Cognitive Science 9 (1985).