Information geometry is a rather interesting fusion of statistics and differential geometry, in which a statistical model is endowed with the structure of a Riemannian manifold. Each point on the manifold corresponds to a probability distribution function, and the metric is governed by the underlying properties thereof. It may have many interesting applications to (quantum) information theory, complexity, machine learning, and theoretical neuroscience, among other fields. The canonical reference is Methods of Information Geometry by Shun-ichi Amari and Hiroshi Nagaoka, originally published in Japanese in 1993, and translated into English in 2000. The English version also contains several new topics/sections, and thus should probably be considered as a “second edition” (incidentally, any page numbers given below refer to this version). This is a beautiful book, which develops the concepts in a concise yet pedagogical manner. The posts in this three-part series are merely the notes I made while studying it, and aren’t intended to provide more than a brief summary of what I deemed the salient aspects.
Chapter 1 provides a self-contained introduction to differential geometry which — as the authors amusingly allude — is far more accessible (dare I say practical) than the typical mathematician’s treatise on the subject. Since this is familiar material, I’ll skip ahead to chapter 2, and simply recall necessary formulas/notation as needed.
Chapter 2 introduces the basic geometric structure of statistical models. First, some notation: a probability distribution is a function which satisfies
where . If is a discrete set (which may have either finite or countably infinite cardinality), then the integral is instead a sum. Each may be parametrized by real-valued variables , such that the family of probability distributions on is
where and is injective (so that the reverse mapping, from probability distributions to the coordinates is a function). Such an is referred to as an -dimensional statistical model, or simply model on , sometimes abbreviated . Note that I referred to as the coordinates above: we will assume that the map provided by is , so that we may take derivatives with respect to the parameters, e.g., , where . Additionally, we assume that the order of differentiation and integration may be freely exchanged; an important consequence of this is that
which we will use below. Finally, we assume that the support of the distribution is independent of (that is, does not vary with) , and hence may redefine for simplicity. Thus, the model is a subset of
Considering parametrizations which are diffeomorphic to one another to be equivalent then elevates to a statistical manifold. (In this context, we shall sometimes conflate the distribution with the coordinate , and speak of the “point ”, etc).
As a concrete example, consider the normal distribution:
In this case,
Other examples are given on page 27.
Fisher information metric
Now, given a model , the Fisher information matrix of at a point is the matrix , with elements
and the expectation with respect to the distribution is defined as
Though the integral above diverges for some models, we shall assume that is finite , and furthermore that is . Note that (i.e., is symmetric). Additionally, while is in general positive semidefinite, we shall assume positive definiteness, which is equivalent to requiring that are linearly independent functions on . Lastly, observe that eq. (3) may be written
Via integration by parts, this allows us to express the matrix elements in a useful alternative form:
The reason for the particular definition of the Fisher matrix above is that it provides a Riemannian metric on . Recall that a Riemannian metric is a -tensor (that is, it maps points in to their inner product on the tangent space ) which is linear, symmetric, and positive definite (meaning , with equality iff ). In fact, in the natural basis of local coordinates , the Riemannian metric is uniquely determined by
whence is called the Fisher metric. Since it’s positive definite, we may define the inverse metric corresponding to such that . Note that is invariant under coordinate transformations, since
and hence we may write
Recalling that , the length of a curve with respect to g is then
But before we can properly speak of curves and distances, we must define a connection, which provides a means of parallel transporting vectors along the curve.
In particular, we will be concerned with comparing elements of the tangent bundle, and hence we require a relation between the tangent spaces at different points in , i.e., an affine connection. To that end, let be an -dimensional model, and define the functions which map each point to
where . This defines an affine connection on via
where is the Fisher metric introduced above. is called the -connection, and accordingly terms like -flat, -affine, -parallel, etc. denote the corresponding notions with respect to this connection.
Let us pause to review some of the terminology from differential geometry above. Recall that the covariant derivative may be expressed in local coordinates as
where , are vectors in the tangent space. If these are basis vectors such that , then . The vector is said to be parallel with respect to the connection if , i.e., ; equivalently, in local coordinates,
If all basis vectors are parallel with respect to a coordinate system , then the latter is an affine coordinate system for . Connections which admit such an affine parametrization are called flat (equivalently, one says that is flat with respect to ).
Now, with respect to a Riemannian metric , one defines as above, namely . Note that this defines a symmetric connection, i.e., . If, in addition, satisfies
then is a metric connection with respect to . (This is basically a statement about linearity, since affine transformations are linear). This implies that
In other words, under a metric connection, parallel transport of two vectors preserves the inner product, hence their significance in Riemannian geometry. Any connection which is both metric and symmetric is Riemannian, of which there are generically an infinite number. However, the natural metrics on statistical manifolds are generically non-metric! Indeed, since
only the special case defines a Riemannian connection with respect to the Fisher metric (though observe that is symmetric for any value of ). While this may seem strange from a physics perspective, where preserving the inner product is of prime importance, there’s nothing mathematically pathological about it. Indeed, the more relevant condition — which we’ll see below — is that every point on the manifold have an interpretation as a probability distribution.
Two neat relationships between different -connections are worth noting. First, for any , we have
where (not to be confused with the unrelated torsion tensor) is a covariant symmetric tensor which maps a point to
Second, the -connection may be decomposed as
Within this infinite class of connections, play a central role in information geometry, and are closely related to an interesting duality structure on the geometry of . We shall give a low-level introduction to the relevant representations of here, and postpone a more elegant derivation based on different embeddings of in in the next post. In particular, we’ll define the so-called exponential and mixed families, which are intimately related to the – and -connections, respectively.
Suppose that an -dimensional model can be expressed in terms of functions on and a function on as
where we’ve employed Einstein’s summation notation for the sum over from to . Then is an exponential family, and are its natural or canonical parameters. The normalization condition implies that
This provides a parametrization , which is if and only if the functions are linearly independent (which we shall assume henceforth). Many important probabilistic models fall into this class, including all those referenced on page 27 above. The normal distribution (5), for instance, yields
The canonical coordinates are natural insofar as they provide a -affine coordinate system, with respect to which is -flat. To see this, observe that
where keep in mind that denotes the derivative with respect to , not ! This implies that
Therefore, exponential families admit a canonical parametrization in terms of a -affine coordinate system , with respect to which is -flat. The associated affine connection is called the exponential connection, and is denoted .
Now consider the case in which can be expressed as
i.e., is an affine subspace of . In this case is called a , with mixture parameters . Note that itself is a mixture family if is infinite. The name arises from the fact that elements in this family admit a representative form as a mixture of probability distributions ,
(i.e., and ), where and . For a mixture family, we have
which implies that
Therefore, a mixture family admits a parametrization in terms of a -affine coordinate system , with respect to which is -flat. The associated affine connection is called the mixture connection, denoted .
In the next post, when we discuss the geometrical structure in more detail, we shall see that are dual connections, which has many interesting consequences.
Why is Fisher special?
As noted above, a given manifold admits infinitely many distinct Riemannian metrics and affine connections. However, a statistical manifold has the property that every point is a probability distribution, which singles out the Fisher metric and -connection as unique. To formalize this notion, we must first introduce the concept of a sufficient statistic.
Let be a map which takes random variables to . Given the distribution of , this results in the distribution on . We then define
where , and is the delta function at the point , such that ,
In other words, the delta function picks out the value of such that . The above implies that is the conditional probability of the event , given (cf. the familiar definition ). If is independent of , then is called a sufficient statistic for . In this case, we may write
i.e., the dependence of on is entirely encoded in the distribution . Therefore, treating as the unknown distribution, whose parameter one wishes to estimate, it suffices to know only the value , hence the name. Formally, one says that is a sufficient statistic if and only if there exists functions and such that
The significance of this lies in the fact that the Fisher information metric satisfies a monotonicity relation under a generic map . This is detailed in Theorem 2.1 of Amari & Nagaoka, which states that given with Fisher metric , and induced model with matrix , the difference is positive semidefinite, i.e., , with equality if and only if is a sufficient statistic. Otherwise, for generic maps, the “information loss” that results from summarizing the data in is given by
which can be expressed in terms of the covariance with respect to the conditional distribution . This theorem will be important later, when we discuss relative entropy.
Now, if is a sufficient statistic, then (37) implies that . But this implies that , and by extension , are the same for both and . Therefore the Fisher metric and -connection are invariant with respect to the sufficient statistic . In the language above, this implies that there is no information loss associated with describing the original distribution by , i.e., that information is preserved under . Formally, this invariance is codified by the following two equations:
, where the prime denotes the object on , is the diffeomorphism from onto given by , and the pushforward is defined by .
The salient feature of the Fisher metric and -connection is that they are are uniquely characterized by this invariance! This is the thrust of Chentsov’s theorem (Theorem 2.6 in Amari & Nagaoka). Strictly speaking, the proof of this theorem relies on finiteness of , but — depending on the level of rigour one demands — it is possible to extend this to infinite models via a limiting procedure in which one considers increasingly fine-grained subsets of . A similar subtlety will arise in our more geometrical treatment of dual structures in the next post. I’m honestly unsure how serious this issue is, but it’s worth bearing in mind that the mathematical basis is less solid for infinite , and may require a more rigorous functional analytic approach.