Information geometry is a rather interesting fusion of statistics and differential geometry, in which a statistical model is endowed with the structure of a Riemannian manifold. Each point on the manifold corresponds to a probability distribution function, and the metric is governed by the underlying properties thereof. It may have many interesting applications to (quantum) information theory, complexity, machine learning, and theoretical neuroscience, among other fields. The canonical reference is Methods of Information Geometry by Shun-ichi Amari and Hiroshi Nagaoka, originally published in Japanese in 1993, and translated into English in 2000. The English version also contains several new topics/sections, and thus should probably be considered as a “second edition” (incidentally, any page numbers given below refer to this version). This is a beautiful book, which develops the concepts in a concise yet pedagogical manner. The posts in this three-part series are merely the notes I made while studying it, and aren’t intended to provide more than a brief summary of what I deemed the salient aspects.
Chapter 1 provides a self-contained introduction to differential geometry which — as the authors amusingly allude — is far more accessible (dare I say practical) than the typical mathematician’s treatise on the subject. Since this is familiar material, I’ll skip ahead to chapter 2, and simply recall necessary formulas/notation as needed.
Chapter 2 introduces the basic geometric structure of statistical models. First, some notation: a probability distribution is a function which satisfies
where . If
is a discrete set (which may have either finite or countably infinite cardinality), then the integral is instead a sum. Each
may be parametrized by
real-valued variables
, such that the family
of probability distributions on
is
where and
is injective (so that the reverse mapping, from probability distributions
to the coordinates
is a function). Such an
is referred to as an
-dimensional statistical model, or simply model on
, sometimes abbreviated
. Note that I referred to
as the coordinates above: we will assume that the map
provided by
is
, so that we may take derivatives with respect to the parameters, e.g.,
, where
. Additionally, we assume that the order of differentiation and integration may be freely exchanged; an important consequence of this is that
which we will use below. Finally, we assume that the support of the distribution is independent of (that is, does not vary with)
, and hence may redefine
for simplicity. Thus, the model
is a subset of
Considering parametrizations which are diffeomorphic to one another to be equivalent then elevates
to a statistical manifold. (In this context, we shall sometimes conflate the distribution
with the coordinate
, and speak of the “point
”, etc).
As a concrete example, consider the normal distribution:
In this case,
Other examples are given on page 27.
Fisher information metric
Now, given a model , the Fisher information matrix of
at a point
is the
matrix
, with elements
where
and the expectation with respect to the distribution
is defined as
Though the integral above diverges for some models, we shall assume that is finite
, and furthermore that
is
. Note that
(i.e.,
is symmetric). Additionally, while
is in general positive semidefinite, we shall assume positive definiteness, which is equivalent to requiring that
are linearly independent functions on
. Lastly, observe that eq. (3) may be written
Via integration by parts, this allows us to express the matrix elements in a useful alternative form:
The reason for the particular definition of the Fisher matrix above is that it provides a Riemannian metric on . Recall that a Riemannian metric
is a
-tensor (that is, it maps points in
to their inner product on the tangent space
) which is linear, symmetric, and positive definite (meaning
, with equality iff
). In fact, in the natural basis of local coordinates
, the Riemannian metric is uniquely determined by
whence is called the Fisher metric. Since it’s positive definite, we may define the inverse metric
corresponding to
such that
. Note that
is invariant under coordinate transformations, since
and hence we may write
Recalling that , the length of a curve
with respect to g is then
But before we can properly speak of curves and distances, we must define a connection, which provides a means of parallel transporting vectors along the curve.
-connections
In particular, we will be concerned with comparing elements of the tangent bundle, and hence we require a relation between the tangent spaces at different points in , i.e., an affine connection. To that end, let
be an
-dimensional model, and define the
functions
which map each point
to
where . This defines an affine connection
on
via
where is the Fisher metric introduced above.
is called the
-connection, and accordingly terms like
-flat,
-affine,
-parallel, etc. denote the corresponding notions with respect to this connection.
Let us pause to review some of the terminology from differential geometry above. Recall that the covariant derivative may be expressed in local coordinates as
where ,
are vectors in the tangent space. If these are basis vectors such that
, then
. The vector
is said to be parallel with respect to the connection
if
, i.e.,
; equivalently, in local coordinates,
If all basis vectors are parallel with respect to a coordinate system , then the latter is an affine coordinate system for
. Connections which admit such an affine parametrization are called flat (equivalently, one says that
is flat with respect to
).
Now, with respect to a Riemannian metric , one defines
as above, namely
. Note that this defines a symmetric connection, i.e.,
. If, in addition,
satisfies
then is a metric connection with respect to
. (This is basically a statement about linearity, since affine transformations are linear). This implies that
In other words, under a metric connection, parallel transport of two vectors preserves the inner product, hence their significance in Riemannian geometry. Any connection which is both metric and symmetric is Riemannian, of which there are generically an infinite number. However, the natural metrics on statistical manifolds are generically non-metric! Indeed, since
only the special case defines a Riemannian connection
with respect to the Fisher metric (though observe that
is symmetric for any value of
). While this may seem strange from a physics perspective, where preserving the inner product is of prime importance, there’s nothing mathematically pathological about it. Indeed, the more relevant condition — which we’ll see below — is that every point on the manifold have an interpretation as a probability distribution.
Two neat relationships between different -connections are worth noting. First, for any
, we have
where (not to be confused with the unrelated torsion tensor) is a covariant symmetric tensor which maps a point
to
Second, the -connection may be decomposed as
Within this infinite class of connections, play a central role in information geometry, and are closely related to an interesting duality structure on the geometry of
. We shall give a low-level introduction to the relevant representations of
here, and postpone a more elegant derivation based on different embeddings of
in
in the next post. In particular, we’ll define the so-called exponential and mixed families, which are intimately related to the
– and
-connections, respectively.
Exponential families
Suppose that an -dimensional model
can be expressed in terms of
functions
on
and a function
on
as
where we’ve employed Einstein’s summation notation for the sum over from
to
. Then
is an exponential family, and
are its natural or canonical parameters. The normalization condition
implies that
This provides a parametrization , which is
if and only if the functions
are linearly independent (which we shall assume henceforth). Many important probabilistic models fall into this class, including all those referenced on page 27 above. The normal distribution (5), for instance, yields
The canonical coordinates are natural insofar as they provide a
-affine coordinate system, with respect to which
is
-flat. To see this, observe that
where keep in mind that denotes the derivative with respect to
, not
! This implies that
Therefore, exponential families admit a canonical parametrization in terms of a -affine coordinate system
, with respect to which
is
-flat. The associated affine connection is called the exponential connection, and is denoted
.
Mixed families
Now consider the case in which can be expressed as
i.e., is an affine subspace of
. In this case
is called a
, with mixture parameters
. Note that
itself is a mixture family if
is infinite. The name arises from the fact that elements in this family admit a representative form as a mixture of
probability distributions
,
(i.e., and
), where
and
. For a mixture family, we have
which implies that
Therefore, a mixture family admits a parametrization in terms of a -affine coordinate system
, with respect to which
is
-flat. The associated affine connection is called the mixture connection, denoted
.
In the next post, when we discuss the geometrical structure in more detail, we shall see that are dual connections, which has many interesting consequences.
Why is Fisher special?
As noted above, a given manifold admits infinitely many distinct Riemannian metrics and affine connections. However, a statistical manifold has the property that every point is a probability distribution, which singles out the Fisher metric and
-connection as unique. To formalize this notion, we must first introduce the concept of a sufficient statistic.
Let be a map which takes random variables
to
. Given the distribution
of
, this results in the distribution
on
. We then define
where , and
is the delta function at the point
, such that
,
In other words, the delta function picks out the value of such that
. The above implies that
is the conditional probability of the event
, given
(cf. the familiar definition
). If
is independent of
, then
is called a sufficient statistic for
. In this case, we may write
i.e., the dependence of on
is entirely encoded in the distribution
. Therefore, treating
as the unknown distribution, whose parameter
one wishes to estimate, it suffices to know only the value
, hence the name. Formally, one says that
is a sufficient statistic if and only if there exists functions
and
such that
The significance of this lies in the fact that the Fisher information metric satisfies a monotonicity relation under a generic map . This is detailed in Theorem 2.1 of Amari & Nagaoka, which states that given
with Fisher metric
, and induced model
with matrix
, the difference
is positive semidefinite, i.e.,
, with equality if and only if
is a sufficient statistic. Otherwise, for generic maps, the “information loss”
that results from summarizing the data
in
is given by
which can be expressed in terms of the covariance with respect to the conditional distribution . This theorem will be important later, when we discuss relative entropy.
Now, if is a sufficient statistic, then (37) implies that
. But this implies that
, and by extension
, are the same for both
and
. Therefore the Fisher metric and
-connection are invariant with respect to the sufficient statistic
. In the language above, this implies that there is no information loss associated with describing the original distribution
by
, i.e., that information is preserved under
. Formally, this invariance is codified by the following two equations:
, where the prime denotes the object on
,
is the diffeomorphism from
onto
given by
, and the pushforward
is defined by
.
The salient feature of the Fisher metric and -connection is that they are are uniquely characterized by this invariance! This is the thrust of Chentsov’s theorem (Theorem 2.6 in Amari & Nagaoka). Strictly speaking, the proof of this theorem relies on finiteness of
, but — depending on the level of rigour one demands — it is possible to extend this to infinite models via a limiting procedure in which one considers increasingly fine-grained subsets of
. A similar subtlety will arise in our more geometrical treatment of dual structures in the next post. I’m honestly unsure how serious this issue is, but it’s worth bearing in mind that the mathematical basis is less solid for infinite
, and may require a more rigorous functional analytic approach.