In the previous post, we introduced the -connection, and alluded to a dualistic structure between
and
. In particular, the cases
are intimately related to two important families of statistical models, the exponential or e-family with affine connection
, and the mixed or m-family with affine connection
. Hence before turning to general aspects of the dual structure on
, it is illuminating to see how these families/connections emerge naturally via an embedding formalism.
Two embeddings
For concreteness (or rather, to err on the safe side of mathematical rigour), let be a finite set. As before, an arbitrary model
on
is a submanifold of
which in turn is a subset of the set of all -valued functions on
, denoted
Note that since is finite, the normalization constraint is essentially the imposition that
, which implies that
is specifically an open subset of the affine subspace
which further implies that the tangent space is naturally identified with the linear subspace
(One can see this by representing as an algebraic curve). Vectors
, considered as elements of
, will be denoted
. These define the mixture or m-representation of
, i.e.,
As before, the natural basis then defines the
-affine coordinates
, with
, with respect to which
is
-flat. The natural connection induced from this affine structure
is the
-connection introduced before, which preserves vectors identically under parallel transport
, i.e.,
Thus we see that the natural embedding of into
makes the affine structure, and in particular the significance of the
-connection, manifest.
Now consider the alternative embedding , and identify
with the subset
. Elements
are then given by applying
to the map
, whereupon we denote them
and call this the exponential or e-representation. In local coordinates, we have
. By the chain rule,
is then related to
as
and therefore the tangent space is obtained by modifying the constraint on
above to
, i.e.,
One sees immediately that unlike ,
depends on
, and therefore parallel transport does not preserve
. However, while an element
does not generally belong to
for
, the constraint in (7) implies that the shifted vector
does belong to
. We therefore have
where denotes parallel transport with respect to the e-connection
. (Properly showing that
is e-parallel under (9) is of course slightly more involved, and is done on page 41). Thus the exponential (or, if one prefers, logarithmic) embedding of
into
leads naturally to the e-affine coordinate system and associated connection introduced in the previous post.
It turns out that the e-representation has a number of interesting properties that allow one to establish close connections between the Fisher metric and various statistical notions. Note that, due to the presence of the log, the Fisher metric can be neatly expressed in the e-representation as
Furthermore, with a bit more technical footwork involving cotangent spaces and the like, one can derive a number of fundamental relations. For example, the variance of a random variable is determined by the sensitivity of
to perturbations of
, and may be expressed as
where is the differential of
at
. See section 2.5 for more details.
As alluded previously, the geometrical arguments are somewhat less rigorous for the case when is infinite. Intuitively, the issue is that the identification of
with
, and by extension the isometry between
and
, requires that
have the same dimensionality as
. For finite
, the constraint
is “loose enough” for this to be possible, but this constraint becomes increasingly strict as the cardinality of
increases (i.e., the portion of
identified with
decreases). For infinite
, one effectively has an infinite number of constraints, which makes this dimensional matching impossible. Nonetheless, I’m given to understand that much of the framework can still be made to go through; see the references at the end of section 2.5 and associated discussion.
Dual structures on
We’ve alluded to a duality between , but in fact this extends to general
-connections. Hence, rather than consider a particular connection in isolation, the fundamental structure in information geometry is the triple
, which Amari & Nagaoka call a dualistic structure on
(we prefer the term “dual structure” instead, and will use this henceforth). Formally, if one has two affine connections
and
with respect to a Riemannian metric
, then if these satisfy
then and
are duals with respect to
, and one is called the dual or conjugate connection of the other. In local coordinates, this condition reads
which follows from the definition , and similarly for
, with basis vectors
etc. Given a metric
and connection
on
, the dual connection
is generally unique, and satisfies
. Additionally, the combination
is metric. Conversely, one immediately sees that the condition for
to be metric is that
, and in this sense dual connections simply constitute a more general class thereof.
In particular, and
are dual with respect to the Fisher metric. This follows readily from the general expressions for the
-connection developed in section 2.6, which we’ve skipped over here. Suffice to say the framework of
-connection admits a straightforward extension to arbitrary
.
The significance of dual connections is neatly illustrated by considering the parallel transport along a curve from
with respect to
and
, respectively denoted
and
. As mentioned in the previous post, general
connections are not metric; but the dual structure allows one to generalize the notion of preservation of the inner product along a curve, namely:
And furthermore, the relationship between and
is completely fixed by this condition.
Divergences
The duality structure of statistical manifolds facilitates the introduction of a distance-like measure , which enables us to compute something like a geometrical distance between distributions. In particular, we shall define a smooth function
such that
, with equality iff
. While this provides a measure of the separation between
and
, it is asymmetric and does not satisfy the triangle inequality, and hence fails the conditions for a distance function. However, if
further satisfies
where is a positive definite matrix everywhere on
, then
is a divergence or contrast function on
. Note that a divergence uniquely defines a Riemannian metric
via
i.e., . We can also define an affine connection
associated with this divergence, via the coefficients
Similarly, we define the dual , from which we have
and
. Thus we have that
and
are dual with respect to
. In fact, any dual structure
is naturally induced from a divergence!
Of course, the above is quite general: there are infinitely many different divergences one could define on a manifold. Henceforth we shall specify to a particularly important class for statistical models, known as the f-divergence:
where is an arbitrary convex function on
with
. The f-divergence satisfies a number of important properties (see page 56), chief of which is monotonicity under arbitrary probability transition functions. That is, let
be randomly transformed into
with probability
, with
. This maps distributions
to
, respectively. Monotonicity of the f-divergence is then the statement that
Spoiler alert: this will surface again in the form of monotonicity of relative entropy below! Note that is a generalization of the deterministic mapping induced by
in the previous post, which corresponds to
. Consequently, the equality is saturated iff
is induced from a sufficient statistic, for which
.
Within this still-broad class of f-divergences, one finds the -divergence
, defined for all
via
which yields, for ,
and, for ,
Of course, this last is none other than the relative entropy or Kullback-Leibler divergence!
The -divergence
is so named because it naturally induces the dual structure
, i.e., the Fisher metric and (dual)
-connections. Let us now explore this duality in more detail, which will lead us to a deeper appreciation for the divergence
.
Dually flat spaces
Suppose that, in a dual structure , both connections are symmetric, as is the case for
-connections. This implies that
-flatness and
-flatness of
are equivalent. In particular, we’ve seen that e-families are 1-flat, and m-families are
-flat, but the above implies that in fact they are both
-flat. In general, if both duals
and
are flat, we call
a dually flat space. Such spaces have a number of properties that are closely related to various concepts in statistics.
By definition, there exist -affine and
-affine coordinate systems
and
, respectively, in which vector fields are denoted
and
. Since both
and
are flat, the condition on parallel transport (14) implies that
is in fact constant on
. Furthermore, given
, affine transformations allow us to choose the dual coordinate system
such that
Coordinate systems which satisfy this requirement are called mutually dual. Such coordinate systems are special to dually flat spaces, and do not exist for general Riemannian manifolds; conversely, if one finds such coordinates for a Riemannian manifold , then the connections
and
for which they are affine are uniquely determined, and
is a dually flat space.
A quick remark on notation: we shall denote the components of the metric with respect to
and
by
Coordinate transformations between them are given by the usual expressions,
In conjunction with (23), we therefore have
We may now introduce the potentials, which will prove useful below. At a mathematical level, these allow us to define the Legendre transformation that explicitly relates the dual coordinate systems and
. Consider a function
that satisfies the following partial differential equation:
From the coordinate expressions above, , and hence the second derivatives of
comprise a positive definite matrix. Hence
is a strictly convex function of
. Similarly, define
such that
which is a strictly convex function of , since
. These two functions are related via
To see this, simply take the differential , whereupon substituting in (27) one recovers (28). Now, the form of (29) — equivalently, the pair (27) and (28) — suggests that
,
form a conjugate pair of coordinates for the functions
,
in the context of Legendre transforms. However, (29) does not quite suffice to define the Legendre transform, since we must remove the
dependence on the right-hand side in order for it to be uniquely invertible. Note that (strict) convexity is important here, since this implies a 1:1 map between a function and its first derivative, which in turn enables one — via the Legendre transform — to express all the information about a function in terms of its derivatives instead. By analogy with Fourier or Laplace transforms, this may lead to deeper mathematical/physical insight, or simple convenience, depending on the application. Hence, since we’re dealing with strictly convex functions, we define the Legendre transforms
Thus we see that the dual/conjugate coordinate systems and
are related by the Legendre transform given in terms of the potentials
and
, which provides another explanation for the name.
Incidentally, in addition to the compact expressions for the metric above, namely
the potentials also enable one to neatly express the connection coefficients as
Now, we mentioned earlier that any divergence induces a (torsion-free) dual structure, and vice-versa. But this map is not bijective; rather, a given dual structure admits infinitely many divergences. In the case of dually flat spaces however, the potentials defined above allow us to introduce a canonical divergence which is in some sense unique, namely
which of course satisfies , with equality iff
. To see that
is a divergence which induces the metric
, observe that
which further implies and
with
due to the
-, resp.
-affinity of
,
, where the dual divergence is
.
Generic properties of such canonical divergences are examined in section 3.4. Instead however, I’m going to close this post by jumping ahead to 3.5, which connects this discussion back to the e- and m-families and associated -connections we spent so much time on above.
Dual structure of exponential families
Recall that an exponential family consists of distributions of the form
where the canonical parameters constitute a 1-affine coordinate system. The notation
here was not chosen arbitrarily, but is in fact precisely the potential introduced above. This is a rather elegant fact that ties together some fundamental notions from statistics with the existence of the dual coordinates
. Suppose we didn’t know that
is the dual coordinate to
, and simply defined
Then it follows that
Additionally, the expression of the metric in the form implies that for an exponential family,
. Thus (0.36) suffices to identify
as the conjugate parameter to
with respect to the Legendre potential
. From the discussion above, this implies that
is a
-affine coordinate system dual to
. Given the form of (0.36),
are sometimes also referred to as the expectation parameters.
It is a trivial exercise to work out and
for the example of the normal distribution in the previous post. A more interesting example is the case when
is a finite set, whence
is identified with the parameters:
(This is introduced as example 2.4 on page 27, and resurfaces as example 2.8 on page 35 and example 3.4 on page 66). Expressing this as an exponential family of the form (0.35) implies the parameter identifications
Then the dual coordinates are simply given by the parameters themselves:
Therefore, the dual potential (29) is
where is the entropy!
Finally, I promised you some statistical notions, which requires the introduction of one more concept, namely that of an estimator. Suppose that some data is generated according to some unknown probability distribution
. We wish to consider the problem of estimating the unknown parameter
by some function
of the data
. The mapping
is called an estimator. Furthermore,
is an unbiased estimator if
which is the statement that the expectation value of does not depend on the data itself. The mean-squared error of such an estimator may be expressed via the variance-covariance matrix
, whose elements are defined via
Lastly, the Cramér-Rao inequality (Theorem 2.2) states that the variance-covariance matrix of an unbiased estimator satisfies
i.e., the difference is a positive semidefinite matrix. An unbiased estimator
which saturates this inequality for all
is an efficient estimator, which means it has the minimum variance among all unbiased estimators (note however that the converse is not always true).
Efficient estimators do not exist for arbitrary coordinates . Rather, a necessary and sufficient condition for a model
to have an efficient estimator is that
be an exponential family, for which
is m-affine (Theorem 3.12).
To illustrate this, let us now regard as an estimator for the parameter
, which we shall henceforth denote
to make contact with the notation just introduced. Then the condition (0.36) implies that
is in fact an unbiased estimator for
. Furthermore, since the Fisher information matrix can be expressed in terms of
and
as
it follows that , and hence
is an efficient estimator for the (m-affine) coordinate system
.
The above is of course only a superficial hint of the deeper connections between information geometry and statistics (which was, after all, the prime impetus for the former’s development), but already suggests interesting mathematical utility for such problems as maximum likelihood estimation (MLE), and machine learning in general. Alas, that’s a topic for another post.
As a final comment: everything I’ve said so far is for classical probability distributions, but much of this machinery can be extended to quantum mechanics as well (insofar as the latter can be viewed as an extension of probability theory). Quantum information geometry is briefly introduced in chapter 7, and I hope to return to it in part 3 of this series soon.
These three posts are a wonderful review of information geometry! Thanks so much!
A quick note, in equation (30), I think the RHS of the left expression should have a psi instead of a phi.
LikeLike
Thanks JG, I’m delighted you found them helpful!
Whoops, indeed that’s a typo; fixed now. Thanks for the catch!
LikeLike