In the previous post, we introduced the -connection, and alluded to a dualistic structure between and . In particular, the cases are intimately related to two important families of statistical models, the exponential or e-family with affine connection , and the mixed or m-family with affine connection . Hence before turning to general aspects of the dual structure on , it is illuminating to see how these families/connections emerge naturally via an embedding formalism.
For concreteness (or rather, to err on the safe side of mathematical rigour), let be a finite set. As before, an arbitrary model on is a submanifold of
which in turn is a subset of the set of all -valued functions on , denoted
Note that since is finite, the normalization constraint is essentially the imposition that , which implies that is specifically an open subset of the affine subspace
which further implies that the tangent space is naturally identified with the linear subspace
(One can see this by representing as an algebraic curve). Vectors , considered as elements of , will be denoted . These define the mixture or m-representation of , i.e.,
As before, the natural basis then defines the -affine coordinates , with , with respect to which is -flat. The natural connection induced from this affine structure is the -connection introduced before, which preserves vectors identically under parallel transport , i.e.,
Thus we see that the natural embedding of into makes the affine structure, and in particular the significance of the -connection, manifest.
Now consider the alternative embedding , and identify with the subset . Elements are then given by applying to the map , whereupon we denote them and call this the exponential or e-representation. In local coordinates, we have . By the chain rule, is then related to as
and therefore the tangent space is obtained by modifying the constraint on above to , i.e.,
One sees immediately that unlike , depends on , and therefore parallel transport does not preserve . However, while an element does not generally belong to for , the constraint in (7) implies that the shifted vector does belong to . We therefore have
where denotes parallel transport with respect to the e-connection . (Properly showing that is e-parallel under (9) is of course slightly more involved, and is done on page 41). Thus the exponential (or, if one prefers, logarithmic) embedding of into leads naturally to the e-affine coordinate system and associated connection introduced in the previous post.
It turns out that the e-representation has a number of interesting properties that allow one to establish close connections between the Fisher metric and various statistical notions. Note that, due to the presence of the log, the Fisher metric can be neatly expressed in the e-representation as
Furthermore, with a bit more technical footwork involving cotangent spaces and the like, one can derive a number of fundamental relations. For example, the variance of a random variable is determined by the sensitivity of to perturbations of , and may be expressed as
where is the differential of at . See section 2.5 for more details.
As alluded previously, the geometrical arguments are somewhat less rigorous for the case when is infinite. Intuitively, the issue is that the identification of with , and by extension the isometry between and , requires that have the same dimensionality as . For finite , the constraint is “loose enough” for this to be possible, but this constraint becomes increasingly strict as the cardinality of increases (i.e., the portion of identified with decreases). For infinite , one effectively has an infinite number of constraints, which makes this dimensional matching impossible. Nonetheless, I’m given to understand that much of the framework can still be made to go through; see the references at the end of section 2.5 and associated discussion.
Dual structures on
We’ve alluded to a duality between , but in fact this extends to general -connections. Hence, rather than consider a particular connection in isolation, the fundamental structure in information geometry is the triple , which Amari & Nagaoka call a dualistic structure on (we prefer the term “dual structure” instead, and will use this henceforth). Formally, if one has two affine connections and with respect to a Riemannian metric , then if these satisfy
then and are duals with respect to , and one is called the dual or conjugate connection of the other. In local coordinates, this condition reads
which follows from the definition , and similarly for , with basis vectors etc. Given a metric and connection on , the dual connection is generally unique, and satisfies . Additionally, the combination is metric. Conversely, one immediately sees that the condition for to be metric is that , and in this sense dual connections simply constitute a more general class thereof.
In particular, and are dual with respect to the Fisher metric. This follows readily from the general expressions for the -connection developed in section 2.6, which we’ve skipped over here. Suffice to say the framework of -connection admits a straightforward extension to arbitrary .
The significance of dual connections is neatly illustrated by considering the parallel transport along a curve from with respect to and , respectively denoted and . As mentioned in the previous post, general connections are not metric; but the dual structure allows one to generalize the notion of preservation of the inner product along a curve, namely:
And furthermore, the relationship between and is completely fixed by this condition.
The duality structure of statistical manifolds facilitates the introduction of a distance-like measure , which enables us to compute something like a geometrical distance between distributions. In particular, we shall define a smooth function such that , with equality iff . While this provides a measure of the separation between and , it is asymmetric and does not satisfy the triangle inequality, and hence fails the conditions for a distance function. However, if further satisfies
where is a positive definite matrix everywhere on , then is a divergence or contrast function on . Note that a divergence uniquely defines a Riemannian metric via
i.e., . We can also define an affine connection associated with this divergence, via the coefficients
Similarly, we define the dual , from which we have and . Thus we have that and are dual with respect to . In fact, any dual structure is naturally induced from a divergence!
Of course, the above is quite general: there are infinitely many different divergences one could define on a manifold. Henceforth we shall specify to a particularly important class for statistical models, known as the f-divergence:
where is an arbitrary convex function on with . The f-divergence satisfies a number of important properties (see page 56), chief of which is monotonicity under arbitrary probability transition functions. That is, let be randomly transformed into with probability , with . This maps distributions to , respectively. Monotonicity of the f-divergence is then the statement that
Spoiler alert: this will surface again in the form of monotonicity of relative entropy below! Note that is a generalization of the deterministic mapping induced by in the previous post, which corresponds to . Consequently, the equality is saturated iff is induced from a sufficient statistic, for which .
Within this still-broad class of f-divergences, one finds the -divergence , defined for all via
which yields, for ,
and, for ,
Of course, this last is none other than the relative entropy or Kullback-Leibler divergence!
The -divergence is so named because it naturally induces the dual structure , i.e., the Fisher metric and (dual) -connections. Let us now explore this duality in more detail, which will lead us to a deeper appreciation for the divergence .
Dually flat spaces
Suppose that, in a dual structure , both connections are symmetric, as is the case for -connections. This implies that -flatness and -flatness of are equivalent. In particular, we’ve seen that e-families are 1-flat, and m-families are -flat, but the above implies that in fact they are both -flat. In general, if both duals and are flat, we call a dually flat space. Such spaces have a number of properties that are closely related to various concepts in statistics.
By definition, there exist -affine and -affine coordinate systems and , respectively, in which vector fields are denoted and . Since both and are flat, the condition on parallel transport (14) implies that is in fact constant on . Furthermore, given , affine transformations allow us to choose the dual coordinate system such that
Coordinate systems which satisfy this requirement are called mutually dual. Such coordinate systems are special to dually flat spaces, and do not exist for general Riemannian manifolds; conversely, if one finds such coordinates for a Riemannian manifold , then the connections and for which they are affine are uniquely determined, and is a dually flat space.
A quick remark on notation: we shall denote the components of the metric with respect to and by
Coordinate transformations between them are given by the usual expressions,
In conjunction with (23), we therefore have
We may now introduce the potentials, which will prove useful below. At a mathematical level, these allow us to define the Legendre transformation that explicitly relates the dual coordinate systems and . Consider a function that satisfies the following partial differential equation:
From the coordinate expressions above, , and hence the second derivatives of comprise a positive definite matrix. Hence is a strictly convex function of . Similarly, define such that
which is a strictly convex function of , since . These two functions are related via
To see this, simply take the differential , whereupon substituting in (27) one recovers (28). Now, the form of (29) — equivalently, the pair (27) and (28) — suggests that , form a conjugate pair of coordinates for the functions , in the context of Legendre transforms. However, (29) does not quite suffice to define the Legendre transform, since we must remove the dependence on the right-hand side in order for it to be uniquely invertible. Note that (strict) convexity is important here, since this implies a 1:1 map between a function and its first derivative, which in turn enables one — via the Legendre transform — to express all the information about a function in terms of its derivatives instead. By analogy with Fourier or Laplace transforms, this may lead to deeper mathematical/physical insight, or simple convenience, depending on the application. Hence, since we’re dealing with strictly convex functions, we define the Legendre transforms
Thus we see that the dual/conjugate coordinate systems and are related by the Legendre transform given in terms of the potentials and , which provides another explanation for the name.
Incidentally, in addition to the compact expressions for the metric above, namely
the potentials also enable one to neatly express the connection coefficients as
Now, we mentioned earlier that any divergence induces a (torsion-free) dual structure, and vice-versa. But this map is not bijective; rather, a given dual structure admits infinitely many divergences. In the case of dually flat spaces however, the potentials defined above allow us to introduce a canonical divergence which is in some sense unique, namely
which of course satisfies , with equality iff . To see that is a divergence which induces the metric , observe that
which further implies and with due to the -, resp. -affinity of , , where the dual divergence is .
Generic properties of such canonical divergences are examined in section 3.4. Instead however, I’m going to close this post by jumping ahead to 3.5, which connects this discussion back to the e- and m-families and associated -connections we spent so much time on above.
Dual structure of exponential families
Recall that an exponential family consists of distributions of the form
where the canonical parameters constitute a 1-affine coordinate system. The notation here was not chosen arbitrarily, but is in fact precisely the potential introduced above. This is a rather elegant fact that ties together some fundamental notions from statistics with the existence of the dual coordinates . Suppose we didn’t know that is the dual coordinate to , and simply defined
Then it follows that
Additionally, the expression of the metric in the form implies that for an exponential family, . Thus (0.36) suffices to identify as the conjugate parameter to with respect to the Legendre potential . From the discussion above, this implies that is a -affine coordinate system dual to . Given the form of (0.36), are sometimes also referred to as the expectation parameters.
It is a trivial exercise to work out and for the example of the normal distribution in the previous post. A more interesting example is the case when is a finite set, whence is identified with the parameters:
(This is introduced as example 2.4 on page 27, and resurfaces as example 2.8 on page 35 and example 3.4 on page 66). Expressing this as an exponential family of the form (0.35) implies the parameter identifications
Then the dual coordinates are simply given by the parameters themselves:
Therefore, the dual potential (29) is
where is the entropy!
Finally, I promised you some statistical notions, which requires the introduction of one more concept, namely that of an estimator. Suppose that some data is generated according to some unknown probability distribution . We wish to consider the problem of estimating the unknown parameter by some function of the data . The mapping is called an estimator. Furthermore, is an unbiased estimator if
which is the statement that the expectation value of does not depend on the data itself. The mean-squared error of such an estimator may be expressed via the variance-covariance matrix , whose elements are defined via
Lastly, the Cramér-Rao inequality (Theorem 2.2) states that the variance-covariance matrix of an unbiased estimator satisfies
i.e., the difference is a positive semidefinite matrix. An unbiased estimator which saturates this inequality for all is an efficient estimator, which means it has the minimum variance among all unbiased estimators (note however that the converse is not always true).
Efficient estimators do not exist for arbitrary coordinates . Rather, a necessary and sufficient condition for a model to have an efficient estimator is that be an exponential family, for which is m-affine (Theorem 3.12).
To illustrate this, let us now regard as an estimator for the parameter , which we shall henceforth denote to make contact with the notation just introduced. Then the condition (0.36) implies that is in fact an unbiased estimator for . Furthermore, since the Fisher information matrix can be expressed in terms of and as
it follows that , and hence is an efficient estimator for the (m-affine) coordinate system .
The above is of course only a superficial hint of the deeper connections between information geometry and statistics (which was, after all, the prime impetus for the former’s development), but already suggests interesting mathematical utility for such problems as maximum likelihood estimation (MLE), and machine learning in general. Alas, that’s a topic for another post.
As a final comment: everything I’ve said so far is for classical probability distributions, but much of this machinery can be extended to quantum mechanics as well (insofar as the latter can be viewed as an extension of probability theory). Quantum information geometry is briefly introduced in chapter 7, and I hope to return to it in part 3 of this series soon.