Information geometry (part 1/3)

Information geometry is a rather interesting fusion of statistics and differential geometry, in which a statistical model is endowed with the structure of a Riemannian manifold. Each point on the manifold corresponds to a probability distribution function, and the metric is governed by the underlying properties thereof. It may have many interesting applications to (quantum) information theory, complexity, machine learning, and theoretical neuroscience, among other fields. The canonical reference is Methods of Information Geometry by Shun-ichi Amari and Hiroshi Nagaoka, originally published in Japanese in 1993, and translated into English in 2000. The English version also contains several new topics/sections, and thus should probably be considered as a “second edition” (incidentally, any page numbers given below refer to this version). This is a beautiful book, which develops the concepts in a concise yet pedagogical manner. The posts in this three-part series are merely the notes I made while studying it, and aren’t intended to provide more than a brief summary of what I deemed the salient aspects.

Chapter 1 provides a self-contained introduction to differential geometry which — as the authors amusingly allude — is far more accessible (dare I say practical) than the typical mathematician’s treatise on the subject. Since this is familiar material, I’ll skip ahead to chapter 2, and simply recall necessary formulas/notation as needed.

Chapter 2 introduces the basic geometric structure of statistical models. First, some notation: a probability distribution is a function ${p:\mathcal{X}\rightarrow\mathbb{R}}$ which satisfies

$\displaystyle p(x)\geq0\quad\forall x\in\mathcal{X}\quad\mathrm{and}\quad\int\!\mathrm{d} x\,p(x)=1~, \ \ \ \ \ (1)$

where ${\mathcal{X}=\mathbb{R}^n}$ . If ${\mathcal{X}}$ is a discrete set (which may have either finite or countably infinite cardinality), then the integral is instead a sum. Each ${p}$ may be parametrized by ${n}$ real-valued variables ${\xi=[\xi^i]=[\xi^1,\ldots,\xi^n]}$ , such that the family ${S}$ of probability distributions on ${\mathcal{X}}$ is

$\displaystyle S=\{p_\xi=p(x;\xi)\,|\,\xi=[\xi^i]\in\Xi\}~, \ \ \ \ \ (2)$

where ${\Xi\subset\mathbb{R}^n}$ and ${\xi\mapsto p_\xi}$ is injective (so that the reverse mapping, from probability distributions ${p}$ to the coordinates ${\xi}$ is a function). Such an ${S}$ is referred to as an ${n}$ -dimensional statistical model, or simply model on ${\mathcal{X}}$ , sometimes abbreviated ${S=\{p_\xi\}}$ . Note that I referred to ${\xi}$ as the coordinates above: we will assume that the map ${\Xi\rightarrow\mathbb{R}}$ provided by ${\xi}$ is ${C^\infty}$ , so that we may take derivatives with respect to the parameters, e.g., ${\partial_ip(x;\xi)}$ , where ${\partial_i\equiv\frac{\partial}{\partial\xi^i}}$ . Additionally, we assume that the order of differentiation and integration may be freely exchanged; an important consequence of this is that

$\displaystyle \int\!\mathrm{d} x\,\partial_i p(x;\xi)=\partial_i\!\int\!\mathrm{d} x\,p(x;\xi)=\partial_i1=0~, \ \ \ \ \ (3)$

which we will use below. Finally, we assume that the support of the distribution ${\mathrm{supp}(p)\equiv\{x\,|\,p(x)>0\}}$ is independent of (that is, does not vary with) ${\xi}$ , and hence may redefine ${\mathcal{X}=\mathrm{supp}(p)}$ for simplicity. Thus, the model ${S}$ is a subset of

$\displaystyle \mathcal{P}(S)\equiv\left\{p:\mathcal{X}\rightarrow\mathbb{R}\,\bigg|\,p(x)>0\;\;\forall x\in\mathcal{X},\;\;\int\!\mathrm{d} x\,p(x)=1\right\}~. \ \ \ \ \ (4)$

Considering parametrizations which are ${C^\infty}$ diffeomorphic to one another to be equivalent then elevates ${S}$ to a statistical manifold. (In this context, we shall sometimes conflate the distribution ${p_\xi}$ with the coordinate ${\xi}$ , and speak of the “point ${\xi}$ ”, etc).

As a concrete example, consider the normal distribution:

$\displaystyle p(x;\xi)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{\left(x-\mu\right)^2}{2\sigma^2}\right]~. \ \ \ \ \ (5)$

In this case,

$\displaystyle \mathcal{X}=\mathbb{R}~,\;\;n=2,\;\;\xi=[\mu,\sigma],\;\;\Xi=\{[\mu,\sigma]\,|\,\mu\in\mathbb{R},\;\sigma\in\mathbb{R}^+\}~. \ \ \ \ \ (6)$

Other examples are given on page 27.

Fisher information metric

Now, given a model ${S}$ , the Fisher information matrix of ${S}$ at a point ${\xi}$ is the ${n\times n}$ matrix ${G(\xi)=[g_{ij}(\xi)]}$ , with elements

$\displaystyle g_{ij}(\xi)\equiv E_\xi[\partial_i\ell_\xi\partial_j\ell_\xi] =\int\!\mathrm{d} x\,\partial_i\ell(x;\xi)\partial_j\ell(x;\xi)p(x;\xi)~, \ \ \ \ \ (7)$

where

$\displaystyle \ell_\xi(x)=\ell(x;\xi)\equiv\ln p(x;\xi)~, \ \ \ \ \ (8)$

and the expectation ${E_\xi}$ with respect to the distribution ${p_\xi}$ is defined as

$\displaystyle E_\xi[f]\equiv\int\!\mathrm{d} x\,f(x)p(x;\xi)~. \ \ \ \ \ (9)$

Though the integral above diverges for some models, we shall assume that ${g_{ij}(\xi)}$ is finite ${\forall\xi,i,j}$ , and furthermore that ${g_{ij}:\Xi\rightarrow\mathbb{R}}$ is ${C^\infty}$ . Note that ${g_{ij}=g_{ji}}$ (i.e., ${G}$ is symmetric). Additionally, while ${G}$ is in general positive semidefinite, we shall assume positive definiteness, which is equivalent to requiring that ${\{\partial_1 p_\xi,\ldots,\partial_n p_\xi\}}$ are linearly independent functions on ${\mathcal{X}}$ . Lastly, observe that eq. (3) may be written

$\displaystyle E_\xi[\partial_i\ell_\xi]=0~. \ \ \ \ \ (10)$

Via integration by parts, this allows us to express the matrix elements in a useful alternative form:

$\displaystyle g_{ij}(\xi)=-E_\xi[\partial_i\partial_j\ell_\xi]~. \ \ \ \ \ (11)$

The reason for the particular definition of the Fisher matrix above is that it provides a Riemannian metric on ${S}$ . Recall that a Riemannian metric ${g:p\mapsto\langle\cdot,\cdot\rangle_p}$ is a ${(0,2)}$ -tensor (that is, it maps points in ${S}$ to their inner product on the tangent space ${T_pS}$ ) which is linear, symmetric, and positive definite (meaning ${\langle X,X\rangle\geq0}$ , with equality iff ${X=0}$ ). In fact, in the natural basis of local coordinates ${[\xi^i]}$ , the Riemannian metric is uniquely determined by

$\displaystyle g_{ij}=\langle\partial_i,\partial_j\rangle~, \ \ \ \ \ (12)$

whence ${g}$ is called the Fisher metric. Since it’s positive definite, we may define the inverse metric ${g^{-1}}$ corresponding to ${G^{-1}(\xi)}$ such that ${g^{ij}g_{jk}=\delta^i_{~k}}$ . Note that ${g_{ij}}$ is invariant under coordinate transformations, since

$\displaystyle \xi\rightarrow\tilde\xi\;\implies\; g_{ij}\rightarrow\tilde g_{ij}=\frac{\partial\tilde\xi^k}{\partial\xi^i}\frac{\partial\tilde\xi^l}{\partial\xi^j}g_{kl}~, \ \ \ \ \ (13)$

and hence we may write

$\displaystyle \langle X,Y\rangle_\xi=E_\xi[(X\ell)(Y\ell)]\quad\forall X,Y\in T_\xi S~. \ \ \ \ \ (14)$

Recalling that ${||V||^2=\langle V,V\rangle_p=g_{ij}(p)V^iV^j}$ , the length of a curve ${\gamma:[a,b]\rightarrow S}$ with respect to g is then

$\displaystyle ||\gamma||=\int_a^b\!\mathrm{d} t\left|\frac{\mathrm{d}\gamma}{\mathrm{d} t}\right| =\int_a^b\!\mathrm{d} t\sqrt{g_{ij}\dot\gamma^i\dot\gamma^j}~. \ \ \ \ \ (15)$

But before we can properly speak of curves and distances, we must define a connection, which provides a means of parallel transporting vectors along the curve.

${\alpha}$ -connections

In particular, we will be concerned with comparing elements of the tangent bundle, and hence we require a relation between the tangent spaces at different points in ${S}$ , i.e., an affine connection. To that end, let ${S=\{p_\xi\}}$ be an ${n}$ -dimensional model, and define the ${n^3}$ functions ${\Gamma_{ij,k}^{(\alpha)}}$ which map each point ${\xi}$ to

$\displaystyle \left(\Gamma_{ij,k}^{(\alpha)}\right)_\xi\equiv E_\xi\left[\left(\partial_i\partial_j\ell_\xi+\frac{1-\alpha}{2}\partial_i\ell_\xi\partial_j\ell_\xi\right)\left(\partial_k\ell_\xi\right)\right]~, \ \ \ \ \ (16)$

where ${\alpha\in\mathbb{R}}$ . This defines an affine connection ${\nabla^{(\alpha)}}$ on ${S}$ via

$\displaystyle \langle\nabla_{\partial_i}^{(\alpha)}\partial_j,\partial_k\rangle=\Gamma_{ij,k}^{(\alpha)}~, \ \ \ \ \ (17)$

where ${g=\langle\cdot,\cdot\rangle}$ is the Fisher metric introduced above. ${\nabla^{(\alpha)}}$ is called the ${\alpha}$ -connection, and accordingly terms like ${\alpha}$ -flat, ${\alpha}$ -affine, ${\alpha}$ -parallel, etc. denote the corresponding notions with respect to this connection.

Let us pause to review some of the terminology from differential geometry above. Recall that the covariant derivative ${\nabla}$ may be expressed in local coordinates as

$\displaystyle \nabla_XY=X^i\left(\partial_iY^k+Y^j\Gamma_{ij}^{~~k}\right)\partial_k~, \ \ \ \ \ (18)$

where ${X=X^i\partial_i}$ , ${Y=Y^i\partial_i}$ are vectors in the tangent space. If these are basis vectors such that ${X^i=Y^i=1}$ , then ${\nabla_{\partial_i}\partial_j=\Gamma_{ij}^{~~k}\partial_k}$ . The vector ${Y}$ is said to be parallel with respect to the connection ${\nabla}$ if ${\nabla Y=0}$ , i.e., ${\nabla_XY=0\;\;\forall X\in TS}$ ; equivalently, in local coordinates,

$\displaystyle \partial_iY^k+Y^j\Gamma_{ij}^{~~k}=0~. \ \ \ \ \ (19)$

If all basis vectors are parallel with respect to a coordinate system ${[\xi^i]}$ , then the latter is an affine coordinate system for ${\nabla}$ . Connections which admit such an affine parametrization are called flat (equivalently, one says that ${S}$ is flat with respect to ${\nabla}$ ).

Now, with respect to a Riemannian metric ${g}$ , one defines ${\Gamma_{ij,k}}$ as above, namely ${\Gamma_{ij,k}=\langle\nabla_{\partial_i}\partial_j,\partial_k\rangle=\Gamma_{ij}^{~~l}g_{lk}}$ . Note that this defines a symmetric connection, i.e., ${\Gamma_{ij,k}=\Gamma_{ji,k}}$ . If, in addition, ${\nabla}$ satisfies

$\displaystyle Z\langle X,Y\rangle=\langle\nabla_ZX,Y\rangle+\langle X,\nabla_ZY\rangle\;\; \forall X,Y,Z\in TS~, \ \ \ \ \ (20)$

then ${\nabla}$ is a metric connection with respect to ${g}$ . (This is basically a statement about linearity, since affine transformations are linear). This implies that

$\displaystyle \partial_kg_{ij}=\Gamma_{ki,j}+\Gamma_{kj,i}~. \ \ \ \ \ (21)$

In other words, under a metric connection, parallel transport of two vectors preserves the inner product, hence their significance in Riemannian geometry. Any connection which is both metric and symmetric is Riemannian, of which there are generically an infinite number. However, the natural metrics on statistical manifolds are generically non-metric! Indeed, since

$\displaystyle \begin{aligned} \partial_kg_{ij}&=\partial_kE_\xi[\partial_i\ell_\xi\partial_j\ell_\xi]\\ &=E_\xi[\left(\partial_k\partial_i\ell_\xi\right)\left(\partial_j\ell_\xi\right)]+E_\xi[\left(\partial_i\ell_\xi\right)\left(\partial_k\partial_j\ell_\xi\right)]+E_\xi[\left(\partial_i\ell_\xi\right)\left(\partial_j\ell_\xi\right)\left(\partial_k\ell_\xi\right)]\\ &=\Gamma_{ki,j}^{(0)}+\Gamma_{kj,i}^{(0)}~, \end{aligned} \ \ \ \ \ (22)$

only the special case ${\alpha=0}$ defines a Riemannian connection ${\nabla^{(0)}}$ with respect to the Fisher metric (though observe that ${\nabla^{(\alpha)}}$ is symmetric for any value of ${\alpha}$ ). While this may seem strange from a physics perspective, where preserving the inner product is of prime importance, there’s nothing mathematically pathological about it. Indeed, the more relevant condition — which we’ll see below — is that every point on the manifold have an interpretation as a probability distribution.

Two neat relationships between different ${\alpha}$ -connections are worth noting. First, for any ${\beta\neq\alpha}$ , we have

$\displaystyle \Gamma_{ij,k}^{(\beta)}=\Gamma_{ij,k}^{(\alpha)}+\frac{\alpha-\beta}{2}T_{ijk}~, \ \ \ \ \ (23)$

where ${T_{ijk}}$ (not to be confused with the unrelated torsion tensor) is a covariant symmetric tensor which maps a point ${\xi}$ to

$\displaystyle \left(T_{ijk}\right)_\xi\equiv E_\xi[\partial_i\ell_\xi\partial_j\ell_\xi\partial_k\ell_\xi]~. \ \ \ \ \ (24)$

Second, the ${\alpha}$ -connection may be decomposed as

$\displaystyle \nabla^{(\alpha)}=(1-\alpha)\nabla^{(0)}+\alpha\nabla^{(1)} =\frac{1+\alpha}{2}\nabla^{(1)}+\frac{1-\alpha}{2}\nabla^{(-1)}~. \ \ \ \ \ (25)$

Within this infinite class of connections, ${\nabla^{(\pm1)}}$ play a central role in information geometry, and are closely related to an interesting duality structure on the geometry of ${\mathcal{P}\left(\mathcal{X}\right)}$ . We shall give a low-level introduction to the relevant representations of ${S}$ here, and postpone a more elegant derivation based on different embeddings of ${\mathcal{P}}$ in ${\mathbb{R}^{\mathcal{X}}}$ in the next post. In particular, we’ll define the so-called exponential and mixed families, which are intimately related to the ${1}$ – and ${(-1)}$ -connections, respectively.

Exponential families

Suppose that an ${n}$ -dimensional model ${S=\{p_\theta\,|\,\theta\in\Theta\}}$ can be expressed in terms of ${n\!+\!1}$ functions ${\{C,F_1,\ldots,F_n\}}$ on ${\mathcal{X}}$ and a function ${\psi}$ on ${\Theta}$ as

$\displaystyle p(x;\theta)=\exp\left[C(x)+\theta^iF_i(x)-\psi(\theta)\right]~, \ \ \ \ \ (26)$

where we’ve employed Einstein’s summation notation for the sum over ${i}$ from ${1}$ to ${n}$ . Then ${S}$ is an exponential family, and ${[\theta^i]}$ are its natural or canonical parameters. The normalization condition ${\int\mathrm{d} xp(x;\theta)=1}$ implies that

$\displaystyle \psi(\theta)=\log\int\!\mathrm{d} x\,\exp\left[C(x)+\theta^iF_i(x)\right]~. \ \ \ \ \ (27)$

This provides a parametrization ${\theta\mapsto p_\theta}$ , which is ${1:1}$ if and only if the functions ${\{C,F_1,\ldots,F_n\}}$ are linearly independent (which we shall assume henceforth). Many important probabilistic models fall into this class, including all those referenced on page 27 above. The normal distribution (5), for instance, yields

$\displaystyle \begin{aligned} C(x)=0~,\;\;F_1(x)=x~,\;\;F_2(x)&=x^2~,\;\;\theta^1=\frac{\mu}{\sigma^2}~,\;\;\theta^2=-\frac{1}{2\sigma^2}~,\\ \psi(\theta)=\frac{\mu^2}{2\sigma^2}+&\ln\left(\sqrt{2\pi}\sigma\right)~. \end{aligned} \ \ \ \ \ (28)$

The canonical coordinates ${[\theta^i]}$ are natural insofar as they provide a ${1}$ -affine coordinate system, with respect to which ${S}$ is ${1}$ -flat. To see this, observe that

$\displaystyle \partial_i\ell(x;\theta)=F_i(x)-\partial_i\psi(\theta)\;\;\implies\;\; \partial_i\partial_j\ell(x;\theta)=-\partial_i\partial_j\psi(\theta)~, \ \ \ \ \ (29)$

where keep in mind that ${\partial_i}$ denotes the derivative with respect to ${\theta^i}$ , not ${x}$ ! This implies that

$\displaystyle \left(\Gamma_{ij,k}^{(1)}\right)_\theta=E_\theta[\left(\partial_i\partial_j\ell_\theta\right)\left(\partial_k\ell_\theta\right)] =-\partial_i\partial_j\psi(\theta)E_\theta[\partial_k\ell_\theta]=0~. \ \ \ \ \ (30)$

Therefore, exponential families admit a canonical parametrization in terms of a ${1}$ -affine coordinate system ${[\theta^i]}$ , with respect to which ${S}$ is ${1}$ -flat. The associated affine connection is called the exponential connection, and is denoted ${\nabla^{(e)}\equiv\nabla^{(1)}}$ .

Mixed families

Now consider the case in which ${S}$ can be expressed as

$\displaystyle p(x;\theta)=C(x)+\theta^iF_i(x)~, \ \ \ \ \ (31)$

i.e., ${S}$ is an affine subspace of ${\mathcal{P}(\mathcal{X})}$ . In this case ${S}$ is called a ${\emph{mixture family}}$ , with mixture parameters ${[\theta^i]}$ . Note that ${\mathcal{P}(\mathcal{X})}$ itself is a mixture family if ${\mathcal{X}}$ is infinite. The name arises from the fact that elements in this family admit a representative form as a mixture of ${n\!+\!1}$ probability distributions ${\{p_0,p_1,\ldots,p_n\}}$ ,

$\displaystyle p(x;\theta)=\theta^ip_i(x)+\left(1-\sum_{i=1}^n\theta^i\right)p_0(x) =p_0(x)+\sum_{i=1}^n\theta^i\left[p_i(x)-p_0(x)\right]~, \ \ \ \ \ (32)$

(i.e., ${C(x)=p_0(x)}$ and ${F_i(x)=p_i(x)-p_0(x)}$ ), where ${\theta^i>0}$ and ${\sum_i\theta^i<1}$ . For a mixture family, we have

$\displaystyle \partial_i\ell(x;\theta)=\frac{F_i(x)}{p(x;\theta)}\;\;\implies\;\; \partial_i\partial_j\ell(x;\theta)=-\frac{F_i(x)F_j(x)}{p(x;\theta)^2}~, \ \ \ \ \ (33)$

which implies that

$\displaystyle \partial_i\partial_j\ell+\partial_i\ell\partial_j\ell=0\;\;\implies\;\; \Gamma_{ij,k}^{(-1)}=0~. \ \ \ \ \ (34)$

Therefore, a mixture family admits a parametrization in terms of a ${(-1)}$ -affine coordinate system ${[\theta^i]}$ , with respect to which ${S}$ is ${(-1)}$ -flat. The associated affine connection is called the mixture connection, denoted ${\nabla^{(m)}\equiv\nabla^{(-1)}}$ .

In the next post, when we discuss the geometrical structure in more detail, we shall see that ${\nabla^{(\pm1)}}$ are dual connections, which has many interesting consequences.

Why is Fisher special?

As noted above, a given manifold admits infinitely many distinct Riemannian metrics and affine connections. However, a statistical manifold ${S}$ has the property that every point is a probability distribution, which singles out the Fisher metric and ${\alpha}$ -connection as unique. To formalize this notion, we must first introduce the concept of a sufficient statistic.

Let ${F:\mathcal{X}\rightarrow\mathcal{Y}}$ be a map which takes random variables ${X}$ to ${Y=F(X)}$ . Given the distribution ${p(x;\xi)}$ of ${X}$ , this results in the distribution ${q(y;\xi)}$ on ${Y}$ . We then define

$\displaystyle r(x;\xi)=\frac{p(x:\xi)}{q\left(F(x),\xi\right)}~,\quad p(x|y;\xi)=r(x;\xi)\delta_{F(x)}(y)~,\quad \mathrm{Pr}(A|y;\xi)=\int_A\!\mathrm{d} x\,p(x|y;\xi)~, \ \ \ \ \ (35)$

where ${A\subset\mathcal{X}}$ , and ${\delta_{F(x)}(y)}$ is the delta function at the point ${F(x)}$ , such that ${\forall B\subset\mathcal{Y}}$ ,

$\displaystyle \int_{A\cap F^{-1}(B)}\!\mathrm{d} x\,p(x;\xi) =\int_A\int_B\!\mathrm{d} x\mathrm{d} y\,r(x;\xi)q(y;\xi)\delta_{F(x)}(y) =\int_B\!\mathrm{d} y\,\mathrm{Pr}(A|y;\xi)q(y;\xi)~. \ \ \ \ \ (36)$

In other words, the delta function picks out the value of ${x}$ such that ${F(x)=y}$ . The above implies that ${\mathrm{Pr}(A|y;\xi)}$ is the conditional probability of the event ${\{A\in X\}}$ , given ${Y=y}$ (cf. the familiar definition ${P(X|Y)=P(X\cup Y)/P(Y)}$ ). If ${F}$ is independent of ${\xi}$ , then ${F}$ is called a sufficient statistic for ${S}$ . In this case, we may write

$\displaystyle p(x;\xi)=q\left(F(x);\xi\right)r(x)~, \ \ \ \ \ (37)$

i.e., the dependence of ${p}$ on ${\xi}$ is entirely encoded in the distribution ${q}$ . Therefore, treating ${p}$ as the unknown distribution, whose parameter ${\xi}$ one wishes to estimate, it suffices to know only the value ${Y=F(x)}$ , hence the name. Formally, one says that ${F}$ is a sufficient statistic if and only if there exists functions ${s:\mathcal{Y}\times\Xi\rightarrow\mathbb{R}}$ and ${t:\mathcal{X}\rightarrow\mathbb{R}}$ such that

$\displaystyle p(x;\xi)=s\left(F(x);\xi\right)t(x)\qquad\forall x,\xi~. \ \ \ \ \ (38)$

The significance of this lies in the fact that the Fisher information metric satisfies a monotonicity relation under a generic map ${F}$ . This is detailed in Theorem 2.1 of Amari & Nagaoka, which states that given ${S=\{p(x;\xi)\}}$ with Fisher metric ${G(\xi)}$ , and induced model ${S_F\equiv\{q(y;\xi)\}}$ with matrix ${G_F(\xi)}$ , the difference ${\Delta G_\xi\equiv G(\xi)-G_F(\xi)}$ is positive semidefinite, i.e., ${G_F(\xi)\leq G(\xi)}$ , with equality if and only if ${F}$ is a sufficient statistic. Otherwise, for generic maps, the “information loss” ${\Delta G(\xi)=[\Delta g_{ij}(\xi)]}$ that results from summarizing the data ${x}$ in ${y=F(x)}$ is given by

$\displaystyle \Delta g_{ij}(\xi)=E_\xi[\partial_i\ln r(X;\xi)\partial_j\ln r(X;\xi)]~, \ \ \ \ \ (39)$

which can be expressed in terms of the covariance with respect to the conditional distribution ${p(x|y;\xi)}$ . This theorem will be important later, when we discuss relative entropy.

Now, if ${F}$ is a sufficient statistic, then (37) implies that ${\partial_i\ln p(x;\xi)=\partial_i\ln q\left(F(x);\xi\right)}$ . But this implies that ${g_{ij}}$ , and by extension ${\Gamma_{ij,k}^{(\alpha)}}$ , are the same for both ${S}$ and ${S_F}$ . Therefore the Fisher metric and ${\alpha}$ -connection are invariant with respect to the sufficient statistic ${F}$ . In the language above, this implies that there is no information loss associated with describing the original distribution ${p}$ by ${q}$ , i.e., that information is preserved under ${F}$ . Formally, this invariance is codified by the following two equations:

$\displaystyle \begin{aligned} \langle X,Y\rangle_p&=\langle\lambda_*(X),\lambda_*(Y)\rangle_{\lambda(p)}^{'}~,\\ \lambda_*\left(\nabla_X^{(\alpha)}\right)&={\nabla'}_{\lambda_*(X)}^{(\alpha)}\lambda_*(Y)~, \end{aligned} \ \ \ \ \ (40)$

${\forall\;X,Y,Y\in TS}$ , where the prime denotes the object on ${S_F}$ , ${\lambda}$ is the diffeomorphism from ${S}$ onto ${S_F}$ given by ${\lambda(p_\xi)=q_\xi}$ , and the pushforward ${\lambda_*:TS\rightarrow TS_F}$ is defined by ${\left(\lambda_*(X)\right)_{\lambda(p)}=(\mathrm{d}\lambda)_p(X_p)}$ .

The salient feature of the Fisher metric and ${\alpha}$ -connection is that they are are uniquely characterized by this invariance! This is the thrust of Chentsov’s theorem (Theorem 2.6 in Amari & Nagaoka). Strictly speaking, the proof of this theorem relies on finiteness of ${\mathcal{X}}$ , but — depending on the level of rigour one demands — it is possible to extend this to infinite models via a limiting procedure in which one considers increasingly fine-grained subsets of ${\mathcal{X}}$ . A similar subtlety will arise in our more geometrical treatment of dual structures in the next post. I’m honestly unsure how serious this issue is, but it’s worth bearing in mind that the mathematical basis is less solid for infinite ${\mathcal{X}}$ , and may require a more rigorous functional analytic approach.

Information geometry (part 1/3)

Leave a comment Cancel reply

Archives

Meta

Follow me!

Legal stuff

Information geometry (part 1/3)

Share this:

Leave a comment Cancel reply

Archives

Meta

Follow me!

Legal stuff