Variational autoencoders

As part of one of my current research projects, I’ve been looking into variational autoencoders (VAEs) for the purpose of identifying and analyzing attractor solutions within higher-dimensional phase spaces. Of course, I couldn’t resist diving into the deeper mathematical theory underlying these generative models, beyond what was strictly necessary in order to implement one. As in the case of the restricted Boltzmann machines I’ve discussed before, there are fascinating relationships between physics, information theory, and machine learning at play here, in particular the intimate connection between (free) energy minimization and Bayesian inference. Insofar as I actually needed to learn how to build one of these networks however, I’ll start by introducing VAEs from a somewhat more implementation-oriented mindset, and discuss the deeper physics/information-theoretic aspects afterwards.

Mathematical formulation

An autoencoder is a type of neural network (NN) consisting of two feedforward networks: an encoder, which maps an input {X} onto a latent space {Z}, and a decoder, which maps the latent representation {Z} to the output {X'}. The idea is that {\mathrm{dim}(Z)<\mathrm{dim}(X)=\mathrm{dim}(X')}, so that information in the original data is compressed into a lower-dimensional “feature space”. For this reason, autoencoders are often used for dimensional reduction, though their applicability to real-world problems seems rather limited. Training consists of minimizing the difference between {X} and {X'} according to some suitable loss function. They are a form of unsupervised (or rather, self-supervised) learning, in which the NN seeks to learn highly compressed, discrete representation of the input.

VAEs inherit the network structure of autoencoders, but are fundamentally rather different in that they learn the parameters of a probability distribution that represents the data. This makes them much more powerful than their simpler precursors insofar as they are generative models (that is, they can generate new examples of the input type). Additionally, their statistical nature — in particular, learning a continuous probability distribution — makes them vastly superior in yielding meaningful results from new/test data that gets mapped to novel regions of the latent space. In a nutshell, the encoding {Z} is generated stochastically, using variational techniques—and we’ll have more to say on what precisely this means below.

Mathematically, a VAE is a latent-variable model {p_\theta(x,z)} with latent variables {z\in Z} and observed variables (i.e., data) {x\in X}, where {\theta} represents the parameters of the distribution. (For example, Gaussian distributions are uniquely characterized by their mean {\mu} and standard deviation {\sigma}, in which case {\theta\in\{\mu,\sigma\}}; more generally, {\theta} would parametrize the masses and couplings of whatever model we wish to construct. Note that we shall typically suppress the subscript {\theta} where doing so does not lead to ambiguity). This joint distribution can be written

\displaystyle p(x,z)=p(x|z)p(z)~. \ \ \ \ \ (1)

The first factor on the right-hand side is the decoder, i.e., the likelihood {p(x|z)} of observing {x} given {z}; this provides the map from {Z\rightarrow X'\simeq X}. This will typically be either a multivariate Gaussian or Bernoulli distribution, implemented by an RBM with as-yet unlearned weights and biases. The second factor is the prior distribution of latent variables {p(z)}, which will be related to observations {x} via the likelihood function (i.e., the decoder). This can be thought of as a statement about the variable {z} with the data {x} held fixed. In order to be computational tractable, we want to make the simplest possible choice for this distribution; accordingly, one typically chooses a multivariate Gaussian,

\displaystyle p(z)=\mathcal{N}(0,1)~. \ \ \ \ \ (2)

In the context of Bayesian inference, this is technically what’s known as an informative prior, since it assumes that any other parameters in the model are sufficiently small that Gaussian sampling from {Z} does not miss any strongly relevant features. This is in contrast to the somewhat misleadingly named uninformative prior, which endeavors to place no subjective constraints on the variable; for this reason, the latter class are sometimes called objective priors, insofar as they represent the minimally biased choice. In any case, the reason such a simple choice (2) suffices for {p(z)} is that any distribution can be generated by applying a sufficiently complicated function to the normal distribution.

Meanwhile, the encoder is represented by the posterior probability {p(z|x)}, i.e., the probability of {z} given {x}; this provides the map from {X\rightarrow Z}. In principle, this is given by Bayes’ rule:

\displaystyle p(z|x)=\frac{p(x|z)p(z)}{p(x)}~, \ \ \ \ \ (3)

but this is virtually impossible to compute analytically, since the denominator amounts to evaluating the partition function over all possible configurations of latent variables, i.e.,

\displaystyle p(x)=\int\!\mathrm{d}z\,p(x|z)p(z)~. \ \ \ \ \ (4)

One solution is to compute {p(x)} approximately via Monte Carlo sampling; but the impression I’ve gained from my admittedly superficial foray into the literature is that such models are computationally expensive, noisy, difficult to train, and generally inferior to the more elegant solution offered by VAEs. The key idea is that for most {z}, {p(x|z)\approx0}, so instead of sampling over all possible {z}, we construct a new distribution {q(z|x)} representing the values of {z} which are most likely to have produced {x}, and sample over this new, smaller set of {z} values [2]. In other words, we seek a more tractable approximation {q_\phi(z|x)\approx p_\theta(z|x)}, characterized by some other, variational parameters {\phi}—so-called because we will eventually vary these parameters in order to ensure that {q} is as close to {p} as possible. As usual, the discrepancy between these distributions is quantified by the familiar Kullback-Leibler (KL) divergence:

\displaystyle D_z\left(q(z|x)\,||\,p(z|x)\right)=\sum_z q(z|x)\ln\frac{q(z|x)}{p(z|x)}~, \ \ \ \ \ (5)

where the subscript on the left-hand side denotes the variable over which we marginalize.

This divergence plays a central role in the variational inference procedure we’re trying to implement, and underlies the connection to the information-theoretic relations alluded above. Observe that Bayes’ rule enables us to rewrite this expression as

\displaystyle D_z\left(q(z|x)\,||\,p(z|x)\right)= \langle \ln q(z|x)-\ln p(z)\rangle_q -\langle\ln p(x|z)\rangle_q +\ln p(x) \ \ \ \ \ (6)

where {\langle\ldots\rangle_q} denotes the expectation value with respect to {q(z|x)}, and we have used the fact that {\sum\nolimits_z q(z|x) \ln p(x)=\ln p(x)} (since probabilities are normalized to 1, and {p(x)} has no dependence on the latent variables {z}). Now observe that the first term on the right-hand side can be written as another KL divergence. Rearranging, we therefore have

\displaystyle \ln p(x)-D_z\left(q(z|x)\,||\,p(z|x)\right)=-F_q(x) \ \ \ \ \ (7)

where we have identified the (negative) variational free energy

\displaystyle -F_q(x)=\langle\ln p(x|z)\rangle_q-D_z\left(q(z|x)\,||\,p(z)\right)~. \ \ \ \ \ (8)

As the name suggests, this is closely related to the Helmholtz free energy from thermodynamics and statistical field theory; we’ll discuss this connection in more detail below, and in doing so provide a more intuitive definition: the form (8) is well-suited to the implementation-oriented interpretation we’re about to provide, but is a few manipulations removed from the underlying physical meaning.

The expressions (7) and (8) comprise the central equation of VAEs (and variational Bayesian methods more generally), and admit a particularly simple interpretation. First, observe that the left-hand side of (7) is the log-likelihood, minus an “error term” due to our use of an approximate distribution {q(z|x)}. Thus, it’s the left-hand side of (7) that we want our learning procedure to maximize. Here, the intuition underlying maximum likelihood estimation (MLE) is that we seek to maximize the probability of each {x\!\in\!X} under the generative process provided by the decoder {p(x|z)}. As we will see, the optimization process pulls {q(z|x)} towards {p(z|x)} via the KL term; ideally, this vanishes, whereupon we’re directly optimizing the log-likelihood {\ln p(x)}.

The variational free energy (8) consists of two terms: a reconstruction error given by the expectation value of {\ln p(x|z)} with respect to {q(z|x)}, and a so-called regulator given by the KL divergence. The reconstruction error arises from encoding {X} into {Z} using our approximate distribution {q(z|x)}, whereupon the log-likelihood of the original data given these inferred latent variables will be slightly off. The KL divergence, meanwhile, simply encourages the approximate posterior distribution {q(z|x)} to be close to {p(z)}, so that the encoding matches the latent distribution. Note that since the KL divergence is positive-definite, (7) implies that the negative variational free energy gives a lower-bound on the log-likelihood. For this reason, {-F_q(x)} is sometimes referred to as the Evidence Lower BOund (ELBO) by machine learners.

The appearance of the (variational) free energy (8) is not a mere mathematical coincidence, but stems from deeper physical aspects of inference learning in general. I’ll digress upon this below, as promised, but we’ve a bit more work to do first in order to be able to actually implement a VAE in code.

Computing the gradient of the cost function

Operationally, training a VAE consists of performing stochastic gradient descent (SGD) on (8) in order to minimize the variational free energy (equivalently, maximize the ELBO). In other words, this will provide the cost or loss function (9) for the model. Note that since {\ln p(x)} is constant with respect to {q(z|x)}, (7) implies that minimizing the variational energy indeed forces the approximate posterior towards the true posterior, as mentioned above.

In applying SGD to the cost function (8), we actually have two sets of parameters over which to optimize: the parameters {\theta} that define the VAE as a generative model {p_\theta(x,z)}, and the variational parameters {\phi} that define the approximate posterior {q_\phi(z|x)}. Accordingly, we shall write the cost function as

\displaystyle \mathcal{C}_{\theta,\phi}(X)=-\sum_{x\in X}F_q(x) =-\sum_{x\in X}\left[\langle\ln p_\theta(x|z)\rangle_q-D_z\left(q_\phi(z|x)\,||\,p(z)\right) \right]~, \ \ \ \ \ (9)

where, to avoid a preponderance of subscripts, we shall continue to denote {F_q\equiv F_{q_\phi(z|x)}}, and similarly {\langle\ldots\rangle_q=\langle\ldots\rangle_{q_\phi(z|x)}}. Taking the gradient with respect to {\theta} is easy, since only the first term on the right-hand side has any dependence thereon. Hence, for a given datapoint {x\in X},

\displaystyle \nabla_\theta\mathcal{C}_{\theta,\phi}(x) =-\langle\nabla_\theta\ln p_\theta(x|z)\rangle_q \approx-\nabla_\theta\ln p_\theta(x|z)~, \ \ \ \ \ (10)

where in the second step we have replaced the expectation value with a single sample drawn from the latent space {Z}. This is a common method in SGD, in which we take this particular value of {z} to be a reasonable approximation for the average {\langle\ldots\rangle_q}. (Yet-more connections to mean field theory (MFT) we must of temporal necessity forgo; see Mehta et al. [1] for some discussion in this context, or Doersch [2] for further intuition). The resulting gradient can then be computed via backpropagation through the NN.

The gradient with respect to {\phi}, on the other hand, is slightly problematic, since the variational parameters also appear in the distribution with respect to which we compute expectation values. And the sampling trick we just employed means that in the implementation of this layer of the NN, the evaluation of the expectation value is a discrete operation: it has no gradient, and hence we can’t backpropagate through it. Fortunately, there’s a clever method called the reparametrization trick that circumvents this stumbling block. The basic idea is to change variables so that {\phi} no longer appears in the distribution with respect to which we compute expectation values. To do so, we express the latent variable {z} (which is ostensibly drawn from {q_\phi(z|x)}) as a differentiable and invertible transformation of some other, independent random variable {\epsilon}, i.e., {z=g(\epsilon; \phi, x)} (where here “independent” means that the distribution of {\epsilon} does not depend on either {x} or {\phi}; typically, one simply takes {\epsilon\sim\mathcal{N}(0,1)}). We can then replace {\langle\ldots\rangle_{q_\phi}\rightarrow\langle\ldots\rangle_{p_\epsilon}}, whereupon we can move the gradient inside the expectation value as before, i.e.,

\displaystyle -\nabla_\phi\langle\ln p_\theta(x|z)\rangle_{q_\phi} =-\langle\nabla_\phi\ln p_\theta(x|z)\rangle_{p_\epsilon}~. \ \ \ \ \ (11)

Note that in principle, this results in an additional term due to the Jacobian of the transformation. Explicitly, this equivalence between expectation values may be written

\displaystyle \begin{aligned} \langle f(z)\rangle_{q_\phi}&=\int\!\mathrm{d}z\,q_\phi(z|x)f(z) =\int\!\mathrm{d}\epsilon\left|\frac{\partial z}{\partial\epsilon}\right|\,q_\phi(z(\epsilon)|x)\,f(z(\epsilon))\\ &\equiv\int\!\mathrm{d}\epsilon \,p(\epsilon)\,f(z(\epsilon)) =\langle f(z)\rangle_{p_\epsilon} \end{aligned} \ \ \ \ \ (12)

where the Jacobian has been absorbed into the definition of {p(\epsilon)}:

\displaystyle p(\epsilon)\equiv J_\phi(x)\,q_\phi(z|x)~, \quad\quad J_\phi(x)\equiv\left|\frac{\partial z}{\partial\epsilon}\right|~. \ \ \ \ \ (13)

Consequently, the Jacobian would contribute to the second term of the KL divergence via

\displaystyle \ln q_\phi(z|x)=\ln p(\epsilon)-\ln J_\phi(x)~. \ \ \ \ \ (14)

Operationally however, the reparametrization trick simply amounts to performing the requisite sampling on an additional input layer for {\epsilon} instead of on {Z}; this is nicely illustrated in both fig. 74 of Mehta et al. [1] and fig. 4 of Doersch [2]. In practice, this means that the analytical tractability of the Jacobian is a non-issue, since the change of variables is performed downstream of the KL divergence layer—see the implementation details below. The upshot is that while the above may seem complicated, it makes the calculation of the gradient tractable via standard backpropagation.


Having fleshed-out the mathematical framework underlying VAEs, how do we actually build one? Let’s summarize the necessary ingredients, layer-by-layer along the flow from observation space to latent space and back (that is, {X\rightarrow Z\rightarrow X'\!\simeq\!X}), with the Keras API in mind:

  • We need an input layer, representing the data {X}.
  • We connect this input layer to an encoder, {q_\phi(z|x)}, that maps data into the latent space {Z}. This will be a NN with an arbitrary number of layers, which outputs the parameters {\phi} of the distribution (e.g., the mean and standard deviation, {\phi\in\{\mu,\sigma\}} if {q_\phi} is Gaussian).
  • We need a special KL-divergence layer, to compute the second term in the cost function (8) and add this to the model’s loss function (e.g., the Keras loss). This takes as inputs the parameters {\phi} produced by the encoder, and our Gaussian ansatz (2) for the prior {p(z)}.
  • We need another input layer for the independent distribution {\epsilon}. This will be merged with the parameters {\phi} output by the encoder, and in this way automatically integrated into the model’s loss function.
  • Finally, we feed this merged layer into a decoder, {p_\theta(x|z)}, that maps the latent space back to {X}. This is generally another NN with as many layers as the encoder, which relies on the learned parameters {\theta} of the generative model.

At this stage of the aforementioned research project, it’s far too early to tell whether such a VAE will ultimately be useful for accomplishing our goal. If so, I’ll update this post with suitable links to paper(s), etc. But regardless, the variational inference procedure underling VAEs is interesting in its own right, and I’d like to close by discussing some of the physical connections to which I alluded above in greater detail.

Deeper connections

The following was largely inspired by the exposition in Mehta et al. [1], though we have endeavored to modify the notation for clarity/consistency. In particular, be warned that what these authors call the “free energy” is actually a dimensionless free energy, which introduces an extra factor of {\beta} (cf. eq. (158) therein); we shall instead stick to standard conventions, in which the mass dimension is {[F]=[E]=[\beta^{-1}]=1}. Of course, we’re eventually going to set {\beta=1} anyway, but it’s good to set things straight.

Consider a system of interacting degrees of freedom {s\in\{x,z\}}, with parameters {\theta} (e.g., {\theta\in\{\mu,\sigma\}} for Gaussians, or would parametrize the couplings {J_{ij}} between spins {s_i} in the Ising model). We may assign an energy {E(s;\theta)=E(x,z;\theta)} to each configuration, such that the probability {p(s;\theta)=p_\theta(x,z)} of finding the system in a given state at temperature {\beta^{-1}} is

\displaystyle p_\theta(x,z)=\frac{1}{Z[\theta]}e^{-\beta E(x,z;\theta)}~, \ \ \ \ \ (15)

where the partition function with respect to this ensemble is

\displaystyle Z[\theta]=\sum_se^{-\beta E(s;\theta)}~, \ \ \ \ \ (16)

where the sum runs over both {x} and {z}. As the notation suggests, we have in mind that {p_\theta(x,z)} will serve as our latent-variable model, in which {x,z} respectively take on the meanings of visible and latent degrees of freedom as above. Upon marginalizing over the latter, we recover the partition function (4) for {\mathrm{dim}(Z)} finite:

\displaystyle p_\theta(x)=\sum_z\,p_\theta(x,z)=\frac{1}{Z[\theta]}\sum_z e^{-\beta E(x,z;\theta)} \equiv\frac{1}{Z[\theta]}e^{-\beta E(x;\theta)}~, \ \ \ \ \ (17)

where in the last step, we have defined the marginalized energy function {E(x;\theta)} that encodes all interactions with the latent variables; cf. eq. (15) of our post on RBMs.

The above implies that the posterior probability {p(z|x)} of finding a particular value of {z\in Z}, given the observed value {x\in X} (i.e., the encoder) can be written as

\displaystyle p_\theta(z|x) =\frac{p_\theta(x,z)}{p_\theta(x)} =e^{-\beta E(x,z;\theta)+\beta E(x;\theta)} \equiv e^{-\beta E(z|x;\theta)} \ \ \ \ \ (18)


\displaystyle E(z|x;\theta) \equiv E(x,z;\theta)-E(x;\theta) \ \ \ \ \ (19)

is the hamiltonian that describes the interactions between {x} and {z}, in which the {z}-independent contributions have been subtracted off; cf. the difference between eq. (12) and (15) here. To elucidate the variational inference procedure however, it will be convenient to re-express the conditional distribution as

\displaystyle p_\theta(z|x)=\frac{1}{Z_p}e^{-\beta E_p} \ \ \ \ \ (20)

where we have defined {Z_p} and {E_p} such that

\displaystyle p_\theta(x)=Z_p~, \qquad\mathrm{and}\qquad p_\theta(x,z)=e^{-\beta E_p}~. \ \ \ \ \ (21)

where the subscript {p=p_\theta(z|x)} will henceforth be used to refer to the posterior distribution, as opposed to either the joint {p(x,z)} or prior {p(x)} (this to facilitate a more compact notation below). Here, {Z_p=p_\theta(x)} is precisely the partition function we encountered in (4), and is independent of the latent variable {z}. Statistically, this simply reflects the fact that in (20), we weight the joint probabilities {p(x,z)} by how likely the condition {x} is to occur. Meanwhile, one must be careful not to confuse {E_p} with {E(z|x;\theta)} above. Rather, comparing (21) with (15), we see that {E_p} represents a sort of renormalized energy, in which the partition function {Z[\theta]} has been absorbed.

Now, in thermodynamics, the Helmholtz free energy is defined as the difference between the energy and the entropy (with a factor of {\beta} for dimensionality) at constant temperature and volume, i.e., the work obtainable from the system. More fundamentally, it is the (negative) log of the partition function of the canonical ensemble. Hence for the encoder (18), we write

\displaystyle F_p[\theta]=-\beta^{-1}\ln Z_p[\theta]=\langle E_p\rangle_p-\beta^{-1} S_p~, \ \ \ \ \ (22)

where {\langle\ldots\rangle_p} is the expectation value with respect to {p_\theta(z|x)} and marginalization over {z} (think of these as internal degrees of freedom), and {S_p} is the corresponding entropy,

\displaystyle S_p=-\sum_zp_\theta(z|x)\ln p_\theta(z|x) =-\langle\ln p_\theta(z|x)\rangle_p~. \ \ \ \ \ (23)

Note that given the canonical form (18), the equivalence of these expressions for {F_p} — that is, the second equality in (22) — follows immediately from the definition of entropy:

\displaystyle S_p=\sum_z p_\theta(z|x)\left[\beta E_p+\ln Z_p\right] =\beta\langle E_p\rangle_p+\ln Z_p~, \ \ \ \ \ (24)

where, since {Z_p} has no explicit dependence on the latent variables, {\langle\ln Z_p\rangle_p=\langle1\rangle_p\ln Z_p=\ln Z_p}. As usual, this partition function is generally impossible to calculate. To circumvent this, we employ the strategy introduced above, namely we approximate the true distribution {p_\theta(z|x)} by a so-called variational distribution {q(z|x;\phi)=q_\phi(z|x)}, where {\phi} are the variational (e.g., coupling) parameters that define our ansatz. The idea is of course that {q} should be computationally tractable while still capturing the essential features. As alluded above, this is the reason these autoencoders are called “variational”: we’re eventually going to vary the parameters {\phi} in order to make {q} as close to {p} as possible.

To quantify this procedure, we define the variational free energy (not to be confused with the Helmholtz free energy (22)):

\displaystyle F_q[\theta,\phi]=\langle E_p\rangle_q-\beta^{-1} S_q~, \ \ \ \ \ (25)

where {\langle E_p\rangle_q} is the expectation value of the energy corresponding to the distribution {p_\theta(z|x)} with respect to {q_\phi(z|x)}. While the variational energy {F_q} has the same form as the thermodynamic definition of Helmholtz energy {F_p}, it still seems odd at first glance, since it no longer enjoys the statistical connection to a canonical partition function. To gain some intuition for this quantity, suppose we express our variational distribution in the canonical form, i.e.,

\displaystyle q_\phi(z|x)=\frac{1}{Z_q}e^{-\beta E_q}~, \quad\quad Z_q[\phi]=\sum_ze^{-\beta E_q(x,z;\phi)}~, \ \ \ \ \ (26)

where we have denoted the energy of configurations in this ensemble by {E_q}, to avoid confusion with {E_p}, cf. (18). Then {F_q} may be written

\displaystyle \begin{aligned} F_q[\theta,\phi]&=\sum_z q_\phi(x|z)E_p-\beta^{-1}\sum_z q_\phi(x|z)\left[\beta E_q+\ln Z_q\right]\\ &=\langle E_p(\theta)-E_q(\phi)\rangle_q-\beta^{-1}\ln Z_q[\phi]~. \end{aligned} \ \ \ \ \ (27)

Thus we see that the variational energy is indeed formally akin to the Helmholtz energy, except that it encodes the difference in energy between the true and approximate configurations. We can rephrase this in information-theoretic language by expressing these energies in terms of their associated ensembles; that is, we write {E_p=-\beta^{-1}\left(\ln p+\ln Z_p\right)}, and similarly for {q}, whereupon we have

\displaystyle F_q[\theta,\phi]=\beta^{-1}\sum_z q_\phi(z|x)\ln\frac{q_\phi(z|x)}{p_\theta(z|x)}-\beta^{-1}\ln Z_p[\theta]~, \ \ \ \ \ (28)

where the {\ln Z_q} terms have canceled. Recognizing (5) and (21) on the right-hand side, we therefore find that the difference between the variational and Helmholtz free energies is none other than the KL divergence,

\displaystyle F_q[\theta,\phi]-F_p[\theta]=\beta^{-1}D_z\left(q_\phi(z|x)\,||\,p_\theta(z|x)\right)\geq0~, \ \ \ \ \ (29)

which is precisely (7)! (It is perhaps worth stressing that this follows directly from (24), independently of whether {q(z|x)} takes canonical form).

As stated above, our goal in training the VAE is to make the variational distribution {q} as close to {p} as possible, i.e., minimizing the KL divergence between them. We now see that physically, this corresponds to a variational problem in which we seek to minimize {F_q} with respect to {\phi}. In the limit where we perfectly succeed in doing so, {F_q} has obtained its global minimum {F_p}, whereupon the two distributions are identical.

Finally, it remains to clarify our implementation-based definition of {F_q} given in (8) (where {\beta=1}). Applying Bayes’ rule, we have

\displaystyle \begin{aligned} F_q&=-\langle\ln p(x|z)\rangle_q+D_z\left(q(z|x)\,||\,p(z)\right) =-\left<\ln\frac{p(z|x)p(x)}{p(z)}\right>_q+\langle\ln q(z|x)-\ln p(z)\rangle_q\\ &=-\langle\ln p(z|x)p(x)\rangle_q+\langle\ln q(z|x)\rangle_q =-\langle\ln p(x,z)\rangle_q-S_q~, \end{aligned} \ \ \ \ \ (30)

which is another definition of {F_q} sometimes found in the literature, e.g., as eq. (172) of Mehta et al. [1]. By expressing {p(x,z)} in terms of {E_p} via (20), we see that this is precisely equivalent to our more thermodynamical definition (24). Alternatively, we could have regrouped the posteriors to yield

\displaystyle F_q=\langle\ln q(z|x)-\ln p(z|x)\rangle_q-\langle\ln p(x)\rangle_q =D\left(q(z|x)\,||\,p(z|x)\right)+F_p~, \ \ \ \ \ (31)

where the identification of {F_p} follows from (20). Of course, this is just (28) again, which is a nice check on internal consistency.


  1. The review by Mehta et al., A high-bias, low-variance introduction to Machine Learning for physicists is absolutely perfect for those with a physics background, and the accompanying Jupyter notebook on VAEs in Keras for the MNIST dataset was especially helpful for the implementation bits above. The latter is a more streamlined version of this blog post by Louis Tiao.
  2. Doersch has written a Tutorial on Autoencoders, which I found helpful for gaining some further intuition for the mapping between theory and practice.
Posted in Minds & Machines | 2 Comments

Black hole entropy: the heat kernel method

As part of my ongoing love affair with black holes, I’ve been digging more deeply into what it means for them to have entropy, which of course necessitates investigating how this is assigned in the first place. This is a notoriously confusing issue — indeed, one which lies at the very heart of the firewall paradox — which is further complicated by the fact that there are a priori three distinct physical entropies at play: thermodynamic, entanglement, and gravitational. (Incidentally, lest my previous post on entropy cause confusion, let me stress that said post dealt only with the relation between thermodynamic and information-theoretic a.k.a. Shannon entropy, at a purely classical level: neither entanglement nor gravity played any role there. I also didn’t include the Shannon entropy in the list above, because — as explained in the aforementioned post — this isn’t an objective/physical entropy in the sense of the other three; more on this below.)

My research led me to a review on black hole entanglement entropy by Solodukhin [1], which is primarily concerned with the use of the conical singularity method (read: replica trick) to isolate the divergences that arise whenever one attempts to compute entanglement entropy in quantum field theory. The structure of these divergences turns out to provide physical insight into the nature of this entropy, and sheds some light on the relation to thermodynamic/gravitational entropy as well, so these sorts of calculations are well-worth understanding in detail.

While I’ve written about the replica trick at a rather abstract level before, for present purposes we must be substantially more concrete. To that end, the main technical objective of this post is to elucidate one of the central tools employed by these computations, known as the heat kernel method. This is a rather powerful method with applications scattered throughout theoretical physics, notably the calculation of 1-loop divergences and the study of anomalies. Our exposition will mostly follow the excellent pedagogical review by Vassilevich [2]. Before diving into the details however, let’s first review the replica trick à la [1], in order to see how the heat kernel arises in the present context.

Consider some quantum field {\psi(X)} in {d}-dimensional Euclidean spacetime, where {X^\mu=\{\tau,x,z^i\}} with {i=1,\ldots,d\!-\!2}. The Euclidean time {\tau} is related to Minkowski time {t=-i\tau} via the usual Wick rotation, and we have singled-out one of the transverse coordinates {x} for reasons which will shortly become apparent. For simplicity, let us consider the wavefunction for the vacuum state, which we prepare by performing the path integral over the lower ({\tau\leq0}) half of Euclidean spacetime with the boundary condition {\psi(\tau=0,x,z)=\psi_0(x,z)}:

\displaystyle \Psi[\psi_0(x)]=\int_{\psi(0,x,z)=\psi_0(x,z)}\!\mathcal{D}\psi\,e^{-W[\psi]}~, \ \ \ \ \ (1)

where we have used {W} to denote the Euclidean action for the matter field, so as to reserve {S} for the entropy.

Now, since we’re interested in computing the entanglement entropy of a subregion, let us divide the {\tau=0} surface into two halves, {x\!<\!0} and {x\!>\!0}, by defining a codimension-2 surface {\Sigma} by the condition {x=0}, {\tau=0}. Correspondingly, let us denote the boundary data

\displaystyle \psi_0(x,z)\equiv \begin{cases} \psi_-(x,z) \;\; & x<0\\ \psi_+(x,z) \;\; & x>0~. \end{cases} \ \ \ \ \ (2)

The reduced density matrix that describes the {x\!>\!0} subregion of the vacuum state is then obtained by tracing over the complementary set of boundary fields {\psi_-}. In the Euclidean path integral, this corresponds to integrating out {\psi_-} over the entire spacetime, but with a cut from negative infinity to {\Sigma} along the {\tau\!=\!0} surface (i.e., along {x\!<\!0}). We must therefore impose boundary conditions for the remaining field {\psi_+} as this cut is approached from above ({\psi_+^1}) and below ({\psi_+^2}). Hence:

\displaystyle \rho(\psi_+^1,\psi_+^2)=\int\!\mathcal{D}\psi_-\Psi[\psi_+^1,\psi_-]\Psi[\psi_+^2,\psi_-]~. \ \ \ \ \ (3)

Formally, this object simply says that the transition elements {\langle\psi_+^2|\rho|\psi_+^1\rangle} are computed by performing the path integral with the specified boundary conditions along the cut.

Unfortunately, explicitly computing the von Neumann entropy {S=-\mathrm{tr}\rho\ln\rho} is an impossible task for all but the very simplest systems. Enter the replica trick. The basic idea is to consider the {n}-fold cover of the above geometry, introduce a conical deficit at the boundary of the cut {\Sigma}, and then differentiate with respect to the deficit angle, whereupon the von Neumann entropy is recovered in the appropriate limit. To see this in detail, it is convenient to represent the {(\tau,x)} subspace in polar coordinates {(r,\phi)}, where {\tau=r\sin\phi} and {x=r\cos\phi}, such that the cut corresponds to {\phi=2\pi k} for {k=1,\ldots,n}. In constructing the {n}-sheeted cover, we glue sheets along the cut such that the fields are smoothly continued from {\psi_+^{1,2}\big|_k} to {\psi_+^{1,2}\big|_{k+1}}. The resulting space is technically a cone, denoted {C_n}, with angular deficit {2\pi(1-n)} at {\Sigma}, on which the partition function for the fields is given by

\displaystyle Z[C_n]=\mathrm{tr}\rho^n~, \ \ \ \ \ (4)

where {\rho^n} is the {n^\mathrm{th}} power of the density matrix (3). At this point in our previous treatment of the replica trick, we introduced the {n^\mathrm{th}} Rényi entropy

\displaystyle S_n=\frac{1}{1-n}\ln\mathrm{tr}\rho^n~, \ \ \ \ \ (5)

which one can think of as the entropy carried by the {n} copies of the chosen subregion, and showed that the von Neumann entropy is recovered in the limit {n\rightarrow1}. Equivalently, we may express the von Neumann entropy directly as

\displaystyle S(\rho)=-(n\partial_n-1)\ln\mathrm{tr}\rho^n\big|_{n=1}~. \ \ \ \ \ (6)


\displaystyle -(n\partial_n-1)\ln\mathrm{tr}\,\rho^n\big|_{n=1} =-\left[\frac{n}{\mathrm{tr}\rho^n}\,\mathrm{tr}\!\left(\rho^n\ln\rho\right)-\ln\mathrm{tr}\rho^n\right]\bigg|_{n=1} =-\mathrm{tr}\left(\rho\ln\rho\right)~, \ \ \ \ \ (7)

where in the last step we took the path integral to be appropriately normalized such that {\mathrm{tr}\rho=1}, whereupon the second term vanishes. Et voilà! As I understand it, the aforementioned “conical singularity method” is essentially an abstraction of the replica trick to spacetimes with conical singularities. Hence, re-purposing our notation, consider the effective action {W[\alpha]=-\ln Z[C_\alpha]} for fields on a Euclidean spacetime with a conical singularity at {\Sigma}. The cone {C_\alpha} is defined, in polar coordinates, by making {\phi} periodic with period {2\pi\alpha}, and taking the limit in which the deficit {(1-\alpha)\ll1}. The entanglement entropy for fields on this background is then

\displaystyle S=(\alpha\partial_\alpha-1)W[\alpha]\big|_{\alpha=1}~. \ \ \ \ \ (8)

As a technical aside: in both these cases, there is of course the subtlety of analytically continuing the parameter {n} resp. {\alpha} to non-integer values. We’ve discussed this issue before in the context of holography, where one surmounts this by instead performing the continuation in the bulk. We shall not digress upon this here, except to note that the construction relies on an abelian (rotational) symmetry {\phi\rightarrow\phi+w}, where {w} is an arbitrary constant. This is actually an important constraint to bear in mind when attempting to infer general physical lessons from our results, but we’ll address this caveat later. Suffice to say that given this assumption, the analytical continuation can be uniquely performed without obstruction; see in particular section 2.7 of [1] for details.

We have thus obtained, in (8), an expression for the entanglement entropy in terms of the effective action in the presence of a conical singularity. And while this is all well and pretty, in order for this expression to be of any practical use, we require a means of explicitly computing {W}. It is at this point that the heat kernel enters the game. The idea is to represent the Green function — or rather, the (connected) two-point correlator — as an integral over an auxiliary “proper time” {s} of a kernel satisfying the heat equation. This enables one to express the effective action as

\displaystyle W=-\frac{1}{2}\int_0^\infty\!\frac{\mathrm{d} s}{s}K(s,D)~, \ \ \ \ \ (9)

where {K(s,D)} is the (trace of the) heat kernel for the Laplacian operator {D}. We’ll now spend some time unpacking this statement, which first requires that we review some basic facts about Green functions, propagators, and all that.

Consider a linear differential operator {L_x=L(x)} acting on distributions with support on {\mathbb{R}^n}. If {L_x} admits a right inverse, then the latter defines the Green function {G(x,x')} as the solution to the inhomogeneous differential equation

\displaystyle L_xG(x,x')=\delta^n(x-x')~, \ \ \ \ \ (10)

where {x,\,x'} represent vectors in {\mathbb{R}^n}. We may also define the kernel of {L_x} as the solution to the homogeneous differential equation

\displaystyle L_xK(x,x')=0~. \ \ \ \ \ (11)

The Green function is especially useful for solving linear differential equations of the form

\displaystyle L_xu(x)=f(x)~. \ \ \ \ \ (12)

To see this, simply multiply both sides of (10) by {f(x')} and integrate w.r.t {x'}; by virtue of the delta function (and the fact that {L_x} is independent of {x'}), one identifies

\displaystyle u(x)=\int\!\mathrm{d} x'G(x,x')f(x')~. \ \ \ \ \ (13)

The particular boundary conditions we impose on {u(x)} then determine the precise form of {G} (e.g., retarded vs. advanced Green functions in QFT).

As nicely explained in this Stack Exchange answer, the precise relation to propagators is ambiguous, because physicists use this term to mean either the Green function or the kernel, depending on context. For example, the Feynman propagator {\Delta_F(x-x')} for a scalar field {\phi(x)} is a Green function for the Klein-Gordon operator, since it satisfies the equation

\displaystyle (\square_x+m^2)\Delta_F(x-x')=\delta^n(x-x')~. \ \ \ \ \ (14)

In contrast, the corresponding Wightman functions

\displaystyle G^+(x,x')=\langle0|\phi(x)\phi(x')|0\rangle~,\quad\quad G^-(x,x')=\langle0|\phi(x')\phi(x)|0\rangle~, \ \ \ \ \ (15)

are kernels for this operator, since they satisfy

\displaystyle (\square_x+m^2)G^{\pm}(x,x')=0~. \ \ \ \ \ (16)

The reader is warmly referred to section 2.7 of Birrell & Davies classic textbook [3] for a more thorough explanation of Green functions in this context (see also part 1 of my QFT in curved space sequence).

In the present case, we shall be concerned with the heat kernel

\displaystyle K(s;x,y;D)=\langle x|e^{-sD}|y\rangle~, \ \ \ \ \ (17)

where {D} is some Laplacian operator, by which we mean that it admits a local expression of the form

\displaystyle D=-\left(g^{\mu\nu}\partial_\mu\partial_\nu+a^\mu\partial_\mu+b\right) =-\left(g^{\mu\nu}\nabla_\mu\nabla_\nu+E\right)~, \ \ \ \ \ (18)

for some matrix-valued functions {a^\mu~,b}. In the second equality, we’ve written the operator in the so-called canonical form for a Laplacian operator on a vector bundle, where {E} is an endomorphism on the bundle over the manifold, and the covariant derivative {\nabla} includes both the familiar Riemann part as well as the contribution from the gauge (bundle) part; the field strength for the latter will be denoted {\Omega_{\mu\nu}}. Fibre bundles won’t be terribly important for our purposes, but we’ll need some of this notation later; see section 2.1 of [2] for details.

The parameter {s} in (17) is some auxiliary Euclidean time variable (note that if we were to take {s=it}, the right-hand side of (17) would correspond to the transition amplitude {\langle x|U|y\rangle} for some unitary operator {U}). {K} is so-named because it satisfied the heat equation,

\displaystyle (\partial_s+D_x)K(s;x,y;D)=0~, \ \ \ \ \ (19)

with the initial condition

\displaystyle K(0;x,y;D)=\delta(x,y)~, \ \ \ \ \ (20)

where the subscript on {D_x} is meant to emphasize the fact that the operator {D} acts only on the transverse variables, not on the auxiliary time {s}. Our earlier claim that the Green function can be expressed in terms of an integral over the latter is then based on the observation that one can invert (17) to obtain the propagator

\displaystyle D^{-1}=\int_0^\infty\!\mathrm{d} s\,K(s;x,y;D)~. \ \ \ \ \ (21)

Note that here, “propagator” indeed refers to the Green function of the field in the path integral representation, insofar as the latter serves as the generating functional for the former. Bear with me a bit longer as we elucidate this last claim, as this will finally bring us full-circle to (9)

Denote the Euclidean path integral for the fields on some fixed background by

\displaystyle Z[J]=\int\!\mathcal{D}\psi\,e^{-W[\psi,J]}~, \ \ \ \ \ (22)

where {J} is the source for the matter field {\psi}. The heat kernel method applies to one-loop calculations, in which case it suffices to expand the action to quadratic order in fluctuations around the classical saddle-point {S_\mathrm{cl}}, whence the Gaussian integral may be written

\displaystyle Z[J]=e^{-S_\mathrm{cl}}\,\mathrm{det}^{-1/2}(D)\,\mathrm{exp}\left(\frac{1}{4}JD^{-1}J\right)~. \ \ \ \ \ (23)

(We’re glossing over some mathematical caveats/assumptions here, notably that {D} be self-adjoint w.r.t. to the scalar product of the fields; see [2] for details). Thus we see that taking two functional derivatives w.r.t. the source {J} brings down the operator {D^{-1}}, thereby identifying it with the two-point correlator for {\psi},

\displaystyle G(x,y)=\langle\psi(x)\psi(y)\rangle=\frac{1}{Z[0]}\left.\frac{\delta^2 Z[J]}{\delta J(x)\delta J(y)}\right|_{J=0}=D^{-1}~, \ \ \ \ \ (24)

which is trivially (by virtue of the far r.h.s.) a Green function of {D} in the sense of (10).

We now understand how to express the Green function for the operator {D} in terms of the heat kernel. But we’re after the connected two-point correlator (sometimes called the connected Green function), which encapsulates the one-loop contributions. Recall that the connected Feynman diagrams are generated by the effective action {W} introduced above. After properly normalizing, we have only the piece which depends purely on {D}:

\displaystyle W=\frac{1}{2}\ln\mathrm{det}(D)~. \ \ \ \ \ (25)

Vassilevich [2] then provides a nice heuristic argument that relates this to the heat kernel (as well as a more rigorous treatment from spectral theory, for the less cavalier among you), which relies on the identity

\displaystyle \ln\lambda=-\int_0^\infty\!\frac{\mathrm{d} s}{s}e^{-s\lambda}~, \ \ \ \ \ (26)

for {\lambda>0} (strictly speaking, this identity is only correct up to a constant, but we may normalize away this inconvenience anyway; did I mention I’d be somewhat cavalier?). We then apply this identity to every (positive) eigenvalue {\lambda} of {D}, whence

\displaystyle W=\frac{1}{2}\ln\mathrm{det}(D)=\frac{1}{2}\mathrm{tr}\ln D =-\frac{1}{2}\int\!\frac{\mathrm{d} s}{s}\mathrm{tr}\left(e^{-sD}\right) =-\frac{1}{2}\int\!\frac{\mathrm{d} s}{s}K(s,D)~, \ \ \ \ \ (27)


\displaystyle K(s,D)\equiv \mathrm{tr}\left(e^{-sD}\right) =\int\!\mathrm{d}^dx\sqrt{g}\,\langle x|e^{-sD}|x\rangle =\int\!\mathrm{d}^dx\sqrt{g}\,K(s;x,x;D)~. \ \ \ \ \ (28)

Let us pause to take stock: so far, we’ve merely elucidated eq. (9). And while this expression itself is valid in general (not just for manifolds with conical singularities) the physical motivation for this post was the investigation of divergences in the entanglement entropy (8). And indeed, the expression for the effective action (27) is divergent at both limits! In the course of regulating this behaviour, we shall see that the UV divergences in the entropy (8) are captured by the so-called heat kernel coefficients.

To proceed, we shall need the fact that on manifolds without boundaries (or else with suitable local boundary conditions on the fields), {K(s,D)} — really, the self-adjoint operator {D} — admits an asymptotic expansion of the form

\displaystyle K(s,D)=\mathrm{tr}\left(e^{-sD}\right)\simeq (4\pi s)^{-d/2}\sum_{k\geq0}a_k(D)s^{k}~, \ \ \ \ \ (29)

cf. eq. (67) of [1]. A couple technical remarks are in order. First, recall that in contrast to a convergent series — which gives finite results for arbitrary, fixed {s} in the limit {k\rightarrow\infty} — an asymptotic series gives finite results for fixed {k} in the limit {s^{-1}\rightarrow\infty}. Second, we are ignoring various subtleties regarding the rigorous definition of the trace, wherein both (29) and (28) are properly defined via the use of an auxiliary function; cf. eq. (2.21) of [2] (n.b., Vassilevich’s coefficients do not include the normalization {(4\pi)^{-d/2}} from the volume integrals; see below).

The most important property of the heat kernel coefficients {a_k} is that they can be expressed as integrals of local invariants—tensor quantities which remain invariant under local diffeomorphisms; e.g., the Riemann curvature tensor and covariant derivatives thereof. Thus the first step in the procedure for calculating the heat kernel coefficients is to write down the integral over all such local invariants; for example, the first three coefficients are

\displaystyle \begin{aligned} a_0(D)=\int_M\!\mathrm{d}^dx\sqrt{g}&\,\alpha_0~,\\ a_1(D)=\frac{1}{6}\int_M\!\mathrm{d}^dx\sqrt{g}\,&\left(\alpha_1E+\alpha_2R\right)~,\\ a_2(D)=\frac{1}{360}\int_M\!\mathrm{d}^dx\sqrt{g}\,&\left(\alpha_3\nabla E+\alpha_4RE+\alpha_5E^2+\alpha_6\nabla R\right.\\ &\left.+\alpha_7R^2+\alpha_8R_{ij}+\alpha_9R_{ijkl}^2+\alpha_{10}\Omega_{ij}^2\right)~,\\ \end{aligned} \ \ \ \ \ (30)

where {E} and {\Omega_{ij}} were introduced in (18). A word of warning, for those of you cross-referencing with Solodukhin [1] and Vassilevich [2]: these coefficients correspond to eqs. (4.13) – (4.15) in [2], except that we have already included the normalization {(4\pi)^{-d/2}} in our expansion coefficients (29), consistent with [1]. Additionally, note that Vassilevich’s coefficients are labeled with even integers, while ours/Solodukhin’s include both even and odd—cf. (2.21) in [2]. The reason for this discrepancy is that all odd coefficients in Vassilevich’s original expansion vanish, as a consequence of the fact that there are no odd-dimensional invariants on manifolds without boundary; Solodukhin has simply relabeled the summation index for cleanliness, and we have followed his convention in (30).

It now remains to calculate the constants {\alpha_i}. This is a rather involved technical procedure, but is explained in detail in section 4.1 of [2]. One finds

\displaystyle \begin{aligned} \alpha_1=6~,\;\; \alpha_2=1~,\;\; \alpha_3=60~,\;\; \alpha_4=60~,\;\; \alpha_5=180~,\\ \alpha_6=12~,\;\; \alpha_7=5~,\;\; \alpha_8=-2~,\;\; \alpha_9=2~,\;\; \alpha_{10}=30~. \end{aligned} \ \ \ \ \ (31)

Substituting these into (30), and doing a bit of rearranging, we have

\displaystyle \begin{aligned} a_0(D)&=\int_M1~,\\ a_1(D)&=\int_M\left(E+\tfrac{1}{6}R\right)~,\\ a_2(D)&=\int_M\left[\tfrac{1}{180}R_{ijkl}^2-\tfrac{1}{180}R_{ij}^2+\tfrac{1}{6}\nabla^2\left(E+\tfrac{1}{5}R\right)+\tfrac{1}{2}\left(E+\tfrac{1}{6}R\right)^2\right]~, \end{aligned} \ \ \ \ \ (32)

where we have suppressed the integration measure for compactness, i.e., {\int_M=\int_M\mathrm{d}^dx\sqrt{g}\,}; we have also set the gauge field strength {\Omega_{ij}=0} for simplicity, since we will consider only free scalar fields below. These correspond to what Solodukhin refers to as regular coefficients, cf. his eq. (69). If one is working on a background with conical singularities, then there are additional contributions from the singular surface {\Sigma} [1]:

\displaystyle \begin{aligned} a_0^\Sigma(D)=&0~,\\ a_1^\Sigma(D)=&\frac{\pi}{3}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_\Sigma1~,\\ a_2^\Sigma(D)=&\frac{\pi}{3}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_\Sigma\left(E+\tfrac{1}{6}R\right)\\ &-\frac{\pi}{180}\frac{(1-\alpha)(1+\alpha)(1+\alpha^2)}{\alpha^3}\int_\Sigma\left(R_{ii}+2R_{ijij}\right)~, \end{aligned} \ \ \ \ \ (33)

where in the last expression {R_{ii}=R_{\mu\nu}n^\mu_in^\nu_i} and {R_{ijij}=R_{\mu\nu\rho\sigma}n^\mu_in^\nu_jn^\rho_in^\sigma_j}, where {n^k=n^\mu_k\partial_\mu} are orthonormal vectors orthogonal to {\Sigma}. In this case, if the manifold {M} in (32) is the {n}-fold cover {C_\alpha} constructed above, then the Riemannian curvature invariants are actually computed on the regular points {C_\alpha/\Sigma}, and are related to their flat counterparts by eq. (55) of [1]. Of course, here and in (33), {\alpha} refers to the conical deficit {2\pi(1-\alpha)}, not to be confused with the constants {\alpha_i} in (31).

Finally, we are in position to consider some of the physical applications of this method discussed in the introduction to this post. As a warm-up to the entanglement entropy of black holes, let’s first take the simpler case of flat space. Despite the lack of conical singularities in this completely regular, seemingly boring spacetime, the heat kernel method above can still be used to calculate the leading UV divergences in the entanglement entropy. While in this very simple case, there are integral identities that make the expansion into heat kernel coefficients unnecessary (specifically, the Sommerfeld formula employed in section 2.9 of [1]), the conical deficit method is more universal, and will greatly facilitate our treatment of the black hole below.

Consider a free scalar field with {\mathcal{D}=-(\nabla^2+X)}, where {X} is some scalar function (e.g., for a massive non-interacting scalar, {X=-m^2)}, on some background spacetime {E_\alpha} in {d>2} dimensions, with a conical deficit at the codimension-2 surface {\Sigma}. The leading UV divergence (we don’t care about the regular piece) in the entanglement entropy across this {(d\!-\!2)}-dimensional surface may be calculated directly from the coefficient {a_1^\Sigma} above. To this order in the expansion, the relevant part of {W} is

\displaystyle \begin{aligned} W[\alpha]&\simeq-\frac{1}{2}\int_{\epsilon^2}^\infty\!\frac{\mathrm{d} s}{s}(4\pi s)^{-d/2}\left(a_0^\Sigma+a_1^\Sigma s\right) =-\frac{\pi}{6}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_{\epsilon^2}^{\infty}\!\frac{\mathrm{d} s}{(4\pi s)^{d/2}}\int_\Sigma1\\ &=\frac{-1}{12(d\!-\!2)(4\pi)^{(d-2)/2}}\frac{A(\Sigma)}{\epsilon^{d-2}}\frac{(1-\alpha)(1+\alpha)}{\alpha}~, \end{aligned} \ \ \ \ \ (34)

where we have introduced the UV-cutoff {\epsilon} (which appears as {\epsilon^2} in the lower limit of integration, to make the dimensions of the auxiliary time variable work-out; this is perhaps clearest by examining eq. (1.12) of [2]) and the area of the surface {A(\Sigma)=\int_\Sigma\,}. Substituting this expression for the effective action into eq. (8) for the entropy, we obtain

\displaystyle S_\mathrm{flat}=\frac{1}{6(d\!-\!2)(4\pi)^{(d-2)/2}}\frac{A(\Sigma)}{\epsilon^{d-2}}~. \ \ \ \ \ (35)

which is eq. (81) in [1]. As explained therein, the reason this matches the flat space result — that is, the case in which {\Sigma} is a flat plane — is because even in curved spacetime, any surface can be locally approximated by flat Minkowski space. In particular, we’ll see that this result remains the leading-order correction to the black hole entropy, because the near-horizon region is approximately Rindler (flat). In other words, this result is exact for flat space (hence the equality), but provides only the leading-order divergence for more general, curved geometries.

For concreteness, we’ll limit ourselves to four dimensions henceforth, in which the flat space result above is

\displaystyle S_\mathrm{flat}=\frac{A(\Sigma)}{48\pi\epsilon^2}~. \ \ \ \ \ (36)

In the presence of a black hole, there will be higher-order corrections to this expression. In particular, in {d\!=\!4} we also have a log divergence from the {a_2^\Sigma s^2} term:

\displaystyle \begin{aligned} W[\alpha]&=-\frac{1}{2}\int_{\epsilon^2}^\infty\!\frac{\mathrm{d} s}{s}(4\pi s)^{-d/2}\left(a_0^\Sigma+a_1^\Sigma s+a_2^\Sigma s^2+\ldots\right)\\ &=-\frac{1}{32\pi^2}\int_{\epsilon^2}^\infty\!\mathrm{d} s\left(\frac{a_1^\Sigma}{s^2}+\frac{a_2^\Sigma}{s}+O(s^0)\right) \simeq-\frac{1}{32\pi^2}\left(\frac{a_1^\Sigma}{\epsilon^2}-2a_2^\Sigma\ln\epsilon\right)~, \end{aligned} \ \ \ \ \ (37)

where in the last step, we’ve dropped higher-order terms as well as the log IR divergence, since here we’re only interested in the UV part. From (33), we then see that the {a_2^\Sigma} term results in an expression for the UV divergent part of the black hole entanglement entropy of the form

\displaystyle S_\mathrm{ent}=\frac{A(\sigma)}{48\pi\epsilon^2} -\frac{1}{144\pi}\int_\Sigma\left[6E+R-\frac{1}{5}\left(R_{ii}-2R_{ijij}\right)\right]\ln\epsilon~, \ \ \ \ \ (38)

cf. (82) [1]. Specifying to a particular black hole solution then requires working out the projections of the Ricci and Riemann tensors on the subspace orthogonal to {\Sigma} ({R_{ii}} and {R_{ijij}}, respectively). For the simplest case of a massless, minimally coupled scalar field ({X=0}) on a Schwarzschild black hole background, the above yields

\displaystyle S_\mathrm{ent}=\frac{A(\Sigma)}{48\pi\epsilon^2}+\frac{1}{45}\ln\frac{r_+}{2}~, \ \ \ \ \ (39)

where {r_+} is the horizon radius; see section 3.9.1 of [1] for more details. Note that since the Ricci scalar {R=0}, the logarithmic term represents a purely topological correction to the flat space entropy (36) (in contrast to flat space, the Euler number for a black hole geometry is non-zero). Curvature corrections can still show up in UV-finite terms, of course, but that’s not what we’re seeing here: in this sense the log term is universal.

Note that we’ve specifically labeled the entropy in (39) with the subscript “ent” to denote that this is the entanglement entropy due to quantum fields on the classical background. We now come to the confusion alluded in the opening paragraph of this post, namely, what is the relation between the entanglement entropy of the black hole and either the thermodynamic or gravitational entropies, if any?

Recall that in classical systems, the thermodynamic entropy and the information-theoretic entropy coincide: not merely formally, but ontologically as well. The reason is that the correct probability mass function will be as broadly distributed as possible subject to the physical constraints on the system (equivalently, in the case of statistical inference, whatever partial information we have available). If only the average energy is fixed, then this corresponds to the Boltzmann distribution. Note that this same logic extends to the entanglement entropy as well (modulo certain skeptical reservations to which I alluded before), which is the underlying physical reason why the Shannon and quantum mechanical von Neumann entropies take the same form as well. In simple quantum systems therefore, the entanglement entropy of the fields coincides with this thermodynamic (equivalently, information-theoretic/Shannon) entropy.

More generally however, “thermodynamic entropy” is a statement about the internal microstates of the system, and is quantified by the total change in the free energy {F=\beta^{-1}\ln Z[\beta,g]} w.r.t. temperature {\beta^{-1}}:

\displaystyle S_\mathrm{thermo}=\frac{\mathrm{d} F}{\mathrm{d} T}=-\beta^2\frac{\mathrm{d} F}{\mathrm{d} \beta} =\left(\beta\frac{\mathrm{d}}{\mathrm{d}\beta}-1\right)W_\mathrm{tot}[\beta,g]~, \ \ \ \ \ (40)

where the total effective action {W_\mathrm{tot}[\beta,g]=\ln Z[\beta,g]}. Crucially, observe that here, in contrast to our partition function (22) for fields on a fixed background, the total Euclidean path integral that prepares the state also includes an integration over the metrics:

\displaystyle Z[\beta,g]=\int\!\mathcal{D} g_{\mu\nu}\,\mathcal{D}\psi\,e^{-W_\mathrm{gr}[g]+W_\mathrm{mat}[\psi,g]}~, \ \ \ \ \ (41)

where {W_\mathrm{gr}} represents the gravitational part of the action (e.g., the Einstein-Hilbert term), and {W_\mathrm{mat}} the contribution from the matter fields. (For reference, we’re following the reasoning and notation in section 4.1 of [1] here).

In the information-theoretic context above, introducing a black hole amounts to imposing additional conditions on (i.e., more information about) the system. Specifically, it imposes constraints on the class of metrics in the Euclidean path integral: the existence of a fixed point {\Sigma} of the isometry generated by the Killing vector {\partial_\tau} in the highly symmetric case of Schwarzschild, and suitable asymptotic behaviour at large radius. Hence in computing this path integral, one first performs the integration over matter fields {\psi} on backgrounds with a conical singularity at {\Sigma}:

\displaystyle \int\!\mathcal{D}\psi\,e^{-W_\mathrm{mat}[\psi,g]}=e^{-W[\beta,g]}~, \ \ \ \ \ (42)

where on the r.h.s., {W[\beta,g]} represents the effective action for the fields described above; note that the contribution from entanglement entropy is entirely encoded in this portion. The path integral (41) now looks like this:

\displaystyle Z[\beta,g]=\int\!\mathcal{D} g\,e^{-W_\mathrm{gr}[\psi,g]-W[\beta,g]} \simeq e^{-W_\mathrm{tot}[\beta,\,g(\beta)]}~, \ \ \ \ \ (43)

where {W_\mathrm{tot}} on the far right is the semiclassical effective action obtained from the saddle-point approximation. That is, the metric {g_{\mu\nu}(\beta)} is the solution to

\displaystyle \frac{\delta W_\mathrm{tot}[\beta,g]}{\delta g}=0~, \qquad\mathrm{with}\qquad W_\mathrm{tot}=W_\mathrm{gr}+W \ \ \ \ \ (44)

at fixed {\beta}. Since the saddle-point returns an on-shell action, {g_{\mu\nu}(\beta)} is a regular metric (i.e., without conical singularities). One can think of this as the equilibrium geometry at fixed {\beta} around which the metric {g} fluctuates due to the quantum corrections represented by the second term {W}. Note that the latter represents an off-shell contribution, since it is computed on singular backgrounds which do not satisfy the equations of motion for the metric (44).

To compute the thermodynamic entropy of the black hole, we now plug this total effective action into (40). A priori, this expression involves a total derivative w.r.t. {\beta}, so that we write

\displaystyle S_\mathrm{thermo}= \beta\left(\partial_\beta W_\mathrm{tot}[\beta,g]+\frac{\delta g_{\mu\nu}(\beta)}{\delta\beta}\frac{\delta W_\mathrm{tot}[\beta,g]}{\delta g_{\mu\nu}(\beta)}\right) -W_\mathrm{tot}[\beta,g]~, \ \ \ \ \ (45)

except that due to the equilibrium condition (44), the second term in parentheses vanishes anyway, and the thermodynamic entropy is given by

\displaystyle S_\mathrm{thermo}=\left(\beta\partial_\beta-1\right)W_\mathrm{tot}[\beta,g] =\left(\beta\partial_\beta-1\right)\left(W_\mathrm{gr}+W\right) =S_\mathrm{gr}+S_\mathrm{ent}~. \ \ \ \ \ (46)

This, then, is the precise relationship between the thermodynamic, gravitational, and entanglement entropies (at least for the Schwarzschild black hole). The thermodynamic entropy {S_\mathrm{thermo}} is a statement about the possible internal microstates at equilibrium — meaning, states which satisfy the quantum-corrected Einstein equations (44) — which therefore includes the entropy from possible (regular) metric configurations {S_\mathrm{gr}} as well as the quantum corrections {S_\mathrm{ent}} given by (39).

Note that the famous Bekenstein-Hawking entropy {S_\mathrm{BH}} refers only to the gravitational entropy, i.e., {S_\mathrm{BH}=S_\mathrm{gr}}. This is sometimes referred to as the “classical” part, because it represents the tree-level contribution to the path integral result. That is, if we were to restore Planck’s constant, we’d find that {S_\mathrm{gr}} comes with a {1/\hbar} prefactor, while {S_\mathrm{ent}} is order {h^0}. Accordingly, {S_\mathrm{ent}} is often called the first/one-loop quantum correction to the Bekenstein-Hawking entropy. (Confusion warning: the heat kernel method allowed us to compute {S_\mathrm{ent}} itself to one-loop in the expansion of the matter action, i.e., quadratic order in the source {J}, but the entire contribution {S_\mathrm{ent}} appears at one-loop order in the {\hbar}-expansion.)

However, despite the long-standing effort to elucidate this entropy, it’s still woefully unclear what microscopic (read: quantum-gravitational) degrees of freedom are truly being counted. The path integral over metrics makes it tempting to interpret the gravitational entropy as accounting for all possible geometrical configurations that satisfy the prescribed boundary conditions. And indeed, as we have remarked before, this is how one would interpret the Bekenstein entropy in the absence of Hawking’s famous calculation. But insofar as entropy is a physical quantity, the interpretation of {S_\mathrm{gr}} as owing to the equivalence class of geometries is rather unsatisfactory, since in this case the entropy must be associated to a single black hole, whose geometry certainly does not appear to be in any sort of quantum superposition. The situation is even less clear once one takes normalization into account, whereby — in most situations at least — the gravitational and matter couplings are harmoniously renormalized in such a way that one cannot help but question the fundamental distinction between the two; see [1] for an overview of renormalization in this context.

Furthermore, this entire calculational method relies crucially on the presence of an abelian isometry w.r.t. the Killing vector {\partial_\tau} in Euclidean time, i.e., the rotational symmetry {\phi\rightarrow\phi+w} mentioned above. But the presence of such a Killing isometry is by no means a physical requisite for a given system/region to have a meaningful entropy; things are simply (fiendishly!) more difficult to calculate in systems without such high degrees of symmetry. However, this means that cases in which all three of these entropies can be so cleanly assigned to the black hole horizon may be quite rare. Evaporating black holes are perhaps the most topical example of a non-static geometry that causes headaches. As a specific example in this vein, Netta Engelhardt and collaborators have compellingly argued that it is the apparent horizon, rather than the event horizon, to which one can meaningfully associated a coarse-grained entropy à la Shannon [4,5]. Thus, while we’re making exciting progress, even the partial interpretation for which we’ve labored above should be taken with caution. There is more work to be done!


  1. S. N. Solodukhin, “Entanglement entropy of black holes,” arXiv:1104.3712.
  2. D. V. Vassilevich, “Heat kernel expansion: User’s manual,” arXiv:hep-th/0306138.
  3. N. D. Birrell and P. C. W. Davies, “Quantum Fields in Curved Space,” Cambridge Monographs on Mathematical Physics.
  4. N. Engelhardt, “Entropy of pure state black holes,” Talk given at the Quantum Gravity and Quantum Information workshop at CERN, March 2019.
  5. N. Engelhardt and A. C. Wall, “Decoding the Apparent Horizon: Coarse-Grained Holographic Entropy,” arXiv:1706.02038.
Posted in Physics | 4 Comments

Restricted Boltzmann machines

As a theoretical physicist making their first foray into machine learning, one is immediately captivated by the fascinating parallel between deep learning and the renormalization group. In essence, both are concerned with the extraction of relevant features via a process of coarse-graining, and preliminary research suggests that this analogy can be made rather precise. This is particularly interesting in the context of attempts to develop a theoretical framework for deep learning; insofar as the renormalization group is well-understood in theoretical physics, the strength of this mathematical analogy may pave the way towards a general theory of deep learning. I hope to return to this tantalizing correspondence soon; but first, we need to understand restricted Boltzmann machines (RBMs).

As the name suggests, an RBM is a special case of the Boltzmann machines we’ve covered before. The latter are useful for understanding the basic idea behind energy-based generative models, but it turns out that all-to-all connectivity leads to some practical problems when training such networks. In contrast, an RBM restricts certain connections (hence the name), which makes training them much more efficient. Specifically, the neurons in an RBM are divided into one of two layers, consisting of visible and hidden units, respectively, with intralayer connections prohibited. Visible units essentially correspond to outputs: these are the variables to which the observer has access. In contrast, hidden units interact only through their connection with the visible layer, but in such a way as to encode complex, higher-order interactions between the visible degrees of freedom. This ability to encode such latent variables is what makes RBMs so powerful, and underlies the close connection with the renormalization group (RG).

Latent or hidden variables correspond to the degrees of freedom one integrates out (“marginalizes over”, in machine learning (ML) parlance) when performing RG. When properly performed, this procedure ensures that the relevant physics is encoded in correlations between the remaining (visible) degrees of freedom, while allowing one to work with a much simpler model, analogous to an effective field theory. To understand how this works in detail, suppose we wish to apply Wilsonian RG to the Ising model (a pedagogical review by Dimitri Vvedensky can be found here). To do this, we must first transform the (discrete) Ising model into a (continuous) field theory, so that momentum-space RG techniques can be applied. This is achieved via a trick known as the Hubbard-Stratonovich transformation, which neatly illustrates how correlations between visible/relevant variables can be encoded via hidden/irrelevant degrees of freedom.

The key to the Hubbard-Stratonovich transformation is the observation that for any value of {s},

\displaystyle e^{\frac{1}{2}Ks^2}=\left(\frac{K}{2\pi}\right)^{1/2}\!\int_{-\infty}^\infty\!\mathrm{d}\phi\,\exp\left(-\frac{1}{2}K\phi^2+Ks\phi\right)~. \ \ \ \ \ (1)

Since we’ll be working with a discrete lattice of spins, it’s helpful to write this in matrix notation:

\displaystyle e^{\frac{1}{2}\sum_{ij}K_{ij}s_is_j} =\left[\frac{\mathrm{det}K}{(2\pi)^N}\right]^{1/2}\prod_{k=1}^N\int\!\mathrm{d}\phi_k\exp\left(-\frac{1}{2}\sum_{ij}K_{ij}\phi_i\phi_j+\sum_{ij}K_{ij}s_i\phi_j\right)~, \ \ \ \ \ (2)

where {N} is the total number of spins. Now consider the Ising model with nearest-neighbor interactions, whose hamiltonian is

\displaystyle H=-\frac{1}{2}\sum_{ij}J_{ij}s_is_j~, \ \ \ \ \ (3)

with spins {s_i=\pm1}, and symmetric coupling matrix {J_{ij}=J} if {s_i} and {s_j} are nearest-neighbors, and zero otherwise. The partition function is

\displaystyle Z=\sum_{s_i=\pm1}e^{\frac{1}{2}\sum_{ij}K_{ij}s_js_j}~, \ \ \ \ \ (4)

which is typically solved by the transfer-matrix method. To recast this as a field theory, we observe that the sum over the second term in (2) may be written

\displaystyle \begin{aligned} {}&\sum_{s_i=\pm1}\exp\left(\sum\nolimits_{ij}K_{ij}s_i\phi_j\right) =\sum_{s_i=\pm1}\prod_i\exp\left(\sum\nolimits_{j}K_{ij}s_i\phi_j\right)\\ &=\prod_i\left[\exp\left(\sum\nolimits_{j}K_{ij}\phi_j\right)+\exp\left(-\sum\nolimits_{j}K_{ij}\phi_j\right)\right]\\ &=2^N\prod_i\cosh\left(\sum\nolimits_{j}K_{ij}\phi_j\right) =2^N\exp\left[\sum\nolimits_i\ln\cosh\left(\sum\nolimits_jK_{ij}\phi_j\right)\right]~, \end{aligned} \ \ \ \ \ (5)

where the spin degrees of freedom have been summed over, and we’ve re-exponentiated the result in preparation for its insertion below. This enables us to express the partition function entirely in terms of the latent variable {\phi}:

\displaystyle Z=\left[\frac{\mathrm{det}K}{(\pi/2)^N}\right]^{1/2}\prod_{k=1}^N\int\!\mathrm{d}\phi_k\exp\left(-\frac{1}{2}\sum_{ij}K_{ij}\phi_i\phi_j+\sum\nolimits_i\ln\cosh\sum\nolimits_jK_{ij}\phi_j\right)~, \ \ \ \ \ (6)

where, since the first term in (2) is independent of {s}, the sum over spins simply introduces a prefactor of {2^N}. We have thus obtained an exact transformation from the original, discrete model of spins to an equivalent, continuum field theory. To proceed with renormalization of this model, we refer the interested reader to the aforementioned Vvedensky reference. The remarkable upshot, for our purposes, is that all of the physics of the original spin degrees of freedom are entirely encoded in the new field {\phi}. To connect with our ML language above, one can think of this as a latent variable that mediates the correlations between the visible spins, analogous to how UV degrees of freedom give rise to effective couplings at low-energies.

So what does this have to do with (restricted) Boltzmann machines? This is neatly explained in the amazing review by Pankaj Mehta et al., A high-bias, low-variance introduction to Machine Learning for physicists, which receives my highest recommendations. The idea is that by including latent variables in generative models, one can encode complex interactions between visible variables without sacrificing trainability (because the correlations between visible degrees of freedom are mediated via the UV degrees of freedom over which one marginalizes, rather than implemented directly as intralayer couplings). The following exposition draws heavily from section XVI of Mehta et al., and the reader is encouraged to turn there for more details.

Consider a Boltzmann distribution with energy

\displaystyle \begin{aligned} E(\mathbf{v})&=-\sum_ia_iv_i-\frac{1}{2}\sum_{ij}J_{ij}v_iv_j\\ &=-\sum_ia_iv_i-\frac{1}{2}\sum_{ij\mu}W_{i\mu}W_{\mu j}v_iv_j~, \end{aligned} \ \ \ \ \ (7)

where {\mathbf{v}} is a vector of visible neurons, and on the second line we’ve appealed to SVD to decompose the coupling matrix as {J_{ij}=\sum\nolimits_\mu W_{i\mu}W_{\mu j}}. In its current form, the visible neurons interact directly through the second, coupling term. We now wish to introduce latent or hidden variables {\mathbf{h}=\{h_\mu\}} that mediate this coupling instead. To do this, consider the Hubbard-Stratonovish transformation (1), with {K=1}, {\mathbf{s}=(W\mathbf{v})^T}, and {\phi=-\mathbf{h}}. Then the second term in (7) becomes

\displaystyle e^{\frac{1}{2}(W\mathbf{v})^T(W\mathbf{v})} =(2\pi)^{-\frac{1}{2}}\int\!\mathrm{d}\mathbf{h}\,\exp\!\left(-\frac{1}{2}\mathbf{h}^T\mathbf{h}-(W\mathbf{v})^T\mathbf{h}\right) \ \ \ \ \ (8)

Of course, this is simply a special case of the following multidimensional Gaussian integral: recall that if {A} is an {n}-dimensional, symmetric, positive-definite matrix, then

\displaystyle \begin{aligned} {}&\int\mathrm{d}^nx\,\exp\!\left(-\frac{1}{2}\sum_{ij}A_{ij}x_ix_j+\sum_iB_ix_i\right)\\ &=\int\mathrm{d}^nx\,\exp\!\left(-\frac{1}{2}\mathbf{x}^TA\mathbf{x}+B^T\mathbf{x}\right) =\sqrt{\frac{(2\pi)^n}{\mathrm{det}A}}e^{\frac{1}{2}B^TA^{-1}B}~. \end{aligned} \ \ \ \ \ (9)

Thus, reverting to matrix notation, and absorbing the normalization into the partition function {Z}, the Boltzmann distribution can be expressed as

\displaystyle \begin{aligned} p(\mathbf{v})&=Z^{-1}e^{\sum\nolimits_ia_iv_i}\prod_\mu\int\!\mathrm{d} h_\mu\,\exp\!\left(-\frac{1}{2}\sum_\mu h_\mu^2-\sum_iW_{i\mu}v_ih_\mu\right)\\ &=Z^{-1}\int\!\mathrm{d}\mathbf{h}\,e^{-E(\mathbf{v},\mathbf{h})}~, \end{aligned} \ \ \ \ \ (10)

where on the second line, we’ve introduced the joint energy function

\displaystyle E(\mathbf{v},\mathbf{h})=-\sum_ia_iv_i+\frac{1}{2}\sum_\mu h_\mu^2-\sum_{i\mu}W_{i\mu}v_ih_\mu~. \ \ \ \ \ (11)

The salient feature of this new hamiltonian is that it contains no interactions between the visible neurons: the original, intralayer coupling {J_{ij}v_iv_j} is now mediated via the latent degrees of freedom {h_\mu} instead. As we’ll see below, this basic mechanism can be readily generalized such that we can encode arbitrarily complex — that is, arbitrarily high-order — interactions between visible neurons by coupling to a second layer of hidden neurons.

A general RBM is described by the hamiltonian

\displaystyle E(\mathbf{v},\mathbf{h})=-\sum\nolimits_ia_i(v_i)-\sum\nolimits_\mu b_\mu(h_\mu)-\sum_{i\mu}W_{i\mu}v_ih_\mu~, \ \ \ \ \ (12)

where the functions {a_i(v_i)} and {b_\mu(h_\mu)} can be freely chosen. According to Mehta et al., the most common choices are

\displaystyle a_i(v_i)\equiv \begin{cases} a_iv_i~, & v_i\in\{0,1\} \\ \frac{v_i^2}{2\sigma_i^2}~, & v_i\in\mathbb{R} \end{cases}~, \qquad\quad b_\mu(h_\mu)\equiv \begin{cases} b_\mu h_\mu~, & h_\mu\in\{0,1\} \\ \frac{h_\mu^2}{2\sigma_\mu^2}~, & h_\mu\in\mathbb{R} \end{cases}~. \ \ \ \ \ (13)

The upper choice, with binary neurons, is referred to as a Bernoulli layer, while the lower choice, with continuous outputs, is called a Gaussian layer. Note that choosing the visible layer to be Bernoulli and the hidden Gaussian reduces to the example considered above, with standard deviations {\sigma_\mu=1}. Of course, if we marginalize over (i.e., integrate out) the latent variable {\mathbf{h}}, we recover the distribution of visible neurons, cf. (10) above, which we may write as

\displaystyle p(\mathbf{v})=Z^{-1}e^{-E(\mathbf{v})}~, \ \ \ \ \ (14)

where the marginalized energy of visible neurons is given by

\displaystyle \begin{aligned} E(\mathbf{v})&=-\ln\int\!\mathrm{d}\mathbf{h}\,e^{-E(\mathbf{v},\mathbf{h})}\\ &=-\sum_ia_i(v_i)-\sum_\mu\ln\int\!\mathrm{d} h_\mu\,\exp\!\left(\sum_\mu b_\mu(h_\mu)+\sum_iW_{i\mu}v_ih_\mu\right)~. \end{aligned} \ \ \ \ \ (15)

Now, to understand how higher-order correlations are encoded in {p(\mathbf{v})}, we follow Mehta et al. in introducing the distribution of hidden neurons, ignoring their interaction with the visible layer:

\displaystyle q_\mu(h_\mu)=Z^{-1}e^{b_\mu(h_\mu)}~. \ \ \ \ \ (16)

The associated cumulant generating function is

\displaystyle K_\mu(t)=\ln E_q\left[e^{th_\mu}\right] =\ln\int\!\mathrm{d} h_\mu\,q_\mu(h_\mu)e^{th_\mu} =\sum_n\kappa_\mu^{(n)}\frac{t^n}{n!}~, \ \ \ \ \ (17)

where {E_q[\cdot]} is the expectation value with respect to the distribution {q_\mu}, and the {n^\mathrm{th}} cumulant {\kappa_n} is obtained by differentiating the power series {n} times and evaluating the result at {t=0}, i.e., {\kappa_n=K_\mu^{(n)}(t)\big|_{t=0}}. The reason for introducing (16) and (17) is that the cumulant of {q_\mu(h_\mu)} are actually encoded in the marginal energy distribution (15)! To see this, observe that taking {t=\sum\nolimits_iW_{i\mu}v_i} in the cumulant generating function yields precisely the log term in (15). Therefore we can replace this term with the cumulant expansion, whereupon we obtain

\displaystyle \begin{aligned} E(\mathbf{v})&=-\sum_ia_i(v_i)-\sum_\mu K_\mu\!\left(\sum\nolimits_i W_{i\mu}v_i\right)\\ &=-\sum_ia_i(v_i)-\sum_{\mu n}\frac{\kappa_\mu^{(n)}}{n!}\left(\sum\nolimits_i W_{i\mu}v_i\right)^n\\ &=-\sum_ia_i(v_i)-\sum_i\!\left(\sum_\mu k_\mu^{(1)}W_{i\mu}\right)v_i -\frac{1}{2}\sum_{ij}\!\left(\sum_\mu k_\mu^{(2)}W_{i\mu}W_{j\mu}\right)v_iv_j+\ldots~. \end{aligned} \ \ \ \ \ (18)

In other words, after marginalizing over the latent variables, we obtain an effective hamiltonian with arbitrarily high-order interactions between the visible neurons, with the effective couplings weighted by the associated cumulant. As emphasized by Mehta et al., it is the ability to encode such complex interactions with only the simplest couplings between visible and hidden neurons that underlies the incredible power of RBMs. See my later post on the relationship between deep learning and RG for a much more in-depth look at this.

As final comment, there’s an interesting connection between cumulants and statistical physics, which stems from the fact that for a thermal system with partition function {Z(\beta)=\mathrm{tr}e^{-\beta E}}, the free energy {{F=-\beta^{-1}\ln Z}} likewise generates thermodynamic features of the system via its derivatives. Pursuing this here would take us too far afield, but it’s interesting to note yet another point where statistical thermodynamics and machine learning cross paths.

Posted in Minds & Machines | 2 Comments

What is entropy?

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

— John von Neumann to Claude Shannon, as to what the latter should call his uncertainty function.

I recently read the classic article [1] by Jaynes, which elucidates the relationship between subjective and objective notions of entropy—that is, between the information-theoretic notion of entropy as a reflection of our ignorance, and the thermodynamic notion of entropy as a measure of the statistical microstates of a system. These two notions are formally identical, and can be written

\displaystyle S=-\sum_ip_i\ln p_i~, \ \ \ \ \ (1)

where {p_i} is the probability of the {i^\mathrm{th}} outcome/microstate. However, as Jaynes rightfully stresses, the fact that these concepts share the same mathematical form does not necessarily imply any relation between them. Nonetheless, a physicist cannot resist the intuitive sense that there must be some deeper relationship lurking beneath the surface. And indeed, Jaynes paper [1] makes it beautifully clear that in fact these distinct notions are referencing the same underlying concept—though I may differ from Jaynes in explaining precisely how the subjective/objective divide is crossed.

(As an aside, readers without a theoretical physics background may be confused by the fact that the thermodynamic entropy is typically written with a prefactor {k_B=1.381\times 10^{-23}J/K}, known as Boltzmann’s constant. This is merely a compensatory factor, to account for humans’ rather arbitrary decision to measure temperatures relative to the freezing point of water at atmospheric pressure. Being civilized people, we shall instead work in natural units, in which {k_B=1}.)

As a prelude, let us first explain what entropy has to do with subjective uncertainty. That is, why is (1) a sensible/meaningful/well-defined measure of the information we have available? Enter Claude Shannon, whose landmark paper [2] proved, perhaps surprisingly, that (1) is in fact the unique, unambiguous quantification of the “amount of uncertainty” represented by a discrete probability distribution! Jaynes offers a more intuitive proof in his appendix A [1], which proceeds as follows. Suppose we have a variable {x}, which assumes any of the discrete values {(x_1,\ldots,x_n)}. These could represent, e.g., particular outcomes or measurement values. To each {x_i}, we associated the corresponding probability {p_i\in(p_1,\ldots,p_n)}, with {\sum\nolimits_ip_i=1}. The question is then whether one can find a function {H(p_1,\ldots,p_n)} which uniquely quantifies the amount of uncertainty represented by this probability distribution. Remarkably, Shannon showed [2] that this can indeed be done, provided {H} satisfies the following three properties:

  1. {H} is a continuous function of {p_i}.
  2. If all {p_i} are equal, then {p_i=1/n} (since probabilities sum to 1), and {A(n)\equiv H(1/n,\ldots,1/n)} is a monotonically increasing function of {n}.
  3. If we decompose an event into sub-events, then the original value of {H} must be the weighted sum of the individual values of {H} for each composite event.

The first requirement simply ensures that {H} is well-defined. The second encodes the fact that if all events are equally likely, then the uncertainty increases monotonically with the number of events. The third property is referred to as the composition law by Jaynes, and is easily explained with the help of figure 6 in Shannon’s paper [2], which I’ve reproduced here:entropy-12On the left, we have three probabilities {p_1\!=\!\tfrac{1}{2}}, {p_2\!=\!\tfrac{1}{3}}, and {p_3\!=\!\tfrac{1}{6}}, so the uncertainty for this distribution is given by {H\!\left(\tfrac{1}{2},\tfrac{1}{3},\tfrac{1}{6}\right)}. On the right, we’ve coarse-grained the system such that at the highest level (that is, the left-most branch), we have two equally likely outcomes, and hence uncertainty {H\!\left(\tfrac{1}{2},\tfrac{1}{2}\right)}. However, this represents a different state of knowledge than the original {H}, insofar as it disregards the fine-grained information in the lower branch. Obviously, our uncertainty function should not change simply because we’ve chosen to group certain clusters of events together. Thus in order for {H} to be invariant under this composition, we must have

\displaystyle H\!\left(\tfrac{1}{2},\tfrac{1}{3},\tfrac{1}{6}\right)=H\!\left(\tfrac{1}{2},\tfrac{1}{2}\right)+\tfrac{1}{2}H\!\left(\tfrac{2}{3},\tfrac{1}{3}\right)~, \ \ \ \ \ (2)

where the {\tfrac{1}{2}} prefactor is the probabilistic weight associated with this composition of events.

To generalize this, it is convenient to represent the probabilities as rational fractions,

\displaystyle p_i=\frac{n_i}{\sum n_i}~,\quad\quad n_i\in\mathbb{Z}_+~. \ \ \ \ \ (3)

(The fact that {H} is continuous enables us to make this assumption without loss of generality). This enables us to fix the form of {H} by applying the composition law to the case in which we coarse-grain {n} equally likely alternatives into clusters of size {n_i}, which results in the requirement

\displaystyle A\left(\sum\nolimits_in_i\right)=H(p_1,\ldots,p_n)+\sum\nolimits_i p_iA(n_i)~. \ \ \ \ \ (4)

To see this, consider the example given by Jaynes, in which we take {n=3}, and let {(n_1,n_2,n_3)=(3,4,2)}. If all {\sum_i n_i=9} outcomes are equally likely, then the uncertainty, given by the left-hand side of (4), is {A(9)}. As in the previous example, we want this uncertainty to be preserved under coarse-graining, where in this case we group the possible outcomes into {n=3} clusters of size {n_i}, as illustrated in the figure below:


Thus we see that at the highest level, the uncertainty — had we thrown away the fine-grained information in each cluster — is {H(p_1,p_2,p_3)=H\left(\tfrac{3}{9},\tfrac{4}{9},\tfrac{2}{9}\right)}. The second term on the right-hand side of (4) is then the weighted value of {H} for each cluster, as in the simpler example above.

The final trick is to consider the case where all {n_i=m}, in which case {\sum_in_i=nm}, and (4) reduces to {A(nm)=A(n)+A(m)}, the solution to which is {A(n)=K\ln n}, where {K} is some constant (note that by the second condition, {K\!>\!0}). Substituting this into (4), we obtain

\displaystyle K\ln\sum_in_i=H(p_1,\ldots,p_n)+K\sum_ip_i\ln n_i \implies H(p_1,\ldots,p_n)=-K\sum_ip_i\ln p_i~, \ \ \ \ \ (5)

where we’ve used the fact that probabilities sum to 1 to write {\ln\sum_in_i=\sum_jp_j\ln\sum_in_i}. QED.

Thus, on the subjective side, we see that entropy (1) is the least biased representation of our state of knowledge. And this fact underlies a form of statistical inference known as maximum entropy or max-ent estimation. The basic problem is how to make inferences — that is, assign probabilities — based on incomplete information. Obviously, we want our decisions to be as unbiased as possible, and therefore we must use the probability distribution which has the maximum entropy subject to known constraints—in other words, the distribution which is maximally non-committal with respect to missing information. To use any other distribution is tantamount to imposing additional, arbitrary constraints, thereby biasing our decisions. In this sense, max-ent estimation may be regarded as a mathematization of rationality, insofar as it represents the absolute best (i.e., most rational) guess we could possibly make given the information at hand.

To make contact with the objective notion of entropy in thermodynamics, let’s examine the max-ent procedure in more detail. Suppose, as in the examples above, that we have a variable {x} which assumes discrete values {x_i} with associated probabilities {p_i}. In a typical statistical inference problem, we don’t know what these probabilities are, but we know the expectation value of some function {f(x)},

\displaystyle \langle f(x)\rangle=\sum_{i=1}^np_if(x_i)~. \ \ \ \ \ (6)

Now the question is, based on this information, what is the expectation value of some other function {g(x)}? As it stands, the problem is insoluble: we need to know {n} values {p_i} to compute {\langle g(x)\rangle}, but the normalization

\displaystyle \sum_ip_i=1 \ \ \ \ \ (7)

and (6) collectively provide only 2 constraints. Max-ent is the additional principle which not only makes this problem tractable, but ensures that our distribution will be as broad as possible (i.e., that we don’t concentrate our probability mass any more narrowly than the given information allows).

As the name suggest, the idea is to maximize the entropy (1) subject to the constraints (6) and (7). We therefore introduce the Lagrange multipliers {\lambda} and {\mu}, and minimize the Lagrange function

\displaystyle \mathcal{L}(p_i;\lambda,\mu)=-\sum_ip_i\ln p_i +\lambda\left( \sum_ip_i-1\right) +\mu\left(\sum_ip_if(x_i)-\langle f(x)\rangle\right)~. \ \ \ \ \ (8)

Imposing {\nabla_{p_i,\lambda,\mu}\mathcal{L}=0} then returns the two constraint equations, along with

\displaystyle \begin{aligned} \frac{\partial\mathcal{L}}{\partial p_j} &=-\sum_i\delta_{ij}\ln p_i-\sum_ip_i\partial_j\ln p_i+\lambda\sum_i\delta_{ij}+\mu\sum_i\delta_{ij}f(x_i)=0\\ &\implies-\ln p_j-1+\lambda+\mu f(x_j)=0~. \end{aligned} \ \ \ \ \ (9)

We can then redefine {\lambda} to absorb the 1, whereupon we obtain the more elegant result

\displaystyle p_i=e^{-\lambda-\mu f(x_i)}~, \ \ \ \ \ (10)

To solve for the Lagrange multipliers, we simply substitute this into the contraints (7) and (6), respectively:

\displaystyle e^\lambda=\sum_ie^{-\mu f(x_i)}~, \quad\quad \langle f(x)\rangle=e^{-\lambda}\sum_ie^{-\mu f(x_i)}~. \ \ \ \ \ (11)

Now observe that the sum in these expressions is none other than the partition function for the canonical ensemble at inverse temperature {\mu}:

\displaystyle Z(\mu)=\sum_ie^{-\mu f(x_i)}~. \ \ \ \ \ (12)

We therefore have

\displaystyle \lambda=\ln Z(\mu)~, \quad\quad \langle f(x)\rangle =-\partial_\mu\ln Z(\mu)~, \ \ \ \ \ (13)

i.e., {\lambda} is the cumulant generating function (in QFT, this would be the generator of connected Feynman diagrams; see effective action), {\langle f(x)\rangle} corresponds to the thermodynamic energy, and (10) is none other than the Boltzmann distribution. Of course, this naturally extends to multiple functions {f_i(x_j, \alpha_k)}, which may generically depend on other parameters {\alpha_k}; see [1] for details—Jaynes paper is a gem, and well-worth reading.

This essentially completes the connection with the thermodynamic concept of entropy. If only the average energy of the system is fixed, then the microstates will be described by the Boltzmann distribution (10). This is an objective fact about the system, reflecting the lack of any additional physical constraints. Thus the subjective and objective notions of entropy both reference the same underlying concept, namely, that the correct probability mass function is dispersed as widely as possible given the constraints on the system (or one’s knowledge thereof). Indeed, a priori, one may go so far as to define the thermodynamic entropy as the quantity which is maximized at thermal equilibrium (cf. the fundamental assumption of statistical mechanics).

Having said all that, one must exercise greater care in extending this analogy to continuous variables, and I’m frankly dubious as to how much of this intuition survives in the quantum mechanical case (I should mention that Jaynes has a follow-up paper [3] in which he extends these ideas to density matrices, but I confess I found it less compelling). And that’s not to mention quantum field theory, where the von Neumann entropy is notoriously divergent. Nonetheless, I suspect a similar unity is at play. The connection between statistical mechanics and Euclidean field theory is certainly well-known, though I still find thermal Green’s functions rather remarkable. Relative entropy is another example, insofar as it provides a well-defined link between von Neumann entropy and the modular hamiltonian, and the latter appears to encode certain thermodynamic properties of the vacuum state. This is particularly interesting in the context of holography (as an admittedly biased reference, see [4]), and is a topic to which I hope to return soon.


  1. E. T. Jaynes, “Information Theory and Statistical Mechanics,” The Physical Review 106 no. 4, (1957).
  2. C. E. Shannon, “A Mathematical Theory of Communication,” The Bell System Technical Journal 27, (1948).
  3. E. T. Jaynes, “Information Theory and Statistical Mechanics II,” The Physical Review 108 no. 2, (1957).
  4. R. Jefferson, “Comments on black hole interiors and modular inclusions,” arXiv:1811.08900.
Posted in Physics | Leave a comment

Boltzmann machines

I alluded previously that information geometry had many interesting applications, among them machine learning and computational neuroscience more generally. A classic example is the original paper by Amari, Kurata, and Nagaoka, Information Geometry of Boltzmann Machines [1]. This paper has a couple hundred citations, so the following is by no means a review of the current state of the art; however, it does provide a concrete setting in which to apply the theoretical framework introduced in my three-part information geometry sequence, and is as good a place to start as any.

Boltzmann machines belong to the class of so-called “energy-based” models of neural networks, for reasons which will be elucidated below, which makes them particularly intuitive from a physics perspective (see, e.g., the spin-glass analogy of neural networks). Additionally, at least in the absence of hidden layers, they have a number of nice mathematical properties that make them ideal toy models for the geometric approach. Let us briefly introduce these models, following the notation in [1]. A Boltzmann machine is network of {n} stochastic neurons with symmetric connection weights {w_{ij}=w_{ji}} between the {i^\mathrm{th}} and {j^\mathrm{th}} neurons (with {w_{ii}=0}). A neuron is a binary variable which has a value of either 1 (excited) or 0 (quiescent). As the network is evolved in time, the state of the {i^\mathrm{th}} neuron is updated according to the potential

\displaystyle u_i=\sum_{j=1}^nw_{ij}x_j-h_i=\sum_{j=0}^nw_{ij}x_j~, \ \ \ \ \ (1)

where {h_i} is the threshold for the neuron (i.e., a free parameter that determines that neuron’s sensitivity to firing), and in the second equality we have absorbed this by defining the zeroth neuron with {x_0=1} and {w_{i0}=-h_i}. The state of the neuron is updated to 1 with probability {f(u_i)}, and 0 with probability {1\!-\!f(u_i)}, where

\displaystyle f(u)=\frac{1}{1+e^{-u}}~. \ \ \ \ \ (2)

We can now see the reason for the name as follows: the network’s state is described by the vector {\mathbf{x}=(x_1,\ldots,x_n)}, and hence the transition between states forms a Markov chain with {2^n} sites. In particular, if there are no disconnected nodes, then it forms an ergodic Markov chain with a unique stationary distribution

\displaystyle p(\mathbf{x})=Z^{-1}e^{-E(\mathbf{x})}~, \ \ \ \ \ (3)

i.e., the probability of finding the network in a given state will asymptotically approach the Boltzmann distribution. In this expression, the normalization is given by the partition function for the canonical ensemble,

\displaystyle Z=\sum_\mathbf{x} e^{-E(\mathbf{x})}~, \ \ \ \ \ (4)

where we sum over all possible microstates. We then recognize {E(\mathbf{x})} as the energy of the state {\mathbf{x}}, which is given by

\displaystyle E(\mathbf{x})=-\frac{1}{2}\sum_{i,j}w_{ij}x_ix_j=-\sum_{i>j}w_{ij}x_ix_j~, \ \ \ \ \ (5)

where the second equality follows from the symmetry of the connection matrix. In other words, (5) is the hamiltonian for a system of nearest-neighbor interactions with coupling {w_{ij}}.

The relationship between the Boltzmann distribution and the update rule (2) is most easily seen in reverse: consider two states which differ only by a single neuron. Their energy difference is {\Delta E=E(x_i\!=\!1)-E(x_i\!=\!0)}, where {E(x_i\!=\!1)} denotes the state in which the {i^\mathrm{th}} neuron is excited, etc. By (3), we have

\displaystyle \begin{aligned} \Delta E&=-\ln p(x_i\!=\!1)+\ln p(x_i\!=\!0) =-\ln p(x_i\!=\!1)+\ln [1-p(x_i\!=\!1)]\\ &=\ln\left[\frac{1}{p(x_i\!=\!1)}-1\right] \;\implies\; p(x_i\!=\!1)=\frac{1}{1+e^{\Delta E}}~, \end{aligned} \ \ \ \ \ (6)

where the {\ln Z} terms have canceled, and we have used the fact that probabilities sum to 1. Now observe that since we only changed a single neuron, {p(x_i\!=\!1)=f(u)}, whereupon (2) mandates the identification {u=\Delta E}. Thus, thermodynamically, {f(u)} is a thermal spectrum, and {u_i} is the momentum mode associated with the excitation {x_i=0\!\rightarrow\!1} (said differently, {u_i} is the energy gap associated with the transition). Given the hamiltonian (5) — that is, the fact that the network is stochastic and fully connected — the form of the probability update rule (2) therefore encodes the fact that the system is in thermal equilibrium: the set of all possible states satisfies a Boltzmann distribution (with temperature {T=1} in our units).

Now, (machine) “learning”, in this context, consists of adjusting the distribution {p(\mathbf{x})} to match, as closely as possible, a given input or environmental distribution {q(\mathbf{x})}. This is achieved by modifying the connection weights {w_{ij}} (which in our compact notation includes the threshold values {h_i}) according to a specified learning rule. Accordingly, we shall impose that the average adjustment to {w_{ij}} is given by

\displaystyle \Delta w_{ij}=\epsilon(q_{ij}-p_{ij})~, \ \ \ \ \ (7)

where {\epsilon\!>\!0} is a small constant, and the expectation values with respect to the distributions {q(\mathbf{x})}, {p(\mathbf{x})} are

\displaystyle q_{ij}=E_q[x_ix_j]=\sum_xq(\mathbf{x})x_ix_j~, \qquad\quad p_{ij}=E_p[x_ix_j]=\sum_xp(\mathbf{x})x_ix_j~, \ \ \ \ \ (8)

cf. eq. (9) in part 1 of my information geometry sequence. Intuitively, the rule (7) simply says that we adjust the weights according to the difference between our reference and target distributions (more carefully: we adjust the connection strength between two neurons in proportion to the difference between the expectation values of both neurons being excited under said distributions). The reason for this particular rule — aside from its intuitive appeal — is that it realizes the (stochastic) gradient descent algorithm for optimizing the Kullback-Leibler divergence between {q(\mathbf{x})} and {p(\mathbf{x})},

\displaystyle I(q:p)=\sum_\mathbf{x} q(\mathbf{x})\ln\frac{q(\mathbf{x})}{p(\mathbf{x})}~, \ \ \ \ \ (9)

i.e., (7) may be written

\displaystyle \Delta w_{ij}=-\epsilon\frac{\partial I(q:p)}{\partial w_{ij}}~, \ \ \ \ \ (10)

where the derivative is with respect to the weights of the initial distribution {p(\mathbf{x};\omega)}. We’ll see this explicitly below, after introducing some facilitating machinery, whereupon the fact that this realizes gradient descent on the divergence (9) will also be seen to emerge naturally from the geometric framework; the original proof, as far as I’m aware, was given in [2].

Incidentally, another interesting feature of the learning rule (7) is that it’s (ultra)local. Thus, in contrast to methods like backpropagation which require global information about the network, it may represent a more biologically plausible model insofar as it relies only on local synaptic data (in addition to being computationally simpler).

In order to apply the methods of information geometry to the Boltzmann machines introduced above, let us first review the relevant aspects in this context. Let {S=\{q(\mathbf{x})\}} be the set of all probability distributions over the state space {X}. The latter consists of {2^n} states, but the constraint

\displaystyle \sum_\mathbf{x} q(\mathbf{x})=1 \ \ \ \ \ (11)

implies that we have only {2^n\!-\!1} degrees of freedom; thus we may treat {S} as a {(2^n\!-\!1)}-dimensional Euclidean manifold. (Since we’re working with finite-dimensional distributions, we mean that {S} is a manifold in the sense that it is topologically locally equivalent to Euclidean space; additionally, the constraint {0<q(\mathbf{x})<1} implies that we exclude the boundary {\partial S}. So technically, {S} is an open simplex with this dimensionality, but the distinction will not be relevant for our purposes). As will become apparent below, {S} forms a dually flat space, which underlies many of the nice properties alluded above.

By a suitable choice of coordinates, we can represent {S} as an exponential family,

\displaystyle q(\mathbf{x};\theta)=\exp\!\left[\theta^IX_I-\psi(\theta)\right]~, \ \ \ \ \ (12)

cf. eq. (26) in part 1. Here, {\theta=(\theta^I)} is a {(2^n\!-\!1)}-dimensional vector of canonical parameters, {\mathbf{X}=(X_I)} with {X_I=x_{i_1}\ldots x_{i_m}}, and the index {I} runs over all combinations of subsets {(i_1,\ldots,i_m)} for {1\leq m\leq n}. The condition {\sum_\mathbf{x} q(\mathbf{x};\theta)=1} then identifies the potential {\psi(\theta)} as the cumulant generating function,

\displaystyle \psi(\theta)=\ln\sum_\mathbf{x}\exp\!\left[\theta^IX_I(\mathbf{x})\right]~. \ \ \ \ \ (13)

Now recall from part 2 that the dual coordinate {\eta} is defined via the expectation with respect to {q(\mathbf{x};\theta)},

\displaystyle \eta_I\equiv E_\theta[X_I]=\mathrm{prob}\left\{x_{i_1}=\ldots=x_{i_m}=1\,|\,I=(i_1,\ldots,i_m)\right\}~, \ \ \ \ \ (14)

where the second equality follows from the fact that the only non-zero term in the sum is that for which all neurons are excited, hence we just have the probability {q(X_I)}. As explained in part 3, this identification is tantamount to the statement that the coordinates {\theta^I} and {\eta_I} are related through the dual potential {\phi(\eta)}, which is defined via the Legendre transform. In the present case, we have via (13),

\displaystyle \phi(\eta)=\theta^I\partial_I\psi(\theta)-\psi(\theta) =\sum_\mathbf{x} q(\mathbf{x};\theta(\eta))\ln q(\mathbf{x};\theta(\eta))=-H(\mathbf{x})~, \ \ \ \ \ (15)

where {H(\mathbf{x})} is the entropy of state. Note that we can also express this as

\displaystyle \psi(\theta)+\phi(\eta)-\theta^I\eta_I=0~, \ \ \ \ \ (16)

which provides a useful relationship between the (dual) coordinates and potentials that will come in handy later.

To see that {S} is dually flat, recall that the canonical and dual parameters, {(\theta^I)} and {(\eta_I)}, respectively induce 1-affine and {(-1)}-affine coordinate systems, with respect to which {S} is {(\pm 1)}-flat; cf. eq. (30) and (34) in part 1. Thus {S} admits a unique Riemannian metric — the Fisher information metric — which can be expressed in the natural basis of coordinates {(\theta^I)} as

\displaystyle g_{IJ}(\theta)=\partial_I\cdot \partial_J~, \ \ \ \ \ (17)

i.e., for any two vectors {X,Y\in T_\theta S},

\displaystyle g_{IJ}(\theta)=\langle X,Y\rangle_\theta=E_\theta[X\ell_\theta(\mathbf{x}) Y\ell_\theta(\mathbf{x})] \ \ \ \ \ (18)

where {\ell_\theta(\mathbf{x})=\ell(\mathbf{x};\theta)\equiv\ln q(\mathbf{x};\theta)}; cf. eqs. (12) and (14) in part 1. Taking {X} and {Y} to be basis vectors, we recover the general definition

\displaystyle g_{IJ}(\theta)=E_\theta[\partial_I\ell_\theta(\mathbf{x}) \partial_J\ell_\theta(\mathbf{x})]~. \ \ \ \ \ (19)

This expression is general insofar as it holds for arbitrary coordinate systems (cf. the inner product on random variables). Since {(\theta^I)} is {e}-flat however, it reduces to

\displaystyle g_{IJ}=\partial_I\partial_J\psi(\theta)~, \ \ \ \ \ (20)

which one can verify explicitly via (13) (it is convenient to start with the form eq. (11) in part 1). Thus, dual flatness has the advantage that the metric can be directly obtained from the potential {\psi(\theta)}. In accordance with our discussion in part 2, we also have the inverse metric

\displaystyle g^{IJ}=\langle\partial^I,\partial^J\rangle=\partial^I\partial^J\phi(\eta)~,\qquad\quad \sum g^{IJ}g_{JK}=\delta^I_K~, \ \ \ \ \ (21)

which implies that {\partial_I=\frac{\partial}{\partial\theta^I}} and {\partial^I=\frac{\partial}{\partial\eta_I}} are mutually dual basis vectors at every point in {S}, i.e., {\partial_I\cdot\partial^J=\delta_I^J}.

Of course, the manifold {S} is generically curved with respect to the Riemannian metric {g_{IJ}} above. But the special feature of dually flat spaces is that we can introduce dual {(\pm1)}-connections with respect to which {S} appears flat. Recall that any dual structure {(g,\nabla,\nabla^*)} is naturally induced from a divergence. Specifically, let us take the canonical divergence 

\displaystyle D(p||q)\equiv\psi(p)+\phi(q)-\theta^I(p)\eta_I(q)~, \ \ \ \ \ (22)

where by the argument of the potential, we mean the associated coordinate; e.g., {\psi(p)=\psi(\theta_p)}, where {\theta_p=(\theta^I_p)} is the canonical coordinate of {p\in S}. One can readily see that this induces the metric above, in the sense that

\displaystyle D\left((\partial_I\partial_J)_p||q\right)\equiv \partial_I\partial_J D(p||q)=\partial_I\partial_J\psi(p)=g_{IJ}(p)~, \ \ \ \ \ (23)

where the last equality follows by (20). Additionally, we saw in part 2 that the Kullback-Leibler divergence corresponds to the {\alpha}-divergence for {\alpha=\pm1}, and that the latter is in turn a special case of the more general {f}-divergence. But for the dual structure {(g,\nabla^{(1)},\nabla^{(-1)})}, this is none other than the canonical divergence! That is, given the potentials and dual coordinate systems {\theta^I} and {\eta_I} above, (22) leads to (9), i.e.,

\displaystyle D(p||q)=I(q:p) \quad\quad\forall\,p,q\in S~, \ \ \ \ \ (24)

(note the change in ordering).

We may now regard the space of Boltzmann machines {B} as a submanifold of {S}. One can see this by taking the log of the distribution (3),

\displaystyle \ln p(\mathbf{x};w)=\sum_{i>j}w_{ij}x_ix_j-\psi(w)~, \ \ \ \ \ (25)

where the potential {\psi(w)=\ln Z} accounts for the normalization; note that this is constant with respect to {\mathbf{x}}. In terms of the canonical coordinates above, this corresponds to {\theta^i_1=w_{i0}}, {\theta^{ij}_2=w_{ij}}, and {\theta^{i_1\ldots i_m}_m=0} for all {m>2}. The last of these constitutes a system of {e}-linear constraints, which implies that {B} is in particular an {e}-linear submanifold of {S}, and therefore inherits the dually linear properties of {S} itself. Furthermore, since {B} is an exponential family, we shall henceforth denote its canonical coordinates by {w^{ij}} (with upper indices) for consistency with the conventions above. We then define the dual coordinates (cf. eq. (14))

\displaystyle \eta_{ij}=E_w[x_ix_j]=p_{ij}=\mathrm{prob}\{x_i=x_j=1\}~, \qquad\quad i<j~,\quad i=0,\ldots,n~, \ \ \ \ \ (26)

where the penultimate equality follows from (8). Note that unlike the canonical coordinates above, the remaining parts {\eta_{i_1,\ldots,i_m}^m} are generally nonlinear functions of {\eta_{ij}} for points in {B}. (This implies that {B} is not an {m}-linear submanifold of {S}, but since it’s {e}-linear, we can find some other coordinates to reveal the dual {m}-linear structure that {B} must possess). The (invariant!) divergence between two Boltzmann machines {w_1^{ij}} and {w_2^{ij}} is then given by

\displaystyle D(w_1,w_2)=\sum_\mathbf{x} p(\mathbf{x};w_1)\ln\frac{p(\mathbf{x};w_1)}{p(\mathbf{x};w_2)} =\phi(w_1)+\psi(w_2)-\sum_{i>j}w_2^{ij}{p_1}_{ij}~. \ \ \ \ \ (27)

(Note that this differs from eq. (34) in [1]; I believe the authors have mistakenly swapped the distribution labels {1\leftrightarrow 2}. Explicitly, (15) allows us to identify the first term as

\displaystyle \sum_\mathbf{x} p(\mathbf{x};w_1)\ln p(\mathbf{x};w_1)=\phi(w_1)~. \ \ \ \ \ (28)

Meanwhile, the second term is easily found to be

\displaystyle \sum_\mathbf{x} p(\mathbf{x};w_1)\ln p(\mathbf{x};w_2)=w_2^{ij}\sum_\mathbf{x} x_ix_j p(\mathbf{x};w_1)-\psi(w_2)=w_2^{ij}{p_1}_{ij}-\psi(w_2)~, \ \ \ \ \ (29)

where we used the fact that {p_{ij}=\eta_{ij}=E_w[x_ix_j]}. Subtracting this from the entropy term above yields (27) as written). From (20) and (16), the induced the metric is simply given by

\displaystyle g_{IJ}=g_{ij,kl}=\frac{\partial^2}{\partial w^{ij}\partial w^{kl}}\psi(w)=\frac{\partial p_{ij}}{\partial w^{kl}}~. \ \ \ \ \ (30)

This, of course, has a statistical interpretation as the Fisher information matrix; it may be alternatively expressed as [1]

\displaystyle g_{ij,kl}=p_{ij,kl}-p_{ij}p_{kl}~, \ \ \ \ \ (31)

where the first term is the probability that all four neurons are excited,

\displaystyle p_{ij,kl}=\mathrm{prob}\{x_i=x_j=x_k=x_l=1\}~. \ \ \ \ \ (32)

As mentioned above, the learning task is for the Boltzmann machine to realize an environmental or target probability distribution {q(\mathbf{x})} as closely as possible, according to the divergence (27). We now wish to connect this notion of minimum divergence to the intuitive differential geometric picture of {B} as a submanifold of {S}, whereby one expects that the best approximation is given by the point {p(\mathbf{x})\in B\subset S} that minimizes the geodesic distance between {p(\mathbf{x})} and {q(\mathbf{x})}. To do so, consider three points, {p,r\in B} and {q\in S}, where {r} is the current state of the Boltzmann machine, and {p} is the best possible approximation of the environmental distribution {q} (which generically lies beyond {B}). Clearly, the point {p} is given by

\displaystyle p=\mathrm{min}_{p'\in B}D(q,p')~, \ \ \ \ \ (33)

where {D(q,p)} is the divergence between the distributions {q} and {p} defined above. The right-hand side of (33) describes the geodesic or {m}-projection of {q} onto {B}, so named because the tangent vector of the geodesic is orthogonal to {T_pB} (the tangent space of {B} at the point {p}). Furthermore, since {B} is {e}-flat, this projection is unique: there are no local minima of the divergence aside from this global minimum.

To elaborate, let

\displaystyle \theta^I=(w^{ij},w_3^{ijk},\ldots,w_n^{i\ldots n})~, \qquad\quad \eta_I=(p_{ij},p^3_{ijk},\ldots,p^n_{i\ldots n})~, \ \ \ \ \ (34)

respectively denote the canonical and dual coordinates of {q(\mathbf{x})}, where as a shorthand we include both {w_1^{0i}} and {w_2^{ij}} in {w^{ij}}, and both {p_{0i}^1} and {p_{ij}^2} in {p_{ij}}. Then recall that since {p(\mathbf{x})\in B}, we must have {w_m^{i_1\ldots i_m}=0} {\forall m>2}. Thus both {p} and {q} lie in the submanifold

\displaystyle M=\{\eta\,|\,\eta_{ij}=p_{ij}\}\subset S~, \ \ \ \ \ (35)

that is, the first part of the {\eta}-coordinates are constrained to be {p_{ij}}, while the remainder are left free. This defines {M} in terms of a set of linear constraints in the {\eta}-coordinates, and hence {M} is an {m}-flat manifold (analogous to how {B} is an {e}-flat manifold). The geodesic projection defined by (33) is therefore an {m}-geodesic in {M} (hence the name).

The construction of {M} also implies that the tangent space of {M} at {P}, {T_pM}, is spanned by the basis vectors {\{\partial^J\}}, where {J} runs over the second (unspecified) part of the {\eta}-coordinates. Similarly, {T_pB} is spanned by {\{\partial_I\}}, where {I} runs over the first (unspecified) part of the {\theta}-coordinates (since {B} is defined by fixing the second part to zero). Since {\partial_I\cdot\partial^J=0} for {I\neq J}, we conclude that {M} and {B} are orthogonal at {p}, and hence that the geodesic projection defined by (33) — that is, the {m}-geodesic connecting {q} and {p} — is likewise orthogonal to {B}. This is useful because it enables us to use the generalized Pythagorean theorem, in the form

\displaystyle D(q,r)=D(q,p)+D(p,r)~. \ \ \ \ \ (36)

Here, the left-hand side plays the role of the hypotenuse: it is the divergence between the distribution {q(\mathbf{x})} and the current state of the Boltzmann machine {r(\mathbf{x})}. This is given, on the right-hand side, by the aforementioned {m}-geodesic representing the projection of {q} “straight down” to {p}, {D(q,p)}, and an {e}-geodesic (because it lies entirely within {B}) representing the divergence between the current machine {r(\mathbf{x})} and the optimal machine {p(\mathbf{x})}, {D(p,r)}. Since we can’t leave the submanifold {B}, the part {D(q,p)} is an unavoidable error: we can’t shrink the divergence between the current state of the machine {r} and the target state {q} to zero; note that this is independent of the current state of the machine, and arises simply from the fact that {q\notin B}. We thus see that to accomplish our learning task, we need to minimize {D(p,r)} by bringing the current machine closer to the optimum; i.e., the learning trajectory is the e-geodesic connecting the state of the current machine {r(\mathbf{x})} to that of the optimal machine {p(\mathbf{x})}.

To move along this geodesic, we simply follow the steepest descent algorithm from the point {r}. Denoting the connection weights thereof by {w^{ij}} (with apologies for reusing notation), the gradient is

\displaystyle \frac{\partial D(p,r)}{\partial w^{ij}} =\frac{\partial D(q,r)}{\partial w^{ij}} =q_{ij}-r_{ij}~, \ \ \ \ \ (37)

where, in the first equality, we have used the aforementioned fact that the unavoidable error {D(q,p)} is independent of {r} (and hence of {w^{ij}}); as for the second, we have via (27)

\displaystyle \frac{\partial}{\partial w_1^{ij}}D(w_1,w_2) =\frac{\partial}{\partial w_1^{ij}}\left[\phi(w_1)+\psi(w_2)-\sum_{i>j}w_1^{ij}{p_2}_{ij}\right] ={p_1}_{ij}-{p_2}_{ij}~, \ \ \ \ \ (38)

where we have used the fact that, from (16), {\eta_I=\tfrac{\partial\phi}{\partial\theta^I}}; we then identified {q_{ij}} and {r_{ij}} according to (8). We thus come full-circle to (7), i.e.,

\displaystyle \Delta w^{ij}=\epsilon\left(r_{ij}-q_{ij}\right)~. \ \ \ \ \ (39)

(Note the change of order). Of course, this expression is only valid in the dually flat coordinates above, and does not have an invariant meaning. But we can uplift the learning rule for steepest descent to a general Riemannian manifold simply by inserting the (inverse) metric:

\displaystyle \Delta w^{ij}=\epsilon\sum g^{ij,kl}\left(r_{kl}-q_{kl}\right)~. \ \ \ \ \ (40)

Since the inverse metric merely interchanges upper and lower indices, the generalized rule (40) is still an {m}-geodesic equation on {B}, and is therefore linear in the {\eta}-coordinates. As noted in [1], this may be practically relevant since, in addition to being more tractable, the convergence to the optimum is likely much faster than in more complicated (i.e., non-linear) networks.

The paper [1] goes on to generalize their results to Boltzmann machines with hidden units; while these are more useful in practice (particularly in the form of restricted Boltzmann machines), I will not discuss them here. Instead, let me close by tying back into the physics-lingo at the beginning, where I mentioned that Boltzmann machines belong to the class of energy-based models. In particular, note that the energy difference between the target {q(\mathbf{x})} and optimum {p(\mathbf{x})} is

\displaystyle \Delta E(p,q)=-\ln p+\ln q=\ln\frac{q}{p} \quad\implies\quad I(q:p)=\sum_\mathbf{x} q\,\Delta E(p,q)~. \ \ \ \ \ (41)

Therefore, since {D(p||q)=I(q:p)}, we may think of the optimum Boltzmann machine as that which minimizes the energy difference between these two states, cf. (33). Thus the learning problem is intuitively the mandate that we flow to the minimum energy state, where here the energy is defined relative to the target distribution {q(\mathbf{x})}. And the solution — that is, the learning trajectory — is the minimum-length geodesic in the space of Boltzmann machines {B}. Neat!


  1. S.-i. Amari, K. Kurata, and H. Nagaoka, “Information Geometry of Boltzmann Machines,” IEEE Transactions on Neural Networks 3 no. 2, (1992).
  2. D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,” Cognitive Science 9 (1985).
Posted in Minds & Machines | Leave a comment

Information geometry (part 3/3)

Insofar as quantum mechanics can be regarded as an extension of (classical) probability theory, most of the concepts developed in the previous two parts of this sequence can be extended to quantum information theory as well, thus giving rise to quantum information geometry.

Quantum preliminaries

Let’s first review a few basic quantum mechanical notions; we’ve covered this in more detail before, but it will be helpful to establish present notation/language.

Quantum mechanics is the study of linear operators acting on some complex Hilbert space {\mathcal{H}}. We shall take this to be finite-dimensional, and hence identify {\mathcal{H}=\mathbb{C}^k}, the space of {k\!\times\!k} complex matrices, where {k=\mathrm{dim}\,\mathcal{H}}. In the usual bra-ket notation, the adjoint {A^*} of an operator {A} is defined by {\langle x|Ay\rangle=\langle A^*x|y\rangle} for all vectors {x,y}. The density operator {\rho}, which describes a quantum state, is a postive semidefinite hermitian operator of unit trace. {\rho} is a pure state if it is of rank 1 (meaning it admits a representation of the form {\rho=|x\rangle\langle x|}), and mixed otherwise.

A measurement {\Pi=\{\pi_i\}=\{\pi_1,\ldots,\pi_n\}} is a set of operators on {\mathcal{H}} which satisfy

\displaystyle \pi_i=\pi_i^*\geq0\qquad\mathrm{and}\qquad\sum_i\pi_i=1~, \ \ \ \ \ (1)

whose action on the state {\rho} yields the outcome {e_i\in\{e_1,\ldots,e_n\}} with probability

\displaystyle P(e_i)=\mathrm{tr}{\rho\pi_i}~. \ \ \ \ \ (2)

Note that by virtue of (1), {P\{e_i\}} is a probability distribution:

\displaystyle P(e_i)\geq0\qquad\mathrm{and}\qquad\sum_iP(e_i)=1~. \ \ \ \ \ (3)

If we further impose that {\Pi} consist of orthogonal projections, i.e.,

\displaystyle \pi_i\pi_j=\pi_i\delta_{ij}~, \ \ \ \ \ (4)

then the above describes a “simple measurement”, or PVM (projection-valued measure). More generally of course, a measurement is described by a POVM (positive operator-valued measure). We covered this concept in an earlier post, but the distinction will not be relevant here.

An observable is formally defined as the pair {(\{\pi_i\},\{a_i\})}, where {a_i\in\mathbb{R}} is a real number associated with the measurement outcome {e_i}. Since {\{\pi_i\}} are orthogonal projectors, we may represent the observable as a (hermitian) operator {A} acting on {\mathcal{H}} whose associated eigenvalues are {\{a_i\}}, i.e.,

\displaystyle A=\sum_ia_i\pi_i~. \ \ \ \ \ (5)

The expectation and variance of {A} in the state {\rho} are then

\displaystyle E_\rho[A]=\sum_i\mathrm{tr}{\rho\pi_i}a_i=\mathrm{tr}{\rho A}~, \ \ \ \ \ (6)

\displaystyle V_\rho[A]=\sum_i\mathrm{tr}{\rho\pi_i}\left(a_i-E_\rho[A]\right)^2=\mathrm{tr}\!\left[\rho\left(A-E_\rho[A]\right)^2\right]~. \ \ \ \ \ (7)

We shall denote the set of all operators {A} on {\mathcal{H}} by

\displaystyle \mathcal{A}=\{A\,|\,A=A^*\}~, \ \ \ \ \ (8)

and define

\displaystyle \mathcal{A}_1\equiv\{A\in\mathcal{A}\,|\,\mathrm{tr}A=1\}\subset\mathcal{A}~. \ \ \ \ \ (9)

Then the set of all density operators {\rho} is a convex subset of {\mathcal{A}_1}:

\displaystyle \bar{\mathcal{S}}\equiv\{\rho\,|\,\rho=\rho^*\geq0\;\;\mathrm{and}\;\;\mathrm{tr}\rho=1\}~. \ \ \ \ \ (10)

Note the similarities with the embedding formalism introduced in the previous post. The set of density operators admits a partition based on the rank of the matrix,

\displaystyle \bar{\mathcal{S}}=\bigcup_{r=1}^k\mathcal{S}_r~,\qquad\mathrm{where}\qquad \mathcal{S}_r\equiv\{\rho\in\bar{\mathcal{S}}\,|\,\mathrm{rank}\,\rho=r\}~. \ \ \ \ \ (11)

Pure states are then elements of {\mathcal{S}_1}, which form the extreme points of the convex set {\bar{\mathcal{S}}} (i.e., they cannot be represented as convex combinations of other points, which is the statement that {\rho=|x\rangle\langle x|}). Geometrically, the pure states are the vertices of the convex hull, while the (strictly positive) mixed states {\rho\in\mathcal{S}_{r>1}} lie in the interior.

We need one more preliminary notion, namely the set {\mathcal{U}} of unitary operators on {\mathcal{H}}:

\displaystyle \mathcal{U}=\{U\,|\,U^{-1}=U^*\}~. \ \ \ \ \ (12)

This forms a Lie group, whose action on {\bar{\mathcal{S}}} is given by

\displaystyle \mathcal{U}\times\bar{\mathcal{S}}\rightarrow\bar{\mathcal{S}}\qquad\mathrm{where}\qquad (U,\rho)\mapsto U\rho U^{-1}\equiv\tilde\rho~. \ \ \ \ \ (13)

(The first expression says that the group multiplication by {U} sends elements of {\bar{\mathcal{S}}} to {\bar{\mathcal{S}}}, while the second specifies how this mapping acts on individual elements). Since the matrix rank is preserved, each {\mathcal{S}_r} is closed under the action of {\mathcal{U}}. It follows from (13) that {\mathcal{U}} maps measurements {\Pi} and observables {A} to

\displaystyle \tilde\pi_i=U\pi_i U^{-1}\qquad\mathrm{and}\qquad\tilde A=UAU^{-1}~, \ \ \ \ \ (14)

and that the probability distributions and expectation values are therefore invariant, i.e.,

\displaystyle \mathrm{tr}(\tilde\rho\tilde\pi_i)=\mathrm{tr}(\rho\pi_i)\qquad\mathrm{and}\qquad \mathrm{tr}(\tilde\rho\tilde A)=\mathrm{tr}(\rho A)~. \ \ \ \ \ (15)

Of course, this last is the familiar statement that the unitarity of quantum mechanics ensures that probabilities are preserved.

Geometric preliminaries

Now, on to (quantum information) geometry! Let us restrict our attention to {\mathcal{S}_1} for now; we shall consider mixed states below. Recall that pure states are actually rays, not vectors, in Hilbert space. Since {\mathcal{H}=\mathbb{C}^k}, we therefore identify {\mathcal{S}_1=\mathbb{C}P^{k-1}} (i.e., the pure states are in one-to-one correspondence with the rays in {\mathbb{C}^{k}}, where we’ve projected out by the physically irrelevant phase). We now wish to associate a metric to this space of states; in particular, we seek a Riemannian metric which is invariant under unitary transformations {\mathcal{U}}. It turns out that, up to an overall constant, this is uniquely provided by the Fubini-Study metric. Recall that when considering classical distributions, we found that the Fisher metric was the unique Riemannian metric that preserved the inner product. The Fubini-Study metric is thus the natural extension of the Fisher metric to the quantum mechanical case (for pure states only!). Unfortunately, Amari & Nagaoka neither define nor discuss this entity, but it’s reasonably well-known in the quantum information community, and has recently played a role in efforts to define holographic complexity in field theories. A particularly useful expression in this context is

\displaystyle \mathrm{d}s^2=\langle\psi|\partial_\sigma U^\dagger\partial_\sigma U|\psi\rangle -\big|\langle\psi|U^\dagger\partial_\sigma U|\psi\rangle\big|^2~, \ \ \ \ \ (16)

where {\sigma\in\mathbb{R}} parametrizes the unitary {U(\sigma)} that performs the transformation between states in {\mathcal{S}_1}.

In the remainder of this post, we shall follow Amari & Nagaoka (chapter 7) in considering only mixed states; accordingly, for convenience, we hence forth denote {\mathcal{S}_{r>1}} by simply {\mathcal{S}}, i.e.,

\displaystyle \mathcal{S}\equiv\{\rho\,|\,\rho=\rho^*>0\;\;\mathrm{and}\;\;\mathrm{tr}\rho=1\}~. \ \ \ \ \ (17)

This is an open subset of {\mathcal{A}_1}, and thus we may regard it as a real manifold of dimension {n\equiv\mathrm{dim}\mathcal{A}_1}. Since the dimension of {\mathcal{H}} is given by the number of linearly independent basis vectors, minus 1 for the trace constraint, {n=k^2\!-\!1}. The tangent space {T_\rho\mathcal{S}} at a point {\rho} is then identified with

\displaystyle \mathcal{A}_0\equiv\{A\in\mathcal{A}\,|\,\mathrm{tr}A=0\}~. \ \ \ \ \ (18)

Note that this precisely parallels the embedding formalism used in defining the m-representation. Hence we denote a tangent vector {X\in T_\rho\mathcal{S}} by {X^{(m)}}, and call it the m-representation of {X} in analogy with the classical case. (We’ll come to the quantum analogue of the e-representation later).

The PVM {\Pi} introduced above can also be represented in geometrical terms, via the submanifold

\displaystyle \mathcal{S}_\Pi=\mathcal{S}\,\cap\,\left\{\sum_{i=1}^la_i\pi_i\,\Big|\,(a_1,\ldots,a_l)\in\mathbb{R}^l\right\} =\left\{\sum_{i=1}^l\frac{p_i}{\mathrm{tr}\,\pi_i}\pi_i\,\Big|\,p=(p_1,\ldots,p_l)\in\mathcal{P}_l\right\}~. \ \ \ \ \ (19)

That is, {\mathcal{S}_\Pi} is the intersection of two sets. The first, {\mathcal{S}\subset\mathcal{A}_1}, is simply the set of all operators {A} with unit trace. From (5), we recognize the second as the spectral representation of operators in the eigenbasis of {\Pi}. Imposing the trace constraint then implies that the eigenvalues of {A} are weighted probabilities, where {p_i\in\{p_1,\ldots,p_l\}} and {\mathcal{P}_l=\mathcal{P}\{1,\ldots,l\}} is the totality of positive probability distributions {p\equiv P\{e_i\}} with {i\in[1,l]}. To see this, observe that

\displaystyle \mathrm{tr}\sum_ia_i\pi_i =\sum_i\mathrm{tr}(\pi_i)a_i=1\;\implies\; a_i=\frac{p_i}{\mathrm{tr}(\pi_i)}~, \ \ \ \ \ (20)

since {\sum p_i=1}; this underlies the second equality in the expression for {\mathcal{S}_\Pi} above.

Lastly, the action of a unitary {U\in\mathcal{U}} at a point {\rho} results in a new vector, given by the mapping (13), which lives in (a subspace of) the tangent space {T_\rho\mathcal{S}\simeq\mathcal{A}_0}. Such elements may be written

\displaystyle \frac{\mathrm{d}}{\mathrm{d}t}U_t\rho U_t^{-1}\Big|_{t=0}=[\Omega,\rho]~, \ \ \ \ \ (21)

where {U_t\in\mathcal{U}} is a curve in the space of unitaries with {U_0=1}, and {\Omega} is its derivative at {t=0} (note that {t} corresponds to {\sigma} in the expression for the Fubini-Study metric (16) above).

With the above notions in hand, we may now proceed to introduce a dual structure {(g,\nabla,\nabla^*)} on {\mathcal{S}}, and thereby the quantum analogues of the {\alpha}-connections, divergence, etc.

Quantum divergences

In part 2 of this sequence, we introduced the f-divergence, which satisfies a number of desireable properties such as monotonicity. It also serves as the super-class to which the {\alpha}-divergence belongs; the latter is particularly important, since it induces the dual structure consisting of the Fisher metric and {(\pm\alpha)}-connections, and reduces to the familiar Kullback-Leibler divergence (a.k.a. relative entropy) for {\alpha=\pm1}. We shall now introduce the quantum analogue of the f-divergence.

Denote by {\mathcal{B}} the set of all (not necessarily hermitian) operators on {\mathcal{H}}. Then given two strictly postive density operators {\rho,\sigma\in\mathcal{S}}, the relative modular operator {\Delta=\Delta_{\sigma,\rho}:\mathcal{B}\rightarrow\mathcal{B}} is defined by

\displaystyle \Delta A=\sigma A\rho^{-1}~,\quad\forall A\in\mathcal{B}~. \ \ \ \ \ (22)

Since {\Delta} may be viewed as a hermitian and positive-definite operator on the Hilbert space, we may promote an arbitrary function {f:\mathbb{R}^+\rightarrow\mathbb{R}} to a hermitian operator {f(\Delta)} on said space such that

\displaystyle \Delta A=\lambda A\;\implies\;f(\Delta)A=f(\lambda)A \quad\quad\forall\,\lambda\in\mathbb{R}^+,A\in\mathcal{B}~. \ \ \ \ \ (23)

We then define the quantum f-divergence as

\displaystyle D_f(\rho||\sigma)=\mathrm{tr}\left[\rho f(\Delta)\,1\right]~, \ \ \ \ \ (24)

To see how this is consistent with (that is, reduces to) the classical f-divergence defined previously, consider the spectral representations

\displaystyle \rho=\sum_ia_i\mu_i~,\quad\quad \sigma=\sum_ib_i\nu_i~, \ \ \ \ \ (25)

where {\{a_i\}}, {\{b_i\}} are the eigenvalues for the PVMs {\{\mu_i\}}, {\{\nu_i\}}. Then from the definition of the relative modular operator above,

\displaystyle \Delta(\nu_j\mu_j)=\frac{b_j}{a_i}\nu_j\mu_i \quad\implies\quad f(\nu_j\mu_j)=f\!\left(\frac{b_j}{a_i}\right)\nu_j\mu_i~, \ \ \ \ \ (26)

and hence for simple measurements,

\displaystyle D_f(\rho||\sigma)=\sum_{i,j}a_i\,f\!\left(\frac{b_j}{a_i}\right)\mathrm{tr}(\nu_j\mu_i)~. \ \ \ \ \ (27)

To compare with the classical expression given in part 2 (cf. eq. (18)), we must take into account the fact that {\mu} and {\nu} are only independently orthonormal (that is, it does not follow that {\mu_i\nu_j=\delta_{ij}\mu_i}). Accordingly, consider performing sequential measurement {\mu} followed by {\nu}, i.e., {G=\{G_{ij}\}} with {G_{ij}=\nu_j\mu_i}. This can be used to constructed the POVM {M=\{M_{ij}\}}, where

\displaystyle M_{ij}=G_{ij}^\dagger G_{ij}=\mu_i\nu_j\mu_i~, \ \ \ \ \ (28)

where we have used the orthonormality of {\nu}. Applying von Neumann’s projection hypothesis, this takes the initial state {\rho} to the final state {G_{ij}\rho G_{ij}^\dagger/\mathrm{tr}\left(\rho M_{ij}\right)} with probability

\displaystyle p_{ij}=\mathrm{tr}\left(\rho M_{ij}\right) =\mathrm{tr}\left(\sum_ka_k\mu_k\mu_i\nu_j\mu_i\right) =a_i\mathrm{tr}\left(\mu_i\nu_j\right)~, \ \ \ \ \ (29)

where in the last step we have again used the orthonormality condition (4), and the cyclic property of the trace. Similarly, by inverting this sequence of measurements, we may construct the POVM with elements {N_{ij}=\nu_j\mu_i\nu_j}. Let us denote the probabilities associated with the outcomes of acting on {\sigma} as {q_{ij}=\mathrm{tr}\left(\sigma N_{ij}\right)=b_j\mathrm{tr}\left(\mu_i\nu_j\right)}. Solving for the eigenvalues {a_i} and {b_j} in these expressions then enables one rewrite (27) in terms of the probability distributions {p=\{p_{ij}\}}, {q=\{q_{ij}\}}, whereupon we find

\displaystyle D_f(\rho||\sigma)=\sum_{i,j}p_{ij}\,f\!\left(\frac{q_{ij}}{p_{ij}}\right)~, \ \ \ \ \ (30)

which is the discrete analogue of the classical divergence referenced above.

Now, recall that every dual structure is naturally induced by a divergence. In particular, it follows from Chentsov’s theorem that the Fisher metric and {\alpha}-connections are induced from the f-divergence {D_f} for any smooth convex function {f:\mathbb{R}^+\rightarrow\mathbb{R}} which satisfies

\displaystyle f(1)=0~,\qquad f''(1)=1~,\qquad \alpha=3+2f'''(1)~. \ \ \ \ \ (31)

Thus, to find the formal quantum analogues of these objects, we restrict to the class of functions {f} which satisfy the above conditions, whereupon the quantum f-divergence {D_f} induces the dual structure {\left(g^{(f)},\nabla^{(f)},\nabla^{(f^*)}\right)=\left(g^{(D_f)},\nabla^{(D_f)},\nabla^{(D_f^*)}\right)}, where {D_f^*=D_{f^*}}. In fact, for a simple measurement which diagonalizes {\rho}, the restriction of this triple to {S_\Pi} coincides precisely with the Fisher metric and {\pm\alpha}-connections. Additionally, note that {D_f} is invariant under unitary transformations, in the sense that {D_f(U\rho U^*||U\sigma U^*)=D(\rho||\sigma)}; in other words, for arbitrary vector fields {X}, {Y}, we have

\displaystyle \langle UXU^*,UYU^*\rangle=\langle X,Y\rangle~, \ \ \ \ \ (32)

where {UXU^*} is formally understood as the vector field such that {(UXU^*)_{U\rho U^*}^{(m)}=UX_\rho^{(m)}U^*}. Thus {g=\langle\cdot,\cdot\rangle}, {\nabla}, and {\nabla^*} are indeed invariant under the action of {U}, which is the characteristic property of the classical analogues we wanted to preserve.

As in the classical case, we can further restrict the class of functions {f} to satisfy eq.~(20) in part 1, which defines the quantum {\alpha}-divergence {D^{(\alpha)}}. Then for {\alpha\neq\pm1}, we obtain

\displaystyle D^{(\alpha)}(\rho||\sigma)=\frac{4}{1-\alpha^2}\left[1-\mathrm{tr}\left(\rho^{\frac{1-\alpha}{2}}\sigma^{\frac{1+\alpha}{2}}\right)\right]~, \ \ \ \ \ (33)

while for {\alpha=\pm1}, we instead have

\displaystyle D^{(-1)}(\rho||\sigma)=D^{(1)}(\sigma||\rho)=\mathrm{tr}\left[\rho\left(\ln\rho-\ln\sigma\right)\right]~, \ \ \ \ \ (34)

cf. eq. (21) and (22) ibid. In obtaining these expressions, we used the fact that the relative modular operator satisfies

\displaystyle (\Delta_{\sigma,\rho})^rA=\sigma^rA\rho^{-r}\quad\quad\forall r\in\mathbb{R} \ \ \ \ \ (35)


\displaystyle \left(\ln\Delta_{\sigma,\rho}\right)X=\left(\ln\sigma\right)X-X\ln\rho~, \ \ \ \ \ (36)

where the {r^{\mathrm{th}}}-power and logarithm of an operator {\rho} with eigenvalues {\{p_i\}} are defined such that these become {\{p_i^r\}} and {\{\ln p_i\}}, respectively, with the same (original) orthonormal eigenvectors. Of course, the quantum analogue of the Kullback-Leibler divergence (34) is none other than the quantum relative entropy!

Many of the classical geometric relations can also be extended to the quantum case. In particular, {\mathcal{S}} is dually flat with respect to {\left(g^{(\pm1)},\nabla^{(1)},\nabla^{(-1)}\right)}, and the canonical divergence is the (quantum) relative entropy {D^{(\pm1)}}. And as in the discussion of (classical) exponential families, we can parameterize the elements of an arbitrary {\nabla^{(1)}}-autoparallel submanifold {M\subseteq\mathcal{S}} as

\displaystyle \rho_\theta=\mathrm{exp}\left[C+\theta^iF_i-\psi(\theta)\right]~, \ \ \ \ \ (37)

where now {F_1,\ldots,F_m} are hermitian operators, and {\psi(\theta)} is an {\mathbb{R}}-valued function on the canonical parameters {[\theta^i]}. As in the classical case, these form a 1-affine coordinate system, and hence the dual {(-1)}-affine coordinates are given by {\eta_i(\theta)\equiv\mathrm{tr}\left(\rho_\theta F_i\right)}, which naturally extends the classical definition in terms of the expectation value of {F_i}, cf. eq. (36) of part 2. Recall that {[\theta^i]} and {[\eta_i]} are related through the dual potential {\varphi}, defined via the Legendre transform

\displaystyle \varphi(\theta)=\mathrm{sup}_{\theta'}\left[\theta'^i\eta_i(\theta)-\psi(\theta')\right]~. \ \ \ \ \ (38)

Taking a derivative of the bracketed quantity, the maximum occurs when {\eta_i(\theta)=\partial_i\psi(\theta')}, where the derivative is with respect to {\theta'^i}. But this condition precisely corresponds to the definition of {\eta_i} above, cf. eq. (37) in part 2. Hence

\displaystyle \varphi(\theta)=\theta^i\eta_i(\theta)-\psi(\theta) =-H(\rho_\theta)-\mathrm{tr}(\rho_\theta C)~, \ \ \ \ \ (39)

where we have identified the von Neumann entropy {H(\rho)\equiv-\mathrm{tr}(\rho\ln\rho)}, and used the fact that {\mathrm{tr}\rho=1} (that is, {\mathrm{tr}\left(\rho\psi(\theta)\right)=\psi(\theta)}).

As one might expect from the appearance of the von Neumann entropy, several other important concepts in quantum information theory emerge naturally from this framework. For example, the monotonicity of the classical f-divergence can be extended to the quantum f-divergence as well, whence {D_f} satisfies the monotonicity relation

\displaystyle D_f(\Gamma\rho||\Gamma\sigma)\leq D_f(\rho||\sigma) \ \ \ \ \ (40)

for any completely positive trace-preserving map {\Gamma} (i.e., a quantum channel), and operator convex function {f}; see section 7.2 for more details. Since the {\alpha}-divergence is a special case of the f-divergence, this result implies the monotonicity of relative entropy (a detailed analysis can be found in [1]).

To close, let us comment briefly on some relations to Tomita-Takesaki modular theory. In this context, the relative modular operator is defined through the relative Tomita-Takesaki anti-linear operator {S_{\sigma,\rho}}, and provides a definition of relative entropy even for type-III factors. For finite-dimensional systems, one can show that this reduces precisely to the familiar definition of quantum relative entropy given above. For more details, I warmly recommend the recent review by Witten [2]. Additionally, the fact that {D_f} satisfies monotonicity (40) corresponds to the statement that the relative entropy is monotonic under inclusions, which can in turn be used to demonstrate positivity of the generator of (half-sided) modular inclusions; see [3] for an application of these ideas in the context of the eternal black hole.


  1. A. Mueller-Hermes and D. Reeb, “Monotonicity of the quantum relative entropy under positive maps,” arXiv:1512.06117 [quant-ph]
  2. E. Witten, “Notes on some entanglement properties of quantum field theory,” arXiv:1803.04993 [hep-th]
  3. R. Jefferson, “Comments on black hole interiors and modular inclusions,” arXiv:1811.08900 [hep-th]
Posted in Minds & Machines, Physics | 2 Comments

Boundary conditions in AdS/CFT

The issue of boundary conditions in AdS/CFT has confused me for years; not because it’s intrinsically complicated, but because most of the literature simply regurgitates a superficial explanation for the standard prescription which collapses at the first inquiry. Typically, the story is presented as follows. Since we’ll be interested in the near-boundary behaviour of bulk fields, it is convenient to work in the Poincaré patch, in which the metric for Euclidean AdS{_{d+1}} reads

\displaystyle \mathrm{d} s^2=\frac{\mathrm{d} z^2+\mathrm{d} x_i\mathrm{d} x^i}{z^2}~, \ \ \ \ \ (1)

where {i=1,\ldots,d} parametrizes the transverse directions. Now consider the action for a bulk scalar {\phi}, dual to some primary operator {\mathcal{O}} in the CFT:

\displaystyle S=\frac{1}{2}\int\!\mathrm{d} z\mathrm{d}^d x\sqrt{-g}\left(\partial_\mu\phi\partial^\mu\phi+m^2\phi^2\right)~, \ \ \ \ \ (2)

where {\mu=0,1,\ldots,d}. The on-shell equations of motion yield the familiar Klein-Gordon equation,

\displaystyle \left(\square-m^2\right)\phi=0~,\quad\mathrm{where}\quad \square\phi\equiv\frac{1}{\sqrt{-g}}\,\partial_\mu\!\left(\sqrt{-g}\,g^{\mu\nu}\partial_\nu\phi\right)~. \ \ \ \ \ (3)

This is a second-order differential equation with two independent solutions, whose leading behaviour as one approaches the boundary at {z=0} scale like {z^{\Delta_{\pm}}}, where {\Delta_{\pm}\in\mathbb{R}} are given by

\displaystyle m^2\ell^2=\Delta(\Delta-d)\quad\implies\quad \Delta_\pm=\frac{d}{2}\pm\sqrt{\frac{d^2}{4}+m^2\ell^2}~. \ \ \ \ \ (4)

For compactness, we shall henceforth set the AdS radius {\ell=1}. The general solution to the Klein-Gordon equation in the limit {z\rightarrow0} may therefore be expressed as a linear combination of these solutions, with coefficients {A(\mathbf{x})} and {B(\mathbf{x})}:

\displaystyle \phi(z,\mathbf{x})\rightarrow z^{\Delta_+}\left[A(\mathbf{x})+O(z^2)\right]+z^{\Delta_-}\left[B(\mathbf{x})+O(z^2)\right]~. \ \ \ \ \ (5)

The subleading {O(z^2)} behaviour plays no role in the following, so we’ll suppress it henceforth.

Note that the condition {\Delta_\pm\in\mathbb{R}} imposes a limit on the allowed mass of the bulk field, namely that {m^2\geq-d^2/4}. This is known as the Breitenlohner-Freedman (BF) bound. (Unlike in flat space, a negative mass squared doesn’t necessarily imply that the theory is unstable; intuitively, the negative curvature of AdS compensates for the backreaction, provided the excitations are sufficiently small). Furthermore, if one integrates the action (2) from {z\!=\!0} to the cutoff at {z\!=\!\epsilon}, the result will be finite for {\Delta\geq d/2}, i.e., {\Delta=\Delta_+}. Thus {\Delta_+} is a solution for all masses above the BF bound.

For {\Delta_-}, the situation is slightly more subtle. One can show that the boundary term obtained by integrating (2) by parts is non-zero only if {\Delta\leq d/2}; i.e., for {\Delta_-}, there exists a second, inequivalent action (read: theory). And in this case, the bulk integral up to the cutoff is finite only for {\Delta\geq d/2-1}, which corresponds to the restricted mass range

\displaystyle -\frac{d^2}{4}<m^2<-\frac{d^2}{4}+1~. \ \ \ \ \ (6)

The upper limit is called the unitarity bound, since the limit {\Delta=d/2-1} corresponds to the constraint, from unitarity, on the representation of conformal field theories. Hence for masses below the unitarity bound (but above the BF bound, of course), one may choose either {\Delta_+} or {\Delta_-}, which implies two different bulk theories for the same CFT. More on this below. (Slightly more technical details can be found in section 5.3 of [1]).

With the above in hand, the vast majority of the literature simply recites the “standard prescription”, in which one identifies {B(\mathbf{x})} as the source of the boundary operator {\mathcal{O}}, which is in turn identified with {A(\mathbf{x})}. This appears to be based on the observation that, since {\Delta_-<\Delta_+} by definition, the second term dominates in the {z\rightarrow0} limit, which requires that one set {B(\mathbf{x})=0} in order that the solution be normalizable—though as remarked above, this is clearly not the full story in the mass range (6). Sometimes, an attempt is made to justify this identification by the observation that one can multiply through by {z^{-\Delta_-}}, whereupon one sees that

\displaystyle B(\mathbf{x})=\lim_{z\rightarrow0} z^{-\Delta_-}\phi(z,\mathbf{x})~. \ \ \ \ \ (7)

In accordance with the extrapolate dictionary, this indeed suggests that {B(\mathbf{x})} sources the bulk field {\phi(z,\mathbf{x})}. But it does not explain why {A(\mathbf{x})} should be identified as the boundary dual of this bulk field; that is, it does not explain why {A} and {B} should be related in the CFT action as

\displaystyle \int\!\mathrm{d}^dx\,B(\mathbf{x})\mathcal{O}(\mathbf{x})~, \quad\mathrm{with}\quad A(\mathbf{x})\sim\langle\mathcal{O}\rangle~, \ \ \ \ \ (8)

since a priori these are independent coefficients appearing in the solution to the differential equation above. Furthermore, it does not follow from the fact that the bulk coefficient {B(\mathbf{x})} has the correct conformal dimension as the boundary source field (and that both will be set to zero on-shell) that the two must be identified: this may be aesthetically suggestive, but it is by no means logically necessary.

Some clarification is found in the classic paper by Klebanov and Witten [2]. They observe that the extrapolate dictionary — or what one would identify, in more modern language, as the HKLL prescription for bulk reconstruction — can be written

\displaystyle \phi(z,\mathbf{x})=\int\!\mathrm{d}^dx' K_\Delta(z,\mathbf{x},\mathbf{x}')\phi_0(\mathbf{x}')~, \ \ \ \ \ (9)

where {K_\Delta} is the bulk-boundary propagator with conformal dimension {\Delta},

\displaystyle K_\Delta(z,\mathbf{x},\mathbf{x}')=\pi^{-d/2}\frac{\Gamma(\Delta)}{\Gamma(\Delta-d/2)}\frac{z^\Delta}{\left[z^2+(\mathbf{x}-\mathbf{x}')^2\right]^\Delta}~, \ \ \ \ \ (10)

and {\phi_0(\mathbf{x})\equiv\phi(0,\mathbf{x})} is the boundary limit of the bulk field {\phi}, which serves as the source for the dual operator {\mathcal{O}}. Choosing {\Delta=\Delta_+} then implies that in order to be consistent with (5), we must identify {\phi_0(\mathbf{x})=B(\mathbf{x})}, and

\displaystyle A(\mathbf{x})=\pi^{-d/2}\frac{\Gamma(\Delta)}{\Gamma(\Delta-d/2)}\int\!\mathrm{d}^2x'\frac{B(\mathbf{x}')}{(\mathbf{x}-\mathbf{x}')^{2\Delta_+}}~, \ \ \ \ \ (11)

which one obtains by taking the {z\rightarrow0} limit of the integral above. We then recognize the position-space Green function {G(\mathbf{x},\mathbf{x}')\sim|\mathbf{x}-\mathbf{x}'|^{-2\Delta}}, whence

\displaystyle A(\mathbf{x})\sim\int\!\mathrm{d}^dx'B(\mathbf{x})G(\mathbf{x},\mathbf{x}')~, \ \ \ \ \ (12)

which is the statement that {B(\mathbf{x})} is the source for the one-point function {A(\mathbf{x})}. That is, recall that the generating function can be written in terms of the Feynman propagator as

\displaystyle Z[J]=\exp\left(\frac{i}{2}\int\!\mathrm{d}^dx\mathrm{d}^dx'J(\mathbf{x})G(\mathbf{x},\mathbf{x}')J(\mathbf{x}')\right)~, \ \ \ \ \ (13)

whereupon the {n^\mathrm{th}} functional derivative with respect to the source {J} yields the {n}-point function

\displaystyle \langle\mathcal{O}(\mathbf{x}_1)\ldots\mathcal{O}(\mathbf{x}_n)\rangle=\frac{1}{i^n}\frac{\delta^nZ[J]}{\delta J(\mathbf{x}_1)\ldots\delta J(\mathbf{x}_n)}\bigg|_{J=0}~. \ \ \ \ \ (14)

Hence, taking a single derivative, we have

\displaystyle \frac{1}{i}\frac{\delta Z[J]}{\delta J(\mathbf{x})}\bigg|_{J=0}=\int\!\mathrm{d}^dx'J(\mathbf{x})G(\mathbf{x},\mathbf{x}')Z[J]\bigg|_{J=0}=\langle\mathcal{O}(\mathbf{x})\rangle~, \ \ \ \ \ (15)

which is precisely the form of (12), with {A(\mathbf{x})\simeq\langle\mathcal{O}(\mathbf{x})\rangle} and {B(\mathbf{x})=J(\mathbf{x})}, thus explaining the relation (8). (Note that we’re ignoring a constant factor here; the exact relation between {A(\mathbf{x})} and the vev is fixed in [2] to {A(\mathbf{x})=(2\Delta-d)^{-1}\langle\mathcal{O}(\mathbf{x})\rangle}). Of course, setting {J=0} causes the one-point function to vanish on-shell. The fact that we set the sources to zero resolves the apparent tension between setting {B(\mathbf{x})=0} for normalizability, and simultaneously having it appear in (12): the expansion (5) was obtained by solving the equations of motion, so we only have {B=0} on-shell.

Thus we see that the bulk-boundary correspondence in the form (9) provides the extra constraint needed to relate the coefficients {A} and {B} in the expansion (5) as (vev of) operator and source, respectively, and that choosing the alternative boundary condition {\Delta=\Delta_-} simply interchanges these roles. It is an interesting and under-appreciated fact that any holographic CFT therefore admits two different bulk duals, i.e., two different quantum field theories are encoded in the same CFT Lagrangian!

The above essentially summarizes the basic story for the usual case, i.e., a CFT action of the form

\displaystyle S=\int\!\mathrm{d}^dx\left(\mathcal{L}+J\mathcal{O}\right)~. \ \ \ \ \ (16)

However, the question I encountered during my research that actually motivated this post is what happens to these boundary conditions in the presence of a double-trace deformation,

\displaystyle \delta S=\int\!\mathrm{d}^dx \,h\,\mathcal{O}^2~. \ \ \ \ \ (17)

To that end, suppose we start with a CFT at some UV fixed point. If the coupling {h} is irrelevant ({[h]<0}), then the deformation does not change the effective field theory, since it only plays a role in high-energy correlators. However, if the coupling is relevant ({[h]>0}), then the perturbation induces a flow into the IR. And it turns out that the boundary conditions change from {\Delta_-} (in the UV) to {\Delta_+} (in the IR) as a result.

First, let’s understand why we need to start with the “alternative boundary condition” {\Delta_-} in the UV. The paper [3] motivates this by examining the Green function in momentum space, which scales like {G(k)\sim k^{2\Delta_\pm-d}=k^{\pm 2\nu}}, where {\nu\equiv\sqrt{d^2/4+m^2}>0}. Thus in the UV, we choose {\Delta_-} so that the Green function converges. This explanation isn’t entirely satisfactory though, since it’s not clear why this should converge with {\Delta_+} in the IR.

In this sense, a better explanation for choosing {\Delta_-} in the UV is obtained simply by examining the mass dimensions in the coupling term (17): in order for {\delta S} to be dimensionless, we must have {[h]=d-2\Delta}. The condition that {h} be a relevant coupling then implies that {\Delta<d/2}. Since {\Delta_+} can only satisfy this condition for masses below the Breitenlohner-Freedman bound, we are constrained to choose {\Delta_-}. Note that from this perspective, the choice {\Delta_+} is equally valid, but it corresponds to an irrelevant perturbation instead, so nothing interesting happens.

Understanding why the theory flows to {\Delta_+} at the new IR fixed point is more complicated, and requires computing the beta functions for the couplings. This is done in [4], where they compute the evolution equation for the coupling (i.e., the beta function for {h}), which enables them to identify the fixed points which correspond to the alternative and standard quantizations in the UV and IR, respectively.

It is then natural to ask what happens at intermediate points along the flow, since the above applies only to the UV/IR fixed points, at which a single choice (either {\Delta_+} or {\Delta_-}) is determined. This leads to the notion of mixed boundary conditions, as introduced by Witten in [5]. In order to understand his proposal, let’s examine the relationship between {A} and {B} above from another angle. The term in the effective action {W} corresponding to the single-trace insertion is

\displaystyle W=\int\!\mathrm{d}^dx\,J\mathcal{O}~. \ \ \ \ \ (18)

Taking functional derivatives with respect to the source {J} then yields the operator expectation value,

\displaystyle \frac{\delta W}{\delta J}=\langle\mathcal{O}\rangle~, \ \ \ \ \ (19)

which of course simply restates the relationship (15). As explained above, the standard quantization identifies {A=\langle\mathcal{O}\rangle} and {B=J}, which amounts to a trivial relabeling of the variables in this expression. However, the alternative quantization, in which {B=\langle\mathcal{O}\rangle} and {A=J}, is slightly more subtle: a similar relabeling would work for the present case, but not for a multi-trace deformation. Instead, consider varying the effective action with respect to the operator {\mathcal{O}}:

\displaystyle \frac{\delta W}{\delta\mathcal{O}}=J~. \ \ \ \ \ (20)

Note that unlike (19), this is an operator statement; it may seem like an odd thing to write down, but in fact it’s formally the same expression that arises in the derivation of Ward identities. Specifically, if one considers an infinitesimal shift of the field variable {\delta\phi(x)}, then invariance of the generating functional {Z[J]} requires

\displaystyle 0=\delta Z[J]=i\int\!\mathcal{D}\phi\,e^{iS}\int\!\mathrm{d}^dx\left(\frac{\delta S}{\delta \phi(x)}+J(x)\right)\delta\phi(x)~, \ \ \ \ \ (21)

where we’ve taken the action (16) with {\phi(x)} in place of {\mathcal{O}}. The resemblance to Noether’s theorem is intriguing, but I’m not sure whether there’s a deeper relation to symmetries in QFT at work here. In any case, Witten’s proposal is that in the alternative quantization, one should identify

\displaystyle A=\frac{\delta W}{\delta\mathcal{O}}~, \ \ \ \ \ (22)

which simply states that {A=J}. Strictly speaking, we can’t identify {B} in this expression, since it doesn’t tell us the vev of the operator {\mathcal{O}}. Rather, to find the expectation value {\langle\mathcal{O}\rangle} — in particular, that it should be identified with {B} — requires us to solve the bulk equations of motion with this boundary condition, substitute the result into the bulk action, and then take variational derivatives with respect to the source as usual. In the end of course, one recovers the simple identification {B=\langle\mathcal{O}\rangle}, hence Witten actually writes (22) directly with {B} in place of {\mathcal{O}}, with the understanding that the expression is purely formal.

Now consider a double-trace deformation,

\displaystyle W=\int\!\mathrm{d}^dx\,\frac{f}{2}\,\mathcal{O}^2~, \ \ \ \ \ (23)

where we’ve rescaled the coupling {f=2h} for consistency with the cited literature (and the convenient factor of 2). According to (22), this leads to the mixed boundary condition

\displaystyle A(\mathbf{x})=fB(\mathbf{x})~. \ \ \ \ \ (24)

As observed in [3], this neatly interpolates between the standard and alternative prescriptions above. To see this, observe that since the deformation is explicitly chosen to be relevant, the dimensionless coupling grows as we move into the IR, i.e., {\tilde f\equiv f/E\rightarrow\infty} as the energy scale {E\rightarrow0}. Since {\tilde f^{-1}=0\implies B(\mathbf{x})=0}, (5) requires that we identify {A(\mathbf{x})=\langle\mathcal{O}(\mathbf{x})\rangle}, with the standard choice {\Delta_+}. Conversely, the coupling shrinks to zero in the UV, where {\tilde f=0\implies A(\mathbf{x})=0}, and hence we identify {B(\mathbf{x})} as the vev of the operator instead, with the alternative choice {\Delta_-}.

Witten’s prescription extends to higher-trace deformations as well, though I won’t discuss those here. The point, for the purposes of the present post, is that it provides a unified treatment of boundary conditions in holography that smoothly interpolates between the UV/IR, and hence allows one to naturally treat the bulk dual all along the flow.

I am grateful to my former officemate, Diptarka Das, for several enlightening discussions on this topic, and for graciously allowing me to pester him across several afternoons until I was satisfied.


  1. M. Ammon and J. Erdmenger, “Gauge/gravity duality,” Cambridge University Press, Cambridge, 2015.
  2. I. R. Klebanov and E. Witten, “AdS / CFT correspondence and symmetry breaking,” Nucl. Phys. B556 (1999) 89–114, arXiv:hep-th/9905104 [hep-th].
  3. T. Hartman and L. Rastelli, “Double-trace deformations, mixed boundary conditions and functional determinants in AdS/CFT,” JHEP 01 (2008) 019, arXiv:hep-th/0602106 [hep-th].
  4. I. Heemskerk and J. Polchinski, “Holographic and Wilsonian Renormalization Groups,” JHEP 06 (2011) 031, arXiv:1010.1264 [hep-th].
  5. E. Witten, “Multitrace operators, boundary conditions, and AdS / CFT correspondence,” arXiv:hep-th/0112258 [hep-th].
Posted in Physics | Leave a comment

Tomita-Takesaki in Rindler space

Rindler space provides a convenient example to elucidate some basic properties of AQFT — specifically Tomita-Takesaki theory — in what is arguably the case of greatest interest to high-energy theorists. We shall begin by introducing a few fundamental objects in the theory, and then apply them to Rindler space, loosely following the excellent review by Witten [1], as well as the work of Papadodimas [2] and Papadodimas & Raju [3].

We begin with the {C^*}-algebra {\mathcal{A}_\mathcal{U}} of observables in some spacetime region {\mathcal{U}} acting on the Hilbert space {\mathcal{H}} of some quantum field theory (accordingly, we will be concerned with Type III algebras throughout this post). In fact, we shall be slightly more restrictive and assume that {\mathcal{A}} is a von Neumann algebra. Recall that a {C^*}-algebra is the set of bounded operators which is closed both in the norm topology and under the adjoint operation. The weak closure of a {C^*}-algebra is a von Neumann algebra, which further implies {(\mathcal{A'})'=\mathcal{A}''=\mathcal{A}} by von Neumann’s double commutant theorem. Additionally, we take {\mathcal{A}} to be equipped with the cyclic and separating vector {\Omega}. Cyclic means that states {\mathcal{O}|\Omega\rangle}, {\mathcal{O}\in\mathcal{A}} are dense in {\mathcal{H}}; separating means that {\mathcal{O}|\Omega\rangle=0} iff {\mathcal{O}=0}. (Note that, glossing over various representational subtleties, the vector {\Omega} therefore serves as the vacuum state {|\Omega\rangle} of {\mathcal{H}}). For concreteness, as well as later convenience, we shall take {\mathcal{U}} to be the right Rindler wedge {R}, and denote {\mathcal{A}\equiv\mathcal{A}_R=\mathcal{A}_\mathcal{U}}. The commutant is then the algebra in the left Rindler wedge {L}, {\mathcal{A}'\equiv\mathcal{A}_L=\mathcal{A}'_\mathcal{U}=\mathcal{A}_{\mathcal{U}'}}. (The very last equality is sometimes called Haag duality, and applies when {\mathcal{U}} and {\mathcal{U}'} are causal complements).

Given a von Neumann algebra {\mathcal{A}}, Tomita-Takesaki theory admits a canonical construction of the commutant {\mathcal{A}'}, which we will use to derive some relationships between the left and right Rindler wedges. The starting point is the antilinear map {S_\Omega:\mathcal{H}\rightarrow\mathcal{H}}, defined by

\displaystyle S_\Omega \mathcal{O}|\Omega\rangle=\mathcal{O}^\dagger|\Omega\rangle~, \ \ \ \ \ (1)

where recall that an antilinear map {f:V\rightarrow W} between complex vector spaces {V,W} satisfies {f(aX\!+bY)=a^*f(X)+b^*f(Y)\;\;\forall a,b\!\in\!\mathbb{C}} and {X,Y\!\in\!V}. Note that {S_\Omega} is a state-dependent operator, in that it is defined in terms of its action on the cyclic and separating state {|\Omega\rangle}. To minimize clutter, we shall omit the subscript on {S} henceforth, but one should bear in mind that most of the analysis below does not hold for general (excited) states, which fail the cyclic/separating condition.

Case in point: the fact that {\Omega} is separating is necessary for {S} to be well-defined, otherwise there would exist an annihilation operator {a\in\mathcal{A}} such that {a|\Omega\rangle=0} but {Sa|\Omega\rangle=a^\dagger|\Omega\rangle\neq0}, which contradicts the definition (1). The fact that {\Omega} is cyclic then ensures that it maps to a dense set of states on {\mathcal{H}}. Additionally, it follows from the definition of the action of {S} on an arbitrary state {|\psi\rangle=\mathcal{O}|\Omega\rangle} that, as an operator, we must have {S^2=1}. Note that by the square of the operator, we do not mean {S^\dagger S}; rather, this implies that {S} is its own inverse, {S^{-1}=S}. It turns out that {S^\dagger=S'}, the corresponding operator on the commutant {\mathcal{A}'}; see [1] for an elementary proof. Lastly, simply taking {\mathcal{O}} to be the identity operator in (1) shows that {S} leaves the vacuum invariant, {S|\Omega\rangle=|\Omega\rangle}.

Since {S} is invertible, it admits a unique polar decomposition,

\displaystyle S=J\Delta^{1/2}~, \ \ \ \ \ (2)

where {J} is an antiunitary operator (meaning {\langle JA,JB\rangle=\langle A,B\rangle^\dagger\;\;\forall A,B\in\mathcal{H}}) called the modular conjugation, and {\Delta} is a self-adjoint operator called the modular operator. The decomposition (2), combined with the fact that {\Delta^\dagger=\Delta}, implies that

\displaystyle \Delta= S^\dagger S~, \ \ \ \ \ (3)

which is sometimes taken as its definition. Additionally, since {\Delta} is a positive Hermitian operator, we may define the modular Hamiltonian {K\equiv-\log(S^\dagger S)} such that

\displaystyle \Delta=e^{-K}~. \ \ \ \ \ (4)

(N.b., some authors — e.g., Witten [1] — define {K} with an additional factor of {2\pi} such that {\Delta=e^{-2\pi K}}. Our notation here is consistent with [2,3]). Since both {S} and {S^\dagger} leave the vacuum state invariant, this property extends to the modular operator as well, {\Delta|\Omega\rangle=|\Omega\rangle}. It it important to note that {\Delta} does not have support purely within the right Rindler wedge {R}, i.e., on the algebra {\mathcal{A}}, but acts on the full Hilbert space defined over both the left and right wedges. This is immediate from (4), and the fact that the domain of {S^\dagger\!=\!S'} is {\mathcal{A}'}. We shall return to this point below. (By extension, {K|\Omega\rangle=0}, but this does not contradict the fact that {\Omega} is separating since {K} is not localized to any subregion).

For the last of the preliminaries, we note the following useful identities:

\displaystyle J\Delta^{1/2}=\Delta^{-1/2}J\quad\mathrm{and}\quad J^2=1\,\iff\,J^{-1}=J~. \ \ \ \ \ (5)

These essentially follow from the fact that {S^2\!=\!1} in conjunction with the polar decomposition. Combining (5) with the aforementioned fact that {S'=S^\dagger} leads to {J'=J} and {\Delta'=\Delta^{-1}}. Finally, expressing {J} in terms of {S} and {\Delta} via (2) enables one to show that it likewise leaves the vacuum invariant; i.e., we have

\displaystyle S|\Omega\rangle=J|\Omega\rangle=\Delta|\Omega\rangle=|\Omega\rangle~. \ \ \ \ \ (6)

Now, the fundamental result of Tomita-Takesaki theory is comprised of the following two facts: first, the modular operator {\Delta} defines a 1-parameter family of modular automorphisms

\displaystyle \Delta^{it}\mathcal{A}\Delta^{-it}=\mathcal{A}~,\;\;\forall t\in\mathbb{R}~. \ \ \ \ \ (7)

In other words, the algebra {\mathcal{A}} is invariant under modular flow. Second, the modular conjugation induces an isomorphism between the algebra and its commutant {\mathcal{A}'},

\displaystyle J\mathcal{A}J=\mathcal{A}'~, \ \ \ \ \ (8)

i.e., {\forall\mathcal{O}\in\mathcal{A}}, {J} defines an element {\mathcal{O}'\equiv J\mathcal{O}J} such that {[\mathcal{O},\mathcal{O}']=0}. The isomorphism (8) is rather remarkable: it allows one to map operators between the left and right Rindler wedges, and features crucially in the works of Papadodimas and Raju on reconstructing black hole interiors [2,3].

To demonstrate how this works, let {\mathcal{O}\!\in\!\mathcal{A}} be a unitary operator with support in {R}. This creates a state {|\psi\rangle=\mathcal{O}|\Omega\rangle} which is indistinguishable from the vacuum state for all observers (i.e., all operators {\mathcal{O}'\in\mathcal{A}'}) in {L}:

\displaystyle \langle\psi|\mathcal{O}'|\psi\rangle =\langle\Omega|\mathcal{O}^\dagger\mathcal{O}'\mathcal{O}|\Omega\rangle =\langle\Omega|\mathcal{O}'|\Omega\rangle~, \ \ \ \ \ (9)

where in the last step, we used the fact that {[\mathcal{O},\mathcal{O}']=0} (note that this wouldn’t have worked if {\mathcal{O}} were non-unitary). Now consider the state

\displaystyle |\phi\rangle=\Delta^{1/2}\mathcal{O}|\Omega\rangle~. \ \ \ \ \ (10)

Perhaps surprisingly, this state is indistinguishable from the vacuum for an observer in {R}, i.e., it represents an excitation localized entirely outside {R}! (One has to be careful in saying that it’s localized in {L}, since under suitable time evolution it may evolve through the left Rindler horizon). This is straightforward to show, using the properties of the modular conjugation above:

\displaystyle |\phi\rangle=J^2\Delta^{1/2}\mathcal{O}|\Omega\rangle =JS\mathcal{O}|\Omega\rangle =J\mathcal{O}^\dagger|\Omega\rangle =J\mathcal{O}^\dagger J|\Omega\rangle =\mathcal{O}'|\Omega\rangle~,\ \ \ \ \ (11)

where {\mathcal{O}'\equiv J\mathcal{O}^\dagger J} is an operator in the commutant. Indistinguishably of {|\phi\rangle} from the vacuum for observers in {R} then follows by the same logic as (9). However, while it is true that the state {|\phi\rangle} is (initially) localized entirely in {L}, the same is not true for the operator {\Delta^{1/2}\mathcal{O}}! This is due to the fact, mentioned above, that {\Delta} has support in both {L} and {R}, and hence the operator {\Delta^{1/2}\mathcal{O}\notin\mathcal{A},\mathcal{A}'}. This example highlights the subtle yet crucial distinction between localized states vs. localized operators. Another way to emphasize this is that, as operators,

\displaystyle \mathcal{O}'\neq\Delta^{1/2}\mathcal{O} \ \ \ \ \ (12)

(since these exist in different algebras), even though as states,

\displaystyle \mathcal{O}'|\Omega\rangle=\Delta^{1/2}\mathcal{O}|\Omega\rangle~. \ \ \ \ \ (13)

as shown in (11). [I am grateful to Kyriakos Papadodimas for clarifying this distinction to me].

Note the similarity to the Reeh-Schlieder theorem at play here: it appears that we’ve created an excitation in {L} by acting with a unitary operator in the spacelike separated region {R}; but this isn’t quite true, since we relied on the modular operator {\Delta} to do so, which lives in both {\mathcal{A}} and {\mathcal{A}'}. To accomplish this feat purely with operators whose support lies entirely in {\mathcal{A}} would require these operators to be non-unitary, as explained in the eponymous post. This suggests a deep relationship between locality and unitarity embedded within this framework.

Now, let’s make the above formalism a bit more concrete. In the case of Rindler space specifically, it turns out that the modular Hamiltonian is proportional to the generator of Lorentz boosts. Witten [1] offers an intuitive path integral explanation of this fact. The derivation is non-rigorous, since it assumes a factorized Hilbert space {\mathcal{H}=\mathcal{H}_L\!\otimes\!\mathcal{H}_R}, but in the course of doing so it provides some insight into where and why this factorization fails, and highlights which operators do and do not exist—the reduced density matrix for either wedge being a prime example of the latter.

First, consider the Euclidean path integral that prepares the vacuum state {\Omega}. This is illustrated in figure (a) of the image below, which we’ve taken from [1]. The shaded region represents an integral over the {\tau<0} half-plane which produces the vacuum state {|\Omega\rangle} on the boundary. Similarly, integrating over the upper half-plane produces the bra {\langle\Omega|}. Stitching these together by integrating over all boundary values at {\tau=0} yields the density matrix {\rho=|\Omega\rangle\langle\Omega|}. As we shall shortly see, this is related to the modular operator {\Delta}.


Now pretend that we could factorize the Hilbert space as {\mathcal{H}=\mathcal{H}_L\!\otimes\!\mathcal{H}_R}, where the fields {\phi_L,\phi_R} at {\tau=0} are respectively supported in {x<0}, {x>0}. Strictly speaking, this defines {L} and {R} as open subregions of {\mathbb{R}^2}, but taking the weak closure to construct the associated von Neumann algebras entails including the bifurcation point {x=0}, as well as the lightlike Rindler horizons themselves. In some sense, this is the source of the problem, since we need to regulate these boundary regions when performing the path integral over either subspace. Specifically, consider the reduced density matrix for the right Rindler wedge,

\displaystyle \rho_R(\phi_R^+,\phi_R^-)=\int_{\phi_L^+=\phi_L^-}\!\mathcal{D}\phi_L|\Omega(\phi_L^-,\phi_R^-)\rangle\langle\Omega(\phi_L^+,\phi_R^+)|~, \ \ \ \ \ (14)

which is illustrated in figure (b). The boundary condition {\phi_L^+\!=\!\phi_L^-} indicates that we identify the fields in the upper and lower half-plane when computing the partial trace, while the cut along {x\!>\!0} indicates that {\phi_R^{\pm}} are kept free. The regularization issue mentioned above arises from the fact that local operators at the boundary point {x\!=\!0} are not well-defined, but must be treated as smeared distributions over some finite region. In typical calculations of entanglement entropy, one excises some small region around the endpoint of the cut, which acts as a UV cutoff that regulates the divergent result. [I am grateful to Jamie Sully for discussions on this issue]. But we can’t strictly do this while preserving the (von Neumann) algebraic structure above, which is one reason this procedure doesn’t rigorously hold.

Now consider figure (c), which represents the Euclidean wedge with opening angle {\theta}. Leaving both boundary conditions unfixed implies that this computes some operator, which acts as a rotation {R_\theta} in the Euclidean {\tau\!-\!x} plane:

\displaystyle R_\theta\binom{\tau}{x}=\begin{pmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{pmatrix}\binom{\tau}{x} \;\implies\; R_\theta\binom{t}{x}=\begin{pmatrix} \cosh i\theta & -\sinh i\theta \\ -\sinh i\theta & \cosh i\theta \end{pmatrix}\binom{t}{x}~, \ \ \ \ \ (15)

where in the second expression we’ve gone to Lorentzian time {t=-i\tau}. But the latter is precisely a Lorentz boost of the real {t\!-\!x} plane with boost parameter {-i\theta}, and we know that the generator of such a boost can be expressed as

\displaystyle M=\int_{t=0}\!\mathrm{d} x\,x\,T_{00}~. \ \ \ \ \ (16)

Quite independent of the above analysis, {M} generates a Lorentz boost by a real parameter {\eta}, i.e., the operator {\exp(-i\eta M)}. Now if the factorization of the Hilbert space were legitimate, we could split {M=M_R-M_L} (with domains of integration restricted to the associated wedges), such that the path integral over the wedge in figure (c) yields the operator {\exp(-\theta M_R)}, where {\eta=-i\theta}. Such an operator propagates degrees of freedom from the upper boundary of the wedge at {x\!>\!0}, {t\!=\!0} to the lower boundary. By setting {\theta=2\pi}, one sees that we recover the path integral in figure (b), and therefore

\displaystyle \rho_R=\exp(-2\pi M_R)~. \ \ \ \ \ (17)

Of course, the same analysis holds for the left wedge, with {M_L} generating {\rho_L}. We then recover the modular operator {\Delta} by appealing to the tensor factor structure, which implies {[M_L,M_R]=0}, and hence

\displaystyle \Delta=\rho_R\otimes\rho_L^{-1}=\exp(-2\pi M_R)\exp(2\pi M_L)=\exp(-2\pi M)=e^{-K}~, \ \ \ \ \ (18)

where the promised proportionality between the modular Hamiltonian and the generator of Lorentz boosts is {K\!=\!2\pi M} (note that many authors absorb the factor of {2\pi} into the definition of {K}).

The formulation in terms of the wedge operator provides a cute illustration of how {\Delta^{1/2}} maps states in {R} to states in {L} and vice-versa, cf. (13). The interpretation is that {{\Delta^\alpha\!=\!\exp(-2\pi\alpha M)}} removes a wedge of angle {2\pi\alpha} from the {t\!=\!0} slice in {R} as depicted in figure (c), and adds an equal wedge in {L} (in the positive half-plane; i.e., it rotates the {t\!=\!0} slice). Since we started with vacuum in {L}, one can think of this as rotating excitations in {R} by an angle {\alpha}; setting {\alpha=1/2} thus moves an operator insertion from {R} to {L}, which is precisely the action of {\Delta^{1/2}} in equation (13) above.

As mentioned above, this path integral picture is not strictly valid since the Hilbert space of Type III factors does not admit a tensor product structure. One way in which this manifests is that {K\!\in\!\mathcal{H}} cannot be split into {K_{L,R}=2\pi M_{L,R}} in such a way as to yield operators whose support lies solely in {L,R}. This implies that the reduced density matrices {\rho_{L,R}} do not really exist! Given how much of our understanding of entanglement in quantum field theory is based on the reduced density matrix — and more recently, at least in certain holographic applications, the modular Hamiltonian {K_{L,R}} itself — this is in principle a rather serious problem. It remains to be seen how much of our ideas about entanglement must accordingly be revised, or whether this is analagous to Haag’s theorem about the non-existence of the interaction picture: theoretically damning, but practically irrelevant.

To see how one might get around this issue, consider two arbitrary, well-defined states {{|\psi_1\rangle,\,|\psi_2\rangle}}. The non-existence of the operator {K_R} manifests in a universal UV divergence in {K_R|\psi_1\rangle} at {x\!=\!0}, as explained above. Nonetheless, the matrix elements {\langle\psi_2|K_R|\psi_1\rangle} are still well-defined. An intuitive argument for this is that in the energy eigenbasis, correlation functions between well-defined (read: finite) states simply don’t see the problematic UV modes. [I am grateful to Ben Freivogel for this explanation]. Of course, one could certainly inquire as to the ontological status of matrix elements of an operator which does not technically exist. But, at least at an epistemic level, this offers a potential back-door to entanglement entropy. That is, taking the trace of the reduced density matrix {\rho_{L,R}} as is typically done is not a rigorous method for computing entanglement, since as discussed above {\rho_{L,R}} is not a valid operator in the theory. However, insofar as entanglement is essentially a measure of correlation functions, one could imagine performing the sum over all possible correlators instead. While the resulting sum may still be divergent, the individual terms are well-defined, and this may lend some support to the practical validity of many existing results despite the false foundation.


  1. E. Witten, “Notes on Some Entanglement Properties of Quantum Field Theory,”, arXiv:1803.04993
  2. K. Papadodimas, “A class of non-equilibrium states and the black hole interior,” arXiv:1708.06328
  3. K. Papadodimas and S. Raju, “State-Dependent Bulk-Boundary Maps and Black Hole Complementarity,” arXiv:1310.6335
Posted in Physics | Leave a comment

Information geometry (part 2/3)

In the previous post, we introduced the {\alpha}-connection, and alluded to a dualistic structure between {\nabla^{(\alpha)}} and {\nabla^{(-\alpha)}}. In particular, the cases {\alpha\!=\!\pm1} are intimately related to two important families of statistical models, the exponential or e-family with affine connection {\nabla^{(e)}\equiv\nabla^{(1)}}, and the mixed or m-family with affine connection {\nabla^{(m)}\equiv\nabla^{(-1)}}. Hence before turning to general aspects of the dual structure on {S}, it is illuminating to see how these families/connections emerge naturally via an embedding formalism.

Two embeddings

For concreteness (or rather, to err on the safe side of mathematical rigour), let {\mathcal{X}} be a finite set. As before, an arbitrary model {S} on {\mathcal{X}} is a submanifold of

\displaystyle \mathcal{P}(\mathcal{X})=\left\{p:\mathcal{X}\rightarrow\mathbb{R}\,\Big|\,p(x)>0\;\forall x\in\mathcal{X}~,\;\int\!\mathrm{d} x\,p(x)=1\right\}~, \ \ \ \ \ (1)

which in turn is a subset of the set of all {\mathbb{R}}-valued functions on {\mathcal{X}}, denoted

\displaystyle \mathbb{R}^\mathcal{X}\equiv\{A\,|\,A:\mathcal{X}\rightarrow\mathbb{R}\}~. \ \ \ \ \ (2)

Note that since {\mathcal{X}} is finite, the normalization constraint is essentially the imposition that {\sum_x A(x)=1}, which implies that {\mathcal{P}} is specifically an open subset of the affine subspace

\displaystyle \mathcal{A}_1\equiv\left\{A\,\Big|\,\sum\nolimits_xA(x)=1\right\}\subset\mathbb{R}^\mathcal{X}~, \ \ \ \ \ (3)

which further implies that the tangent space {T_p\mathcal{P}} is naturally identified with the linear subspace

\displaystyle \mathcal{A}_0\equiv\left\{A\,\Big|\,\sum\nolimits_xA(x)=0\right\}~. \ \ \ \ \ (4)

(One can see this by representing {A} as an algebraic curve). Vectors {X\in T_p\mathcal{P}}, considered as elements of {\mathcal{A}_0}, will be denoted {X^{(m)}}. These define the mixture or m-representation of {X}, i.e.,

\displaystyle T_p^{(m)}\mathcal{P}\equiv\left\{X^{(m)}\,\Big|\,X\in T_p\mathcal{P}\right\}=\mathcal{A}_0~. \ \ \ \ \ (5)

As before, the natural basis {\partial_i} then defines the {m}-affine coordinates {\xi=[\xi^i]}, with {(\partial_i)_\xi^{(m)}=\partial_ip_\xi}, with respect to which {\mathcal{P}} is {m}-flat. The natural connection induced from this affine structure {\mathcal{A}_1} is the {m}-connection introduced before, which preserves vectors identically under parallel transport {\Pi_{p,q}^{(m)}:T_p\mathcal{P}\rightarrow T_q\mathcal{P}}, i.e.,

\displaystyle \Pi_{p,q}^{(m)}X=X'\;\;\mathrm{such~that}\;\;{X'}^{(m)}=X^{(m)}~. \ \ \ \ \ (6)

Thus we see that the natural embedding of {\mathcal{P}} into {\mathcal{R}^\mathcal{X}} makes the affine structure, and in particular the significance of the {m}-connection, manifest.

Now consider the alternative embedding {p\mapsto\ln p}, and identify {\mathcal{P}} with the subset {\{\ln p\,|\,p\in\mathcal{P}\}\subset\mathbb{R}^\mathcal{X}}. Elements {X\in T_p\mathcal{P}} are then given by applying {X} to the map {p\mapsto\ln p}, whereupon we denote them {X^{(e)}} and call this the exponential or e-representation. In local coordinates, we have {(\partial_i)_\xi^{(e)}=\partial_i\ln p_\xi}. By the chain rule, {X^{(e)}} is then related to {X^{(m)}} as

\displaystyle X^{(e)}(x)=\frac{X^{(m)}}{p(x)}~, \ \ \ \ \ (7)

and therefore the tangent space {T_p^{(e)}\mathcal{P}} is obtained by modifying the constraint on {\mathcal{A}_0} above to {0=\sum\nolimits_xp(x)A(x)=E_p[A]}, i.e.,

\displaystyle T_p^{(e)}\mathcal{P}\equiv\left\{X^{(e)}\,\Big|\,X\in T_p\mathcal{P}\right\} =\{A\in\mathcal{R}^\mathcal{X}\,|\,E_p[A]=0\}~. \ \ \ \ \ (8)

One sees immediately that unlike {T_p^{(m)}\mathcal{P}}, {T_p^{(e)}\mathcal{P}} depends on {p}, and therefore parallel transport does not preserve {X^{(e)}}. However, while an element {A\in T_p^{(e)}\mathcal{P}} does not generally belong to {T_q^{(e)}\mathcal{P}} for {q\neq p}, the constraint in (7) implies that the shifted vector {A'\equiv A-E_q[A]} does belong to {T_p^{(e)}\mathcal{P}}. We therefore have

\displaystyle \Pi_{p,q}^{(e)}X=X'\;\;\mathrm{such~that}\;\;{X'}^{(e)}=X^{(e)}-E_q[X^{(e)}]~, \ \ \ \ \ (9)

where {\Pi_{p,q}^{(e)}} denotes parallel transport with respect to the e-connection {\nabla^{(e)}}. (Properly showing that {X} is e-parallel under (9) is of course slightly more involved, and is done on page 41). Thus the exponential (or, if one prefers, logarithmic) embedding of {\mathcal{P}} into {\mathbb{R}^\mathcal{X}} leads naturally to the e-affine coordinate system and associated connection introduced in the previous post.

It turns out that the e-representation has a number of interesting properties that allow one to establish close connections between the Fisher metric and various statistical notions. Note that, due to the presence of the log, the Fisher metric can be neatly expressed in the e-representation as

\displaystyle \langle X,Y\rangle_p=E_p\left[X^{(e)}Y^{(e)}\right]~. \ \ \ \ \ (10)

Furthermore, with a bit more technical footwork involving cotangent spaces and the like, one can derive a number of fundamental relations. For example, the variance of a random variable {A} is determined by the sensitivity of {E_p[A]} to perturbations of {p}, and may be expressed as

\displaystyle V_p[A]=E_p[(A-E_p[A])^2]=||(\mathrm{d} E[A])_p||_p^2~, \ \ \ \ \ (11)

where {\mathrm{d} E} is the differential of {E} at {p}. See section 2.5 for more details.

As alluded previously, the geometrical arguments are somewhat less rigorous for the case when {\mathcal{X}} is infinite. Intuitively, the issue is that the identification of {\mathcal{P}} with {\mathcal{A}_1}, and by extension the isometry between {T_p\mathcal{P}} and {\mathcal{A}_0}, requires that {\mathcal{P}} have the same dimensionality as {\mathcal{A}_1}. For finite {\mathcal{X}}, the constraint {p(x)\!>\!0\;\forall x\in\mathcal{X}} is “loose enough” for this to be possible, but this constraint becomes increasingly strict as the cardinality of {\mathcal{X}} increases (i.e., the portion of {\mathcal{A}_1} identified with {\mathcal{P}} decreases). For infinite {\mathcal{X}}, one effectively has an infinite number of constraints, which makes this dimensional matching impossible. Nonetheless, I’m given to understand that much of the framework can still be made to go through; see the references at the end of section 2.5 and associated discussion.

Dual structures on {S}

We’ve alluded to a duality between {\nabla^{(\pm1)}}, but in fact this extends to general {\alpha}-connections. Hence, rather than consider a particular connection in isolation, the fundamental structure in information geometry is the triple {(g,\nabla^{(\alpha)},\nabla^{(-\alpha)})}, which Amari & Nagaoka call a dualistic structure on {S} (we prefer the term “dual structure” instead, and will use this henceforth). Formally, if one has two affine connections {\nabla} and {\nabla^*} with respect to a Riemannian metric {g}, then if these satisfy

\displaystyle Z\langle X,Y\rangle=\langle\nabla_ZX,Y\rangle+\langle X,\nabla_Z^*Y\rangle\;\;\forall X,Y,Z\in TS~, \ \ \ \ \ (12)

then {\nabla} and {\nabla^*} are duals with respect to {g}, and one is called the dual or conjugate connection of the other. In local coordinates, this condition reads

\displaystyle \partial_kg_{ij}=\Gamma_{ij,k}+\Gamma^*_{kj,i}~, \ \ \ \ \ (13)

which follows from the definition {\langle\nabla_{\partial_i}\partial_j,\partial_k\rangle=\Gamma_{ij,k}}, and similarly for {\nabla^*}, with basis vectors {Z=\partial_i} etc. Given a metric {g} and connection {\nabla} on {S}, the dual connection {\nabla^*} is generally unique, and satisfies {(\nabla^*)^*=\nabla}. Additionally, the combination {(\nabla+\nabla^*)/2} is metric. Conversely, one immediately sees that the condition for {\nabla} to be metric is that {\nabla^*=\nabla}, and in this sense dual connections simply constitute a more general class thereof.

In particular, {\nabla^{(\alpha)}} and {\nabla^{(-\alpha)}} are dual with respect to the Fisher metric. This follows readily from the general expressions for the {\alpha}-connection developed in section 2.6, which we’ve skipped over here. Suffice to say the framework of {(\pm1)}-connection admits a straightforward extension to arbitrary {\alpha}.

The significance of dual connections is neatly illustrated by considering the parallel transport along a curve {\gamma} from {T_pS\rightarrow T_qS} with respect to {\nabla} and {\nabla^*}, respectively denoted {\Pi_\gamma} and {\Pi_\gamma^*}. As mentioned in the previous post, general {\alpha} connections are not metric; but the dual structure allows one to generalize the notion of preservation of the inner product along a curve, namely:

\displaystyle \langle\Pi_\gamma X,\Pi_\gamma^* Y\rangle_q=\langle X,Y\rangle_p~\quad\forall X,Y\in T_pS~. \ \ \ \ \ (14)

And furthermore, the relationship between {\Pi_\gamma} and {\Pi_\gamma^*} is completely fixed by this condition.


The duality structure of statistical manifolds facilitates the introduction of a distance-like measure {D}, which enables us to compute something like a geometrical distance between distributions. In particular, we shall define a smooth function {D(\cdot||\cdot):S\times S\rightarrow\mathbb{R}} such that {D(p||q)\geq0\;\;\forall p,q\in S}, with equality iff {p=q}. While this provides a measure of the separation between {p} and {q}, it is asymmetric and does not satisfy the triangle inequality, and hence fails the conditions for a distance function. However, if {D} further satisfies

\displaystyle \begin{aligned} D(\partial_i||\cdot)=D&(\cdot||\partial_i)=0~,\\ D(\partial_i\partial_j||\cdot)=D(\cdot||\partial_i\partial_j)&=-D(\partial_i||\partial_j)\equiv g_{ij}^{(D)}~, \end{aligned} \ \ \ \ \ (15)

where {[g_{ij}^{(D)}]} is a positive definite matrix everywhere on {S}, then {D} is a divergence or contrast function on {S}. Note that a divergence uniquely defines a Riemannian metric {g^{(D)}=\langle\cdot,\cdot\rangle^{(D)}} via

\displaystyle \langle X,Y\rangle^{(D)}=-D(X||Y)~, \ \ \ \ \ (16)

i.e., {\langle\partial_i,\partial_j\rangle^{(D)}=g_{ij}^{(D)}}. We can also define an affine connection {\nabla^{(D)}} associated with this divergence, via the coefficients {\Gamma_{ij,k}^{(D)}}

\displaystyle \Gamma_{ij,k}^{(D)}=-D(\partial_i\partial_j||\partial_k)\;\;\iff\;\; \langle\nabla_X^{(D)}Y,Z\rangle^{(D)}=-D(XY||Z)~. \ \ \ \ \ (17)

Similarly, we define the dual {D^*(p||q)\equiv D(q||p)}, from which we have {g^{(D^*)}=g^{(D)}} and {\Gamma_{ij,k}^{(D^*)}=-D(\partial_k||\partial_i\partial_j)}. Thus we have that {\nabla^{(D)}} and {\nabla^{(D^*)}} are dual with respect to {g^{(D)}}. In fact, any dual structure {(g,\nabla,\nabla^*)} is naturally induced from a divergence!

Of course, the above is quite general: there are infinitely many different divergences one could define on a manifold. Henceforth we shall specify to a particularly important class for statistical models, known as the f-divergence:

\displaystyle D_f(p||q)\equiv\int\!\mathrm{d} x\,p(x)f\!\left(\frac{q(x)}{p(x)}\right)~, \ \ \ \ \ (18)

where {f(u)} is an arbitrary convex function on {0<u\in\mathbb{R}} with {f(1)=0}. The f-divergence satisfies a number of important properties (see page 56), chief of which is monotonicity under arbitrary probability transition functions. That is, let {x\in\mathcal{X}} be randomly transformed into {y\in\mathcal{Y}} with probability {\kappa(y|x)\geq0}, with {\int\!\mathrm{d} y\,\kappa(y|x)=1\;\;\forall x}. This maps distributions {p(x),q(x)} to {p_\kappa(y),q_\kappa(y)}, respectively. Monotonicity of the f-divergence is then the statement that

\displaystyle D_f(p||q)\geq D_f(p_\kappa||q_\kappa)~. \ \ \ \ \ (19)

Spoiler alert: this will surface again in the form of monotonicity of relative entropy below! Note that {\kappa} is a generalization of the deterministic mapping induced by {F:\mathcal{X}\rightarrow\mathcal{Y}} in the previous post, which corresponds to {\kappa(y,x)=\delta_{F(x)}y}. Consequently, the equality is saturated iff {\kappa} is induced from a sufficient statistic, for which {p_\kappa(x|y)=q_\kappa(x|y)\;\;\forall x,y}.

Within this still-broad class of f-divergences, one finds the {\alpha}-divergence {D^{(\alpha)}=D_{f^{(\alpha)}}}, defined for all {\alpha\in\mathbb{R}} via

\displaystyle f^{(\alpha)}(u)=\begin{cases} \frac{4}{1-\alpha^2}\left(1-u^{(1+\alpha)/2}\right) \;\; & \alpha\neq\pm1\\ u\ln u & \alpha=1\\ -\ln u & \alpha=-1 \end{cases}~, \ \ \ \ \ (20)

which yields, for {\alpha\neq\pm1},

\displaystyle D^{(\alpha)}(p||q)=\frac{4}{1-\alpha^2}\left(1-\int\!\mathrm{d} x\,p(x)^{\frac{1-\alpha}{2}}q(x)^{\frac{1+\alpha}{2}}\right)~, \ \ \ \ \ (21)

and, for {\alpha=\pm1},

\displaystyle D^{(-1)}(p||q)=D^{(1)}(q||p)=\int\!\mathrm{d} x\,p(x)\ln\frac{p(x)}{q(x)}~. \ \ \ \ \ (22)

Of course, this last is none other than the relative entropy or Kullback-Leibler divergence!

The {\alpha}-divergence {D^{(\alpha)}} is so named because it naturally induces the dual structure {(g,\nabla^{(\alpha)},\nabla^{(-\alpha)})}, i.e., the Fisher metric and (dual) {\alpha}-connections. Let us now explore this duality in more detail, which will lead us to a deeper appreciation for the divergence {D}.

Dually flat spaces

Suppose that, in a dual structure {(g,\nabla,\nabla^*)}, both connections are symmetric, as is the case for {\alpha}-connections. This implies that {\alpha}-flatness and {(-\alpha)}-flatness of {S} are equivalent. In particular, we’ve seen that e-families are 1-flat, and m-families are {(-1)}-flat, but the above implies that in fact they are both {(\pm1)}-flat. In general, if both duals {\nabla} and {\nabla^*} are flat, we call {(S,g,\nabla,\nabla^*)} a dually flat space. Such spaces have a number of properties that are closely related to various concepts in statistics.

By definition, there exist {\nabla}-affine and {\nabla^*}-affine coordinate systems {[\theta^i]} and {[\eta_j]}, respectively, in which vector fields are denoted {\partial_i\equiv\frac{\partial}{\partial\theta^i}} and {\partial_j\equiv\frac{\partial}{\partial\eta_j}}. Since both {\partial_i} and {\partial^j} are flat, the condition on parallel transport (14) implies that {\langle\partial_i,\partial^j\rangle} is in fact constant on {S}. Furthermore, given {[\theta^i]}, affine transformations allow us to choose the dual coordinate system {[\eta^j]} such that

\displaystyle \langle\partial_i,\partial^j\rangle=\delta_i^{~j}~. \ \ \ \ \ (23)

Coordinate systems which satisfy this requirement are called mutually dual. Such coordinate systems are special to dually flat spaces, and do not exist for general Riemannian manifolds; conversely, if one finds such coordinates for a Riemannian manifold {(S,g)}, then the connections {\nabla} and {\nabla^*} for which they are affine are uniquely determined, and {(S,g,\nabla,\nabla^*)} is a dually flat space.

A quick remark on notation: we shall denote the components of the metric {g} with respect to {[\theta^i]} and {[\eta_j]} by

\displaystyle g_{ij}\equiv\langle\partial_i,\partial_j\rangle\;\;\mathrm{and}\;\; g^{ij}\equiv\langle\partial^i,\partial^j\rangle~. \ \ \ \ \ (24)

Coordinate transformations between them are given by the usual expressions,

\displaystyle \partial^j=(\partial^j\theta^i)\partial_i\;\;\mathrm{and}\;\; \partial_i=(\partial_i\eta_j)\partial^j~. \ \ \ \ \ (25)

In conjunction with (23), we therefore have

\displaystyle \frac{\partial\eta_j}{\partial\theta^i}=g_{ij}\;\;\;\mathrm{and}\;\;\; \frac{\partial\theta^i}{\partial\eta_j}=g^{ij}~,\;\;\;\mathrm{with}\;\;\; g_{ij}g^{jk}=\delta_i^{~k}~. \ \ \ \ \ (26)

We may now introduce the potentials, which will prove useful below. At a mathematical level, these allow us to define the Legendre transformation that explicitly relates the dual coordinate systems {[\theta^i]} and {[\eta_j]}. Consider a function {\psi:S\rightarrow\mathbb{R}} that satisfies the following partial differential equation:

\displaystyle \partial_i\psi=\eta_i\qquad\iff\qquad\mathrm{d}\psi=\eta_i\mathrm{d}\theta^i~. \ \ \ \ \ (27)

From the coordinate expressions above, {\partial_i\partial_j\psi=g_{ij}}, and hence the second derivatives of {\psi} comprise a positive definite matrix. Hence {\psi} is a strictly convex function of {[\theta^i]}. Similarly, define {\varphi} such that

\displaystyle \partial^i\varphi=\theta^i\qquad\iff\qquad\mathrm{d}\varphi=\theta^i\mathrm{d}\eta_i~, \ \ \ \ \ (28)

which is a strictly convex function of {[\eta_j]}, since {\partial^i\partial^j\varphi=g^{ij}}. These two functions are related via

\displaystyle \varphi=\theta^i\eta_i-\psi~. \ \ \ \ \ (29)

To see this, simply take the differential {\mathrm{d}\varphi=\mathrm{d}\theta^i\eta_i+\theta^i\mathrm{d}\eta_i-\mathrm{d}\psi}, whereupon substituting in (27) one recovers (28). Now, the form of (29) — equivalently, the pair (27) and (28) — suggests that {[\theta^i]}, {[\eta_j]} form a conjugate pair of coordinates for the functions {\psi}, {\varphi} in the context of Legendre transforms. However, (29) does not quite suffice to define the Legendre transform, since we must remove the {\theta} dependence on the right-hand side in order for it to be uniquely invertible. Note that (strict) convexity is important here, since this implies a 1:1 map between a function and its first derivative, which in turn enables one — via the Legendre transform — to express all the information about a function in terms of its derivatives instead. By analogy with Fourier or Laplace transforms, this may lead to deeper mathematical/physical insight, or simple convenience, depending on the application. Hence, since we’re dealing with strictly convex functions, we define the Legendre transforms

\displaystyle \varphi(q)=\mathrm{sup}_{p\in S}\left[\theta^i(p)\eta_i(q)-\psi(p)\right]\quad\mathrm{and}\quad \psi(p)=\mathrm{sup}_{q\in S}\left[\theta^i(p)\eta_i(q)-\varphi(q)\right]~. \ \ \ \ \ (30)

Thus we see that the dual/conjugate coordinate systems {[\theta^i]} and {[\eta_j]} are related by the Legendre transform given in terms of the potentials {\psi} and {\varphi}, which provides another explanation for the name.

Incidentally, in addition to the compact expressions for the metric above, namely

\displaystyle \partial_i\partial_j\psi=g_{ij}\quad\mathrm{and}\quad \partial^i\partial^j\varphi=g^{ij}~, \ \ \ \ \ (31)

the potentials also enable one to neatly express the connection coefficients as

\displaystyle \Gamma_{ij,k}^*\equiv\langle\nabla_{\partial_i}^*\partial_j,\partial_k\rangle=\partial_i\partial_j\partial_k\psi\quad\mathrm{and}\quad \Gamma^{ij,k}\equiv\langle\nabla_{\partial^i}\partial^j,\partial^k\rangle=\partial^i\partial^j\partial^k\varphi~. \ \ \ \ \ (32)

Now, we mentioned earlier that any divergence induces a (torsion-free) dual structure, and vice-versa. But this map is not bijective; rather, a given dual structure admits infinitely many divergences. In the case of dually flat spaces however, the potentials defined above allow us to introduce a canonical divergence which is in some sense unique, namely

\displaystyle D(p||q)\equiv\psi(p)+\varphi(q)-\theta^i(p)\eta_i(q)~, \ \ \ \ \ (33)

which of course satisfies {D(p||q)\geq0}, with equality iff {p=q}. To see that {D} is a divergence which induces the metric {g}, observe that

\displaystyle D\left((\partial_i\partial_j)p||q\right)=g_{ij}(p)\quad\mathrm{and}\quad D\left(p||(\partial^i\partial^j)q\right)=g^{ij}(q)~, \ \ \ \ \ (34)

which further implies {\nabla=\nabla^{(D)}} and {\nabla^*=\nabla^{(D^*)}} with {\Gamma_{ij,k}={\Gamma^*}^{ij,k}=0} due to the {\nabla}-, resp. {\nabla^*}-affinity of {[\theta^i]}, {[\eta_j]}, where the dual divergence is {D^*(p||q)=D(q||p)}.

Generic properties of such canonical divergences are examined in section 3.4. Instead however, I’m going to close this post by jumping ahead to 3.5, which connects this discussion back to the e- and m-families and associated {(\pm1)}-connections we spent so much time on above.

Dual structure of exponential families

Recall that an exponential family consists of distributions of the form

\displaystyle p(x;\theta)=\exp\left[C(x)+\theta^iF_i(x)-\psi(\theta)\right]~, \ \ \ \ \ (35)

where the canonical parameters {[\theta^i]} constitute a 1-affine coordinate system. The notation {\psi} here was not chosen arbitrarily, but is in fact precisely the potential introduced above. This is a rather elegant fact that ties together some fundamental notions from statistics with the existence of the dual coordinates {[\eta_i]}. Suppose we didn’t know that {\eta} is the dual coordinate to {\theta}, and simply defined

\displaystyle \eta_i=\eta_i(\theta)\equiv E_\theta[F_i]=\int\!\mathrm{d} x\,F_i(x)p(x;\theta)~. \ \ \ \ \ (36)

Then it follows that

\displaystyle 0=E_\theta[\partial_i\ell_\theta]=E_\theta[F_i(x)-\partial_i\psi(\theta)]\;\;\implies\;\; \eta_i=\partial_i\psi~. \ \ \ \ \ (37)

Additionally, the expression of the metric in the form {g_{ij}(\theta)=-E_\theta[\partial_i\partial_j\ell_\theta]} implies that for an exponential family, {\partial_i\partial_j\psi=g_{ij}}. Thus (0.36) suffices to identify {\eta} as the conjugate parameter to {\theta} with respect to the Legendre potential {\psi}. From the discussion above, this implies that {[\eta_i]} is a {(-1)}-affine coordinate system dual to {[\theta^i]}. Given the form of (0.36), {[\eta_i]} are sometimes also referred to as the expectation parameters.

It is a trivial exercise to work out {\eta_1} and {\eta_2} for the example of the normal distribution in the previous post. A more interesting example is the case when {\mathcal{X}} is a finite set, whence {\mathcal{P}(\mathcal{X})} is identified with the parameters:

\displaystyle \begin{aligned} \mathcal{X}=&\{x_0,\ldots,x_n\}~,\qquad \Xi=\left\{[\xi^i]\,\Big|\,\xi^i>0\;\;\forall i~,\;\;\sum_{i=1}^n\xi^i<1\right\}\\ &p(x_i;\xi)=\begin{cases} \xi^i & 1\leq i\leq n \\ 1-\sum_{i=1}^n\xi^i\quad & i=0 \end{cases}~. \end{aligned} \ \ \ \ \ (38)

(This is introduced as example 2.4 on page 27, and resurfaces as example 2.8 on page 35 and example 3.4 on page 66). Expressing this as an exponential family of the form (0.35) implies the parameter identifications

\displaystyle \begin{aligned} C(x)=&0~,\quad F_i(x)=\delta(x-x_i)~,\\ \theta^i=\ln\frac{p(x_i)}{p(x_0)}&=\ln\frac{\xi^i}{1-\sum\nolimits_{j=1}^n\xi^j}\qquad i=1,\ldots,n~,\\ \psi(\theta)=-\ln p(\theta)=&-\ln\left(1-\sum_{i=1}^n\xi^i\right)=\ln\left(1+\sum_{i=1}^n\exp\theta^i\right)~. \end{aligned} \ \ \ \ \ (39)

Then the dual coordinates {[\eta_i]} are simply given by the parameters themselves:

\displaystyle \eta_i=p(x_i)=\xi^i=\frac{\exp\theta^i}{1+\sum_{j=1}^n\exp\theta^j}~. \ \ \ \ \ (40)

Therefore, the dual potential (29) is

\displaystyle \varphi(\theta)=E_\theta[\ln p_\theta-C]=-H(p_\theta)-E_\theta[C]~, \ \ \ \ \ (41)

where {H(p)\equiv-\int\!\mathrm{d} x\,p(x)\ln p(x)} is the entropy!

Finally, I promised you some statistical notions, which requires the introduction of one more concept, namely that of an estimator. Suppose that some data {x} is generated according to some unknown probability distribution {p_\xi}. We wish to consider the problem of estimating the unknown parameter {\xi} by some function {\hat\xi(x)} of the data {x}. The mapping {[\hat\xi^i]:\mathcal{X}\rightarrow\mathbb{R}^n} is called an estimator. Furthermore, {\hat\xi} is an unbiased estimator if

\displaystyle E_\xi[\hat\xi(X)]=\xi\quad\forall\xi\in\Xi~, \ \ \ \ \ (42)

which is the statement that the expectation value of {\hat\xi} does not depend on the data itself. The mean-squared error of such an estimator may be expressed via the variance-covariance matrix {V_\xi[\hat\xi]=[v_\xi^{ij}]}, whose elements are defined via

\displaystyle v_\xi^{ij}\equiv E_\xi\left[(\hat\xi^i(X)-\xi^i)(\hat\xi^j(X)-\xi^j)\right]~. \ \ \ \ \ (43)

Lastly, the Cramér-Rao inequality (Theorem 2.2) states that the variance-covariance matrix of an unbiased estimator satisfies

\displaystyle V_\xi[\hat\xi]\geq G(\xi)^{-1}~, \ \ \ \ \ (44)

i.e., the difference {V_\xi[\hat\xi]-G(\xi)^{-1}} is a positive semidefinite matrix. An unbiased estimator {\hat\xi} which saturates this inequality for all {\xi} is an efficient estimator, which means it has the minimum variance among all unbiased estimators (note however that the converse is not always true).

Efficient estimators do not exist for arbitrary coordinates {\xi}. Rather, a necessary and sufficient condition for a model {S=\{p_\xi\}} to have an efficient estimator is that {S} be an exponential family, for which {\xi} is m-affine (Theorem 3.12).

To illustrate this, let us now regard {F_i} as an estimator for the parameter {\eta_i}, which we shall henceforth denote {\hat\eta_i(x)=F_i(x)} to make contact with the notation just introduced. Then the condition (0.36) implies that {\hat\eta} is in fact an unbiased estimator for {\eta}. Furthermore, since the Fisher information matrix can be expressed in terms of {F} and {\eta} as

\displaystyle g_{ij}(\theta)=E_\theta[(F_i-\eta_i)(F_j-\eta_j)]~, \ \ \ \ \ (45)

it follows that {V_\eta[\hat\eta]=G=[g_{ij}]}, and hence {\hat\eta} is an efficient estimator for the (m-affine) coordinate system {\eta}.

The above is of course only a superficial hint of the deeper connections between information geometry and statistics (which was, after all, the prime impetus for the former’s development), but already suggests interesting mathematical utility for such problems as maximum likelihood estimation (MLE), and machine learning in general. Alas, that’s a topic for another post.

As a final comment: everything I’ve said so far is for classical probability distributions, but much of this machinery can be extended to quantum mechanics as well (insofar as the latter can be viewed as an extension of probability theory). Quantum information geometry is briefly introduced in chapter 7, and I hope to return to it in part 3 of this series soon.

Posted in Minds & Machines, Physics | 2 Comments

Information geometry (part 1/3)

Information geometry is a rather interesting fusion of statistics and differential geometry, in which a statistical model is endowed with the structure of a Riemannian manifold. Each point on the manifold corresponds to a probability distribution function, and the metric is governed by the underlying properties thereof. It may have many interesting applications to (quantum) information theory, complexity, machine learning, and theoretical neuroscience, among other fields. The canonical reference is Methods of Information Geometry by Shun-ichi Amari and Hiroshi Nagaoka, originally published in Japanese in 1993, and translated into English in 2000. The English version also contains several new topics/sections, and thus should probably be considered as a “second edition” (incidentally, any page numbers given below refer to this version). This is a beautiful book, which develops the concepts in a concise yet pedagogical manner. The posts in this three-part series are merely the notes I made while studying it, and aren’t intended to provide more than a brief summary of what I deemed the salient aspects.

Chapter 1 provides a self-contained introduction to differential geometry which — as the authors amusingly allude — is far more accessible (dare I say practical) than the typical mathematician’s treatise on the subject. Since this is familiar material, I’ll skip ahead to chapter 2, and simply recall necessary formulas/notation as needed.

Chapter 2 introduces the basic geometric structure of statistical models. First, some notation: a probability distribution is a function {p:\mathcal{X}\rightarrow\mathbb{R}} which satisfies

\displaystyle p(x)\geq0\quad\forall x\in\mathcal{X}\quad\mathrm{and}\quad\int\!\mathrm{d} x\,p(x)=1~, \ \ \ \ \ (1)

where {\mathcal{X}=\mathbb{R}^n}. If {\mathcal{X}} is a discrete set (which may have either finite or countably infinite cardinality), then the integral is instead a sum. Each {p} may be parametrized by {n} real-valued variables {\xi=[\xi^i]=[\xi^1,\ldots,\xi^n]}, such that the family {S} of probability distributions on {\mathcal{X}} is

\displaystyle S=\{p_\xi=p(x;\xi)\,|\,\xi=[\xi^i]\in\Xi\}~, \ \ \ \ \ (2)

where {\Xi\subset\mathbb{R}^n} and {\xi\mapsto p_\xi} is injective (so that the reverse mapping, from probability distributions {p} to the coordinates {\xi} is a function). Such an {S} is referred to as an {n}-dimensional statistical model, or simply model on {\mathcal{X}}, sometimes abbreviated {S=\{p_\xi\}}. Note that I referred to {\xi} as the coordinates above: we will assume that the map {\Xi\rightarrow\mathbb{R}} provided by {\xi} is {C^\infty}, so that we may take derivatives with respect to the parameters, e.g., {\partial_ip(x;\xi)}, where {\partial_i\equiv\frac{\partial}{\partial\xi^i}}. Additionally, we assume that the order of differentiation and integration may be freely exchanged; an important consequence of this is that

\displaystyle \int\!\mathrm{d} x\,\partial_i p(x;\xi)=\partial_i\!\int\!\mathrm{d} x\,p(x;\xi)=\partial_i1=0~, \ \ \ \ \ (3)

which we will use below. Finally, we assume that the support of the distribution {\mathrm{supp}(p)\equiv\{x\,|\,p(x)>0\}} is independent of (that is, does not vary with) {\xi}, and hence may redefine {\mathcal{X}=\mathrm{supp}(p)} for simplicity. Thus, the model {S} is a subset of

\displaystyle \mathcal{P}(S)\equiv\left\{p:\mathcal{X}\rightarrow\mathbb{R}\,\bigg|\,p(x)>0\;\;\forall x\in\mathcal{X},\;\;\int\!\mathrm{d} x\,p(x)=1\right\}~. \ \ \ \ \ (4)

Considering parametrizations which are {C^\infty} diffeomorphic to one another to be equivalent then elevates {S} to a statistical manifold. (In this context, we shall sometimes conflate the distribution {p_\xi} with the coordinate {\xi}, and speak of the “point {\xi}”, etc).

As a concrete example, consider the normal distribution:

\displaystyle p(x;\xi)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{\left(x-\mu\right)^2}{2\sigma^2}\right]~. \ \ \ \ \ (5)

In this case,

\displaystyle \mathcal{X}=\mathbb{R}~,\;\;n=2,\;\;\xi=[\mu,\sigma],\;\;\Xi=\{[\mu,\sigma]\,|\,\mu\in\mathbb{R},\;\sigma\in\mathbb{R}^+\}~. \ \ \ \ \ (6)

Other examples are given on page 27.

Fisher information metric

Now, given a model {S}, the Fisher information matrix of {S} at a point {\xi} is the {n\times n} matrix {G(\xi)=[g_{ij}(\xi)]}, with elements

\displaystyle g_{ij}(\xi)\equiv E_\xi[\partial_i\ell_\xi\partial_j\ell_\xi] =\int\!\mathrm{d} x\,\partial_i\ell(x;\xi)\partial_j\ell(x;\xi)p(x;\xi)~, \ \ \ \ \ (7)


\displaystyle \ell_\xi(x)=\ell(x;\xi)\equiv\ln p(x;\xi)~, \ \ \ \ \ (8)

and the expectation {E_\xi} with respect to the distribution {p_\xi} is defined as

\displaystyle E_\xi[f]\equiv\int\!\mathrm{d} x\,f(x)p(x;\xi)~. \ \ \ \ \ (9)

Though the integral above diverges for some models, we shall assume that {g_{ij}(\xi)} is finite {\forall\xi,i,j}, and furthermore that {g_{ij}:\Xi\rightarrow\mathbb{R}} is {C^\infty}. Note that {g_{ij}=g_{ji}} (i.e., {G} is symmetric). Additionally, while {G} is in general positive semidefinite, we shall assume positive definiteness, which is equivalent to requiring that {\{\partial_1 p_\xi,\ldots,\partial_n p_\xi\}} are linearly independent functions on {\mathcal{X}}. Lastly, observe that eq. (3) may be written

\displaystyle E_\xi[\partial_i\ell_\xi]=0~. \ \ \ \ \ (10)

Via integration by parts, this allows us to express the matrix elements in a useful alternative form:

\displaystyle g_{ij}(\xi)=-E_\xi[\partial_i\partial_j\ell_\xi]~. \ \ \ \ \ (11)

The reason for the particular definition of the Fisher matrix above is that it provides a Riemannian metric on {S}. Recall that a Riemannian metric {g:p\mapsto\langle\cdot,\cdot\rangle_p} is a {(0,2)}-tensor (that is, it maps points in {S} to their inner product on the tangent space {T_pS}) which is linear, symmetric, and positive definite (meaning {\langle X,X\rangle\geq0}, with equality iff {X=0}). In fact, in the natural basis of local coordinates {[\xi^i]}, the Riemannian metric is uniquely determined by

\displaystyle g_{ij}=\langle\partial_i,\partial_j\rangle~, \ \ \ \ \ (12)

whence {g} is called the Fisher metric. Since it’s positive definite, we may define the inverse metric {g^{-1}} corresponding to {G^{-1}(\xi)} such that {g^{ij}g_{jk}=\delta^i_{~k}}. Note that {g_{ij}} is invariant under coordinate transformations, since

\displaystyle \xi\rightarrow\tilde\xi\;\implies\; g_{ij}\rightarrow\tilde g_{ij}=\frac{\partial\tilde\xi^k}{\partial\xi^i}\frac{\partial\tilde\xi^l}{\partial\xi^j}g_{kl}~, \ \ \ \ \ (13)

and hence we may write

\displaystyle \langle X,Y\rangle_\xi=E_\xi[(X\ell)(Y\ell)]\quad\forall X,Y\in T_\xi S~. \ \ \ \ \ (14)

Recalling that {||V||^2=\langle V,V\rangle_p=g_{ij}(p)V^iV^j}, the length of a curve {\gamma:[a,b]\rightarrow S} with respect to g is then

\displaystyle ||\gamma||=\int_a^b\!\mathrm{d} t\left|\frac{\mathrm{d}\gamma}{\mathrm{d} t}\right| =\int_a^b\!\mathrm{d} t\sqrt{g_{ij}\dot\gamma^i\dot\gamma^j}~. \ \ \ \ \ (15)

But before we can properly speak of curves and distances, we must define a connection, which provides a means of parallel transporting vectors along the curve.


In particular, we will be concerned with comparing elements of the tangent bundle, and hence we require a relation between the tangent spaces at different points in {S}, i.e., an affine connection. To that end, let {S=\{p_\xi\}} be an {n}-dimensional model, and define the {n^3} functions {\Gamma_{ij,k}^{(\alpha)}} which map each point {\xi} to

\displaystyle \left(\Gamma_{ij,k}^{(\alpha)}\right)_\xi\equiv E_\xi\left[\left(\partial_i\partial_j\ell_\xi+\frac{1-\alpha}{2}\partial_i\ell_\xi\partial_j\ell_\xi\right)\left(\partial_k\ell_\xi\right)\right]~, \ \ \ \ \ (16)

where {\alpha\in\mathbb{R}}. This defines an affine connection {\nabla^{(\alpha)}} on {S} via

\displaystyle \langle\nabla_{\partial_i}^{(\alpha)}\partial_j,\partial_k\rangle=\Gamma_{ij,k}^{(\alpha)}~, \ \ \ \ \ (17)

where {g=\langle\cdot,\cdot\rangle} is the Fisher metric introduced above. {\nabla^{(\alpha)}} is called the {\alpha}-connection, and accordingly terms like {\alpha}-flat, {\alpha}-affine, {\alpha}-parallel, etc. denote the corresponding notions with respect to this connection.

Let us pause to review some of the terminology from differential geometry above. Recall that the covariant derivative {\nabla} may be expressed in local coordinates as

\displaystyle \nabla_XY=X^i\left(\partial_iY^k+Y^j\Gamma_{ij}^{~~k}\right)\partial_k~, \ \ \ \ \ (18)

where {X=X^i\partial_i}, {Y=Y^i\partial_i} are vectors in the tangent space. If these are basis vectors such that {X^i=Y^i=1}, then {\nabla_{\partial_i}\partial_j=\Gamma_{ij}^{~~k}\partial_k}. The vector {Y} is said to be parallel with respect to the connection {\nabla} if {\nabla Y=0}, i.e., {\nabla_XY=0\;\;\forall X\in TS}; equivalently, in local coordinates,

\displaystyle \partial_iY^k+Y^j\Gamma_{ij}^{~~k}=0~. \ \ \ \ \ (19)

If all basis vectors are parallel with respect to a coordinate system {[\xi^i]}, then the latter is an affine coordinate system for {\nabla}. Connections which admit such an affine parametrization are called flat (equivalently, one says that {S} is flat with respect to {\nabla}).

Now, with respect to a Riemannian metric {g}, one defines {\Gamma_{ij,k}} as above, namely {\Gamma_{ij,k}=\langle\nabla_{\partial_i}\partial_j,\partial_k\rangle=\Gamma_{ij}^{~~l}g_{lk}}. Note that this defines a symmetric connection, i.e., {\Gamma_{ij,k}=\Gamma_{ji,k}}. If, in addition, {\nabla} satisfies

\displaystyle Z\langle X,Y\rangle=\langle\nabla_ZX,Y\rangle+\langle X,\nabla_ZY\rangle\;\; \forall X,Y,Z\in TS~, \ \ \ \ \ (20)

then {\nabla} is a metric connection with respect to {g}. (This is basically a statement about linearity, since affine transformations are linear). This implies that

\displaystyle \partial_kg_{ij}=\Gamma_{ki,j}+\Gamma_{kj,i}~. \ \ \ \ \ (21)

In other words, under a metric connection, parallel transport of two vectors preserves the inner product, hence their significance in Riemannian geometry. Any connection which is both metric and symmetric is Riemannian, of which there are generically an infinite number. However, the natural metrics on statistical manifolds are generically non-metric! Indeed, since

\displaystyle \begin{aligned} \partial_kg_{ij}&=\partial_kE_\xi[\partial_i\ell_\xi\partial_j\ell_\xi]\\ &=E_\xi[\left(\partial_k\partial_i\ell_\xi\right)\left(\partial_j\ell_\xi\right)]+E_\xi[\left(\partial_i\ell_\xi\right)\left(\partial_k\partial_j\ell_\xi\right)]+E_\xi[\left(\partial_i\ell_\xi\right)\left(\partial_j\ell_\xi\right)\left(\partial_k\ell_\xi\right)]\\ &=\Gamma_{ki,j}^{(0)}+\Gamma_{kj,i}^{(0)}~, \end{aligned} \ \ \ \ \ (22)

only the special case {\alpha=0} defines a Riemannian connection {\nabla^{(0)}} with respect to the Fisher metric (though observe that {\nabla^{(\alpha)}} is symmetric for any value of {\alpha}). While this may seem strange from a physics perspective, where preserving the inner product is of prime importance, there’s nothing mathematically pathological about it. Indeed, the more relevant condition — which we’ll see below — is that every point on the manifold have an interpretation as a probability distribution.

Two neat relationships between different {\alpha}-connections are worth noting. First, for any {\beta\neq\alpha}, we have

\displaystyle \Gamma_{ij,k}^{(\beta)}=\Gamma_{ij,k}^{(\alpha)}+\frac{\alpha-\beta}{2}T_{ijk}~, \ \ \ \ \ (23)

where {T_{ijk}} (not to be confused with the unrelated torsion tensor) is a covariant symmetric tensor which maps a point {\xi} to

\displaystyle \left(T_{ijk}\right)_\xi\equiv E_\xi[\partial_i\ell_\xi\partial_j\ell_\xi\partial_k\ell_\xi]~. \ \ \ \ \ (24)

Second, the {\alpha}-connection may be decomposed as

\displaystyle \nabla^{(\alpha)}=(1-\alpha)\nabla^{(0)}+\alpha\nabla^{(1)} =\frac{1+\alpha}{2}\nabla^{(1)}+\frac{1-\alpha}{2}\nabla^{(-1)}~. \ \ \ \ \ (25)

Within this infinite class of connections, {\nabla^{(\pm1)}} play a central role in information geometry, and are closely related to an interesting duality structure on the geometry of {\mathcal{P}\left(\mathcal{X}\right)}. We shall give a low-level introduction to the relevant representations of {S} here, and postpone a more elegant derivation based on different embeddings of {\mathcal{P}} in {\mathbb{R}^{\mathcal{X}}} in the next post. In particular, we’ll define the so-called exponential and mixed families, which are intimately related to the {1}– and {(-1)}-connections, respectively.

Exponential families

Suppose that an {n}-dimensional model {S=\{p_\theta\,|\,\theta\in\Theta\}} can be expressed in terms of {n\!+\!1} functions {\{C,F_1,\ldots,F_n\}} on {\mathcal{X}} and a function {\psi} on {\Theta} as

\displaystyle p(x;\theta)=\exp\left[C(x)+\theta^iF_i(x)-\psi(\theta)\right]~, \ \ \ \ \ (26)

where we’ve employed Einstein’s summation notation for the sum over {i} from {1} to {n}. Then {S} is an exponential family, and {[\theta^i]} are its natural or canonical parameters. The normalization condition {\int\mathrm{d} xp(x;\theta)=1} implies that

\displaystyle \psi(\theta)=\log\int\!\mathrm{d} x\,\exp\left[C(x)+\theta^iF_i(x)\right]~. \ \ \ \ \ (27)

This provides a parametrization {\theta\mapsto p_\theta}, which is {1:1} if and only if the functions {\{C,F_1,\ldots,F_n\}} are linearly independent (which we shall assume henceforth). Many important probabilistic models fall into this class, including all those referenced on page 27 above. The normal distribution (5), for instance, yields

\displaystyle \begin{aligned} C(x)=0~,\;\;F_1(x)=x~,\;\;F_2(x)&=x^2~,\;\;\theta^1=\frac{\mu}{\sigma^2}~,\;\;\theta^2=-\frac{1}{2\sigma^2}~,\\ \psi(\theta)=\frac{\mu^2}{2\sigma^2}+&\ln\left(\sqrt{2\pi}\sigma\right)~. \end{aligned} \ \ \ \ \ (28)

The canonical coordinates {[\theta^i]} are natural insofar as they provide a {1}-affine coordinate system, with respect to which {S} is {1}-flat. To see this, observe that

\displaystyle \partial_i\ell(x;\theta)=F_i(x)-\partial_i\psi(\theta)\;\;\implies\;\; \partial_i\partial_j\ell(x;\theta)=-\partial_i\partial_j\psi(\theta)~, \ \ \ \ \ (29)

where keep in mind that {\partial_i} denotes the derivative with respect to {\theta^i}, not {x}! This implies that

\displaystyle \left(\Gamma_{ij,k}^{(1)}\right)_\theta=E_\theta[\left(\partial_i\partial_j\ell_\theta\right)\left(\partial_k\ell_\theta\right)] =-\partial_i\partial_j\psi(\theta)E_\theta[\partial_k\ell_\theta]=0~. \ \ \ \ \ (30)

Therefore, exponential families admit a canonical parametrization in terms of a {1}-affine coordinate system {[\theta^i]}, with respect to which {S} is {1}-flat. The associated affine connection is called the exponential connection, and is denoted {\nabla^{(e)}\equiv\nabla^{(1)}}.

Mixed families

Now consider the case in which {S} can be expressed as

\displaystyle p(x;\theta)=C(x)+\theta^iF_i(x)~, \ \ \ \ \ (31)

i.e., {S} is an affine subspace of {\mathcal{P}(\mathcal{X})}. In this case {S} is called a {\emph{mixture family}}, with mixture parameters {[\theta^i]}. Note that {\mathcal{P}(\mathcal{X})} itself is a mixture family if {\mathcal{X}} is infinite. The name arises from the fact that elements in this family admit a representative form as a mixture of {n\!+\!1} probability distributions {\{p_0,p_1,\ldots,p_n\}},

\displaystyle p(x;\theta)=\theta^ip_i(x)+\left(1-\sum_{i=1}^n\theta^i\right)p_0(x) =p_0(x)+\sum_{i=1}^n\theta^i\left[p_i(x)-p_0(x)\right]~, \ \ \ \ \ (32)

(i.e., {C(x)=p_0(x)} and {F_i(x)=p_i(x)-p_0(x)}), where {\theta^i>0} and {\sum_i\theta^i<1}. For a mixture family, we have

\displaystyle \partial_i\ell(x;\theta)=\frac{F_i(x)}{p(x;\theta)}\;\;\implies\;\; \partial_i\partial_j\ell(x;\theta)=-\frac{F_i(x)F_j(x)}{p(x;\theta)^2}~, \ \ \ \ \ (33)

which implies that

\displaystyle \partial_i\partial_j\ell+\partial_i\ell\partial_j\ell=0\;\;\implies\;\; \Gamma_{ij,k}^{(-1)}=0~. \ \ \ \ \ (34)

Therefore, a mixture family admits a parametrization in terms of a {(-1)}-affine coordinate system {[\theta^i]}, with respect to which {S} is {(-1)}-flat. The associated affine connection is called the mixture connection, denoted {\nabla^{(m)}\equiv\nabla^{(-1)}}.

In the next post, when we discuss the geometrical structure in more detail, we shall see that {\nabla^{(\pm1)}} are dual connections, which has many interesting consequences.

Why is Fisher special?

As noted above, a given manifold admits infinitely many distinct Riemannian metrics and affine connections. However, a statistical manifold {S} has the property that every point is a probability distribution, which singles out the Fisher metric and {\alpha}-connection as unique. To formalize this notion, we must first introduce the concept of a sufficient statistic.

Let {F:\mathcal{X}\rightarrow\mathcal{Y}} be a map which takes random variables {X} to {Y=F(X)}. Given the distribution {p(x;\xi)} of {X}, this results in the distribution {q(y;\xi)} on {Y}. We then define

\displaystyle r(x;\xi)=\frac{p(x:\xi)}{q\left(F(x),\xi\right)}~,\quad p(x|y;\xi)=r(x;\xi)\delta_{F(x)}(y)~,\quad \mathrm{Pr}(A|y;\xi)=\int_A\!\mathrm{d} x\,p(x|y;\xi)~, \ \ \ \ \ (35)

where {A\subset\mathcal{X}}, and {\delta_{F(x)}(y)} is the delta function at the point {F(x)}, such that {\forall B\subset\mathcal{Y}},

\displaystyle \int_{A\cap F^{-1}(B)}\!\mathrm{d} x\,p(x;\xi) =\int_A\int_B\!\mathrm{d} x\mathrm{d} y\,r(x;\xi)q(y;\xi)\delta_{F(x)}(y) =\int_B\!\mathrm{d} y\,\mathrm{Pr}(A|y;\xi)q(y;\xi)~. \ \ \ \ \ (36)

In other words, the delta function picks out the value of {x} such that {F(x)=y}. The above implies that {\mathrm{Pr}(A|y;\xi)} is the conditional probability of the event {\{A\in X\}}, given {Y=y} (cf. the familiar definition {P(X|Y)=P(X\cup Y)/P(Y)}). If {F} is independent of {\xi}, then {F} is called a sufficient statistic for {S}. In this case, we may write

\displaystyle p(x;\xi)=q\left(F(x);\xi\right)r(x)~, \ \ \ \ \ (37)

i.e., the dependence of {p} on {\xi} is entirely encoded in the distribution {q}. Therefore, treating {p} as the unknown distribution, whose parameter {\xi} one wishes to estimate, it suffices to know only the value {Y=F(x)}, hence the name. Formally, one says that {F} is a sufficient statistic if and only if there exists functions {s:\mathcal{Y}\times\Xi\rightarrow\mathbb{R}} and {t:\mathcal{X}\rightarrow\mathbb{R}} such that

\displaystyle p(x;\xi)=s\left(F(x);\xi\right)t(x)\qquad\forall x,\xi~. \ \ \ \ \ (38)

The significance of this lies in the fact that the Fisher information metric satisfies a monotonicity relation under a generic map {F}. This is detailed in Theorem 2.1 of Amari & Nagaoka, which states that given {S=\{p(x;\xi)\}} with Fisher metric {G(\xi)}, and induced model {S_F\equiv\{q(y;\xi)\}} with matrix {G_F(\xi)}, the difference {\Delta G_\xi\equiv G(\xi)-G_F(\xi)} is positive semidefinite, i.e., {G_F(\xi)\leq G(\xi)}, with equality if and only if {F} is a sufficient statistic. Otherwise, for generic maps, the “information loss” {\Delta G(\xi)=[\Delta g_{ij}(\xi)]} that results from summarizing the data {x} in {y=F(x)} is given by

\displaystyle \Delta g_{ij}(\xi)=E_\xi[\partial_i\ln r(X;\xi)\partial_j\ln r(X;\xi)]~, \ \ \ \ \ (39)

which can be expressed in terms of the covariance with respect to the conditional distribution {p(x|y;\xi)}. This theorem will be important later, when we discuss relative entropy.

Now, if {F} is a sufficient statistic, then (37) implies that {\partial_i\ln p(x;\xi)=\partial_i\ln q\left(F(x);\xi\right)}. But this implies that {g_{ij}}, and by extension {\Gamma_{ij,k}^{(\alpha)}}, are the same for both {S} and {S_F}. Therefore the Fisher metric and {\alpha}-connection are invariant with respect to the sufficient statistic {F}. In the language above, this implies that there is no information loss associated with describing the original distribution {p} by {q}, i.e., that information is preserved under {F}. Formally, this invariance is codified by the following two equations:

\displaystyle \begin{aligned} \langle X,Y\rangle_p&=\langle\lambda_*(X),\lambda_*(Y)\rangle_{\lambda(p)}^{'}~,\\ \lambda_*\left(\nabla_X^{(\alpha)}\right)&={\nabla'}_{\lambda_*(X)}^{(\alpha)}\lambda_*(Y)~, \end{aligned} \ \ \ \ \ (40)

{\forall\;X,Y,Y\in TS}, where the prime denotes the object on {S_F}, {\lambda} is the diffeomorphism from {S} onto {S_F} given by {\lambda(p_\xi)=q_\xi}, and the pushforward {\lambda_*:TS\rightarrow TS_F} is defined by {\left(\lambda_*(X)\right)_{\lambda(p)}=(\mathrm{d}\lambda)_p(X_p)}.

The salient feature of the Fisher metric and {\alpha}-connection is that they are are uniquely characterized by this invariance! This is the thrust of Chentsov’s theorem (Theorem 2.6 in Amari & Nagaoka). Strictly speaking, the proof of this theorem relies on finiteness of {\mathcal{X}}, but — depending on the level of rigour one demands — it is possible to extend this to infinite models via a limiting procedure in which one considers increasingly fine-grained subsets of {\mathcal{X}}. A similar subtlety will arise in our more geometrical treatment of dual structures in the next post. I’m honestly unsure how serious this issue is, but it’s worth bearing in mind that the mathematical basis is less solid for infinite {\mathcal{X}}, and may require a more rigorous functional analytic approach.

Posted in Minds & Machines, Physics | Leave a comment