Lately, I’ve been spending a lot of time exploring the surprisingly rich mathematics at the intersection of physics, information theory, and machine learning. Among other things, this has led me to a new appreciation of cumulants. At face value, these are just an alternative to the moments that characterize a given probability distribution function, and aren’t particularly exciting. Except they show up all over statistical thermodynamics, quantum field theory, and the structure of deep neural networks, so of course I couldn’t resist trying to better understand the information-theoretic connections to which this seems to allude. In the first part of this two-post sequence, I’ll introduce them in the context of theoretical physics, and then turn to their appearance in deep learning in the next post, where I’ll dive into the parallel with the renormalization group.

The relation between these probabilistic notions and statistical physics is reasonably well-known, though the literature on this particular point unfortunately tends to be slightly sloppy. Loosely speaking, the partition function corresponds to the moment generating function, and the (Helmholtz) free energy corresponds to the cumulant generating function. By way of introduction, let’s make this identification precise.

The moment generating function for a random variable is

where denotes the expectation value for the corresponding distribution. (As a technical caveat: in some cases, the moments — and correspondingly, — may not exist, in which case one can resort to the characteristic function instead). By series expanding the exponential, we have

were is the moment, which we can obtain by taking derivatives and setting , i.e.,

However, it is often more convenient to work with cumulants instead of moments (e.g., for independent random variables, the cumulant of the sum is the sum of the cumulants, thanks to the log). These are uniquely specified by the moments, and vice versa—unsurprisingly, since the cumulant generating function is just the log of the moment generating function:

where is the cumulant, which we again obtain by differentiating times and setting :

Note however that is *not* simply the log of !

Now, to make contact with thermodynamics, consider the case in which is the energy of the canonical ensemble. The probability of a given energy eigenstate is

The moment generating function for energy is then

Thus we see that the partition function is *not* the moment generating function, but there’s clearly a close relationship between the two. Rather, the precise statement is that the moment generating function is the ratio of two partition functions at inverse temperatures and , respectively. We can gain further insight by considering the moments themselves, which are — by definition (3) — simply expectation values of powers of the energy:

Note that derivatives of the partition function with respect to have, at , become derivatives with respect to inverse temperature (obviously, this little slight of hand doesn’t work for all functions; simple counter example: ). Of course, this is simply a more formal expression for the usual thermodynamic expectation values. The first moment of energy, for example, is

which is the ensemble average. At a more abstract level however, (8) expresses the fact that the average energy — appropriately normalized — is canonically conjugate to . That is, recall that derivatives of the action are conjugate variables to those with respect to which we differentiate. In classical mechanics for example, energy is conjugate to time. Upon Wick rotating to Euclidean signature, the trajectories become thermal circles with period . Accordingly, the energetic moments can be thought of as characterizing the dynamics of the ensemble in imaginary time.

Now, it follows from (7) that the cumulant generating function (4) is

While the cumulant does not admit a nice post-derivative expression as in (8) (though I suppose one could write it in terms of Bell polynomials if we drop the adjective), it is simple enough to compute the first few and see that, as expected, the first cumulant is the mean, the second is the variance, and the third is the third central moment:

where the prime denotes the derivative with respect to . Note that since the second term in the generating function (10) is independent of , the normalization drops out when computing the cumulants, so we would have obtained the same results had we worked directly with the partition function and taken derivatives with respect to . That is, we could define

where, in contrast to (5), we don’t need to set anything to zero after differentiating. This expression for the cumulant generating function will feature more prominently when we discuss correlation functions below.

So, what does the cumulant generating function have to do with the (Helmholtz) free energy, ? Given the form (12), one sees that they’re essentially one and the same, up to a factor of . And indeed the free energy is a sort of “generating function” in the sense that it allows one to compute any desired thermodynamic quantity of the system. The entropy, for example, is

where is the Boltzmann distribution (6). However, the factor of in the definition of free energy technically prevents a direct identification with the cumulant generating function above. Thus it is really the log of the partition function itself — i.e., the *dimensionless* free energy — that serves as the cumulant generating function for the distribution. We’ll return to this idea momentarily, cf. (21) below.

So much for definitions; what does it all mean? It turns out that in addition to encoding correlations, cumulants are intimately related to connectedness (in the sense of connected graphs), which underlies their appearance in QFT. Consider, for concreteness, a real scalar field in 4 spacetime dimensions. As every student knows, the partition function

is the generating function for the -point correlator or Green function :

where the normalization is fixed by demanding that in the absence of sources, we should recover the vacuum expectation value, i.e., . In the language of Feynman diagrams, the Green function contains all possible graphs — both connected and disconnected — that contribute to the corresponding transition amplitude. For example, the 4-point correlator of theory contains, at first order in the coupling, a disconnected graph consisting of two Feynman propagators, another disconnected graph consisting of a Feynman propagator and a 1-loop diagram, and an irreducible graph consisting of a single 4-point vertex. But only the last of these contributes to the scattering process, so it’s often more useful to work with the generating function for *connected* diagrams only,

from which we obtain the connected Green function :

The fact that the generating functions for connected vs. disconnected diagrams are related by an exponential, that is, , is not obvious at first glance, but it is a basic exercise in one’s first QFT course to show that the coefficients of various diagrams indeed work out correctly by simply Taylor expanding the exponential . In the example of theory above, the only first-order diagram that contributes to the connected correlator is the 4-point vertex. More generally, one can decompose into plus products of with . The factor of in (16) goes away in Euclidean signature, whereupon we see that is analogous to — and hence plays the role of the moment generating function — while is analogous to — and hence plays the role of the cumulant generating function in the form (12).

Thus, the cumulant of the field corresponds to the connected Green function , i.e., the contribution from correlators of all fields only, excluding contributions from lower-order correlators among them. For example, we know from Wick’s theorem that Gaussian correlators factorize, so the corresponding -point correlator becomes

What this means is that there are no interactions among all four fields that aren’t already explained by interactions among pairs thereof. The probabilistic version of this statement is that for the normal distribution, all cumulants other than are zero. (For a probabilist’s exposition on the relationship between cumulants and connectivity, see the first of three lectures by Novak and LaCroix [1], which takes a more graph-theoretic approach).

There’s one more important function that deserves mention here: the final member of the triumvirate of generating functions in QFT, namely the *effective action* , defined as the Legendre transform of :

The Legendre transform is typically first encountered in classical mechanics, where it relates the hamiltonian and lagrangian formulations. Geometrically, it translates between a function and its envelope of tangents. More abstractly, it provides a map between the configuration space (here, the sources ) and the dual vector space (here, the fields ). In other words, and are conjugate pairs in the sense that

As an example that connects back to the thermodynamic quantities above: we already saw that and are conjugate variables by considering the partition function, but the Legendre transform reveals that the free energy and entropy are conjugate pairs as well. This is nicely explained in the lovely pedagogical treatment of the Legendre transform by Zia, Redish, and McKay [2], and also cleans up the disruptive factor of that prevented the identification with the cumulant generating function above. The basic idea is that since we’re working in natural units (i.e., ), the thermodynamic relation in the form (13) obscures the duality between the properly dimensionless quantities and . From this perspective, it is more natural to work with instead, in which case we have both an elegant expression for the duality in terms of the Legendre transform, *and* a precise identification of the dimensionless free energy with the cumulant generating function (12):

Now, back to QFT, in which generates *one-particle irreducible* (1PI) diagrams. A proper treatment of this would take us too far afield, but can be found in any introductory QFT book, e.g., [3]. The basic idea is that in order to be able to cut a reducible diagram, we need to work at the level of vertices rather than sources (e.g., stripping off external legs, and identifying the bare propagator between irreducible parts). The Legendre transform (19) thus removes the dependence on the sources , and serves as the generator for the vertex functions of , i.e., the fundamental interaction terms. The reason this is called the effective action is that in perturbation theory, contains the classical action as the leading saddle-point, as well as quantum corrections from the higher-order interactions in the coupling expansion.

In information-theoretic terms, the Legendre transform of the cumulant generating function is known as the *rate function*. This is a core concept in large deviations theory, and I won’t go into details here. Loosely speaking, it quantifies the exponential decay that characterizes rare events. Concretely, let represent the outcome of some measurement or operation (e.g., a coin toss); then the mean after independent trials is

The probability that a given measurement deviates from this mean by some specified amount is

where is the aforementioned rate function. The formal similarity with the partition function in terms of the effective action, , is obvious, though the precise dictionary between the two languages is not. I suspect that a precise translation between the two languages — physics and information theory — can be made here as well, in which the increasing rarity of events as one moves along the tail of the distribution correspond to increasingly high-order corrections to the quantum effective action, but I haven’t worked this out in detail.

Of course, the above is far from the only place in physics where cumulants are lurking behind the scenes, much less the end of the parallel with information theory more generally. In the next post, I’ll discuss the analogy between deep learning and the renormalization group, and see how Bayesian terminology can provide an underlying language for both.

**References**

[1] J. Novak and M. LaCroix, “Three lectures on free probability,” arXiv:1205.2097.

[2] R. K. P. Zia, E. F. Redish, and S. R. McKay, “Making sense of the Legendre transform, arXiv:0806.1147.

[3] L. H. Ryder, Quantum Field Theory. Cambridge University Press, 2 ed., 1996.

Thank you for these great posts! The approach you mention here has been extended by Eric Smith here: https://arxiv.org/abs/1102.3938

He has subsequently analyzed the “stochastic effective action” using information geometry and applied it to evolutionary game theory. I’m currently working on applying these ideas to reinforcement learning. I’d love to discuss this more over email with you if you’re interested.

(I just finished my Physics PhD at UCSB and am applying to AI Residencies.)

LikeLike

Thanks for the link JG! I see you’ve also written about Smith’s paper on your blog. I would indeed be very interested in discussing the application of these ideas to reinforcement learning!

LikeLike