In recent years, a number of works have pointed to similarities between deep learning (DL) and the renormalization group (RG) [1-7]. This connection was originally made in the context of certain lattice models, where decimation RG bears a superficial resemblance to the structure of deep networks in which one marginalizes over hidden degrees of freedom. However, the relation between DL and RG is more subtle than has been previously presented. The “exact mapping” put forth by [2], for example, is really just a formal analogy that holds for essentially any hierarchical model! That’s not to say there aren’t deeper connections between the two: in my earlier post on RBMs for example, I touched on how the cumulants encoding UV interactions appear in the renormalized couplings after marginalizing out hidden degrees of freedom, and we’ll go into this in much more detail below. But it’s obvious that DL and RG are functionally distinct: in the latter, the couplings (i.e., the connection or weight matrix) are fixed by the relationship between the hamiltonian at different scales, while in the former, these connections are dynamically altered in the training process. There is, in other words, an important distinction between structure and dynamics which seems to have been overlooked. Understanding both these aspects is required to truly understand why deep learning “works”, but “learning” itself properly refers to the latter.
That said, structure is the first step to dynamics, so I wanted to see how far one could push the analogy. To that end, I started playing with simple Gaussian/Bernoulli RBMs, to see whether understanding the network structure — in particular, the appearance of hidden cumulants, hence the previous post in this two-part sequence — would shed light on, e.g., the hierarchical feature detection observed in certain image recognition tasks, the propagation of structured information more generally, or the relevance of criticality to both deep nets and biological brains. To really make the RG analogy precise, one would ideally like a beta function for the network, which requires a recursion relation for the couplings. So my initial hope was to derive an expression for this in terms of the cumulants of the marginalized neurons, and thereby gain some insight into how correlations behave in these sorts of hierarchical networks.
To start off, I wanted a simple model that would be analytically solvable while making the analogy with decimation RG completely transparent. So I began by considering a deep Boltzmann machine (DBM) with three layers: a visible layer of Bernoulli units , and two hidden layers of Gaussian units
. The total energy function is
where on the second line I’ve switched to the more convenient vector notation; the dot product between vectors is implicit, i.e., . Note that there are no intra-layer couplings, and that I’ve stacked the layers so that the visible layer
is connected only to the intermediate hidden layer
, which in turn is connected only to the final hidden layer
. The connection to RG will be made by performing sequential marginalizations over first
, and then
, so that the flow from UV to IR is
. There’s an obvious Bayesian parallel here: we low-energy beings don’t have access to complete information about the UV, so the visible units are naturally identified with IR degrees of freedom, and indeed I’ll use these terms interchangeably throughout.
The joint distribution function describing the state of the machine is
where , and similarly for
. Let us now consider sequential marginalizations to obtain
and
. In Bayesian terms, these distributions characterize our knowledge about the theory at intermediate- and low-energy scales, respectively. The first of these is
In order to establish a relationship between couplings at each energy scale, we then define the hamiltonian on the remaining, lower-energy degrees of freedom such that
This is a simple multidimensional Gaussian integral:
where in the present case and
. We therefore obtain
The key point to note is that the interactions between the intermediate degrees of freedom have been renormalized by an amount proportional to the coupling with the UV variables
. And indeed, in the context of deep neural nets, the advantage of hidden units is that they encode higher-order interactions through the cumulants of the associated prior. To make this connection explicit, consider the prior distribution of UV variables
The cumulant generating function for with respect to this distribution is then
cf. eqn. (4) in the previous post. So by choosing , we have
and may therefore express (7) as
From the cumulant expansion in the aforementioned eqn. (4), in which the cumulant is
, we then see that the effect of the marginalizing out UV (i.e., hidden) degrees of freedom is to induce higher-order couplings between the IR (i.e., visible) units, with the coefficients of the interaction weighted by the associated cumulant:
For the Gaussian prior (9), one immediately sees from (10) that all cumulants except for (the variance) vanish, whereupon (13) reduces to (7) above.
Now, let’s repeat this process to obtain the marginalized distribution of purely visible units . In analogy with Wilsonian RG, this corresponds to further lowering the cutoff scale in order to obtain a description of the theory in terms of low-energy degrees of freedom that we can actually observe. Hence, tracing out
, we have
Of course, this is just another edition of (6), but now with and
. We therefore obtain
where we have defined . As before, we then define
such that
Again we see that marginalizing out UV information induces new couplings between IR degrees of freedom; in particular, the hamiltonian contains a quadratic interaction terms between the visible units. And we can again write this directly in terms of a cumulant generating function for hidden degrees of freedom by defining a prior of the form (9), but with
and
. This will be of the form (10), where in this case we need to choose
, so that
(where, since , the inverse matrix
is also invariant under the transpose; at this stage of exploration, I’m being quite cavalier about questions of existence). Thus the hamiltonian of visible units may be written
with and
as above. Since the prior with which these cumulants are computed is again Gaussian, only the second cumulant survives in the expansion, and we indeed recover (17).
To summarize, the sequential flow from UV (hidden) to IR (visible) distributions is, from top to bottom,
where upon each marginalization, the new hamiltonian gains additional interaction terms/couplings governed by the cumulants of the UV prior (where “UV” is defined relative to the current cutoff scale, i.e., for
and
for
), and the renormalization of the partition function is accounted for by
As an aside, note that at each level, fixing the form (4), (16) is equivalent to imposing that the partition function remain unchanged. Ultimately, this is required in order to preserve low-energy correlation functions. The two-point correlator between visible (low-energy) degrees of freedom, for example, does not depend on which distribution we use to compute the expectation value, so long as the energy scale thereof is at or above the scale set by the inverse lattice spacing of
:
Failure to satisfy this requirement amounts to altering the theory at each energy scale, in which case there would be no consistent renormalization group relating them. In information-theoretic terms, this would represent an incorrect Bayesian inference procedure.
Despite (or perhaps, because of) its simplicity, this toy model makes manifest the fact that the RG prescription is reflected in the structure of the network, not the dynamics of learning per se. Indeed, Gaussian units aside, the above is essentially nothing more than real-space decimation RG on a 1d lattice, with a particular choice of couplings between “spins” . In this analogy, tracing out
and then
maps to a sequential marginalization over even spins in the 1d Ising model. “Dynamics” in this sense are determined by the hamiltonian
, which is again one-dimensional. When one speaks of deep “learning” however, one views the network as two-dimensional, and “dynamics” refers to the changing values of the couplings as the network attempts to minimize the cost function. In short, RG lies in the fact that the couplings at each level in (21) encode the cumulants from hidden units in such a way as to ensure the preservation of visible correlations, whereas deep learning then determines their precise values in such a way as to reproduce a particular distribution. To say that deep learning itself is an RG is to conflate structure with function.
Nonetheless, there’s clearly an intimate parallel between RG and hierarchical Bayesian modeling at play here. As mentioned above, I’d originally hoped to derive something like a beta function for the cumulants, to see what insights theoretical physics and machine learning might yield to one another at this information-theoretic interface. Unfortunately, while one can see how the higher UV cumulants from are encoded in those from
, the appearance of the inverse matrix makes a recursion relation for the couplings in terms of the cumulants rather awkward, and the result would only hold for the simple Gaussian hidden units I’ve chosen for analytical tractability here.
Fortunately, after banging my head against this for a month, I learned of a recent paper [8] that derives exactly the sort of cumulant relation I was aiming for, at least in the case of generic lattice models. The key is to not assume a priori which degrees of freedom will be considered UV/hidden vs. IR/visible. That is, when I wrote down the joint distribution (2), I’d already distinguished which units would survive each marginalization. While this made the parallel with the familiar decimation RG immediate — and the form of (1) made the calculations simple to perform analytically — it’s actually a bit unnatural from both a field-theoretic and a Bayesian perspective: the degrees of freedom that characterize the theory in the UV may be very different from those that we observe in the IR (e.g., strings vs. quarks vs. hadrons), so we shouldn’t make the distinction at this level. Accordingly, [8] instead replace (2) with
where is the so-called reduced (i.e., dimensionless) hamiltonian, cf. the reduced/dimensionless free energy
in the previous post. Note that
runs over all the degrees of freedom in the theory, which are all UV/hidden variables at the present level.
Two words of notational warning ere we proceed: first, there is a sign error in eqn. (1) of [8] (in version 1; the negative in the exponent has been absorbed into already). More confusingly however, their use of the terminology “visible” and “hidden” is backwards with respect to the RG analogy here. In particular, they coarse-grain a block of “visible” units into a single “hidden” unit. For reasons which should by now be obvious, I will instead stick to the natural Bayesian identifications above, in order to preserve the analogy with standard coarse-graining in RG.
Let us now repeat the above analysis in this more general framework. The real-space RG prescription consists of coarse-graining , and then writing the new distribution
in the canonical form (24). In Bayesian terms, we need to marginalize over the information about
contained in the distribution of
, except that unlike in my simple example above, we don’t want to make any assumptions about the form of
. So we instead express the integral — or rather, the discrete sum over lattice sites — in terms of the conditional distribution
:
where ,
. Denoting the new dimensionless hamiltonian
, with
,
, we therefore have
So far, so familiar, but now comes the trick: [8] split the hamiltonian into a piece containing only intra-block terms,
(that is, interactions solely among the set of hidden units which is to be coarse-grained into a single visible unit), and a piece containing the remaining, inter-block terms,
(that is, interactions between different aforementioned sets of hidden units).
Let us denote a block of hidden units by , such that
(note that since
, this implies
degrees of freedom
per block). To each
, we associate a visible unit
, into which the constituent UV variables
have been coarse-grained. (Note that, for the reasons explained above, we have swapped
relative to [8]). Then translation invariance implies
where denotes a single intra-block term of the hamiltonian. With this notation in hand, (26) becomes
Now, getting from this to the first line of eqn. (13) in [8] is a bit of a notational hazard. We must suppose that for each block , we can define the block-distribution
, where
. Given the underlying factorization of the total Hilbert space, we furthermore suppose that the distribution of all intra-block contributions can be written
, so that
. This implies that
Thus we see that we can insert a factor of into (28), from which the remaining manipulations are straightforward: we identify
and (again by translation invariance), whereupon we have
where the expectation value is defined with respect to the conditional distribution . Finally, by taking the log, we obtain
which one clearly recognizes as a generalization of (12): accounts for the normalization factor
;
gives the contribution from the un-marginalized variables
; and the log of the expectation values is the contribution from the UV cumulants, cf. eqn. (1) in the previous post. Note that this is not the cumulant generating function itself, but corresponds to setting
therein:
Within expectation values, becomes the dimensionless energy
, so the
moment/cumulant picks up a factor of
relative to the usual energetic moments in eqn. (11) of the previous post. Thus we may express (32) in terms of the cumulants of the dimensionless hamiltonian
as
where , and the expectation values in the generating functions are computed with respect to
.
This is great, but we’re not quite finished, since we’d still like to determine the renormalized couplings in terms of the cumulants, as I did in the simple Gaussian DBM above. This requires expressing the new hamiltonian in the same form as the old, which allows one to identify exactly which contributions from the UV degrees of freedom go where. (See for example chapter 13 of [9] for a pedagogical exposition of this decimation RG procedure for the 1d Ising model). For the class of lattice models considered in [8] — by which I mean, real-space decimation with the imposition of a buffer zone — one can write down a formal expression for the canonical form of the hamiltonian, but expressions for the renormalized couplings themselves remain model-specific.
There’s more cool stuff in the paper [8] that I won’t go into here, concerning the question of “optimality” and the behaviour of mutual information in these sorts of networks. Suffice to say that, as alluded in the previous post, the intersection of physics, information theory, and machine learning is potentially rich yet relatively unexplored territory. While the act of learning itself is not an RG in a literal sense, the two share a hierarchical Bayesian language that may yield insights in both directions, and I hope to investigate this more deeply (pun intended) soon.
References
[1] C. Beny, “Deep learning and the renormalization group,” arXiv:1301.3124.
[2] P. Mehta and D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831.
[3] H. W. Lin, M. Tegmark, and D. Rolnick, “Why Does Deep and Cheap Learning Work So Well?,” arXiv:1608.08225.
[4] S. Iso, S. Shiba, and S. Yokoo, “Scale-invariant Feature Extraction of Neural Network and Renormalization Group Flow,” arXiv:1801.07172.
[5] M. Koch-Janusz and Z. Ringel, “Mutual information, neural networks and the renormalization group,” arXiv:1704.06279.
[6] S. S. Funai and D. Giataganas, “Thermodynamics and Feature Extraction by Machine Learning,” arXiv:1810.08179.
[7] E. Mello de Koch, R. Mello de Koch, and L. Cheng, “Is Deep Learning an RG Flow?,” arXiv:1906.05212.
[8] P. M. Lenggenhager, Z. Ringel, S. D. Huber, and M. Koch-Janusz, “Optimal Renormalization Group Transformation from Information Theory,” arXiv:1809.09632.
[9] R. K. Pathria, Statistical Mechanics. 1996. Butterworth-Heinemann, Second edition.
I was just curious to know what happens if we use convolutional neural networks to process the data. Will it make a considerable difference? For example in [2], wouldn’t it have been better to use convoluted RBMs (assuming such a thing exists)?
LikeLike