Mean field theory: from physics to deep neural nets

In a previous post, I alluded to the question of whether criticality played any role in deep neural networks. The question I originally had in mind was whether the fact that the correlation length diverges at a critical point implies an advantage in terms of information propagation in such systems. In particular, would a neural net operating at or near the critical point (e.g., by savvy initialization of weights and biases) exhibit advantages in training or performance? As it turns out, a few papers have actually addressed this issue using a prescription from physics known as mean field theory (MFT). In this two-part sequence, I’d like to first explain briefly what MFT is, and then in part 2 understand how machine learning researchers have applied it to obtain impressive real-world improvements in training performance.

In a nutshell, the idea behind MFT is that most partition functions (containing interactions) are too hard to evaluate explicitly, but can be made tractable by replacing each degree of freedom, together with its interactions, with an effective degree of freedom — the mean field — in which these interactions have been averaged over. Note that MFT is an approximation, for two reasons: it ignores higher-order fluctuations, and the averaging prescription necessarily washes-out some fine-grained information. We’ll cover both these points in more detail below, when discussing the situations under which MFT fails. Perhaps ironically given the previous paragraph, MFT breaks down precisely at the critical point, so it’s important to understand the conditions under which the associated predictions are valid.

To make our discussion concrete, consider the ${d}$ -dimensional Ising hamiltonian with ${N}$ spins total (i.e., ${N/d}$ spins per direction):

$\displaystyle H=-\frac{1}{2}\hat J^{ij}\sigma_i\sigma_j-\hat h^i\sigma_i~, \ \ \ \ \ (1)$

where ${\sigma_i=\pm1}$ , and for compactness I’m employing Einstein’s summation convention with ${i=\{1,\ldots,N\}}$ and ${\hat h^i=\hat h\mathbf{1}}$ . Since all directions are spacelike, there’s no difference between raised and lowered indices (e.g., ${\sigma_i=\sigma^i}$ ), so I’ll denote the inverse matrix ${\hat J^{-1}}$ explicitly to avoid any possible confusion, i.e., ${\hat J\hat J^{-1}=\mathbf{1}}$ and ${{\hat J^{ij}\hat J_{ij}^{-1}=\delta^i_{~i}=N}}$ . In a 1d lattice, one would typically avoid boundary effects by joining the ends into an ${S^1}$ by setting ${\sigma_{N+1}=\sigma_1}$ , but this issue won’t be relevant for our purposes, as we’ll be interested in the thermodynamic limit ${N\rightarrow\infty}$ anyway.

One approach to constructing a MFT for this model is to observe that for a given spin ${\sigma_i}$ , the effect of all the other spins acts like an external magnetic field. That is, observe that we may write (1) as

$\displaystyle H=\sum_i\sigma_i\left(-\frac{1}{2}\sum_j\hat J_{ij}\sigma_j-\hat h\right)~. \ \ \ \ \ (2)$

We then replace ${\sigma_j}$ by the average value, ${\langle\sigma_j\rangle\equiv s}$ . We’ll give a more thorough Bayesian treatment of MFT below, but the idea here is that since no spin is special, the most likely value of ${\sigma_j}$ is the mean. This allows us to define an effective magnetic field at site ${i}$ :

$\displaystyle \hat h^\textrm{\tiny{eff}}_i\equiv\frac{s}{2}\sum_j\hat J_{ij}+\hat h~, \ \ \ \ \ (3)$

so that the hamiltonian becomes

$\displaystyle H\approx-\sum_i\sigma_i\hat h_i^\textrm{\tiny{eff}}~, \ \ \ \ \ (4)$

where the interactions have been absorbed into the effective magnetic field. Thus we’ve reduced an interacting many-body problem to a non-interacting one-body problem, which is much easier to solve!

As mentioned above however, this result is an approximation, which in the present case amounts to neglecting the correlation between spins at different sites. That is, when we replaced ${\sigma_j}$ by the mean, we secretly discarded terms of order ${\delta s^2}$ . This is illustrated explicitly in John McGreevy’s entertaining notes on the Renormalization Group, and runs as follows. As mentioned above, one can think of MFT as the replacement of the degrees of freedom by the average value plus fluctuations:

$\displaystyle \begin{aligned} \sigma_i\sigma_i&=\left[s+(\sigma_i-s)\right]\left[s+(\sigma_j-s)\right] =\left( s+\delta s_i\right)\left( s+\delta s_j\right)\\ &=s^2+s\left( \delta s_i+\delta s_j\right)+O(\delta s^2) =-s^2+s\left(\sigma_i+\sigma_j\right)+O(\delta s^2)~, \end{aligned} \ \ \ \ \ (5)$

where ${\delta s_i\equiv\sigma_i-s}$ . We then substitute this into the hamiltonian (1); taking ${J_{ij}=J_{ji}}$ for simplicity, we obtain

$\displaystyle H=\frac{s^2}{2}\sum_{ji}\hat J_{ij} +\sum_i\sigma_i\left(-\sum_j\hat J_{ij}s-\hat h\right)+O(\delta s^2)~. \ \ \ \ \ (6)$

Note that the first term is some constant factor of the lattice size, and hence doesn’t affect the dynamics (we simply absorb it into the normalization of the partition function). If we then define an effective action as in (3) (with a suitable factor of 2) and work to linear order in the fluctuations, we recover the effective one-body hamiltonian (4). Thus, in the course of our mean field approximation, we averaged over the fluctuations, but lost some information about the interactions between spins.

Despite its approximate nature, the hamiltonian (6) (equivalently, (1)) is quite useful insofar as it can be used to obtain upper bound on the free energy. To understand this, let us introduce the Bayesian perspective promised above. In MFT, we’re ignoring some information about the system at hand, but we want to make inferences that are as accurate as possible subject to the available constraints. Recall from our discussion of entropy that if we do not know the underlying distribution with respect to which a particular expectation value is computed, the most rational choice is obtained by maximizing the von Neumann entropy. In particular, if we know the average energy, ${\langle H\rangle=E}$ , this procedure yields the Boltzmann distribution

$\displaystyle p_i\equiv p(x_i)=\frac{1}{Z[\beta]}e^{-\beta E_i}~, \qquad\qquad\sum_ip_i=1~, \ \ \ \ \ (7)$

where ${E_i\equiv\langle H(x_i)\rangle}$ , and we identify the inverse temperature ${\beta}$ as the Lagrange multiplier arising from the constraint on the energy.

Now suppose that instead of knowing the average energy, we know only the temperature (i.e., we consider the canonical rather than microcanonical ensemble). As explained in a previous post, this amounts to a constraint in the (canonically) dual space, so the appropriate extremization procedure is instead to minimize the free energy,

$\displaystyle F=E-\beta^{-1}S~. \ \ \ \ \ (8)$

(Note that for a given energy configuration, the free energy is minimized when the entropy is maximized). One finds again (7), with ${Z[\beta]=e^{\beta\lambda}}$ where ${\lambda}$ is the Lagrange multiplier for the normalization constraint, cf. the max-ent procedure here.

The upshot is that the max-ent distribution (7) has minimum free energy: had we used any other distribution, it would amount to the imposition of additional constraints on the system, thereby reducing the entropy and increasing ${F}$ in (8). This is essentially what happens in MFT, since we select a more tractable distribution with respect to which we can compute expectation values (i.e., a simpler hamiltonian, with a constraint on the fluctuations). In the present case, this implies that the normalized free energy obtained via the mean-field hamiltonian (6), denoted $f_\textrm{\tiny{MF}}$ , provides an upper bound on the true (equilibrium) free energy ${f}$ :

$\displaystyle f_\textrm{\tiny{MF}}\geq f~, \ \ \ \ \ (9)$

where ${f=F/N}$ (henceforth I’ll refer to this simply as the free energy, without the “normalized” qualifier). This statement, sometimes referred to as the Bogolyubov inequality, can be easily shown to follow from Gibb’s inequality. (This apparently standard derivation is however not a sufficient explanation per se, since it simply assumes that expectation values are taken with respect to the mean-field distribution. Had we chosen to take them with respect to the equilibrium (max-ent) distribution, the bound would be reversed!)

Working to linear order in the fluctuations, the mean-field partition function for (6) is

$\displaystyle \begin{aligned} Z_\textrm{\tiny{MF}}&=\sum_{\{\sigma\}}e^{-\beta H_\textrm{\tiny{MF}}} =\prod_{k=1}^N\sum_{\sigma_k=\pm1}\exp\left[-NdJs^2+\left(2dJs+h\right)\sum_i\sigma_i\right]\\ &=e^{-NdJs^2}\prod_{k=1}^N\exp\left[-\left(2dJs+h\right)+\left(2dJs+h\right)\right]\\ &=2^Ne^{-NdJs^2}\cosh^N\left(2dJs+h\right)~. \end{aligned} \ \ \ \ \ (10)$

where for simplicity we have restricted to homogeneous nearest-neighbor interactions (in ${d}$ spatial dimensions, each of the ${N}$ spins has ${2d}$ neighbors with coupling strength ${J\equiv\beta\hat J}$ and ${h\equiv\beta \hat h}$ ). The corresponding free energy is then

$\displaystyle f_\textrm{\tiny{MF}}(s)=-\frac{1}{N\beta}{\ln Z} =\beta^{-1}dJs^2-\beta^{-1}\ln\cosh\left(2dJs+h\right)~, \ \ \ \ \ (11)$

where we have dropped the ${\beta^{-1}\ln 2}$ term, since this doesn’t contribute to any of the observables for which ${F}$ serves as a generating function (that is, it’s just a normalization factor).

Now, as per our discussion above, (11) provides an upper bound on the true free energy. Thus we can obtain the tightest possible bound (given our ansatz (6)) by minimizing over ${s}$ :

$\displaystyle \frac{\partial f_{\textrm{\tiny{MF}}}}{\partial s}=0 \quad\implies\quad s=\tanh\left(2dJs+h\right)~. \ \ \ \ \ (12)$

This last is referred to as the self-consistency condition. The reason is that it’s precisely what we would have obtained had we computed the average spin via the single-site hamiltonian (6), or equivalently (4): since the linear term has been absorbed into the effective magnetic field ${\hat h^\textrm{\tiny{eff}}}$ , it looks as though ${J=0}$ , and therefore

$\displaystyle s=\langle\sigma\rangle=\sum_{\sigma=\pm1}\!\sigma\,p(\sigma) =\frac{1}{Z}\sum_{\sigma=\pm1}\sigma e^{ h^{\textrm{\tiny{eff}}}\sigma} =\frac{\sum_{\sigma=\pm1}\sigma e^{h^{\textrm{\tiny{eff}}}\sigma}}{\sum_{\sigma=\pm1} e^{h^{\textrm{\tiny{eff}}}\sigma}} =\tanh h^\textrm{\tiny{eff}}~, \ \ \ \ \ (13)$

where ${h^\textrm{\tiny{eff}}\equiv\beta\hat h^\textrm{\tiny{eff}}}$ . Substituting in (3) for the homogeneous nearest-neighbor case at hand (that is, without the factor of ${1/2}$ , cf. (6)) then gives precisely (12).

The self-consistency equation can be solved graphically, i.e., the critical points are given by the intersections of the left- and right-hand sides; see [1] for a pedagogical treatment. In brief, for ${h\!\neq\!0}$ , the global minimum of ${f_\textrm{\tiny{MF}}}$ is given by the intersection at positive ${s}$ , regardless of temperature. For ${h\!=\!0}$ in contrast, there’s a single minimum at ${s\!=\!0}$ for high temperatures, and two degenerate minima at ${\pm s_0}$ at low temperatures (depending on whether ${\tanh(2d\beta\hat Js)}$ crosses ${s}$ for ${s\neq0}$ ; to see this, recall that for small ${x}$ , ${\tanh x\approx x-x^3/3+\ldots}$ , so a sufficiently small value of ${\beta}$ makes this an approximately straight line whose slope is less than ${s}$ ). The critical temperature that divides these two regimes is found by imposing that

$\displaystyle s=\tanh(2d\beta_c\hat Js)\overset{!}{=}\tanh(s) \quad\implies\quad T_c=2d\hat J~. \ \ \ \ \ (14)$

Note that this is completely wrong for low dimensions! For ${d\!=\!1}$ , ${T_c\!=0\!}$ , while for ${d\!=\!2}$ , ${T_c\approx2.269\hat J}$ ; we’ll have more to say about this failure below.

Let’s concentrate on the ${h\!=\!0}$ case henceforth: note that the critical point ${s_0}$ will always be small ( ${|s_0|\!<\!1}$ ) independent of ${T}$ (since ${\lim\nolimits_{x\rightarrow\pm\infty}\tanh x=\pm1}$ ), so we can expand

$\displaystyle \ln\cosh(2dJs_0)=\frac{1}{2}(2dJ)^2s_0^2-\frac{1}{12}(2dJ)^4s_0^4+O(s_0^6)~, \ \ \ \ \ (15)$

whence the free energy (11) near the critical point is approximately

$\displaystyle f_\textrm{\tiny{MF}}(s_0)\approx \frac{r}{2}s_0^2+\frac{g}{4!}s_0^4~, \ \ \ \ \ (16)$

where we have defined

$\displaystyle \begin{aligned} r\equiv 2d&J\beta^{-1}\left(1-2dJ\right) =\frac{T_c}{T}\left( T-T_c\right)~,\\ g&\equiv 32\beta^{-1}d^4J^4 =\frac{2T_c^4}{T^3}~, \end{aligned} \ \ \ \ \ (17)$

and dropped all higher-order terms. Observe that the sign of ${r}$ changes at the critical temperature ${T\!=\!T_c}$ , which determines whether the global minimum of ${f_\textrm{\tiny{MF}}}$ lies at ${s_0\!=\!0}$ ( ${T\!>\!T_c}$ ) or ${\pm s_0\!>\!0}$ ( ${T\!<\!T_c}$ ). The physical interpretation is that below the critical temperature, it is energetically favourable for the spins to align, resulting in a non-zero magnetization (which is what the average spin ${\langle\sigma\rangle=s}$ is). Above the critical temperature however, thermal fluctuations disrupt this ordering, so the net magnetization is zero. For this reason, the magnetization ${s}$ is an example of an order parameter, since it parametrizes which “order” — that is, which phase — we’re in on either side of the critical point.

As alluded above however, there’s a problem with the MFT results for the critical point, namely that it’s precisely at the critical point where MFT breaks down! The reason is that at the critical point, fluctuations at all scales are important, whereas MFT includes only fluctuations to linear order (cf. (5)). The contribution from all scales is related to the statement we made in the introductory paragraph, namely that the correlation length diverges at the critical point. To properly understand this, we need to go beyond the MFT approach above. In particular, while the discrete lattice is a helpful starting point, we can gain further insight by considering a continuum field theory. We’ll see that MFT corresponds to the leading-order saddle point approximation, and that the first corrections to this expansion can qualitatively change these results.

To proceed, we’ll map our square-lattice Ising model to an equivalent theory of scalar fields. (If you like, you can just jump to the Ginzburg-Landau action (32) and take it as an ansatz, but I find the mapping both neat and instructive). Starting again from (1), the partition function is

$\displaystyle Z=\sum_{\{\sigma\}}e^{-\beta H} =\prod_{k=1}^N\sum_{\sigma_k=\pm1}\exp\left(\frac{1}{2} J^{ij}\sigma_i\sigma_j+h^i\sigma_i\right)~, \ \ \ \ \ (18)$

where as before we have absorbed the pesky factor of ${\beta}$ by defining ${J=\beta\hat J}$ , ${h=\beta\hat h}$ . The first step is to apply the Hubbard-Stratanovich transformation,

$\displaystyle e^{\frac{1}{2}K^{ij}s_is_j}=\left[\frac{\mathrm{det}K}{(2\pi)^N}\right]^{1/2}\!\int\!\mathrm{d}^N\phi\exp\left(-\frac{1}{2}K^{ij}\phi_i\phi_j+K^{ij}s_i\phi_j\right)~, \qquad\forall\,s_i\in\mathbb{R}~. \ \ \ \ \ (19)$

(We used this before in our post on restricted Boltzmann machines; the difference here is that we want to allow ${h\neq0}$ ). Applying this transformation to the first term in the partition function, we have

$\displaystyle Z=\left[\frac{\mathrm{det}J}{(2\pi)^N}\right]^{1/2}\!\int\!\mathrm{d}^N\phi\sum_{\{\sigma\}}\exp\left[-\frac{1}{2}J^{ij}\phi_i\phi_j+\left( J^{ij}\phi_j+h^i\right)\sigma_i\right]~. \ \ \ \ \ (20)$

At a computational level, the immediate advantage of the Hubbard-Stratanovich transformation in the present case is that we can sum over the binary spins ${\sigma}$ , leaving us with an expression entirely in terms of the new field variables ${\phi}$ . Observe that for each spin,

$\displaystyle \sum_{\sigma_i=\pm1}\exp\left[\left( J^{ij}\phi_j+h^i\right)\sigma_i\right]=2\cosh\left( J^{ij}\phi_j+h^i\right)~, \ \ \ \ \ (21)$

and therefore by re-exponentiating this expression, the partition function becomes

$\displaystyle Z=\left[\left(\frac{2}{\pi}\right)^N\!\mathrm{det}J\right]^{1/2}\!\int\!\mathrm{d}^N\phi\exp\!\left[-\frac{1}{2}J^{ij}\phi_i\phi_j+\sum_i\ln\cosh\left( J^{ij}\phi_j+h^i\right)\right]~. \ \ \ \ \ (22)$

We now observe that ${J^{ij}\phi_j+h^j\equiv\mu^i}$ can be thought of as the mean field ${\langle\phi_i\rangle}$ at site ${i}$ , incorporating the interaction with all other sites as well as the external magnetic field. We can then express the partition function in terms of the mean field ${\mu^i}$ by inverting this identification:

$\displaystyle \begin{aligned} \phi^i=J^{-1}_{ij}\left(\mu^j-h^j\right) \quad\implies\quad J^{ij}\phi_i\phi_j&=J^{ij}J^{-1}_{in}J^{-1}_{jm}\left(\mu^n-h^n\right)\left(\mu^m-h^m\right)\\ &=J^{-1}_{ij}\left(\mu^i\mu^j-h^i\mu^j-h^j\mu^i+h^ih^j\right)~. \end{aligned} \ \ \ \ \ (23)$

As for the change in the measure, it follows from the anti-symmetry of the wedge product, together with the fact that ${J_{ii}=0}$ , that

$\displaystyle \mathrm{d}\phi_i=J^{-1}_{ij}\mathrm{d}\mu^j \quad\implies\quad \mathrm{d}^N\!\phi=\mathrm{det}J^{-1}\mathrm{d}^N\!\mu~. \ \ \ \ \ (24)$

Hence the partition function may be equivalently expressed as

$\displaystyle Z=\left[\left(\frac{2}{\pi}\right)^N\!\mathrm{det}J^{-1}\right]^{1/2}\!e^{-\frac{1}{2}J^{-1}_{ij}h^ih^j}\!\int\!\mathrm{d}^N\!\mu\exp\!\left[-\frac{1}{2}J^{-1}_{ij}\mu^i\mu^j+J^{-1}_{ij}h^i\mu^j+\sum_i\ln\cosh\mu_i\right]~, \ \ \ \ \ (25)$

where I’ve assumed ${J_{ij}=J_{ji}}$ . While this only becomes a proper (mean) field theory in the thermodynamic limit, it’s worth emphasizing that up to this point, the transformation from the original lattice model (18) to is exact!

Now comes the approximation: to obtain a more tractable expression, let’s consider the case where the external magnetic field is very small as we did above. In this case, since the spin interactions don’t induce any preferred direction, we expect the mean field to be centered near zero, i.e., ${|\mu_i|\ll1}$ . We can then expand

$\displaystyle \ln\cosh\mu_i=\frac{1}{2}\mu_i^2-\frac{1}{12}\mu_i^4+O(\mu_i^6)~, \ \ \ \ \ (26)$

whereupon the partition function becomes

$\displaystyle Z\approx\left[\left(\frac{2}{\pi}\right)^N\!\mathrm{det}J^{-1}\right]^{1/2}\!e^{-\frac{1}{2}J^{-1}_{ij}h^ih^j}\!\int\!\mathrm{d}^N\mu\exp\!\left[-\frac{1}{2}J^{-1}_{ij}\mu^i\mu^j +J^{-1}_{ij}h^i\mu^j +\sum_i\left(\frac{1}{2}\mu_i^2-\frac{1}{12}\mu_i^4\right)\right]~. \ \ \ \ \ (27)$

Finally, we take the continuum limit, in which we label the field at each site by the { ${d}$ -dimensional} vector ${\mathbf{x}}$ (i.e., ${\mu_i\rightarrow\mu(\mathbf{x})}$ and ${\sum\nolimits_i\rightarrow\int\!\mathrm{d}^dx=\int\!\mathrm{d}\mathbf{x}}$ ), and obtain the path-integral measure

$\displaystyle \left[\left(\frac{2}{\pi}\right)^N\!\mathrm{det}J^{-1}\right]^{1/2}\!e^{-\frac{1}{2}J^{-1}_{ij}h^ih^j}\!\int\!\mathrm{d}^N\mu \;\longrightarrow\;\mathcal{N}\!\int\!\mathcal{D}\mu~. \ \ \ \ \ (28)$

Thus the continuum field theory for the Ising model is

$\displaystyle \begin{aligned} Z\approx\mathcal{N}\!\int\!\mathcal{D}\mu\exp\bigg\{\!&-\frac{1}{2}\int\!\mathrm{d}\mathbf{x}\mathrm{d}\mathbf{y}\,\mu(\mathbf{x})J^{-1}(\mathbf{x}-\mathbf{y})\big[\mu(\mathbf{y})-h\big]\\ &+\frac{1}{2}\int\!\mathrm{d}\mathbf{x}\left[\mu(\mathbf{x})^2-\frac{1}{6}\mu(\mathbf{x})^4\right]\bigg\}~, \end{aligned} \ \ \ \ \ (29)$

where ${h(\mathbf{x})=h}$ , since the external magnetic field is the same for all lattice sites, cf. the definition of ${\hat h^i}$ below (1).

In obtaining (29), I haven’t made any assumptions about the form of the coupling matrix ${J}$ , except that it be a symmetric invertible matrix, with no self-interactions ( ${J_{ii}=0}$ ). Typically however, we’re interested in the case in which the hamiltonian (1) includes only nearest-neighbor interactions — as we eventually specified in our lattice model, cf. (10) — and we’d like to preserve this notion of locality in the field theory. To do this, we take ${|\mathbf{y}-\mathbf{x}|\ll 1}$ and Taylor expand the field ${\phi(\mathbf{y})}$ around ${\mathbf{x}}$ :

$\displaystyle \mu(\mathbf{y})=\mu(\mathbf{x})+(y^i-x^i)\partial_i\mu(\mathbf{x})+\frac{1}{2}(y^i-x^i)(y^j-x^j)\partial_i\partial_j\mu(\mathbf{x})+O\left((\mathbf{y}-\mathbf{x})^3\right)~. \ \ \ \ \ (30)$

That is, we view ${J^{-1}(\mathbf{y}-\mathbf{x})}$ as mediating interactions between fields at infinitesimally separated points in space, with increasingly non-local (i.e., higher-derivative) terms suppressed by powers of the separation. Upon substituting this expansion into (29), and working to second-order in this local expansion, one obtains a partition function of the form

$\displaystyle Z\approx\mathcal{N}\!\int\!\mathcal{D}\mu\,e^{-S[\mu]}~, \ \ \ \ \ (31)$

with

$\displaystyle S[\mu]=\int\!\mathrm{d}^d\mathbf{x}\left[\frac{1}{2}\kappa(\nabla\mu)^2-h\mu+\frac{1}{2}\tilde r\mu^2+\frac{\tilde g}{4!}\mu^4\right]~, \ \ \ \ \ (32)$

where the coefficients ${\kappa}$ , ${\tilde r}$ , and ${\tilde g}$ are some (analytic) functions of the physical parameters, and can be expressed in terms of the zero modes of the inverse coupling matrix. I’m not going to go through the details of that computation here, since a great exposition is already available in the answer to this post on Stack Exchange (note however that they do not keep the linear ${h\mu}$ term).

The main lesson of this field-theoretic exercise is that MFT is nothing more than the leading saddle point of (31). Denoting the minimum ${\mu_0}$ , and expanding the action to second order in the fluctuations ${(\mu-\mu_0)}$ , we have

$\displaystyle Z\approx\mathcal{N} e^{-S[\mu_0]}\int\!\mathcal{D}\mu\,e^{-\frac{1}{2}(\mu-\mu_0)^2S''[\mu_0]}~, \ \ \ \ \ (33)$

where the prime denotes variation with respect to ${\mu}$ , and the linear term has vanished by definition, i.e., ${\mu_0}$ is given by

$\displaystyle \frac{\delta S}{\delta\mu}=0 \quad\implies\quad \kappa\nabla^2\mu_0=-h+\tilde r\mu_0+\frac{\tilde g}{6}\mu_0^3~, \ \ \ \ \ (34)$

where we have applied integration by parts and assumed that the field vanishes at infinity. If we then keep only the leading-order saddle point, the partition function is given entirely by the prefactor

$\displaystyle Z\approx\mathcal{N} e^{-S[\mu_0]} \quad\mathrm{with}\quad S[\mu_0]=\int\!\mathrm{d}^d\mathbf{x}\left(\tilde r\mu_0^2+\frac{\tilde g}{8}\mu_0^4-\frac{3}{2}h\mu_0\right)~, \ \ \ \ \ (35)$

so that the free energy is

$\displaystyle f_\textrm{\tiny{sp}}=\frac{1}{\beta}\left(\tilde r\mu_0^2+\frac{\tilde g}{2}\mu_0^4-\frac{3}{4}h\mu_0\right) =\frac{\hat r}{2}\hat\mu_0^2+\frac{\hat g}{4!}\hat\mu_0^4-\hat h\mu_0~, \ \ \ \ \ (36)$

where the subscript “sp” stands for “saddle point”, and we have dropped the non-dynamical ${\ln\mathcal{N}}$ term. In the second equality, we have simply extracted the factor of ${\beta}$ from ${\mu}$ by defining ${\mu=\beta\hat\mu}$ (cf. the absorption of ${\beta}$ into the coefficients ${J,h}$ in (18), and the definition of ${\mu}$ below (22)), and defined ${2\hat r\equiv\beta\tilde r}$ , ${\hat g\equiv12\tilde g\beta^3}$ , and ${4\hat h\equiv3h}$ . For ${h\!=\!0}$ , this is formally identical to ${f_\textrm{\tiny{MF}}}$ obtained above, cf. (16)! (By this point the two theories are technically different, though Kopietz et al. [1] do give an argument as to how one might match the coefficients; otherwise one can compute them explicitly via Fourier transform as mentioned above).

Now suppose we kept the leading correction to the MFT result, given by the quadratic term in the path integral (33). For ${S''[\mu_0]}$ , we have the operator

$\displaystyle K\equiv\frac{\delta^2S}{\delta\mu_x\delta\mu_y}=\left(-\kappa\nabla^2+\tilde r+\frac{\tilde g}{2}\mu_0^2\right)\delta^d(\mathbf{x}-\mathbf{y})~, \ \ \ \ \ (37)$

where ${\mu_x\equiv\mu(\mathbf{x})}$ . Substituting this into (33) and doing the Gaussian integral, one finds that the contribution from this term is given by the sum of the eigenvalues of the operator ${K}$ . I’m not going to go through this in detail, since this post is getting long-winded, and McGreevy’s notes already do a great job. The result is an additional contribution to the free energy that shifts the location of the critical point. Whether or not these higher-order corrections are important depends not only on the size of the fluctuations, but also on the spatial dimension of the system. It turns out that for systems in the Ising universality class (that is, systems whose critical points are characterized by the same set of critical exponents), the MFT result is good enough in ${d\!\geq\!4}$ , but the fluctuations diverge in ${d\!<\!4}$ and hence render its conclusions invalid. We’ll give a better explanation for this dimensional-dependent validity below.

There’s another way to see the breakdown of MFT at the critical point in a manner that makes more transparent the precise role of the higher-order terms in the expansion, via the renormalization group. Suppose we’d included all higher-order terms in (32)—that is, all terms consistent with the symmetries of the problem (rotation & translation invariance, and ${\mathbb{Z}_2}$ symmetry if ${h\!=\!0}$ ). The result is called the Ginzburg-Landau action, after the eponymous authors who first used it to study systems near critical points. Now observe that the field ${\mu}$ has mass dimension ${\tfrac{d-2}{2}}$ , so the squared mass ${\tilde r}$ has dimension 2, the quartic coupling ${\tilde g}$ has dimension ${4\!-\!d}$ , a sextic coupling would have dimension ${6\!-\!2d}$ , and so on. Recall that a coupling is relevant if its mass dimension ${\Delta\!>\!0}$ (since the dimensionless coupling carries a factor of ${E^{-\Delta}}$ , e.g., ${m^2=\tilde r/E^2}$ ), irrelevant if ${\Delta\!<\!0}$ (since it runs like ${E^\Delta}$ ), and marginal if ${\Delta\!=\!0}$ . Thus we see that the quadratic term is always relevant, and that higher-order corrections are increasingly suppressed under RG in a dimension-dependent manner.

So, a more sophisticated alternative to our particular MFT attempt above — where we kept the quartic ${\mu_0^4}$ term in the saddle point — is to compute the Gaussian path integral consisting of only the quadratic contribution, and treat the quartic and higher terms perturbatively. (As these originally arose from higher-order terms in the Taylor expansion, this is morally in line with simply taking ${\mu_0\ll1}$ in the MFT result ${f_\textrm{\tiny{sp}}}$ ). Treating the action as a Gaussian integral also allows us to obtain a simple expression for the two-point correlator that captures the limiting behaviour in which we’re primarily interested. That is, tying all this back to the information theory / neural network connections alluded in the introduction, we’re ultimately interested in understanding the propagation of information near the critical point, so understanding how correlation functions behave in the leading-order / MFT / saddle point approximation — and how perturbative corrections from fluctuations might affect this — is of prime importance.

We thus consider the partition function (31), with the quadratic action

$\displaystyle S[\mu]=\int\!\mathrm{d}^d\mathbf{x}\left[\frac{1}{2}(\nabla\mu(\mathbf{x}))^2+\frac{m^2}{2}\mu(\mathbf{x})^2-h\mu(\mathbf{x})\right]~, \ \ \ \ \ (38)$

where we’ve set ${\kappa\!=\!1}$ and relabelled the quadratic coefficient ${m^2}$ . Evaluating partition functions of this type is a typical exercise in one’s first QFT course, since the action now resembles that of a free massive scalar field, where ${-h}$ plays the role of the source (normally denoted ${J}$ ). The basic prescription is to Fourier transform to momentum space, where the modes decouple, and then absorb the remaining source-independent term into the overall normalization. The only difference here is that we’re in Euclidean rather than Lorentzian signature, so there’s no issues of convergence; see for example Tong’s statistical field theory notes for a pedagogical exposition. The result is

$\displaystyle Z\simeq\exp\frac{1}{2}\!\int\!\frac{\mathrm{d}^dk}{(2\pi)^d}\frac{\tilde h_{\mathbf{k}}\tilde h_{-\mathbf{k}}}{k^2+m^2} =\exp\frac{1}{2}\!\int\!\mathrm{d}^dx\mathrm{d}^dy\,h(\mathbf{x})G(\mathbf{x}-\mathbf{y})h(\mathbf{y})~, \ \ \ \ \ (39)$

where the position-dependence of the external field ${h(\mathbf{x})}$ merely serves as a mnemonic, and ${\tilde h_\mathbf{k}}$ is the same field in momentum space. In the second equality, we’ve simply Fourier transformed back to real space by identifying the propagator

$\displaystyle G(\mathbf{x}-\mathbf{y})=\int\!\frac{\mathrm{d}^dk}{(2\pi)^2}\frac{e^{-i\mathbf{k}\mathbf{x}}}{k^2+m^2}~, \ \ \ \ \ (40)$

which describes the correlation between the field ${\mathbf{x}}$ and ${\mathbf{y}}$ . To see that this is indeed a correlation function, recall that the variance is given by the second cumulant:

$\displaystyle \kappa_2=\frac{\partial^2\ln Z}{\partial h^2} \overset{\tiny{(38)}}{=} \langle\mu_x\mu_y\rangle-\langle\mu_x\rangle\langle\mu_y\rangle \overset{\tiny{(40)}}{=} G(\mathbf{x}-\mathbf{y})~, \ \ \ \ \ (41)$

and thus ${G}$ is indeed the connected 2-point correlator. (I should mention that in the present case, there’s a special name for this which seems to be preferred by condensed matter theorists: it’s the magnetic susceptibility, defined as the sensitivity of ${s}$ with respect to ${h}$ ,

$\displaystyle \chi=\partial_hs~, \ \ \ \ \ (42)$

where the connection arises by observing that the magnetization is none other than the mean (i.e., the first cumulant),

$\displaystyle s=-\frac{\partial\ln Z}{\partial h}=\langle\mu\rangle~. \ \ \ \ \ (43)$

But I’ll continue to refer to it as the correlation function, or the connected Green function, since calling it the “susceptibility” obscures its deeper physical and information-theoretic significance. Actually seeing that this is a Green function does however require slightly more work.)

The evaluation of (40) is treated very nicely in the aforementioned notes by Tong. In brief, we proceed by defining a length scale ${\xi^2=m^{-2}}$ , and use the identity

$\displaystyle \int_0^\infty\!\mathrm{d} t\,e^{-t(k^2+\xi^{-2})}=\frac{1}{k^2+1/\xi^2} \ \ \ \ \ (44)$

to massage the integral into the following form:

$\displaystyle G(r)=\frac{1}{(4\pi)^{d/2}}\int_0^\infty\!\mathrm{d} t\,t^{-d/2}e^{-r^2/4t-t/\xi^2} \ \ \ \ \ (45)$

which is obtained by completing the square in the exponential and performing the integral over ${\mathrm{d}^dk}$ ; we’ve also used rotation invariance with ${r=|\mathbf{x}|}$ (not to be confused with the old name for the quadratic coefficient).

As will shortly become apparent, ${\xi}$ is the correlation length that determines the size of fluctuations, and hence the spatial structure of correlations. Since we’re primarily interested in the limiting cases where ${r\gg\xi}$ and ${r\ll\xi}$ , it is more illuminating to evaluate the integral via saddle point, rather than to preserve the exact form (which, as it turns out, can be expressed as a Bessel function). We thus exponentiate the ${t^{-d/2}}$ factor to write

$\displaystyle G(r)=\frac{1}{(4\pi)^{d/2}}\int_0^\infty\!\mathrm{d} t\,e^{-P(t)}~, \qquad P(t)\equiv\frac{r^2}{4t}+\frac{t}{\xi^2}+\frac{d}{2}\ln t~, \ \ \ \ \ (46)$

so that we have, to second order,

$\displaystyle G(r)\sim \sqrt{\frac{\pi}{2S''(t_*)}}\,e^{-P(t_*)}~, \ \ \ \ \ (47)$

where the saddle point ${t_*}$ is given by

$\displaystyle S'(t_*)=0\implies t_*=\frac{\xi^2}{2}\left(-\frac{d}{2}+\sqrt{\frac{d^2}{4}+\frac{r^2}{\xi^2}}\right) \approx\begin{cases} \frac{r^2}{2d}\; & r\ll\xi~,\\ \frac{r\xi}{2}\; & r\gg\xi~. \end{cases} \ \ \ \ \ (48)$

Substituting these case values into (47), we find the following limiting behaviour for the correlation function:

$\displaystyle G(r)\sim\begin{cases} \frac{1}{r^{d-2}}\; & r\ll\xi~,\\ \frac{e^{-r/\xi}}{r^{(d-1)/2}}\; & r\gg\xi~. \end{cases} \ \ \ \ \ (49)$

Recalling that ${m^2\sim|T-T_c|}$ near the critical point, we see that the correlation length diverges as

$\displaystyle \xi\sim\frac{1}{|T-T_c|^{1/2}} \ \ \ \ \ (50)$

as the system approaches criticality. This means that at the critical point, we are always in the regime ${r\ll\xi}$ , and hence the correlator exhibits a power law divergence. Another way to say this is that there is no longer any length scale in the problem (since that role was played by ${\xi}$ , which has gone to infinity). This is why the divergence of the correlator at criticality must be a power law: any other function would require a length scale on dimensional grounds.

In the previous MFT treatment, we mentioned that fluctuations can change the results. From the RG perspective, this is because the quadratic coupling (which determines the location of the critical point) may be adjusted in the renormalization process as we integrate out UV modes. (In fact, we saw an explicit example of this in our post on deep learning and the renormalization group). The lower the dimension, the more relevant operators we need to take into account; in particular, all operators are relevant in ${d\!=\!2}$ , so the saddle point approximation is exceedingly poor. In contrast, as the dimension increases, more and more operators are suppressed under RG flow. In the lattice picture, we can understand the fact that MFT gets more accurate in higher dimensions by noting that more dimensions means more neighbors, and hence approximating the degrees of freedom by the mean field is more likely to be accurate.

Finally, let us return to the comment we made at the beginning of this post, namely that the correlation length diverges at a critical point. This is another way of understanding the breakdown of MFT, since a divergent correlation length implies that fluctuations on all scales are important (and hence we neglect them at our peril). Explicitly, MFT (broadly interpreted) is valid when the fluctuations are much smaller than the mean or background field around which they’re fluctuating, i.e., ${\langle\mu^2\rangle\ll\langle\mu\rangle^2}$ . Tong offers a clean way to see the dimensional dependence explicitly: simply integrate these expectation values over a ball of radius ${\xi}$ and compare the ratio

$\displaystyle R\equiv\frac{\int_0^\xi\!\mathrm{d}^dx\langle\mu(\mathbf{x})\mu(0)\rangle}{\int_0^\xi\!\mathrm{d}^dx\langle\mu^2\rangle} \simeq \frac{1}{\mu_0^2\xi^d}\int_0^\xi\!\mathrm{d} r\frac{r^{d-1}}{r^{d-2}} =\frac{\xi^{2-d}}{\mu_0^2} \sim|T-T_c|^{(d-4)/2}~, \ \ \ \ \ (51)$

where ${\mu_0=\langle\mu\rangle}$ is the mean field from above, and in the last step we have used the scaling behaviours of ${\xi\sim|T-T_c|^{-1/2}}$ and ${\mu_0\sim|T-T_c|^{1/2}}$ (the latter can be obtained by minimizing the quartic result for ${f_\textrm{\tiny{MF}}}$ (16)). Upon demanding that this ratio be much less than unity, we see that the MFT results (for the Ising universality class) are only trustworthy in ${d\geq4}$ dimensions. (The case ${d\!=\!4}$ actually requires a more careful RG treatment due to the logarithmic divergence; see Tong’s notes for more details).

To summarize: MFT is a useful approximation method that averages over interactions and enables one to obtain closed-form expressions of otherwise intractable partition functions. It is tantamount to the saddle point approximation, and — in the context of RG — may be qualitatively altered by any relevant higher-order terms. While these corrections can potentially shift the location of the critical point however, the basic fact that the correlation function diverges at criticality remains unchanged. As we’ll see in part 2, it is this feature that makes phase transitions interesting from a computational perspective, since it means that the propagation of information at this point is especially stable.

References

1. P. Kopietz, L. Bartosch, and F. Schütz, “Mean-field Theory and the Gaussian Approximation,” Lect. Notes Phys. 798 (2010).

Mean field theory: from physics to deep neural nets

Leave a comment Cancel reply

Archives

Meta

Follow me!

Legal stuff

Mean field theory: from physics to deep neural nets

Share this:

Leave a comment Cancel reply

Archives

Meta

Follow me!

Legal stuff