Restricted Boltzmann machines

As a theoretical physicist making their first foray into machine learning, one is immediately captivated by the fascinating parallel between deep learning and the renormalization group. In essence, both are concerned with the extraction of relevant features via a process of coarse-graining, and preliminary research suggests that this analogy can be made rather precise. This is particularly interesting in the context of attempts to develop a theoretical framework for deep learning; insofar as the renormalization group is well-understood in theoretical physics, the strength of this mathematical analogy may pave the way towards a general theory of deep learning. I hope to return to this tantalizing correspondence soon; but first, we need to understand restricted Boltzmann machines (RBMs).

As the name suggests, an RBM is a special case of the Boltzmann machines we’ve covered before. The latter are useful for understanding the basic idea behind energy-based generative models, but it turns out that all-to-all connectivity leads to some practical problems when training such networks. In contrast, an RBM restricts certain connections (hence the name), which makes training them much more efficient. Specifically, the neurons in an RBM are divided into one of two layers, consisting of visible and hidden units, respectively, with intralayer connections prohibited. Visible units essentially correspond to outputs: these are the variables to which the observer has access. In contrast, hidden units interact only through their connection with the visible layer, but in such a way as to encode complex, higher-order interactions between the visible degrees of freedom. This ability to encode such latent variables is what makes RBMs so powerful, and underlies the close connection with the renormalization group (RG).

Latent or hidden variables correspond to the degrees of freedom one integrates out (“marginalizes over”, in machine learning (ML) parlance) when performing RG. When properly performed, this procedure ensures that the relevant physics is encoded in correlations between the remaining (visible) degrees of freedom, while allowing one to work with a much simpler model, analogous to an effective field theory. To understand how this works in detail, suppose we wish to apply Wilsonian RG to the Ising model (a pedagogical review by Dimitri Vvedensky can be found here). To do this, we must first transform the (discrete) Ising model into a (continuous) field theory, so that momentum-space RG techniques can be applied. This is achieved via a trick known as the Hubbard-Stratonovich transformation, which neatly illustrates how correlations between visible/relevant variables can be encoded via hidden/irrelevant degrees of freedom.

The key to the Hubbard-Stratonovich transformation is the observation that for any value of ${s}$ ,

$\displaystyle e^{\frac{1}{2}Ks^2}=\left(\frac{K}{2\pi}\right)^{1/2}\!\int_{-\infty}^\infty\!\mathrm{d}\phi\,\exp\left(-\frac{1}{2}K\phi^2+Ks\phi\right)~. \ \ \ \ \ (1)$

Since we’ll be working with a discrete lattice of spins, it’s helpful to write this in matrix notation:

$\displaystyle e^{\frac{1}{2}\sum_{ij}K_{ij}s_is_j} =\left[\frac{\mathrm{det}K}{(2\pi)^N}\right]^{1/2}\prod_{k=1}^N\int\!\mathrm{d}\phi_k\exp\left(-\frac{1}{2}\sum_{ij}K_{ij}\phi_i\phi_j+\sum_{ij}K_{ij}s_i\phi_j\right)~, \ \ \ \ \ (2)$

where ${N}$ is the total number of spins. Now consider the Ising model with nearest-neighbor interactions, whose hamiltonian is

$\displaystyle H=-\frac{1}{2}\sum_{ij}J_{ij}s_is_j~, \ \ \ \ \ (3)$

with spins ${s_i=\pm1}$ , and symmetric coupling matrix ${J_{ij}=J}$ if ${s_i}$ and ${s_j}$ are nearest-neighbors, and zero otherwise. The partition function is

$\displaystyle Z=\sum_{s_i=\pm1}e^{\frac{1}{2}\sum_{ij}K_{ij}s_js_j}~, \ \ \ \ \ (4)$

which is typically solved by the transfer-matrix method. To recast this as a field theory, we observe that the sum over the second term in (2) may be written

$\displaystyle \begin{aligned} {}&\sum_{s_i=\pm1}\exp\left(\sum\nolimits_{ij}K_{ij}s_i\phi_j\right) =\sum_{s_i=\pm1}\prod_i\exp\left(\sum\nolimits_{j}K_{ij}s_i\phi_j\right)\\ &=\prod_i\left[\exp\left(\sum\nolimits_{j}K_{ij}\phi_j\right)+\exp\left(-\sum\nolimits_{j}K_{ij}\phi_j\right)\right]\\ &=2^N\prod_i\cosh\left(\sum\nolimits_{j}K_{ij}\phi_j\right) =2^N\exp\left[\sum\nolimits_i\ln\cosh\left(\sum\nolimits_jK_{ij}\phi_j\right)\right]~, \end{aligned} \ \ \ \ \ (5)$

where the spin degrees of freedom have been summed over, and we’ve re-exponentiated the result in preparation for its insertion below. This enables us to express the partition function entirely in terms of the latent variable ${\phi}$ :

$\displaystyle Z=\left[\frac{\mathrm{det}K}{(\pi/2)^N}\right]^{1/2}\prod_{k=1}^N\int\!\mathrm{d}\phi_k\exp\left(-\frac{1}{2}\sum_{ij}K_{ij}\phi_i\phi_j+\sum\nolimits_i\ln\cosh\sum\nolimits_jK_{ij}\phi_j\right)~, \ \ \ \ \ (6)$

where, since the first term in (2) is independent of ${s}$ , the sum over spins simply introduces a prefactor of ${2^N}$ . We have thus obtained an exact transformation from the original, discrete model of spins to an equivalent, continuum field theory. To proceed with renormalization of this model, we refer the interested reader to the aforementioned Vvedensky reference. The remarkable upshot, for our purposes, is that all of the physics of the original spin degrees of freedom are entirely encoded in the new field ${\phi}$ . To connect with our ML language above, one can think of this as a latent variable that mediates the correlations between the visible spins, analogous to how UV degrees of freedom give rise to effective couplings at low-energies.

So what does this have to do with (restricted) Boltzmann machines? This is neatly explained in the amazing review by Pankaj Mehta et al., A high-bias, low-variance introduction to Machine Learning for physicists, which receives my highest recommendations. The idea is that by including latent variables in generative models, one can encode complex interactions between visible variables without sacrificing trainability (because the correlations between visible degrees of freedom are mediated via the UV degrees of freedom over which one marginalizes, rather than implemented directly as intralayer couplings). The following exposition draws heavily from section XVI of Mehta et al., and the reader is encouraged to turn there for more details.

Consider a Boltzmann distribution with energy

$\displaystyle \begin{aligned} E(\mathbf{v})&=-\sum_ia_iv_i-\frac{1}{2}\sum_{ij}J_{ij}v_iv_j\\ &=-\sum_ia_iv_i-\frac{1}{2}\sum_{ij\mu}W_{i\mu}W_{\mu j}v_iv_j~, \end{aligned} \ \ \ \ \ (7)$

where ${\mathbf{v}}$ is a vector of visible neurons, and on the second line we’ve appealed to SVD to decompose the coupling matrix as ${J_{ij}=\sum\nolimits_\mu W_{i\mu}W_{\mu j}}$ . In its current form, the visible neurons interact directly through the second, coupling term. We now wish to introduce latent or hidden variables ${\mathbf{h}=\{h_\mu\}}$ that mediate this coupling instead. To do this, consider the Hubbard-Stratonovish transformation (1), with ${K=1}$ , ${\mathbf{s}=(W\mathbf{v})^T}$ , and ${\phi=-\mathbf{h}}$ . Then the second term in (7) becomes

$\displaystyle e^{\frac{1}{2}(W\mathbf{v})^T(W\mathbf{v})} =(2\pi)^{-\frac{1}{2}}\int\!\mathrm{d}\mathbf{h}\,\exp\!\left(-\frac{1}{2}\mathbf{h}^T\mathbf{h}-(W\mathbf{v})^T\mathbf{h}\right) \ \ \ \ \ (8)$

Of course, this is simply a special case of the following multidimensional Gaussian integral: recall that if ${A}$ is an ${n}$ -dimensional, symmetric, positive-definite matrix, then

$\displaystyle \begin{aligned} {}&\int\mathrm{d}^nx\,\exp\!\left(-\frac{1}{2}\sum_{ij}A_{ij}x_ix_j+\sum_iB_ix_i\right)\\ &=\int\mathrm{d}^nx\,\exp\!\left(-\frac{1}{2}\mathbf{x}^TA\mathbf{x}+B^T\mathbf{x}\right) =\sqrt{\frac{(2\pi)^n}{\mathrm{det}A}}e^{\frac{1}{2}B^TA^{-1}B}~. \end{aligned} \ \ \ \ \ (9)$

Thus, reverting to matrix notation, and absorbing the normalization into the partition function ${Z}$ , the Boltzmann distribution can be expressed as

$\displaystyle \begin{aligned} p(\mathbf{v})&=Z^{-1}e^{\sum\nolimits_ia_iv_i}\prod_\mu\int\!\mathrm{d} h_\mu\,\exp\!\left(-\frac{1}{2}\sum_\mu h_\mu^2-\sum_iW_{i\mu}v_ih_\mu\right)\\ &=Z^{-1}\int\!\mathrm{d}\mathbf{h}\,e^{-E(\mathbf{v},\mathbf{h})}~, \end{aligned} \ \ \ \ \ (10)$

where on the second line, we’ve introduced the joint energy function

$\displaystyle E(\mathbf{v},\mathbf{h})=-\sum_ia_iv_i+\frac{1}{2}\sum_\mu h_\mu^2-\sum_{i\mu}W_{i\mu}v_ih_\mu~. \ \ \ \ \ (11)$

The salient feature of this new hamiltonian is that it contains no interactions between the visible neurons: the original, intralayer coupling ${J_{ij}v_iv_j}$ is now mediated via the latent degrees of freedom ${h_\mu}$ instead. As we’ll see below, this basic mechanism can be readily generalized such that we can encode arbitrarily complex — that is, arbitrarily high-order — interactions between visible neurons by coupling to a second layer of hidden neurons.

A general RBM is described by the hamiltonian

$\displaystyle E(\mathbf{v},\mathbf{h})=-\sum\nolimits_ia_i(v_i)-\sum\nolimits_\mu b_\mu(h_\mu)-\sum_{i\mu}W_{i\mu}v_ih_\mu~, \ \ \ \ \ (12)$

where the functions ${a_i(v_i)}$ and ${b_\mu(h_\mu)}$ can be freely chosen. According to Mehta et al., the most common choices are

$\displaystyle a_i(v_i)\equiv \begin{cases} a_iv_i~, & v_i\in\{0,1\} \\ \frac{v_i^2}{2\sigma_i^2}~, & v_i\in\mathbb{R} \end{cases}~, \qquad\quad b_\mu(h_\mu)\equiv \begin{cases} b_\mu h_\mu~, & h_\mu\in\{0,1\} \\ \frac{h_\mu^2}{2\sigma_\mu^2}~, & h_\mu\in\mathbb{R} \end{cases}~. \ \ \ \ \ (13)$

The upper choice, with binary neurons, is referred to as a Bernoulli layer, while the lower choice, with continuous outputs, is called a Gaussian layer. Note that choosing the visible layer to be Bernoulli and the hidden Gaussian reduces to the example considered above, with standard deviations ${\sigma_\mu=1}$ . Of course, if we marginalize over (i.e., integrate out) the latent variable ${\mathbf{h}}$ , we recover the distribution of visible neurons, cf. (10) above, which we may write as

$\displaystyle p(\mathbf{v})=Z^{-1}e^{-E(\mathbf{v})}~, \ \ \ \ \ (14)$

where the marginalized energy of visible neurons is given by

$\displaystyle \begin{aligned} E(\mathbf{v})&=-\ln\int\!\mathrm{d}\mathbf{h}\,e^{-E(\mathbf{v},\mathbf{h})}\\ &=-\sum_ia_i(v_i)-\sum_\mu\ln\int\!\mathrm{d} h_\mu\,\exp\!\left(\sum_\mu b_\mu(h_\mu)+\sum_iW_{i\mu}v_ih_\mu\right)~. \end{aligned} \ \ \ \ \ (15)$

Now, to understand how higher-order correlations are encoded in ${p(\mathbf{v})}$ , we follow Mehta et al. in introducing the distribution of hidden neurons, ignoring their interaction with the visible layer:

$\displaystyle q_\mu(h_\mu)=Z^{-1}e^{b_\mu(h_\mu)}~. \ \ \ \ \ (16)$

The associated cumulant generating function is

$\displaystyle K_\mu(t)=\ln E_q\left[e^{th_\mu}\right] =\ln\int\!\mathrm{d} h_\mu\,q_\mu(h_\mu)e^{th_\mu} =\sum_n\kappa_\mu^{(n)}\frac{t^n}{n!}~, \ \ \ \ \ (17)$

where ${E_q[\cdot]}$ is the expectation value with respect to the distribution ${q_\mu}$ , and the ${n^\mathrm{th}}$ cumulant ${\kappa_n}$ is obtained by differentiating the power series ${n}$ times and evaluating the result at ${t=0}$ , i.e., ${\kappa_n=K_\mu^{(n)}(t)\big|_{t=0}}$ . The reason for introducing (16) and (17) is that the cumulant of ${q_\mu(h_\mu)}$ are actually encoded in the marginal energy distribution (15)! To see this, observe that taking ${t=\sum\nolimits_iW_{i\mu}v_i}$ in the cumulant generating function yields precisely the log term in (15). Therefore we can replace this term with the cumulant expansion, whereupon we obtain

$\displaystyle \begin{aligned} E(\mathbf{v})&=-\sum_ia_i(v_i)-\sum_\mu K_\mu\!\left(\sum\nolimits_i W_{i\mu}v_i\right)\\ &=-\sum_ia_i(v_i)-\sum_{\mu n}\frac{\kappa_\mu^{(n)}}{n!}\left(\sum\nolimits_i W_{i\mu}v_i\right)^n\\ &=-\sum_ia_i(v_i)-\sum_i\!\left(\sum_\mu k_\mu^{(1)}W_{i\mu}\right)v_i -\frac{1}{2}\sum_{ij}\!\left(\sum_\mu k_\mu^{(2)}W_{i\mu}W_{j\mu}\right)v_iv_j+\ldots~. \end{aligned} \ \ \ \ \ (18)$

In other words, after marginalizing over the latent variables, we obtain an effective hamiltonian with arbitrarily high-order interactions between the visible neurons, with the effective couplings weighted by the associated cumulant. As emphasized by Mehta et al., it is the ability to encode such complex interactions with only the simplest couplings between visible and hidden neurons that underlies the incredible power of RBMs. See my later post on the relationship between deep learning and RG for a much more in-depth look at this.

As final comment, there’s an interesting connection between cumulants and statistical physics, which stems from the fact that for a thermal system with partition function ${Z(\beta)=\mathrm{tr}e^{-\beta E}}$ , the free energy ${{F=-\beta^{-1}\ln Z}}$ likewise generates thermodynamic features of the system via its derivatives. Pursuing this here would take us too far afield, but it’s interesting to note yet another point where statistical thermodynamics and machine learning cross paths.

3 Responses to Restricted Boltzmann machines

Francesco says:

February 14, 2021 at 6:07 pm

Hi Ro, thanks a lot for this nice review. It seems the link to Vvedensky‘s review is not working. Do you have the file stored anywhere? That would be very useful. Thanks!

LikeLike

- rojefferson says:
  
  March 3, 2021 at 9:19 pm
  
  Thanks Francesco, I’m glad you liked it! And thanks for pointing out the broken link; sadly I cannot find the notes online anymore, but Dimitri was kind enough to send me the pdfs, which I’ll be happy to email you.
  
  LikeLike
  
Preeti Bhandari says:

February 4, 2024 at 12:45 pm

Hi Ro. Thank you so much for your wonderful explanation.

LikeLike