Restricted Boltzmann Machines - Theory I

Before actually building a Restricted Boltzmann machine (RBM), let us first discuss the math behind it.

First, a quick recap of our goal: We try to map the pixels of an image (input data) to the content of this image (label). Specifically, we try to recognize images of numbers. We tackle this problem by using a two step approach of first building a generative model of the input data and then use the output of this model for classification.

Like most machine learning algorithms RBMs are closely tied to basic concepts to probability theory. Specifically RBMs are used to model an a-priori unknown probability distribution of your input data. If we denote a sample of your input data by the vector v, RBMS try to learn the probability P(v) of that vector.

Let us consider a visible vectors with two binary elements. For this example there are four possible values v can take, (0,0), (1,0), (0,1), (1,1), and the probability distribution P(v) consists four entries. Generally if the visible vector has d elements the probability distribution has 2d entries. Even for quite small inputs, as 8x8 pixel image, this is an astronomically large number with 20 digits.

As we cannot reason directly about P(v) with its enormous number of entries, we have to find a suitable parametrization, where the number of parameters is much smaller. Like many probabilistic models, RBMs are based on the Gibbs distribution, which assigns each visible vector v a free energy F(v). The free energy then determines the probability by

$$P(v) = Z^{-1} \exp[-F(v)]$$

Here Z is the so called partition function, which ensure that the probability sums to one if we sum over all possible visible vectors. So for our example of a two element binary vector the partition sum is equal to

$$Z = \exp[-F((0,0))] + \exp[-F((1,0))] + \exp[-F((0,1))] + \exp[-F((1,1))]$$

An RBM adds additional unknowns, so called hidden units h, which are coupled to the visible state. This coupling is again described by an energy function

$$E(v,h) = -v^T W h -v^T b - h^T c$$

Here, W is the weight matrix, b the bias of the visible units and c the bias of the hidden units. The Gibbs distribution for E(v,h) determines the joint-probability of a visible vector v and a hidden vector h.

So now that we have a probability P(v,h) assigned to every value of visible and hidden vector, how do we get the distribution of visible vectors? We can compute the marginal distribution, by summing over all possible values of the hidden vectors:

$$P(v) = Z^{-1} \sum_{ \{ h \} } \exp [ -E(v, h) ]$$

As an example consider the case that h is a two dimensional binary vector, then the set {h} covers the values {(0, 0), (0, 1), (1, 0), (1, 1)} and we sum up the probabilities by setting the hidden vector to each one of these in turn.

Note, that we never specified the value of the partition function Z. The reason is that in general we are not able to, as it involves a sum over all possible values of the visible and hidden vectors. However, as it turns out for many applications there are good approximations available so we do not need to determine Z.

One particularity of the restricted Boltzmann machine is the fact that the conditional distributions can be determined explicitly. The conditional probability P(v|h) is the probability to find the visible vector v if the hidden vector is equal to h. The conditional distribution P(h|v) has an equivalent meaning.

The reason we can determine these probability distribution is that they factorize in their argument, i.e., we can be write P(v|h) as

$$P(v|h) = \prod_i P(v_i|h).$$

The probability that the i-th visible state is equal to one is given by

$$P(v_i = 1 | h) = \frac{1}{1 + \exp[ - \sum_j (W_{ij} h_j - b_j)]}$$

So all that remains now is to optimize the parameters, i.e., the matrix W and the vectors b and c, to maximize the probability of the input data. How this is done will be the topic of the next post.