## Restricted Boltzmann Machines - Theory II

In the last post we have parametrized the model and are now left with the task of learning it. Learning in the context of machine-learning always implies maximizing a predefined objective. Generative models aim to maximize the probability they assign to the input data. We will present the model a number images where the pixels are stored in the vectors v(i). We search for the parameters of the model (here: visible biases b, hidden biases c and weight matrix W) such that the probability the model assigns to the input images is maximized. This probability, also called likelihood, is given as the product

$$P(\{v\}) = \prod_i P(v^{(i)}).$$

To maximize this probability we will use gradient ascent. However when we try to differentiate this product we end up with a sum of products, which we will not be able to compute. Luckily there is an easy way around this problem: by taking the logarithm of this product we turn the likelihood into the log-likelihood and the products into the sum

$$\log P(\{v\}) = \sum_i \log P(v^{(i)}).$$

As the logarithm is a monotonically increasing function, maximizing the logarithm also maximizes its argument.

To simplify the next steps let us denote the log-likelihood by J. We can maximize it by taking small steps along its gradients - a procedure called gradient ascent. So we can increase J by updating the weights by

$$W_{ij} \rightarrow W_{ij} + \epsilon \frac{\partial J}{\partial W_{ij}}.$$

Gradient ascent can be easily understood if we consider how J changes if we perturb the weights by a small step ε ΔWij. If ε is small enough we can Taylor-expand and find that J changes as

$$J(W + \epsilon \Delta W_{ij}) \approx J(W) + \epsilon \sum_{ij} \frac{\partial J}{\partial W_{ij}} \Delta W_{ij}$$

By choosing ε ΔWij equal to ∂J / ∂W this increase is always positive. However this strictly only holds for infinitesimal small ε.

Note that you can treat the biases by including units that are always one, so there is no need to include them in the discussion explicitly.

Inserting the joint probability of the RBM into the log-likelihood we find its derivative as the difference of the derivative of the energy function averaged over the data and averaged over the model expectations

$$\frac{\partial J}{\partial W_{ij}} \propto -\langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{data} +\langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{model}.$$

We can understand this as follows: By following the gradient, we decrease the energy of the data samples and increase the energy of the model expectations. As lower energy states are more probable, this increases the probability of the data and decreases the probability of what the model would come up with on its own.

We can derive the gradient of the log-likelihood J by the following steps:

\begin{aligned} \frac{\partial J}{\partial W_{ij}} &= \sum_k \frac{\partial}{\partial W_{ij}} \log P(v^{(k)}) \\ &= \sum_k \frac{\partial}{\partial W_{ij}} \left[ \log \sum_{h} \exp(-E(v^{(k)},h)) - \log Z \right] \end{aligned}

The first term will give the expectation over the data, while the second term will give the expectation over the model.

Concentrating on the first term:

\begin{aligned} &\sum_k \frac{\partial}{\partial W_{ij}} \log \sum_{ { h } } \exp(-E(v^{(k)},h)) \\ &= \sum_k \frac{1}{\sum_{ { h } } \exp(-E(v,h))} \sum_{ { h } } \exp(-E(v^{(k)},h)) (-\frac{\partial}{\partial W_{ij}} E(v^{(k)}, h)) \\ &= -N_v \frac{1}{N_v} \sum_k \sum_{ { h } } P(h | v^{(k)}) \frac{\partial}{\partial W_{ij}} E(v^{(k)}, h)). \end{aligned}

Here, Nv is the number of visible samples. Furthermore the sum over all data vectors can be considered the sum over all possible visible vectors multiplied by the probability of that vector being part of the input data. If we define the probability density of the input data as

$$P_\mathrm{data}(v, h) = \frac{1}{N_v} \sum_k \delta(v - v^{(k)}) P(h | v^{(k)}),$$

we can rewrite the first part of the derivative as

\begin{aligned} &\sum_k \frac{\partial}{\partial W_{ij}} \log \sum_{ { h } } \exp(-E(v^{(k)},h)) \\ &= -N_v \sum_{ {v, h } } P_\mathrm{data}(v, h) \frac{\partial}{\partial W_{ij}} E(v^{(k)}, h)) \\ &= -N_v \langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{data}. \end{aligned}

The derivative of the partition functions results in

\begin{aligned} &-\sum_k \frac{\partial}{\partial W_{ij}} \log Z \\ &= -N_v \frac{\partial}{\partial W_{ij}} \log \sum_{ { v, h } } exp[ -E(v, h) ] \\ &= N_v \log \frac{1}{\sum_{ { v, h } } exp[ -E(v, h) ]} \sum_{ { v, h } } exp[ -E(v, h) ] \frac{\partial}{\partial W_{ij}} E(v, h) \\ &= N_v \sum_{ { v, h } } P_\mathrm{model}(v, h) \frac{\partial}{\partial W_{ij}} E(v, h) \\ &= N_v \langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{model}. \end{aligned}

Combining both terms we find:

$$\sum_k \frac{\partial}{\partial W_{ij}} \log P(v^{(k)}) = -N_v \langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{data} + N_v \langle \frac{\partial E}{\partial W_{ij}} \rangle_\mathrm{model}.$$

All that remains is plugging in the gradient with respect to the weights of the energy defined in the previous post and we end up with

$$\frac{\partial J}{\partial W_{ij}} \propto -\langle v h^T \rangle_\mathrm{data} +\langle v h^T \rangle_\mathrm{model}.$$

Still we cannot calculate this gradient, as it involves expectations over the model, which we cannot perform due to the infeasibility of the partition function. Here we will try to approximate this expectation value by a finite number of samples and often resort to using merely a single sample. To generate the samples we will perform Gibbs sampling.

Remember that we can evaluate the conditional probability of the hidden vector given the visible vector and the converse. Gibbs sampling exploits this fact by alternating between sampling the hidden and the visible vector. If we repeat this process until infinity we will generate unbiased samples from the model. In practice, as popularized by Hinton, a good approximation is start at the data vector and perform a small number of Gibbs sampling steps. This learning rule is called contrastive divergence.

So to summarize the learning algorithm for the Restricted Boltzmann machine can be written (in python):

for visible in input_data:
hidden = sample_hidden_given_visible(visible)