Department of Computer Science 6 King’s College Rd, Toronto
University of Toronto M5S 3G4, Canada
http://learning.cs.toronto.edu fax: +1 416 978 1455
Copyright c© Geoffrey Hinton 2010.
August 2, 2010
UTML TR 2010–003
A Practical Guide to Training
Restricted Boltzmann Machines
Version 1
Geoffrey Hinton
Department of Computer Science, University of Toronto
A Practical Guide to Training Restricted Boltzmann
Machines
Version 1
Geoffrey Hinton
Department of Computer Science, University of Toronto
Contents
1 Introduction 3
2 An overview of Restricted Boltzmann Machines and Contrastive Divergence 3
3 How to collect statistics when using Contrastive Divergence 5
3.1 Updating the hidden states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Updating the visible states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Collecting the statistics needed for learning . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 A recipe for getting the learning signal for CD1 . . . . . . . . . . . . . . . . . . . . . . 6
4 The size of a mini-batch 6
4.1 A recipe for dividing the training set into mini-batches . . . . . . . . . . . . . . . . . . 7
5 Monitoring the progress of learning 7
5.1 A recipe for using the reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Monitoring the overfitting 8
6.1 A recipe for monitoring the overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 The learning rate 8
7.1 A recipe for setting the learning rates for weights and biases . . . . . . . . . . . . . . . 8
8 The initial values of the weights and biases 9
8.1 A recipe for setting the initial values of the weights and biases . . . . . . . . . . . . . 9
9 Momentum 9
1
9.1 A recipe for using momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10 Weight-decay 10
10.1 A recipe for using weight-decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
11 Encouraging sparse hidden activities 11
11.1 A recipe for sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
12 The number of hidden units 12
12.1 A recipe for choosing the number of hidden units . . . . . . . . . . . . . . . . . . . . . 12
13 Different types of unit 13
13.1 Softmax and multinomial units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
13.2 Gaussian visible units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
13.3 Gaussian visible and hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
13.4 Binomial units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
13.5 Rectified linear units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
14 Varieties of contrastive divergence 15
15 Displaying what is happening during learning 16
16 Using RBM’s for discrimination 16
16.1 Computing the free energy of a visible vector . . . . . . . . . . . . . . . . . . . . . . . 17
17 Dealing with missing values 17
1
1If you make use of this technical report to train an RBM, please cite it in any resulting publication.
2
1 Introduction
Restricted Boltzmann machines (RBMs) have been used as generative models of many different
types of data including labeled or unlabeled images (Hinton et al., 2006a), windows of mel-cepstral
coefficients that represent speech (Mohamed et al., 2009), bags of words that represent documents
(Salakhutdinov and Hinton, 2009), and user ratings of movies (Salakhutdinov et al., 2007). In their
conditional form they can be used to model high-dimensional temporal sequences such as video or
motion capture data (Taylor et al., 2006) or speech (Mohamed and Hinton, 2010). Their most
important use is as learning modules that are composed to form deep belief nets (Hinton et al.,
2006a).
RBMs are usually trained using the contrastive divergence learning procedure (Hinton, 2002).
This requires a certain amount of practical experience to decide how to set the values of numerical
meta-parameters such as the learning rate, the momentum, the weight-cost, the sparsity target, the
initial values of the weights, the number of hidden units and the size of each mini-batch. There are also
decisions to be made about what types of units to use, whether to update their states stochastically
or deterministically, how many times to update the states of the hidden units for each training case,
and whether to start each sequence of state updates at a data-vector. In addition, it is useful to know
how to monitor the progress of learning and when to terminate the training.
For any particular application, the code that was used gives a complete specification of all of
these decisions, but it does not explain why the decisions were made or how minor changes will affect
performance. More significantly, it does not provide a novice user with any guidance about how to
make good decisions for a new application. This requires some sensible heuristics and the ability to
relate failures of the learning to the decisions that caused those failures.
Over the last few years, the machine learning group at the University of Toronto has acquired
considerable expertise at training RBMs and this guide is an attempt to share this expertise with
other machine learning researchers. We are still on a fairly steep part of the learning curve, so the
guide is a living document that will be updated from time to time and the version number should
always be used when referring to it.
2 An overview of Restricted Boltzmann Machines and Contrastive
Divergence
Skip this section if you already know about RBMs
Consider a training set of binary vectors which we will assume are binary images for the purposes
of explanation. The training set can be modeled using a two-layer network called a “Restricted
Boltzmann Machine” (Smolensky, 1986; Freund and Haussler, 1992; Hinton, 2002) in which stochastic,
binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted
connections. The pixels correspond to “visible” units of the RBM because their states are observed;
the feature detectors correspond to “hidden” units. A joint configuration, (v,h) of the visible and
hidden units has an energy (Hopfield, 1982) given by:
E(v,h) = −
∑
i∈visible
aivi −
∑
j∈hidden
bjhj −
∑
i,j
vihjwij (1)
where vi, hj are the binary states of visible unit i and hidden unit j, ai, bj are their biases and wij is
the weight between them. The network assigns a probability to every possible pair of a visible and a
3
hidden vector via this energy function:
p(v,h) =
1
Z
e−E(v,h) (2)
where the “partition function”, Z, is given by summing over all possible pairs of visible and hidden
vectors:
Z =
∑
v,h
e−E(v,h) (3)
The probability that the network assigns to a visible vector, v, is given by summing over all possible
hidden vectors:
p(v) =
1
Z
∑
h
e−E(v,h) (4)
The probability that the network assigns to a training image can be raised by adjusting the weights
and biases to lower the energy of that image and to raise the energy of other images, especially those
that have low energies and therefore make a big contribution to the partition function. The derivative
of the log probability of a training vector with respect to a weight is surprisingly simple.
∂ log p(v)
∂wij
= 〈vihj〉data − 〈vihj〉model (5)
where the angle brackets are used to denote expectations under the distribution specified by the
subscript that follows. This leads to a very simple learning rule for performing stochastic steepest
ascent in the log probability of the training data:
∆wij = �(〈vihj〉data − 〈vihj〉model) (6)
where � is a learning rate.
Because there are no direct connections between hidden units in an RBM, it is very easy to get
an unbiased sample of 〈vihj〉data. Given a randomly selected training image, v, the binary state, hj ,
of each hidden unit, j, is set to 1 with probability
p(hj = 1 | v) = σ(bj +
∑
i
viwij) (7)
where σ(x) is the logistic sigmoid function 1/(1 + exp(−x)). vihj is then an unbiased sample.
Because there are no direct connections between visible units in an RBM, it is also very easy to
get an unbiased sample of the state of a visible unit, given a hidden vector
p(vi = 1 | h) = σ(ai +
∑
j
hjwij) (8)
Getting an unbiased sample of 〈vihj〉model, however, is much more difficult. It can be done by
starting at any random state of the visible units and performing alternating Gibbs sampling for a very
long time. One iteration of alternating Gibbs sampling consists of updating all of the hidden units
in parallel using equation 7 followed by updating all of the visible units in parallel using equation 8.
A much faster learning procedure was proposed in Hinton (2002). This starts by setting the
states of the visible units to a training vector. Then the binary states of the hidden units are all
computed in parallel using equation 7. Once binary states have been chosen for the hidden units,
4
a “reconstruction” is produced by setting each vi to 1 with a probability given by equation 8. The
change in a weight is then given by
∆wij = �(〈vihj〉data − 〈vihj〉recon) (9)
A simplified version of the same learning rule that uses the states of indivisdual units instead of
pairwise products is used for the biases.
The learning works well even though it is only crudely approximating the gradient of the log prob-
ability of the training data (Hinton, 2002). The learning rule is much more closely approximating the
gradient of another objective function called the Contrastive Divergence (Hinton, 2002) which is the
difference between two Kullback-Liebler divergences, but it ignores one tricky term in this objective
function so it is not even following that gradient. Indeed, Sutskever and Tieleman have shown that it
is not following the gradient of any function (Sutskever and Tieleman, 2010). Nevertheless, it works
well enough to achieve success in many significant applications.
RBMs typically learn better models if more steps of alternating Gibbs sampling are used before
collecting the statistics for the second term in the learning rule, which will be called the negative
statistics. CDn will be used to denote learning using n full steps of alternating Gibbs sampling.
3 How to collect statistics when using Contrastive Divergence
To begin with, we shall assume that all of the visible and hidden units are binary. Other types of
units will be discussed in sections 13. We shall also assume that the purpose of the learning is to
create a good generative model of the set of training vectors. When using RBMs to learn Deep Belief
Nets (see the article on Deep Belief Networks at www.scholarpedia.org) that will subsequently be
fine-tuned using backpropagation, the generative model is not the ultimate objective and it may be
possible to save time by underfitting it, but we will ignore that here.
3.1 Updating the hidden states
Assuming that the hidden units are binary and that you are using CD1, the hidden units should have
stochastic binary states when they are being driven by a data-vector. The probability of turning on
a hidden unit, j, is computed by applying the logistic function σ(x) = 1/(1 + exp(−x)) to its “total
input”:
p(hj = 1) = σ(bj +
∑
i
viwij) (10)
and the hidden unit turns on if this probability is greater than a random number uniformly distributed
between 0 and 1.
It is very important to make these hidden states binary, rather than using the probabilities
themselves. If the probabilities are used, each hidden unit can communicate a real-value to the
visible units during the reconstruction. This seriously violates the information bottleneck created by
the fact that a hidden unit can convey at most one bit (on average). This information bottleneck
acts as a strong regularizer.
For the last update of the hidden units, it is silly to use stochastic binary states because nothing
depends on which state is chosen. So use the probability itself to avoid unnecessary sampling noise.
When using CDn, only the final update of the hidden units should use the probability.
5
3.2 Updating the visible states
Assuming that the visible units are binary, the correct way to update the visible states when generating
a reconstruction is to stochastically pick a 1 or 0 with a probability determined by the total top-down
input:
pi = p(vi = 1) = σ(ai +
∑
j
hjwij) (11)
However, it is common to use the probability, pi, instead of sampling a binary value. This is not
nearly as problematic as using probabilities for the data-driven hidden states and it reduces sampling
noise thus allowing faster learning. There is some evidence that it leads to slightly worse density
models (Tijmen Tieleman, personal communication, 2008). This probably does not matter when
using an RBM to pretrain a layer of hidden features for use in a deep belief net.
3.3 Collecting the statistics needed for learning
Assuming that the visible units are using real-valued probabilities instead of stochastic binary values,
there are two sensible ways to collect the positive statistics for the connection between visible unit i
and hidden unit j:
〈pihj〉data or 〈pipj〉data
where pj is a probability and hj is a binary state that takes value 1 with probability pj . Using hj
is closer to the mathematical model of an RBM, but using pj usually has less sampling noise which
allows slightly faster learning2.
3.4 A recipe for getting the learning signal for CD1
When the hidden units are being driven by data, always use stochastic binary states. When they are
being driven by reconstructions, always use probabilities without sampling.
Assuming the visible units use the logistic function, use real-valued probabilities for both the data
and the reconstructions3.
When collecting the pairwise statistics for learning weights or the individual statistics for learning
biases, use the probabilities, not the binary states, and make sure the weights have random initial
values to break symmetry.
4 The size of a mini-batch
It is possible to update the weights after estimating the gradient on a single training case, but it is
often more efficient to divide the training set into small “mini-batches” of 10 to 100 cases4. This
allows matrix-matrix multiplies to be used which is very advantageous on GPU boards or in Matlab.
2Using hj always creates more noise in the positive statistics than using pj but it can actually create less noise in
the difference of the positive and negative statistics because the negative statistics depend on the binary decision for
the state of j that is used for creating the reconstruction. The probability of j when driven by the reconstruction is
highly correlated with the binary decision that was made for j when it was driven by the data.
3So there is nothing random about the generation of the reconstructions given the binary states of the hidden units.
4The word “batch” is confusing and will be avoided because when it is used to contrast with “on-line” it usually
means the entire training set.
6
To avoid having to change the learning rate when the size of a mini-batch is changed, it is helpful
to divide the total gradient computed on a mini-batch by the size of the mini-batch, so when talking
about learning rates we will assume that they multiply the average, per-case gradient computed on
a mini-batch, not the total gradient for the mini-batch.
It is a serious mistake to make the mini-batches too large when using stochastic gradient descent.
Increasing the mini-batch size by a factor of N leads to a more reliable gradient estimate but it does
not increase the maximum stable learning rate by a factor of N, so the net effect is that the weight
updates are smaller per gradient evaluation5.
4.1 A recipe for dividing the training set into mini-batches
For datasets that contain a small number of equiprobable classes, the ideal mini-batch size is often
equal to the number of classes and each mini-batch should contain one example of each class to reduce
the sampling error when estimating the gradient for the whole training set from a single mini-batch.
For other datasets, first randomize the order of the training examples then use minibatches of size
about 10.
5 Monitoring the progress of learning
It is easy to compute the squared error between the data and the reconstructions, so this quantity
is often printed out during learning. The reconstruction error on the entire training set should fall
rapidly and consistently at the start of learning and then more slowly. Due to the noise in the
gradient estimates, the reconstruction error on the individual mini-batches will fluctuate gently after
the initial rapid descent. It may also oscillate gently with a period of a few mini-batches when using
high momentum (see section 9).
Although it is convenient, the reconstruction error is actually a very poor measure of the progress of
learning. It is not the function that CDn learning is approximately optimizing, especially for n >> 1,
and it systematically confounds two different quantities that are changing during the learning. The
first is the difference between the empirical distribution of the training data and the equilibrium
distribution of the RBM. The second is the mixing rate of the alternating Gibbs Markov chain. If
the mixing rate is very low, the reconstruction error will be very small even when the distributions of
the data and the model are very different. As the weights increase the mixing rate falls, so decreases
in reconstruction error do not necessarily mean that the model is improving and, conversely, small
increases do not necessarily mean the model is getting worse. Large increases, however, are a bad sign
except when they are temporary and caused by changes in the learning rate, momentum, weight-cost
or sparsity meta-parameters.
5.1 A recipe for using the reconstruction error
Use it but don’t trust it. If you really want to know what is going on during the learning, use multiple
histograms and graphic displays as described in section 15. Also consider using Annealed Importance
5The easy way to parallelize the learning on a cluster is to divide each mini-batch into sub-mini-batches and to use
different cores to compute the gradients on each sub-mini-batch. The gradients computed by different cores must then
be combined. To minimize the ratio of communication to computation, it is tempting to make the sub-mini-batches
large. This usually makes the learning much less efficient, thus wiping out much of the gain achieved by using multiple
cores (Vinod Nair, personal communication, 2007).
7
JF.Y
高亮
JF.Y
高亮
JF.Y
高亮
Sampling (Salakhutdinov and Murray, 2008) to estimate the density on held out data. If you are
learning a joint density model of labelled data (see section 16), consider monitoring the discriminative
performance on the training data and on a held out validation set.
6 Monitoring the overfitting
When learning a generative model, the obvious quantity to monitor is the probability that the current
model assigns to a datapoint. When this probability starts to decrease for held out validation data, it
is time to stop learning. Unfortunately, for large RBMs, it is very difficult to compute this probability
because it requires knowledge of the partition function. Nevertheless, it is possible to directly monitor
the overfitting by comparing the free energies of training data and held out validation data. In this
comparison, the partition function cancels out. The free energy of a data vector can be computed in
a time that is linear in the number of hidden units (see section 16.1). If the model is not overfitting
at all, the average free energy should be about the same on training and validation data. As the
model starts to overfit the average free energy of the validation data will rise relative to the average
free energy of the
本文档为【A Practical Guide to Training RBM】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。