TIMAN deep learning reading group
1
Introduction to Restricted Boltzmann Machine
Outline
1. Boltzmann Machine
1.1 Learning the Boltzmann Machine
1.2 Practical limitation in learning the BM
2. Restricted Boltzmann Machine
2.1 Learning the Restricted Boltzmann Machine
3. Contrastive Divergence
4. Representational power of RBM
4.1 Better Model with Increasing Number of Units
4.2 A Huge Model Can Represent Any Distribution
TIMAN deep learning reading group
2
1. Boltzmann Machine
One approach to unsupervised learning is the Boltzmann Machine (BM) neural
network. BM consists of binary units (neurons) and symmetric connections
between them. There are no edges going form the a neuron to itself. BM contains a
set of visible units and a set of hidden units . The energy of the
state {v, h} is defined as:
( )
,
where are the model parameters: W, L, J represents visible-to-hidden,
visible-to-visible and hidden-to-hidden symmetric interaction terms. The diagonal
elements of L and J are set to 0.
Based on the energy, the probability that model assign to a visible vector v is:
TIMAN deep learning reading group
3
( )
( )
∑ ( ( )) ,
( ) is the partition function which is used to normalize the probability
distribution.
The conditional distributions over hidden and visible units are given by:
( ) (∑
∑
),
( ) (∑
∑
),
( ) is the logistic function.
1.1 Learning Boltzmann Machine
We use the standard maximum likelihood to learn the Boltzmann Machine. Given
a data set ( )
, the log-likelihood of the parameters of BM is
( ) ∑ ( ( ) )
∑( ∑ ( ( ( ) ))
∑ ( ( ) )
)
The derivative of the log-likelihood of a single training pattern ( ) w.r.t. the
parameter becomes
( ( ))
∑ ( | ( ))
( ( ) )
∑ ( )
( )
〈
( ( ) )
〉 〈
( )
〉
The learning algorithm can be seen as driving BM so that the probability
distribution represented by BM to be exactly identical to the probability
distribution defined by the training data set. Although we do not know the
analytical formulation of the exact probability distribution of the training data set
TIMAN deep learning reading group
4
(the same in lots of the machine learning problem), BM try to mimic it by
clamping the visible units to the training data, and lets the hidden units freely have
their own states.
1.2 Practical limitation in learning the BM
Exact maximum likelihood learning in this model is in tractable because exact
computation of both the data dependent expectation and model expectation take a
time that is exponential in the number of hidden units.
For example, BM which was designed to handle a 28*28 black-and-white image
with 500 hidden nodes had 2
28*28+500
possible combinations. Even the evaluation of
a single combination is impossible since we need to compute the normalizing
constant ( ).
Hinton and Sejnowski (Hinton et al. 1983) proposed to use Gibbs sampling to
approximate both expectations. For each iteration of learning, a separate Markov
chain is run for every training data to approximate Edata and an additional chain is
run to approximate Emodel. Gibbs sampling makes the learning of BM is not
anymore as computational unfeasible as the exact computation of the probability
masses.
However, Gibbs sampling also has some limitations. Due to the full-connectivity
of BM, each unit is influenced by all the other units. Even if we visible units are
clamped to the training data, we still need to consider all the other hidden units
when sample for a single hidden unit. This makes the successive samples in the
chain highly correlated with each other and this poor mixing affects the speed of
converge, especially when estimating the model’s expectations, since the Gibbs
chain may need to explore a highly multimodal energy landscape. This is typical
when modeling the real-world distributions such as datasets of images in which
almost all of the possible images have extremely low probability, but there are
many very different images that occur with quite similar probabilities.
To solve these limitation, we can set both J=0 and L=0 to get the well-known
restricted Boltzmann machine (RBM). In contrast to general BM, each hidden unit
is independent with other hidden units. Gibbs sampling can be done layer-wise
TIMAN deep learning reading group
5
rather than component wise now. Therefore, the mixing rate is enhanced and the
sampling is more efficiency now.
2. Restricted Boltzmann Machine
RBM is constructed by removing the edges in-between the visible units and the
hidden units. RBM is now bi-partite graph corresponding to the following energy
function,
( ) ∑∑
∑
∑
Here we added a bias terms for the visible units and hidden units, which are
parameterized by c and b respectively. We can also absorb them into W by simply
adding a visible unit and a hidden unit which are always set to 1.
Because of the Markov property in undirected graphical model, the hidden units
are independent to each other given all the visible units. The visible units are also
independent to each other given all the hidden units.
The conditional distributions over hidden and visible units are thus given by:
( ) (∑
),
( ) (∑
),
2.1 Learning Restricted Boltzmann Machine
Using the conditional independent property of RBM, the derivative of the log-
likelihood of a single training pattern ( ) w.r.t. the weigh wij becomes
TIMAN deep learning reading group
6
( ( ))
〈
( )〉 〈
( )〉
(
( ))
( )
∑ ( ) ( )
Recall that in BM, both 〈
( )〉 and 〈
( )〉 cannot be computed
analytically. However, 〈
( )〉 can be computed analytically in RBM due to
the conditional dependent.
To compute 〈
( )〉 , Gibbs sampling is also more efficiency in RBM. We
can choose a random start point of v
0
and then sample h
0
~ p(h| v
0
). In the t
iteration, we sample v
t
~ p(v| h
t-1
) and h
t
~ p(h| v
t
). The Gibbs sampling stop when
p(v
t
, h
t
) converge to the underlying distribution represented by the graph structure
of RBM (the structure depends on the parameter W). Note that this procedure of
the Gibbs sampling is independent with the current training data ( ).
Gibbs sampling also has limitations. As the number of units in RBM increases, it
still needs to gather a great number of samples before reaching the stationary point.
Besides, Gibbs sampling always failed in multimodal distribution. For instance, if
we have a two dimensional multimodal distribution p(x1,x2) and both x1 and x2
are binary variable. Assume that p(x1=0,x2=0) = 0.4999, p(x1=0,x2=1) = 0.0001,
p(x1=1,x2=0) = 0.0001, p(x1=1,x2=1) = 0.4999. Both p(x1=0,x2=0) and
p(x1=1,x2=1) are the modes. If we start to sample from x2=0, the Gibbs sampling
will stuck in p(x1=0,x2=0). If we start to sample from x2=1, the Gibbs sampling
will stuck in p(x1=1,x2=1) (see Wikipedia page for Gibbs sampling).
3. Contrastive divergence
To overcome the limitation of Gibbs Sampling, Hinton proposed the contrastive
divergence (CD). Obtaining unbiased estimates of gradient using MCMC methods
typically requires many sampling steps. However, recently it has shown that
TIMAN deep learning reading group
7
estimates obtained after running the chain for just a few steps can be sufficient for
model training (Hinton, 2002).
The idea of k-step contrastive divergence learning (CD-k) is quite simple: Instead
of approximating the second term in the log-likelihood gradient by a sample from
the RBM-distribution (which would require to run a Markov chain until the
stationary distribution is reached), a Gibbs chain is run for only k steps (and
usually k = 1). The Gibbs chain is initialized with a training example v
(0)
of the
training set and yields the sample v
(k)
after k steps. Each step t consists of sampling
h
(t)
from p(h|v
(t)
) and sampling v
(t+1)
from p(v|h
(t)
) subsequently.
The gradient of CD-k w.r.t. for one training pattern is approximated by
(
( )) ∑ ( | ( ))
( ( ) )
∑ ( | ( ))
( ( ) )
From the above equation, we can find there is a bias between gradient of CD-k and
the gradient of the maximum likelihood estimation.
A batch version of CD-k can be seen in algorithm 1.
TIMAN deep learning reading group
8
The bias of the CD-k vanishes as k goes to infinite. A good property of CD is that
in case the data distribution is multimodal, running the chains starting from each
data sample guarantees, that the samples approximating the negative phase have
representatives from different modes.
There are some other learning algorithms for RBM: persistent contrastive
divergence (Tieleman, 2008), fast persistent contrastive divergence (Tieleman and
Hinton, 2009) and parallel tempering (Cho, 2010).
4. Representational Power of RBM
Le Roux (Le Roux, 2006) prove that any distribution can be approximated
arbitrarily well with an RBM with k+1 hidden units where k is the number of input
vectors whose probability is not 0. We will prove this theory in detail in our
reading group.
4.1 Better Model with Increasing Number of Units
They first show that when the number of hidden units of an RBM is increased,
there exist weight values for the new units that guarantee improvement in the
training log likelihood or, equivalently in the KL divergence between the data
distribution p0 and the model distribution p.
Lemma 1. Let Rp be the equivalence class containing the RBMs whose associated
marginal distribution over the visible units is p. The operation of adding a hidden
unit to an RBM of Rp preserves the equivalence class. Thus, the set of RBMs
composed of an RBM of Rp and an additional hidden unit is also an equivalence
class (meaning that all the RBMs of this set have the same marginal distribution
over visible units).
Lemma 2. Let p be the distribution over binary vectors v in {0, 1}
d
, obtained with
an RBM Rp and let pw;c be the distribution obtained when adding a hidden unit with
TIMAN deep learning reading group
9
weights w and bias c to Rp. Then
Theorem 2.3. Let p0 be an arbitrary distribution over {0, 1}
n
and let Rp be an RBM
with marginal distribution p over the visible units such that KL(p0||p) > 0. Then
there exists an RBM R p w,c composed of Rp and an additional hidden unit with
parameters (w, c) whose marginal distribution pw,c over the visible units achieves
KL(p0|| pw,c) < KL(p0||p).
4.2 A Huge Model Can Represent Any Distribution
Theorem 2.4. Any distribution over {0,1}
n
can be approximated arbitrarily well
(in the sense of the KL divergence) with an RBM with k + 1 hidden units where k
is the number of input vectors whose probability is not 0.
Proof sketch (Universal approximator property). We constructively build an
RBM with as many hidden units as the number of input vectors whose probability
is strictly positive. Each hidden unit will be assigned to one input vector. Namely,
when vi is the visible units vector, all hidden units have a probability 0 of being on
except the one corresponding to vi which has a probability sigm(λi) of being on.
The value of λi is directly tied with p(vi). On the other hand, when all hidden units
are off but the ith one, p(vi|h) = 1. With probability 1 − sigm(λi), all the hidden
units are turned off, which yields independent draws of the visible units. The proof
consists in finding the appropriate weights (and values λi) to yield that behavior.
References
Cho K H, Raiko T, Ilin A. Parallel tempering is efficient for learning restricted boltzmann
machines[C]//Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE, 2010: 1-8.
Hinton G E. Training products of experts by minimizing contrastive divergence[J]. Neural computation,
2002, 14(8): 1771-1800.
Hinton G E, Sejnowski T J. Optimal perceptual inference[C]//Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 1983: 448-453.
TIMAN deep learning reading group
10
Le Roux N, Bengio Y. Representational power of restricted Boltzmann machines and deep belief
networks[J]. Neural Computation, 2008, 20(6): 1631-1649.
Tieleman T. Training restricted Boltzmann machines using approximations to the likelihood
gradient[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 1064-
1071.
Tieleman T, Hinton G. Using fast weights to improve persistent contrastive divergence[C]//Proceedings of
the 26th Annual International Conference on Machine Learning. ACM, 2009: 1033-1040.
More Readings
Fischer A, Igel C. An Introduction to Restricted Boltzmann Machines[M]//Progress in Pattern Recognition,
Image Analysis, Computer Vision, and Applications. Springer Berlin Heidelberg, 2012: 14-36.
Hinton G. A practical guide to training restricted Boltzmann machines[J]. Momentum, 2010, 9(1).
Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J].
Neural computation, 2006, 18(7): 1527-1554.
Larochelle H, Bengio Y. Classification using discriminative restricted Boltzmann
machines[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 536-543.
Nair V, Hinton G E. Implicit mixtures of restricted Boltzmann machines[C]//Advances in neural information
processing systems. 2008: 1145-1152.
Salakhutdinov R. Learning and evaluating Boltzmann machines[R]. Technical Report UTML TR 2008-002,
Department of Computer Science, University of Toronto, 2008.
Salakhutdinov R, Hinton G. An efficient learning procedure for deep Boltzmann machines[J]. Neural
Computation, 2012, 24(8): 1967-2006.
Salakhutdinov R, Hinton G E. Deep boltzmann machines[C]//International Conference on Artificial
Intelligence and Statistics. 2009: 448-455
Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative
filtering[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 791-798.
Salakhutdinov R, Murray I. On the quantitative analysis of deep belief networks[C]//Proceedings of the
25th international conference on Machine learning. ACM, 2008: 872-879.
Schmah T, Hinton G E, Small S L, et al. Generative versus discriminative training of RBMs for
classification of fMRI images[C]//Advances in neural information processing systems. 2008: 1409-1416.
Sutskever I, Hinton G E, Taylor G W. The recurrent temporal restricted boltzmann machine[C]//Advances
in Neural Information Processing Systems. 2008: 1601-1608.
Yuille A. The convergence of contrastive divergences[J]. 2005.
本文档为【Introduction to Restricted Boltzmann Machine】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。