首页 Introduction to Restricted Boltzmann Machine

Introduction to Restricted Boltzmann Machine

举报
开通vip

Introduction to Restricted Boltzmann Machine TIMAN deep learning reading group 1 Introduction to Restricted Boltzmann Machine Outline 1. Boltzmann Machine 1.1 Learning the Boltzmann Machine 1.2 Practical limitation in learning the BM 2. Restricted Boltzmann Machine 2...

Introduction to Restricted Boltzmann Machine
TIMAN deep learning reading group 1 Introduction to Restricted Boltzmann Machine Outline 1. Boltzmann Machine 1.1 Learning the Boltzmann Machine 1.2 Practical limitation in learning the BM 2. Restricted Boltzmann Machine 2.1 Learning the Restricted Boltzmann Machine 3. Contrastive Divergence 4. Representational power of RBM 4.1 Better Model with Increasing Number of Units 4.2 A Huge Model Can Represent Any Distribution TIMAN deep learning reading group 2 1. Boltzmann Machine One approach to unsupervised learning is the Boltzmann Machine (BM) neural network. BM consists of binary units (neurons) and symmetric connections between them. There are no edges going form the a neuron to itself. BM contains a set of visible units and a set of hidden units . The energy of the state {v, h} is defined as: ( ) , where are the model parameters: W, L, J represents visible-to-hidden, visible-to-visible and hidden-to-hidden symmetric interaction terms. The diagonal elements of L and J are set to 0. Based on the energy, the probability that model assign to a visible vector v is: TIMAN deep learning reading group 3 ( ) ( ) ∑ ( ( )) , ( ) is the partition function which is used to normalize the probability distribution. The conditional distributions over hidden and visible units are given by: ( ) (∑ ∑ ), ( ) (∑ ∑ ), ( ) is the logistic function. 1.1 Learning Boltzmann Machine We use the standard maximum likelihood to learn the Boltzmann Machine. Given a data set ( ) , the log-likelihood of the parameters of BM is ( ) ∑ ( ( ) ) ∑( ∑ ( ( ( ) )) ∑ ( ( ) ) ) The derivative of the log-likelihood of a single training pattern ( ) w.r.t. the parameter becomes ( ( )) ∑ ( | ( )) ( ( ) ) ∑ ( ) ( ) 〈 ( ( ) ) 〉 〈 ( ) 〉 The learning algorithm can be seen as driving BM so that the probability distribution represented by BM to be exactly identical to the probability distribution defined by the training data set. Although we do not know the analytical formulation of the exact probability distribution of the training data set TIMAN deep learning reading group 4 (the same in lots of the machine learning problem), BM try to mimic it by clamping the visible units to the training data, and lets the hidden units freely have their own states. 1.2 Practical limitation in learning the BM Exact maximum likelihood learning in this model is in tractable because exact computation of both the data dependent expectation and model expectation take a time that is exponential in the number of hidden units. For example, BM which was designed to handle a 28*28 black-and-white image with 500 hidden nodes had 2 28*28+500 possible combinations. Even the evaluation of a single combination is impossible since we need to compute the normalizing constant ( ). Hinton and Sejnowski (Hinton et al. 1983) proposed to use Gibbs sampling to approximate both expectations. For each iteration of learning, a separate Markov chain is run for every training data to approximate Edata and an additional chain is run to approximate Emodel. Gibbs sampling makes the learning of BM is not anymore as computational unfeasible as the exact computation of the probability masses. However, Gibbs sampling also has some limitations. Due to the full-connectivity of BM, each unit is influenced by all the other units. Even if we visible units are clamped to the training data, we still need to consider all the other hidden units when sample for a single hidden unit. This makes the successive samples in the chain highly correlated with each other and this poor mixing affects the speed of converge, especially when estimating the model’s expectations, since the Gibbs chain may need to explore a highly multimodal energy landscape. This is typical when modeling the real-world distributions such as datasets of images in which almost all of the possible images have extremely low probability, but there are many very different images that occur with quite similar probabilities. To solve these limitation, we can set both J=0 and L=0 to get the well-known restricted Boltzmann machine (RBM). In contrast to general BM, each hidden unit is independent with other hidden units. Gibbs sampling can be done layer-wise TIMAN deep learning reading group 5 rather than component wise now. Therefore, the mixing rate is enhanced and the sampling is more efficiency now. 2. Restricted Boltzmann Machine RBM is constructed by removing the edges in-between the visible units and the hidden units. RBM is now bi-partite graph corresponding to the following energy function, ( ) ∑∑ ∑ ∑ Here we added a bias terms for the visible units and hidden units, which are parameterized by c and b respectively. We can also absorb them into W by simply adding a visible unit and a hidden unit which are always set to 1. Because of the Markov property in undirected graphical model, the hidden units are independent to each other given all the visible units. The visible units are also independent to each other given all the hidden units. The conditional distributions over hidden and visible units are thus given by: ( ) (∑ ), ( ) (∑ ), 2.1 Learning Restricted Boltzmann Machine Using the conditional independent property of RBM, the derivative of the log- likelihood of a single training pattern ( ) w.r.t. the weigh wij becomes TIMAN deep learning reading group 6 ( ( )) 〈 ( )〉 〈 ( )〉 ( ( )) ( ) ∑ ( ) ( ) Recall that in BM, both 〈 ( )〉 and 〈 ( )〉 cannot be computed analytically. However, 〈 ( )〉 can be computed analytically in RBM due to the conditional dependent. To compute 〈 ( )〉 , Gibbs sampling is also more efficiency in RBM. We can choose a random start point of v 0 and then sample h 0 ~ p(h| v 0 ). In the t iteration, we sample v t ~ p(v| h t-1 ) and h t ~ p(h| v t ). The Gibbs sampling stop when p(v t , h t ) converge to the underlying distribution represented by the graph structure of RBM (the structure depends on the parameter W). Note that this procedure of the Gibbs sampling is independent with the current training data ( ). Gibbs sampling also has limitations. As the number of units in RBM increases, it still needs to gather a great number of samples before reaching the stationary point. Besides, Gibbs sampling always failed in multimodal distribution. For instance, if we have a two dimensional multimodal distribution p(x1,x2) and both x1 and x2 are binary variable. Assume that p(x1=0,x2=0) = 0.4999, p(x1=0,x2=1) = 0.0001, p(x1=1,x2=0) = 0.0001, p(x1=1,x2=1) = 0.4999. Both p(x1=0,x2=0) and p(x1=1,x2=1) are the modes. If we start to sample from x2=0, the Gibbs sampling will stuck in p(x1=0,x2=0). If we start to sample from x2=1, the Gibbs sampling will stuck in p(x1=1,x2=1) (see Wikipedia page for Gibbs sampling). 3. Contrastive divergence To overcome the limitation of Gibbs Sampling, Hinton proposed the contrastive divergence (CD). Obtaining unbiased estimates of gradient using MCMC methods typically requires many sampling steps. However, recently it has shown that TIMAN deep learning reading group 7 estimates obtained after running the chain for just a few steps can be sufficient for model training (Hinton, 2002). The idea of k-step contrastive divergence learning (CD-k) is quite simple: Instead of approximating the second term in the log-likelihood gradient by a sample from the RBM-distribution (which would require to run a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only k steps (and usually k = 1). The Gibbs chain is initialized with a training example v (0) of the training set and yields the sample v (k) after k steps. Each step t consists of sampling h (t) from p(h|v (t) ) and sampling v (t+1) from p(v|h (t) ) subsequently. The gradient of CD-k w.r.t. for one training pattern is approximated by ( ( )) ∑ ( | ( )) ( ( ) ) ∑ ( | ( )) ( ( ) ) From the above equation, we can find there is a bias between gradient of CD-k and the gradient of the maximum likelihood estimation. A batch version of CD-k can be seen in algorithm 1. TIMAN deep learning reading group 8 The bias of the CD-k vanishes as k goes to infinite. A good property of CD is that in case the data distribution is multimodal, running the chains starting from each data sample guarantees, that the samples approximating the negative phase have representatives from different modes. There are some other learning algorithms for RBM: persistent contrastive divergence (Tieleman, 2008), fast persistent contrastive divergence (Tieleman and Hinton, 2009) and parallel tempering (Cho, 2010). 4. Representational Power of RBM Le Roux (Le Roux, 2006) prove that any distribution can be approximated arbitrarily well with an RBM with k+1 hidden units where k is the number of input vectors whose probability is not 0. We will prove this theory in detail in our reading group. 4.1 Better Model with Increasing Number of Units They first show that when the number of hidden units of an RBM is increased, there exist weight values for the new units that guarantee improvement in the training log likelihood or, equivalently in the KL divergence between the data distribution p0 and the model distribution p. Lemma 1. Let Rp be the equivalence class containing the RBMs whose associated marginal distribution over the visible units is p. The operation of adding a hidden unit to an RBM of Rp preserves the equivalence class. Thus, the set of RBMs composed of an RBM of Rp and an additional hidden unit is also an equivalence class (meaning that all the RBMs of this set have the same marginal distribution over visible units). Lemma 2. Let p be the distribution over binary vectors v in {0, 1} d , obtained with an RBM Rp and let pw;c be the distribution obtained when adding a hidden unit with TIMAN deep learning reading group 9 weights w and bias c to Rp. Then Theorem 2.3. Let p0 be an arbitrary distribution over {0, 1} n and let Rp be an RBM with marginal distribution p over the visible units such that KL(p0||p) > 0. Then there exists an RBM R p w,c composed of Rp and an additional hidden unit with parameters (w, c) whose marginal distribution pw,c over the visible units achieves KL(p0|| pw,c) < KL(p0||p). 4.2 A Huge Model Can Represent Any Distribution Theorem 2.4. Any distribution over {0,1} n can be approximated arbitrarily well (in the sense of the KL divergence) with an RBM with k + 1 hidden units where k is the number of input vectors whose probability is not 0. Proof sketch (Universal approximator property). We constructively build an RBM with as many hidden units as the number of input vectors whose probability is strictly positive. Each hidden unit will be assigned to one input vector. Namely, when vi is the visible units vector, all hidden units have a probability 0 of being on except the one corresponding to vi which has a probability sigm(λi) of being on. The value of λi is directly tied with p(vi). On the other hand, when all hidden units are off but the ith one, p(vi|h) = 1. With probability 1 − sigm(λi), all the hidden units are turned off, which yields independent draws of the visible units. The proof consists in finding the appropriate weights (and values λi) to yield that behavior. References Cho K H, Raiko T, Ilin A. Parallel tempering is efficient for learning restricted boltzmann machines[C]//Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE, 2010: 1-8. Hinton G E. Training products of experts by minimizing contrastive divergence[J]. Neural computation, 2002, 14(8): 1771-1800. Hinton G E, Sejnowski T J. Optimal perceptual inference[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 1983: 448-453. TIMAN deep learning reading group 10 Le Roux N, Bengio Y. Representational power of restricted Boltzmann machines and deep belief networks[J]. Neural Computation, 2008, 20(6): 1631-1649. Tieleman T. Training restricted Boltzmann machines using approximations to the likelihood gradient[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 1064- 1071. Tieleman T, Hinton G. Using fast weights to improve persistent contrastive divergence[C]//Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009: 1033-1040. More Readings Fischer A, Igel C. An Introduction to Restricted Boltzmann Machines[M]//Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer Berlin Heidelberg, 2012: 14-36. Hinton G. A practical guide to training restricted Boltzmann machines[J]. Momentum, 2010, 9(1). Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554. Larochelle H, Bengio Y. Classification using discriminative restricted Boltzmann machines[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 536-543. Nair V, Hinton G E. Implicit mixtures of restricted Boltzmann machines[C]//Advances in neural information processing systems. 2008: 1145-1152. Salakhutdinov R. Learning and evaluating Boltzmann machines[R]. Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto, 2008. Salakhutdinov R, Hinton G. An efficient learning procedure for deep Boltzmann machines[J]. Neural Computation, 2012, 24(8): 1967-2006. Salakhutdinov R, Hinton G E. Deep boltzmann machines[C]//International Conference on Artificial Intelligence and Statistics. 2009: 448-455 Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative filtering[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 791-798. Salakhutdinov R, Murray I. On the quantitative analysis of deep belief networks[C]//Proceedings of the 25th international conference on Machine learning. ACM, 2008: 872-879. Schmah T, Hinton G E, Small S L, et al. Generative versus discriminative training of RBMs for classification of fMRI images[C]//Advances in neural information processing systems. 2008: 1409-1416. Sutskever I, Hinton G E, Taylor G W. The recurrent temporal restricted boltzmann machine[C]//Advances in Neural Information Processing Systems. 2008: 1601-1608. Yuille A. The convergence of contrastive divergences[J]. 2005.
本文档为【Introduction to Restricted Boltzmann Machine】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_147907
暂无简介~
格式:pdf
大小:613KB
软件:PDF阅读器
页数:10
分类:互联网
上传时间:2013-11-23
浏览量:32