首页 A Practical Guide to Training RBM

A Practical Guide to Training RBM

举报
开通vip

A Practical Guide to Training RBM Department of Computer Science 6 King’s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c© Geoffrey Hinton 2010. August 2, 2010 UTML TR 2010–003 A Practical Guide to Training Restrict...

A Practical Guide to Training RBM
Department of Computer Science 6 King’s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c© Geoffrey Hinton 2010. August 2, 2010 UTML TR 2010–003 A Practical Guide to Training Restricted Boltzmann Machines Version 1 Geoffrey Hinton Department of Computer Science, University of Toronto A Practical Guide to Training Restricted Boltzmann Machines Version 1 Geoffrey Hinton Department of Computer Science, University of Toronto Contents 1 Introduction 3 2 An overview of Restricted Boltzmann Machines and Contrastive Divergence 3 3 How to collect statistics when using Contrastive Divergence 5 3.1 Updating the hidden states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Updating the visible states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Collecting the statistics needed for learning . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 A recipe for getting the learning signal for CD1 . . . . . . . . . . . . . . . . . . . . . . 6 4 The size of a mini-batch 6 4.1 A recipe for dividing the training set into mini-batches . . . . . . . . . . . . . . . . . . 7 5 Monitoring the progress of learning 7 5.1 A recipe for using the reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . 7 6 Monitoring the overfitting 8 6.1 A recipe for monitoring the overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 7 The learning rate 8 7.1 A recipe for setting the learning rates for weights and biases . . . . . . . . . . . . . . . 8 8 The initial values of the weights and biases 9 8.1 A recipe for setting the initial values of the weights and biases . . . . . . . . . . . . . 9 9 Momentum 9 1 9.1 A recipe for using momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 Weight-decay 10 10.1 A recipe for using weight-decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 Encouraging sparse hidden activities 11 11.1 A recipe for sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 The number of hidden units 12 12.1 A recipe for choosing the number of hidden units . . . . . . . . . . . . . . . . . . . . . 12 13 Different types of unit 13 13.1 Softmax and multinomial units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13.2 Gaussian visible units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13.3 Gaussian visible and hidden units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 13.4 Binomial units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 13.5 Rectified linear units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 Varieties of contrastive divergence 15 15 Displaying what is happening during learning 16 16 Using RBM’s for discrimination 16 16.1 Computing the free energy of a visible vector . . . . . . . . . . . . . . . . . . . . . . . 17 17 Dealing with missing values 17 1 1If you make use of this technical report to train an RBM, please cite it in any resulting publication. 2 1 Introduction Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data including labeled or unlabeled images (Hinton et al., 2006a), windows of mel-cepstral coefficients that represent speech (Mohamed et al., 2009), bags of words that represent documents (Salakhutdinov and Hinton, 2009), and user ratings of movies (Salakhutdinov et al., 2007). In their conditional form they can be used to model high-dimensional temporal sequences such as video or motion capture data (Taylor et al., 2006) or speech (Mohamed and Hinton, 2010). Their most important use is as learning modules that are composed to form deep belief nets (Hinton et al., 2006a). RBMs are usually trained using the contrastive divergence learning procedure (Hinton, 2002). This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters such as the learning rate, the momentum, the weight-cost, the sparsity target, the initial values of the weights, the number of hidden units and the size of each mini-batch. There are also decisions to be made about what types of units to use, whether to update their states stochastically or deterministically, how many times to update the states of the hidden units for each training case, and whether to start each sequence of state updates at a data-vector. In addition, it is useful to know how to monitor the progress of learning and when to terminate the training. For any particular application, the code that was used gives a complete specification of all of these decisions, but it does not explain why the decisions were made or how minor changes will affect performance. More significantly, it does not provide a novice user with any guidance about how to make good decisions for a new application. This requires some sensible heuristics and the ability to relate failures of the learning to the decisions that caused those failures. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers. We are still on a fairly steep part of the learning curve, so the guide is a living document that will be updated from time to time and the version number should always be used when referring to it. 2 An overview of Restricted Boltzmann Machines and Contrastive Divergence Skip this section if you already know about RBMs Consider a training set of binary vectors which we will assume are binary images for the purposes of explanation. The training set can be modeled using a two-layer network called a “Restricted Boltzmann Machine” (Smolensky, 1986; Freund and Haussler, 1992; Hinton, 2002) in which stochastic, binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted connections. The pixels correspond to “visible” units of the RBM because their states are observed; the feature detectors correspond to “hidden” units. A joint configuration, (v,h) of the visible and hidden units has an energy (Hopfield, 1982) given by: E(v,h) = − ∑ i∈visible aivi − ∑ j∈hidden bjhj − ∑ i,j vihjwij (1) where vi, hj are the binary states of visible unit i and hidden unit j, ai, bj are their biases and wij is the weight between them. The network assigns a probability to every possible pair of a visible and a 3 hidden vector via this energy function: p(v,h) = 1 Z e−E(v,h) (2) where the “partition function”, Z, is given by summing over all possible pairs of visible and hidden vectors: Z = ∑ v,h e−E(v,h) (3) The probability that the network assigns to a visible vector, v, is given by summing over all possible hidden vectors: p(v) = 1 Z ∑ h e−E(v,h) (4) The probability that the network assigns to a training image can be raised by adjusting the weights and biases to lower the energy of that image and to raise the energy of other images, especially those that have low energies and therefore make a big contribution to the partition function. The derivative of the log probability of a training vector with respect to a weight is surprisingly simple. ∂ log p(v) ∂wij = 〈vihj〉data − 〈vihj〉model (5) where the angle brackets are used to denote expectations under the distribution specified by the subscript that follows. This leads to a very simple learning rule for performing stochastic steepest ascent in the log probability of the training data: ∆wij = �(〈vihj〉data − 〈vihj〉model) (6) where � is a learning rate. Because there are no direct connections between hidden units in an RBM, it is very easy to get an unbiased sample of 〈vihj〉data. Given a randomly selected training image, v, the binary state, hj , of each hidden unit, j, is set to 1 with probability p(hj = 1 | v) = σ(bj + ∑ i viwij) (7) where σ(x) is the logistic sigmoid function 1/(1 + exp(−x)). vihj is then an unbiased sample. Because there are no direct connections between visible units in an RBM, it is also very easy to get an unbiased sample of the state of a visible unit, given a hidden vector p(vi = 1 | h) = σ(ai + ∑ j hjwij) (8) Getting an unbiased sample of 〈vihj〉model, however, is much more difficult. It can be done by starting at any random state of the visible units and performing alternating Gibbs sampling for a very long time. One iteration of alternating Gibbs sampling consists of updating all of the hidden units in parallel using equation 7 followed by updating all of the visible units in parallel using equation 8. A much faster learning procedure was proposed in Hinton (2002). This starts by setting the states of the visible units to a training vector. Then the binary states of the hidden units are all computed in parallel using equation 7. Once binary states have been chosen for the hidden units, 4 a “reconstruction” is produced by setting each vi to 1 with a probability given by equation 8. The change in a weight is then given by ∆wij = �(〈vihj〉data − 〈vihj〉recon) (9) A simplified version of the same learning rule that uses the states of indivisdual units instead of pairwise products is used for the biases. The learning works well even though it is only crudely approximating the gradient of the log prob- ability of the training data (Hinton, 2002). The learning rule is much more closely approximating the gradient of another objective function called the Contrastive Divergence (Hinton, 2002) which is the difference between two Kullback-Liebler divergences, but it ignores one tricky term in this objective function so it is not even following that gradient. Indeed, Sutskever and Tieleman have shown that it is not following the gradient of any function (Sutskever and Tieleman, 2010). Nevertheless, it works well enough to achieve success in many significant applications. RBMs typically learn better models if more steps of alternating Gibbs sampling are used before collecting the statistics for the second term in the learning rule, which will be called the negative statistics. CDn will be used to denote learning using n full steps of alternating Gibbs sampling. 3 How to collect statistics when using Contrastive Divergence To begin with, we shall assume that all of the visible and hidden units are binary. Other types of units will be discussed in sections 13. We shall also assume that the purpose of the learning is to create a good generative model of the set of training vectors. When using RBMs to learn Deep Belief Nets (see the article on Deep Belief Networks at www.scholarpedia.org) that will subsequently be fine-tuned using backpropagation, the generative model is not the ultimate objective and it may be possible to save time by underfitting it, but we will ignore that here. 3.1 Updating the hidden states Assuming that the hidden units are binary and that you are using CD1, the hidden units should have stochastic binary states when they are being driven by a data-vector. The probability of turning on a hidden unit, j, is computed by applying the logistic function σ(x) = 1/(1 + exp(−x)) to its “total input”: p(hj = 1) = σ(bj + ∑ i viwij) (10) and the hidden unit turns on if this probability is greater than a random number uniformly distributed between 0 and 1. It is very important to make these hidden states binary, rather than using the probabilities themselves. If the probabilities are used, each hidden unit can communicate a real-value to the visible units during the reconstruction. This seriously violates the information bottleneck created by the fact that a hidden unit can convey at most one bit (on average). This information bottleneck acts as a strong regularizer. For the last update of the hidden units, it is silly to use stochastic binary states because nothing depends on which state is chosen. So use the probability itself to avoid unnecessary sampling noise. When using CDn, only the final update of the hidden units should use the probability. 5 3.2 Updating the visible states Assuming that the visible units are binary, the correct way to update the visible states when generating a reconstruction is to stochastically pick a 1 or 0 with a probability determined by the total top-down input: pi = p(vi = 1) = σ(ai + ∑ j hjwij) (11) However, it is common to use the probability, pi, instead of sampling a binary value. This is not nearly as problematic as using probabilities for the data-driven hidden states and it reduces sampling noise thus allowing faster learning. There is some evidence that it leads to slightly worse density models (Tijmen Tieleman, personal communication, 2008). This probably does not matter when using an RBM to pretrain a layer of hidden features for use in a deep belief net. 3.3 Collecting the statistics needed for learning Assuming that the visible units are using real-valued probabilities instead of stochastic binary values, there are two sensible ways to collect the positive statistics for the connection between visible unit i and hidden unit j: 〈pihj〉data or 〈pipj〉data where pj is a probability and hj is a binary state that takes value 1 with probability pj . Using hj is closer to the mathematical model of an RBM, but using pj usually has less sampling noise which allows slightly faster learning2. 3.4 A recipe for getting the learning signal for CD1 When the hidden units are being driven by data, always use stochastic binary states. When they are being driven by reconstructions, always use probabilities without sampling. Assuming the visible units use the logistic function, use real-valued probabilities for both the data and the reconstructions3. When collecting the pairwise statistics for learning weights or the individual statistics for learning biases, use the probabilities, not the binary states, and make sure the weights have random initial values to break symmetry. 4 The size of a mini-batch It is possible to update the weights after estimating the gradient on a single training case, but it is often more efficient to divide the training set into small “mini-batches” of 10 to 100 cases4. This allows matrix-matrix multiplies to be used which is very advantageous on GPU boards or in Matlab. 2Using hj always creates more noise in the positive statistics than using pj but it can actually create less noise in the difference of the positive and negative statistics because the negative statistics depend on the binary decision for the state of j that is used for creating the reconstruction. The probability of j when driven by the reconstruction is highly correlated with the binary decision that was made for j when it was driven by the data. 3So there is nothing random about the generation of the reconstructions given the binary states of the hidden units. 4The word “batch” is confusing and will be avoided because when it is used to contrast with “on-line” it usually means the entire training set. 6 To avoid having to change the learning rate when the size of a mini-batch is changed, it is helpful to divide the total gradient computed on a mini-batch by the size of the mini-batch, so when talking about learning rates we will assume that they multiply the average, per-case gradient computed on a mini-batch, not the total gradient for the mini-batch. It is a serious mistake to make the mini-batches too large when using stochastic gradient descent. Increasing the mini-batch size by a factor of N leads to a more reliable gradient estimate but it does not increase the maximum stable learning rate by a factor of N, so the net effect is that the weight updates are smaller per gradient evaluation5. 4.1 A recipe for dividing the training set into mini-batches For datasets that contain a small number of equiprobable classes, the ideal mini-batch size is often equal to the number of classes and each mini-batch should contain one example of each class to reduce the sampling error when estimating the gradient for the whole training set from a single mini-batch. For other datasets, first randomize the order of the training examples then use minibatches of size about 10. 5 Monitoring the progress of learning It is easy to compute the squared error between the data and the reconstructions, so this quantity is often printed out during learning. The reconstruction error on the entire training set should fall rapidly and consistently at the start of learning and then more slowly. Due to the noise in the gradient estimates, the reconstruction error on the individual mini-batches will fluctuate gently after the initial rapid descent. It may also oscillate gently with a period of a few mini-batches when using high momentum (see section 9). Although it is convenient, the reconstruction error is actually a very poor measure of the progress of learning. It is not the function that CDn learning is approximately optimizing, especially for n >> 1, and it systematically confounds two different quantities that are changing during the learning. The first is the difference between the empirical distribution of the training data and the equilibrium distribution of the RBM. The second is the mixing rate of the alternating Gibbs Markov chain. If the mixing rate is very low, the reconstruction error will be very small even when the distributions of the data and the model are very different. As the weights increase the mixing rate falls, so decreases in reconstruction error do not necessarily mean that the model is improving and, conversely, small increases do not necessarily mean the model is getting worse. Large increases, however, are a bad sign except when they are temporary and caused by changes in the learning rate, momentum, weight-cost or sparsity meta-parameters. 5.1 A recipe for using the reconstruction error Use it but don’t trust it. If you really want to know what is going on during the learning, use multiple histograms and graphic displays as described in section 15. Also consider using Annealed Importance 5The easy way to parallelize the learning on a cluster is to divide each mini-batch into sub-mini-batches and to use different cores to compute the gradients on each sub-mini-batch. The gradients computed by different cores must then be combined. To minimize the ratio of communication to computation, it is tempting to make the sub-mini-batches large. This usually makes the learning much less efficient, thus wiping out much of the gain achieved by using multiple cores (Vinod Nair, personal communication, 2007). 7 JF.Y 高亮 JF.Y 高亮 JF.Y 高亮 Sampling (Salakhutdinov and Murray, 2008) to estimate the density on held out data. If you are learning a joint density model of labelled data (see section 16), consider monitoring the discriminative performance on the training data and on a held out validation set. 6 Monitoring the overfitting When learning a generative model, the obvious quantity to monitor is the probability that the current model assigns to a datapoint. When this probability starts to decrease for held out validation data, it is time to stop learning. Unfortunately, for large RBMs, it is very difficult to compute this probability because it requires knowledge of the partition function. Nevertheless, it is possible to directly monitor the overfitting by comparing the free energies of training data and held out validation data. In this comparison, the partition function cancels out. The free energy of a data vector can be computed in a time that is linear in the number of hidden units (see section 16.1). If the model is not overfitting at all, the average free energy should be about the same on training and validation data. As the model starts to overfit the average free energy of the validation data will rise relative to the average free energy of the
本文档为【A Practical Guide to Training RBM】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_176947
暂无简介~
格式:pdf
大小:191KB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2013-10-20
浏览量:6