首页 3872-3d-object-recognition-with-deep-belief-nets

3872-3d-object-recognition-with-deep-belief-nets

举报
开通vip

3872-3d-object-recognition-with-deep-belief-nets 3D Object Recognition with Deep Belief Nets Vinod Nair and Geoffrey E. Hinton Department of Computer Science, University of Toronto 10 King’s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu Abstract We introduce a new type of top-level...

3872-3d-object-recognition-with-deep-belief-nets
3D Object Recognition with Deep Belief Nets Vinod Nair and Geoffrey E. Hinton Department of Computer Science, University of Toronto 10 King’s College Road, Toronto, M5S 3G5 Canada {vnair,hinton}@cs.toronto.edu Abstract We introduce a new type of top-level model for Deep Belief Nets and evalu- ate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best pub- lished result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1 Introduction Recent work on deep belief nets (DBNs) [10], [13] has shown that it is possible to learn multiple layers of non-linear features that are useful for object classification without requir- ing labeled data. The features are trained one layer at a time as a restricted Boltzmann machine (RBM) using contrastive divergence (CD) [4], or as some form of autoencoder [20], [16], and the feature activations learned by one module become the data for training the next module. After a pre-training phase that learns layers of features which are good at modeling the statistical structure in a set of unlabeled images, supervised backpropagation can be used to fine-tune the features for classification [7]. Alternatively, classification can be performed by learning a top layer of features that models the joint density of the class labels and the highest layer of unsupervised features [6]. These unsupervised features (plus the class labels) then become the penultimate layer of the deep belief net [6]. Early work on deep belief nets was evaluated using the MNIST dataset of handwritten digits [6] which has the advantage that a few million parameters are adequate for modeling most of the structure in the domain. For 3D object classification, however, many more parameters are probably required to allow a deep belief net with no prior knowledge of spatial structure to capture all of the variations caused by lighting and viewpoint. It is not yet clear how well deep belief nets perform at 3D object classification when compared with shallow techniques such as SVM’s [19], [3] or deep discriminative techniques like convolutional neural networks [11]. In this paper, we describe a better type of top-level model for deep belief nets that is trained using a combination of generative and discriminative gradients [5], [8], [9]. We evaluate the model on NORB [12], which is a carefully designed object recognition task that requires 1 hj lk vi (a) hj lk vi (b) h1 h2 l1 v1 v2 l2 (c) h1 h2 l1 l22 v1 v2 (d) W112 W212 W122 W222W111 W211 h dd 111 211 W121 W221 Nv visible units Nh hidden units (e) Figure 1: The Third-Order Restricted Boltzmann Machine. (a) Every clique in the model contains a visible unit, hidden unit, and label unit. (b) Our shorthand notation for representing the clique in (a). (c) A model with two of each unit type. There is one clique for every possible triplet of units created by selecting one of each type. The “restricted” architecture precludes cliques with multiple units of the same type. (d) Our shorthand notation for representing the model in (c). (e) The 3D tensor of parameters for the model in (c). The architecture is the same as that of an implicit mixture of RBMs [14], but the inference and learning algorithms have changed. generalization to novel object instances under varying lighting conditions and viewpoints. Our model significantly outperforms SVM’s, and it also outperforms convolutional neural nets when given additional unlabeled data produced by small translations of the training images. We use restricted Boltzmann machines trained with one-step contrastive divergence as our basic module for learning layers of features. These are fully described elsewhere [6], [1] and the reader is referred to those sources for details. 2 A Third-Order RBM as the Top-Level Model Until now, the only top-level model that has been considered for a DBN is an RBM with two types of observed units (one for the label, another for the penultimate feature vector). We now consider an alternative model for the top-level joint distribution in which the class label multiplicatively interacts with both the penultimate layer units and the hidden units to determine the energy of a full configuration. It is a Boltzmann machine with three-way cliques [17], each containing a penultimate layer unit vi, a hidden unit hj , and a label unit lk. See figure 1 for a summary of the architecture. Note that the parameters now form a 3D tensor, instead of a matrix as in the earlier, bipartite model. Consider the case where the components of v and h are stochastic binary units, and l is a discrete variable with K states represented by 1-of-K encoding. The model can be defined in terms of its energy function E(v,h, l) = − ∑ i,j,k Wijkvihj lk, (1) where Wijk is a learnable scalar parameter. (We omit bias terms from all expressions for clarity.) The probability of a full configuration {v,h, l} is then P (v,h, l) = exp(−E(v,h, l)) Z , (2) where Z = ∑ v′,h′,l′ exp(−E(v ′,h′, l′)) is the partition function. Marginalizing over h gives the distribution over v and l alone. 2 The main difference between the new top-level model and the earlier one is that now the class label multiplicatively modulates how the visible and hidden units contribute to the energy of a full configuration. If the label’s kth unit is 1 (and the rest are 0), then the kth slice of the tensor determines the energy function. In the case of soft activations (i.e. more than one label has non-zero probability), a weighted blend of the tensor’s slices specifies the energy function. The earlier top-level (RBM) model limits the label’s effect to changing the biases into the hidden units, which modifies only how the hidden units contribute to the energy of a full configuration. There is no direct interaction between the label and the visible units. Introducing direct interactions among all three sets of variables allows the model to learn features that are dedicated to each class. This is a useful property when the object classes have substantially different appearances that require very different features to describe. Unlike an RBM, the model structure is not bipartite, but it is still “restricted” in the sense that there are no direct connections between two units of the same type. 2.1 Inference The distributions that we would like to be able to infer are P (l|v) (to classify an input), and P (v, l|h) and P (h|v, l) (for CD learning). Fortunately, all three distributions are tractable to sample from exactly. The simplest case is P (h|v, l). Once l is observed, the model reduces to an RBM whose parameters are the kth slice of the 3D parameter tensor. As a result P (h|v, l) is a factorized distribution that can be sampled exactly. For a restricted third-order model with Nv visible units, Nh hidden units and Nl class labels, the distribution P (l|v) can be exactly computed in O(NvNhNl) time. This result follows from two observations: 1) setting lk = 1 reduces the model to an RBM defined by the k th slice of the tensor, and 2) the negative log probability of v, up to an additive constant, under this RBM is the free energy : Fk(v) = − Nh∑ j=1 log(1 + exp( Nv∑ i=1 Wijkvi)). (3) The idea is to first compute Fk(v) for each setting of the label, and then convert them to a discrete distribution by taking the softmax of the negative free energies: P (lk = 1|v) = exp(−Fk(v))∑Nl k=1 exp(−Fk(v)) . (4) Equation 3 requires O(NvNh) computation, which is repeated Nl times for a total of O(NvNhNl) computation. We can use the same method to compute P (l|h). Simply switch the role of v and h in equation 3 to compute the free energy of h under the kth RBM. (This is possible since the model is symmetric with respect to v and h.) Then convert the resulting Nl free energies to the probabilities P (lk = 1|h) with the softmax function. Now it becomes possible to exactly sample P (v, l|h) by first sampling l˜ ∼ P (l|h). Suppose l˜k = 1. Then the model reduces to its k th-slice RBM from which v˜ ∼ P (v|h, l˜k = 1) can be easily sampled. The final result {v˜, l˜} is an unbiased sample from P (v, l|h). 2.2 Learning Given a set of N labeled training cases {(v1, l1), (v2, l2), ..., (vN , lN )} , we want to learn the 3D parameter tensor W for the restricted third-order model. When trained as the top-level model of a DBN, the visible vector v is a penultimate layer feature vector. We can also train the model directly on images as a shallow model, in which case v is an image (in row vector form). In both cases the label l represents the Nl object categories using 1-of-Nl encoding. For the same reasons as in the case of an RBM, maximum likelihood learning is intractable here as well, so we rely on Contrastive Divergence learning instead. CD was originally formulated in the context of the RBM and its bipartite architecture, but here we extend it to the non-bipartite architecture of the third-order model. 3 An unbiased estimate of the maximum likelihood gradient can be computed by running a Markov chain that alternatively samples P (h|v, l) and P (v, l|h) until it reaches equilibrium. Contrastive divergence uses the parameter updates given by three half-steps of this chain, with the chain initialized from a training case (rather than a random state). As explained in section 2.1, both of these distributions are easy to sample from. The steps for computing the CD parameter updates are summarized below: Contrastive divergence learning of P (v, l): 1. Given a labeled training pair {v+, l+k = 1}, sample h + ∼ P (h|v+, l+k = 1). 2. Compute the outer product D+k = v +h+T . 3. Sample {v−, l−} ∼ P (v, l|h+). Let m be the index of the component of l− set to 1. 4. Sample h− ∼ P (h|v−, l−m = 1). 5. Compute the outer product D−m = v −h−T . Let W·,·,k denote the Nh×Nv matrix of parameters corresponding to the k th slice along the label dimension of the 3D tensor. Then the CD update for W·,·,k is: ∆W·,·,k = D + k −D − k , (5) W·,·,k ←W·,·,k + η∆W·,·,k, (6) where η is a learning rate parameter. Typically, the updates computed from a “mini-batch” of training cases (a small subset of the entire training set) are averaged together into one update and then applied to the parameters. 3 Combining Gradients for Generative and Discriminative Models In practice the Markov chain used in the learning of P (v, l) can suffer from slow mixing. In particular, the label l− generated in step 3 above is unlikely to be different from the true label l+ of the training case used in step 1. Empirically, the chain has a tendency to stay “stuck” on the same state for the label variable because in the positive phase the hidden activities are inferred with the label clamped to its true value. So the hidden activities contain information about the true label, which gives it an advantage over the other labels. Consider the extreme case where we initialize the Markov chain with a training pair {v+, l+k = 1} and the label variable never changes from its initial state during the chain’s entire run. In effect, the model that ends up being learned is a class-conditional generative distribution P (v|lk = 1), represented by the k th slice RBM. The parameter updates are identical to those for training Nl independent RBMs, one per class, with only the training cases of each class being used to learn the RBM for that class. Note that this is very different from the model in section 2: here the energy functions implemented by the class-conditional RBMs are learned independently and their energy units are not commensurate with each other. Alternatively, we can optimize the same set of parameters to represent yet another distri- bution, P (l|v). The advantage in this case is that the exact gradient needed for maximum likelihood learning, ∂logP (l|v)/∂W , can be computed in O(NvNhNl) time. The gradient expression can be derived with some straightforward differentiation of equation 4. The dis- advantage is that it cannot make use of unlabeled data. Also, as the results show, learning a purely discriminative model at the top level of a DBN gives much worse performance. However, now a new way of learning P (v, l) becomes apparent: we can optimize the parameters by using a weighted sum of the gradients for logP (v|l) and logP (l|v). As explained below, this approach 1) avoids the slow mixing of the CD learning for P (v, l), and 2) allows learning with both labeled and unlabeled data. It resembles pseudo-likelihood in how it optimizes the two conditional distributions in place of the joint distribution, except here one of the conditionals (P (v|l)) is still learned only approximately. In our experiments, a model trained with this hybrid learning algorithm has the highest classification accuracy, beating both a generative model trained using CD as well as a purely discriminative model. 4 The main steps of the algorithm are listed below. Hybrid learning algorithm for P (v, l): Let {v+, l+k = 1} be a labeled training case. Generative update: CD learning of P (v|l) 1. Sample h+ ∼ P (h|v+, l+k = 1). 2. Compute the outer product D+k = v +h+T . 3. Sample v− ∼ P (v|h+, l+k = 1). 4. Sample h− ∼ P (h|v−, l+k = 1). 5. Compute the outer product D−k = v −h−T . 6. Compute update ∆W g ·,·,k = D + k −D − k . Discriminative update: ML learning of P (l|v) 1. Compute logP (lc = 1|v +) for c ∈ {1, ..., Nl}. 2. Using the result from step 1 and the true label l+k = 1, compute the update ∆W d ·,·,k = ∂ logP (l|v)/∂W·,·,c for c ∈ {1, ..., Nl}. The two types of update for the cth slice of the tensorW·,·,c are then combined by a weighted sum: W·,·,c ←W·,·,c + η(∆W g ·,·,c + λ∆W d ·,·,c), (7) where λ is a parameter that sets the relative weighting of the generative and discriminative updates, and η is the learning rate. As before, the updates from a mini-batch of training cases can be averaged together and applied as a single update to the parameters. In ex- periments, we set λ by trying different values and evaluating classification accuracy on a validation set. Note that the generative part in the above algorithm is simply CD learning of the RBM for the kth class. The earlier problem of slow mixing does not appear in the hybrid algorithm because the chain in the generative part does not involve sampling the label. Semi-supervised learning: The hybrid learning algorithm can also make use of unlabeled training cases by treating their labels as missing inputs. The model first infers the missing label by sampling P (l|vu) for an unlabeled training case vu. The generative update is then computed by treating the inferred label as the true label. (The discriminative update will always be zero in this case.) Therefore the unlabeled training cases contribute an extra generative term to the parameter update. 4 Sparsity Discriminative performance is improved by using binary features that are only rarely active. Sparse activities are achieved by specifying a desired probability of being active, p << 1, and then adding an additional penalty term that encourages an exponentially decaying average, q, of the actual probability of being active to be close to p. The natural error measure to use is the cross entropy between the desired and actual distributions: p log q+(1−p) log(1− q). For logistic units this has a simple derivative of p−q with respect to the total input to a unit. This derivative is used to adjust both the bias and the incoming weights of each hidden unit. We tried various values for p and 0.1 worked well. In addition to specifying p it is necessary to specify how fast the estimate of q decays. We used qnew = 0.9 ∗ qold +0.1 ∗ qcurrent where qcurrent is the average probability of activation for the current mini-batch of 100 training cases. It is also necessary to specify how strong the penalty term should be, but this is easy to set empirically. We multiply the penalty gradient by a coefficient that is chosen to ensure that, on average, q is close to p but there is still significant variation among the q values for different hidden units. This prevents the penalty term from dominating the learning. One 5 added advantage of this sparseness penalty is that it revives any hidden units whose average activities are much lower than p. 5 Evaluating DBNs on the NORB Object Recognition Task 5.1 NORB Database For a detailed description see [12]. The five object classes in NORB are animals, humans, planes, trucks, and cars. The dataset comes in two different versions, normalized-uniform and jittered-cluttered. In this paper we use the normalized-uniform version, which has objects centred in the images with a uniform background. There are 10 instances of each object class, imaged under 6 illuminations and 162 viewpoints (18 azimuths × 9 elevations). The instances are split into two disjoint sets (pre-specified in the database) of five each to define the training and test sets, both containing 24,300 cases. So at test time a trained model has to recognize unseen instances of the same object classes. Pre-processing: A single training (and test) case is a stereo-pair of grayscale images, each of size 96×96. To speed up experiments, we reduce dimensionality by using a “foveal” image representation. The central 64 × 64 portion of an image is kept at its original resolution. The remaining 16 pixel-wide ring around it is compressed by replacing non-overlapping square blocks of pixels with the average value of a block. We split the ring into four smaller ones: the outermost ring has 8 × 8 blocks, followed by a ring of 4 × 4 blocks, and finally two innermost rings of 2 × 2 blocks. The foveal representation reduces the dimensionality of a stereo-pair from 18432 to 8976. All our models treat the stereo-pair images as 8976- dimensional vectors1. 5.2 Training Details Model architecture: The two main decisions to make when training DBNs are the number of hidden layers to greedily pre-train and the number of hidden units to use in each layer. To simplify the experiments we constrain the number of hidden units to be the same at all layers (including the top-level model). We have tried hidden layer sizes of 2000, 4000, and 8000 units. We have also tried models with two, one, or no greedily pre-trained hidden layers. To avoid clutter, only the results for the best settings of these two parameters are given. The best classification results are given by the DBN with one greedily pre-trained sparse hidden layer of 4000 units (regardless of the type of top-level model). A DBN trained on the pre-processed input with one greedily pre-trained layer of 4000 hidden units and a third-order model on top of it, also with 4000 hidden units, has roughly 116 million learnable parameters in total. This is roughly two orders of magnitude more parameters than some of the early DBNs trained on the MNIST images [6], [10]. Training such a model in Matlab on an Intel Xeon 3GHz machine takes almost two weeks. See a recent paper by Raina et al. [15] that uses GPUs to train a deep model with roughly the same number of parameters much more quickly. We put Gaussian units at the lowest
本文档为【3872-3d-object-recognition-with-deep-belief-nets】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_749859
暂无简介~
格式:pdf
大小:138KB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2014-04-02
浏览量:18