The chains model for detecting parts by their context
Leonid Karlinsky Michael Dinerstein Daniel Harari Shimon Ullman
{leonid.karlinsky,michael.dinerstein,danny.harari,shimon.ullman}@weizmann.ac.ilWeizmann Institute of Science, Rehovot 76100, Israel
Abstract
Detecting an object part relies on two sources of infor-mation - the appearance of the part itself, and the contextsupplied by surrounding parts. In this paper we considerproblems in which a target part cannot be recognized re-liably using its own appearance, such as detecting low-resolution hands, and must be recognized using the con-text of surrounding parts. We develop the ‘chains model’which can locate parts of interest in a robust and precisemanner, even when the surrounding context is highly vari-able and deformable. In the proposed model, the relationbetween context features and the target part is modeled in anon-parametric manner using an ensemble of feature chainsleading from parts in the context to the detection target. Themethod uses the configuration of the features in the imagedirectly rather than through fitting an articulated 3-D modelof the object. In addition, the chains are composable, mean-ing that new chains observed in the test image can be com-posed of sub-chains seen during training. Consequently,the model is capable of handling object poses which areinfrequent, even non-existent, during training. We test theapproach in different settings, including object parts detec-tion, as well as complete object detection. The results showthe advantages of the chains model for detecting and local-izing parts of complex deformable objects.
1. Introduction
Two main sources of information can be used for detect-ing a part of interest of an object: the appearance of the partitself, and the context supplied by surrounding parts. In thecurrent paper we focus on detecting the location of a partof a deformable object using the context supplied by otherparts of the object (denoted ‘internal context’ below). Wealso show that the same approach can be used to detect com-plete objects using the internal context of their parts. Thedetection of object parts and objects are interrelated tasks,since object parts can be used for detecting the objects, andthe internal context of the object can be used to detect parts.Below, we briefly review existing recognition systems from
(a) (a)
(c) (c)
(b) (b)
Figure 1. Overview of the proposed approach. (a) The goal is todetect object parts (e.g., hand, hoof) of deformable objects (e.g.,person, horse) in still images, when the parts of interest are notrecognizable on their own; (b) During testing, the inference graphof the chains model is constructed for each test image and usedto compute the chains model posterior for the detected part. Theedges are color-coded according to their weights (red=high). (c)Star marks the global maximum of the posterior.
the point of view of how they specify the relative locationsof object parts, and their ability to detect a part of interestspecified to the system, using the surrounding context.Existing approaches can be divided into several groupsdepending on the way they model geometric relationshipsbetween parts. The simplest methods do not explicitlymodel feature geometry; for example, the so called bag-of-features approaches, such as [25, 3]. Another popularapproach models feature geometry by a star model that al-lows each object part to have a region of tolerance relativeto some object reference point (star center) [16, 8, 13]. Theconstellation model [9] allows more flexible joint Gaus-sian relationships between object parts. Finally, there aremodels that try to capture more complex feature geometry,such as feature hierarchy [6, 23], articulated spring models[7, 21, 10, 2, 14, 5] and level-set methods [22]. In termsof object and part localization, the bag-of-features meth-ods provide only a rough estimate of object location as acluster of relevant features. The star and the constellationmodels provide more specific object and part localization,but they are particularly suitable for semi-rigid objects andthey do not focus on the detection of specific parts of inter-est. Feature hierarchies provide more flexible geometry, butare much more complex to learn, and in current approaches
1
[6, 23] are still restricted to semi-rigid objects. Finally, ar-ticulated models and level-set methods assume a known a-priori model of object deformation (with the exception of[21]). Such models may encounter problems when the de-formations are complex and only partially covered by thetraining examples. Moreover, most existing approaches (ex-cept [3]) train a relatively small feature codebook (e.g., ob-tained through quantization of randomly sampled patch de-scriptors), and all geometric approaches train a fixed modelfor the spatial arrangement of these features. One problemwith using a feature codebook and a fixed spatial model isthat they are trained to fit well the majority of the train-ing examples, while infrequent object poses in the trainingset are usually underrepresented. This is reflected both inthe lack of features for these poses, as well as in the spa-tial model that does not fit well the infrequent poses. Thisproblem becomes particularly significant when the task isthe detection of parts of a highly deformable object, such ashand detection, since many of the possible object poses areinfrequent in the training set.
The goal of the proposed approach is to obtain preciselocalization of parts of highly deformable objects. An ex-ample goal would be detecting the hand location in imagesof people captured from a distance or in a pose that doesnot allow detecting the hand reliably based on its own ap-pearance. The approach is different from past approaches inseveral respects. First, the model focuses on detecting partsof interest, whereas many recognition models focus on thedetection of the object itself, and ambiguous parts may bemissed during recognition. Second, the way it captures geo-metric relations between features is neither through a semi-rigid model, such as a star or a constellation, nor throughan articulated model fitted to the majority of the trainingset. Instead, we propose a new non-parametric probabilis-tic model (the chains model) and its associated inferencealgorithm, which can efficiently handle complex non-rigidfeature configurations. This is obtained by using multiplefeature chains leading from potential source features to thetarget part. Third, instead of using a relatively small fea-ture codebook, all of the relevant features extracted fromthe training images are retained and efficiently used dur-ing the testing. Consequently, the model is able to repre-sent information from all training images, including imagescontaining rare poses. Fourth, all relative pair-wise spatialconfigurations of the relevant features extracted from thetraining images are retained and used to build the geometricmodel for the features on the test image. As a result, evenrare poses are not under-represented, both in terms of thefeatures as well as in terms of their geometric model. Anoverview of the approach is given in figure 1.
The remainder of the paper is organized as follows. Sec-tion 2 describes the proposed approach and its implemen-tation details. Section 3 describes the experimental valida-
tion. Summary and discussion are provided in section 4.
2. Method
This section describes the proposed ‘chains model’, theassociated feature selection approach and the implementa-tion details of the method. For clarity, the hand detectiontask will be used for illustrating the proposed method. Inaddition, we describe the use of the method for full objectdetection in section 2.3.Intuitively, the chains model describes an assembly offeature chains of arbitrary length that start at some knownreference point on the object, such as an automatically de-tected face, and terminate at the detection target, such as thehand. The chains model is a generative model for the set offeatures extracted from the test image. It generates some ofthe features by creating a feature chain between the source(face) and the target (hand) parts. The rest of the featuresare generated independently out of the ’world’ distribution,i.e. marginal distribution over all images. During inference,we marginalize over all possible feature chains between theface and the hand. This is shown to be equivalent to sum-ming over simple paths in a graph connecting the featuresextracted from the test image, with edge weights that arerelated to the so-called suspicious coincidence between theneighboring features. We show how we can approximatethe solution of the inference problem by a simple computa-tion using the adjacency matrix of the constructed graph.In the algorithm, the method for estimating different em-pirical probabilities of the features is similar to the one usedby [3], and is based on Kernel Density Estimation (KDE)over neighbors computed using the Approximate NearestNeighbor (ANN) technique. The details of this computa-tion are described in section 2.5.
2.1. Feature extraction
For each image In (training or test), the face location isdetected using an off-the-shelf method (we used [13]). De-note the location of the detected face by `fn = (xfn, yfn)and its horizontal size of by sn. All other distances, off-sets and patch sizes are computed relative to sn, eliminat-ing the dependence on the scale of the person. We extract adense set of features centered on all edge points of In (de-tected by [18]) sampled with a spacing of min(0.2 ∙ sn, 4)pixels. Denote the set of extracted features by Fn ={
f jn =
〈
SIFT jn, X
j
n
〉 |j = 1, 2, . . . , kn}, where kn is thetotal number of features extracted from image In, SIFT jnis the SIFT descriptor (128-D row vector) [17] of feature
j from image In, computed for sn × sn sized patch cen-tered at location Xjn (2-D row vector). For training imagesonly, we also denote the marked hand location in image Inby `hn = (xhn, yhn). For training and quantitative evaluation,ground truth hand locations were semi-automatically iden-
(b)
Star model
(c)
j=1,…,kn
hL
jA
jnj fF
(a)
fNfL
M,T
Chains model
)3(TF
)1(TF
)2(TF
hL
jF
kF
mF
lF
Figure 2. (a) Chains model applied for part detection. The unob-served variable Lh is the location of the target part (e.g. hand).
Lf = `fN is the observed location of the reference part (e.g. face)in image IN . The edges symbolize the chains ‘feature graph’constructed over the set of observed features {F j = f jN}. Unob-served variables T and M (affecting all the nodes) select a simplepath T (red) of length M on the graph. Features not on this pathare generated from their respective ‘world’ distributions. Duringinference we marginalize over T and M , summing over paths onthe graph going from the reference part to each candidate loca-tion of the target part. (b) Chains model applied for full objectdetection as an extended star model. (c) Star model. Unobservedbinary variables Aj (one per feature), control whether the featureis generated with respect to the star center or from its ‘world’ dis-tribution.
tified in some of the hand detection experiments. In theseexperiments a colored glove was used for fast and simplehand tracking, using a color blob tracker with manual ini-tialization. Although color was used to obtain the groundtruth, both training and testing of the chains model wereperformed on grayscale images in all of the experiments. Inaddition, experiments on external datasets (of [5] and [10])verified the proposed approach on people without gloves.
2.2. Feature selection for target (hand) detection
The training stage proceeds as follows. The algorithmreceives a set of training frames in which a person per-forms various hand movements. At each frame, both thehand and the face locations are marked by a single pointeach, as described in Section 2.1. No motion informationis used. The feature selection procedure then retains 100features with highest likelihood ratio from each training im-age. The ratio Λjn for feature f jn in image In is computed as:
Λjn =
P(Lh=`hn,F
j=fjn,L
f=`fn)
P(Lh 6=`hn,F j=fjn,Lf=`fn)
where the random variables
Lh and Lf designate the hand and face locations respec-tively, and F j is a random variable for the j-th observed fea-ture. Throughout the paper we will use a shorthand form ofvariable assignments, e.g., P (`fn) instead of P (Lf = `fn).
As we found in our experiments, this feature selection crite-rion favors features whose presence correlates with the tar-get part location (e.g., many arm features were automati-cally selected for hand detection). The probabilities used tocompute Λjn are estimated as described in section 2.5.
2.3. The chains model
Model definition. The chains model is a generative modelfor the features extracted from the test image IN . Theobserved variables of the model are the extracted features
FN =
{
F j = f jN =
〈
SIFT jN , X
j
N
〉} and the detected
face (source part) location Lf = `fN . The upper index
j of the features is according to some arbitrary order ofthe feature extractor (the order is immaterial for the sub-sequent computation). The unobserved variables are the lo-cation of the hand (target part) denoted by Lh, an arbitrarylength M and an ordered list of feature indices denoted by
T = 〈T (1) , . . . , T (M)〉, where 1 ≤ T (i) ≤ kN . kN isthe number of features extracted from image IN . The list Tdefines a simple chain of features (i.e., without repetitions).The joint distribution of all these variables is taken to be:
PCH
(
M,T,Lh, `fN ,
{
f jN
})
= P (M,T ) ∙
P
(
Lh
∣∣∣fT (M)N ) ∙M−1∏
m=1
P
(
f
T (m+1)
N
∣∣∣fT (m)N ) ∙
P
(
f
T (1)
N
∣∣∣`fN ) ∙ P (`fN) ∙∏
j /∈T
PW
(
f jN
) (1)
In this expression, along the chain we use the conditionalprobability of each feature given its chain predecessor, andfor features outside the chain we take the independent prod-uct of their ‘world probabilities’. The distribution P (M,T )is taken to be uniform over simple chains of length up to
M = 4 and such that locations of neighboring chain fea-tures are neither too close nor too far apart (i.e for any
1 ≤ m < M : 12sN ≤
∥∥∥XT (m+1)N −XT (m)N ∥∥∥ ≤ 32sN ).
PW
(
f jN
) is the ‘world’ probability for generating the fea-
ture f jN , i.e., the probability to observe f jN either on the ob-ject or in the background. PW (f jN) is used to generate all
of the features that lie outside the chain T . P (fT (1)N ∣∣∣`fN )is the conditional probability for transition from the face to
the first feature on the chain, and P (Lh ∣∣∣fT (M)N ) is theprobability of transition from the last feature on the chainto the hand. All of these probabilities will be explicitly de-fined in section 2.5. We next divide the joint PCH by theterm ∏fjN∈FN PW (f jN), which is fixed for a given image
IN , thus rewriting the joint in an equivalent form as:
PCH
(
M,T,Lh, . . .
) ∝ P (M,T ) ∙ P
(
f
T (1)
N , `
f
N
)
PW
(
f
T (1)
N
) ∙
P
(
Lh
∣∣∣fT (M)N ) ∙M−1∏
m=1
P
(
f
T (m+1)
N , f
T (m)
N
)
PW
(
f
T (m+1)
N
)
∙ P
(
f
T (m)
N
) (2)
Here, P (fT (m)N ) is the marginal of the pairwise probabil-
ity: P (fT (m+1)N , fT (m)N ). The chains model is schemati-cally illustrated in figure 2a.Inference. The goal of the inference is to compute the pos-terior distribution over the hand locations:
PCH
(
Lh | `fN ,
{
f jN
})
∝
∑
M,T
PCH
(
M,T,Lh, `fN ,
{
f jN
})
To compute this posterior, we define a directed weighted
‘feature graph’ GN . The nodes of GN are {f jN}, and edges
connect all pairs of features f jN and fkN for which 12sN ≤∥∥∥XkN −XjN∥∥∥ ≤ 32sN . The weight on the directed edge
from node fkN to node f jN is wjk = P(fjN ,fkN)PW (fjN)∙P(fkN) (a ratiosimilar to ‘suspicious coincidence’). The adjacency matrix
AN of GN is a matrix in which entry jk is equal to wjk ifthe directed edge from fkN to f jN exists in GN and is zerootherwise. Using this notation, the marginal over chains oflength smaller or equal to M can be computed as:
PCH
(
Lh
∣∣∣`fN ,{f jN}) ∝∑
M,T
PCH
(
M,T,Lh, . . .
)
=
D ∙ C∗ ∙ S ≈ D ∙ C ∙ S, where C =
M−1∑
m=0
(AN )
m
;
D =
[
P
(
Lh
∣∣f1N ) , ∙ ∙ ∙ , P (Lh ∣∣∣fkNN )] ;
S =
P
(
f1N , l
f
N
)
PW (f1N )
, ∙ ∙ ∙ ,
P
(
fkNN , `
f
N
)
PW
(
fkNN
)
T (3)
where rectangular brackets denote row vectors. Here Sis the source vector containing edge weights between thesource part and its neighbors; C∗ is the chains matrix, whereentry jk is the sum of products of weights on all simplepaths going from node k to node j in the graph GN ; and
D is the target vector containing the edge weights betweenthe target part and its neighbors. The matrix C is used as aneasily computable approximation of C∗, obtained by sum-ming over all paths of length≤M , and not only over simplepaths. The reason to use an approximation is that computing
k best simple paths is costly [12], and in our experiments wehave found the simple approximation to be sufficient. Entry
jk of the adjacency matrix (AN )t is the sum of the weights
of all directed paths of length t going from node k to node
j, where the weight of a path is a product of weights ofedges on it. Consequently, finding Lh that maximizes eq.3 is equivalent to max-marginal inference over the poste-rior of the chains model. Thus the inference in the modelis simple and straightforward, involving only a few matrixmultiplications. Interestingly, it is also possible to com-pute in a similar manner the Maximum A-Posteriori (MAP)inference for the chains model, using the observation that
r
√
ar + br →
r→∞max (a, b). Raising each entry of D, ANand S to a sufficiently large power r, obtaining the respec-tive Dr, ArN , Sr, and Cr = ∑M−1m=0 (ArN )m, and maximiz-ing r√Dr ∙ Cr ∙ Sr becomes equivalent to a MAP inferencefor the chains model. We have experimentaly observed thatusing a combination of MAP and max-marginal inferenceattains the best performance. We maximize D ∙ r√Cr ∙ Sr,where the r-th root here is taken entry-wise. This is equiva-lent to maximizing a sum over all of the features in terms oftheir votes for possible hand locations, where each featurevote is determined by the single maximal chain reaching itfrom the source part. We use this method with r = 10 in allour experiments. Finally, it is also possible to use the sameset of selected features for detecting multiple object partsby replacing the vector D with a matrix in which each rowcorresponds to a part being detected.Star as a special case. Noteworthy, setting C = I in eq.3 reduces the chains model to a variant of the popular starmodel illustrated in figure 2c. This variant allows each fea-ture to independently choose to be generated either by theobject or the world distributions. The proof of this reduc-tion and the detailed description of the star model variantare beyond the scope of the current paper and are providedin supplementary material. In section 3 we quantitativelyshow the advantage of the chains model over the star modelfor highly deformable part detection.
2.4. Full object detection
The chains model can also be applied with some modi-fications to the task of full object detection. Similar to partdetection, the model can use the test image features to pointat the object location, either directly, or via intermediatefeature chains (see figure 2b). In this section we describehow to apply the chains model to object detection. In thefinal discussion we consider the integration of objects andobject parts detection within the chains model.For detection at the object level, we eliminate the detec-tion of the chain’s source part, thereby allowing an arbitrarysource location. As we no longer have a reference scale (ob-tained from the source size), the features’ SIFT descriptorsare computed for patches of a fixed predefined size (20×20pixels). Instead, multi-scale support is achieved by testingeach image at multiple scales. The training stage runs on aset of positive images (with objects roughly aligned in lo-
Probability (KDE(V,
本文档为【手势识别OpenCV】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。