Online Learning of Robust Facial Feature Trackers

Online Learning of Robust Facial Feature Trackers Online Learning of Robust Facial Feature Trackers Tim Sheerman-Chase, Eng-Jon Ong and Richard Bowden CVSSP, University of Surrey, Guildford, Surrey GU2 7XH, United Kingdom t.sheerman-chase,e.ong,r.bowden@surrey.ac.uk Abstract This paper presents a head po...

Online Learning of Robust Facial Feature Trackers Tim Sheerman-Chase, Eng-Jon Ong and Richard Bowden CVSSP, University of Surrey, Guildford, Surrey GU2 7XH, United Kingdom t.sheerman-chase,e.ong,r.bowden@surrey.ac.uk Abstract This paper presents a head pose and facial feature es- timation technique that works over a wide range of pose variations without a priori knowledge of the appearance of the face. Using simple LK trackers, head pose is esti- mated by Levenberg-Marquardt (LM) pose estimation us- ing the feature tracking as constraints. Factored sampling and RANSAC are employed to both provide a robust pose estimate and identify tracker drift by constraining outliers in the estimation process. The system provides both a head pose estimate and the position of facial features and is ca- pable of tracking over a wide range of head poses. 1. Introduction This paper presents an approach for the tracking of facial features and pose estimate of the head throughout a video sequence without an a priori model of appearance. The ap- proach uses online learning to build a model of appearance on-the-ﬂy using a generic 3D shape model to remove track- ing drift inherent to online approaches. For many applications involving the face, the accurate tracking of facial features is an important ﬁrst step that pre- cedes further processing e.g. identity veriﬁcation, expres- sion or action recognition. However, robust tracking of fa- cial features is very challenging due to changing facial ex- pression and pose. Additionally, one needs to be able to cope with self occlusions and lighting changes. For these reasons, attempting to robustly track facial features with- out a priori knowledge of their appearance is a difﬁcult task. Appearance can be learn on-the-ﬂy, however such ap- proaches can fail drastically through the accumulation of errors during the learning process. While the appearance of the face may vary between individuals and settings, struc- turally they are the same and this structure can be used to reduce errors during the online learning process. Existing work tends to address the above problems in two ways: 2D/3D model-based approaches; or pose- speciﬁc piecewise tracking models. For model based ap- proaches, a popular method is to use a 2D model for the face [5, 13, 9, 14], but these models only approximate the shape of the face at near frontal head pose. Alternatively, various 3D head models can be used [7, 1, 12, 4, 2]. These more complex models require accurate initialization and are computationally expensive. 3D head models can range from simple cylindrical and ellipsoidal models to complex polyg- onal approximation, usually used for pose estimation. Many of these techniques use template update, by incrementally modifying the expected appearance to correspond with the observed appearance and to achieve pose invariant track- ing. More accurate model-based facial feature tracking can be achieved using a realistic 3D face model of sufﬁcient de- tail for rendering. This allows for a realistic reconstruction of the visual appearance of the face [11]. However, these models are complicated and deforming the model to ﬁt the image is usually non-trivial. The next class of approaches couples a set of pose- speciﬁc trackers with a switching mechanism (e.g. pose es- timator) to decide which of these trackers to use. An exam- ple is proposed by Kanaujia et al.2006 [6] where multiple 2D pose-speciﬁc Active Shape Models (ASMs) are coupled with a switching mechanism using SIFT descriptors. Peyras et al.2008 [10] used a pool of Active Appearance Models (AAMs) that were specialized at different poses and expres- sions but not robust to changes in illumination. However, in order to train a reliable ASM or AAM model, a large amount of training data in the form of labeled shapes was needed. This paper proposes a novel framework (Figure 1) that combines elements of both approaches described above. Speciﬁcally, sets of pose-speciﬁc facial feature trackers are integrated into a robust pose estimation system detailed in Section 2.5. The robust estimation of pose based on noisy tracking data is a novel extension of Random Sample Con- sensus (RANSAC) [3]. The tracking is initialised using alignment of a generic 3D model based on features that are easily identiﬁed using feature localisation. Following a detailed description of our approach in section 2, results presented in section 3, demonstrate the accuracy of the ap- proach. Finally, conclusions are drawn in section 4. 1386 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 978-1-4244-4441-0/09/$25.00 ©2009 IEEE Figure 1. Overview of online tracker system. 2. Online Facial Feature Tracking An overview of the tracking processing is given in Figure 1. Online tracking of the face is performed by initialisation on the ﬁrst frame of the sequence based on a set of eas- ily identiﬁable landmarks around the mouth, nose and eyes. From these initial position estimates, the pose of the head is determined using a generic face mesh and LM minimi- sation. A larger set of features are then back projected into the image to initialise a set of trackers over the face region. The 3D face mesh was obtained using a commercial 3dMD system. Template trackers are initialised at each feature position. The template is a square image patch centred on the fea- ture and Lucas Kanade tracking is used to perform template alignment. The ﬁrst iteration of the sequence is described in sec- tion 2.1 in which each feature is represented by a single template. This removes the requirement for weights to be considered during the ﬁrst iteration. However, as tracking progresses through the sequence, the new appearance of the tracked features are stored for later use. Later iterations have multiple templates per facial feature point with cal- culated conﬁdence weights (described in section 2.4). The set of templates represent the appearance of a feature at var- ious poses and the conﬁdence weights give an indication of which templates match the current pose of the head. As the feature may revert to a previously observed ap- pearance, all retained templates are used to track at each frame. This is feasible as the templates are small, ﬁnite in number and relatively cheap to compute in an LK frame- work. The conﬁdence weighting, assigned to each tem- plate tracker, is used with RANSAC and factored sampling to determine a robust head pose at each iteration. Weight- ing is determined by measuring which trackers are in good agreement with the back projected head model and follow- ing pose estimation are used to control the degree of cor- rection to feature positions. The feature positions are cor- rected/updated based on the weighted average of the tracker position and the back projected model. Once the tracker po- sitions have been constrained by the model pose estimate, the templates are updated with new appearances from the current frame. The process then reiterates by repeating this procedure on each consecutive frame of video. The approach enables tracking and adaptation to previ- Figure 2. Examples of poor initial head pose alignment. The tracker points on the background indicate the face mesh is not in agreement at the start of the sequence. ously unseen pose angles in the video sequence. However, the constraints applied by the 3D head model prevent track- ing drift during online learning. We will now discuss each of these processes in turn. 2.1. Tracking Initialisation The face appearance is represented by a set of J features, each of which is a tuple (Mj ,pj , zj) where 1 < j < J , Mj is a set of nj templates, pj ∈ R3 is a face mesh position and 1387 zj ∈ R2 is an image position. The set of templates Mj is a tuple (Ki,j ,wi,j , ai,j) comprising of n image patches Ki,j , a weight wi,j , and an image position ai,j where 1 < i < n (Ki ∈ Rnj ,wi ∈ Rnj ,ai ∈ R2×nj ). At initialisation, 19 key points on the mesh are identiﬁed corresponding to simple features that are easy to identify us- ing detection, such as the corners of the mouth, eyes, nose and eyebrows as depicted in Figure 3. From these known points, the 3D pose of the head is estimated using equation (1) and a further 32 evenly positioned features over the fa- cial model are backprojected into the image. For each of the 51 feature points (J = 51), the initial template within the image, Ki,j , is retained at each of the feature positions aF=0i along with the estimated rotational pose of the head Pi,j (Pi,j ∈ R3×nj ). As each feature initially corresponds to only one template (nj = 1), initial weights are set to be equal (wi,j = 1). Lucas Kanade tracking is then used to estimate the posi- tions of all features on the second frame. 2.2. Head Pose Estimation by LM Many existing pose estimation techniques use feature tracking/detection and iterative model ﬁtting. LM mini- mization can be used to determine pose from a point cloud when point correspondence is known [8]. The pose is esti- mated by minimizing the cost function: F(R, t) = J∑ j=1 nj∑ i=1 ||ai,j − proj(Rpi + t)||2 (1) where proj() is the projection function, R is the cur- rent rotation matrix and t is the head translation. The re- sulting R and t matrices represent the estimated head pose (where R is the rotation matrix corresponding to the Euler angles {Rpitch, Rroll, Ryaw} and t is the head translation {tx, ty, tz}). Perspective geometry is used for the projec- tion function. 2.3. Online Learning As the appearance of facial features is dependent on the direction of view, periodically retraining a tracker can help the tracker adapt to new appearances. However, care must be taken that online tracker updates are not used excessively as this causes tracking drift. Drift can be minimized by per- forming a template update only when necessary to adapt to new appearances. The central idea of this system is to balance adaptation to the changing appearance of a feature without introducing tracker drift. Each feature has one or more tracker templatesKi,j with a corresponding head pose estimate Pi,j at which the tem- plate was created. At each iteration, the current head pose estimate R is compared to the pose at which templates were added to the model. If no template exists within an angle ci of the current frames estimated head pose, an additional template is added to the model for that feature. To enable gradual adaptation to new appearance, each feature has a in- dependent threshold to prevent collection of new templates from all features simultaneously. All previous templates are retained. Consequently, as the head moves in the image, the model gradually accumulates a set of templates representa- tive of the appearance of features at all head orientations. 2.4. Tracking with Multiple Templates Tracking proceeds in a similar manner to the ﬁrst iter- ation for all subsequent frames, except each landmark on the face has multiple templates that may represent its ap- pearance. Lucas Kanade tracking is used to estimate each template position aFi,j on frame F with the initial tracker position taken from the corresponding landmark zF−1i on frame F − 1. For a speciﬁc landmark, some templates will be more suited for tracking than others due to it having various pos- sible appearances. To prioritise these trackers, a weighting of each template wi,j is calculated based on their tracking agreement with the face model on frame F − 1. This en- ables trackers with good performance to be preferred over trackers with poor performance in the pose estimation step. The weightings are assigned as follows: wi,j = e− ui,j s (2) where ui,j is the difference between the tracker predic- tion and the back projected mesh node pi in pixels (ui ∈ R ni) and s is the agreement scaling factor. 2.5. Estimate pose using Weighted RANSAC LM Although the use of a conﬁdence weighting reduces the effect of poorly performing trackers on the pose estimate, occasionally a tracker of higher conﬁdence undergoes drift. Conventional LM pose estimation is not robust to outliers resulting from tracker drift. A robust framework that in- corporates tracker conﬁdence weightings is described. This framework is an extension to the Random Sampling and Consensus (RANSAC) method to incorporate factored sam- pling. To enable the preference of high conﬁdence points over points with lower conﬁdence, equation (1) can be mod- iﬁed to incorporate weighting term: F(R, t) = J∑ i=1 ni∑ j=1 wi,j · ||ai,j − proj(Rpi + t)||2 (3) For a single RANSAC iteration, a random subset ρ of l 1388 Figure 3. Position of features used to initialize and track the face. Multiple templatesMi at various poses represent the appearance of each feature. points is sampled from the distribution p(μi,j): p(μi,j) = wi,j ∑ w (4) Following this, an LM pose model ﬁt is performed on random subset ρ. The point conﬁdence weights w are used in this minimization. Points not in set ρ are checked to determine if they are in agreement with model. Support is calculated by summing the agreement with the model where the agreement is determined by the distance ||ai,j − proj(Rpi + t)||. If the proportion of the weight of inliers, compared to the weight of all points J∑ i=1 ni∑ j=1 wi,j , is less than a good agreement threshold β, this iteration model is discarded. The model ﬁt error E is calculated are follows: E = J∑ i=1 ni∑ j=1 ||ai,j − proj(pi)|| · wi,j (5) where E is the model ﬁt in pixels. If the model ﬁt error is lower than any previous model ﬁt, it is stored as the new best model ﬁt. We overcome the problem of parameter selection of β by iterating the process over decreasing β values until a RANSAC solution is found. 2.6. Update tracker positions As described above, each landmark has multiple tem- plates that represent its appearance. To determine a ﬁnal landmark position on frame F and correct any trackers that drift, the highest weighted template position is combined with the corresponding 3D mesh point back projected onto the image. As LK is an iterative tracking scheme, under normal conditions, tracking drift can result in total failure once the tracker is outside the basin of convergence of the LK optimisation. By correcting trackers, this enables points with a low conﬁdence to be corrected to within the tracker’s basin of convergence. Tracking performance for features that are occluded or have undergone drift are naturally expected to be inaccurate and will therefore have a low weight. A low weight will result in playing little or no part in the RANSAC estimation and they will therefore be corrected using the position from the back projected mesh at this stage of the algorithm. zi = wi,j · ai,j + (1− wi,j) · proj(pi)|j = argmax j wi,j (6) where zi is the new tracker position, ai,j is the highest weighted prediction wi,j . Processing of this frame concludes by the addition of templates to the model as described in section 2.3. 3. Experimental Results The proposed system was tested in two experiments. The ﬁrst determined the head pose estimation accuracy and the second measured the feature tracking accuracy. All tests were performed on a “uniform lighting” video sequences provided by Boston University [7]. The data set contains 45 video sequences, containing 5 subjects with a resolution of 320×240 pixels, which all begin with a frontal view of the face. The data includes head pose ground truth recorded using an electromagnetic sensor with an angular accuracy of 0.5 degrees in ideal conditions. 1389 Figure 4. Examples showing online tracking of feature positions at various pose angles of subjects JAM and JIM. Accurate tracking is possible from various view angles of the face. 3.1. Head Pose Estimation For robust pose estimation, the RANSACwas set to use a minimum number of model points l = 6. The LK template size was 25x25 pixels, the agreement scaling factor s = 10.0 and the retraining thresholds was randomly selected in the range ci = 2± 1. Testing the online LK system on 45 sequences, the aver- age estimation error for pitch, roll and yaw were 3.9◦, 3.1◦, 4.2◦ respectively. This is compared to the performance of other methods in Table 1. A comparison of estimated pose with ground truth for two sequences is shown in Figure 5. These graphs show the ground truth pose for Roll, Pitch and Yaw as a dashed (red) line and the predicted pose as solid (blue). As can be seen, the prediction largely follows the ground truth, although there is a noticeable time lag be- tween the signals which we suspect is due to smoothing on the ground truth as it is the ground truth which lags behind the prediction. The estimated pose also has a tendency to underpredict at extremes of pose. The jitter in the predic- tion is a direct result of the RANSAC LM estimation, as no Average Error (Deg) Method Pitch Roll Yaw Proposed method 3.9 3.1 4.2 Jang and Kanade 2008[4] 3.7 4.6 2.1 Choi and Kim 2008[2], Cylinder 4.4 5.2 2.5 Choi and Kim 2008[2], Ellipsoid 3.9 4.0 2.8 Table 1. Comparison of performance with other head pose estima- tion methods. Average Error (pixels) for Subject JAM JIM LLM SSM VAM Overall 3.1 3.8 5.0 3.2 3.5 3.7 Table 2. Results of 2D facial feature tracking 45 sequences. Figure 6. Examples showing online tracking of feature positions at various pose angles of subjects LLM and SSM. temporal constraints or smoothing were applied. The system initialized accurately on most individuals but was consistently inaccurate on one subject due to the face shape being noticeably different to the generic face mesh (see ﬁgure 2). 3.2. Facial Feature Tracking In addition to the pose estimate, the feature point accu- racy is also considered as the tracking of individual fea- 1390 Figure 5. Examples of head pose estimation for ﬁve sequences. The variation for three rotation axis are shown for each sequence. Blue solid line is predicted pose. Red dashed line is ground truth. Figure 7. Examples showing online tracking of feature positions at various pose angles of subjects VAM. ture points and their motion is important for other applica- tions such as expression recognition. The tracking perfor- mance was measured by comparing predicted feature posi- tions zj to the ground truth feature position. The test se- quences again used the 45 sequences in the “uniform light- ing” Boston University data set. Only the 17 features on the face were included in the performance assessment (see ﬁg 3); the ear positions could not be reliably tracked due to fre- quent occlusions. Table 2 shows the average error in pixels for 5 subjects from the database (labelled JAM, JIM, LLM, SSM and VAM). For all 45 sequences the average error was 3.7 pixels. Ground truth positions were semi manually an- notated for the sequences. To ﬁnd the noise level in this ground truth, two sequences were fully manually ground truthed and compared to the semi-automatic method. The average position error was 2.7 pixels. Examples of online tracking are shown in Figure 4, 6 and 7. 1391 4. Conclusions This paper described a system that automatically esti- mates head pose and facial feature positions without a pri- ori knowledge of the appearance of the face. The system uses robust LM pose estimation within a RANSAC frame- work to constrain features during on-line learning. The con- straints are weighted based on the tracking agreement with a 3D face mesh. The online learning rate is controlled by changes in head pose. Accurate estimation of head pose is achieved over a wide range of poses resulting in comparable performance to other state-of-the-art approaches. 5. Future Work For some subjects, the initial head pose estimate is poor due to differences in head shape as the system uses a rigid mean model. This also has implications to radical changes in expression which are not represented in the model. Fu- ture work will address robustness to

                    本文档为【Online Learning of Robust Facial Feature Trackers】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Online Learning of Robust Facial Feature Trackers

你可能还喜欢