Online Learning of Robust Facial Feature Trackers
Tim Sheerman-Chase, Eng-Jon Ong and Richard Bowden
CVSSP, University of Surrey, Guildford, Surrey GU2 7XH, United Kingdom
t.sheerman-chase,e.ong,r.bowden@surrey.ac.uk
Abstract
This paper presents a head pose and facial feature es-
timation technique that works over a wide range of pose
variations without a priori knowledge of the appearance
of the face. Using simple LK trackers, head pose is esti-
mated by Levenberg-Marquardt (LM) pose estimation us-
ing the feature tracking as constraints. Factored sampling
and RANSAC are employed to both provide a robust pose
estimate and identify tracker drift by constraining outliers
in the estimation process. The system provides both a head
pose estimate and the position of facial features and is ca-
pable of tracking over a wide range of head poses.
1. Introduction
This paper presents an approach for the tracking of facial
features and pose estimate of the head throughout a video
sequence without an a priori model of appearance. The ap-
proach uses online learning to build a model of appearance
on-the-fly using a generic 3D shape model to remove track-
ing drift inherent to online approaches.
For many applications involving the face, the accurate
tracking of facial features is an important first step that pre-
cedes further processing e.g. identity verification, expres-
sion or action recognition. However, robust tracking of fa-
cial features is very challenging due to changing facial ex-
pression and pose. Additionally, one needs to be able to
cope with self occlusions and lighting changes. For these
reasons, attempting to robustly track facial features with-
out a priori knowledge of their appearance is a difficult
task. Appearance can be learn on-the-fly, however such ap-
proaches can fail drastically through the accumulation of
errors during the learning process. While the appearance of
the face may vary between individuals and settings, struc-
turally they are the same and this structure can be used to
reduce errors during the online learning process.
Existing work tends to address the above problems
in two ways: 2D/3D model-based approaches; or pose-
specific piecewise tracking models. For model based ap-
proaches, a popular method is to use a 2D model for the
face [5, 13, 9, 14], but these models only approximate the
shape of the face at near frontal head pose. Alternatively,
various 3D head models can be used [7, 1, 12, 4, 2]. These
more complex models require accurate initialization and are
computationally expensive. 3D head models can range from
simple cylindrical and ellipsoidal models to complex polyg-
onal approximation, usually used for pose estimation. Many
of these techniques use template update, by incrementally
modifying the expected appearance to correspond with the
observed appearance and to achieve pose invariant track-
ing. More accurate model-based facial feature tracking can
be achieved using a realistic 3D face model of sufficient de-
tail for rendering. This allows for a realistic reconstruction
of the visual appearance of the face [11]. However, these
models are complicated and deforming the model to fit the
image is usually non-trivial.
The next class of approaches couples a set of pose-
specific trackers with a switching mechanism (e.g. pose es-
timator) to decide which of these trackers to use. An exam-
ple is proposed by Kanaujia et al.2006 [6] where multiple
2D pose-specific Active Shape Models (ASMs) are coupled
with a switching mechanism using SIFT descriptors. Peyras
et al.2008 [10] used a pool of Active Appearance Models
(AAMs) that were specialized at different poses and expres-
sions but not robust to changes in illumination. However,
in order to train a reliable ASM or AAM model, a large
amount of training data in the form of labeled shapes was
needed.
This paper proposes a novel framework (Figure 1) that
combines elements of both approaches described above.
Specifically, sets of pose-specific facial feature trackers are
integrated into a robust pose estimation system detailed in
Section 2.5. The robust estimation of pose based on noisy
tracking data is a novel extension of Random Sample Con-
sensus (RANSAC) [3]. The tracking is initialised using
alignment of a generic 3D model based on features that
are easily identified using feature localisation. Following
a detailed description of our approach in section 2, results
presented in section 3, demonstrate the accuracy of the ap-
proach. Finally, conclusions are drawn in section 4.
1386
2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops
978-1-4244-4441-0/09/$25.00 ©2009 IEEE
Figure 1. Overview of online tracker system.
2. Online Facial Feature Tracking
An overview of the tracking processing is given in Figure
1. Online tracking of the face is performed by initialisation
on the first frame of the sequence based on a set of eas-
ily identifiable landmarks around the mouth, nose and eyes.
From these initial position estimates, the pose of the head
is determined using a generic face mesh and LM minimi-
sation. A larger set of features are then back projected into
the image to initialise a set of trackers over the face region.
The 3D face mesh was obtained using a commercial 3dMD
system.
Template trackers are initialised at each feature position.
The template is a square image patch centred on the fea-
ture and Lucas Kanade tracking is used to perform template
alignment.
The first iteration of the sequence is described in sec-
tion 2.1 in which each feature is represented by a single
template. This removes the requirement for weights to be
considered during the first iteration. However, as tracking
progresses through the sequence, the new appearance of the
tracked features are stored for later use. Later iterations
have multiple templates per facial feature point with cal-
culated confidence weights (described in section 2.4). The
set of templates represent the appearance of a feature at var-
ious poses and the confidence weights give an indication of
which templates match the current pose of the head.
As the feature may revert to a previously observed ap-
pearance, all retained templates are used to track at each
frame. This is feasible as the templates are small, finite in
number and relatively cheap to compute in an LK frame-
work. The confidence weighting, assigned to each tem-
plate tracker, is used with RANSAC and factored sampling
to determine a robust head pose at each iteration. Weight-
ing is determined by measuring which trackers are in good
agreement with the back projected head model and follow-
ing pose estimation are used to control the degree of cor-
rection to feature positions. The feature positions are cor-
rected/updated based on the weighted average of the tracker
position and the back projected model. Once the tracker po-
sitions have been constrained by the model pose estimate,
the templates are updated with new appearances from the
current frame. The process then reiterates by repeating this
procedure on each consecutive frame of video.
The approach enables tracking and adaptation to previ-
Figure 2. Examples of poor initial head pose alignment. The
tracker points on the background indicate the face mesh is not in
agreement at the start of the sequence.
ously unseen pose angles in the video sequence. However,
the constraints applied by the 3D head model prevent track-
ing drift during online learning. We will now discuss each
of these processes in turn.
2.1. Tracking Initialisation
The face appearance is represented by a set of J features,
each of which is a tuple (Mj ,pj , zj) where 1 < j < J , Mj
is a set of nj templates, pj ∈ R3 is a face mesh position and
1387
zj ∈ R2 is an image position. The set of templates Mj is a
tuple (Ki,j ,wi,j , ai,j) comprising of n image patches Ki,j ,
a weight wi,j , and an image position ai,j where 1 < i < n
(Ki ∈ Rnj ,wi ∈ Rnj ,ai ∈ R2×nj ).
At initialisation, 19 key points on the mesh are identified
corresponding to simple features that are easy to identify us-
ing detection, such as the corners of the mouth, eyes, nose
and eyebrows as depicted in Figure 3. From these known
points, the 3D pose of the head is estimated using equation
(1) and a further 32 evenly positioned features over the fa-
cial model are backprojected into the image. For each of
the 51 feature points (J = 51), the initial template within
the image, Ki,j , is retained at each of the feature positions
aF=0i along with the estimated rotational pose of the head
Pi,j (Pi,j ∈ R3×nj ). As each feature initially corresponds
to only one template (nj = 1), initial weights are set to be
equal (wi,j = 1).
Lucas Kanade tracking is then used to estimate the posi-
tions of all features on the second frame.
2.2. Head Pose Estimation by LM
Many existing pose estimation techniques use feature
tracking/detection and iterative model fitting. LM mini-
mization can be used to determine pose from a point cloud
when point correspondence is known [8]. The pose is esti-
mated by minimizing the cost function:
F(R, t) =
J∑
j=1
nj∑
i=1
||ai,j − proj(Rpi + t)||2 (1)
where proj() is the projection function, R is the cur-
rent rotation matrix and t is the head translation. The re-
sulting R and t matrices represent the estimated head pose
(where R is the rotation matrix corresponding to the Euler
angles {Rpitch, Rroll, Ryaw} and t is the head translation
{tx, ty, tz}). Perspective geometry is used for the projec-
tion function.
2.3. Online Learning
As the appearance of facial features is dependent on the
direction of view, periodically retraining a tracker can help
the tracker adapt to new appearances. However, care must
be taken that online tracker updates are not used excessively
as this causes tracking drift. Drift can be minimized by per-
forming a template update only when necessary to adapt
to new appearances. The central idea of this system is to
balance adaptation to the changing appearance of a feature
without introducing tracker drift.
Each feature has one or more tracker templatesKi,j with
a corresponding head pose estimate Pi,j at which the tem-
plate was created. At each iteration, the current head pose
estimate R is compared to the pose at which templates were
added to the model. If no template exists within an angle
ci of the current frames estimated head pose, an additional
template is added to the model for that feature. To enable
gradual adaptation to new appearance, each feature has a in-
dependent threshold to prevent collection of new templates
from all features simultaneously. All previous templates are
retained. Consequently, as the head moves in the image, the
model gradually accumulates a set of templates representa-
tive of the appearance of features at all head orientations.
2.4. Tracking with Multiple Templates
Tracking proceeds in a similar manner to the first iter-
ation for all subsequent frames, except each landmark on
the face has multiple templates that may represent its ap-
pearance. Lucas Kanade tracking is used to estimate each
template position aFi,j on frame F with the initial tracker
position taken from the corresponding landmark zF−1i on
frame F − 1.
For a specific landmark, some templates will be more
suited for tracking than others due to it having various pos-
sible appearances. To prioritise these trackers, a weighting
of each template wi,j is calculated based on their tracking
agreement with the face model on frame F − 1. This en-
ables trackers with good performance to be preferred over
trackers with poor performance in the pose estimation step.
The weightings are assigned as follows:
wi,j = e−
ui,j
s (2)
where ui,j is the difference between the tracker predic-
tion and the back projected mesh node pi in pixels (ui ∈
R
ni) and s is the agreement scaling factor.
2.5. Estimate pose using Weighted RANSAC LM
Although the use of a confidence weighting reduces the
effect of poorly performing trackers on the pose estimate,
occasionally a tracker of higher confidence undergoes drift.
Conventional LM pose estimation is not robust to outliers
resulting from tracker drift. A robust framework that in-
corporates tracker confidence weightings is described. This
framework is an extension to the Random Sampling and
Consensus (RANSAC) method to incorporate factored sam-
pling. To enable the preference of high confidence points
over points with lower confidence, equation (1) can be mod-
ified to incorporate weighting term:
F(R, t) =
J∑
i=1
ni∑
j=1
wi,j · ||ai,j − proj(Rpi + t)||2 (3)
For a single RANSAC iteration, a random subset ρ of l
1388
Figure 3. Position of features used to initialize and track the face. Multiple templatesMi at various poses represent the appearance of each
feature.
points is sampled from the distribution p(μi,j):
p(μi,j) =
wi,j
∑
w
(4)
Following this, an LM pose model fit is performed on
random subset ρ. The point confidence weights w are used
in this minimization. Points not in set ρ are checked to
determine if they are in agreement with model. Support
is calculated by summing the agreement with the model
where the agreement is determined by the distance ||ai,j −
proj(Rpi + t)||. If the proportion of the weight of inliers,
compared to the weight of all points
J∑
i=1
ni∑
j=1
wi,j , is less
than a good agreement threshold β, this iteration model is
discarded. The model fit error E is calculated are follows:
E =
J∑
i=1
ni∑
j=1
||ai,j − proj(pi)|| · wi,j (5)
where E is the model fit in pixels.
If the model fit error is lower than any previous model
fit, it is stored as the new best model fit. We overcome the
problem of parameter selection of β by iterating the process
over decreasing β values until a RANSAC solution is found.
2.6. Update tracker positions
As described above, each landmark has multiple tem-
plates that represent its appearance. To determine a final
landmark position on frame F and correct any trackers that
drift, the highest weighted template position is combined
with the corresponding 3D mesh point back projected onto
the image. As LK is an iterative tracking scheme, under
normal conditions, tracking drift can result in total failure
once the tracker is outside the basin of convergence of the
LK optimisation. By correcting trackers, this enables points
with a low confidence to be corrected to within the tracker’s
basin of convergence.
Tracking performance for features that are occluded or
have undergone drift are naturally expected to be inaccurate
and will therefore have a low weight. A low weight will
result in playing little or no part in the RANSAC estimation
and they will therefore be corrected using the position from
the back projected mesh at this stage of the algorithm.
zi = wi,j · ai,j + (1− wi,j) · proj(pi)|j
= argmax
j
wi,j (6)
where zi is the new tracker position, ai,j is the highest
weighted prediction wi,j .
Processing of this frame concludes by the addition of
templates to the model as described in section 2.3.
3. Experimental Results
The proposed system was tested in two experiments. The
first determined the head pose estimation accuracy and the
second measured the feature tracking accuracy. All tests
were performed on a “uniform lighting” video sequences
provided by Boston University [7]. The data set contains
45 video sequences, containing 5 subjects with a resolution
of 320×240 pixels, which all begin with a frontal view of
the face. The data includes head pose ground truth recorded
using an electromagnetic sensor with an angular accuracy
of 0.5 degrees in ideal conditions.
1389
Figure 4. Examples showing online tracking of feature positions at
various pose angles of subjects JAM and JIM. Accurate tracking
is possible from various view angles of the face.
3.1. Head Pose Estimation
For robust pose estimation, the RANSACwas set to use a
minimum number of model points l = 6. The LK template
size was 25x25 pixels, the agreement scaling factor s =
10.0 and the retraining thresholds was randomly selected in
the range ci = 2± 1.
Testing the online LK system on 45 sequences, the aver-
age estimation error for pitch, roll and yaw were 3.9◦, 3.1◦,
4.2◦ respectively. This is compared to the performance of
other methods in Table 1. A comparison of estimated pose
with ground truth for two sequences is shown in Figure 5.
These graphs show the ground truth pose for Roll, Pitch
and Yaw as a dashed (red) line and the predicted pose as
solid (blue). As can be seen, the prediction largely follows
the ground truth, although there is a noticeable time lag be-
tween the signals which we suspect is due to smoothing on
the ground truth as it is the ground truth which lags behind
the prediction. The estimated pose also has a tendency to
underpredict at extremes of pose. The jitter in the predic-
tion is a direct result of the RANSAC LM estimation, as no
Average Error (Deg)
Method Pitch Roll Yaw
Proposed method 3.9 3.1 4.2
Jang and Kanade 2008[4] 3.7 4.6 2.1
Choi and Kim 2008[2], Cylinder 4.4 5.2 2.5
Choi and Kim 2008[2], Ellipsoid 3.9 4.0 2.8
Table 1. Comparison of performance with other head pose estima-
tion methods.
Average Error (pixels) for Subject
JAM JIM LLM SSM VAM Overall
3.1 3.8 5.0 3.2 3.5 3.7
Table 2. Results of 2D facial feature tracking 45 sequences.
Figure 6. Examples showing online tracking of feature positions at
various pose angles of subjects LLM and SSM.
temporal constraints or smoothing were applied.
The system initialized accurately on most individuals but
was consistently inaccurate on one subject due to the face
shape being noticeably different to the generic face mesh
(see figure 2).
3.2. Facial Feature Tracking
In addition to the pose estimate, the feature point accu-
racy is also considered as the tracking of individual fea-
1390
Figure 5. Examples of head pose estimation for five sequences. The variation for three rotation axis are shown for each sequence. Blue
solid line is predicted pose. Red dashed line is ground truth.
Figure 7. Examples showing online tracking of feature positions at
various pose angles of subjects VAM.
ture points and their motion is important for other applica-
tions such as expression recognition. The tracking perfor-
mance was measured by comparing predicted feature posi-
tions zj to the ground truth feature position. The test se-
quences again used the 45 sequences in the “uniform light-
ing” Boston University data set. Only the 17 features on the
face were included in the performance assessment (see fig
3); the ear positions could not be reliably tracked due to fre-
quent occlusions. Table 2 shows the average error in pixels
for 5 subjects from the database (labelled JAM, JIM, LLM,
SSM and VAM). For all 45 sequences the average error was
3.7 pixels. Ground truth positions were semi manually an-
notated for the sequences. To find the noise level in this
ground truth, two sequences were fully manually ground
truthed and compared to the semi-automatic method. The
average position error was 2.7 pixels.
Examples of online tracking are shown in Figure 4, 6 and
7.
1391
4. Conclusions
This paper described a system that automatically esti-
mates head pose and facial feature positions without a pri-
ori knowledge of the appearance of the face. The system
uses robust LM pose estimation within a RANSAC frame-
work to constrain features during on-line learning. The con-
straints are weighted based on the tracking agreement with
a 3D face mesh. The online learning rate is controlled by
changes in head pose. Accurate estimation of head pose is
achieved over a wide range of poses resulting in comparable
performance to other state-of-the-art approaches.
5. Future Work
For some subjects, the initial head pose estimate is poor
due to differences in head shape as the system uses a rigid
mean model. This also has implications to radical changes
in expression which are not represented in the model. Fu-
ture work will address robustness to
本文档为【Online Learning of Robust Facial Feature Trackers】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。