Face pose estimation and its application in video shot selection
Face Pose Estimation and its Application in Video Shot Selection
Zhiguang YANG
1
, Haizhou AI
1
, Bo WU
1
, Shihong LAO
2 and Lianhong CAI1
1
Computer Science and Technology Department, Tsinghua University, Beijing, 100084, China
2
Sensing Technology L...
Face Pose Estimation and its Application in Video Shot Selection
Zhiguang YANG
1
, Haizhou AI
1
, Bo WU
1
, Shihong LAO
2 and Lianhong CAI1
1
Computer Science and Technology Department, Tsinghua University, Beijing, 100084, China
2
Sensing Technology Laboratory, Omron Corporation
E-mail: ahz@mail.tsinghua.edu.cn
Abstract
In this paper, a face pose estimation method and its
application in video shot selection for face image
preprocessing is introduced. The pose estimator is
learned by a boosting regression algorithm called
SquareLev.R [1] that learns poses from simple Haar-
type features. It consists of two tree structured
subsystems for the left-right angle and up-down angle
respectively. As a specific application in video based
face recognition, the best shot selection problem is
discussed, which results in a real-time system that can
automatically select the most frontal face from a video
sequence.
1. Introduction
Face pose estimation (PE) is used to predict the 3D
orientation, that is the rotation-in-plane (RIP) and
rotation-out-of-plane (ROP) angles, of human head. In
particular, in this paper we only discuss its simplified
version that corresponds to left-right angles and up-
down angles. It is very important due to face pose
plays an essential role in many real-life applications,
such as monitoring attentiveness of drivers [2] or
automating camera management [3]. In addition, many
view-based approaches for face image analysis such as
face recognition usually need to estimate the pose to
some extent [4].
Previous works on pose estimation (PE) include
PCA [5,6], ANN [7], SVMs [8,9], and Independent
Subspace Analysis (ISA) [10]. In this paper, we
propose a novel method to learn a pose estimator by
boosting regression algorithm called SquareLev.R [1]
that learns poses from simple Haar-type features [11].
It consists of two tree structured subsystems for the
left-right angle and up-down angle respectively. As a
specific application in video based face recognition,
the best shot selection problem is discussed, which
results in a real-time system that can automatically
select the most frontal face from a video sequence.
Best shot selection is of important value in live video
based face related processing such as face recognition,
demographic classification [12], etc. The main
contribution of our work is a novel pose estimation
method based on boosting regression that proves to be
very useful for practical applications such as best shot
selection.
The rest of this paper is organized as follows: in
Section 2, we discuss the problems involved in pose
estimation; in Section 3, we give a brief introduction of
the boosting regression algorithm, SquareLev.R; in
Section 4, we introduce the Haar feature based weak
learner for regression; in Section 5, we describe our
pose estimation trees; in Section 6, we give our
solution to the best shot selection problem and its
results; and finally in Section 7, we present our
conclusions.
Al
l S
u
b-
W
in
do
ws
PE
Vi
e
w
-
Ba
se
d
Fa
ce
D
e
te
ct
or
s
Vi
e
w
-
Ba
se
d
M
od
e
ls
fo
r F
ac
e
Al
ig
nm
en
t
PE PE Further
Processing
Level 1 Level 2 Level 3
Figure 1. Definition of pose estimation (PE)
2. The Definition of Pose Estimation
As illustrated in Fig.1, PE has three variations
according to its position in the flow chart. PE before
face detection is a rough prediction used to divide each
sub-window into its corresponding subcategory for
view-based face detectors. Because there are usually
millions of patches to be processed for face detection,
PE at this level must be simple and fast. PE after face
detection serves as the multiplexer that guides the face
pattern to its view-based model. Its accuracy has direct
influence on the performance of further processing. PE
after face alignment is the last level. At this stage, there
are usually many facial landmarks available, so model-
based method can be used. In this paper we focus on
Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04)
1051-4651/04 $ 20.00 IEEE
the second level. It means we assume there are no
landmarks available and the target is to estimate the
pose from the detected face regions in an image.
3. Boosting Regression Algorithm
The learning algorithm SquareLev.R [1] is a boost-
or leverage-style regression algorithm that aims at
reducing the variance of residuals. Given a sample set
1
{( , )}mi i iS y == x and a regressor F, the variance of
residuals is
2
2Var
P = −r r , (1)
where r is the m-vector of residuals defined by
( )i i ir y F= − x , and r is the m-vector with all
components equal to
1
1 m
ii
r r
m =
= ¦ . Fig.2 gives the
details of SquareLev.R. It has been proved that in each
iteration of SquareLev.R, PVar will decrease by a factor
of 2(1 )tε− [1]. That means if İt has a positive lower
bound İmin then for any positive number ȡ this
algorithm will definitely generate a master regressor
whose sample error is at most ȡ.
4. Haar Feature Based Weak Learner
In each boosting round, SquareLev.R will call the
weak learner to obtain a hypothesis or weak regressor.
Different from classifications in which the hypothesis
is a threshold function, the hypothesis for regression
should be a continuous function of the feature value. A
very simple yet effective set is the Look-Up-Table
(LUT). We follow Viola & Jones’ [11] to use the Haar
features. For a Haar feature h, assuming its range has
been normalized to [0,1], our LUT has 64 bins and the
i-th bin corresponds to the sub-domain [(i-1) /64, i/64],
i=1,…,64. The hypothesis on bini is calculated as
[ ]| ( ) iE y h bin∈x� . (2)
Define the characteristic function
1
( )
0
i
i
i
u bin
B u
u bin
∈
= ® ∉¯
,
then the hypothesis based on Haar feature h can be
formalized as
64
1
( ) ( ( )) [ | ( ) ]i i
i
f B h E y h bin
=
= ∈¦x x x� (3)
We construct a hypothesis pool from all possible Haar
features.
Figure 3. Multi-view face samples
5. Pose Estimation Tree
Pose data for training consist of faces with ±45˚,
±30˚, ±15˚, 0˚ left-right ROP and ±30˚, ±15˚, 0˚ up-
down ROP that is totally 35 view categories of which
each has 300 faces of different people. Because our
target is PE after face detection, we do not do any
shape alignment to the face samples, that is to say the
face block obtained by the face detection module will
be used for training directly. All samples are resized to
24×24-pixel patch, see Fig.3.
• Given Sample Set
1
{( , )}mi i iS y == x , a base
learning algorithm and parameters ȡ, Tmax
• Initialize master regressor F to the zero
function
• For t = 1 to Tmax do
For i = 1 to m do
( )i i ir y F= − x
end do
If
2
2
mρ−
本文档为【Face pose estimation and its application in video shot selection】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。