首页 Automated Facial Expression Recognition System

Automated Facial Expression Recognition System

举报
开通vip

Automated Facial Expression Recognition System Automated Facial Expression Recognition System Andrew Ryan Naval Criminal Investigative Services, NCIS Washington, DC United States andrew.h.ryan@navy.mil Jeffery F. Cohn, Simon Lucey, Jason Saragih, Patrick Lucey, & Fernando De la Torre ...

Automated Facial Expression Recognition System
Automated Facial Expression Recognition System Andrew Ryan Naval Criminal Investigative Services, NCIS Washington, DC United States andrew.h.ryan@navy.mil Jeffery F. Cohn, Simon Lucey, Jason Saragih, Patrick Lucey, & Fernando De la Torre Carnegie Mellon University Pittsburgh, Pennsylvania United States jeffcohn@cs.cmu.edu, slucey@cs.cmu.edu, jsaragih@andrew.cmu.edu, ftorre@cs.cmu.edu Adam Rossi Platinum Solutions, Inc Reston, Virginia United States adam.rossi@platinumsolutions.com Abstract—Heightened concerns about the treatment of individuals during interviews and interrogations have stimulated efforts to develop "non-intrusive" technologies for rapidly assessing the credibility of statements by individuals in a variety of sensitive environments. Methods or processes that have the potential to precisely focus investigative resources will advance operational excellence and improve investigative capabilities. Facial expressions have the ability to communicate emotion and regulate interpersonal behavior. Over the past 30 years, scientists have developed human-observer based methods that can be used to classify and correlate facial expressions with human emotion. However, these methods have proven to be labor intensive, qualitative, and difficult to standardize. The Facial Action Coding System (FACS) developed by Paul Ekman and Wallace V. Friesen is the most widely used and validated method for measuring and describing facial behaviors. The Automated Facial Expression Recognition System (AFERS) automates the manual practice of FACS, leveraging the research and technology behind the CMU/PITT Automated Facial Image Analysis System (AFA) system developed by Dr. Jeffery Cohn and his colleagues at the Robotics Institute of Carnegie Mellon University. This portable, near real-time system will detect the seven universal expressions of emotion (figure 1), providing investigators with indicators of the presence of deception during the interview process. In addition, the system will include features such as full video support, snapshot generation, and case management utilities, enabling users to re-evaluate interviews in detail at a later date. Keywords-automated facial expression recognition system; Biometric systems; utilizing facial features; shape and appearance modeling; expression recognition; facial action processing; constrained local models; facial expression recognition; support vector machines; spontaneous facial behavior I. INTRODUCTION Interrogations are a critical practice in the information gathering process, but the information collected can be severely compromised if the interviewee attempts to mislead the interviewer through the use of deception. Being able to quantitatively assess an interview subject’s emotional state and changes in emotional state would be a tremendous advantage in being able to guide an interview and assess the truthfulness of the interviewee. Figure 1. Demonstrates the seven universal expressions of emotion. Each of these expressions is racially and culturally independent. Recent advances in facial image processing technology have facilitated the introduction of advanced applications that extend beyond facial recognition techniques. This paper introduces an Automated Facial Expression Recognition System (AFERS): A near real-time, next generation interrogation tool that has the ability to automate the Facial Action Coding System (FACS) process for the purposes of expression recognition. The AFERS system will analyze and report on a subject’s facial behavior, classifying facial expressions with one of the seven universal expressions of emotion [1]. II. FACIAL ACTION CODING SYSTEM In behavioral psychology, research into systems and processes that can recognize and classify a subject’s facial expressions have allowed scientists to more accurately assess and diagnose underlying emotional state. This practice has since opened the door to new breakthroughs in areas such as pain analysis and depression treatment. In 1978, Paul Ekman and Wallace V. Friesen published the Facial Action Coding System (FACS), which, 30 years later, is still the most widely used method available. Through observational and electromyographic study of facial behavior, they determined how the contraction of each facial muscle, both singly and in unison with other muscles, changes the appearance of the face. 978-1-4244-4170-9/09/$25.00 ©2009 IEEE 172 Rather than using the names of the active muscles, FACS measures these changes in appearance using units called Action Units (AUs). Figure 2 illustrates some of these Action Units and the appearance changes they describe. The benefits of using AUs are two-fold. First, individually and in combination they provide a way to unambiguously describe nearly all possible facial actions. Second, combinations of AUs refer to emotion-specified facial expressions. Happy, for instance, is distinguished by the combination AU 6 and AU 12. AU 6, orbicularis oculi contraction, raises the cheeks and causes wrinkling lateral to the eyes. AU 12, zygomatic major contraction, pulls the lip corners obliquely into a smile. Seven expressions appear universally in Western- and non-Western, literate, and pre-literate cultures [1]. While FACS is an efficient, objective method to describe facial expressions, it is not without its drawbacks. Coding a subject’s video is a time- and labor-intensive process that must be performed frame by frame. A trained, certified FACS coder takes on average 2 hours to code 2 minutes of video. In situations where real-time feedback is desired and necessary, manual FACS coding is not a viable option. Figure 2. Sample AUs and the appearance changes they describe [2]. III. AFERS SYSTEM OVERVIEW AFERS is designed to operate in a platform independent manner allowing it to be hosted on various hardware platforms and to be compatible with most standard video cameras. AFERS employs shape and appearance modeling using constrained local models for facial registration and feature extraction and representation, and support vector machines for expression classification. AFERS provides both pre- and post- analysis capabilities and includes features such as video playback, snapshot generation, and case management. In addition to the AFERS processing algorithms, the implementation features a plug-in architecture that is capable of accommodating future algorithmic enhancements as well as additional inputs for behavior analysis. AFERS is built upon both Java and C++ technologies. The user interface, video processing and analytics engine are built using Java and the expression recognition engine is built using C++. The two technologies are bridged via the Java Native Interface (JNI). Figure 3 depicts a high level overview of each of these components, and their interaction with each other during the expression recognition process. Figure 3. High level overview of the AFERS components. IV. VIDEO PROCESSING The video processing component of the AFERS application is responsible for sequencing the inputted video into individual frames at a rate of 25 frames per second. Once the video is sequenced the video processing component places the frames onto a queue for input into the shape and appearance modeling component. V. SHAPE & APPEARANCE MODELING The successful automatic registration and tracking of non- rigidly varying geometric landmarks on the face is a key ingredient to the analysis of human spontaneous behavior. Until recently, popular approaches for accurate non-rigid facial registration and tracking have centered upon inverting a synthesis model (or in machine learning terms a generative model) of how faces can vary in terms of shape and appearance. As a result, the ability of such approaches to register an unseen face image is intrinsically linked to how well the synthesis model can reconstruct the face image. Perhaps the most well known application of inverting a synthesis model for non-rigid face registration can be found in the active appearance model (AAM) work first proposed by [3]. Other closely related methods can be found in the morphable models work of Blanz and Vetter [4]. AAMs have become the de facto standard in non-rigid face alignment/tracking [5]. An example of a shape and appearance model using AAM can be seen in Figure 4. Shape is represented by the x,y coordinates of facial features and appearance by the texture within that shape. Any face can be represented by a mean shape and appearance and their modes of variation. Because the models are generative, new faces and expressions can be generated from the model. 173 Figure 4. Row 1 shows the mean face shape (left) and 1st three shape modes and row 2 the mean appearance (left) and 1st three appearance modes for an AAM. The AAM is invertible and can be used to synthesize new face images. Four example face faces generated with an AAM are shown in row 3. AAMs are learned from training data. That is, they are person-specific. In our experience, about 5% of face images must be hand labeled for model training. For many biometric and forensic applications, generic models that can be used with previously unknown persons are needed. AAMs have inherent problems when attempting to fit generically to face images. This problem can be directly attributed to the balance that shape and appearance models require in their representational power (i.e., the model’s ability to synthesize face images). If the representational power is too constrained, the method can do a good job on a small population of faces but cannot synthesize faces outside that population. On the other hand, if the representational power is too unconstrained, the model can easily synthesize all faces but can also synthesize non-face objects. Finding a suitable balance between these two extremes in a computationally tractable manner has not been easily attained through an invertible synthesis paradigm [6]. Hence, pre-training on face images for person-specific models has been needed. Person-specific models, while capable of precise tracking, are ill-suited for use with unknown persons. AFERS is intended for just such use, person-specific shape and appearance modeling. A. Constrained local models (CLM) Accurate and consistent tracking of non-rigid object motion, such as facial motion and expressions, is important in many computer vision applications and has been studied intensively in the last two decades. This problem is particularly difficult when tracking subjects with previously unseen appearance variations. To address this problem, a number of registration/tracking methods have been developed based on local region descriptors and a non-rigid shape prior. We refer to this family of methods collectively as a constrained local model (CLM). Our definition of CLMs is much broader than that given by Cristinacce and Cootes [7] who employ the same name for their approach. Cristinacce and Cootes' method can be thought of as a specific subset of the CLM family of models. Probably, the best-known example of a CLM can be found in the seminal active shape model (ASM) work of Cootes and Taylor. Instantiations of CLMs differ primarily in the literature with regards to: (i) whether the local experts employ a 1D or 2D local search, (ii) how the local experts are learnt, (iii) how the source image is normalized geometrically and photometrically before the application of the local experts, and (iv) how one fits the local experts’ responses to conform to the global non-rigid shape prior. Disregarding these differences, however, all instantiations of CLMs can be considered to be pursuing the same two goals: (i) perform an exhaustive local search for each landmark around their current estimate using some kind of patch-expert (i.e., feature detector), and (ii) optimize the global non-rigid shape parameters such that the local responses for all of its landmarks are minimized A major advantage of CLMs over conventional methods for non-rigid registration, such as AAMs, lies in their ability to: (i) be discriminative and generalize well to unseen appearance variation; (ii) offer greater invariance to global illumination variation and occlusion; (iii) model the non-rigid object as an ensemble of low dimensional independent patch-experts; and (iv) not employ complicated piece-wise affine texture warp operations that might introduce unwanted noise [8]. For the AFERS project, we are extending the CLM framework in several ways. We are simplifying the optimization in a way that allows the optimization algorithm to be parallelized for faster performance. We are replacing the original non-linear patch experts that were suggested in [7] with linear support vector machines (SVM). This approach further increases performance and improves accuracy of model fitting. And we are using a composite warp in place of an additive warp that increases robustness to changes in scale. These extensions of the CLM framework will enable sufficiently fast model fitting to support the demands of real- time expression recognition. Figure 5. Examples of alignment performance on a single subject’s face. Rows 1, 2 and 3 illustrate the alignment for the initial warp perturbation, simultaneous (AAM), and our constrained local model (using exhaustive local search algorithm), respectively. Columns 1, 2, and 3 illustrate the alignment for initial warp perturbation of 10, 7.5 and 5 pixels RMS-PE, respectively. In initial work, we evaluated our approach to CLM by comparing it with one of the leading approaches to AAM [9] in face images from the Multi-PIE database [10]. Multi-PIE consists of face images of 337 participants of Asian, Caucasian, and African-American background that were recorded under multiple pose, illumination, and expression conditions on as many as four occasions over several months. The database samples some of the variability that AFERS is intended to 174 manage. The ELS algorithm for CLM was compared against two well-known AAM fitting approaches, namely the ‘‘simultaneous” (SIM) and ‘‘project-out” algorithms. The ELS algorithm obtained real-time fitting speeds of over 35 fps, compared to the SIM algorithm’s speed of 2–3 fps. In addition, the ELS algorithm achieved superior alignment performance to the SIM algorithm in nearly all comparisons. For an example, please see Figure 5. (For further explanation and results, see [11]). VI. REPRESENTATION OF FACIAL FEATURES Once the CLM has estimated the shape and appearance parameters, we can use this information to derive features from the face for expression recognition. From the initial work conducted in [12] we extract the following features: PTS: Similarity normalized shape, sn, refers to the vertex points for the x- and y- coordinates of the face shape, resulting in a raw 136 dimensional feature vector. These points are the vertex locations after all the rigid geometric variation (translation, rotation and scale), relative to the base shape, has been removed. The similarity normalized shape sn can be obtained by synthesizing a shape instance of s that ignores the similarity parameters p. An example of the normalized shape features, PTS, is given in Figure 6. Figure 6. Example of AAM derived representations (a) Top row: input shape, Bottom row: input image, (b) Top row: Similarity Normalized Shape (sn), Bottom Row: Similarity Normalized Appearance( an), (c) Top Row: Base Shape (s0), Bottom Row: Shape Normalized Appearance( a0) APP: Canonical normalized appearance a0 refers to where all the non-rigid shape variation has been normalized with respect to the base shape s0. This is accomplished by warping each triangle patch appearance in the source image so that it aligns with the base face shape. If we can remove all shape variation from an appearance, we obtain a representation referred to as shape-normalized appearance, a0. This canonical normalized appearance a0 differs from the similarity normalized appearance an in that it removes the non-rigid shape variation and not the rigid shape variation. The resulting features yield an approximately 27,000 dimensional raw feature vector. A mask is applied to each image so that the same number of pixels is used for each. To reduce the dimensionality of the features, we use a 2D discrete cosine transform (DCT). Lucey et al. [11] found that using M = 500 gave the best results. Examples of the reconstructed images with M = 500 are shown in Figure 7. Note that regardless of the head pose and orientation, the appearance features are projected back onto the normalized base shape, so as to make these features more robust to such variability. PTS+APP: combination of shape and appearance features sn + a0 refers to the shape features being concatenated to the appearance features. Figure 7. Top row shows the first three frames of an image sequence. The followings show reconstructed images using 100, 200, and 500 DCT coefficients, repectively. Note that regardless of the head pose and orientation, the appearance features are projected back onto the normalized base shape, so as to make these features more robust to such variability. VII. EXPRESSION RECOGNITION A leading approach to pattern recognition is that of support vector machines (SVM) [13]. SVMs have been proven useful in a number of pattern recognition tasks including face and facial action recognition. SVMs attempt to find the hyperplane that maximizes the margin between positive and negative observations for a specified class. A linear SVM classification decision is made for an unlabelled test observation x_ by where w is the vector normal to the separating hyperplane and b is the bias. Both w and b are estimated so that they minimize the structural risk of a train-set, thus avoiding the possibility of over-fitting to the training data. Typically, w is not defined explicitly, but through a linear sum of support vectors. As a result SVMs offer additional appeal as they allow for the employment of non-linear combination functions through the use of kernel functions, such as the radial basis function (RBF), polynomial and sigmoid kernels. For AFERS, we will use a linear kernel due to its ability to generalize well to unseen data in many pattern recognition tasks and its efficiency. 175 Figure 8 gives an example of AFERS processing of an image sequence. Input video is processed using CLM. Shape and appearance parameters are estimated for each video frame and then input to an SVM for expression recognition. AFERS will be tested in two publically available datasets, Cohn- Kanade AU-Coded Facial Expression Database [2] and MMI [15], and in GEMEP [14]. Figure 8. Automatic Facial Expression Recognition System. Similarity normalized shape and and canonical appearance are estimated for each video frame. Parameters are then inputted to SVMs to recognize emotion expression on a frame-by-frame basis. VIII. ANALYTICS ENGINE During runtime, the AFERS application provides operators with several real-time outputs of the expression recognition process, along with snapshot generation and interrogation reporting. A. Current FACS Emotion Response Indicator AFERS displays the current expression response demonstrated by the subject and determined by the automated FACS coding. Each time the subject’s expression changes, even if only for a fraction of a second (as with microexpressions), the results are updated within the user interface in real-time. B. Trend Analysis AFERS also provides a polyg
本文档为【Automated Facial Expression Recognition System】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_392016
暂无简介~
格式:pdf
大小:685KB
软件:PDF阅读器
页数:6
分类:教育学
上传时间:2011-03-26
浏览量:35