首页 Kinect体感程式设计入门

Kinect体感程式设计入门

举报
开通vip

Kinect体感程式设计入门 Video-based event recognition: activity representation and is considered to be composed of action threads, each thread being executed by a single actor. A 1077-3142/$ - see front matter � 2004 Elsevier Inc. All rights reserved. Government under Contract No...

Kinect体感程式设计入门
Video-based event recognition: activity representation and is considered to be composed of action threads, each thread being executed by a single actor. A 1077-3142/$ - see front matter � 2004 Elsevier Inc. All rights reserved. Government under Contract No. MDA-908-00-C-0036. * Correspondence to: KOGS, FB Informatik, University of Hamburg, Vogt-Koelln-Str. 30, D-22527 Hamburg, Germany. E-mail addresses: hongeng@iris.usc.edu (S. Hongeng), nevatia@iris.usc.edu (R. Nevatia). 1 Present address: INRIA Project ORION, 2004 Route des Lucioles BP 93, 06902 Sophia Antipolis Cedex, France. www.elsevier.com/locate/cviu Computer Vision and Image Understanding 96 (2004) 129–162 single-thread action is represented by a stochastic finite automaton of event states, which are recognized from the characteristics of the trajectory and shape of moving blob of the actor using Bayesian methods. A multi-agent event is composed of several action threads related by temporal constraints. Multi-agent events are recognized by propagating the constraints and likelihood of event threads in a temporal logic network. We present results on real-world data and performance characterization on perturbed data. � 2004 Elsevier Inc. All rights reserved. Keywords: Video-based event detection; Event mining; Activity recognition q This research was supported in part by the Advanced Research and Development Activity of the U.S. probabilistic recognition methodsq Somboon Hongeng*, Ram Nevatia, Francois Bremond1 Institute for Robotics and Intelligent Systems, University of Southern California, Los Angeles, CA 90089, USA Received 15 March 2002; accepted 2 February 2004 Available online 13 August 2004 Abstract We present a new representation and recognition method for human activities. An activity doi:10.1016/j.cviu.2004.02.005 130 S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 1. Introduction Automatic event detection in video streams is gaining attention in the computer vision research community due to the needs of many applications such as surveil- lance for security, video content understanding, and human–computer interaction. The type of events to be recognized can vary from a small-scale action such as facial expressions, hand gestures, and human poses to a large-scale activity that may in- volve a physical interaction among locomotory objects moving around in the scene for a long period of time. There also may be interactions between moving objects and other objects in the scene, requiring static scene understanding. Addressing all the issues in event detection is thus enormously challenging and a major undertaking. In this paper, we focus on the detection of large-scale activities where some knowl- edge of the scene (e.g., the characteristics of the objects in the environment) is known. One characteristic of activities of our interest is that they exhibit some spe- cific patterns of whole-body motion. For example, consider a group of people steal- ing luggage left unattended by the owners. One particular pattern of the ‘‘stealing’’ event may be: two persons approach the owners and obstruct the view of the lug- gage, while another person takes the luggage. In the following, the words ‘‘event’’ and ‘‘activity’’ are used to refer to a large-scale activity. The task of activity recognition is to bridge the gap between numerical pixel level data and a high-level abstract activity description. A common approach involves first detecting and tracking moving object features from image sequences. The goal of this step is to transform pixel level data into low-level features that are more appropriate for activity analysis. From the tracked features, the type of moving objects and their spatio-temporal interaction are then analyzed [1–4]. There are several challenges that need to be addressed to achieve this task: � Motion detection and object tracking from real video data are often unstable due to poor video quality, shadows, occlusion, and so on. A single-view constraint common to many applications further complicates these problems. � The interpretation of low-level features (e.g., the appearance of objects) may be dependent on the view point. � There is a spatio-temporal variation in the execution style of the same activity by different actors, leading to a variety of temporal durations. � Repeated performance by the same individual can vary in appearance. � Similar motion patterns may be caused by different activities. Therefore, there is need for a generic activity representation as well as a robust recognition mechanism that handle both data and event variations. The representa- tion must be able to describe a simple action as well as a complicated, cooperative task by several actors. For a pragmatic system, the representation should also be eas- ily modified and extended by a user. Recognition methods must handle the probabil- ities accurately at all processing levels. A large number of activity detection systems have been developed in the last decades. Details of some of the current approaches are given in Section 2. One threads, each thread being executed by a single actor. A single-thread action is repre- S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 131 sented by a stochastic finite automaton of event states, which are recognized from the characteristics of the trajectory and shape of the moving blob of the actor based on a rigorous Bayesian analysis. A multi-agent event is represented by an event graph composed of several action threads related by logical and temporal constraints.Multi- agent events are recognized by propagating the constraints and the likelihood of event threads in the event graph. Our earlier papers [4,6] have described these components to some extent. This paper integrates all materials and provides more details and perfor- mance evaluation. The organization of the paper is as follows: related work is discussed in Section 2. An overview of our event detection system is described in Section 3. Our tracking approach based on ground plane locations is in Section 4. The extended hierarchical representation is in Section 5. Our event recognition algorithms are described in de- tail including experimental results in Sections 6 and 7. Performance characterization of the algorithms is in Section 8. 2. Related work During the last decade, there has been a significant amount of event understand- ing research in various application domains [7,8]. A review of the current approaches in motion analysis can be found in [9]. Most of the current approaches to activity recognition are composed of defining models for specific activity types that suit the goal in a particular domain and developing procedural recognition methods. In [10], simple periodic events (e.g., ‘‘walking’’) are recognized by constructing the dynamic models of the periodic pattern of human movements and are highly depen- dent on the robustness of the tracking. Bayesian networks have been used to recognize static postures (e.g., ‘‘standing close to a car’’) or simple events (e.g., ‘‘sitting’’) from the visual evidence gathered during one video frame [1,3,11]. The use of Bayesian networks in these approaches differs in the way they are applied (e.g., what data are used as evidential input and deficiency of most approaches is that they are developed for events that suit the goal in a particular domain and lack genericity. Many event representations (e.g., an image–pixel-based representation) cannot be extended easily. Most event detection algorithms are only for simple events (e.g., ‘‘walking’’ or ‘‘running’’) per- formed by a single actor, or for specific movements such as periodic motion. Some of them rely on the accuracy of motion sensors and do not provide a measure of confidence of the results, which is crucial for discriminating similar events in a noisy environment. In this paper, we present a system that overcomes some of these deficiencies. We model scenario events from shape and trajectory features using a hierarchical activity representation extended from [5], where events are organized into several layers of ab- straction, providing flexibility andmodularity in modeling scheme. The event recogni- tion methods described in [5] are based on a heuristic method and could not handle multiple-actor events. In this paper, an event is considered to be composed of action 132 S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 how these data are computed, the structures of the networks, etc.). One of the lim- itations of using Bayesian networks is that they are not suitable for encoding the dy- namic of long-term activities. Inspired by the applications in speech recognition, Hidden Markov Model (HMM) formalism has been extensively applied to activity recognition [12–16]. In one of the earlier attempts [12], discrete HMMs are used as the representations of tennis strokes. A feature vector of a snapshot of a tennis stroke is defined di- rectly from the pixel values of a subsampled image. A tennis stroke is recognized by computing the probability that the HMM model produces the sequence of fea- ture vectors observed during the action. Parameterized-HMMs [14] and coupled- HMMs [16] were introduced later to recognize a more complex event such as an interaction of two mobile objects. In [2], a stochastic context-free grammar parsing algorithm is used to compute the probability of a temporally consistent sequence of primitive actions recognized by HMMs. Even though HMMs are ro- bust against the variation of the temporal segmentations of events, the structures and probability distributions are not transparent and need to be learned accurately using an iterative method. For complex events, the parameter space may become prohibitively large. There is only a limited amount of research on multi-agent events [3,17] as the track- ing ofmultiple objects in a natural scene is difficult and it is difficult tomaintain the pa- rameters of the fine temporal granularity of the event models such as HMMs. In [3], a complicated Bayesian network is defined together with specific functions to evaluate some temporal relationships among events (e.g., before and around) to recognize ac- tions involvingmultiple agents trackedmanually in a football match.Generalizing this system for other tasks than those of a football match may require a substantial devel- opment. In recent years, there has also been a significant amount of work toward the fusion of multimodal information (e.g., color, motion, acoustic, speech, and text) for event and action recognition. Most approaches [18–20] rely on contextual knowledge and are limited to specific domains (e.g., offices, classrooms, and TV programs). Our approach is closely related to the work by Ivanov and Bobick [2] in the sense that the external knowledge about the problem domain is incorporated into the expected structure of the activity model. In [5], we introduced a hierarchical activity representation that allows the recognition of a series of actions performed by a single mobile object. Image features are linked explicitly to a symbolic no- tion of activity through several layers of more abstract activity descriptions. Rule- based methods are used to approximate the belief of the occurrence of activities. A set of rules are defined at each recognition step to verify whether the properties of mobile objects match their expected distributions (represented by a mean and a variance) for a particular action or event. This method often involves a careful hand-tuning of parameters such as threshold values. In this paper, we extend the representation described in [5] and present a recognition algorithm that com- putes the probabilities of activities in a more rigorous way using Bayesian and logical methods. 3. Overview of the system Fig. 1 shows schematically our approach to recognize the behavior of moving ob- jects from an image sequence and available context. Context consists of associated information, other than the sensed data, that is useful for activity recognition such as a spatial map and a prior activity expectation (task context). Our system is com- posed of two modules: (1) Motion Detection and Tracking (shown in a light shade); (2) Event Analysis (shown in a dark shade). Our tracking system is augmented from a graph-based moving blob tracking sys- tem described in [5]. A stationary single-view camera is used in our experiments. Background pixels are learned statistically in real time from the input video streams. Moving regions are segmented from background by detecting changes in the inten- sity. Knowledge of the ground plane, acquired as a spatial context, is used to filter S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 133 moving regions and track objects robustly. Shape and trajectory features of moving objects are then computed by low-level image processing routines and used to infer the probability of potential events defined in a library of scenario event models. Events in the scenario library are modeled using a hierarchical event representa- tion, in which a hierarchy of entities is defined to bridge the gap between a high-level event description and the pixel level information. Fig. 2 shows an example of this representation for the event ‘‘converse,’’ which is described as ‘‘a person approaches a reference person from a distance and then stops at the reference person when he ar- rives.’’ Image features are defined at the lowest layer of the event representation. Sev- eral layers of more abstract mobile object properties and scenarios are then constructed explicitly by users to describe a more complex and abstract activity shown at the highest layer. Mobile object properties are general properties of a mobile object that are com- puted over a few frames. Some properties can be elementary such as width, height, color histogram or texture while the others can be complex (e.g., a graph description for the shape of an object). Properties can also be defined with regard to the context (e.g., ‘‘locate in the security area’’). In Fig. 2, mobile object properties are defined based on the characteristics of the shapes and trajectories (motion) of the moving Fig. 1. Overview of the system. 134 S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 blobs. The links between a mobile object property at a higher layer to a set of prop- erties at the lower layers represent some relation between them (e.g., taking a ratio of the width and height properties to compute the aspect ratio of the shape of an ob- Fig. 2. A representation of the complex event ‘‘converse.’’ ject). Typically, approximately three layers of properties are defined and used in a broad spectrum of applications. A filtering function and a mean function are applied to property values collected over time to minimize the errors caused by environmen- tal and sensor noise. Scenarios correspond to long-term activities described by the classes of moving ob- jects (e.g., human, car or suitcase) and the event in which they are involved. Both the object class and the event have a confidence value (or a probability distribution) at- tached to them based on statistical analysis. Scenarios are defined from a set of prop- erties or sub-scenarios. The structure of a scenario is thus hierarchical. We classify scenario events into a single-thread and amultiple-thread event. In a single-thread event, relevant actions occur along a linear time scale. Single-thread events are further cate- gorized into a simple or complex event. Simple events are defined as a short coherent unit of movement (e.g., ‘‘approaching a reference person’’) and can be verified from a set of sub-events (‘‘getting closer to the reference person,’’ ‘‘heading toward,’’ etc.) and mobile object properties. Complex events are a linearly ordered time sequence of simple events (or other complex events), requiring a long-term evaluation of sub- events. In Fig. 2, the complex event ‘‘converse’’ is a sequential occurrence of ‘‘approach a reference person’’ and ‘‘stop at the reference person.’’ A multiple-thread event is com- posed of several action threads, possibly performed by several actors. These action threads are related by some logical and time relations. In a typical application, there are about two to four layers of single-thread events and another two to four layers of multiple-thread events. Since our event representation at the scenario level maps closely to how human would describe events, little expertise is expected from the users. The users only need to have basic understanding of our event taxonomy. Recognition process begins with the evaluation of the evidence (image features) and the computation of the probabilities of simple events at each time frame based on Bayesian analysis. The probabilities of simple events are then combined in a long term to recognize and segment complex events. Multiple-thread events are recog- between low-level image features of moving regions. One difficulty is that the same S. Hongeng et al. / Computer Vision and Image Understanding 96 (2004) 129–162 135 moving regions at different times may split in several parts or merge with some other objects nearby due to noise, occlusion, and low contrast. Fig. 3 illustrates one of these problems, where the moving region Rti at time t (a human shape) splits into Fig. 3. Splitting of moving regions and noise. (B) Frame 75.(A) Frame 74. nized by combining the probabilities of complex event threads whose temporal seg- mentations satisfy the logical and time constraints. We describe in detail each component of our system in the following. 4. Detection and tracking Activity recognition by computer involves the analysis of the spatio-temporal inter- action among the trajectories of moving objects [1,3,6]. Robust detection and tracking of moving objects from an image sequence is therefore an important key to a reliable activity recognition. In the case of a static camera, the detection of moving regions is relatively easier to perform, often based on backgroundmodeling and foreground seg- mentation. However, noise, shadows, and reflections often arise in real sequences, causing detection to be unstable. For example, moving regions belonging to the same object may not connect or may merge with some unrelated regions. Tracking moving objects involvesmaking hypotheses about the shapes of the objects from such unstable moving regions and track themcorrectly in the presence of partial or total occlusions. If some knowledge about the objects being tracked or about the scene is available, track- ing can be simplified [21]. Otherwise, correspondence between regions must be estab- lished based on pixel level information such as shape and texture [5]. Such auxiliary knowledge other than the sensed data is called context. In many applications, a large amount of context is often available. In this paper, we demonstrate the use of the ground plane information as a constraint to achieve robust tracking. Robust tracking often requires an object model and a sophisticated optimization process [22]. In the case that a model is not available or the size of the image of an object is too small, tracking must rely on the spatial and temporal correspondence two smaller regions Rtþ1j and R tþ1 k and noise detected at time t + 1. The image corre- lation between the moving region Rti and R tþ1 j or R tþ1 k by itself is often low and creates an incorrect trajectory. Filtering moving regions is therefore an important step of a reliable trajectory computation. 4.1. Ground plane assumption for filtering Let us assume that objects move along a known ground plane. An estimate of
本文档为【Kinect体感程式设计入门】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_477904
暂无简介~
格式:pdf
大小:1MB
软件:PDF阅读器
页数:34
分类:互联网
上传时间:2014-02-21
浏览量:153