Vision, Perception and Multimedia Understanding
1. Members ................................................................................ 1
2. Overall Objectives ........................................................................ 2
2.1. Presentation 2
3.3. Semantic Activity Recognition 6
3.4. Software Engineering for Activity Recognition 7
4. Application Domains .....................................................................10
4.1. Introduction 10
4.2. Video Analytics 10
4.3. Healthcare Monitoring 11
5. Software ................................................................................ 11
5.1. SUP 11
5.2. ViSEval 12
5.3. Clem 15
6. New Results ............................................................................. 15
6.1. Introduction 15
6.2. Image Compression and Modelization 18
6.3. Background Subtraction 18
6.22.1. Run Time Adaptation Architecture 42
6.22.2. Metrics on Feature Models to Optimize Conﬁguration Adaptation at Run Time 43
6.23.1. Scenario Analysis Module (SAM) 44
6.23.2. The clem Workﬂow 44
6.23.3. Multiple Services for Device Adaptive Platform for Scenario Recognition 45
7. Partnerships and Cooperations ........................................................... 46
18.104.22.168. VANAHEIM 48
22.214.171.124. SUPPORT 49
7.3.2. Collaborations in European Programs, except FP7 50
8.1.1. Conference Organization 51
8.1.3. Conferences 51
8.1.4. Invited Talk 52
8.1.5. Advisory Board 52
8.1.6. Expertise 52
9. Bibliography ............................................................................ 54
Keywords: Perception, Semantics, Machine Learning, Software Engineering, Cognitive Vision
Creation of the Team: January 01, 2012 , Updated into Project-Team: January 01, 2013 .
François Brémond [Team Leader, DR2 Inria, HdR] Guillaume Charpiat [CR1 Inria] Sabine Moisan [CR1 Inria, HdR] Annie Ressouche [CR1 Inria] Monique Thonnat [DR1 Inria, HdR]
Etienne Corvée [Research engineer at Link Care Services] Daniel Gaffé [Assistant professor, Faculty Member Nice University and CNRS-LEAT Member, on second- ment since September 2012] Aurelie Gouze [Research engineer at CSTB, since December 2012] Veronique Joumier [Researcher Engineer, CHU Nice University, upto November 2012] Jean-Paul Rigault [Professor, Faculty Member Nice Sophia-Antipolis University] Philippe Robert [Professor, CHU Nice University] Jean Yves Tigli [Assistant professor, Faculty Member Nice Sophia-Antipolis University]
Slawomir Bak [Development Engineer, VideoID, since August 2012] Vasanth Bathrinarayanan [Development Engineer, VICOMO Project] Bernard Boulay [Development Engineer, COFRIEND and QUASPER Projects, upto October 2012] Duc Phu Chau [Development Engineer, VANAHEIM Project, since March 2012] Hervé Falciani [Development Engineer, [EIT ITC Labs, upto August 2012] Baptiste Fosty [Development Engineer, since February 2012] Julien Gueytat [Development Engineer, SWEET HOME Project] Jihed Joobeur [Development Engineer, PAL AEN, upto September 2012] Srinidhi Mukanahallipatna [Development Engineer, PAL AEN] Anh-Tuan Nghiem [Development Engineer, since September 2012] Jose-Luis Patino Vilchis [Development Engineer, COFRIEND and VANAHEIM Projects, upto June 2012] Guido-Tomas Pusiol [Development Engineer, since June 2012] Leonardo Rocha [Development Engineer, CIU Santé, SWEET HOME and VICOMO Projects, upto October 2012] Silviu-Tudor Serban [Development Engineer, QUASPER Project, upto December 2012] Soﬁa Zaidenberg [Development Engineer, VANAHEIM Project] Salma Zouaoui-Elloumi [Development Engineer, VANAHEIM Project, since December 2012]
Julien Badie [Nice Sophia-Antipolis University, SWEET HOME Grant] Slawomir Bak [Nice Sophia-Antipolis University, VideoID Grant, upto August 2012] Piotr Bilinski [Nice Sophia-Antipolis University, Paca Grant] Duc Phu Chau [Nice Sophia-Antipolis University, Paca Grant, upto March 2012] Carolina Garate [Nice Sophia-Antipolis University, VANAHEIM Grant] Ratnesh Kumar [Nice Sophia-Antipolis University, VANAHEIM Grant] Guido-Tomas Pusiol [Nice Sophia-Antipolis University, CORDIs, upto June 2012] Rim Romdhame [Nice Sophia-Antipolis University, CIU Santé Project]
Malik Souded [Nice Sophia-Antipolis University, Keeneo CIFRE Grant]
Carlos-Fernando Crispim Junior [PAL AEN]
Christine Claux [AI Inria, upto may 2012] Sonia Rousseau [since June 2012 uo to end of July 2012] Jane Desplanques [since September 2012]
Pierre Aittahar [since April 2012 upto June 2012] Guillaume Barbe [since April 2012 upto June 2012] Sorana-Maria Capalnean [EGIDE, since July 2012 upto October 2012] Cintia Corti [EGIDE, since May 2012 upto November 2012] Eben Freeman [EGIDE, since June 2012 upto September 2012] Vaibhav Katiyar [ACET, since July 2012 upto December 2012] Vannara Loch [since April 2012 upto June 2012] Qiao Ma [China, EGIDE, since July 2012 upto October 2012] Firat Ozemir [since June 2012 upto September 2012] Luis-Emiliano Sanchez [EGIDE, since September 2012 upto end of December 2012] Bertrand Simon [ENS Lyon, since June 2012 upto mid-July 2012] Abhineshwar Tomar [ACET, since November 2012] Swaminathan Sankaranarayanan [EGIDE, upto June 2012]
2. Overall Objectives
2.1.1. Research Themes
STARS (Spatio-Temporal Activity Recognition Systems) is focused on the design of cognitive systems for Activity Recognition. We aim at endowing cognitive systems with perceptual capabilities to reason about an observed environment, to provide a variety of services to people living in this environment while preserving their privacy. In today world, a huge amount of new sensors and new hardware devices are currently available, addressing potentially new needs of the modern society. However the lack of automated processes (with no human interaction) able to extract a meaningful and accurate information (i.e. a correct understanding of the situation) has often generated frustrations among the society and especially among older people. Therefore, Stars objective is to propose novel autonomous systems for the real-time semantic interpretation of dynamic scenes observed by sensors. We study long-term spatio-temporal activities performed by several interacting agents such as human beings, animals and vehicles in the physical world. Such systems also raise fundamental software engineering problems to specify them as well as to adapt them at run time.
We propose new techniques at the frontier between computer vision, knowledge engineering, machine learning and software engineering. The major challenge in semantic interpretation of dynamic scenes is to bridge the gap between the task dependent interpretation of data and the ﬂood of measures provided by sensors. The problems we address range from physical object detection, activity understanding, activity learning to vision system design and evaluation. The two principal classes of human activities we focus on, are assistance to older adults and video analytics.
A typical example of a complex activity is shown in Figure 1 and Figure 2 for a homecare application. In this example, the duration of the monitoring of an older person apartment could last several months. The activities involve interactions between the observed person and several pieces of equipment. The application goal is to recognize the everyday activities at home through formal activity models (as shown in Figure 3) and data captured by a network of sensors embedded in the apartment. Here typical services include an objective assessment of the frailty level of the observed person to be able to provide a more personalized care and to monitor the effectiveness of a prescribed therapy. The assessment of the frailty level is performed by an Activity Recognition System which transmits a textual report (containing only meta-data) to the general practitioner who follows the older person. Thanks to the recognized activities, the quality of life of the observed people can thus be improved and their personal information can be preserved.
Figure 1. Homecare monitoring: the set of sensors embedded in an apartment
Figure 2. Homecare monitoring: the different views of the apartment captured by 4 video cameras
The ultimate goal is for cognitive systems to perceive and understand their environment to be able to provide appropriate services to a potential user. An important step is to propose a computational representation of people activities to adapt these services to them. Up to now, the most effective sensors have been video cameras due to the rich information they can provide on the observed environment. These sensors are currently perceived as intrusive ones. A key issue is to capture the pertinent raw data for adapting the services to the people while preserving their privacy. We plan to study different solutions including of course the local processing of the data without transmission of images and the utilisation of new compact sensors developed
Activity (PrepareMeal, PhysicalObjects( (p : Person), (z : Zone), (eq : Equipment)) Components( (s_inside : InsideKitchen(p, z))
(s_close : CloseToCountertop(p, eq)) (s_stand : PersonStandingInKitchen(p, z)))
Constraints( (z->Name = Kitchen) (eq->Name = Countertop) (s_close->Duration >= 100) (s_stand->Duration >= 100))
Annotation( AText("prepare meal") AType("not urgent")))
Figure 3. Homecare monitoring: example of an activity model describing a scenario related to the preparation of a meal with a high-level language
for interaction (also called RGB-Depth sensors, an example being the Kinect) or networks of small non visual
2.1.2. International and Industrial Cooperation
Our work has been applied in the context of more than 10 European projects such as COFRIEND, ADVISOR, SERKET, CARETAKER, VANAHEIM, SUPPORT, DEM@CARE, VICOMO. We had or have industrial collaborations in several domains: transportation (CCI Airport Toulouse Blagnac, SNCF, Inrets, Alstom, Ratp, GTT (Italy), Turin GTT (Italy)), banking (Crédit Agricole Bank Corporation, Eurotelis and Ciel), security (Thales R&T FR, Thales Security Syst, EADS, Sagem, Bertin, Alcatel, Keeneo), multimedia (Multitel (Belgium), Thales Communications, Idiap (Switzerland)), civil engineering (Centre Scientiﬁque et Technique du Bâtiment (CSTB)), computer industry (BULL), software industry (AKKA), hardware industry (ST-Microelectronics) and health industry (Philips, Link Care Services, Vistek).
We have international cooperations with research centers such as Reading University (UK), ENSI Tunis (Tunisia), National Cheng Kung University, National Taiwan University (Taiwan), MICA (Vietnam), IPAL, I2R (Singapore), University of Southern California, University of South Florida, University of Maryland (USA).
2.2. Highlights of the Year
Stars designs cognitive vision systems for activity recognition based on sound software engineering paradigms. This year, we have designed several novel algorithms for activity recognition systems. In particular, we have extended an efﬁcient algorithm for detecting people in a static image based on a cascade of classiﬁers. We have also proposed a new algorithm for re-identiﬁcation of people through a camera network. This algorithm outperforms state-of-the-art approaches on several benchmarking datasets (e.g. Ilids). We have realized a new algorithm for the recognition of short actions and validated also its performance on several benchmarking databases (e.g. ADL). We have improved a generic event recognition algorithm by handling event uncertainty at several processing levels. We have extended an original work on learning techniques such as data mining
in large multimedia databases based on ofﬂine trajectory clustering. We have designed a generic controller algorithm, which is able to automatically tune the parameters of tracking algorithms. We have also continued a large clinical trial with Nice Hospital to characterize the behaviour proﬁle of
Alzheimer patients compared to healthy older people. We have organized a summer school which was held at Inria in October 2012, entitled “Human Activity and Vision Summer School", with many prestigious researchers (e.g. M. Shah).
3. Scientiﬁc Foundations
Stars follows three main research directions: perception for activity recognition, semantic activity recognition, and software engineering for activity recognition. These three research directions are interleaved: the software architecture direction provides new methodologies for building safe activity recognition systems and the perception and the semantic activity recognition directions provide new activity recognition techniques which are designed and validated for concrete video analytics and healthcare applications. Conversely, these concrete systems raise new software issues that enrich the software engineering research direction.
Transversally, we consider a new research axis in machine learning, combining a priori knowledge and learning techniques, to set up the various models of an activity recognition system. A major objective is to automate model building or model enrichment at the perception level and at the understanding level.
3.2. Perception for Activity Recognition
Participants: Guillaume Charpiat, François Brémond, Sabine Moisan, Monique Thonnat.
Computer Vision; Cognitive Systems; Learning; Activity Recognition.
Our main goal in perception is to develop vision algorithms able to address the large variety of conditions characterizing real world scenes in terms of sensor conditions, hardware requirements, lighting conditions, physical objects, and application objectives. We have also several issues related to perception which combine machine learning and perception techniques: learning people appearance, parameters for system control and shape statistics.
3.2.2. Appearance models and people tracking
An important issue is to detect in real-time physical objects from perceptual features and predeﬁned 3D models. It requires ﬁnding a good balance between efﬁcient methods and precise spatio-temporal models. Many improvements and analysis need to be performed in order to tackle the large range of people detection scenarios.
Appearance models. In particular, we study the temporal variation of the features characterizing the appearance of a human. This task could be achieved by clustering potential candidates depending on their position and their reliability. This task can provide any people tracking algorithms with reliable features allowing for instance to (1) better track people or their body parts during occlusion, or to (2) model people appearance for re-identiﬁcation purposes in mono and multi-camera networks, which is still an open issue. The underlying challenge of the person re-identiﬁcation problem arises from signiﬁcant differences in illumination, pose and camera parameters. The re-identiﬁcation approaches have two aspects: (1) establishing correspondences between body parts and (2) generating signatures that are invariant to different color responses. As we have already several descriptors which are color invariant, we now focus more on aligning two people detections and on ﬁnding their corresponding body parts. Having detected body parts, the approach can handle pose variations. Further, different body parts might have different inﬂuence on ﬁnding the correct match among a whole gallery dataset. Thus, the re-identiﬁcation approaches have to search for matching strategies. As the results of the re-identiﬁcation are always given as the ranking list, re-identiﬁcation focuses on learning to rank. "Learning to rank" is a type of machine learning problem, in which the goal is to automatically construct a ranking model from a training data.
Therefore, we work on information fusion to handle perceptual features coming from various sensors (several cameras covering a large scale area or heterogeneous sensors capturing more or less precise and rich information). New 3D sensors (e.g. Kinect) are also investigated, to help in getting an accurate segmentation for speciﬁc scene conditions.
Long term tracking. For activity recognition we need robust and coherent object tracking over long periods of time (often several hours in videosurveillance and several days in healthcare). To guarantee the long term coherence of tracked objects, spatio-temporal reasoning is required. Modelling and managing the uncertainty of these processes is also an open issue. In Stars we propose to add a reasoning layer to a classical Bayesian framework1pt modelling the uncertainty of the tracked objects. This reasoning layer can take into account the a priori knowledge of the scene for outlier elimination and long-term coherency checking.
Controling system parameters. Another research direction is to manage a library of video processing programs. We are building a perception library by selecting robust algorithms for feature extraction, by insuring they work efﬁciently with real time constraints and by formalizing their conditions of use within a program supervision model. In the case of video cameras, at least two problems are still open: robust image segmentation and meaningful feature extraction. For these issues, we are developing new learning techniques.
3.2.3. Learning shape and motion
Another approach, to improve jointly segmentation and tracking, is to consider videos as 3D volumetric data and to search for trajectories of points that are statistically coherent both spatially and temporally. This point of view enables new kinds of statistical segmentation criteria and ways to learn them.
We are also using the shape statistics developed in  for the segmentation of images or videos with shape prior, by learning local segmentation criteria that are suitable for parts of shapes. This uniﬁes patchbased detection methods and active-contour-based segmentation methods in a single framework. These shape statistics can be used also for a ﬁne classiﬁcation of postures and gestures, in order to extract more precise information from videos for further activity recognition. In particular, the notion of shape dynamics has to be studied.
More generally, to improve segmentation quality and speed, different optimization tools such as graph-cuts can be used, extended or improved.
3.3. Semantic Activity Recognition
Participants: Guillaume Charpiat, François Brémond, Sabine Moisan, Monique Thonnat.
Activity Recognition, Scene Understanding,Computer Vision
Semantic activity recognition is a complex process where information is abstracted through four levels: signal (e.g. pixel, sound), perceptual features, physical objects and activities. The signal and the feature levels are characterized by strong noise, ambiguous, corrupted and missing data. The whole process of scene understanding consists in analysing this information to bring forth pertinent insight of the scene and its dynamics while handling the low level noise. Moreover, to obtain a semantic abstraction, building activity models is a crucial point. A still open issue consists in determining whether these models should be given a priori or learned. Another challenge consists in organizing this knowledge in order to capitalize experience, share it with others and update it along with experimentation. To face this challenge, tools in knowledge engineering such as machine learning or ontology are needed.
Thus we work along the two following research axes: high level understanding (to recognize the activities of physical objects based on high level activity models) and learning (how to learn the models needed for activity recognition).
3.3.2. High Level Understanding
A challenging research axis is to recognize subjective activities of physical objects (i.e. human beings, animals, vehicles) based on a priori models and objective perceptual measures (e.g. robust and coherent object tracks).
To reach this goal, we have deﬁned original activity recognition algorithms and activity models. Activity recognition algorithms include the computation of spatio-temporal relationships between physical objects. All the possible relationships may correspond to activities of interest and all have to be explored in an efﬁcient way. The variety of these activities, generally called video events, is huge and depends on their spatial and temporal granularity, on the number of physical objects involved in the events, and on the event complexity (number of components constituting the event).
Concerning the modelling of activities, we are working towards two directions: the uncertainty management for representing probability distributions and knowledge acquisition facilities based on ontological engineering techniques. For the ﬁrst direction, we are investigating classical statistical techniques and logical approaches. We have also built a language for video event modelling and a visual concept ontology (including color, texture and spatial concepts) to be extended with temporal concepts (motion, trajectories, events ...) and other perceptual concepts (physiological sensor concepts ...).
3.3.3. Learning for Activity Recognition
Given the difﬁculty of building an activity recognition system with a priori knowledge for a new application, we study how machine learning techniques can automate building or completing models at the perception level and at the understanding level.
At the understanding level, we are learning primitive event detectors. This can be done for example by learning visual concept detectors using SVMs (Support Vector Machines) with perceptual feature samples. An open question is how far can we go in weakly supervised learning for each type of perceptual concept
(i.e. leveraging the human annotation task). A second direction is to learn typical composite event models for frequent activities using trajectory clustering or data mining techniques. We name composite event a particular combination of several primitive events.
3.3.4. Activity Recognition and Discrete Event Systems
The previous research axes are unavoidable to cope with the semantic interpretations. However they tend to let aside the pure event driven aspects of scenario recognition. These aspects have been studied for a long time at a theoretical level and led to methods and tools that may bring extra value to activity recognition, the most important being the possibility of formal analysis, veriﬁcation and validation.
We have thus started to specify a formal model to deﬁne, analyze, simulate, and prove scenarios. This model deals with both absolute time (to be realistic and efﬁcient in the analysis phase) and logical time (to beneﬁt from well-known mathematical models providing re-usability, easy extension, and veriﬁcation). Our purpose is to offer a generic tool to express and recognize activities associated with a concrete language to specify activities in the form of a set of scenarios with temporal constraints. The theoretical foundations and the tools being shared with Software Engineering aspects, they will be detailed in section 3.4.
The results of the research performed in perception and semantic activity recognition (ﬁrst and second research directions) produce new techniques for scene understanding and contribute to specify the needs for new software architectures (third research direction).
3.4. Software Engineering for Activity Recognition
Participants: Sabine Moisan, Annie Ressouche, Jean-Paul Rigault, François Brémond.
Software Engineering, Generic Components, Knowledge-based Systems, Software Component Platform,
Object-oriented Frameworks, Software Reuse, Model-driven Engineering The aim of this research axis is to build general solutions and tools to develop systems dedicated to activity recognition. For this, we rely on state-of-the art Software Engineering practices to ensure both sound design and easy use, providing genericity, modularity, adaptability, reusability, extensibility, dependability, and maintainability.
This research requires theoretical studies combined with validation based on concrete experiments conducted in Stars. We work on the following three research axes: models (adapted to the activity recognition domain), platform architecture (to cope with deployment constraints and run time adaptation), and system veriﬁcation (to generate dependable systems). For all these tasks we follow state of the art Software Engineering practices and, if needed, we attempt to set up new ones.
3.4.1. Platform Architecture for Activity Recognition
Figure 4. Global Architecture of an Activity Recognition The grey areas contain software engineering support modules whereas the other modules correspond to software components (at Task and Component levels) or to generated systems (at Application level).
In the former project teams Orion and Pulsar, we have developed two platforms, one (VSIP), a library of real-time video understanding modules and another one, LAMA , a software platform enabling to design not only knowledge bases, but also inference engines, and additional tools. LAMA offers toolkits to build and to adapt all the software elements that compose a knowledge-based system or a cognitive system.
Figure 4 presents our conceptual vision for the architecture of an activity recognition platform. It consists of three levels:
• The Component Level, the lowest one, offers software components providing elementary operations and data for perception, understanding, and learning.
Machines (SVM), Case-based Learning (CBL), clustering, etc. An Activity Recognition system is likely to pick components from these three packages. Hence, tools must be provided to conﬁgure (select, assemble), simulate, verify the resulting component combination. Other support tools may help to generate task or application dedicated languages or graphic interfaces.
The philosophy of this architecture is to offer at each level a balance between the widest possible genericity
and the maximum effective reusability, in particular at the code level. To cope with real application requirements, we shall also investigate distributed architecture, real time implementation, and user interfaces.
Concerning implementation issues, we shall use when possible existing open standard tools such as NuSMV for model-checking, Eclipse for graphic interfaces or model engineering support, Alloy for constraint representation and SAT solving, etc. Note that, in Figure 4, some of the boxes can be naturally adapted from SUP existing elements (many perception and understanding components, program supervision, scenario recognition...) whereas others are to be developed, completely or partially (learning components, most support and conﬁguration tools).
3.4.2. Discrete Event Models of Activities
As mentioned in the previous section (3.3) we have started to specify a formal model of scenario dealing with both absolute time and logical time. Our scenario and time models as well as the platform veriﬁcation tools rely on a formal basis, namely the synchronous paradigm. To recognize scenarios, we consider activity descriptions as synchronous reactive systems and we apply general modelling methods to express scenario behaviour.
Activity recognition systems usually exhibit many safeness issues. From the software engineering point of view we only consider software security. Our previous work on veriﬁcation and validation has to be pursued; in particular, we need to test its scalability and to develop associated tools. Model-checking is an appealing technique since it can be automatized and helps to produce a code that has been formally proved. Our veriﬁcation method follows a compositional approach, a well-known way to cope with scalability problems in model-checking.
Moreover, recognizing real scenarios is not a purely deterministic process. Sensor performance, precision of image analysis, scenario descriptions may induce various kinds of uncertainty. While taking into account this uncertainty, we should still keep our model of time deterministic, modular, and formally veriﬁable. To formally describe probabilistic timed systems, the most popular approach involves probabilistic extension of timed automata. New model checking techniques can be used as veriﬁcation means, but relying on model checking techniques is not sufﬁcient. Model checking is a powerful tool to prove decidable properties but introducing uncertainty may lead to inﬁnite state or even undecidable properties. Thus model checking validation has to be completed with non exhaustive methods such as abstract interpretation.
3.4.3. Model-Driven Engineering for Conﬁguration and Control and Control of Video Surveillance systems
Model-driven engineering techniques can support the conﬁguration and dynamic adaptation of video surveillance systems designed with our SUP activity recognition platform. The challenge is to cope with the many—functional as well as nonfunctional—causes of variability both in the video application speciﬁcation and in the concrete SUP implementation. We have used feature models to deﬁne two models: a generic model of video surveillance applications and a model of conﬁguration for SUP components and chains. Both of them express variability factors. Ultimately, we wish to automatically generate a SUP component assembly from an application speciﬁcation, using models to represent transformations . Our models are enriched with intra-and inter-models constraints. Inter-models constraints specify models to represent transformations. Feature models are appropriate to describe variants; they are simple enough for video surveillance experts to express their requirements. Yet, they are powerful enough to be liable to static analysis . In particular, the constraints can be analysed as a SAT problem.
An additional challenge is to manage the possible run-time changes of implementation due to context variations (e.g., lighting conditions, changes in the reference scene, etc.). Video surveillance systems have to dynamically adapt to a changing environment. The use of models at run-time is a solution. We are deﬁning adaptation rules corresponding to the dependency constraints between speciﬁcation elements in one model and software variants in the other , [ 80 ], .
4. Application Domains
While in our research the focus is to develop techniques, models and platforms that are generic and reusable, we also make effort in the development of real applications. The motivation is twofold. The ﬁrst is to validate the new ideas and approaches we introduce. The second is to demonstrate how to build working systems for real applications of various domains based on the techniques and tools developed. Indeed, Stars focuses on two main domains: video analytics and healthcare monitoring.
4.2. Video Analytics
Our experience in video analytics , [ 1 ],  (also referred to as visual surveillance) is a strong basis which ensures both a precise view of the research topics to develop and a network of industrial partners ranging from end-users, integrators and software editors to provide data, objectives, evaluation and funding.
For instance, the Keeneo start-up was created in July 2005 for the industrialization and exploitation of Orion and Pulsar results in video analytics (VSIP library, which was a previous version of SUP). Keeneo has been bought by Digital Barriers in August 2011 and is now independent from Inria. However, Stars continues to maintain a close cooperation with Keeneo for impact analysis of VSIP and for exploitation of new results.
Moreover new challenges are arising from the visual surveillance community. For instance, people detection and tracking in a crowded environment are still open issues despite the high competition on these topics. Also detecting abnormal activities may require to discover rare events from very large video data bases often characterized by noise or incomplete data.
4.3. Healthcare Monitoring
We have initiated a new strategic partnership (called CobTek) with Nice hospital , [ 81 ] (CHU Nice, Prof P. Robert) to start ambitious research activities dedicated to healthcare monitoring and to assistive technologies. These new studies address the analysis of more complex spatio-temporal activities (e.g. complex interactions, long term activities).
To achieve this objective, several topics need to be tackled. These topics can be summarized within two points: ﬁner activity description and longer analysis. Finer activity description is needed for instance, to discriminate the activities (e.g. sitting, walking, eating) of Alzheimer patients from the ones of healthy older people. It is essential to be able to pre-diagnose dementia and to provide a better and more specialised care. Longer analysis is required when people monitoring aims at measuring the evolution of patient behavioural disorders. Setting up such long experimentation with dementia people has never been tried before but is necessary to have real-world validation. This is one of the challenge of the European FP7 project Dem@Care where several patient homes should be monitored over several months.
For this domain, a goal for Stars is to allow people with dementia to continue living in a self-sufﬁcient manner in their own homes or residential centers, away from a hospital, as well as to allow clinicians and caregivers remotely proffer effective care and management. For all this to become possible, comprehensive monitoring of the daily life of the person with dementia is deemed necessary, since caregivers and clinicians will need a comprehensive view of the person’s daily activities, behavioural patterns, lifestyle, as well as changes in them, indicating the progression of their condition.
The development and ultimate use of novel assistive technologies by a vulnerable user group such as individuals with dementia, and the assessment methodologies planned by Stars are not free of ethical, or even legal concerns, even if many studies have shown how these Information and Communication Technologies (ICT) can be useful and well accepted by older people with or without impairments. Thus one goal of Stars team is to design the right technologies that can provide the appropriate information to the medical carers while preserving people privacy. Moreover, Stars will pay particular attention to ethical, acceptability, legal and privacy concerns that may arise, addressing them in a professional way following the corresponding established EU and national laws and regulations, especially when outside France.
As presented in 3.1, Stars aims at designing cognitive vision systems with perceptual capabilities to monitor efﬁciently people activities. As a matter of fact, vision sensors can be seen as intrusive ones, even if no images are acquired or transmitted (only meta-data describing activities need to be collected). Therefore new communication paradigms and other sensors (e.g. accelerometers, RFID, and new sensors to come in the future) are also envisaged to provide the most appropriate services to the observed people, while preserving their privacy. To better understand ethical issues, Stars members are already involved in several ethical organizations. For instance, F. Bremond has been a member of the ODEGAM -“Commission Ethique et Droit” (a local association in Nice area for ethical issues related to older people) from 2010 to 2011 and a member of the French scientiﬁc council for the national seminar on “La maladie d’Alzheimer et les nouvelles technologies -Enjeux éthiques et questions de société” in 2011. This council has in particular proposed a chart and guidelines for conducting researches with dementia patients.
For addressing the acceptability issues, focus groups and HMI (Human Machine Interaction) experts, will be consulted on the most adequate range of mechanisms to interact and display information to older people.
Figure 5. Tasks of the Scene Understanding Platform (SUP).
SUP is a Scene Understanding Software Platform written in C and C++ (see Figure 5). SUP is the continuation of the VSIP platform. SUP is splitting the workﬂow of a video processing into several modules, such as acquisition, segmentation, etc., up to activity recognition, to achieve the tasks (detection, classiﬁcation, etc.) the platform supplies. Each module has a speciﬁc interface, and different plugins implementing these interfaces can be used for each step of the video processing. This generic architecture is designed to facilitate:
Currently, 15 plugins are available, covering the whole processing chain. Several plugins are using the Genius platform, an industrial platform based on VSIP and exploited by Keeneo. Goals of SUP are twofold:
ViSEval is a software dedicated to the evaluation and visualization of video processing algorithm outputs. The evaluation of video processing algorithm results is an important step in video analysis research. In video processing, we identify 4 different tasks to evaluate: detection, classiﬁcation and tracking of physical objects of interest and event recognition.
The proposed evaluation tool (ViSEvAl, visualization and evaluation) respects three important properties:
• For users to easily modify or add new metrics. The ViSEvAl tool is composed of two parts: a GUI to visualize results of the video processing algorithms and metrics results, and an evaluation program to evaluate automatically algorithm outputs on large amount of data. An XML format is deﬁned for the different input ﬁles (detected objects from one or several cameras, groundtruth and events). XSD ﬁles and associated classes are used to check, read and write automatically the different
XML ﬁles. The design of the software is based on a system of interfaces-plugins. This architecture allows the user to develop speciﬁc treatments according to her/his application (e.g. metrics). There are 6 interfaces:
Figure 6. GUI of the ViSEvAl software
The GUI is composed of 3 different parts:
1. The widows dedicated to result visualization (see Figure 6):
– Window 1: the video window displays the current image and information about the detected and ground-truth objects (bounding-boxes, identiﬁer, type,...).
Figure 7. The object window enables users to choose the object to display
Figure 8. The multi-view window
The evaluation program saves, in a text ﬁle, the evaluation results of all the metrics for each frame (whenever it is appropriate), globally for all video sequences or for each object of the ground truth. The ViSEvAl software was tested and validated into the context of the Cofriend project through its partners
(Akka,...). The tool is also used by IMRA, Nice hospital, Institute for Infocomm Research (Singapore),... The software version 1.0 was delivered to APP (French Program Protection Agency) on August 2010. ViSEvAl is under GNU Affero General Public License AGPL (http://www.gnu.org/licenses/) since July 2011. The tool is available on the web page : http://www-sop.inria.fr/teams/pulsar/EvaluationTool/ViSEvAl_Description.html
The Clem Toolkit (see Figure 9) is a set of tools devoted to design, simulate, verify and generate code for LE  [ 77 ] programs. LE is a synchronous language supporting a modular compilation. It also supports automata possibly designed with a dedicated graphical editor.
Each LE program is compiled later into lec and lea ﬁles. Then when we want to generate code for different backends, depending on their nature, we can either expand the lec code of programs in order to resolve all abstracted variables and get a single lec ﬁle, or we can keep the set of lec ﬁles where all the variables of the main program are deﬁned. Then, the ﬁnalization will simplify the ﬁnal equations and code is generated for simulation, safety proofs, hardware description or software code. Hardware description (Vhdl) and software code (C) are supplied for LE programs as well as simulation. Moreover, we also generate ﬁles to feed the NuSMV model checker  in order to perform validation of program behaviors.
6. New Results
This year Stars has proposed new algorithms related to its three main research axes : perception for activity recognition, semantic activity recognition and software engineering for activity recognition.
6.1.1. Perception for Activity Recognition
Participants: Julien Badie, Slawomir Bak, Vasanth Bathrinarayanan, Piotr Bilinski, Bernard Boulay, François Brémond, Sorana Capalnean, Guillaume Charpiat, Duc Phu Chau, Etienne Corvée, Eben Freeman, Carolina Garate, Jihed Joober, Vaibhav Katiyar, Ratnesh Kumar, Srinidhi Mukanahallipatna, Sabine Moisan, Silviu Serban, Malik Souded, Anh Tuan Nghiem, Monique Thonnat, Soﬁa Zaidenberg.
Figure 9. The Clem Toolkit
This year Stars has extended an efﬁcient algorithm for detecting people. We have also proposed a new algorithm for re-identiﬁcation of people through a camera network. We have realized a new algorithm for the recognition of short actions and validated also its performance on several benchmarking databases (e.g. ADL). We have improved a generic event recognition algorithm by handling event uncertainty at several processing levels. More precisely, the new results for perception for activity recognition concern:
6.1.2. Semantic Activity Recognition
Participants: Sorana Capalnean, Guillaume Charpiat, Cintia Corti, Carlos -Fernando Crispim Junior, Hervé Falciani, Baptiste Fosty, Qioa Ma, Firat Ozemir, Jose-Luis Patino Vilchis, Guido-Tomas Pusiol, Rim Romdhame, Bertrand Simon, Abhineshwar Tomar.
Concerning semantic activity recognition, the contributions are :
6.1.3. Software Engineering for Activity Recognition
Participants: François Brémond, Daniel Gaffé, Julien Gueytat, Baptiste Fosty, Sabine Moisan, Anh tuan Nghiem, Annie Ressouche, Jean-Paul Rigault, Leonardo Rocha, Luis-Emiliano Sanchez, Swaminathan Sankaranarayanan.
This year Stars has continued the development of the SUP platform. This latter is the backbone of the team experiments to implement the new algorithms. We continue to improve our meta-modelling approach to support the development of video surveillance applications based on SUP. This year we have focused on an architecture for run time adaptation and on metrics to drive dynamic architecture changes. We continue the development of a scenario analysis module (SAM) relying on formal methods to support activity recognition in SUP platform. We improve the theoretical foundations of CLEM toolkit and we rely on it to build SAM. Finally, we are improving the way we perform adaptation in the deﬁnition of a multiple services for device adaptive platform for scenario recognition.
The contributions for this research axis are:
6.2. Image Compression and Modelization
Participants: Guillaume Charpiat, Eben Freeman.
Recent results in statistical learning have established the best strategy to combine several advices from different experts, for the problem of sequential prediction of times series. The notions of prediction and compression are tightly linked, in that a good predictor can be turned into a good compressor via entropy coding (such as Huffman coding or arithmetic coding), based on the predicted probabilities of the events to come : the more predictable an event E is, the easier to compress it will be, with coding cost − log(p(E)) with such techniques.
The initial idea here, by Yann Ollivier (TAO team), within a collaboration with G. Charpiat and Jamal Atif (TAO team), was to adapt these results to the case of image compression, where time series are replaced with 2D series of pixel colors, and where experts are predictors of the color of a pixel given the colors of neighbors. The main difference is that there is no canonical physically-relevant 1D ordering of the pixels in an image, so that a sequential order (of the pixels to predict their colors) had to be deﬁned ﬁrst. Preliminary results with a hierarchical ordering scheme already competed with standard techniques in lossless compression (png, lossless jpeg2000).
During his internship in the Stars team, Eben Freeman developed this approach, by building relevant experts able to predict a variety of image features (regions of homogeneous color, edges, noise, . . . ). We also considered random orderings of pixels, using kernels to express probabilities in a spatially-coherent manner. Using such modellings of images with experts, we were also able to generate new images, that are typical of these models, and show more structure than the ones associated to standard compression schemes (typical images highly compressed).
6.3. Background Subtraction Participants: Vasanth Bathrinarayanan, Anh-Tuan Nghiem, Duc-Phu CHAU, François Brémond. Keywords: Gaussian Mixture Model, Shadow removal, Parameter controller, Codebook model, Context based information
6.3.1. Statistical Background Subtraction for Video Surveillance Platform
Anh-Tuan Nghiem work on background subtraction is an extended version of Gaussian Mixture Models . The algorithm compares each pixel of current frame to background representation which is developed based on the pixel information from previous frames. It includes shadow and highlight removal to give better results. Selective background updating method based on the feedback from the object detection helps to better model background and remove noise and ghosts.
Figure 10 shows a sample illustration of the output of the background subtraction, where blue are foreground pixels and red are shadow or illumination change pixels and a green bounding box is a foreground blob. Also we have compared our algorithm with few other such as OpenCV and also IDIAP’s background subtraction(not tuned perfectly, used default parameters) and the results are shown in Figure 11 where the green background refers to best performance of the comparisons. This evaluation is done on PETS 2009 data-set with our obtained foreground blobs to the manually annotated bounding boxes of people.
6.3.2. Parameter controller using Contextual features
The above method has some parameters that has to be tuned every time for each video, which is a time consuming work. The work of Chau et al  learns the contextual information from the video and controls object tracking algorithm parameters during the run-time of the algorithm. This approach is at preliminary stage for background subtraction algorithm to automatically adapt parameters. These parameters are learned as described in the ofﬂine learning process block diagram 12 over several ground truth videos and clustered into a database. The contextual feature which are used presently include object density, occlusion, contrast, 2D area, contrast variance, 2D area variance. Figure 13 shows a sample of video chunks based on contextual feature similarity for a video from caviar data-set.
The controller’s preliminary results are promising and we are experimenting and evaluating with different features to learn the parameters. The results will be published in upcoming top computer vision conferences.
6.4. Fiber Based Video Segmentation
Participants: Ratnesh Kumar, Guillaume Charpiat, Monique Thonnat. Keywords: Video Volume, Fibers, Trajectory The aim of this work is to segment objects in videos by considering videos as 3D volumetric data (2D×time).
Figure 14 shows an input video and its corresponding partition in terms of ﬁber at a particular hierarchy level. Particularly, it shows 2D slices of a video volume. Bottom right corner of each ﬁgure shows the current temporal depth in the volume, while top right shows the X-time slice and bottom left shows Y-time slice. In this 3D representation of videos, points of static background form straight lines of homogeneous intensity over time, while points of moving objects form curved lines. Analogically to the ﬁbers in MRI images of human brains, we term ﬁbers, these straight and curved lines of homogeneous intensity. So, in our case, to segment the whole video volume data, we are interested in a dense estimation of ﬁbers involving all pixels.
Initial ﬁbers are built using correspondences computing algorithms like optical ﬂow and descriptor matching. As these algorithms are reliable near corners and edges, we build ﬁbers at these locations for a video. Our subsequent goal is to partition this video in terms of ﬁbers built, by extending them (both spatially and temporally) to the rest of the video. To extend ﬁbers, we compute geodesics from pixels (not belonging to the initially built ﬁbers) to ﬁbers. For a reliable extension, the cost of moving along a geodesic is proportional to the trajectory similarity of a pixel wrt a ﬁber, wherein a pixel trajectory is similar to the ﬁber trajectory. This cost function quantiﬁes the color homogeneity of a pixel trajectory along with its color similarity wrt a ﬁber. A pixel is then associated to a ﬁber for which this cost is minimum. With the above mentioned steps we obtain a partition of a video in terms of ﬁbers wherein we have a trajectory associated with each pixel. This hierarchical partition provides a mid-level representation of a video, which can be seen as a facilitator or a pre-processing step towards higher level video understanding systems eg activity recognition.
6.5. Enforcement of Monotonous Shape Growth/Shrinkage in Video Segmentation Participant: Guillaume Charpiat. keywords: graph cuts, video segmentation, shape growth The segmentation of noisy videos or time series is a difﬁcult problem, not to say an impossible or ill-posed task when the noise level is very high. While individual frames can be analysed independently, time coherence in image sequences provides a lot of information not available for a single image. Most of the state-of-art works explored short-term temporal continuity for object segmentation in image sequences, i.e., each next frame is segmented by using information from one or several images at previous time points. It is, however, more advantageous to simultaneously segment many frames in the data set, so that segmentation of the entire image set supports each of the individual segmentations.
In this work, we focus on segmenting shapes in image sequences which only grow or shrink in time, and on making use of this knowledge as a constraint to help the segmentation process. Examples of growing shapes are forest ﬁres in satellite images and organ development in medical imaging. We propose a segmentation framework based on graph cuts for the joint segmentation of a multi-dimensional image set. By minimizing an energy computed on the resulting spatio-temporal graph of the image sequence, the proposed method yields a globally optimal solution, and runs in practice in linear complexity in the total number of pixels.
Two applications are performed. First, with Yuliya Tarabalka (Ayin team), we segment multiyear sea ice ﬂoes in a set of satellite images acquired through different satellite sensors, after rigid alignment (see Figure 15). The method returns accurate melting proﬁles of sea ice, which is important for building climate models. The second application, with Bjoern Menze (ETH Zurich, also MIT and collaborator of Asclepios team), deals with the segmentation of brain tumors from longitudinal sets of multimodal MRI volumes. In this task we impose an additional inter-modal inclusion constraint for joint segmentation of different image sequences, ﬁnally also returning highly sensitive time-volume plots of tumor growth.
Figure 15. (a) Aligned satellite images captured each four days superposed with segmentation contours computed by our approach. (b) Segmentation contours for images (a) obtained by applying graph cut segmentation to each image at a single time moment. Note that the segmentations (a) are pixelwise precise, and that the white regions surrounding sometimes the boundaries are other ice blocks, agglomerating temporarily only, thus correctly labelled. Hence the importance of enforcing time coherence.
6.6. Dynamic and Robust Object Tracking in a Single Camera View Participants: Duc-Phu Chau, Julien Badie, François Brémond, Monique Thonnat. Keywords: Object tracking, online parameter tuning, controller, self-adaptation and machine learning Object tracking quality usually depends on video scene conditions (e.g. illumination, density of objects, object
occlusion level). In order to overcome this limitation, we present a new control approach to adapt the object
tracking process to the scene condition variations. The proposed approach is composed of two tasks. The objective of the ﬁrst task is to select a convenient tracker for each mobile object among a Kanade-LucasTomasi-based (KLT) tracker and a discriminative appearance-based tracker. The KLT feature tracker is used to decide whether an object is correctly detected. For badly detected objects, the KLT feature tracking is performed to correct object detection. A decision task is then performed using a Dynamic Bayesian Network (DBN) to select the best tracker among the discriminative appearance and KLT trackers.
The objective of the second task is to tune online the tracker parameters to cope with the tracking context variations. The tracking context, or context, of a video sequence is deﬁned as a set of six features: density of mobile objects, their occlusion level, their contrast with regard to the surrounding background, their contrast variance, their 2D area and their 2D area variance. Each contextual feature is represented by a code-book model. In an ofﬂine phase, training video sequences are classiﬁed by clustering their contextual features. Each context cluster is then associated with satisfactory tracking parameters. In the online control phase, once a context change is detected, the tracking parameters are tuned using the learned values. This work has been published in , [ 35 ].
We have tested the proposed approach on several public datasets such as Caviar and PETS. Figure 16 illustrates the results of the object detection correction using the KLT feature tracker.
Figure 17 illustrates the tracking output for a Caviar video (on the left image) and for a PETS video (on the right image). The experimental results show that our method gets the best performance compared to some recent state of the art trackers.
Table 1 presents the tracking results for 20 videos from the Caviar dataset. The proposed approach obtains the
best MT value (i.e. mostly tracked trajectories) compared to some recent state of the art trackers.
Table 1. Tracking results on the Caviar dataset. MT: Mostly tracked trajectories, higher is better. PT: Partially tracked trajectories. ML: Most lost trajectories, lower is better. The best values are printed bold.
|Method||MT (%)||PT (%)||ML (%)|
|Zhang et al., CVPR 2008 ||85.7||10.7||3.6|
|Li et al., CVPR 2009 ||84.6||14.0||1.4|
|Kuo et al., CVPR 2010 ||84.6||14.7||0.7|
Table 2 presents the tracking results of the proposed approach and three recent approaches , [ 82 ],  for a PETS video. With the proposed approach, we obtain the best values in both metrics MOTA (i.e. Multi-object tracking accuracy) and MOTP (i.e. Multi-object tracking precision). The authors in , [ 82 ],  do not present the tracking results with the MT, PT and ML metrics.
Table 2. Tracking results on the PETS sequence S2.L1, camera view 1, sequence time 12.34. MOTA: Multi-object tracking accuracy, higher is better. MOTP: Multi-object tracking precision, higher is better. The best values are printed bold.
|Method||MOTA||MOTP||MT (%)||PT (%)||ML (%)|
|Berclaz et al., PAMI 2011 ||0.80||0.58||-||-||-|
|Shitrit et al., ICCV 2011 ||0.81||0.58||-||-||-|
|Henriques et al., ICCV 2011 ||0.85||0.69||-||-||-|
6.7. Optimized Cascade of Classiﬁers for People Detection Using Covariance Features Participants: Malik Souded, François Brémond. keywords: People detection, Covariance descriptor, LogitBoost. We propose a new method to optimize a state of the art approach for people detection, which is based on classiﬁcation on Riemannian manifolds using covariance matrices in a boosting scheme. Our approach makes training and detection faster while maintaining equivalent or better performances. This optimisation is achieved by clustering negative samples before training, providing a smaller number of cascade levels and less weak classiﬁers in most levels in comparison with the original approach.
Our approach is based on Tuzel et al.  work which was improved by Yao et al. [ 87 ]. We keep the same scheme to achieve our people detector: train a cascade of classiﬁers based on covariance descriptors, using a Logitboost training algorithm which was modiﬁed by Tuzel et al. to deal with the Riemannian manifolds metrics and using the operators which were presented in . In fact, Covariance matrices do not belong to vector space but to the Riemannian manifold of (d x d) symmetric positive deﬁnite matrices. The trained cascade of classiﬁers is applied for detection after training.
We propose an additional step to speed up training and detection process. We propose to apply a clustering step on negative training dataset before training the classiﬁers. This clustering step is performed both in Riemannian manifold and in the vector space of mapped covariance matrices, using the operators and metrics previously cited.
The idea consists in regrouping all similar negative samples, with regard to their covariance information, into decreasing size clusters. Each classiﬁer of the cascade is trained on one cluster, specializing this classiﬁer for a given kind of covariance information, and then, speeding up the training step and providing shorter classiﬁer, which accelerate its response when applied on image. In the same time, the specialization of each cascade classiﬁer shortens the cascade too, speeding up the detection (see Figure 18 and Figure 19).
6.8. Learning to Match Appearances by Correlations in a Covariance Metric Space
Participants: Sławomir B ˛ak, Guillaume Charpiat, Etienne Corvée, Francois Brémond, Monique Thonnat. (a) (b) (c)
keywords: covariance matrix, re-identiﬁcation, appearance matching
This work addresses the problem of appearance matching across disjoint camera views. Signiﬁcant appearance changes, caused by variations in view angle, illumination and object pose, make the problem challenging. We propose to formulate the appearance matching problem as the task of learning a model that selects the
most descriptive features for a speciﬁc class of objects. Our main idea is that different regions of the object appearance ought to be matched using various strategies to obtain a distinctive representation. Extracting region-dependent features allows us to characterize the appearance of a given object class (e.g. class of humans) in a more efﬁcient and informative way. Different kinds of features characterizing various regions of an object is fundamental to our appearance matching method.
We propose to model the object appearance using covariance descriptor yielding rotation and illumination invariance. Covariance descriptor has already been successfully used in the literature for appearance matching. In contrast to state of the art approaches, we do not deﬁne a priori feature vector for extracting covariance, but we learn which features are the most descriptive and distinctive depending on their localization in the object appearance (see ﬁgure 20). Learning is performed in a covariance metric space using an entropy-driven criterion. Characterizing a speciﬁc class of objects, we select only essential features for this class, removing irrelevant redundancy from covariance feature vectors and ensuring low computational cost.
The proposed technique has been successfully applied to the person re-identiﬁcation problem, in which a human appearance has to be matched across non-overlapping cameras . We demonstrated that: (1) by using different kinds of covariance features w.r.t. the region of an object, we obtain clear improvement in appearance matching performance; (2) our method outperforms state of the art methods in the context of pedestrian recognition on publicly available datasets (i-LIDS-119, i-LIDS-MA and i-LIDS-AA); (3) using 4 × 4 covariance matrices we signiﬁcantly speed-up the processing time offering an efﬁcient and distinctive representation of the object appearance.
6.9. Recovering Tracking Errors with Human Re-identiﬁcation
Participants: Julien Badie, Slawomir Bak, Duc-Phu Chau, François Brémond, Monique Thonnat.
keywords: tracking error correction, re-identiﬁcation This work addresses the problem of people tracking at long range even if the target people are lost several times by the tracking algorithm. We have identiﬁed two main reasons for tracking interruption. The ﬁrst one concerns interruptions that can be quickly recovered, which includes short mis-detections, occlusions with other persons or static obstacles. The second one occurs when a person is occluded or mis-detected for a long time or when the person leaves the scene and comes back latter. Our main objective is to design a framework that can track people even if their trajectory is very segmented and/or associated with different IDs. We called this problem the global tracking challenge (see Figure 21).
Figure 21. The global tracking challenge : correcting errors due to occlusions (ID 142 on the ﬁrst frame becomes 147 on the last frame) and tracking people that are leaving the scene and reentering (ID 133 on the ﬁrst frame becomes 151 on the last frame).
In order to describe a person’s tracklet (segment of trajectory), we use a visual signature called Mean Riemannian Covariance Grid and a discriminative method to emphasize the main differences between each tracklet. This step improves the reliability and the accuracy of the results. By computing the distance between the visual signatures, we are able to link tracklets belonging to the same person into a tracklet cluster. Only tuples of tracklets that are not overlapping each other are used as initial candidates. Then, we use Mean Shift to create the clusters. We evaluated this method on several datasets (i-LIDS, Caviar, PETS 2009). We have shown that our approach can perform as well as the other state of the art methods on Caviar and can perform better on i-LIDS. On PETS 2009 dataset, our approach performs better than standard tracker but cannot be compared with the best state of the art methods due to unadapted metrics. This approach is described in detail in two articles : one published in ICIP 2012 , which is focused on computing the covariance signature and the way to discriminate it and the other one published in PETS 2012 workshop (part of AVSS 2012 conference) , which is focused on the method to link the tracklets. This work will be added to a more general tracking controller that should be able to detect several kinds of detection and tracking errors and try to correct them.
6.10. Human Action Recognition in Videos
Participants: Piotr Bilinski, François Brémond.
keywords: Action Recognition, Contextual Features, Pairwise Features, Relative Tracklets, Spatio-Temporal Interest Points, Tracklets, Head Estimation. The goal of this work is to automatically recognize human actions and activities in diverse and realistic video
Over the last few years, the bag-of-words approach has become a popular method to represent video actions. However, it only represents a global distribution of features and thus might not be discriminative enough. In particular, the bag-of-words model does not use information about: local density of features, pairwise relations among the features, relative position of features and space-time order of features. Therefore, we propose three new, higher-level feature representations that are based on commonly extracted features (e.g. spatiotemporal interest points used to evaluate the ﬁrst two feature representations or tracklets used to evaluate the last approach). Our representations are designed to capture information not taken into account by the model, and thus to overcome its limitations.
In the ﬁrst method, we propose new and complex contextual features that encode spatio-temporal distribution of commonly extracted features. Our feature representation captures not only global statistics of features but also local density of features, pairwise relations among the features and space-time order of local features. Using two benchmark datasets for human action recognition, we demonstrate that our representation enhances the discriminative power of commonly extracted features and improves action recognition performance, achieving 96.16% recognition rate on popular KTH action dataset and 93.33% on challenging ADL dataset. This work has been published in .
In the second approach, we design new representation of features encoding statistics of pairwise co-occurring local spatio-temporal features. This representation focuses on pairwise relations among the features. In particular, we introduce the geometric information to the model and associate geometric relations among the features with appearance relations among the features. Despite that local density of features and space-time order of local features are not captured, we are able to achieve similar results on the KTH dataset (96.30% recognition rate) and 82.05% recognition rate on UCF-ARG dataset. An additional advantage of this method is to reduce the processing time of training the model from one week on a PC cluster to one day. This work has been published in .
In the third approach, we propose a new feature representation based on point tracklets and a new head estimation algorithm. Our representation captures a global distribution of tracklets and relative positions of tracklet points according to the estimated head position. Our approach has been evaluated on three datasets, including KTH, ADL, and our locally collected Hospital dataset. This new dataset has been created in cooperation with the CHU Nice Hospital. It contains people performing daily living activities such as: standing up, sitting down, walking, reading a magazine, etc. Sample frames with extracted tracklets from video sequences of the ADL and Hospital datasets are illustrated on Figure 22. Consistently, experiments show that our representation enhances the discriminative power of tracklet features and improves action recognition performance. This work has been accepted for publication in .
Figure 22. Sample frames with extracted tracklets from video sequences of the ADL (left column) and Hospital (right column) datasets.
6.11. Group Interaction and Group Tracking for Video-surveillance in Underground Railway Stations
Participants: Soﬁa Zaidenberg, Bernard Boulay, Carolina Garate, Duc-Phu Chau, Etienne Corvée, François Brémond.
Keywords: events detection, behaviour recognition,automatic video understanding, tracking One goal in the European project VANAHEIM is the tracking of groups of people. Based on frame to frame mobile object tracking, we try to detect which mobiles form a group and to follow the group through its lifetime. We deﬁne a group of people as two or more people being close to each other and having similar trajectories (speed and direction). The dynamics of a group can be more or less erratic: people may join or split from the group, one or more can disappear temporarily (occlusion or disappearance from the ﬁeld of view) but reappear and still be part of the group. The motion detector which detects and labels mobile objects may also fail (misdetections or wrong labels). Analysing trajectories over a temporal window allows handling this instability more robustly. We use the event-description language described in  to deﬁne events, described using basic group properties such as size, type of trajectory or number and density of people and perform the recognition of events and behaviours such as violence or vandalism (alarming events) or a queue at the vending machine (non-alarming events).
The group tracking approach uses Mean-Shift clustering of trajectories to create groups. Two or more individuals are associated in a group if their trajectories have been clustered together by the Mean-Shift algorithm. The trajectories are given by the long-term tracker described in . Each trajectory is composed of a person’s positions (x, y) on the ground plane (in 3D) over the time window, and of their speed at each frame in the time window. Positions and speed are normalized using the minimum and maximum possible values (0 and 10m/s for the speed and the ﬁeld of view of the camera for the position). The Mean-Shift algorithm requires a tolerance parameter which is set to 0.1, meaning that trajectories need to be distant by less than 10% of the maximum to be grouped.
Figure 23. Example of a group composed of non-similar individual trajectories.
As shown in Figure 23, people in a group might not always have similar trajectories. For this reason, a group is also created when people are very close. A group is described by its coherence, a value calculated from the average distances of group members, their speed similarity and direction similarity. The update phase of the group uses the coherence value. A member will be kept in a group as long as the group coherence is above a threshold. This way, a member can temporarily move apart (for instance to buy a ticket at the vending machine) without being separated from the group.
This work has been applied to the benchmark CAVIAR dataset for testing, using the provided ground truth for evaluation. This dataset is composed of two parts: acted scenes in the Inria hall (9 sequences of 665 frames in average) and not acted recordings from a shopping mall corridor (7 sequences processed of 1722 frames in average). The following scenarios have been deﬁned using the event-description language of : ﬁghting, split up, joining, shop enter, shop exit, browsing. These scenarios have been recognized in the videos with a high success rate (94%). The results of this evaluation and the above described method have been published in .
The group tracking algorithm is integrated at both Torino and Paris testing sites and runs in real time on live video streams. The global VANAHEIM system has been presented as a demonstration at the ECCV 2012 conference. A demonstration video has been compiled from the results of the group tracking on 60 sequences from the Paris subway showing interesting groups with various activities such as waiting, walking, lost, kids and lively.
6.12. Crowd Event Monitoring Using Texture and Motion Analysis Participants: Vaibhav Katiyar, Jihed Joober, François Brémond. keywords: Crowd Event, Texture Analysis, GLCM, Optical Flow The aim of this work is to monitor crowd event using crowd density, change of speed and orientation of group of people. For reducing complexity we are using human density rather than individual human detection and tracking. In this study Human density is quantiﬁed mainly into three groups-(1) Empty (2) Sparse (3) Dense. These are approximated by calculating Haralick features from Grey Level Co-occurrence Matrix (GLCM).
We use Optical ﬂow for getting motion information like current speed and orientation of selected FAST feature points. Subsequently we used this information for classifying crowd behaviour into normal or abnormal categories wherein we seek for sudden change in speed or orientation heterogeneity for abnormal behaviour.
In future work this abnormal behaviour may further be classiﬁed into different events like Running, Collecting, Dispersion, Stopping/Blocking.
6.13. Detecting Falling People Participants: Etienne Corvee, Francois Bremond. keywords: fall, tracking, event We have developed a people falling algorithm based on our object detection and tracking algorithm  and using our ontology based event detector . These algorithms extract moving object trajectories from videos and triggers alarms whenever the people activity ﬁts event models. Most surveillance systems use a multi Gaussian technique  to model background scene pixels. This technique is very efﬁcient in detecting in real-time moving objects in scenes captured by a static camera, with low level of shadows, few persons interacting in the scene and with as few as possible illumination changes. This technique does not analyse the content of the moving pixels but simply assign them as foreground or background pixels.
Many state of the art algorithms exist that can recognize objects such as a person human shape, a head, a face or a couch. However, these algorithms are quite time consuming or the database used for training the system is not well adapted to our application domain. For example, people detection algorithms use databases containing thousands of image instances of standing or walking persons taken by camera from a certain distance from the persons and from a facing position. In our indoor monitoring application, cameras are located on the roof with high tilt angle so that most of the scene (e.g.rooms) is viewed. With such camera spatial conﬁguration, the image of a person on the screen rarely corresponds to the person images in the training database. In addition, people are often occluded by the image border (the image of the full body is not available), image distortion needs to be corrected and people often have poses that are not present in the database (e.g. a person bending or sitting).
Using our multi Gaussian technique , after having calibrated a camera scene, a detected object is associated with a 3D width and height in two positions : the standing and lying positions. This 3D information is checked against 3D human model and any object is then labelled as either a standing person, a lying person or unknown. Many 3D ﬁltering thresholds are used ; for example, object speed should not be greater than a human possible running speed. Second, we use an ontology based event detector to build a hierarchy of event model complexity. We detect people to have fallen on the ﬂoor if the object has been detected as a person on the ﬂoor and outside the bed and couch for at least several seconds consecutively. An example of a fallen person is shown in Figure 24.
6.14. People Detection Framework Participants: Srinidhi Mukanahallipatna, Silviu-Tudor Serban, François Brémond. keywords: LBP, Adaboost, Cascades We present a new framework called COFROD (Comprehensive Optimization Framework for Real-time Object Detection) for object detection that focuses on improving state of the art accuracy, while maintaining real-time detection speed. The general idea behind our work is to create an efﬁcient environment for developing and analyzing novel or optimized approaches in terms of classiﬁcation, features, usage of prior knowledge and custom strategies for training and detection. In our approach we opt for a standard linear classiﬁer such as Adaboost. Inspired by the integral channel feature approach, we compute variants of LBP and Haar-like features on multiple channels of the input image. Thus, we obtain an elevated number of computationally inexpensive features that capture substantial information. We use an extensive training technique in order to obtain optimal classiﬁer.
We propose a comprehensive framework for object detection with an intuitive modular design and high emphasis on performance and ﬂexibility. Its components are organized by parent-modules, child-modules and auxiliary-modules. The parent-modules contain several child-modules and focus on a general task such as Training or Detection. Child-modules solve more speciﬁc tasks, such as feature extraction, training or testing and in most cases require auxiliary-modules. The later have precise intents, for instance computing a color channel transformation or a feature response.
We present two detection conﬁgurations. One relies on a single intensively trained detector and the other as a
set of specialist detectors. Our baseline detector uses cascades in order to speed up the classiﬁer. By removing most false positive at ﬁrst stages, computation time is signiﬁcantly reduced. Classiﬁer for each cascade is generated using the training approach.
Our contribution is in the form of a hierarchical design of specialized detectors. At ﬁrst level we use a version of the baseline detector in order to remove irrelevant candidates. At the second level, specialist detectors are deﬁned. These detectors can be either independent or can use third level detectors and cumulate their output. A specialist detector can take the role of solving an exact classiﬁcation issue, such as sitting pose. In that case it is trained only with data relevant to that task. In some applications, a specialist detector can be trained to perform exceptionally on a speciﬁc situation. In this case training samples are adapted to the particularity of the testing, and possibly parts of the testing sets are used for training.
This is a versatile system for object detection that excels in both accuracy and speed. We present a valuable strategy for training and a hierarchy of specialized people detectors for dealing with difﬁcult scenarios. We also propose an interesting feature channel and a method for loosing less detection speed-up. In our approach we build upon the ideas of feature scaling instead of resizing images and of transferring most computations from detection to training, thus achieving real-time performance on VGA resolution.
Figure 25. Detection Results
6.15. A Model-based Framework for Activity Recognition of Older People using Multiple sensors
Participants: Carlos -Fernando Crispim Junior, Qiao Ma, Baptiste Fosty, Cintia Corti, Véronique Joumier,
Philippe Robert, Alexandra Konig, François Brémond, Monique Thonnat. keywords: Activity Recognition, Multi-sensor Analysis, Surveillance System, Older people, Frailty assessment
We have been investigating a model-based activity recognition framework for the automatic detection of physical activity tests and instrumental activities of daily living (IADL, e.g., preparing coffee, making a phone call) of older people. The activities are modelled using a constraint-based approach (using spatial, temporal, and a priori information of the scene), and a generic ontology based on natural terms which allows medical experts to easily modify the deﬁned activity models. Activity models are organized in a hierarchical structure according to their complexity (Primitive state, Composite State, Primitive Event, and Composite Event). The framework has been tested as a system on the clinical protocol developed by the Memory Center of Nice hospital. This clinical protocol aims at studying how ICTs (Information and Communication Technologies)
can provide objective evidence of early symptoms of Alzheimer’s disease (AD) and related conditions (like Memory Cognitive Impairment -MCI). The Clinical protocol participants are recorded using a RGB videocamera (8 fps), a RGB-D Camera (Kinect -Microsoft), and an inertial sensor (MotionPod) which allows a multi-sensor evaluation of the activities of the participants in an observation room equipped with home appliances. A study of the use of a multi-sensor monitoring for Patient diagnosis using events annotated by experts has been performed in partnership with CHU-Nice and SMILE team of TAIWAN, and it has shown the feasibility of the use of these sensors for patient performance evaluation and differentiation of clinical protocol groups (Alzheimer’s disease and healthy participants)  and [ 40 ]. The multi-sensor evaluation has used the proposed surveillance system prototype and has been able to detect the full set of physical activities of the scenario 1 of the clinical protocol (e.g., Guided a ctivities : Balance test, Repeated Transfer Test), with a true positive rate of 96.9% to 100% for a set of 38 patients (MCI=19, Alzheimer=9) using data of an ambient camera. An extension of the developed framework has been investigated to handle multiple sensors data in the event modeling. In this new scenario, information from the ambient camera and the inertial sensor worn on the participants chest is used (see Figure 28). The prototype using the extended framework has been tested on the automatic detection of IADLs, and preliminary results points to an average sensitivity of 91% and an average precision of 83.5%. This evaluation has been performed for 9 participants videos (15 min each, healthy: 4, MCI: 5). See  for more details. Future work will focus on a learning mechanism to automatic fuse events detected by a set of heterogeneous sensors, and at supporting clinicians at the task of studying differences between the activity proﬁle of healthy participants and early to moderate stage Alzheimer’s patients.
C: Trajectory information of Patient Activity during the experimentation.
6.16. Activity Recognition for Older People using Kinect
Participants: Baptiste Fosty, Carlos -Fernando Crispim Junior, Véronique Joumier, Philippe Robert, Alexan
dra Konig, François Brémond, Monique Thonnat. keywords: Activity Recognition, RGB-D camera analysis, Surveillance System, Older people, Frailty assessment
Within the context of the Dem@Care project, we have studied the potential of the RGB-D camera (Red Green Blue + Depth) from Microsoft (Kinect) for an activity recognition system developed to extract automatically and objectively evidences of early symptoms of Alzheimer’s disease (AD) and related conditions (like Memory Cognitive Impairment -MCI) for older people. This system is designed on a model-based activity recognition framework. Using a constraint-based approach with contextual and spatio-temporal informations of the scene, we have developped activity models related to the physical activity part of the protocol (Scenario 1, guided activities : balance test, walking test, repeated transfers posture between sitting and standing). These models are organized in a hierarchical structure according to their complexity (Primitive state, Composite State, Primitive Event, and Composite Event). This work is an adaptation of the work performed for multi-sensor analysis .
Several steps are needed to adapt the processing. We had for example to generate new ground truth, or we had to design new 3D zones of interest according to Kinect point of view and referential (differing from the 2D camera). Moreover, in order to improve the reliability of the results, we had to solve several issues in the processing chain. For instance, Kinect and the detection algorithm provided by OpenNi and Nestk (free libraries) have several limitations which leads to wrong detection of human. We proposed in these cases several solutions like ﬁltering wrong object detections by size (see Figure29 C) or recomputing the height of older people based on their head when wearing black pants (absorption of infrared) (see Figure 29 D).
For the experimentation, we have processed the data recorded for 30 patients. The results are shown in Figure 30. With a true positive rate of almost 97% and a precision of 94.2%, our system is able to extract most of the activities performed by patients. Then, relevant and objective information can be delivered to clinicians, to assess the patient frailty. For further information on the performance of the detection process, we also generate the results frame by frame, which are shown in Figure 31. We see there that the performance of the event detection in terms of true positive rate is almost as good as by events (94.5%). Nevertheless, if we focus on the precision, it is lower than previously. This means that we still need to improve detection accuracy of the beginning and the end of an event.
Future work will focus on using the human skeleton to extract ﬁnest information on the patient activity and to process more scenarios (semi-guided and free).
6.17. Descriptors of Depth-Camera Videos for Alzheimer Symptom Detection Participants: Guillaume Charpiat, Sorana Capalnean, Bertrand Simon, Baptiste Fosty, Véronique Joumier. keywords: Kinect, action description, video analysis In a collaboration with the CHU hospital of Nice, a dataset of videos was recorded, where elderly are asked by doctors to perform a number of predeﬁned exercises (like walking, standing-sitting, equilibrium test), and recorded with an RGBD camera (Kinect). Our task is to analyze the videos and detect automatically early Alzheimer symptoms, through statistical learning. Here we focus on the 3D depth sensor (no use of the RGB image), and aim at providing action descriptors that are accurate enough to be informative.
During her internship in the Stars team, Sorana Capalnean proposed descriptors relying directly on the 3D points of the scene. First, based on trajectory analysis, she proposed a way to recognize the different physical exercises. Then she proposed, for each exercise, speciﬁc descriptors aiming at providing the information asked by doctors, such as step length, frequency and asymmetry for the walking exercise, or sitting speed and acceleration for the second exercise, etc. Problems to deal with included the high level of noise in the 3D cloud of points given by the Kinect, as well as an accurate localization of the ﬂoor.
During his internship, Bertrand Simon proposed other kinds of descriptors, based on the articulations of the human skeleton given by OpenNI. These articulations are however very noisy too, so that a pre-ﬁltering step of the data in time had to be performed. Various coordinate systems were studied, to reach the highest robustness. The work focused not only on descriptors but also on metrics suitable to compare gestures (in the phase space as well as in the space of trajectories). See ﬁgure 32 for an example.
These descriptors are designed to be robust to camera noise and to extract the relevant information from the videos; however their statistical analysis still remains to be done, to recognize Alzheimer symptoms during the different exercises.
6.18. Online Activity Learning from Subway Surveillance Videos Participants: Jose-Luis Patino Vilchis, Abhineshwar Tomar, François Brémond, Monique Thonnat. Keywords: Activity learning, clustering, trajectory analysis, subway surveillance This work provides a new method for activity learning from subway surveillance videos. This is achieved by learning the main activity zones in the observed scene by taking as input the trajectories of detected mobile objects. This provides us the information on the occupancy of the different areas of the scene. In a second step, these learned zones are employed to extract people activities by relating mobile trajectories to the learned zones, in this way, the activity of a person can be summarised as the series of zones that the person has visited. If the person resides in the single zone this activity is also classiﬁed as a standing. For the analysis of the trajectory, a multiresolution analysis is set such that a trajectory is segmented into a series of tracklets based on changing speed points thus extracting the information when people stop to interact with elements of the scene or other people. Starting and ending tracklet points are fed to an advantageous incremental clustering algorithm to create an initial partition of the scene. Similarity relations between resulting clusters are modelled employing fuzzy relations. A clustering algorithm based on the transitive closure calculation of the fuzzy relations easily builds the ﬁnal structure of the scene. To allow for incremental learning and update of activity zones (and thus people activities), fuzzy relations are deﬁned with online learning terms. The approach is tested on the extraction of activities from the video recorded at one entrance hall in the Torino (Italy) underground system. Figure 33 presents the learned zones corresponding to the analyzed video. To test the validity of the activity extraction a one hour video was annotated with activities (corresponding to each trajectory) according to user deﬁned ground-truth zones. After the comparison, following results were obtained: TP:26, FP:3, FN:1, Precision:0.89, Sensitivity:0.96. This work is published in .
6.19. Automatic Activity Detection Modeling and Recognition: ADMR
Participants: Guido-Tomas Pusiol, François Brémond.
This year a new Ph.D. thesis has been defended . The main objective of the thesis is to propose a complete framework for the automatic activity discovery, modeling and recognition using video information. The framework uses perceptual information (e.g. trajectories) as input and goes up to activities (semantics). The framework is divided into ﬁve main parts:
The work has also been evaluated for other types of applications such as sleeping monitoring. For example, Figure 34 display the results of the activity discovery method during 6 hours (left to right) applied to the center of mass (3D) of a tracked sleeping person. The colored segments represent hierarchical (bottom-up is ﬁner-coarse) discovered activity which matches with sleeping postural movements. The segments have similar color when postural movements are similar. For example, the segment (j) is the only time the person sleeps upside down. Also, health professionals analysed the results claiming that the segments corresponds to normal sleeping cycle, where low motion is noticed at the beginning of the sleep and more motion is shown when the person have a lighter sleep when starts waking up.
6.20. SUP Software Platform
Participants: Julien Gueytat, Baptiste Fosty, Anh tuan Nghiem, Leonardo Rocha, François Brémond.
Our team focuses on developing Scene Understanding Platform (SUP) (see section 5.1). This platform has been designed for analyzing a video content. SUP is able to recognize simple events such as ’falling’, ’walking’ of a person. We can easily build new analyzing system thanks to a set of algorithms also called plugins. The order of those plugins and their parameters can be changed at run time and the result visualized. This platform has many more advantages such as easy serialization to save and replay a scene, portability to Mac, Windows or Linux, ... All those advantages are available since we are working together with the software developers team DREAM. Many Inria teams are pushing together to improve a common Inria development toolkit DTK. Our SUP framework is one of the DTK-like framework developed at Inria. Currently, we have fully integrated OpenCV library with SUP and the next step is to integrate OpenNI to get depth map processing algorithms from PrimeSense running in SUP. Updates and presentations of our framework can be found on our team website http://team.inria.fr/stars. Detailed tips for users are given on our Wiki website http://wiki.inria.fr/stars and sources are hosted thanks to the new Source Control Management tool.
6.21. Qualitative Evaluation of Detection and Tracking Performance
Participants: Swaminathan Sankaranarayanan, François Brémond.
We study an evaluation approach for detection and tracking systems. Given an algorithm that detects people and simultaneously tracks them, we evaluate its output by considering the complexity of the input scene. Some videos used for the evaluation are recorded using the Kinect sensor which provides for an automated ground truth acquisition system. To analyse the algorithm performance, a number of reasons due to which an algorithm might fail is investigated and quantiﬁed over the entire video sequence. A set of features called Scene Complexity measures are obtained for each input frame. The variability in the algorithm performance is modelled by these complexity measures using a polynomial regression model. From the regression statistics, we show that we can compare the performance of two different algorithms and also quantify the relative inﬂuence of the scene complexity measures on a given algorithm. This work has been published in .
6.22. Model-Driven Engineering and Video-surveillance Participants: Sabine Moisan, Jean-Paul Rigault, Luis-Emiliano Sanchez. keywords: Feature Model Optimization, Software Metrics, Requirement speciﬁcation, Component-based
system, Dynamic Adaptive Systems, Model-Driven Engineering, Heuristic Search, Constraint Satisfaction Problems The domain of video surveillance (VS) offers an ideal training ground for Software Engineering studies, be
cause of the huge variability in both the surveillance tasks and the video analysis algorithms . The various VS tasks (counting, intrusion detection, tracking, scenario recognition) have different requirements. Observation conditions, objects of interest, device conﬁguration... may vary from one application to another. On the implementation side, selecting the components themselves, assembling them, and tuning their parameters to comply with context may lead to a multitude of variants. Moreover, the context is not ﬁxed, it evolves dynamically and requires run time adaptation of the component assembly.
Our work relies on Feature Models, a well-known formalism to represent variability in software systems. This year we have focused on an architecture for run time adaptation and on metrics to drive dynamic architecture changes.
6.22.1. Run Time Adaptation Architecture
The architecture of the run time system (also used for initialization at deployment time) is based on three collaborating modules as shown in Figure 35. A Run Time Component Manager (RTCM) cooperates with the low levels (to manage the software components and capture events) and applies conﬁguration changes. A Conﬁguration Adapter (CA) receives events from the RTCM, and propagates them as features into the models to obtain a new conﬁguration. The Model Manager (MM) embeds a specialized scripting language for Feature Models (FAMILIAR , [ 53 ]1) to manage the representation of the two specialized feature models and applies constraints and model transformations on them. The Model Manager produces new component conﬁgurations (a model specialization) that it sends to the CA. At its turn, the CA selects one single conﬁguration (possibly using heuristics) and converts it into component operations to be applied by the RTCM.
This year we ﬁrst ﬁnalized the interface between the Model Manager and the Conﬁguration Adapter. On one hand, we transform the feature models obtained from FAMILIAR into C++ representations enriched with software component information. On the other hand, we dynamically transform context change events into requests to FAMILIAR.
Second, we searched for a suitable technology for handling components in the Run Time Component Manager. OSGi is an adequate de facto standard but it is mainly available in the Java world. However we could ﬁnd a C++ implementation, complete enough for our needs (SOF, Service Oriented Framework ). However, SOF has to be completed to adjust to the needs of our end users who are the video system developers. Thus, we are currently building a multi-threaded service layer on top of SOF, easy to use and hiding most of the nitty-gritty technical details of thread programming and SOF component manipulation. This layer provides end users with a set of simple patterns and allow them to concentrate only on the code of video services (such as acquisition, segmentation, tracking...).
As a matter of feasability study we are building an experimental video self adaptive system based on the afore mentionned architecture. Software components are implemented with the OpenCV library. In the ﬁnal system, feature models and software components continuously interact in real time, modifying the whole system in response to changes in its environment.
6.22.2. Metrics on Feature Models to Optimize Conﬁguration Adaptation at Run Time
As shown on ﬁgure 35, the Conﬁguration Adapter has to set up a suitable component conﬁguration of the run time system. For this, each time the context changes, it receives a set of valid conﬁgurations (a feature sub-model) from the Model Manager. In most cases, this set contains more than one conﬁguration. Of course, only one conﬁguration can be applied at a given time and the problem is to select the “best” one. Here, “best” is a trade-off between several non-functional aspects: performance, quality of service, time cost for replacing the current conﬁguration, etc.
It is thus necessary to rank the conﬁgurations. Our approach is to deﬁne metrics suitable for comparing conﬁgurations. Then the problem comes down to the widely studied problem of Feature model optimization . This problem is known to be an intractable combinatorial optimization problem in general.
We started with a study of the state of the art: metrics for general graphs as well speciﬁc to feature models, optimization and requirement speciﬁcation on feature models... We obtained a structured catalog of quality and feature model metrics. Then we selected solutions based on heuristic search algorithms using quality and feature model metrics. We thus propose several strategies and heuristics offering different properties regarding optimality of results and execution efﬁciency.
These strategies and heuristics have been implemented, tested, and analyzed using random generated feature models. We got empirical measures about their properties, such as completeness, optimality, time and memory efﬁciency, scalability... This allows us to compare the performance of the different algorithms and heuristics, and to combine them in order to achieve a good trade-off between optimality and efﬁciency. Finally, the proposed algorithms have been introduced as part of the Conﬁguration Adapter module.
1FAMILIAR has been developed at the I3S laboratory by the Modalis team.
This work is quite original from several aspects. First, we did not ﬁnd any study using heuristic search algorithms for solving the feature optimization problem. Most studies apply Artiﬁcial Intelligence techniques such as CSP solvers, planning agents, genetic algorithms... Second, we do not restrict to the optimization of linear objective functions, but we also address non-linear ones allowing us to take into account a broader set of criteria. Among the possible criteria we consider quality of service of components, their performance, their set up delay, the cost of their replacement, etc. Finally, we apply our metrics at run time whereas most studies consider metrics only for static analysis of feature models.
Currently, we are still working on new variants of the search algorithms and new heuristics relying on techniques proposed in the domains of heuristic search and constraint satisfaction problems.
6.23. Synchronous Modelling and Activity Recognition
Participants: Annie Ressouche, Sabine Moisan, Jean-Paul Rigault, Daniel Gaffé.
6.23.1. Scenario Analysis Module (SAM)
To generate activity recognition systems we supply a scenario analysis module (SAM) to express and recognize complex events from primitive events generated by SUP or other sensors. In this framework, this year we focus on recognition algorithm improvement in order to face the problem of large number of scenario instances recognition.
The purpose of this research axis is to offer a generic tool to express and recognize activities. Genericity means that the tool should accommodate any kind of activities and be easily specialized for a particular framework. In practice, we propose a concrete language to specify activities in the form of a set of scenarios with temporal constraints between scenarios. This language allows domain experts to describe their own scenario models. To recognize instances of these models, we consider the activity descriptions as synchronous reactive systems
 and we adapt usual techniques of synchronous modelling approach to express scenario behaviours. This
approach facilitates scenario validation and allows us to generate a recognizer for each scenario model. In addition, we have completed SAM in order to address the life cycle of scenario instances. For a given scenario model there may exist several (possibly many) instances at different evolution states. These instances are created and deleted dynamically, according to the input event ﬂow. The challenge is to manage the creation/destruction of this large set of scenario instances efﬁciently (in time and space), to dispatch events to expecting instances, and to make them evolve independently. To face this challenge, we introduced in the generation of the recognition engine, the expected events of the next step. This avoids to run the engine automatically with events that are not relevant for the recognition process. Indeed, we relied on Lustre  synchronous language to express the automata semantics of scenario models as Boolean equation systems. This approach was successful and shows that we can consider a synchronous framework to generate validated scenario recognition engines. This year, in order to improve efﬁciency (and to tackle the real time recognition problem), we begin to rely on CLEM (see section 6.23.2) toolkit to generate such recognition engines. The reason is threefold: (1) CLEM is becoming a mature synchronous programming environment; (2) we can use the CLEM compiler to build our own compiler; (3) CLEM supplies the possibility of using NuSMV  model checker, which is more powerful than the Lustre model-checker. Moreover, thanks to CLEM compiler into Boolean equation systems, we can compute the expected events of the next instant on the ﬂy, by propagation of information related to the current instant.
6.23.2. The clem Workﬂow
This research axis concerns the theoretical study of a synchronous language LE with modular compilation and the development of a toolkit (see Figure 9) around the language to design, simulate, verify and generate code for programs. The novelty of the approach is the ability to manage both modularity and causality. This year, we mainly work on theoretical aspects of CLEM.
First, synchronous language semantics usually characterizes each output and local signal status (as present or absent) according to input signal status. To reach our goal, we deﬁned a semantics that translates LE programs into equation systems. This semantics bears and grows richer the knowledge about signals and is never in contradiction with previous deduction (this property is called constructiveness). In such an approach, causality turns out to be a scheduling evaluation problem. We need to determine all the partial orders of equation systems and to compute them, we consider a 4-valued algebra to characterize the knowledge of signal status (unknown, present, absent, overknown). Previously, we relied on 4-valued Boolean algebra , [ 20 ] which deﬁnes the negation of unknown as overknown. The advantage of this way is to beneﬁt from Boolean algebras laws to compute equation system solutions. The drawback concerns signal status evaluation which does not correspond to usual interpretation (not unknown = unknown and not overknown = overknown). To avoid this drawback, we study other kinds of algebras well suited to deﬁne synchronous languages semantics. In , we choose an algebra which is a bilattice and we show that it is well suited to solve our problem. It is a new application of general bilattice theory . But, the algebra we deﬁned is no more a Boolean algebra, but we prove (always in ), that the main laws of Boolean algebras hold as distributivity laws, associativity laws, idempotence laws, etc. After compilation, signals have to be projected into Boolean values. Bilattice theory offers an isomorphism between 4-valued status and pair of Boolean.
Second, the algorithm which computes partial orders relies on the computation of two dependency graphs: the upstream (downstream) dependency graph computes the dependencies of each variable of the system starting from the input (output) variables. Inputs (resp. outputs) have date 0 and the algorithm recursively increases the dates of nodes in the upstream (resp downstream) dependencies graph. Hence, the algorithm determines an earliest date and a latest date for equation system variables. Moreover, we can compute the dates of variables of a global equation system starting from dates already computed for variables which were inputs and outputs in a sub equation system corresponding to a sub program of the global program2. This way of compiling is the corner stone of our approach . We deﬁned two approaches to compute all the valid partial orders of equation systems, either applying critical path scheduling technique (CPM) 3 or applying ﬁx point theory: the vector of earliest (resp. latest) dates can be computed as the least ﬁx point of a monotonic increasing function. This year we have proved that we can compute dates either starting from a global equation system or considering equation system where some variables are abstracted (i.e they have no deﬁnition) and whose dates have been already computed. To achieve the demonstration, we rely on an algebraic characterization of dates and thanks to uniqueness property of least ﬁx points, we can deduce that the result is the same for a global equation systems as for its abstraction. We are in the process of publishing this result. From an implementation point of view, we use CPM approach to implement our scheduling algorithm since it is more efﬁcient than ﬁx point consideration. Of course both ways yield the same result. Indeed, ﬁx point approach is useful for a theoretical concern.
6.23.3. Multiple Services for Device Adaptive Platform for Scenario Recognition
The aim of this research axis is to federate the inherent constraints of an activity recognition platform like SUP (see section 5.1) with a service oriented middleware approach dealing with dynamic evolutions of system infrastructure. The Rainbow team (Nice-Sophia Antipolis University) proposes a component-based adaptive middleware (WComp , [ 84 ], ) to dynamically adapt and recompose assemblies of components. These operations must obey the "usage contract" of components. The existing approaches don’t really ensure that this usage contract is not violated during application design. Only a formal analysis of the component behaviour models associated with a well sound modelling of composition operation may guarantee the respect of the usage contract.
The approach we adopted introduces in a main assembly, a synchronous component for each sub assembly connected with a critical component. This additional component implements a behavioural model of the critical component and model checking techniques apply to verify safety properties concerning this critical component. Thus, we consider that the critical component is validated.
2these variables are local in the global equation system 3 http://pmbook.ce.cmu.edu/10_Fundamental_Scheduling_Procedures.html
To deﬁne such synchronous component, user can specify a synchronous component per sub assembly corresponding to a concern and compose the synchronous components connected with the same critical component in order to get an only synchronous component. Thus, we supply a composition under constraints of synchronous components and we proved that this operation preserves already separately veriﬁed properties of synchronous components , [ 78 ].
The main challenge of this approach is to deal with the possibly very large number of constraints a user must specify. Indeed, each synchronous monitor has to tell how it combines with other, then we get a combinatorial number of constraints with respect to the number of synchronous monitors and inputs of the critical component. To tackle this problem, we replace the effective description of constraints by a generic speciﬁcation of them in the critical component. But, we must offer a way to express these generic constraints. Then, each synchronous component has a synchronous controller, which is the projection of the generic constraints on its output set. The global synchronous component is the synchronous parallel composition of all basic components and their synchronous controllers. Moreover, according to synchronous parallel composition features, the property preservation result we have still hold.
7. Partnerships and Cooperations
7.1. Regional Initiatives
7.2. National Initiatives
Program: ANR Sécurité Project acronym: VIDEO-ID Project title: VideoSurveillance and Biometrics Duration: February 2008-February 2012 Coordinator: Thales Security Systems and Solutions S.A.S Other partners: Inria; EURECOM; TELECOM and Management Sud Paris; CREDOF ; RATP See also: http://www-sop.inria.fr/pulsar/projects/videoid/ Abstract: Using video surveillance, the VIDEO-ID project aims at achieving real time human
activity detection including the prediction of suspect or abnormal activities. This project also aims at performing identiﬁcation using face and iris recognition. Thanks to such identiﬁcation, a detected person will be tracked throughout a network of distant cameras, allowing to draw a person’s route and his destination. Without being systematic, a logic set of identiﬁcation procedures is established: event and abnormal behaviour situation and people face recognition.
Program: ANR Tecsan Project acronym: SWEET-HOME Project title: Monitoring Alzheimer Patients at Nice Hospital Duration: November 2009-November 2012 Coordinator: CHU Nice Hospiteal (FR) Other partners: Inria (FR); LCS (FR); CNRS unit -UMI 2954, MICA Center in Hanoi (VN); SMILE
Lab , National Cheng Kung University (TW); National Cheng Kung University Hospital (TW). Abstract: SWEET-HOME project aims at building an innovative framework for modeling activities of daily living (ADLs) at home. These activities can help assessing elderly disease (e.g. Alzheimer, depression, apathy) evolution or detecting pre-cursors such as unbalanced walking, speed, walked distance, psychomotor slowness, frequent sighing and frowning, social withdrawal with a result of increasing indoor hours.
Program: FUI Project acronym: QUASPER Project title: QUAliﬁcation et certiﬁcation des Systèmes de PERception Duration: June 2010 -May 2012 Coordinator: THALES ThereSIS Other partners: AFNOR; AKKA; DURAN; INRETS; Sagem Securité; ST Microelectronics; Thales
RT; Valeo Vision SAS; CEA; CITILOG; Institut d’Optique; CIVITEC; SOPEMEA; ERTE; HGH. See also: http://www.systematic-paris-region.org/fr/projets/quasper-rd Abstract: QUASPER project gathers 3 objectives to serve companies and laboratories: (1) to
encourage R&D and the design of new perception systems; (2) to develop and support the deﬁnition of European standards to evaluate the functional results of perception systems; (3) to support the qualiﬁcation and certiﬁcation of sensors, software and integrated perception systems. Target domains are Security, Transportation and Automotive.
7.2.3. Investment of future
Program: DGCIS Project acronym: Az@GAME Project title: un outil d’aide au diagnostic médical sur l’évolution de la maladie d’Alzheimer et les
pathologies assimilées. Duration: January 2012-December 2015 Coordinator: Groupe Genious Other partners: IDATE, Inria(Stars), CMRR (CHU Nice) and CobTek team. See also: http://www.azagame.fr/ Abstract: This French project aims at providing evidence concerning the interest of serious games to
design non pharmacological approaches to prevent dementia patients from behavioural disturbances, most particularly for the stimulation of apathy.
7.2.4. Large Scale Inria Initiative
Program: Inria Project acronym: PAL
Project title: Personally Assisted Living Duration: 2010 -2014 Coordinator: COPRIN team Other partners: AROBAS, DEMAR, E-MOTION, PULSAR, PRIMA, MAIA, TRIO, and LAGADIC
Inria teams See also: http://www-sop.inria.fr/coprin/aen/ Abstract: The objective of this project is to create a research infrastructure that will enable exper
iments with technologies for improving the quality of life for persons who have suffered a loss of autonomy through age, illness or accident. In particular, the project seeks to enable development of technologies that can provide services for elderly and fragile persons, as well as their immediate family, caregivers and social groups.
7.3. European Initiatives
7.3.1. FP7 Projects
Title: PANORAMA Duration: April 2012 -March 2015 Coordinator: Philips Healthcare (Netherlands) Other partners :Medisys (France), Grass Valley (Netherlands), Bosch Security Systems (Nether
lands), STMicroelectronics (France), Thales Angenieux (France), CapnaDST (UK), CMOSIS (Belgium), CycloMedia (Netherlands), Q-Free (Netherlands), TU Eindhoven (Netherlands) , University of Leeds (UK), University of Catania (Italy), Inria(France), ARMINES (France), IBBT (Belgium).
See also: http://www.panorama-project.eu/ Abstract: PANORAMA aims to research, develop and demonstrate generic breakthrough technologies and hardware architectures for a broad range of imaging applications. For example, object segmentation is a basic building block of many intermediate and low level image analysis methods. In broadcast applications, segmentation can ﬁnd people’s faces and optimize exposure, noise reduction and color processing for those faces; even more importantly, in a multi-camera set-up these imaging parameters can then be optimized to provide a consistent display of faces (e.g., matching colors) or other regions of interest. PANORAMA will deliver solutions for applications in medical imaging, broadcasting systems and security & surveillance, all of which face similar challenging issues in the real time handling and processing of large volumes of image data. These solutions require the development of imaging sensors with higher resolutions and new pixel architectures. Furthermore, integrated high performance computing hardware will be needed to allow for the real time image processing and system control. The related ENIAC work program domains and Grand Challenges are Health and Ageing Society -Hospital Healthcare, Communication & Digital Lifestyles -Evolution to a digital lifestyle and Safety & Security -GC Consumers and Citizens security.
Title: Autonomous Monitoring of Underground Transportation Environment Type: COOPERATION (ICT) Deﬁ: Cognitive Systems and Robotics
Instrument: Integrated Project (IP) Duration: February 2010 -July 2013 Coordinator: Multitel (Belgium) Other partners: Inria Sophia-Antipolis (FR); Thales Communications (FR); IDIAP (CH); Torino
GTT (Italy); Régie Autonome des Transports Parisiens RATP (France); Ludwig Boltzmann Institute for Urban Ethology (Austria); Thales Communications (Italy).
See also: http://www.vanaheim-project.eu/ Abstract: The aim of this project is to study innovative surveillance components for the autonomous monitoring of multi-Sensory and networked Infrastructure such as underground transportation environment.
Title: Security UPgrade for PORTs Type: COOPERATION (SECURITE) Instrument: IP Duration: July 2010 -June 2014 Coordinator: BMT Group (UK) Other partners: Inria Sophia-Antipolis (FR); Swedish Defence Research Agency (SE); Securitas
(SE); Technical Research Centre of Finland (FI); MARLO (NO); INLECOM Systems (UK). Abstract: SUPPORT is addressing potential threats on passenger life and the potential for crippling economic damage arising from intentional unlawful attacks on port facilities, by engaging representative stakeholders to guide the development of next generation solutions for upgraded preventive and remedial security capabilities in European ports. The overall beneﬁt will be the secure and efﬁcient operation of European ports enabling uninterrupted ﬂows of cargos and passengers while suppressing attacks on high value port facilities, illegal immigration and trafﬁcking of drugs, weapons and illicit substances all in line with the efforts of FRONTEX and EU member states.
Title: Dementia Ambient Care: Multi-Sensing Monitoring for Intelligent Remote Management and Decision Support Type: COOPERATION (ICT) Deﬁ: Cognitive Systems and Robotics Instrument: Collaborative Project (CP) Duration: November 2011-November 2015
Coordinator: Centre for Research and Technology Hellas (G) Other partners: Inria Sophia-Antipolis (FR); University of Bordeaux 1(FR); Cassidian (FR), Nice Hospital (FR), LinkCareServices (FR), Lulea Tekniska Universitet (SE); Dublin City University (IE); IBM Israel (IL); Philips (NL); Vistek ISRA Vision (TR).
Abstract: The objective of Dem@Care is the development of a complete system providing personal health services to persons with dementia, as well as medical professionals, by using a multitude of sensors, for context-aware, multiparametric monitoring of lifestyle, ambient environment, and health parameters. Multisensor data analysis, combined with intelligent decision making mechanisms, will allow an accurate representation of the person’s current status and will provide the appropriate feedback, both to the person and the associated medical professionals. Multi-parametric monitoring of daily activities, lifestyle, behaviour, in combination with medical data, can provide clinicians with a comprehensive image of the person’s condition and its progression, without their being physically present, allowing remote care of their condition.
7.3.2. Collaborations in European Programs, except FP7
Program: ITEA 2 Project acronym: ViCoMo Project title: Visual Context Modeling Duration: October 2009 -October 2012 Coordinator: International Consortium (Philips, Acciona, Thales, CycloMedia, VDG Security) Other partners: TU Eindhoven; University of Catalonia; Free University of Brussels; Inria; CEA
List; Abstract: The ViCoMo project is focusing on the construction of realistic context models to improve the decision making of complex vision systems and to produce a faithful and meaningful behavior. ViCoMo goal is to ﬁnd the context of events that are captured by the cameras or image sensors, and to model this context such that reliable reasoning about an event can be performed.
7.4. International Initiatives
7.4.1. Inria International Partners
126.96.36.199. Collaborations with Asia
Stars has been cooperating with the Multimedia Research Center in Hanoi MICA on semantics extraction from multimedia data. Stars also collaborates with the National Cheng Kung University in Taiwan and I2R in Singapore.
188.8.131.52. Collaboration with U.S.
Stars collaborates with the University of Southern California.
184.108.40.206. Collaboration with Europe
Stars collaborates with Multitel in Belgium and the University of Kingston upon Thames UK.
7.4.2. Participation In International Programs
220.127.116.11. EIT ICT Labs
EIT ICT Labs is one of the ﬁrst three Knowledge and Innovation Communities (KICs) selected by the European Institute of Innovation & Technology (EIT) to accelerate innovation in Europe. EIT is a new independent community body set up to address Europe’s innovation gap. It aims to rapidly emerge as a key driver of EU’s sustainable growth and competitiveness through the stimulation of world-leading innovation. Among the partners, there are strong technical universities (U Berlin, 3TU / NIRICT, Aalto University, UPMC -Université Pierre et Marie Curie, Université Paris-Sud 11, Institut Telecom, The Royal Institute of Technology); excellent research centres (DFKI, Inria, Novay , VTT, SICS) and leading companies ( Deutsche Telekom Laboratories, SAP, Siemens, Philips, Nokia, Alcatel-Lucent, France Telecom, Ericsson). This project is largely described at http://eit.ictlabs.eu.
Stars is involved in the EIT ICT Labs -Health and Wellbeing .
7.5. International Research Visitors
7.5.1. Visits of International Scientists
This year Stars has hosted 12 internships:
8.1. Scientiﬁc Animation
8.1.1. Conference Organization
• In the framework of VANAHEIM project, Stars has organized a summer school which was held at Inria in October 2012, entitled “Human Activity and Vision Summer School”4. This summer school addressed the human activity or behaviour recognition, focusing dominantly on the video and audio modalities. In this context, the topics addressed ranged from low-level feature extraction (background subtraction, space-time interest points, tracklets) to active learning, as well as object detection (human, body), tracking (multi-object, multi-camera, audio-visual), behaviour cue extraction (body or head pose), crowd monitoring and supervised behaviour recognition (statistical and symbolic approaches). The summer school counted 26 outside participants, 19 Inria participants and 21 invited speakers. Most of the participants were PhD students, but master students and postdoctoral researchers were also registered.
8.1.4. Invited Talk
8.1.5. Advisory Board
• M. Thonnat participated to the evaluation of ANR Tecsan proposals, award committee member for the best scientiﬁc project at Ecole Polytechnique Paris
8.2. Teaching -Supervision -Juries
Master : François Brémond, Video Understanding Techniques at the Human Activity and Vision
Summer school, Sophia-Antipolis, 3h., Oct 2012, FR; Master : Annie Ressouche, Critical Systems and Vériﬁcation. Application to WComp Platform, 10h, M2, Polytechnic School of Nice Sophia Antipolis University, FR;
Jean-Paul Rigault is Full Professor of Computer Science at Polytech’Nice (University of Nice): courses on C++ (beginners and advanced), C, System Programming, Software Modeling.
PhD & HdR
PhD : Slawomir Bak, People Detection in Temporal Video Sequences by Deﬁning a Generic Visual Signature of Individuals, Nice Sophia Antipolis University, 5th July 2012, François Brémond  PhD : Duc Phu Chau, Object Tracking for Activity Recognition, Nice Sophia Antipolis University,
PhD : Guido-Tomas Pusiol, Learning Techniques for Video Understanding, Nice Sophia Antipolis University, 31st May 2012, François Brémond ; PhD in progress: Julien Badie, People tracking and video understanding, October 2011, François
Brémond; PhD in progress : Piotr Bilinski, Gesture Recognition in Videos, March 2010, François Brémond; PhD in progress : Carolina Garate, Video Understanding for Group Behaviour Analysis, August
2011, François Brémond;
PhD in progress : Ratnesh Kumar, Fiber-based segmentation of videos for activity recognition, January 2011, Guillaume Charpiat and Monique Thonnat; PhD in progress : Rim Romdhame, Event Recognition in Video Scenes with Uncertain Knowledge,
March 2009, François Brémond and Monique Thonnat; PhD in progress : Malik Souded : Suivi d’Individu à travers un Réseau de Caméras Vidéo, February 2010, François Brémond ;
• G. Charpiat takes part in Mastic, a local scientiﬁc animation committee (Médiation et Animation Scientiﬁque dans les MAthémathiques et dans les Sciences et Techniques Informatiques et des Communications) and attended a media training.
8.3.1. Press Release
9. Bibliography Major publications by the team in recent years
 A. AVANZI, F. BRÉMOND, C. TORNIERI, M. THONNAT. Design and Assessment of an Intelligent Activity Monitoring Platform, in "EURASIP Journal on Applied Signal Processing, Special Issue on “Advances in Intelligent Vision Systems: Methods and Applications”", August 2005, vol. 2005:14, p. 2359-2374.
 H. BENHADDA, J. PATINO, E. CORVEE, F. BREMOND, M. THONNAT. Data Mining on Large Video Recordings, in "5eme Colloque Veille Stratégique Scientiﬁque et Technologique VSST 2007", Marrakech, Marrocco, 21st -25th October 2007.
 B. BOULAY, F. BREMOND, M. THONNAT. Applying 3D Human Model in a Posture Recognition System., in "Pattern Recognition Letter.", 2006, vol. 27, no 15, p. 1785-1796.
 F. BRÉMOND, M. THONNAT. Issues of Representing Context Illustrated by Video-surveillance Applications, in "International Journal of Human-Computer Studies, Special Issue on Context", 1998, vol. 48, p. 375-391.
 G. CHARPIAT. Learning Shape Metrics based on Deformations and Transport, in "Proceedings of ICCV 2009 and its Workshops, Second Workshop on Non-Rigid Sh ape Analysis and Deformable Image Alignment (NORDIA)", Kyoto, Japan, September 2009.
 G. CHARPIAT, P. MAUREL, J.-P. PONS, R. KERIVEN, O. FAUGERAS. Generalized Gradients: Priors on Minimization Flows, in "International Journal of Computer Vision", 2007.
 N. CHLEQ, F. BRÉMOND, M. THONNAT. Advanced Video-based Surveillance Systems, Kluwer A.P. , Hangham, MA, USA, November 1998, p. 108-118.
 F. CUPILLARD, F. BRÉMOND, M. THONNAT. Tracking Group of People for Video Surveillance, Video-Based Surveillance Systems, Kluwer Academic Publishers, 2002, vol. The Kluwer International Series in Computer Vision and Distributed Processing, p. 89-100.
 F. FUSIER, V. VALENTIN, F. BREMOND, M. THONNAT, M. BORG, D. THIRDE, J. FERRYMAN. Video Understanding for Complex Activity Recognition, in "Machine Vision and Applications Journal", 2007, vol. 18, p. 167-188.
 B. GEORIS, F. BREMOND, M. THONNAT. Real-Time Control of Video Surveillance Systems with Program Supervision Techniques, in "Machine Vision and Applications Journal", 2007, vol. 18, p. 189-205.
 C. LIU, P. CHUNG, Y. CHUNG, M. THONNAT. Understanding of Human Behaviors from Videos in Nursing Care Monitoring Systems, in "Journal of High Speed Networks", 2007, vol. 16, p. 91-103.
 N. MAILLOT, M. THONNAT, A. BOUCHER. Towards Ontology Based Cognitive Vision, in "Machine Vision and Applications (MVA)", December 2004, vol. 16, no 1, p. 33-40.
 V. MARTIN, J.-M. TRAVERE, F. BREMOND, V. MONCADA, G. DUNAND. Thermal Event Recognition Applied to Protection of Tokamak Plasma-Facing Components, in "IEEE Transactions on Instrumentation and Measurement", Apr 2010, vol. 59, no 5, p. 1182-1191, http://hal.inria.fr/inria-00499599.
 S. MOISAN. Knowledge Representation for Program Reuse, in "European Conference on Artiﬁcial Intelligence (ECAI)", Lyon, France, July 2002, p. 240-244.
 S. MOISAN. Une plate-forme pour une programmation par composants de systèmes à base de connaissances, Université de Nice-Sophia Antipolis, April 1998, Habilitation à diriger les recherches.
 S. MOISAN, A. RESSOUCHE, J.-P. RIGAULT. Blocks, a Component Framework with Checking Facilities for Knowledge-Based Systems, in "Informatica, Special Issue on Component Based Software Development", November 2001, vol. 25, no 4, p. 501-507.
 J. PATINO, H. BENHADDA, E. CORVEE, F. BREMOND, M. THONNAT. Video-Data Modelling and Discovery, in "4th IET International Conference on Visual Information Engineering VIE 2007", London, UK, 25th -27th July 2007.
 J. PATINO, E. CORVEE, F. BREMOND, M. THONNAT. Management of Large Video Recordings, in "2nd International Conference on Ambient Intelligence Developments AmI.d 2007", Sophia Antipolis, France, 17th -19th September 2007.
 A. RESSOUCHE, D. GAFFÉ, V. ROY. Modular Compilation of a Synchronous Language, in "Software Engineering Research, Management and Applications", R. LEE (editor), Studies in Computational Intelligence, Springer, 2008, vol. 150, p. 157-171, selected as one of the 17 best papers of SERA’08 conference.
 A. RESSOUCHE, D. GAFFÉ. Compilation Modulaire d’un Langage Synchrone, in "Revue des sciences et technologies de l’information, série Théorie et Science Informat ique", June 2011, vol. 4, no 30, p. 441-471, http://hal.inria.fr/inria-00524499/en.
 M. THONNAT, S. MOISAN. What Can Program Supervision Do for Software Re-use?, in "IEE Proceedings Software Special Issue on Knowledge Modelling for Software Components Reuse", 2000, vol. 147, no 5.
 M. THONNAT. Vers une vision cognitive: mise en oeuvre de connaissances et de raisonnements pour l’analyse et l’interprétation d’images., Université de Nice-Sophia Antipolis, October 2003, Habilitation à diriger les recherches.
 M. THONNAT. Special issue on Intelligent Vision Systems, in "Computer Vision and Image Understanding", May 2010, vol. 114, no 5, p. 501-502, http://hal.inria.fr/inria-00502843.
 A. TOSHEV, F. BRÉMOND, M. THONNAT. An A priori-based Method for Frequent Composite Event Discovery in Videos, in "Proceedings of 2006 IEEE International Conference on Computer Vision Systems", New York USA, January 2006.
 V. VU, F. BRÉMOND, M. THONNAT. Temporal Constraints for Video Interpretation, in "Proc of the 15th European Conference on Artiﬁcial Intelligence", Lyon, France, 2002.
 V. VU, F. BRÉMOND, M. THONNAT. Automatic Video Interpretation: A Novel Algorithm based for Temporal Scenario Recognition, in "The Eighteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI’03)", 9-15 September 2003.
 N. ZOUBA, F. BREMOND, A. ANFOSSO, M. THONNAT, E. PASCUAL, O. GUERIN. Monitoring elderly activities at home, in "Gerontechnology", May 2010, vol. 9, no 2, http://hal.inria.fr/inria-00504703.
Publications of the year Doctoral Dissertations and Habilitation Theses
 S. BAK. Ré-identiﬁcation de personne dans un réseau de cameras vidéo, Université de Nice Sophia-Antipolis, July 2012, http://tel.archives-ouvertes.fr/tel-00763443.
 D. P. CHAU. Suivi dynamique et robuste d’objets pour la reconnaissance d’activités, Institut National de Recherche en Informatique et en Automatique (Inria), March 2012, http://hal.inria.fr/tel-00695567.
 G.-T. PUSIOL. Discovery of human activities in video., Institut National de Recherche en Informatique et en Automatique (Inria), May 2012.
Articles in International Peer-Reviewed Journals
 C. F. CRISPIM-JUNIOR, V. JOUMIER, Y.-L. HSU, M.-C. PAI, P.-C. CHUNG, A. DECHAMPS, P. ROBERT,
F. BREMOND. Alzheimer’s patient activity assessment using different sensors, in "Gerontechnology", 2012, vol. 11, no 2, p. 266-267 [DOI : 10.4017/GT.2012.11.02.597..678], http://hal.inria.fr/hal-00721549.
 M.-B. KAÂNICHE, F. BREMOND. Recognizing Gestures by Learning Local Motion Signatures of HOG Descriptors, in "IEEE Transactions on Pattern Analysis and Machine Intelligence", 2012, http://hal.inria. fr/hal-00696371.
International Conferences with Proceedings
 J. BADIE, S. BAK, S.-T. SERBAN, F. BREMOND. Recovering people tracking errors using enhanced covariance-based signatures, in "Fourteenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance -2012", Beijing, Chine, July 2012, p. 487-493 [DOI : 10.1109/AVSS.2012.90], http://hal.inria.fr/hal-00761322.
 S. BAK, G. CHARPIAT, E. CORVEE, F. BREMOND, M. THONNAT. Learning to Match Appearances by Correlations in a Covariance Metric Space, in "12th European Conference on Computer Vision", Florence, Italy, A. FITZGIBBON, S. LAZEBNIK, P. PERONA, Y. SATO, C. SCHMID (editors), Lecture Notes in Computer Science -LNCS, Springer, October 2012, vol. 7574, p. 806-820 [DOI : 10.1007/978-3-64233712-3_58], http://hal.inria.fr/hal-00731792.
 S. BAK, D. P. CHAU, J. BADIE, E. CORVEE, F. BREMOND, M. THONNAT. Multi-target Tracking by Discriminative Analysis on Riemann Manifold, in "ICIP -International Conference on Image Processing -2012", Orlando, United States, IEEE Computer Society, June 2012, vol. 1, p. 1-4, http://hal.inria.fr/hal 00703633.
 P. BILINSKI, F. BREMOND. Contextual Statistics of Space-Time Ordered Features for Human Action Recognition, in "9th IEEE International Conference on Advanced Video and Signal-Based Surveillance", Beijing, China, September 2012, http://hal.inria.fr/hal-00718293.
 P. BILINSKI, F. BREMOND. Statistics of Pairwise Co-occurring Local Spatio-Temporal Features for Human Action Recognition, in "4th International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR), in conjunction with 12th European Conference on Computer Vision (ECCV)", Florence, Italy, October 2012, http://hal.inria.fr/hal-00760963.
 P. BILINSKI, E. CORVEE, S. BAK, F. BREMOND. Relative Dense Tracklets for Human Action Recognition, in "10th IEEE International Conference on Automatic Face and Gesture Recognition (FG)", Shanghai, China, 2012, To Appear in April 2013.
 C. F. CRISPIM-JUNIOR, F. BREMOND, V. JOUMIER. A Multi-Sensor Approach for Activity Recognition in Older Patients, in "The Second International Conference on Ambient Computing, Applications, Services and Technologies -AMBIENT 2012", Barcelona, Espagne, XPS/ThinkMindTM Digital Library, September 2012, in press, http://hal.inria.fr/hal-00726184.
 C. F. CRISPIM-JUNIOR, V. JOUMIER, Y.-L. HSU, P.-C. CHUNG, A. DECHAMPS, M.-C. PAI, P. ROBERT, F. BREMOND. Alzheimer’s patient activity assessment using different sensors, in "ISG*ISARC 2012: 8th World Conference of the International Society for Gerontechnology in cooperation with the ISARC, International Symposium of Automation and Robotics in Construction", Eindhoven, Netherlands, J. VAN BRONSWIJK (editor), Gerontechnology 2012:11(2):63, ISG, IAARC and TU/e (Eindhoven University of Technology), 2012, p. 266-267, Best Paper Award of ISG*ISARC2012 [DOI : 10.4017/GT.2012.11.02.597.00], http:// hal.inria.fr/hal-00721575.
 S. MOISAN, M. ACHER, J.-P. RIGAULT. A Feature-based Approach to System Deployment and Adaptation, in "ICSE, MISE Workshop -34th International Conference on Software Engineering, Workshop on Modelling in Software Engineering", Zurich, Switzerland, June 2012, http://hal.inria.fr/hal-00708745.
 S. MOISAN. Intelligent Monitoring of Software Components, in "ICSE RAISE Workshop -34th International Conference on Software Engineering, Workshop on Realizing AI Synergies in Software Engineering -2012", Zurich, Switzerland, June 2012, http://hal.inria.fr/hal-00708737.
 L. PATINO, F. BREMOND, M. THONNAT. On-line learning of activities from video, in "AVSS -IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance 2012", Beijing, Chine, 2012, p. 234-239, http://hal.inria.fr/hal-00761461.
 S. SANKARANARAYANAN, F. BREMOND, D. TAX. Qualitative Evaluation of Detection and Tracking Performance, in "9th IEEE International Conference On Advanced Video and Signal Based Surveillance (AVSS 12)", Beijing, Chine, Sep 2012, http://hal.inria.fr/hal-00763587.
 S. ZAIDENBERG, B. BOULAY, F. BREMOND. A generic framework for video understanding applied to group behavior recognition, in "Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on", IEEE, September 2012, p. 136 -142 [DOI : 10.1109/AVSS.2012.1], http:// hal.inria.fr/hal-00702179.
Scientiﬁc Books (or Scientiﬁc Book chapters)
 F. BRÉMOND, G. SACCO. Technologies de l’information, limiter les effets de la maladie d’Alzheimer, in "Alzheimer, éthique et société", F. Gzil and E. Hirsh, Sep 2012, p. 518-526.
 A. MARA, L. S. MASTELLA, M. PERRIN, M. THONNAT. Ontologies and their use in geological knowledge formalization., in "Shared Earth Modeling: Knowledge based solutions for building and managing subsurface structural models", M. PERRIN, J. RAINAUD (editors), Technip, Paris, 2012, http://hal.inria.fr/hal-00761496.
 P. VERNEY, M. THONNAT, J.-F. RAINAUD. Knowledge based approach of a data intensive problem: seismic interpretation, in "Shared Earth Modeling: Knowledge based solutions for building and managing subsurface structural models", M. PERRIN, J. RAINAUD (editors), Technip, 2012, http://hal.inria.fr/hal-00761476.
 D. GAFFÉ, A. RESSOUCHE. Algebras and Synchronous Language Semantics, Inria, November 2012, no RR8138, 107, http://hal.inria.fr/hal-00752976.
 M. SOUDED, F. BRÉMOND. Optimized Cascade of Classiﬁers for People Detection Using Covariance Features, 2012, To appear in 2013 in International Conference on Computer Vision Theory and Applications Proceedings.
References in notes
 M. ACHER, P. COLLET, F. FLEUREY, P. LAHIRE, S. MOISAN, J.-P. RIGAULT. Modeling Context and Dynamic Adaptations with Feature Models, in "Models@run.time Workshop", Denver, CO, USA, October 2009, http://hal.inria.fr/hal-00419990/en.
 M. ACHER, P. COLLET, P. LAHIRE, R. FRANCE. Managing Feature Models with FAMILIAR: a Demonstration of the Language and its Tool Support, in "Fifth International Workshop on Variability Modelling of Software-intensive Systems(VaMoS’11)", Namur, Belgium, VaMoS, ACM, January 2011.
 M. ACHER, P. COLLET, P. LAHIRE, S. MOISAN, J.-P. RIGAULT. Modeling Variability from Requirements to Runtime, in "16th International Conference on Engineering of Complex Computer Systems (ICECCS’11)", Las Vegas, IEEE, April 2011.
 M. ACHER, P. LAHIRE, S. MOISAN, J.-P. RIGAULT. Tackling High Variability in Video Surveillance Systems through a Model Transformation Approach, in "ICSE’2009 -MISE Workshop", Vancouver, Canada, May 2009, http://hal.inria.fr/hal-00415770/en.
 D. BENAVIDES, S. SEGURA, A. RUIZ-CORTES. Automated Analysis of Feature Models 20 Years Later: A Literature Review, in "Information Systems", September 2010, vol. 35, p. 615–636.
 J. BERCLAZ, F. FLEURET, E. TURETKEN, P. FUA. Multiple object tracking using k-shortest paths optimization, in "PAMI", 2011, vol. 33, no 9, p. 1806–1819.
 F. BREMOND, N. MAILLOT, M. THONNAT, V. VU. Ontologies For Video Events, in "Inria Research Report RR-5189", 2004.
 F. BREMOND, M. THONNAT. Tracking multiple non-rigid objects in video sequences, in "in proceedings of the IEEE Transactions On Automatic Control", 1998, vol. 8,5.
 D. P. CHAU, F. BREMOND, M. THONNAT. A multi-feature tracking algorithm enabling adaptation to context variations, in "The International Conference on Imaging for Crime Detection and Prevention (ICDP)", London, Royaume-Uni, November 2011, http://hal.inria.fr/inria-00632245/en/.
 D. P. CHAU, F. BREMOND, M. THONNAT, E. CORVEE. Robust Mobile Object Tracking Based on Multiple Feature Similarity and Trajectory Filtering, in "The International Conference on Computer Vision Theory and Applications (VISAPP)", Algarve, Portugal, March 2011, This work is supported by the PACA region, The General Council of Alpes Maritimes province, France as well as The ViCoMo, Vanaheim, Video-Id, Cofriend and Support projects., http://hal.inria.fr/inria-00599734/en/.
 A. CIMATTI, E. CLARKE, E. GIUNCHIGLIA, F. GIUNCHIGLIA, M. PISTORE, M. ROVERI, R. SEBASTIANI,
A. TACCHELLA. NuSMV 2: an OpenSource Tool for Symbolic Model Checking, in "Proceeeding CAV", Copenhagen, Danmark, E. BRINKSMA, K. G. LARSEN (editors), LNCS, Springer-Verlag, July 2002, no 2404,
 R. DAVID, E. MULIN, P. MALLEA, P. ROBERT. Measurement of Neuropsychiatric Symptoms in Clinical Trials Targeting Alzheimer’s Disease and Related Disorders, in "Pharmaceuticals", 2010, vol. 3, p. 23872397.
 D. GAFFÉ, A. RESSOUCHE. The Clem Toolkit, in "Proceedings of 23rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2008)", L’Aquila, Italy, September 2008.
 M. GINSBERG. Multivalued Logics: A Uniform Approach to Inference in Artiﬁcial Intelligence, in "Computational Intelligence", 1988, vol. 4, p. 265–316.
 M. GROSAM. SOF: Service Oriented Framework, Web site, http://sof.tiddlyspot.com/.
 N. HALBWACHS. Synchronous Programming of Reactive Systems, Kluwer Academic, 1993.
 J. F. HENRIQUES, R. CASEIRO, J. BATISTA. Globally optimal solution to multi-object tracking with merged measurements, 2011, In ICCV.
 V. HOURDIN, J.-Y. TIGLI, S. LAVIROTTE, M. RIVEILL. Context-Sensitive Authorization for Asynchronous Communications, in "4th International Conference for Internet Technology and Secured Transactions (ICITST)", London UK, November 2009.
 C. KUO, C. HUANG, R. NEVATIA. Multi-target tracking by online learned discriminative appearance models, 2010, In CVPR.
 C. KÄSTNER, S. APEL, S. TRUJILLO, M. KUHLEMANN, D. BATORY. Guaranteeing Syntactic Correctness for All Product Line Variants: A Language-Independent Approach, in "TOOLS (47)", 2009, p. 175-194.
 Y. LI, C. HUANG, R. NEVATIA. Learning to Associate: HybridBoosted Multi-Target Tracker for Crowded Scene, 2009, The International Conference on Computer Vision and Pattern Recognition (CVPR).
 S. MOISAN, J.-P. RIGAULT, M. ACHER, P. COLLET, P. LAHIRE. Run Time Adaptation of Video-Surveillance Systems: A software Modeling Approach, in "ICVS, 8th International Conference on Computer Vision Systems", Sophia Antipolis, France, September 2011, http://hal.inria.fr/inria-00617279/en.
 A.-T. NGHIEM, F. BRÉMOND, M. THONNAT. Controlling Background Subtraction Algorithms for Robust Object Detection, in "The 3rd International Conference on Imaging for Crime Detection and Prevention", London, United Kingdom, 3 December 2009.
 A.-T. NGHIEM. Algorithmes Adaptatifs d’Estimation du Fond pour la Détection des Objets Mobiles dans les Séquences Vidéos, Nice Sophia-Antipolis University, Jun 2010, http://hal.inria.fr/tel-00505881.
 X. PENNEC, P. FILLARD, N. AYACHE. A Riemannian Framework for Tensor Computing, 2006, Int. Journal of Comp. Vision, 66(1):41–66..
 A. PNUELI, D. HAREL. On the Development of Reactive Systems, in "Nato Asi Series F: Computer and Systems Sciences", K. APT (editor), Springer-Verlag berlin Heidelberg, 1985, vol. 13, p. 477-498.
 A. RESSOUCHE, D. GAFFÉ, V. ROY. Modular Compilation of a Synchronous Language, Inria, 01 2008, no 6424, http://hal.inria.fr/inria-00213472.
 A. RESSOUCHE, J.-Y. TIGLI, O. CARILLO. Composition and Formal Validation in Reactive Adaptive Middleware, Inria, February 2011, no RR-7541, http://hal.inria.fr/inria-00565860/en.
 A. RESSOUCHE, J.-Y. TIGLI, O. CARRILLO. Toward Validated Composition in Component-Based Adaptive Middleware, in "SC 2011", Zurich, Suisse, S. APE, E. JACKSON (editors), LNCS, Springer, July 2011, vol. 6708, p. 165-180, http://hal.inria.fr/inria-00605915/en/.
 L. M. ROCHA, S. MOISAN, J.-P. RIGAULT, S. SAGAR. Girgit: A Dynamically Adaptive Vision System for Scene Understanding, in "ICVS", Sophia Antipolis, France, September 2011, http://hal.inria.fr/inria 00616642/en.
 R. ROMDHANE, E. MULIN, A. DERREUMEAUX, N. ZOUBA, J. PIANO, L. LEE, I. LEROI, P. MALLEA,
R. DAVID, M. THONNAT, F. BREMOND, P. ROBERT. Automatic Video Monitoring system for assessment of Alzheimer’s Disease symptoms, in "The Journal of Nutrition, Health and Aging Ms(JNHA)", 2011, vol. JNHA-D-11-00004R1, http://hal.inria.fr/inria-00616747/en.
 H. B. SHITRIT, J. BERCLAZ, F. FLEURET, P. FUA. Tracking multiple people under global appearance constraints, 2011, In ICCV.
 B. C. STAUFFER, W. E. L. GRIMSON. Adaptive background mixture models for real-time tracking, in "in the IEEE Computer Vision and Pattern Recognition", 1999, vol. 2, p. 246-252.
 J.-Y. TIGLI, S. LAVIROTTE, G. REY, V. HOURDIN, D. CHEUNG, E. CALLEGARI, M. RIVEILL. WComp middleware for ubiquitous computing: Aspects and composite event-based Web services, in "Annals of Telecommunications", 2009, vol. 64, no 3-4, ISSN 0003-4347 (Print) ISSN 1958-9395 (Online).
 J.-Y. TIGLI, S. LAVIROTTE, G. REY, V. HOURDIN, M. RIVEILL. Lightweight Service Oriented Architecture for Pervasive Computing, in "IJCSI International Journal of Computer Science Issues", 2009, vol. 4, no 1, ISSN (Online): 1694-0784 ISSN (Print): 1694-0814.
 O. TUZEL, F. PORIKLI, P. MEER. Human Detection via Classiﬁcation on Riemannian Manifolds, 2007, In IEEE Conf. Comp. Vision and Pattern Recognition (CVPR)..
 J. YAO, J. ODOBEZ. Fast Human Detection from Videos Using Covariance Feature, 2008, In: ECCV 2008 Visual Surveillance Workshop..
 S. ZAIDENBERG, B. BOULAY, C. GARATE, D. P. CHAU, E. CORVEE, F. BREMOND. Group interaction and group tracking for video-surveillance in underground railway stations, in "International Workshop on Behaviour Analysis and Video Understanding (ICVS 2011)", Sophia Antipolis, France, September 2011, 10, http://hal.inria.fr/inria-00624356/en/.
 L. ZHANG, Y. LI, R. NETVATIA. Global data association for multi-object tracking using network ﬂows, 2008, In CVPR.