next up previous
Next: Results of Metro Station Up: Video Sequence Interpretation for Previous: Contextual Information

Subsections
   
Interpretation

A State Model

The objective of this state model is to provide a set of generic states based on a formalism which enables its extension and its parametrization. A state of the scene is defined by an n-ary tree which represents the way this state is computed. Four types of nodes are distinguished: object nodes, descriptor nodes, operator nodes and classifier nodes (see below for their definition). The root node is a classifier node. The leaves of this tree are objects nodes. Father nodes of the leaves are descriptors nodes. All other intermediate nodes are operators nodes.

The minimal tree structure is reduced to 3 nodes, 1 classifier root node, 1 descriptor intermediate node and 1 object terminal node. The number of branches of the tree and the length of the branches are free.

$\bullet$
The objects are the objects of the scene at time t, i.e. an element of O, the set of the objects oi,j where i is the class of the object and j its label. For instance the object operson, 1 is a mobile object which has been recognized as being a person and whose label is 1. oequipment, door is an object belonging to the class equipment labeled as a door.

$\bullet$
The descriptors are functions defined from O to Rp to access an object measure. For instance, the size, the position, the shape, the trajectory, the orientation or the volume are possible descriptors. This notion ensures the anchoring of the model in the numerical results of the perceptual module.

$\bullet$
The operators are functions defined from $({R}^{p_1} \times \dots \times {R}^{p_n})$ to Rq in order to operate on the measures. Examples of operators are the distance, the norm, the classical arithmetic or logic ones.
$\bullet$
The classifiers are functions defined from Rp to S, the set of symbols: large, small, fast, slow, close, far, etc. These operators ensure the passage from numbers to symbols by associating to each symbolic value a domain of definition. This domain of definition represents the parameters of the corresponding state.

Given this model, for each image a set of generic predefined states are instantiated with the current objects detected in the scene at that time. The resulting set of instantiated states provide a description of the scene at that time. Event recognition is performed by comparing this new set with those obtained during the preceding times. The states for which the symbolic value has changed create new events.

Utilization of the Model of State


  
Figure 5: Six instances of the model of state. In yellow the object, in light green the descriptors, in dark green the operators and in blue the classifiers.
\includegraphics[width=6.5cm]{Figures/instanceEvent1.eps}


  
Figure 6: Two instances of the model of state. In yellow the object, in light green the descriptors, in dark green the operators and in blue the classifiers.
\includegraphics[width=5cm]{Figures/instanceEvent2.eps}

We have used this model of state to define a first set of states (see figures 5 and 6). For that we have defined three classes of objects, four descriptors, four operators and eight classifiers.

The three classes of objects are person, area, and equipment. The persons are the mobile objects of the scene which have been recognized as human. The previous steps provide a vector $(px_{3D},\ py_{3D})$representing the location of the person on the ground, a vector $(vx_{3D},\ vy_{3D})$ representing the speed vector of the person and the size h of that person. An area is a static object representing a subpart of the ground of the scene with a polygon $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ An equipment represents any volumic object of the environment for which we know the polygonal basis $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ and the height h.

We have defined 4 nodes with the descriptor type: position, size, speed and shape. $position(o_{i,j}) i \in
\{person\}$ applied to an object of the class <<person>> give access to $(px_{3D},\ py_{3D})$ the location of the person. $size(o_{i,j}) i \in
\{person,\ equipment\}$ applied to an object of the class <<person>> or to an object of the class <<équipment>> enables us to recover the size h of the object. $speed(o_{i,j}) i \in \{person\}$ applied to an object of the class <<person>> returns the speed vector $(vx_{3D},\ vy_{3D})$ $shape(o_{i,j}) i
\in \{area,\ equipment\}$ applied to an object of the class <<equipment>> or <<area>> returns the polygon $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ associated to this object.

We have defined 4 nodes of the type operator: distance, norm, angle and constr. distance, $(R^2 \times R)
\rightarrow R^2$, is a binary operator computing the euclidean distance. norm, $R^2 \rightarrow R$, is an operator computing the norm of a vector. angle, $(R^2 \times R^2) \rightarrow \{0,\ 360\}$, is an operator computing the angle between two vectors in degrees. constr, $(R \times R)
\rightarrow R^2$, is an operator which constructs a 2D vector with its scalar components.

We have defined 8 nodes with the type classifier: $posture:\ R \rightarrow \{lying, crouching, standing\} $ $direction:\ R
\rightarrow \{towards~the~right, towards~the~left, leaving, arriving \} $ $velocity:\ R \rightarrow \{stopped,walking, running\}$ $location:\ R
\rightarrow \{inside, outside \}$ $proximity:\ R \rightarrow \{close, far \}$ $relation~location:\ R \rightarrow \{close, far \}$ $relative~posture:\ R^2
\rightarrow \{seated , any \}$ $relative~walk:\ R^2 \rightarrow \{coupled,
any \}$

Based on these classifiers, operators, descriptors and objects we have defined 8 states: posture, direction, velocity, location, proximity, relative location, relative posture and relative walk.

 

$\bullet$
$posture(o_{person, i})\in \{lying, crouching, standing \} $
$\bullet$
$direction(o_{person, i})\in \{towards~the~right,$ $
towards~the~left,leaving, arriving\} $
$\bullet$
$velocity(o_{person, i})\in \{stopped,walking,$ $running \}$
$\bullet$
location(operson, i, $o_{area, j}) \in
\{inside,,outside \}$
$\bullet$
proximity(operson, i, oequipment, j) $\in \{
close,~far\}$
$\bullet$
relative location(operson, i, operson, j) $\in \{
close,~far\}$ ($i \neq j$)
$\bullet$
relative posture(operson, i, oequipment, j) $\in
\{seated,any \}$
$\bullet$
relative walk(operson, i, operson, j) $\in
\{coupled,~any\}$ ($i \neq j$)

For instance we have defined the state (see figure 6) relative walk(operson, i, operson, j by measuring the angle between the speed vectors of operson, i and operson, j and the distance between operson, i and operson, j). If the speed vectors have a similar orientation (an angle below 45 degrees or greater than 315 degrees) and if the distance is small (below 200cm) then these persons are considered as having a coupled relative walk.

Event Recognition

This enables us to define 18 events.

Posture(operson, i) changes create the events operson, i falls down, operson, i crouches down and operson, istands up.

Direction(operson, i) changes create the events operson, i goes right side, operson, i goes left side, operson, i goes away and operson, i arrives.

Velocity(operson, i) changes create the events operson, i stops, operson, i walks and operson, i starts runing.

Location(operson, i, oarea, j) changes create the events operson, i leaves oarea, j and operson, i enters oarea, j.

Proximity(operson, i, oequipment, j) changes create the events operson, i moves close to 
oequipment, j operson, i moves away from oequipment, j.

Relative location(operson, i, operson, j) changes create the events operson, i moves close to  operson, j and operson, i moves away from operson, j.

Relative posture(operson, i, oequipment, j changes create the events operson, i sits on oequipment, j.

Relative walk(operson, i, operson, j) changes create the events operson, i and operson, j walk together.

   
Scenario Recognition

The final problem is to incrementally recognize predefined scenarios representing behaviors. A scenario is an interdependent set of events.

To recognize a scenario implies to recognize all the events which compose it and to verify the constraints of the dependencies. The constraints can be temporal, spatial, logical or algebraic. A scenario can be:

$\bullet$
totally recognized, when all the events are recognized and all the constraints are verified.
$\bullet$
partially recognized, when a subset S of all events are recognized and the constraints involving events of S are verified.
$\bullet$
not recognized, when no event is recognized. This kind of scenario, as they are defined in the knowledge base, is called blank scenario.

The principle of the scenario recognition consists of two points: as previously described, we generate, image after image, interesting events which happened in the scene, then with those events we instantiate predefined scenario models. It means that scenario recognition corresponds to updating a set of partially recognized scenarios.

We will now give details of the scenario model we use. A scenario si, t, where i is the scenario identifier and t the current time of recognition, is composed of four parts: events, constraints, conditions, and success.


next up previous
Next: Results of Metro Station Up: Video Sequence Interpretation for Previous: Contextual Information
Nathanael Rota
2000-11-06