Interpretation

Next: Results of Metro Station Up: Video Sequence Interpretation for Previous: Contextual Information

Subsections

Interpretation

A State Model

The objective of this state model is to provide a set of generic states based on a formalism which enables its extension and its parametrization. A state of the scene is defined by an n-ary tree which represents the way this state is computed. Four types of nodes are distinguished: object nodes, descriptor nodes, operator nodes and classifier nodes (see below for their definition). The root node is a classifier node. The leaves of this tree are objects nodes. Father nodes of the leaves are descriptors nodes. All other intermediate nodes are operators nodes.

The minimal tree structure is reduced to 3 nodes, 1 classifier root node, 1 descriptor intermediate node and 1 object terminal node. The number of branches of the tree and the length of the branches are free.

$\bullet$: The objects are the objects of the scene at time t, i.e. an element of O, the set of the objects o_i,j where i is the class of the object and j its label. For instance the object o_{person, 1} is a mobile object which has been recognized as being a person and whose label is 1. o_{equipment, door} is an object belonging to the class equipment labeled as a door.
$\bullet$: The descriptors are functions defined from O to R^p to access an object measure. For instance, the size, the position, the shape, the trajectory, the orientation or the volume are possible descriptors. This notion ensures the anchoring of the model in the numerical results of the perceptual module.
$\bullet$: The operators are functions defined from $({R}^{p_1} \times \dots \times {R}^{p_n})$ to R^q in order to operate on the measures. Examples of operators are the distance, the norm, the classical arithmetic or logic ones.
$\bullet$: The classifiers are functions defined from R^p to S, the set of symbols: large, small, fast, slow, close, far, etc. These operators ensure the passage from numbers to symbols by associating to each symbolic value a domain of definition. This domain of definition represents the parameters of the corresponding state.

Given this model, for each image a set of generic predefined states are instantiated with the current objects detected in the scene at that time. The resulting set of instantiated states provide a description of the scene at that time. Event recognition is performed by comparing this new set with those obtained during the preceding times. The states for which the symbolic value has changed create new events.

Utilization of the Model of State

**Figure 5:** Six instances of the model of state. In yellow the object, in light green the descriptors, in dark green the operators and in blue the classifiers.
$\includegraphics[width=6.5cm]{Figures/instanceEvent1.eps}$

**Figure 6:** Two instances of the model of state. In yellow the object, in light green the descriptors, in dark green the operators and in blue the classifiers.
$\includegraphics[width=5cm]{Figures/instanceEvent2.eps}$

We have used this model of state to define a first set of states (see figures 5 and 6). For that we have defined three classes of objects, four descriptors, four operators and eight classifiers.

The three classes of objects are person, area, and equipment. The persons are the mobile objects of the scene which have been recognized as human. The previous steps provide a vector $(px_{3D},\ py_{3D})$ representing the location of the person on the ground, a vector $(vx_{3D},\ vy_{3D})$ representing the speed vector of the person and the size h of that person. An area is a static object representing a subpart of the ground of the scene with a polygon $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ An equipment represents any volumic object of the environment for which we know the polygonal basis $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ and the height h.

We have defined 4 nodes with the descriptor type: position, size, speed and shape. $position(o_{i,j}) i \in \{person\}$ applied to an object of the class <<person>> give access to $(px_{3D},\ py_{3D})$ the location of the person. $size(o_{i,j}) i \in \{person,\ equipment\}$ applied to an object of the class <<person>> or to an object of the class <<équipment>> enables us to recover the size h of the object. $speed(o_{i,j}) i \in \{person\}$ applied to an object of the class <<person>> returns the speed vector $(vx_{3D},\ vy_{3D})$ $shape(o_{i,j}) i \in \{area,\ equipment\}$ applied to an object of the class <<equipment>> or <<area>> returns the polygon $\{(px_i, py_i)~\vert~i~=~1 \dots k\}$ associated to this object.

We have defined 4 nodes of the type operator: distance, norm, angle and constr. distance, $(R^2 \times R) \rightarrow R^2$ , is a binary operator computing the euclidean distance. norm, $R^2 \rightarrow R$ , is an operator computing the norm of a vector. angle, $(R^2 \times R^2) \rightarrow \{0,\ 360\}$ , is an operator computing the angle between two vectors in degrees. constr, $(R \times R) \rightarrow R^2$ , is an operator which constructs a 2D vector with its scalar components.

We have defined 8 nodes with the type classifier: $posture:\ R \rightarrow \{lying, crouching, standing\}$ $direction:\ R \rightarrow \{towards~the~right, towards~the~left, leaving, arriving \}$ $velocity:\ R \rightarrow \{stopped,walking, running\}$ $location:\ R \rightarrow \{inside, outside \}$ $proximity:\ R \rightarrow \{close, far \}$ $relation~location:\ R \rightarrow \{close, far \}$ $relative~posture:\ R^2 \rightarrow \{seated , any \}$ $relative~walk:\ R^2 \rightarrow \{coupled, any \}$

Based on these classifiers, operators, descriptors and objects we have defined 8 states: posture, direction, velocity, location, proximity, relative location, relative posture and relative walk.

$\bullet$: $posture(o_{person, i})\in \{lying, crouching, standing \}$
$\bullet$: $direction(o_{person, i})\in \{towards~the~right,$ $towards~the~left,leaving, arriving\}$
$\bullet$: $velocity(o_{person, i})\in \{stopped,walking,$ $running \}$
$\bullet$: location(o_{person, i}, $o_{area, j}) \in \{inside,,outside \}$
$\bullet$: proximity(o_{person, i}, o_{equipment, j}) $\in \{ close,~far\}$
$\bullet$: relative location(o_{person, i}, o_{person, j}) $\in \{ close,~far\}$ ( $i \neq j$ )
$\bullet$: relative posture(o_{person, i}, o_{equipment, j}) $\in \{seated,any \}$
$\bullet$: relative walk(o_{person, i}, o_{person, j}) $\in \{coupled,~any\}$ ( $i \neq j$ )

For instance we have defined the state (see figure 6) relative walk(o_{person, i}, o_{person, j} by measuring the angle between the speed vectors of o_{person, i} and o_{person, j} and the distance between o_{person, i} and o_{person, j}). If the speed vectors have a similar orientation (an angle below 45 degrees or greater than 315 degrees) and if the distance is small (below 200cm) then these persons are considered as having a coupled relative walk.

Event Recognition

This enables us to define 18 events.

Posture(o_{person, i}) changes create the events o_{person, i} falls down, o_{person, i} crouches down and o_{person, i}stands up.

Direction(o_{person, i}) changes create the events o_{person, i} goes right side, o_{person, i} goes left side, o_{person, i} goes away and o_{person, i} arrives.

Velocity(o_{person, i}) changes create the events o_{person, i} stops, o_{person, i} walks and o_{person, i} starts runing.

Location(o_{person, i}, o_{area, j}) changes create the events o_{person,
i} leaves o_{area, j} and o_{person, i} enters o_{area, j}.

Proximity(o_{person, i}, o_{equipment, j}) changes create the events o_{person, i} moves close to
o_{equipment, j} o_{person, i} moves away from o_{equipment, j}.

Relative location(o_{person, i}, o_{person, j}) changes create the events o_{person, i} moves close to o_{person, j} and o_{person, i} moves away from o_{person, j}.

Relative posture(o_{person, i}, o_{equipment, j} changes create the events o_{person, i} sits on o_{equipment, j}.

Relative walk(o_{person, i}, o_{person, j}) changes create the events o_{person, i} and o_{person, j} walk together.

Scenario Recognition

The final problem is to incrementally recognize predefined scenarios representing behaviors. A scenario is an interdependent set of events.

To recognize a scenario implies to recognize all the events which compose it and to verify the constraints of the dependencies. The constraints can be temporal, spatial, logical or algebraic. A scenario can be:

$\bullet$: totally recognized, when all the events are recognized and all the constraints are verified.
$\bullet$: partially recognized, when a subset S of all events are recognized and the constraints involving events of S are verified.
$\bullet$: not recognized, when no event is recognized. This kind of scenario, as they are defined in the knowledge base, is called blank scenario.

The principle of the scenario recognition consists of two points: as previously described, we generate, image after image, interesting events which happened in the scene, then with those events we instantiate predefined scenario models. It means that scenario recognition corresponds to updating a set of partially recognized scenarios.

We will now give details of the scenario model we use. A scenario s_{i, t}, where i is the scenario identifier and t the current time of recognition, is composed of four parts: events, constraints, conditions, and success.

the events part is a set of events $\{e_1, ..., e_i, ..., e_n\}$ requested by the scenario. Each event e_i is associated with the variable t_i which represents the time when e_i occured. There are two categories of events in this part: positive events and negative events. Positive events must occur for the total recognition and negative events must not occur during scenario recognition.
the constraints part is a set of temporal constraints $\{c_1, ..., c_i, ..., c_m\}$ . Those contraints are described as first degree linear inequations on t₁, ..., t_i, ..., t_n .
the conditions part is a set of non-temporal contraints $\{k_1, ..., k_i, ..., k_p\}$ on the objects involved in the events. It forces an event object attribute to a predefined value. This attribute can be symbolic (name, function, etc...) or numerical (height, size, velocity, etc...).
the success part is a set of keywords $\{f_1, ..., f_i, ..., f_q\}$ , which indicate the kind of feedback associated with the scenario. This part is used when the scenario is totally instanciated. There are two kinds of feedback: external and internal. External feedback is used to trigger an alarm to human operators and internal feedback is used to generate an event to signify that the scenario has been totally recognized.

Next: Results of Metro Station Up: Video Sequence Interpretation for Previous: Contextual Information

Nathanael Rota
2000-11-06