Perception

In this section we will shortly present the perception component we have used. As shown in figure 1 this component is composed of four main sub-parts: motion detection, person detection, person tracking and smoothing. The goal of this module is to incrementally provide a history of the persons who have been detected in the scene. Each of its sub-parts contains different alternative methods which are selected in a configuration phase.

**Figure 1:** Abstract decomposition of the video understanding system. From bottom to top, the data flow. From left to right : decreasing abstraction levels.
$\includegraphics[width=7cm]{archi2.eps}$

Motion Detection

The goal is to extract from each image a set of primitives which indicate the presence of a motion. A function $f:(N \times N) \rightarrow \{ 0,\ 1\}$ is defined for each pixel of the image. The value 1 means that the pixel is mobile, 0 means that the pixels is static. Connected regions are obtained by grouping the neighbours with a mobile label (f = 1). These regions are named blobs. Three alternative motion detection methods are defined:

where $I_{t}(x,\ y)$ is the value of the pixel $(x,\ y)$ of the image at time t. $I_{bg}(x,\ y)$ is the value of the point $(x,\ y)$ for an empty scene image (a scene without mobile objects) and $Thr,\ Abs,\ Max,\ Diff,\$ are rescpectively thresholding, absolute value, maximum and difference functions.

Person Detection

The person detection algorithm splits the set of blobs into n subsets of potential persons in the scene. Both 2D image criteria and 3D scene criteria are used. The first ones are based on the 2D distance between blobs in the image. The goal is to merge the closest blobs in the image. The second ones are based on constraints on the 3D height and width. The 3D measures are obtained by linear projection of the image plane on the ground plane in the scene.

Person Tracking

The goal of the person tracking is to update the set of previous trajectories. For that purpose, the persons detected in the current image must be matched with those detected in the previous ones. This matching can be defined as a function from the set P_{t - 1} of persons detected a time t - 1 into the set P_t of the persons detected a time t. We use 3 alternative methods: a method based on the amount of overlap in the 2D image, a method based on the proximity of the persons in the 3D scene and a restrictive method based on the proximity of the persons in the 3D scene. The first method (based on the amount of overlap in the 2D image) states that two persons detected at two consecutive times are the same real person if the percentage of overlap of their bounding box is greater than a threshold. The second method matches a person at time P_t with a person at time P_{t -
1} if their 3D distance is below a threshold. The last method is similar to the second one, but the function must be either an injection or a surjection.

Smoothing

The goal of the smoothing step is twofold. The first goal is to correct errors made in the previous perception steps on the different 3D parameters of a person: $(px_{3D},\ py_{3D})$ the position on the ground plane, h_img the height and l_img the width. The second goal is to estimate $(vx_{3D},\ vy_{3D})$ the instantaneous speed of the persons. Three smoothing methods are used. The first method uses a standard Kalman filter [15]. The state vector is defined by $(px_{3D},\ py_{3D},\ vx_{3D},\ vy_{3D})$ . The linear dynamic model is based on the hypothesis of a constant speed. The second and third methods are respectively median and mean filtering with window size 3, 5 or 7 . $(vx_{3D},\ vy_{3D})$ is initialized by computing $v(t) = \frac{p(t) - p(t - 1)}{\delta t}$ then each of the four values px_3D, py_3D, vx_3Dand vy_3D are filtered.

The perception methods we have described are simple ones in order to comply with the real-time constraint hypothesis. Their role is to provide enough information to the intepretation methods for video understanding.