Evaluation

Collaborators: Cián Shaffrey (University of Cambridge, UK), Nick Kingsbury (University of Cambridge, UK). Key words: database retrieval, segmentation, indexing, psychovisual evaluation. Resume: To advance a science, methodologically well-defined evaluation techniques are necessary. In the case of database retrieval systems, they have not always been used, and there seems to be some confusion about what such techniques should be. The goal of this work was to analyse the problem of evaluation abstractly and then apply the analysis to two completely different databases: one, kindly provided by the IGN (French National Geographic Institute), consisting of aerial images of the Ile-de-France region around Paris; the second, kindly provided by BAL (Bridgeman Art Library) in the UK, consisting of fine art images. The first step was thus a methodological analysis of the problem of evaluation in scenarios with differing amounts of knowledge about the image semantics. The main conclusions were: In situations in which the image semantics S is well defined, the human interpretation h is available, and in which image processing techniques are close to reproducing the human interpretation, retrieval is not the issue. It is easier and better defined to check that the image processing reproduces the correct semantics, i.e., that the diagram below nearly commutes. This is the IGN case, discussed further below. In situations in which the image semantics S is not well defined, and consequently the human interpretation h is not available, the only choice is to use human subjects to compare the outputs of the image processing arrow for different methods. Note again that the emphasis is on the outputs "making sense", that is, on the semantics, and not on retrieval as such. This is the BAL case, discussed further below. Query-by-example is ill-defined as a retrieval method, in the sense that the expected output cannot be known. In conjunction with the use of "relevance classes" for evaluation, things get even worse. The success of many of the evaluations in the literature says more about the databases used than the retrieval methods themselves. Semantics is inherently linguistic, and must be defined as such. Reproducing the human interpretation h means good retrieval. Not reproducing it means bad retrieval.

We are thus lead to the use of two very different methods for the two databases. For the IGN database, the semantics is well defined, consisting of conjunctions of statements such as "Region R contains forest". The human interpretation exists, in the form of land use maps compiled from existing cartography and field studies, and kindly provided to us by IAURIF, the Urban Planning Institute for the Ile-de-France region. In addition, segmentation algorithms can get close to the correct results. As promised, we are thus in the first situation listed above. Work continues on this database with the lengthy task of registering the original data with the land use maps. Once this is done, relatively simple metrics can be used to measure how close different segmentation results are to correct, thus measuring their usefulness for retrieval.

For the BAL database, it is another story. The image semantics is extremely complicated, in fact practically unbounded, and thus impossible to define. In addition, human interpretation is very varied and difficult to characterize, and image processing algorithms have no hope of actually reproducing whatever sub-semantics can be defined. We are thus, as promised, in the second situation listed above. Consequently, we used human subjects and psychovisual experiments to evaluate various segmentation algorithms. The results of these experiments show that there is a degree of consensus among the subjects about which segmentations are better than others, and also about which segmentation methods are better than others. One indication is that even thought the users ranked the segmentation methods pairwise, the results are consistent with a single total order on the schemes (the graph below has no cycles). The results indicate that there may be obtainable criteria for what constitutes a good segmentation from the point of view of human subjects.

One of the BAL images.	Pairwise ordering of segmentation schemes.
Publications: “Psychovisual Evaluation of Image Segmentation Algorithms”, Cián W. Shaffrey, Ian H. Jermyn and Nick G. Kingsbury. To appear in Proceedings of Advanced Concepts for Intelligent Visual Systems (ACIVS), Ghent, Belgium, September 2002. (PDF) "Evaluation Methodologies for Image Retrieval Systems", Ian H. Jermyn, Cián W. Shaffrey and Nick G. Kingsbury. To appear in Proceedings of Advanced Concepts for Intelligent Visual Systems (ACIVS), Ghent, Belgium, September 2002. (PDF)

Ariana (joint research group CNRS/INRIA/UNSA), INRIA Sophia Antipolis 2004 route des Lucioles, B.P. 93, 06902 Sophia Antipolis Cedex, France.
E: Ian.Jermyn@sophia.inria.fr	T: +33 (0)4 92 38 76 83	F: +33 (0)4 92 38 76 43