INRIA and University of Nice - Sophia Antipolis — Doctoral School of Sciences and Technologies of Information and Communication (S.T.I.C.)
Distributed Artificial Intelligence and Knowledge Management:
ontologies and multi-agent systems for a corporate semantic web *
by Fabien GANDON
* Intelligence artificielle distribuée et gestion des connaissances : ontologies et systèmes multi-agents pour un web sémantique organisationnel
§ Mme Rose Dieng-Kuntz (Research Director)
§ Mr Joost Breuker (Professor)
§ Mr Les Gasser (Professor)
§ Mr Jean-Paul Haton (Professor)
§ Mr Pierre Bernhard (Professor - President of the jury)
§ Mr Jacques Ferber (Professor)
§ Mr Hervé Karp (Consultant)
§ Mr Gilles Kassel (Professor)
§ Mr Agostino Poggi (Professor)
To my whole family without whom I would not be where I am, and even worse, I would not be at all.
To my parents Michel and Josette Gandon for their infallible support and advice to a perpetually absent son.
To my sister Adeline Gandon from a brother always on the way to some other place.
To my friends for not giving up their friendship to the ghost of myself.
To Rose Dieng-Kuntz, director of my PhD, for her advice, her support and her trust in me and whithout whom this work would never have been done.
To Olivier Corby, for the extremely fruitful technical discussions we had and his constant meticulous work on CORESE.
To Alain Giboin, for the inspiring discussions we had on cognitive aspects and his supervision of the ergonomics and evaluation matters.
To the whole ACACIA team for the work we did in common, in particular to Laurent Berthelot, Alexandre Delteil, Catherine Faron-Zucker and Carolina Medina Ramirez.
To the assistant of the ACACIA team, Hortense Hammel and Sophie Honnorat for their advice and help in the day-to-day work and their assistance in the logistic of the PhD.
To all the partners of the CoMMA IST European project, as well as all the trainees involved in it.
To the members of the jury and the reporters of this Ph.D for accepting the extra amount of work.
To Catherine Barry and Régine Loisel who introduced me to research during my M.Phil.
To the SEMIR department of INRIA, in charge of the computer facilities for its excellent work in maintaining a high quality working environment.
To the librarian department of INRIA, that did a valuable work to provide me with information resources and references vital to my work.
To INRIA and
To the European Commission that funded the CoMMA IST project
Table of Contents
Guided tour of relevant literature 21
1 Organisational knowledge management 23
1.1 Needs for knowledge management 24
1.1.1 Needs due to the organisational activity 24
1.1.2 Needs due to external organisational factors 24
1.1.3 Needs due to internal organisational factors 25
1.1.4 Needs due to the nature of information and knowledge 26
1.2 Organisational memories 27
1.2.1 Nature of organisational knowledge 29
1.2.2 Knowledge management and organisational memory lifecycle 33
1.2.3 Typology or facets of organisational memories 36
1.3 Organisation modelling 40
1.3.1 Origins of organisational model 40
1.3.2 Ontology based approaches for organisation modelling 42
1.3.3 Concluding remarks 44
1.4 User-modelling and human-computer interaction 45
1.4.1 Nature of the model 45
1.4.2 Adaptation and customisation 46
1.4.3 Application domains and use in knowledge management 47
1.5 Information retrieval systems 49
1.5.1 Resources and query: format and surrogates 50
18.104.22.168 Initial structure 50
22.214.171.124 Surrogate generation 51
126.96.36.199.1 Surrogate for indexing and querying 51
188.8.131.52.2 Statistic approaches to surrogate generation 53
184.108.40.206.3 Semantic approaches to surrogate generation 54
1.5.2 Querying a collection 56
1.5.3 Query and results: the view from the user side 57
1.5.4 The future of information retrieval systems 60
2 Ontology and knowledge modelling 61
2.1 Ontology: the object 62
2.1.1 Ontology: an object of artificial intelligence and conceptual tool of Knowledge Modelling 62
2.1.2 Definitions adopted here 65
2.1.3 Overview of existing ontologies 70
220.127.116.11 A variety of application domains 70
18.104.22.168 Some ontologies for organisational memories 71
2.1.4 Lifecycle of an ontology: a living object with a maintenance cycle 72
2.2 Ontology: the engineering 74
2.2.1 Scope and Granularity: the use of scenarios for specification 74
2.2.2 Knowledge Acquisition 77
2.2.3 Linguistic study & Semantic commitment 81
2.2.4 Conceptualisation and Ontological commitment 83
2.2.5 Taxonomic skeleton 84
22.214.171.124 Differential semantics [Bachimont, 2000] 86
126.96.36.199 Semantic axis [Kassel et al., 2000] 87
188.8.131.52 Checking the taxonomy [Guarino and Welty, 2000] 88
184.108.40.206 Formal concept analysis for lattice generation 91
220.127.116.11 Relations have intensions too 92
18.104.22.168 Semantic commitment 93
1.1.6 Formalisation and operationalisation of an ontology 94
22.214.171.124 Logic of propositions 96
126.96.36.199 Predicate or first order logic 96
188.8.131.52 Logic Programming language 97
184.108.40.206 Conceptual graphs 97
220.127.116.11 Topic Maps 98
18.104.22.168 Frame and Objet oriented formalisms 98
22.214.171.124 Description Logics 99
126.96.36.199 Conclusion on formalisms 100
1.1.7 Reuse 100
3 Structured and semantic Web 105
3.1 Beginnings of the semantic web 106
3.1.1 Inspiring projects 106
188.8.131.52 SHOE 106
184.108.40.206 Ontobroker 106
220.127.116.11 Ontoseek 107
18.104.22.168 ELEN 107
22.214.171.124 RELIEF 107
126.96.36.199 LogicWeb 107
188.8.131.52 Untangle 108
184.108.40.206 WebKB 108
220.127.116.11 CONCERTO 108
18.104.22.168 OSIRIX 109
3.1.2 Summary on ontology-based information systems 109
3.2 Notions and definitions 111
3.3 Toward a structured and semantic Web 113
3.3.1 XML: Metadata Approach 113
3.3.2 RDF(S): Annotation approach 117
22.214.171.124 Resource Description Framework: RDF datamodel 117
126.96.36.199 RDF Schema: RDFS meta-model 120
3.4 Summary and perspectives 125
3.4.1 Some implementations of the RDF(S) framework 126
188.8.131.52 ICS-FORTH RDFSuite 126
184.108.40.206 Jena 126
220.127.116.11 SiLRI and TRIPLE 127
18.104.22.168 CORESE 127
22.214.171.124 Karlsruhe Ontology (KAON) Tool Suite 127
126.96.36.199 Metalog 128
188.8.131.52 Sesame 128
184.108.40.206 Protégé-2000 128
220.127.116.11 WebODE 129
18.104.22.168 Mozilla RDF 129
3.4.2 Some extension initiatives 130
22.214.171.124 DAML-ONT 130
126.96.36.199 OIL and DAML+OIL 130
188.8.131.52 DRDF(S) 131
184.108.40.206 OWL 132
3.5 Remarks and conclusions 133
4 Distributed artificial intelligence 135
4.1 Agents and multi-agent systems 137
4.1.1 Notion of agent 137
4.1.2 Notion of multi-agent systems 143
4.1.3 Application fields and interest 144
4.1.4 Notion of information agents and multi-agent information systems 145
220.127.116.11 Tackling the heterogeneity of the form and the means to access information 145
18.104.22.168 Tackling growth and distribution of information 146
22.214.171.124 Transversal issues and characteristics 146
126.96.36.199 The case of organisational information systems 149
4.1.5 Chosen definitions 150
4.2 Design rationale for multi-agent systems 151
4.2.1 Overview of some methodologies or approaches 151
188.8.131.52 AALAADIN and the A.G.R. model 151
184.108.40.206 AOR 152
220.127.116.11 ARC 152
18.104.22.168 ARCHON 153
22.214.171.124 AUML 153
126.96.36.199 AWIC 154
188.8.131.52 Burmeister's Agent-Oriented Analysis and Design 154
184.108.40.206 CoMoMAS 155
220.127.116.11 Cassiopeia methodology 155
18.104.22.168 DESIRE 156
22.214.171.124 Dieng's methodology for specifying a co-operative system 156
126.96.36.199 Elammari and Lalonde 's AgentOriented Methodology 157
188.8.131.52 GAIA 157
184.108.40.206 Jonker et al.'s Design of Collaborative Information Agents 158
220.127.116.11 Kendall et al.'s methodology for developing agent systems for enterprise integration 158
18.104.22.168 Kinny et al. 's methodology for systems of BDI agents 159
22.214.171.124 MAS-CommonKADS 160
126.96.36.199 MASB 161
188.8.131.52 MESSAGE 161
184.108.40.206 Role card model for agents 162
220.127.116.11 Verharen's methodology for co-operative information agents design 162
18.104.22.168 Vowels 163
22.214.171.124 Z specifications for agents 163
126.96.36.199 Remarks on the state of the art of methodologies 164
4.2.2 The organisational design axis and role turning-point 166
4.3 Agent communication 168
4.3.1 Communication language: the need for syntactic and semantic grounding 168
4.3.2 Communication protocols: the need for interaction rules 169
4.3.3 Communication and content languages 170
4.4 Survey of multi-agent information systems 174
4.4.1 Some existing co-operative or multi-agent information systems 174
188.8.131.52 Enterprise Integration Information Agents 174
184.108.40.206 SAIRE 175
220.127.116.11 NZDIS 175
18.104.22.168 UMDL 176
22.214.171.124 InfoSleuth™ 176
126.96.36.199 Summarising remarks on the field 178
1.1.2 Common roles for agents in multi-agent information systems 179
188.8.131.52 User agent role 179
184.108.40.206 Resource agent role 180
220.127.116.11 Middle agent role 182
18.104.22.168 Ontology agent role 183
22.214.171.124 Executor agent role 183
126.96.36.199 Mediator and facilitator agent role 184
188.8.131.52 Other less common specific roles 184
1.5 Conclusion of this survey of multi-agent systems 185
5 Research context and positioning 187
5.1 Requirements 188
5.1.1 Application scenarios 188
184.108.40.206 Knowledge management to improve the integration of a new employee 188
220.127.116.11 Knowledge management to support technology monitoring and survey 189
5.1.2 The set of needed functionalities common to both scenarios 190
5.2 Implementation choices 192
5.2.1 Organisational memory as an heterogeneous and distributed information landscape 192
5.2.2 Stakeholders of the memory as an heterogeneous and distributed population 193
5.2.3 Management of the memory as an heterogeneous and distributed set of tasks 193
5.2.4 Ontology: the cornerstone providing a semantic grounding 194
Positioning and envisioned system 195
5.3.1 Related works and positioning in the state of the art 195
5.3.2 Overall description the CoMMA solution 198
Ontology & annotated corporate memory 201
6 Corporate semantic web 203
6.1 An annotated world for agents 204
6.2 A model-based memory 205
6.3 CORESE: COnceptual REsource Search Engine 208
6.4 On the pivotal role of an ontology 212
7 Engineering the O'CoMMA ontology 213
7.1 Scenario analysis and Data collection 214
7.1.1 Scenario-based analysis 214
7.1.2 Semi-structured interviews 218
7.1.3 Workspace observations 219
7.1.4 Document Analysis 220
7.1.5 Discussion on data collection 222
7.2 Reusing ontologies and other sources of expertise 223
7.3 Initial terminological stage 224
7.4 Perspectives in populating and structuring the ontology 225
7.5 From semi-informal to semi-formal 227
7.6 On a continuum between formal and informal 230
7.7 Axiomatisation and needed granularity 238
7.8 Conclusion and abstraction 242
8 The ontology O'CoMMA 245
8.1 Overall structure of O'CoMMA 246
8.2 Top of O'CoMMA 247
8.3 Ontology parts dedicated to organisational modelling 248
8.4 Ontology parts dedicated to user profiles 250
8.5 Ontology parts dedicated to documents 253
8.6 Ontology parts dedicated to the domain 254
8.7 The hierarchy of properties 255
8.8 End-Users' extensions 257
8.9 Discussion and abstraction 261
Multi-agent system for memory management 265
9 Design rationale of the multi-agent architecture 267
9.1 From macroscopic to microscopic 268
9.1.1 Architecture versus configuration 268
9.1.2 Organising sub-societies 269
18.104.22.168 Hierarchical society: a separated roles structure 269
22.214.171.124 Peer-to-peer society: an egalitarian roles structure 270
126.96.36.199 Clone society: a full replication structure 271
188.8.131.52 Choosing an organisation 271
9.1.3 Sub-society dedicated to ontology and organisational model 271
9.1.4 Annotation dedicated sub-society 272
9.1.5 Interconnection dedicated sub-society 273
9.1.6 User dedicated sub-society 273
184.108.40.206 Possible options 274
220.127.116.11 User profile storage 274
18.104.22.168 Other possible roles for interest group management 275
9.1.7 Overview of the sub-societies 276
9.2 Roles, Interactions and behaviours 277
9.2.1 Accepted roles 277
22.214.171.124 Ontology Archivist 277
126.96.36.199 Corporate Model Archivist 278
188.8.131.52 Annotation Archivist 278
184.108.40.206 Annotation Mediator 279
220.127.116.11 Directory Facilitator 279
18.104.22.168 Interface Controller 280
22.214.171.124 User Profile Manager 280
126.96.36.199 User Profile Archivist 281
9.2.2 Characterising and comparing roles 282
9.2.3 Social interactions 284
9.2.4 Behaviour and technical competencies 287
9.3 Agents types and deployment 288
9.3.1 Typology of implemented agents 288
188.8.131.52 Interface Controller Agent Class implementation 288
184.108.40.206 User Profile Manager Agent Class implementation 291
220.127.116.11 User Profile Archivist Agent Class implementation 291
18.104.22.168 Directory Facilitator Agent Class implementation 292
22.214.171.124 Ontology Archivist Agent Class implementation 292
126.96.36.199 Annotation Mediator Agent Class implementation 292
188.8.131.52 Annotation Archivist Agent Class implementation 292
9.3.2 Deployment configuration 293
9.4 Discussion and abstraction 294
10 Handling annotation distribution 297
10.1 Issues of distribution 298
10.2 Differentiating archivist agents 300
10.3 Annotation allocation 303
10.3.1 Allocation protocol 303
10.3.2 Pseudo semantic distance 305
10.3.2.1 Definition of constants 305
10.3.2.2 Distance between two literals 305
10.3.2.3 Pseudo-distance literal - literal interval 307
10.3.2.4 Distance between two ontological types 307
10.3.2.5 Distance between a concept type and a literal 309
10.3.2.6 Pseudo-distance annotation triple - property family 309
10.3.2.7 Pseudo-distance annotation triple - ABIS 309
10.3.2.8 Pseudo-distance annotation triple - CAP 310
10.3.2.9 Pseudo-distance annotation - ABIS 310
10.3.2.10 Pseudo-distance annotation - CAP 310
10.3.2.11 Pseudo-distance annotation - AA 310
10.3.3 Conclusion and discussion on the Annotation allocation 311
10.4 Distributed query-solving 312
10.4.1 Allocating tasks 312
10.4.2 Query solving protocol 314
10.4.3 Distributed query solving (first simple algorithm) 316
10.4.3.1 Query simplification 316
10.4.3.2 Constraint solving 317
10.4.3.3 Question answering 317
10.4.3.4 Filtering final results 317
10.4.3.5 Overall algorithm and implementation details 319
10.4.4 Example and discussion 321
10.4.5 Distributed query solving (second algorithm) 326
10.4.5.1 Discussion on the first algorithm 326
10.4.5.2 Improved algorithm 327
10.5 Annotation push mechanism 329
10.6 Conclusion and discussion 331
Lessons learned & Perspectives 333
11 Evaluation and return on experience 335
11.1 Implemented functionalities 336
11.2 Non-implemented functionalities 337
11.3 Official evaluation of CoMMA 338
11.3.1 Overall System and Approach evaluation 338
184.108.40.206 First trial evaluation 339
220.127.116.11 Final trial evaluation 340
11.3.2 Criteria used for specific aspects evaluation 342
11.3.3 The ontology aspect 344
18.104.22.168 Appropriateness and relevancy 344
22.214.171.124 Usability and exploitability 344
126.96.36.199 Adaptability, maintainability and flexibility 345
188.8.131.52 Accessibility and guidance 345
184.108.40.206 Explicability and documentation 346
220.127.116.11 Expressiveness, relevancy, completeness, consistency and reliability 346
18.104.22.168 Versatility, modularity and reusability 346
22.214.171.124 Extensibility 347
126.96.36.199 Interoperability and portability 347
188.8.131.52 Feasibility, scalability and cost 347
11.3.4 The semantic web and conceptual graphs 348
184.108.40.206 Usability, exploitability, accessibility and cost 348
220.127.116.11 Flexibility, adaptability, extensibility, modularity, reusability and versatility 349
18.104.22.168 Expressiveness 349
22.214.171.124 Appropriateness and maintainability 350
126.96.36.199 Portability, integrability and interoperability 350
188.8.131.52 Feasibility, scalability, reliability and time of response 351
184.108.40.206 Documentation 352
220.127.116.11 Relevancy, completeness and consistency 352
18.104.22.168 Security, explicability, pro-activity and guidance 352
11.3.5 The multi-agent system aspect 353
22.214.171.124 Appropriateness, modularity and exploitability 353
126.96.36.199 Usability, feasibility, explicability, guidance and documentation 353
188.8.131.52 Adaptability and pro-activity 354
184.108.40.206 Interoperability, expressiveness and portability 354
220.127.116.11 Scalability, reliability and time of response 355
18.104.22.168 Security 355
22.214.171.124 Cost 355
126.96.36.199 Integrability and maintainability 355
188.8.131.52 Flexibility, versatility, extensibility and reusability 356
184.108.40.206 Relevancy, completeness, consistency 356
11.4 Conclusion and discussion on open problems 357
12 Short-term improvements 359
12.1 Querying and annotating through ontologies: from conceptual concerns to users' concerns 360
12.2 Improving the pseudo-distance 370
12.2.1 Introducing other criteria 370
12.2.2 The non-semantic part problem 370
12.2.3 The subsumption link is not a unitary length 371
12.3 Improving distributed solving 376
12.3.1 Multi-instantiation and distributed solving 376
12.3.2 Constraint sorting 377
12.3.3 URI in ABIS 377
12.4 Ontology service improvements 378
12.4.1 Precise querying to retrieve the ontology 378
12.4.2 Propagating updates of the ontology 378
12.5 Annotation management improvements 379
12.5.1 Query-based push registration 379
12.5.2 Allowing edition of knowledge 379
12.5.3 Update scripting and propagation 380
12.6 Conclusion on short-term perspectives 381
13 Long-term perspectives 383
13.1 Ontology and multi-agent systems 384
13.1.1 Ontological lifecycle 384
220.127.116.11 Emergence and management of the ontological consensus 384
18.104.22.168 Maintenance and evolution 385
22.214.171.124 Ontologies as interfaces, ontologies and interfaces 386
126.96.36.199 Conclusion 388
13.1.2 Ontologies for MAS vs. MAS for ontologies 389
188.8.131.52 Use of ontologies for multi-agent systems 389
184.108.40.206 Use of multi-agent systems for ontologies 389
13.2 Organisational memories and multi-agent systems 390
13.2.1 Opening the system: no actor is an island 391
220.127.116.11 Extranets: coupling memories 391
18.104.22.168 Web-mining: wrapping external sources 392
22.214.171.124 Collaborative filtering: exploiting the organisational identity 394
13.2.2 Closing the system: security and system management 395
13.2.3 Distributed rule bases and rule engines 395
13.2.4 Workflows and information flows 396
13.2.5 New noises 396
13.3 Dreams 399
14 Résumé in french 411
14.1 Le projet CoMMA 413
14.2 Web sémantique d'entreprise 415
14.2.1 Du Web Sémantique à l'intraweb sémantique 415
14.2.2 Une mémoire basée sur un modèle 415
126.96.36.199 Description des profils utilisateur : "annoter les personnes" 416
188.8.131.52 Description de l'entreprise : "annoter l'organisation" 416
184.108.40.206 Architecture de la mémoire 416
14.3 CORESE: moteur de recherche sémantique 418
14.4 Conception De L'ontologie O'CoMMA 419
14.4.1 Position et Définitions 419
14.4.2 Analyse par scénarios et Recueil 420
14.4.3 Les scénarios comme guides de conception 420
14.4.4 Recueil spécifique à l’entreprise 420
14.4.5 Recueil non spécifique à l’entreprise 422
14.4.6 Phase terminologique 422
14.4.7 Structuration : du semi-informel au semi-formel 423
14.4.8 Formalisation de l'ontologie 426
14.4.9 Extensions nécessaires à la formalisation 428
14.5 Contenu de l'ontologie O'CoMMA 429
14.6 Les agents de la mémoire 431
14.6.1 Une architecture multi-agents 431
220.127.116.11 Système d'information multi-agents 431
18.104.22.168 Du niveau macroscopique du SMA au niveau microscopique des agents 432
22.214.171.124 Rôles, interactions et protocoles 433
14.7 Cas de la société dédiée aux annotations 436
14.8 Discussion et perspectives 437
15 O'CoMMA 439
15.1 Lexicon view of concepts 439
15.2 Lexicon view of relations 452
16 Tables of illustrations 455
16.1 List of figures 455
16.2 List of tables 460
17 References 463
The beginning is the most
important part of the work.
Human societies are structured and ruled by organisations. Organisations can be seen as abstract holonic living entities composed of individuals and other organisations. The raison d'être of these living entities is a set of core activities that answer needs of other organisations or individuals. These core activities are the result of the collective work of the members of the organisation. The global activity relies on organisational structures and infrastructures that are supervised by an organisational management. The management aims at effectively co-ordinating the work of the members in order to obtain a collective achievement of the core organisational activities.
The individual work, be it part of the organisational management or the core activities, requires knowledge. The whole knowledge mobilised by an organisation for its functioning forms an abstract set called organisational knowledge; a lack of organisational knowledge may result in organisational dysfunction.
As the speed of markets is rising and their dimensions tend towards globalisation, reaction time is shortening and competitive pressure is growing; information loss may lead to a missed opportunity. Organisations must react quickly to changes in their domain and in the needs they answer, and even better they must anticipate them. In this context, knowledge is an organisational asset for competitiveness and survival, the importance of which has been growing fast in the last decade. Thus organisational management now explicitly includes the activity of knowledge management that addresses problems of identification, acquisition, storage, access, diffusion, reuse and maintenance of both internal and external knowledge.
One approach for managing knowledge in an organisation, is to set up an organisational memory management solution: the organisational memory aspect is in charge of ensuring the persistent storage and/or indexing of the organisational knowledge and its management solution is in charge of capturing relevant pieces of knowledge and providing the concerned persons with them, both activities being carried out at the appropriate time, with the right level of details and in an adequate format. An organisational memory relies on knowledge resources i.e., documents, people, formalised knowledge (e.g.: software, knowledge bases) and other artefacts in which knowledge has been embedded.
Such memory and its management require methodologies and tools to be operationalised. Resources, such as documents, are information supports therefore, their management can benefit from results in informatics and the assistance of software solutions developed in the field. The work I am presenting here, was carried out during my Ph.D. in Informatics, within ACACIA, a multidisciplinary research team of INRIA that aims at offering models, methods and tools for building a corporate memory. My research was applied to the realisation of CoMMA (Corporate Memory Management through Agents) a two-year European IST project.
In this thesis, I shall show that (1) the semantic webs can provide distributed knowledge spaces for knowledge management; (2) ontologies are applicable and effective means for supporting distributed knowledge spaces; (3) multi-agent systems are applicable and effective architectures for managing distributed knowledge spaces. To this end, I devided the plan of this document into four parts which structure a total of thirteen chapters:
The first part is a guided tour of relevant literature. It consists of five chapters, the first four chapters analyse the literature relevant to the work and the fifth one positions CoMMA within this survey.
Chapter 1 - Organisational knowledge management. I shall describe the needs for knowledge management and the notion of corporate memory aimed at answering these needs. I shall also introduce three related domains that contribute to building organisational knowledge management solutions: corporate modelling, user modelling and information retrieval systems.
Chapter 2 - Ontology and knowledge modelling. I shall analyse the latest advance in knowledge modelling that may be used for knowledge management, especially focusing on the notion of ontology that is that part of the knowledge model that captures the semantics of primitives used to make formal assertions about the application domain of the knowledge-based solution. The chapter is divided in two large sections respectively addressing the nature and the design of an ontology.
Chapter 3 - Structured and semantic web. More and more organisations rely on intranets and internal corporate webs to implement a corporate memory solution. In the current Web technology, information resources are machine readable, but not machine understandable. To improve exploitation and management mechanisms of a web requires the introduction of formal knowledge; in the semantic web vision of the W3C, this takes the form of semantic annotations about the information resources. After a survey of the pioneer projects that prefigured the semantic web, I shall detail the current foundations layed down by the W3C, and summarise the on-going and future trends.
Chapter 4 - Distributed artificial intelligence. Artificial intelligence studies artificial entities, that I shall call agents, reproducing individual intelligent behaviours. In application fields where the elements of the problem are scattered, distributed artificial intelligence tries to build artificial societies of agents to propose adequate solutions; a semantic web is a distributed landscape that naturally calls for this kind of paradigm. Therefore, I shall introduce agents and multi-agent systems and survey design methodologies. I shall focus, in particular, on the multi-agent information systems.
Chapter 5 - Research context and positioning. Based on the previous surveys, this chapter will identify the requirements of the application scenarios of CoMMA and show that the requirements can be matched with solutions proposed in knowledge modelling, semantic web and distributed artificial intelligence. I shall describe and justify the overall solution envisioned and the implementation choices, and I shall also position CoMMA in the state of the art.
The second part concerns ontology & annotated corporate memory. It consists of three chapters presenting a vision of the corporate memory as a corporate semantic web as well as the approach followed in CoMMA to build the underlying ontology and the result of this approach i.e., the ontology O'CoMMA.
Chapter 6 - Corporate semantic web. It is a short introduction to the vision of a corporate memory as a corporate semantic web providing an annotated world for information agents. I shall present the motivations, the structure and the tools of this model-based documentary memory.
Chapter 7 - Engineering the O'CoMMA ontology. The ontology plays a seminal role in the vision of a corporate semantic web. This chapter explains step-by-step how the ontology O'CoMMA was built, following a scenario-based design rationale to go from the informal descriptions of requirements and specifications to a formal ontology. I shall discuss the tools designed and adopted as well as the influences and motivations that drove my choices.
Chapter 8 - The ontology O'CoMMA. The ontology O'CoMMA is the result produced by the methodology adopted in CoMMA and provides the semantic grounding of a solution. I shall describe the O'CoMMA ontology and different aspects of the vocabulary organised in three layers and containing an abstract top layer, a part dedicated to the organisational structure modelling, a part dedicated to the description of people, and a part dealing with domain topics. I shall also discuss some extensions and appropriation experiences with end-users.
The third part focuses on a multi-agent system for memory management. It consists of two chapters respectively presenting the design rationale of the whole CoMMA system and the specific design of the mechanisms handling annotation distribution.
Chapter 9 - Design rationale of the multi-agent architecture. I present the design rationale that was followed in CoMMA to obtain the multi-agent architecture supporting the envisaged corporate memory management scenarios. I shall explain every main stage of the organisational top-down analysis of functionalities and describe the characteristics and documentation of roles and protocols supporting each agent sub-society, finally coming down to their implementation into agent behaviours.
Chapter 10 - Handling annotation distribution. I detail how some aspects of the conceptual foundations of the semantic web can support a multi-agent system in allocating and retrieving semantic annotations in a distributed corporate memory. I shall show how our agents exploit the semantic structures and the ontology, when deciding where to allocate new annotations and when resolving queries over distributed bases of annotations.
The fourth part gives the lessons learned and the perspectives. It consists of three chapters, the first one discussing the evaluation and return on experience of the CoMMA project, the second one describing short-term improvements and current work, and the last one proposing some long-term perspectives.
Chapter 11 - Evaluation and return on experience. This chapter gives the implementation status of CoMMA and the trial process to get a feedback from the end-users and carry-out the official evaluation of CoMMA. I shall present the criteria used, the trial feeback, and the lessons learned.
Chapter 12 - Short-term improvements. This chapter gives beginnings of answers to some of the problems raised by the evaluation. The ideas presented are extensions of the work done in CoMMA, some of them already started to be implemented and tested, while others are shallow specifications of possible improvements.
Chapter 13 - long-term perspectives. I provide considerations on extensions that could be imagined to go towards a complete real-world solution and I give long-term perspectives that can be considered as my future research interests.
As shown on the above schema, the reading follows a classical U-shape progress from generic concerns to specific contributions, back and forth. I included as much information material as I found necessary to provide this document with a reasonable autonomy for a linear reading. I sincerely hope that the readers of these pages will find information here to build new knowledge.
— Fabien Gandon
Part - I
He who cannot draw on 3000 years of
culture is living from hand to mouth.
The objectives of knowledge management are usually structured by three key issues: Capitalise (i.e., know where you are and you come from to know where you go), Share ( i.e., switch from individual to collective intelligence), Create (i.e., anticipate and innovate to survive) [Ermine, 2000]. How to improve identification, acquisition, storage, access, diffusion and reuse and maintenance of both internal and external knowledge in an organisation? This is the question at the heart knowledge management. One approach for managing knowledge in an organisation is the set-up of an organisational memory. Information or knowledge management systems, and in particular a corporate memory, should provide the concerned persons with the relevant pieces of knowledge or information at the appropriate time, with the right level of details and the adequate format of presentation. I shall discuss the notion of organisational memory in the second sub-section.
Then, I shall give an introduction to three domains that are now closely linked to organisational knowledge management
- corporate modelling: the need for capturing an explicit view of the organisation that can support the knowledge management solution led to include organisational models as parts of the organisational memory.
- user modelling: the need for capturing the profile of the organisation's members as future users, led user modelling field to find a natural application in information and knowledge management for organisations.
- information retrieval systems: accessing the right information at the right moment may enable people to get the knowledge to take the good decision at the right time; the retrieval is one of the main issues of information systems and different techniques have been developed that can be use to support knowledge management.
Core business resource: in 1994, Drucker already noticed how knowledge was a primary resource of the society and he believed that the implications of this shift would prove increasingly significant for organisations (commercial companies, public institutions, etc.) [Drucker, 1994]. The past decade with its emergence of knowledge workers and knowledge intensive organisations proved he was right. Knowledge is now considered as a capital which has an economic value; it is a strategic resource for increasing productivity; it is a stability factor in a unstable and dynamic competitive environment and it is a decisive competitive advantage [Ermine, 2000]. This encompasses also protecting intellectual property of this capital of knowledge; as soon as a resource becomes vital to a business, security and protection concern arise. Knowledge management thus is also concerned with the problem of securing new and existing knowledge and storing it to make it persistent over time. Also, as knowledge production becomes more critical, managers will need to do it more reflectively [Seely Brown and Duguid, 1998]. This means there is a requirement for methods and tools to assist the management processes.
Transversal resource: knowledge concerns R&D, management (strategy, quality, etc.), production (data management, documentation, workflows, know-how, etc. ), human resources (competencies, training, etc.) [Ermine, 1998]. Moreover, knowledge is often needed far from were it was created (in time and space): e.g. the production data may be useless to the worker of the production line, but they are vital to the analyst of the management group that will mine them and correlate them to build key indicators and take strategic decisions; likewise the people of the workshop may be extremely interested in the knowledge discovered by technology monitory group or lessons learned in other workshops. As summarised in [Nagendra Prasad and Plaza, 1996], timely availability of relevant information from resources accessible to an organisation can lead to more informed decisions on the part of individuals, thereby promoting the effectiveness and viability of decentralised decision making. Decision making can be on the assembly line to solve a problem (e.g. past project knowledge) or at the management board (e.g. manage large amount of data and information and mine them to extract key indicators that will drive strategic decisions). Sharing the activity in a group also implies to share the knowledge involved in the activity. To survive, the organisation needs to share knowledge among its members and create and collect new knowledge to anticipate the future. [Abecker et al., 1998]
Competitive advantage: "in an economy where the only certainty is uncertainty, the sure source of lasting competitive advantage is knowledge" [Nonaka and Takeuchi, 1995]. Globalisation is leading to strong international competition which is expressed in rapid market developments and short lifecycles for products and services [Weggeman, 1996]. Therefore, it is vital to maintain knowledge generation as a competitive advantage for knowledge intensive organisations: this is critical, for instance, in trying to reduce the time to market, costs, wastes, etc. An organisation is a group of people who work together in a structured way for a shared purpose. Their activity will mobilise knowledge about the subject of their work. This knowledge has been obtained by experience or study and will drive the behaviour of the people and therefore, determine the actions and reactions of the organisation as a whole. Through organisational activities, additional knowledge will be collected and generated; if it is not stored, indexed and reactivated when needed, then it is lost. In a global market where information flows at the speed of Internet, organisational memory loss is dangerous; an efficient memory is becoming a vital asset for an organisation especially if its competitors also acknowledge that fact.
Globalisation: globalisation was terribly boosted by the evolution of information technology. As described by Fukuda in 1995, it started with the use of main-frame computers for business information processing in the 60s and 70s, then it spread through the organisations with the micro computers even starting to enter homes in the 80s and it definitively impacted the whole society from the mid 80s by introducing networks and their world-wide webs of interconnected people and organisations. "Now that they can communicate interactively and share information, they come to know that they are affected or regulated by the information power". [Fukuda, 1995]
Turnover: as the story goes, if NASA was to send men to the moon again, it would have to start from scratch, having lost not the data, but the human expertise that made possible the event of the 13th of July 1969. An organisation's knowledge walks out of the door every night and it might never come back. Employees and knowledge gained during their engagement are often lost in the dynamic business environment [Dzbor et al., 2000]. The turn over in artefacts (e.g. software evolution, products evolution) and in members of the organisation (retirement, internal and external mobility, etc.) are major breaks in knowledge capitalisation.
Need for awareness: members of an organisation are often unaware of critical resources that remain hidden in the vast repositories. Most of the knowledge is thus forgotten in a relatively short time after it was invented [Dzbor et al., 2000]. Competitive pressure requires quick and effective reactions to the ever changing market situations. The gap between the evolving and continuously changing collective information and knowledge resources of an organisation and the employees' awareness of the existence of such resources and of their changes can lead to loss in productivity. [Nagendra Prasad and Plaza, 1996]. Too often one part of an organisation repeats work of another part simply because it is impossible to keep track of, and make use of, knowledge in other parts. Organisations need to know what their corporate knowledge assets are and how to manage and make use of these assets to capitalise on them. They are realising how important it is to "know what they know" and be able to make maximum use of this knowledge [Macintosh, 1994].
Project oriented management: this new way of management aims at promoting organisations as learning systems and avoiding repeating the same mistakes. Information about past projects - protocols, design specifications, documentation of experiences: both failures and successes, alternatives explored - can all serve as stimulants for learning, leading to "expertise transfer" and "cross-project fertilisations" within and across organisations. [Nagendra Prasad and Plaza, 1996]
Internationalisation and geographic dispersion: The other side of the coin of globalisation is the expansion of companies to global trusts. As companies expand internationally, geographic barriers can affect knowledge exchange and prevent easy access to information [O’Leary, 1998]. It thus amplifies the already existing problem of knowledge circulation and knowledge distribution between artefacts and humans dispersed in the world. Moreover, the internationalisation raises the problem of making knowledge cross-cultural and language boundaries.
Information overload: with more and more data online (databases, internal web, data warehouse, Internet etc.), “we are drowning in information, but starved of knowledge”. The question raised are: what is relevant and what is not? what are the relevance criteria? how do they evolve? can we predict / anticipate future needs? The amount of available information resources augment exponentially, overloading knowledge-workers; too much information hampers decision making just as much as insufficient knowledge does.
Information heterogeneity: the knowledge assets reside in many different places such as: databases, knowledge bases, filing cabinets and people heads. They are distributed across the enterprise [Macintosh, 1994] and the intranet and internet phenomena amplifies that problem. The dispersed, fragmented and diverse sources augment the cognitive overload of knowledge workers that need integrated homogeneous views to interpret them into 'enactable' knowledge.
For all these reasons, one of the rare consensus in the knowledge management domain is that knowledge is now perceived as an organisational and production asset, a valuable patrimony to be managed and thus there is a need for tools and methods assisting this management.
One approach for managing knowledge in an organisation is the set-up of an organisational memory. Information or knowledge management systems, and in particular a corporate memory, should provide the concerned persons with the relevant pieces of knowledge or information at the appropriate time with the right level of details and the adequate format of presentation. I shall discuss the notion of organisational memory in the following sub-section.
I use 'corporate memory' or 'organisational memory' as synonyms; the different concepts that are sometimes denoted by these terms will be explored in the section about typology of memories.
Organisations are societies of beings: if these beings have an individual memory, the organisation will naturally develop a collective memory being at the very least the sum of these memories and often much more. From that perspective the question here is what is knowledge management with regard to corporate memories? For [Ackerman and Halverson, 1998] it is important to consider an organisational memory as both an object and a process: it holds its state and it is embedded in many organisational and individual processes. I stress the difference here between this static aspect (the knowledge captured) and dynamic aspect of the memory (the ability to memorise and remember) because I believe they are unconditionally present in a complete solution.
A memory without an intelligence (singular or plural) in charge of it is destined for obsolescence, and an intelligence without a memory (internal or external) is destined for stagnation.
In fact, I see three aspects that are used to define what is a corporate memory: the memory content (what) i.e., the nature of knowledge; the memory form (where) i.e., the storage support; the memory working (how) i.e., the system managing knowledge. Here are some elements of definition found in the literature for each one of these facets:
- content: it contains the organisational experience acquired by employees and related to the work they carry out [Pomian, 1996]. It is a repository of knowledge and know-how from a set of individuals working in a particular firm [Euzenat, 1996]. It captures a company's accumulated know-how and other knowledge assets [Kuhn and Abecker, 1997]. It consists of the total sum of the information and knowledge resources within an organisation. Such resources are typically distributed and are characterised by multiplicity and diversity: company databases, machine-readable texts, documentation resources and reports, product requirements, design rationale etc. [Nagendra Prasad and Plaza, 1996] The corporate knowledge is the whole know-how of the company, i.e., its business processes, its procedures, its policies (mission, rules, norms) and its data [Gerbé, 2000]. It is an explicit, disembodied, persistent representation of the knowledge and information in an organisation [Van Heijst et al., 1996]. It preserves the reasoning and knowledge with their diversity and contradictions in order to reuse them later [Pomian, 1996]
- form: [Kuhn and Abecker, 1997] characterised corporate memory as a comprehensive computer system. In other approaches organisational memories take the form of an efficient librarian solution based on document management systems. Finally, it can rely on a human resources management policy relying on human as 'knowledge containers'.
- working: a corporate memory makes knowledge assets available to enhance the efficiency and effectiveness of knowledge-intensive work processes [Kuhn and Abecker, 1997]. It is a system that enables the integration of dispersed and unstructured organisational knowledge by enhancing its access, dissemination and reuse among an organisation’s members and information systems [von Krogh, 1998]. The main function of a corporate memory is that it should enhance the learning capacity of an organisation [Van Heijst et al., 1996]
My colleagues proposed their own definition of a corporate memory: "explicit, disembodied, persistent representation of knowledge and information in an organisation, in order to facilitate its access and reuse by members of the organisation, for their tasks" [Dieng et al., 2001]. While I find it a good digest of the different definitions, I do not think that knowledge must be disembodied and explicitly represented. I believe that a complete corporate memory solution (i.e., memory & management system) should allow for the indexing of external sources (may be embodied in a person or hold in another memory). This is because, contrary to its human counterpart, the organisational memory tends not to be centralised, localised and enclosed within a physical boundary, but distributed, diffuse and heterogeneous. Moreover, just like its human counterpart, it does not memorise everything it encounters: sometimes it is better to reference/index an external resource rather than duplicate it so that we can still consult it, but do not really have to memorise and maintain it. The reasons may be because it is cost-effective, because it is not feasible (copyright, amount of data, difficult formalisation, etc.), because it is ever changing and volatile, because the maintenance is out of our expertise, etc.
Therefore, I would suggest an extended definition:
An organisational memory is an explicit, disembodied, persistent representation and indexing of knowledge and information or their sources in an organisation, in order to facilitate its access, share and reuse by members of the organisation, for their individual and collective tasks.
The desire of management and control over knowledge, contrasts with its fluid, dispersed, intangible, subjective and sometime tacit original nature. How can we manage something we even have problem to define? Therefore, there has been an extensive effort to analyse this knowledge asset and characterise it. I shall give an overview of this discussion in a first following sub-part.
Management consists of activities of control and supervision over other activities. In the case of knowledge management, what are these activities? what does management of the collective knowledge of an organisation consist in? In the second following sub-part, I shall identify and describe the different activities and processes involved.
Finally, I shall give a typology of the memories that have been envisaged so far. It could also be seen as a typology of the different facets of a complete solution trying to integrate the different types of knowledge that must flow and interact, for an organisation to live.
According to [Weggeman, 1996], knowledge and information are primitive terms i.e., terms that are understood although they cannot be accurately defined and the meaning of which lies in the correct use of the concept that can only be learnt by practising the use.
If we take the ladder of understanding (Figure 1) used in the librarian community, we can propose some definitions of data, information and knowledge. These definitions can be built bottom-up or top-down.
Starting from the bottom, I have merged my definitions with the disjunctive definition of knowledge information and data of [Weggeman, 1996] and [Fukuda, 1995]:
- Data is a perception, a signal, a sign or a quantum of interaction (e.g. '40' or 'T' are data). Data is symbolic representation of numbers, fact, quantities; an item of data is what a natural or artificial sensor indicates about a variable. [Weggeman, 1996]. Data are made up of symbols and figures which reflect a perception of the experiential world [Fukuda, 1995].
- Information is data structured according to a convention (e.g. T=40°). Information is the result of the comparison of data which are situationally structured in order to arrive at a message that is significant in a given context [Weggeman, 1996]. Information is obtained from data which have been given a significance and selected as useful [Fukuda, 1995].
- Knowledge is information
with a context and value that make it usable (e.g.
"the patient of the room 313 of the
Wisdom can be defined as timeless knowledge and [Fukuda, 1995] adds an intermediary step for theory which he defines as generalised knowledge.
Interestingly, I found no trace in literature of top-down definitions while in presenting and re-presenting cycles for individual and collective manipulation of knowledge, we are going up the ladder to peruse and going down the ladder to communicate. Going down the ladder, I would propose:
- Knowledge is understanding of a subject which has been obtained by experience or study.
- Information can be defined as knowledge expressed according to a convention or knowledge in transit; The Latin root informare means "to give form to". In an interview, Nonaka explained that information is a flow of a messages, and knowledge is a stock created by accumulating information; thus, information is a necessary medium or material for eliciting and constructing knowledge. The second difference he made was that information is something passive while knowledge comes from belief, so it is more proactive.
- Data can be defined as the basic element of information coding.
I now look at the different categories and characteristics of knowledge that were identified and used in the literature:
- Formal knowledge vs. informal knowledge: In its natural state, the knowledge context includes a situation of interpretation and an actor interpreting the information. However, when we envisage an explicit, disembodied and persistent representation of knowledge, it means that it is an artificial knowledge (the counterpart of an artificial intelligence), or more commonly called formalised knowledge where information and context are captured in a symbolic system and its attached model that respectively enable automatic processing and unambiguous interpretation of results and manipulations. This view which was developed in artificial intelligence is close to the wishful thinking of knowledge management according to which knowledge could be an artificial resource that only has value within an appropriate context, that can be devalued and revalued, that is inexhaustible and used, but not consumed. Unfortunately a lot of knowledge comes with a container (human, news letter, etc.) that does not comply to this. The first axiom of [Euzenat, 1996] is that "knowledge must be stated as formally as possible (...) However, not everything can and must be formalised and even if it were, the formal systems could suffer from serious limitations (complexity or incompleteness). " It followed by a second axiom stating that "it must be possible to wrap up a skeleton of formal knowledge with informal flesh made of text, pictures, animation, etc. Thus, knowledge which has not yet reached a formal state, comments about the production of knowledge or informal explanations can be tied to the formal corpora." [Euzenat, 1996]. These axioms acknowledge the fact that in a memory there will be different degrees of knowledge formalisation and that the management system will have to deal with that aspect.
- Cognitivist perspective vs. constructionist perspective [von Krogh, 1998]: The cognitivist perspective suggests that knowledge consists of "representations of the world that consist of a number of objects or events". The constructionist perspective suggests that knowledge is "an act of construction or creation". Also, Polanyi [Polanyi, 1966] regards knowledge as both static "knowledge" and dynamic "knowing".
- Tacit knowledge vs. explicit knowledge [Nonaka, 1994] drawing on [Polanyi, 1966]: tacit knowledge is mental (mental schemata, beliefs, images, personal points of view and perspectives, concrete know-how e.g. reflexes, personal experience, etc.). Tacit knowledge is personal, context-specific, subjective and experience based knowledge, and therefore, hard to formalise and communicate. It also includes cognitive skills such as intuition as well as technical skills such as craft and know-how. Explicit knowledge, on the other hand, is formalised, coded in a language natural (French, English, etc.) or artificial (UML, mathematics, etc.) and can be transmitted. It is objective and rational knowledge that can be expressed in words, sentences, numbers or formulas. It includes theoretical approaches, problem solving, manuals and databases. As explicit knowledge is visible, it was the first to be managed or, at least, to be archived. [Nonaka and Takeuchi, 1995] pointed out that tacit knowledge is also important and raises additional problems; it was illustrated by these companies having to hire back their fired or retired seniors because they could not be replaced by the newcomers having only the explicit knowledge of their educational background and none of the tacit knowledge that was vital to run the business. Tacit knowledge and explicit knowledge are not totally separate, but mutually complementary entities. Without experience, we cannot truly understand. But unless we try to convert tacit knowledge to explicit knowledge, we cannot reflect upon it and share it in the whole organisational (except through mentoring situations i.e., master-apprentice co-working to ensure transfer of know how).
- Tacit knowledge vs. focal knowledge: with references to Polanyi, [Sveiby, 1997] discusses tacit vs. focal knowledge. In each activity, there are two dimensions of knowledge, which are mutually exclusive: the focal knowledge about the object or phenomenon that is in focus; the tacit knowledge that is used as a tool to handle or improve what is in focus. The focal and tacit dimensions are complementary. The tacit knowledge functions as a background knowledge which assists in accomplishing a task which is in focus. What is tacit varies from one situation to another.
- Hard knowledge vs. soft knowledge: [Kimble et al., 2001] propose hard and soft knowledge as being two parts of a duality. That is all knowledge is to some degree both hard and soft. Harder aspects of knowledge are those aspects that are more formalised and that can be structured, articulated and thus ‘captured’. Soft aspects of knowledge on the other hand are the more subtle, implicit and not so easily articulated.
- Competencies (know-how responsible and validated) / theoretical knowledge "know-that" / procedural knowledge / procedural know-how / empirical know-how / social know-how [Le Bortef, 1994].
- Declarative knowledge (fact, results, generalities, etc.) vs. procedural knowledge (know-how, process, expertise, etc.)
- Know-how vs. skills: [Grundstein, 1995; Grundstein and Barthès, 1996] distinguish on the one hand, know-how (ability to design, build, sell and support products and services) and on the other hand, individual and collective skills (ability to act, adapt and evolve).
- Know-what vs. know-how [Seely Brown and Duguid, 1998]: the organisational knowledge that constitutes "core competency" requires know-what and know-how. The know-what is explicit knowledge which may be shared by several persons. The "know-how" is the particular ability to put know-what into practice. While both work together, they circulate separately. Know-what circulates with relative ease and is consequently hard to protect. Know-how is embedded in work practice and is sui generis and thus relatively easy to protect. Conversely, however, it can be hard to spread, co-ordinate, benchmark, or change. Know-how is a disposition, brought out in practice. Thus, know-how is critical in making knowledge actionable and operational.
- Company knowledge vs. corporate knowledge [Grunstein and Barthès, 1996]: company knowledge is technical knowledge used inside the company, its business units, departments, subsidiaries (knowledge needed everyday by the company employees); corporate knowledge is strategic knowledge used by the management at a corporate level. (knowledge about the company)
- Distributed knowledge vs. centralised knowledge: [Seely Brown and Duguid, 1998] The distribution of knowledge in an organisation, or in society as a whole, reflects the social division of labour. As Adam Smith insightfully explained, the division of labour is a great source of dynamism and efficiency. Specialised groups are capable of producing highly specialised knowledge. The tasks undertaken by communities of practice develop particular, local, and highly specialised knowledge within the community. Hierarchical divisions of labour often distinguish thinkers from doers, mental from manual labour, strategy (the knowledge required at the top of a hierarchy) from tactics (the knowledge used at the bottom). Above all, a mental-manual division predisposes organisations to ignore a central asset, the value of the know-how created throughout all its parts.
- Descriptive knowledge vs. deductive knowledge vs. documentary knowledge [Pomian, 1996]: descriptive knowledge is about history of the organisation, chronological events, actors, and domain of activity. Deductive knowledge is about the reasoning (diagnostic, planning, design, etc.) and their justification (considered aspects, characteristics, facts, results, etc.). It is equivalent to what some may call logic and rationale. Finally, documentary knowledge is knowledge about documents (types, use, access, context, etc.), their nature (report, news, etc.), their content (a primary document contains raw / original data; a secondary document contains identification / analysis of primary documents; a tertiary contains synthesis of primary or secondary documents) and if they are dead (read-only document, frozen once for all) or living (changing its content, etc.). J. Pomian rightly insists on the importance of the interactions between these types of knowledge. One can also find finer-grained descriptions: here the static knowledge was decomposed into descriptive and documentary knowledge, but the dynamic aspect could be decomposed into rationale (plan, diagnostic, etc.) vs. heuristic (rule of the thumb, etc.) vs. theories vs. cases vs. (best) practices.
- Tangible knowledge vs. intangible knowledge [Grunstein and Barthès, 1996]: tangible assets are data, document, etc. while intangible assets are abilities, talents, personal experience, etc. Intangible assets require an elicitation to become tangible before they can participate to a materialised corporate memory
- Technical knowledge vs. management knowledge [Grunstein and Barthès, 1996]: the technical knowledge is used by the core business in its day-to-day work. The strategic or management knowledge is used by the managers to analyse and the organisation functioning and build management strategies
And so on: content vs. context [Dzbor et al., 2000], explicable knowledge (with justifications) vs. not explicable (bare facts), granularity (fuzzy vs. precise, shallow vs. deep, etc.), individual vs. collective, volatile vs. perennial (from the source point of view), ephemeral vs. stable (from the content point of view), specialised vs. common knowledge, public vs. personal/private, etc.
A first remark is that the differences proposed here are not always mutually exclusive and neither always compatible. Secondly a knowledge domain is spread over a continuum between the extremes proposed here. However, most traditional company policies and controls focus on the tangible assets of the company and leave unmanaged their important knowledge assets [Macintosh, 1994].
An organisational memory may include (re)sources at different levels of the data-information-knowledge scale and of different nature. It implies that a management solution must be able to handle and integrate this heterogeneity.
The stake in building a corporate memory management system is the coherent integration of this dispersed knowledge in a corporation with the objective to "promote knowledge growth, promote knowledge communication and in general preserve knowledge within an organisation" [Steels, 1993]. This implies a number of activities to turn the memory into a living object.
In [Dieng et al., 2001] my colleagues discussed the lifecycle of a corporate memory. I added the overall activity of managing the different knowledge management processes i.e., as described in [Zacklad and Grundstein, 2001a; 2001b] to promote, organise, plan, motivate, etc. the whole cycle. The result is depicted in Figure 2 and I shall comment the different phases with references to the literature.
Here are some cycles proposed in literature:
- [Grunstein and Barthès, 1996]: locate crucial knowledge, formalise / save, distribute and maintain, plus manage that was added in [Zacklad and Grundstein, 2001a; 2001b]
- [Zacklad and Grundstein, 2001a; 2001b]: manage, identify, preserve, use, and maintain.
- [Abecker et al., 1998]: identification, acquisition, development, dissemination, use, and preservation.
- [Pomian, 1996]: identify, acquire/collect, make usable.
- [Jasper et al., 1999]: create tacit knowledge, discover tacit or explicit knowledge, capture tacit knowledge to make it explicit, organise, maintain, disseminate through push solutions, allow search through pull solutions, assist reformulation, internalise, apply
- [Dieng et al., 2001]: detection of needs, construction, diffusion, use, evaluation, maintenance and evolution.
I have tried to conciliate them in the schema in Figure 2, where the white boxes are the initialisation phase and the grey ones are the cycle strictly speaking. The central 'manage' activity oversees the others. I shall describe them with some references to the literature:
- Management: knowledge management is difficult and costly. It requires a careful assessment of what knowledge should be considered and how to conduct the process of capitalising such knowledge [Grunstein and Barthès, 1996]. All the following activities have to be planned and supervised; this is the role of a knowledge manager.
- Inspection: inventory of fixtures to identify knowledge that already exists and knowledge that is missing [Nonaka, 1991]; identify strategic knowledge to be capitalised [Grundstein and Barthès, 1996]; make an inventory and a cartography of available knowledge: identify assets and their availability to plan their exploitation and detect lacks and needs to offset weak points [Dieng et al., 2001]
- Construction: build the corporate memory and develop necessary new knowledge [Nonaka, 1991], memorise, link, index, integrate different and/or heterogeneous sources of knowledge to avoid loss of knowledge [Dieng et al., 2001]
- Diffusion: irrigate the organisation with knowledge and allocate new knowledge [Nonaka, 1991]. Knowledge must be actively distributed to those who can make use of it. The turn-around speed of knowledge is increasingly crucial for the competitiveness of companies. [Van Heijst et al., 1996] Make the knowledge flow and circulate to improve communication in the organisation: transfer of the pieces of knowledge from where they were created, captured or stored to where they may be useful. It is called "activation of the memory" to avoid oblivion knowledge buried and dormant in a long forgotten report [Dieng et al., 2001]. This implies a facility for deciding who should be informed about a particular new piece of knowledge and this point justify this sections about user modelling and organisation modelling.
- Capitalisation: process which allows to reuse, in a relevant way, the knowledge of a given domain, previously stored and modelled, in order to perform new tasks [Simon, 1996], to apply knowledge [Nonaka, 1991], to build upon past experience to avoid reinventing the wheel, to generalise solutions, combine and/or adapt them to go further and invent new ones, improve training and integration of new members [Dieng et al., 2001] The use is tightly linked to diffusion since the way knowledge is made available conditions the way it may be exploited.
- Evaluation: it is close to inspection since it assesses the availability and needs of knowledge. However it also aims at evaluating the solution chosen for the memory and its adequacy, comparing the results to the requirements, the functionalities to the specifications etc.
- Evolution: it is close to construction phase since it will deal with additions to the current memory. More generally it is the process of updating changing knowledge and removing obsolete knowledge [Nonaka, 1991]; it is where the learning spiral takes place to enrich/update existing knowledge (improve it, augment it, precise it, re-evaluate it etc.).
The memory management consists of two initial activities (inspection of the knowledge situation and construction of the initial memory) and four cyclic activities (diffusion, capitalisation, evaluation, evolution); all these activities being planned and supervised by the management.
In his third axiom, [Euzenat, 1996] insisted on the collaborative dimension: people must be supported in discussing about the knowledge introduced in the knowledge base. In this perspective, re-using, diffusing and maintaining knowledge should be participatory activities and underlines the strong link between some problems addressed in knowledge management and results of research in the field of Computer Supported Collaborative Work. "Users will use the knowledge only if they understand it and they are assured that it is coherent. The point is to enforce discussion and consensus while the actors are still at hand rather than hurrying the storage of raw data and discovering far latter that it is of no help." [Euzenat, 1996]
[Nonaka and Takeuchi, 1995] details the social and individual processes at play during the learning spiral of an organisation and focuses on the notion of tacit and explicit knowledge (Figure 3).
[Nonaka and Takeuchi, 1995] believe that knowledge creation processes are more effective when they spiral through the four following activities to improve the understanding:
- Socialisation (Tacit ® Tacit): transfers tacit knowledge in one person to tacit knowledge in another person. It is an experiential and active process where the knowledge is captured by walking around, through direct interaction, by observing behaviour by others and by copying their behaviours and beliefs. It is close to the remark of [Kimble et al., 2001] saying that communities of practice are central to the maintenance of soft knowledge.
- Externalisation (Tacit ® Explicit): getting tacit knowledge into an explicit form so that you can look at it, manipulate it and communicate it. It can be an individual articulation of one’s own tacit knowledge where one is looking for awareness and expressions of one's ideas, images, mental models, metaphors, analogies, etc. It can also consist in eliciting and expressing the tacit knowledge of others into explicit knowledge.
- Combination (Explicit ® Explicit): take explicit explainable knowledge, combine it with other explicit knowledge and develop new explicit knowledge. This is where information technology is most helpful, because explicit knowledge is conveyed in artefacts (e.g. documents) that can be collected, manipulated, and disseminated allowing knowledge transfer across organisations.
- Internalisation (Explicit ® Tacit): understanding and absorbing collectively shared explicit knowledge into individual tacit knowledge actionable by the owner. Once one deeply learned a process, it becomes completely internal and one can apply it without noticing, as reflex or automatic natural activities. Internalisation is largely experiential through the actual doing, in real situation or in a simulation. A symptom is that generally when one tries to pay attention to how one does these things, it impairs one's performance and this is a break to elicitation processes.
Finally, concerning the learning, [Van Heijst et al., 1996] differentiate two types:
- top-down learning, or strategic learning: at some management level a particular knowledge area is recognised as strategic and deliberate action is planed and undertaken to acquire that knowledge.
- bottom-up learning: a worker learns something which might be useful and that this lesson learned is distributed through the organisation.
As far as I know and so far, no one has tackled all the different types of knowledge and activities of management in one project coming up with a complete solution. Every study and project focused on selected types of knowledge and knowledge management activities. As a result we can find a typology of memories and management systems.
From an external point of view, [Barthès, 1996] gives several facets of the corporate memory problem that can be considered: socio-organisational, economic, financial, technical, human and legal.
From on internal point of view, the first seminal typology is the one of [Van Heijst et al., 1996]. It is based on the envisaged management processes as shown in (Table 1).
The knowledge attic is a corporate memory used as an archive which can be consulted when needed and updated when wanted. It is not intrusive, but it requires a high discipline of the organisation members not to become obsolete.
The knowledge sponge is a corporate memory actively fed to keep it more or less complete. Its use is left to the individual responsibility of organisation members.
The knowledge publisher is a corporate memory where the contribution is left to the individual workers while memory maintainers analyse the incoming knowledge and combine it with the stored one and forward the relevant news to potentially interested members.
The knowledge pump is a corporate memory that ensures that the knowledge developed in the organisation is effectively captured from and used by members of the organisation.
Other researchers classify memories depending on the knowledge content:
- trade/profession/ technical memory: composed of the referential, the documents, the tools and methods used in a given profession [Tourtier, 1995]. This knowledge about a domain, the investigations and research results [Pomian, 1996] the knowledge used everyday inside the organisation by its members to perform their daily activity [Grundstein and Barthès, 1996].
- managerial memory: related to organisation, activities, products, participants [Tourtier, 1995], about the organisation itself, its structure, its arrangement, its management principles, its policy, its history, [Pomian, 1996]. It captures the past and present organisational structures of the enterprise (human resources, management, etc [Grundstein and Barthès, 1996]. This memory is extremely close to the field of organisation modelling I shall describe in a following section.
- individual memory: characterised by status, competencies, know-how, activities of a given member of the enterprise, [Tourtier, 1995].
- project memory: it is acquired in the context of a project that must be saved with the knowledge to preserve its meaning [Pomian, 1996]. It comprises the project definition, activities, history and results [Tourtier, 1995] It is used to capitalise lessons and experience from a given projects because although the ephemeral nature of the project is seductive from an organisational point-of-view to augment flexibility, adaptability, etc. On the other hand the memory of the project suffers from volatility. The project memory preserve the technical memory and managerial memory mobilised for a project. It can be the memory of an on-going project or a past project. In any case, it is important to capture the course/progress, rationale and the context [Pomian, 1996].
Finally, one can classify the memories on their form and additional characteristics:
- Non computational memory: A non computational memory [Dieng et al. 1999] is made of physical artefacts (paper-based documents, video tapes, etc.) capturing knowledge that had never been elicited previously. [Dieng et al. 1999] distinguish two different aims to build such a memory: to elaborate synthesis documents on knowledge that is not explicit in reports or technical documentation, and is more related to the know-how of the experts of the enterprise; to improve enterprise production through experts' propositions on their tasks in a design process.
- Data warehouses and Data marts: In many companies, one of the first KM tools is a data warehouse, i.e., a central storage area for an organisation's transaction data [O’Leary, 1998]. It usually replicates or at least accesses content from many of the organisation's databases. From this, it provides with wide variety of data and it tries to present a coherent picture of business conditions at a single point in time. Automatic report generation systems based on queries and views of the databases as well as more complex knowledge discovery and data mining techniques may be used to enable knowledge workers (essentially managers) to collect information supporting management decision-making. Data marts are like data warehouse, but usually smaller, they focus on a particular subject or department; they may be subsets of larger data warehouses. They are usually linked to a community of knowledge workers.
- Internal vs. external memories: a corporate memory may not be restricted to the sole organisation [Rabarijaona et al., 1999]. An internal corporate memory can rely on an internal resources while an external corporate memory rather includes information and knowledge stemming from the external world, but useful for the organisation's activities. The retrieval and integration of information available on the Web may be interesting and I call it a corporate portal to the outside "world wild web" i.e., a system that uses the organisation's memory as a filter to tame the heterogeneity and information overload of the Web.
- Document-based memory or
knowledge warehouses: it relies on existing
documents to build a memory. The construction of such a memory begins by the
collection of the different documents and requires an interface to manage them
(addition of documents, retrieval of documents, etc.) [Dieng et al. 1999]. "A good documentation
system is very likely the least expensive and the most feasible solution to
knowledge management" [
- documents linked to projects: specifications of the product to be designed or manufactured, design documents, test documents, contractual technical reports,
- reference bibles in a given profession,
- visual documents such as photos, scanned plans, iconographic documents,
- technical reports, scientific or technical articles,
- books, theses, norms, archive documents, guides, dossiers of technological intelligence,
- on-line documentation, user manuals, reference manuals, business dossiers, etc.
- Knowledge-based Corporate Memory: such a memory is based on elicitation and explicit modelling of knowledge from experts [Dieng et al. 1999]. It can be mixed with the previous document-based memory by indexing documents through a formal representation of knowledge underlying them. However, the goal of this approach is to provide assistance to users, supplying them with relevant corporate information, but leaving them the responsibility of a contextual interpretation and evaluation of this information [Kuhn and Abecker, 1997].
- Case-based memories: organisations have a collection of past experiences (successes or failures) that can be represented explicitly in a same formalism to compare them [Dieng et al. 1999]; these formalised experiences are called cases and their management can exploit case-based reasoning [Simon and Grandbastien, 1995; Simon, 1996]. [Dieng et al. 1999] distinguish two aims: avoid the scattering of the expertise by concentrating knowledge of all experts in dedicated cases; allow a continuous evolution of the memory thanks to the progressive addition of new cases. Case-based reasoning allows to capitalise upon cases already encountered, in order to solve new ones. The retrieval mechanism is built around a similarity measure to find past cases close enough to suggest a solution that could be reused or adapted to a new problem to be solved. This approach is very useful for project memories project memories.
- Distributed Memory: a distributed memory is interesting for supporting collaboration and knowledge sharing between people or organisations dispersed geographically. It is essential for virtual enterprises made of distributed organisations and teams of people that meet and work together online [Dieng et al., 1999]. Distributed memory naturally relies on internet and Web technologies.
- People-based memory: individuals represent a prime location where intellectual resources of an organisation are located [Dzbor et al., 2000]. Thus another frequently used corporate application is a human resource knowledge base about employee capabilities and skills, education, specialities, previous experience, etc. [O’Leary, 1998] Although it is important to disembody and materialise the knowledge to make it perennial and perpetuate it, it is also clear that, so far, not all the knowledge can be captured in a symbolic system to be memorised. In that case it is important to capture the identity of the sources (e.g. an expert) and index this system-external sources to include it in the overall memory and also to know which part of the current memory are not disembodied and safely stored. [Drucker, 1994] states how if knowledge is the key concept in our future society, the 'person' will be central because knowledge, although stored in books and databases, is actually embodied in a person. [Liao et al., 1999] proposed a competence knowledge base system for the organisational memory to facilitate finding of appropriate contact person. Another type of expert matchmaker system is proposed by [Sangüesa Sol and Pujol Serra, 1999]: it builds a representation of acquaintance networks by mining Web page references. Other systems uses document co-authoring, recommending systems, etc.
Although this last categorisation is quite technical, one should not reduce knowledge management to a technical problem. Traditionally, enterprises have addressed knowledge management from either a management or a technological point of view. [Abecker et al., 1998]. The set-up of an organisational memory requires an approach balanced and integrated in its 'multidisciplinarity' [Dieng et al., 2001].
according to [Ackerman and Halverson, 1998] there does not exist an organisational memory, but
rather a supra-individual memory, using several people and many artefacts; this
vision is close to the one adopted in distributed cognition. These authors
believe that this view of a network of artefacts and people, of memory and of
processing, bound by social arrangements, provides a deeper and ultimately more
usable understanding of organisational life. Effective management of knowledge
requires hybrid solutions that involve both people and technology [
The different types or aspects I compartmentalised here are not mutually exclusive. Hybrid systems are being studied and developed with different dimensions and interesting problems of interactions between these dimensions arise, e.g: [Nagendra Prasad and Plaza, 1996] studied corporate memories as distributed case bases.
A memory is likely to be an integrated hybrid system both from the technical (several technical approaches may be combined) and physical (documents, software and other artefacts, people, communities, etc. are involved in a solution) points of view. It requires an approach balanced and integrated in its 'multidisciplinarity'.
I do not intend to make a complete state of the art of enterprise modelling, because this field is very large and most of the contributions are noticeably far from my concerns. However some interesting points have been selected and are reported here since they are related to the notion of Organisational Memories.
For [Rolstadås, 2000], a model is an abstract and limited representation of a piece of reality expressed in terms of some formalism and that can be used to obtain information about that piece of reality. Therefore, an organisation model is used to give a limited representation of an organisation. [Rolstadås, 2000] quotes several definitions of the enterprise model: some tend to adopt more generic definitions than others, they vary in their focus and in their definition of an enterprise: some have single view while others handle multiple views, and so on.
[Solberg, 2000] explains with reference to [Vernadat, 1996] that organisation modelling is the set of activities, methods and tools related to developing models for various aspects of an organisation. He believes that such a model exists in any organisation, be it small or large, but that it is poorly formalised: organisation charts, documented operational procedures, regulation texts, databases, knowledge bases, data files, application programs and to a large extend in the minds of members of the organisation. "Methods and tools are required to capture, formalise, maintain, and use this knowledge for better operation and control of complex systems such as manufacturing enterprises." [Solberg, 2000]
For [Szegheo, 2000] an organisation model can be made of several sub-models including (but not limited to) process models, data models, resource models and structural models. The organisation can be viewed from different aspects and the author underlines that in practice all these aspects cannot be taken into account in one model since the result would be too complex to handle and work with. The objective of a model is "neither to fully describe all aspects of a manufacturing enterprise nor to model the entire enterprise. This would be useless, nearly impossible, and certainly endless as enterprises are tremendously complex systems in terms of number of entities involved, things to do, decision variables to be considered, and processes to be controlled." [Solberg, 2000]
Usually the model contains those aspects that are crucial for solving the problem that is being considered [Szegheo, 2000] i.e., the model depends on the task it is used for. "The degree of abstraction and simplification depends on the interest of the targeted audience" [Szegheo, 2000]. Thus the model depends on the stakeholders of the application scenario it was designed for.
To generalise, the degree of abstraction simplification, as well as the points of view adopted, depends on the specifications of the (computerised or not) system exploiting the formal model, and therefore, it ultimately depends on the stakeholders' expectation. Therefore, [Solberg, 2000], with reference to [Vernadat, 1996], insists on the fact an enterprise model must have a finality defined by the goal of the modeller. He gives examples of such finalities:
- to better represent and understand how the organisation or some part(s) of it works,
- to capitalise acquired knowledge and know-how for later reuse,
- to rationalise and secure information flows,
- to design or redesign and specify a part of the organisation,
- to analyse some aspects of the organisation,
- to simulate the behaviour of some part(s) of the organisation,
- to make better decisions about organisation operations and organisation,
- to control, co-ordinate, or monitor some parts of the organisation.
Organisation modelling is currently extensively used for organisation design concerns. Examples of problems addressed by techniques using organisation modelling are:
- organisation development (e.g. [Alfines, 2000])
- organisation integration (e.g. [Røstad, 2000])
- organisation simulation (e.g. [Szegheo and Martinsen, 2000])
- performance measurement (e.g. [Deng, 2000])
- self-assessment (e.g. [Fagerhaug, 2000])
- business process improvement (e.g. [Andersen, 2000])
- setting-up extended organisation (e.g. [Szegheo and Petersen, 2000])
It is clear that any kind of organisation model serves a purpose. There are many different purposes, but fundamentally any model aims to "make people understand, communicate, develop, and cultivate solutions to business problems [the difference between different models] might lay in the purpose of the model, the content of the model, the quality of the formalism and manifestation, the level of abstraction, and the span of existence." [Szegheo, 2000]
Thus, so far, the enterprise modelling field has been mainly concerned with simulation and optimisation of the production system design with relevant criteria called performance indicators. Such modelling aims to improve industrial competitiveness. It provides benchmark for business processes and is used for business process re-engineering.
As it is largely acknowledged in contemporary literature, 'globalisation' and 'information society' modified the market rules of the game and set new constraints on its stakeholders. [Rolstadås, 2000] notices "there is an industrial change in direction of organising work in projects. This change from operations management to project management involves that enterprises to a larger extent will handle their business as projects and use project planning and control tools rather than the classic operations management tools". In fact, this introduces the necessity of being able to create, manage and dissolve ephemeral teams when necessary to adapt the dynamic of market. One of the new stakes of this situation is to be able to capitalise and reuse the knowledge from past project experiences when their structure (team) has dissolved in a new organisation.
[Rolstadås, 2000] also identifies "a trend toward organising work to use teams that are designed on an interdisciplinary basis. This enables things to be done more in parallel than earlier and thus reduces time to market. It also stimulates new innovation, often in the intersection between technology and social sciences". This new trend leads to the problem of managing and integrating multiple expertise points of view in the design rationale and then in the corporate memory to enable the history of an older project to be revisited and to take advantage of this experience in new projects. "The virtual nature of the agile organisation entails a greater degree of communication, co-ordination, and co-operation within and among enterprises; the agile organisation must be integrated from the structural, behavioural, and informational point of view. (...) The drive for more agile enterprises requires a degree of integration that is not possible with-out the use of a sophisticated information infrastructure. At the core of this infrastructure lies an enterprise model." [Fox and Gruninger, 1998]
Another trend is the lifecycle follow-up aspect. "This takes environment and sustainability into account. Products must be made for the entire lifecycle including scrapping, disassembly, or recycling." [Rolstadås, 2000]. The lifecycle aspect also implies the transfer of information or knowledge from one stage to another (e.g.: from assembly to disassembly) and therefore, it sets constraints on the documentation and more broadly the memory attached to one product.
"An enterprise model is a computational representation of the structure, activities, processes, information, resources, people, behaviour, goals, and constraints of a business, government, or other enterprise. It can be both descriptive and definitional - spanning what is and what should be. The role of an enterprise model is to achieve model-driven enterprise design, analysis, and operation." [Fox and Gruninger, 1998] Now organisations are aware that their operation is strongly influence by the knowledge management. Therefore, the organisation model has a role to play in knowledge management solutions too. The new trends exposed by Rolstadås and the shift in the market rules led enterprises to become aware of the value of their memory and the fact that enterprise model has a role to play in this application too. [Rolstadås00] notices that enterprise models may well constitute a theoretical basis for the information system in an enterprise, and are regarded by many as a substantial opportunity to improve global competitiveness of industry. Achieving organisational integration requires an "infrastructure that supports the communication of information and knowledge, the making of decisions, and the co-ordination of actions. At the heart of this infrastructure lies a model of the enterprise. (...) It would not be overly general to say that most information systems in use within an enterprise incorporate a model of some aspect of the enterprise’s structure, operations, or knowledge." [Fox and Gruninger, 1998] However, this modelling needs to be systematic, explicit and integrated to the whole information organisational system.
As noticed by [Szegheo, 2000] the enterprise model, like any model, will have to be expressed in terms of a language. The language could be formal or informal. The richest languages are natural languages, their use would seem logical. The problem is that they lack formalism and their interpretation is not universal. A good modelling language is formalised and its usage and meaning are unambiguous. As stressed later ontologies are used to capture the intended and unambiguous meaning of the modelling primitives. An organisational ontology defines the concepts relevant for the description of an organisation, e.g.: organisational structure, processes, strategies, resources, goals, constraints and environment. Such an ontology can be used to make explicit models for automating business-engineering or supporting the exchange of information and knowledge in the enterprise [Fraser, 1994]. Moreover, this representation should eliminate much of the programming required to answer "simple" common sense questions about the enterprise [Fox and Gruninger, 1998] by capturing the essential characteristics of the entities existing in an organisation and their relations.
An organisational model is an explicit representation of the structure, activities, processes, flows, resources, people, behaviour, goals, and constraints of an organisation. The corresponding ontology captures the essential characteristics of the modelled entities and forms of relations existing between them in an unambiguous consensual vocabulary.
Here is an overview of some approaches that use a more or less explicit ontology. It is essentially based on the excellent article of [Fox and Gruninger, 1998] where a more complete version can be found:
- CIMOSA (Computer-Integrated Manufacturing—Open-System Architecture) [AMICE, 1993] [Bernus et al., 1996]: integration of enterprise operations by means of information exchange within the enterprise. It defines an integrated methodology to support all phases of a CIM system lifecycle from requirements specification through system design, implementation, operation, and maintenance. This description is used to control the enterprise operation and to plan, design, and optimise updates of the real operation environment. It defines four different views concerned with the enterprise: (1) functional structure, the related control structure and the workflow, (2) information required by each function, (3) the resources and their relationship to other structures (4) enterprise organisational structures and responsibilities.
- Enterprise Ontology [Uschold et al., 1998]: this ontology supports integrating methods and tools for capturing and analysing key aspects of an enterprise. This ontology is semiformal; it provides a glossary of terms expressed in a restricted and structured form of natural language supplemented with a few formal axioms. It comprises of five parts: (1) metaontology: entity, relation-ship, role, actor, and state of affairs; (2) activities and processes: activity, resource, plan, and capability; (3) organisation: organisational unit, legal entity, management, and owner-ship; (4) strategy: purpose, strategy, help to achieve, and assumption; (5) marketing: sale, product, vendor, customer, and market.
- Enterprise-wide Data Modelling [Scheer, 1989]: this ontology undertakes to construct data structures for typical functional areas (departments), such as production, engineering, purchasing, human resources, sales and marketing, accountancy, and office administration, with the aim of supporting planning, analysis, and traditional accounting systems in general. It uses the entity-relationship model to systematically develop the data structures for the enterprise in terms of entities (something that can be identified in the users’ work environment), attributes (characteristics-properties of an entity), and relationships (the association of entities).
- GERAM (Generic Enterprise Reference Architecture And Methodology): this ontology is about integrated enterprise [Bernus, et al., 1996]. Its coverage is: products, enterprises, enterprise integration, and strategic enterprise management.
- IDEF Ontology: the ontology developed at KBSI are intended to provide a
rigorous foundation for the reuse and integration of enterprise models [Fillion
et al., 1995]. The ontology is a
first-order theory consisting of a set of foundational theories, along with a
- MILOS [Maurer and Dellen, 1998]: this approach relies on a process-oriented view on the organisation inspired of research on workflow management: for example, it offers a process modelling language for representing knowledge upon work processes (e.g. "process, product and resource models, project plans and schedules, products developed within projects, project traces, background knowledge such as guidelines, business rules, studies").
- NIST Process-Specification Language [Schlenoff et al., 1996]: to facilitate complete and correct exchange of process information among manufacturing applications (e.g. scheduling, process planning, simulation, project management, work flow, business-process reengineering, etc.). The core provides the basis for representing the simplest processes (e.g. time, resource, activity). The outer is for describing common processes (e.g. temporal constraints, resource grouping, alternative tasks). Extensions group representation primitives for particular sets of applications that provide added functions (e.g. goals, intentions, organisation constraints, products). Application-specific correspond to a specific application.
- OLYMPIOS [Beauchène et al., 1996]: this approach models an enterprise organisation, using a model stemming from quality management and focusing on «customer-supplier» relationships between the enterprise members.
- PERA (Purdue Reference Architecture): this approach is interested in enterprise modelling for a computer-integrated manufacturing (CIM) [Bernus et al., 1996] [Williams, 1991]. The functional descriptions of the tasks and functions of the enterprise are divided into two major streams: (1) decision, control, and information and (2) manufacturing, and customer service.
- Process-Interchange Format (PIF): it is an interchange format to support the automatic exchange of process descriptions among a wide variety of business-process modelling and support system (e.g. workflow tools, process simulation systems, business-process reengineering tools, process repositories). PIF serves as a common format to support inter-operability. PIF is a formal ontology structured as a core plus a set of extensions. The top-level defines classes "activity, object, agent, and time point" as well as the relations "performs, uses, creates, modifies, before, and successor ".
- TOVE (Toronto Virtual Enterprise Ontology): the goal of TOVE [Fox et al., 1993] is to create an ontology for both commercial and public enterprises that every application can jointly understand and use with the meaning of each term precise and unambiguous. The associated test bed provides an environment for analysing enterprise ontologies; it provides a model of an enterprise and tools for browsing, visualisation, simulation, and deductive queries. TOVE currently spans knowledge of activity, time, and causality, resources, cost, quality, organisation structure, product, and agility. TOVE aims at creating a generic, reusable enterprise data model that has the following characteristics: it provides a shared terminology for the enterprise that each agent can jointly understand and use; it defines the meaning of each term (aka semantics) in a precise and as unambiguous manner as possible; it implements the semantics in a set of axioms that will enable to automatically deduce the answer to many "common sense" questions about the enterprise; it defines a symbology for depicting a term or the concept constructed thereof in a graphical context. The TOVE reusable representation is a significant ontological engineering of industrial concepts.
My first remark is on the paradoxical aspect of modelling which arises as soon as a model has to be used by people that may not have been involved in its design or when the design is subject to a consensus. [Solberg, 2000], again with reference to [Vernadat, 1996], explained it perfectly:
The opposite side of the coin is that users are often looking for oversimplified techniques, which do not go far enough in details and at the end have little value. The difficulty for tool builders is to develop sophisticated modelling and analysis environments which hide this complexity and have a user-friendly interface, good graphical model representations, and 'talk' the language of the user while at the same time offering powerful analysis and simulation capabilities.
Ultimately, the success of an enterprise model depends on if it works appropriately, and the best way to find this out is to test it. Such a test will uncover how the enterprise model works." [Solberg, 2000]
My second remark
is that in this Ph.D., my goal was not to evaluate the model of an organisation
or optimise it to support enterprise evolution. The model I envisaged aimed at
supporting corporate memory activities involved in the application scenario
(this is why the enterprise modelling state of the art is focused and limited).
As [Papazoglou and Heuvel, 1999] stressed, it is necessary for information
systems to have an understanding of the organisational environment, its goals
and policies, so that the resulting systems will work effectively together with
human agents to achieve a common objective.
User modelling started in the early 80's and human-computer interaction in the 60's. It would be a tremendous work and a task completely out of the scope of this Ph.D. to make a comprehensive state of the art of these two domains. However, in a computer assisted solution of corporate memory management, the interaction with users has to be studied. The two fields are extremely linked both historically and in their research objectives; in human-computer interaction, the human is quite often a user whose model must be taken into account to improve the behaviour of the system. In a knowledge management perspective, the user is part of the context, and the context is an important factor when knowledge is handled, therefore, user modelling has a role to play in knowledge management solutions. On the other side, the problem of modelling humans and their cognitive activities will raise considerations that fall within the competence of knowledge representation and knowledge-based systems. This the two domains can complement each other.
The first aspect of the problem is the nature and content of the model. There are two types of models: individual models and group or stereotype models.
An individual user model consists of representation of assumptions about one or more types of user characteristics in models of individual users [Kobsa, 2001]. It includes assumptions about their knowledge, beliefs, goals, preferences, interests, misconceptions, plans, tasks, and abilities, the work context, etc. The forms that a user model may take are as varied as the purposes for which user models are formed. User models may seek to describe: the cognitive processes that underly the user's actions; the differences between the user's skills and expert skills; the user's behavioural patterns or preferences; or the user's characteristics. [Webb et al., 2001] With reference to [Kobsa et al., 1999], Brusilovsky  suggests to distinguish adaptation to user data, usage data, and environment data. User data comprise the adaptation target, various characteristics of the users. Usage data comprise data about user interaction with the systems that cannot be resolved to user characteristics. Environment data comprise all aspects of the user environment that are not related to the users themselves (e.g. user location and the user platform). [Brusilovsky, 2001] divides the user data into:
- User Characteristics (user's goals/tasks, knowledge, background, experience, and preferences),
- User interests (long-term interests such as a passion and short-term interest such as search goal),
- User's individual traits (e.g. personality factors, cognitive factors, and learning styles.).
- User groups and stereotypes model are the representation of relevant common characteristics of users pertaining to specific user subgroups of the application system [Kobsa, 2001].
A user model is the explicit representation of assumptions or facts about a real user or a stereotype. It includes the user's characteristics (assumptions about the knowledge, beliefs, goals, preferences, interests, plans, tasks, and abilities, work context, etc.), the past usage, and the environment. NB: A user model may be part of an organisational model.
Based on models, there exist two categories of adaptation of human-computer interaction:
- Manual customisation: offer the users the capability to select, or set between different alternative interaction characteristics, among the ones built into the system. [Stephanidis, 2001]
- Automatic adaptation: the system is capable of identifying those circumstances that require adaptation, and accordingly, select and effect an appropriate course of action. This implies that the system possesses the capability to monitor user interaction, and use the monitoring data as the basis upon which it draws assumptions, continuously verifying, refining, revising, and, if necessary, withdrawing them. [Stephanidis, 2001]
Of course the latter is the most complex and the richest one and it is on this point that the field focuses. It can be applied both at the individual or collective level.
[Zukerman and Albrecht, 2001] differentiates between the content-based approach where the behaviour of users is predicted from their past behaviour, and the collaborative approach where the behaviour of users is predicted from the behaviour of other like-minded people:
- Content-based learning is used when users' past behaviour is a reliable indicator of their future behaviour so that a predictive model can be built. This approach is ideal for tailoring a system's behaviour to the specific requirements of a particular user, but it requires each user to provide relatively large amounts of data to enable the construction of the model. In addition, the selected features have a substantial effect on the usefulness of the resulting model. If they are too specific, the system is useful only for repetitive behaviours, while if they are too general, predictions are of debatable usefulness [Zukerman and Albrecht, 2001].
- Collaborative learning is used when a user behaves in a similar way to other users. A model is built using data from a group of users, and it is then used to make predictions about a particular individual user. This approach reduces the data collection burden for individual users, and can be implemented using the specific values of the data without obtaining features with the "right'' level of abstraction. However, it does not support tailoring a system to the requirements of a particular user and there is a risk to conflate all users into a model that represents an "average user" [Zukerman and Albrecht, 2001].
[Fischer, 2001] also distinguishes between three techniques to find out what the user really knows and does: being told by the users (e.g. by questionnaires, setting preferences, or specification components; being able to infer it from the user's actions (e.g. by using critics) or usage data; and communicating information about external events to the system. However, as noticed by Zukerman and Albrecht, these approaches have complementary advantages and disadvantages that call for a solution combining both types of modelling approaches.
A number of goals can be pursued exploiting these models such as:
- user classification: classification of users as belonging to one or more subgroups, and the integration of the typical characteristics of these subgroups into the current individual user model [Kobsa, 2001].
- collaborative or clique-based processing by comparison of different users' selective actions: users' selective actions are matched with those of other users, and users' future selective actions are predicted based on those of the most similar other users [Kobsa, 2001].
- user prediction or simulation: formation of assumptions about the user, based on the interaction history in order to predict future action or simulate the user's behaviour.
In any goal, a recurrent problem to address is the one of detecting patterns in user models and behaviour. To that purpose, user-modelling techniques draw upon machine learning, probabilistics, and logic-based methods. Several different statistical models have been used in the framework of both the content-based and the collaborative approach. The main models are: linear models, TF*IDF-based models, Markov models, neural networks, classification and rule-induction methods, Dempster-Shafer theory and Bayesian networks [Zukerman and Albrecht, 2001]. Using machine learning, user's behaviour provide training examples that are then used to form a model designed to predict future actions.
However several problems challenge the domain, namely: the need for large data sets; the need for labelled data; concept drift i.e., the changing nature of the knowledge learned; and the computational complexity [Webb et al., 2001]. Moreover, whereas much of the academic research in machine learning for user-modelling concentrates on modelling individual users, many of the emerging applications in electronic commerce relate to forming generic models of user communities [Webb et al., 2001].
Plan recognition is a special kind of pattern detection which tries to recognise the users' goals and their intended plan for achieving them, the motto here is to try to build "systems that do what you mean, not what you say". Most plan recognition systems start with a set of goals that a user might be expected to pursue in the domain and with an observed action by the user. Then, the plan recognition system infers the user's goal and determines how the observed action contributes to that goal. For this, it needs a set of actions that the user might execute in the domain and a set of recipes that encode how a user might go about performing these actions. Recipes constitute a plan library and include for each action, the preconditions, the sub-goals, and the effects of executing the action. Classical reasoning uses chaining algorithms between preconditions and post-conditions and a wide variety of mechanisms have been proposed for narrowing the space of viable hypotheses about a user's plan. The reasoning performed by the system uses domain-dependent knowledge, but it is itself largely domain-independent [Carberry, 2001].
Here again hybrid solutions are being envisaged. In particular, to attain quick adaptation, some systems try to select between more than one modelling and customisation methods with different degrees of complexity. The choice depends on the amount and quality of data available.
The application domains of user modelling include:
- educational and tutorial systems: knowledge and skill diagnosis, student modelling, tailoring to learner proficiency, feedback and support for collaboration,
- adaptive information filtering, retrieval and browsing: information retrieval (hypermedia navigation, intermediaries and information filtering), information presentation (decision support, dialog management, natural language interpretation, processing and generation),
- multi-modal user interactions and use of the model to integrate interactions and to interpret the user,
- supporting inter-user collaboration,
- e-commerce applications (acquiring user preferences, services and product customisation, profile management, automatic suggestion, shopping and comparison systems, etc.),
- consumer guides,
- mobile systems,
- interface adaptation (tailoring to abilities, disabilities, and preferences; provision of help; high-level programming and control),
The beginning of a commercial boom of this domain in the late 1990s, was essentially due to the value of web customisation [Kobsa, 2001].
From the knowledge management perspective, adaptive information filtering and interfaces, collaborative software and tutorial tools applications are extremely close. Knowledge engineering is looking for tools capable of synthesising representations that best communicate a concept to a targeted user. Users are not created equal and there are as many possible customisations and adaptations as users. [Fischer, 2001] explains that the universal access issues that underlines a lot of the above points, is to write software for millions of users (at design time), while making it work as if it had been designed for each individual user (at use time), to provide universal access for people with different (dis)abilities and to adapt to different profiles (e.g. easy to understand and use without prior experience for novices vs. complex systems, but useful for professionals).
According to [Brusilovsky, 2001] adaptive hypermedia have two main parent areas (hypertext and user-modelling) and at least six kinds of application systems: educational hypermedia, on-line information systems, on-line help systems, information retrieval hypermedia, institutional hypermedia, and systems for managing personalised views in information spaces. The following extract illustrates very well my conviction of the strong link between knowledge management and user-modelling:
Saying the right thing, at the right time, in the right way:
"The challenge in an information-rich world (in which human attention is the most valuable and scarcest commodity) is not only to make information available to people at any time, at any place and in any form, but to reduce information overload by making information relevant to the task-at-hand and to the assumed background knowledge of the users. Techniques to say the right thing include: (1) support for differential descriptions that relate new information to information and concepts assumed to be known by a specific user; and (2) embedded critiquing systems that make information more relevant to the task-at-hand by generating interactions in real time, on demand, and suited to individual users as needed. They are able to do so by exploiting a richer context provided by the domain orientation of the environment, by the analysis of partially constructed artefacts and partially completed specifications. To say things at the right time requires to balance the costs of intrusive interruptions against the loss of context-sensitivity of deferred alerts. To say things in the right way (for example by using multimedia channel to exploit different sensory channels) is especially critical for users who may suffer from some disability." [Fischer, 2001]
As I said about organisational modelling. The model I envisaged aimed at supporting corporate memory activities involved in the application scenario; I shall not discuss improvements in user modelling. My interest in user model techniques here is their direct use in order to enable the system to get insight in its environment (user profiles, communities of users, existing interests,...); to give the system an awareness of the users it is interacting with.
Knowledge is of two kinds. We know a subject ourselves, or
we know where we can find information upon it.
— Samuel Johnson
Since information is tightly linked to knowledge, information systems are tightly linked to knowledge management. Accessing the right information at the right moment may enable people to get the knowledge to take the good decision at the right time. Retrieval is the main issue of information retrieval systems, and I shall briefly survey different techniques developed in that field using the states of the art of [Greengrass, 2000] and [Korfhage, 1997]. I shall not enter into the details: there are whole books on just sub-fields of this subject (databases, information retrieval, data mining, information extraction, etc.), enough to fill many shelves in a library. Therefore, I shall only survey the existing lines of research to position my work and identify the different influences.
Ideally, an information system would somehow understand the content of documents and produce a single, concise, coherent and comprehensive answer relevant to the user’s request. In actual fact, such a strategy is far beyond the state of the art [Greengrass, 2000].
Most of the research focused on textual documents since natural language remains the preferred form for knowledge communication. Thus information retrieval matches information to information need by matching documents to a query; information filtering proposes techniques for rapid selection of rich ore of document [Korfhage, 1997]; data mining and information extraction try to extract conceptual structures respectively from databases or documents.
Information retrieval often addresses the retrieval of documents from an organised and relatively static collection, also called archive, corpus or digital library. However it is not restricted to static collections and it can be applied to streams of information resources, (e.g. e-mail messages, news letters). Following [Korfhage, 1997], I shall distinguish between:
- Retrospective search system, designed to search the entire corpus (set of textual documents) in response to an ad hoc query. Such a system has a relatively large and stable corpus and a small, rapidly changing set of submitted queries, provided anew each time by the user.
- Monitoring search system, designed to keep its users informed concerning the state of the art in their areas of interest as specified in their profiles. These interest descriptions are run as routing queries against new documents, and the results are disseminated to users for whom there is a match. Such a system has a flow of documents, a relatively unstable corpus, and a large static set of queries derived from the user profile.
This distinction between routing and information retrieval is an idealisation. In practice, the distinction is not necessarily so clear-cut. Routing and information retrieval may be viewed as opposite ends of a spectrum with many actual applications in between. [Greengrass, 2000]
The storage problem of information systems will not be addressed here since it is not directly relevant to the my work; it is however a non negligible problem in a complete solution.
Thus a typical information retrieval system seeks to classify the documents of a given collection to find the ones that concern a given topic identified by a query submitted by the user. Documents that satisfy the query with regard to the judgement of the user are said to be 'relevant'.
Information retrieval generally focuses on textual resources. Other kinds of information resources such as images, sound, video, and multimedia documents combining several of these could be relevant for a query. However textual documents are the most current support of information and moreover, mining unstructured multimedia documents is extremely complex, even more than mining textual resources that is already a challenge as we shall see it through the following sections.
An information resource such as a document, is structured if it consists of named components, organised according to some well-defined syntax and with a fixed meaning for every component of a given record type (e.g. database schema). This does not mean that the associated semantics is unambiguously explicit to anyone looking at the structure. Information resources may be structured, unstructured, semi-structured, or a mixture of these types.
In a solution of fully structured information such as in databases or knowledge-based systems, there exist fixed schemata, conceptual structures with known fields and logical rules. As long as the specifications of the data structure correspond to the needs, exact or range matching operators can be designed for each data type (e.g. for integers: =, >, £, etc.) and used to propose powerful retrieval algorithms. A search engine applies these operators to find a given component in a given structure and retrieve its contents. Similarly, given a component and a value, the search engine can find records such that the given component contains the given value. These approaches will encounter problems with imprecise or highly heterogeneous data and require to formalise all the information corpus in a logical format which can be impractical.
In a collection of unstructured resources such as texts in natural language, there is no unique and fixed syntactic position: in a random collection of documents, there is no guarantee that they are about the same topic, there is no guarantee on the information they specify or where it is specified. Unstructured means that there is no (externally) well-defined syntax shared by the documents. Even if a syntax is known to exist, the semantics of the syntactic components is unknown [Greengrass, 2000].
As usual there exist hybrids between these extremes, I shall distinguish two of them:
- Informally structured documents (usually called semi-structured): the structure does not rely on a formal language. A collection of textual documents may share a common structure not completely marked out or explicit, but in these semi-structured documents, at least some information is located at a fairly standard position in the text with clues such as domain dependent keywords (e.g. "age:", "sex:", "population:" etc.) or document nature-dependent structural information (e.g. strategic position and formatting for the title, keywords such as "abstract:", "index", etc.) This makes it possible to write algorithms or at least heuristics for extracting the data back into a structured format. A classical example of resources that can be mined this way are the content of the pages of the CIA Fact Book. Ironically, semi-structured documents are usually the results of the front-end of a database or another archiving system; XML could be an asset in removing this unseemliness.
- Hybrid structure, mixed structured or partly structured resources: they include structured parts and unstructured ones sometimes intertwined e.g. Web pages have a structured header and an unstructured body. Structured part typically contains metadata (i.e., data about the resource) rather than the information content of the resource itself, e.g. the author, title, recipient, abstract, creation date, keywords, ISBN and other id numbers, etc. With a language like XML language where text and tags can be mixed at will, a document can be anywhere between the two extremes.
The initial structure or the information resources varies on a continuum between unstructured resources, informally structured or semi-structured resources, hybrid partly-structured resources and strongly structured resources.
In any case, information retrieval focused a lot on searching the remaining unstructured part with the strong constraint of producing automatic and scalable methods and tools.
One could envisage to scan the full collection of documents each time a query is issued. For instance a finite automata could be applied to search all the documents for a given string. In that case, no additional space than the original corpus is required and new document insertion is straightforward. This may be necessary if no restriction is placed on the form of queries that can be submitted, however it is a very slow approach because the whole corpus has to be scanned every time.
The key problems are how to state an information need and how to describe an information resource so as to be able to match the need to the resource. To tackle these problems, there are two major categories of approaches: semantic and statistical. Semantic approaches attempt to implement syntactic and semantic analysis to capture in semantic structures some aspects of the understanding of the resource that a human would have. In statistical approaches, the resources selected for retrieval or clustering are those that match the query most closely in terms of some statistical measure. However it is clear that each approach tends to introduce techniques from the other; statistical measures need to be attached a minimum of meaning in order to be usable and semantic features are reintroduced in the statistical spaces while semantic approaches use statistical tools to filter the corpora before processing thus improving their scalability.
There are two spaces, often with the same features, in nearly every approach: the document space and the query space. The document space is where documents are organised in order to prepare their use. Without this pre-organisation of documents, the query processing rapidly becomes prohibitively expensive; on the other hand, the pre-organisation must be designed so as to be adequate to the envisaged processing and the richer this pre-organisation is, the more likely it will require heavy pre-processing that will raise problem of scalability and costs of modification such as adding or removing documents or features; iterative construction methods are a must. The task of organising documents is called indexing and the resulting structure is the index. The first and simplest index type is the signature file: it consists of a sequence of document references and document signatures i.e., bit strings derived from the document using hashing on its words. Indexes are used to create inverted indexes i.e., indexes where the entrance points are the features used for indexing and the output is a list of candidate documents. For instance for the signature file, the inverse index would give for each word the documents that use it.
Indexing consists of scanning the corpus to build representative surrogates for each one of the information resources. This is the first facet of the use of surrogates in information system (the second one will be presented in the section about user interaction): the use of this surrogate (e.g. a vector) is to represent the resource in the processing carried out by the system (e.g. matching a query); it influences the whole system in its process (since the structure that feed the algorithms and the algorithms that exploit the structures are tightly linked), in its use (since if a surrogate eliminates an information -e.g. the order of terms- it will not be possible to use them for processing) and in its performances (since complex structure may provide finer results, but need much more computational power). Surrogates for indexing may be as simple as collecting some words of the document or as complex as a natural language analysis of its content.
The choice of the surrogate depends on the information coding and structure. In fact, I compare the surrogate coding and structure to a sort of highly degrading compression preserving only features relevant to information processing. Surrogates do not limit to the representation of the content, but can also include metadata (editor, author, ISBN, etc.) The whole set of these surrogates and its structure form the index of the corpus.
The first use of information resource surrogates is to provide a highly synthetic and representative structure that reduces the content of the resource to those features relevant for the intended processing.
Document granularity (a whole encyclopaedia, a volume, a chapter, a section, a page, a paragraph) and documents within documents raise the problem of choosing cut-point i.e., on which partition will the surrogate calculation be based. This choice influences the relevance evaluation process: the parts of a document that has been split may not be individually relevant to a query, but collectively they may be; conversely, a whole encyclopaedia most probably contains a lot of answers, but for each one of them, only a short extract is most probably sufficient and really relevant. Document format is also a problem (text, image, sound, schema, table, application, data base file, etc.) and other characteristics (e.g. ephemeral documents vs. flow of documents vs. static collection, etc.).
On the other side of the coin is the query space. The forms of a query range from strongly constrained formal structures efficient for computational processing, but may be difficult to use for humans, to natural language queries that put a heavier computational load on the system, but are much more easily stated by the user. The methods for matching a document to a query are closely related to the query form used. The classic range of form is divided into the statistical approach and the semantic approach. The choice of the query space is not separable from the choice of the document space since elements from the query space and document space will have to be matched. A natural language query can be treated in a crude way, by removing stop-words, stemming other words, thus creating a list of terms. At the other extreme, the query undergoes a lexical, syntactic, semantic and pragmatic analysis. However, a query is usually much shorter than a text, thus it is quicker to process, but there are much less context and clues to interpret. As we shall see in the interface section, one can use a dialogue system to refine the query elicitation, a bit like it would happen if you were asking for information to a librarian.
An indexing and querying vocabulary has to be chosen (words, concepts, sentences, complex structures, etc.). A special constraint is the use of controlled vocabulary (vs. uncontrolled vocabulary) by surrogate authors and/or query authors. Controlled vocabulary obliges user to chose among a reduced, precise set of terms to express a document or a query; it is constraining, restricting, and sometimes frustrating, but it allows much more efficient processing. With an uncontrolled vocabulary the user is free to use any term, but it introduces risks of mismatch and losses because different indexing terms were used for describing documents and queries; it may also lead to huge indexing structures, however it allows for much more flexible solutions. As usual, the current trend is to mix both approaches, trying to allow a maximum flexibility at the user-end and implementing (semi)-automatic reduction algorithm to translate informal inputs into controlled (formal) and semantically richer structures. The translation usually rests on thesauri providing standard terms to index or query and cross-referencing relations such as "see also", "other", "broader term", "narrower term", "synonyms", "term co-occurring". Algorithms exist to automatically build a thesaurus by analysing co-occurrences of terms. We shall also see that one form that can be taken by a controlled vocabulary is called an ontology.
"Two characteristics of an indexing language are its exhaustivity and specificity. Exhaustivity refers to the breadth of coverage of the index terms - the extent to which all topics and concepts met in a document set are covered. Specificity refers to the depth of coverage - the extent to which specific topic ideas are indexed in details." [Korfhage, 1997]
The indexing may be manual or automatic:
- manual indexing relies on human interpretation and thus benefits from a better assessment of the content and topics of a document, but it is less systematic and therefore, bias and error prone. However controlled vocabulary assisted by semi-automatic annotation assistance can reduce these problems.
- automatic indexing usually uses term frequency in a document as an indicator of importance in the document and term average frequency occurrence in the collection as an indicator of its power of differentiation; it also relies on stemming and stop-term removal as pre-processing.
In automatic indexing, the statistical approaches break documents and queries into terms that provide a population statistically counted and studied. Terms can be simple words, complete phrases recognised by frequency of co-occurrences or using dictionaries or, in the purest statistic approaches, n-grams i.e., strings of n consecutive characters obtained by moving a window of n characters in length through a document or query, one character at a time [Greengrass, 2000]. Using n-grams is language-independent, and can even be used to sort documents by language. Techniques based on n-grams appear to be relatively insensitive to degraded text such as spelling mistakes, typos, etc. However the original structure of the natural language is completely lost and syntactic or semantic level techniques can hardly use the obtained index; the complete processing is purely statistic. Keyword indexing removes the structure and ordering which may be useful to introduce basic constraints in querying. An extension of keyword indexing is called full-text indexing and enables the system to preserve the notions of ordering and neighbourhood. Each term is indexed with its positions into the documents where it appears and thus enabling more expressiveness in queries.
The first and simplest surrogate is the boolean vector. A document is represented by a vector comprising 0 or 1 values respectively indicating for each term its absence or its presence. A document is represented by a set of its terms, phrases, or n-grams which sacrifices all the syntactic information about the order and structure in which the terms occur in the document. The query is formulated as a boolean combination of terms using classical operators: AND, OR, and NOT. Absence or presence of a term is a poor way of representing the relevance of a document and the first step to mitigate the qualification is the extended boolean form using weights or fuzzy logic approaches. Fuzzy set and fuzzy queries especially try to address the problem of non binary attributes.
The most common weighted form is the vector space where the document surrogate is the term vector in the document space. The union of all the sets of terms obtained by scanning the corpus defines the document space as a space in which each distinct term represents one dimension. In this space, a term vector represents a document by assigning a numeric weight to each term from the document, trying to estimate the usefulness of the given term as a descriptor of the given document. [Greengrass, 2000]. The weights assigned to the terms of a document are interpreted as the coordinates of the document in the document space (point or a vector from the origin). Conversely, the inverse of the document space is the term space, where each document is a dimension. Each point (or vector) is a term in the given collection the coordinates of which are the weights assigned to the given term in each document in which it occurs [Greengrass, 2000]. In a document weights measure how effective the given term is likely to be for distinguishing this document from another; in a query, they measure how much importance the term is to be assigned in computation of the matching of documents to the given query.
Weights are assigned to terms following statistical methods. The most famous one being "term frequency in a document * inverse term frequency in the collection of documents" commonly abbreviated "tf*idf". High-idf terms tend to be better discriminators of relevance than low-idf terms. However a lot of alternative weighting schemes have been proposed to offset some drawbacks, but there is no use to detail them here.
Weights undergo normalisation, the most common ones being the frequency and Euclidean normalisation. Normalisation of the term frequency, makes it depend on the frequency of occurrence relative to other terms in the same document, not its absolute frequency of occurrence; weighting a term by absolute frequency would favour longer documents over shorter documents. Euclidean normalisation divides each component by the Euclidean length of the vector. Here again alternative normalising techniques were proposed, aimed at factoring out the effects of document length.
The terms often undergo pre-processing. The most common pre-processes are the stemming and stop-words removal. A word is stemmed when the system extracts and only keeps its root. The goal is to eliminate the variation of different grammatical forms of the same word, e.g. "retrieve", "retrieved", "retrieves" and "retrieval". Stop-words are common words that have little power to discriminate relevant from non-relevant documents, e.g. "the", "a", "it", etc. To be able to operate the removal, surrogate generators are usually provided with a "stop list" i.e., a list of such stop-words. Note that both stemming and stop lists are language-dependent [Greengrass, 2000]. Stemming and stop-words removal are the most common types of normalisation in traditional information retrieval system. They are also examples of introduction of some natural language processing in statistics techniques, even if they are quite low level techniques.
Concerning the pre-processing, [Greengrass, 2000] gives examples of higher level problems due to such low-level treatment: "assassination" (singular) in a document proved a much more reliable indicator that the document describes an assassination event than the presence of the term “assassinations” (plural) that often refers to assassinations in general. Likewise "venture" by itself is not a good indicator that a document describes a joint venture, but the phrases "venture with" and "venture by" (with the stop words) proved to be very good descriptors.
Vector spaces cannot address the problem of expressing logical connectivity as in a boolean query; to overcome this problem, extensions based on weight manipulation have been proposed.
Another major limitation of the vector space approach is that it assumes that terms are independent enough to provide orthogonal dimensions for the document space i.e., relationships among the terms are ignored (co-occurrences, synonyms, etc.).
An alternative vector space approach, called Latent Semantic Indexing (LSI), attempts to capture these term-term relationships using statistics of co-occurrences. The LSI document space is a much lower dimensional space in which dimensions are artificial concepts statistically independent; these artificial concepts are sets of terms that are related (co-occurrence, synonyms, etc.). Thus documents and queries dealing with the same topic that could be far apart in traditional term-based document space because they use synonyms terms, may be close together in this space. This representation requires storage of a substantially larger set of values. Moreover, the technique is not a semantic method as pretended in its name, but rather a statistical method for capturing term dependencies that it is hoped have semantic significance; co-occurrences techniques tend to build lexical fields rather than synonym sets and the choice of the size of this set influences the results and precision of the retrieval techniques.
Statistical methods are highly scalable and highly autonomous, but encounter problems inherent to the lexical level such as homographs (the converse problem of synonyms) that require syntactic, semantic and pragmatic analysis. They rely on the presence of a term in a document as if it meant the document is relevant to that term while it may have been have been merely mentioned in passing. As an example inspired by [Korfhage, 1997], if this PhD thesis includes the sentence. I shall not talk about planning algorithm here because they are not directly relevant to this work. It is clear that a term driven retrieval system may select it while it is literally not relevant...
Unlike statistical approaches, non statistical approaches are usually extremely interested in the structure; they are usually grouped under the appellation natural language processing (NLP). They attempt to address the structure and meaning of textual documents directly, instead of merely using statistical measures as surrogates. Natural language processing may include: lexical analysis, syntactic analysis, semantic and pragmatic analysis. The output is usually a rich complex formal structures to be exploited in the retrieval process such as the conceptual graphs and the associated operators [Sowa, 1984].
[Greengrass, 2000] picked out several levels in natural language processing:
- phonological level: for speech recognition. It implies analysis of sounds the details of which are out of the scope of this work.
- morphological level: recognise the variant forms of a given word in terms. It is used to generate stemming systems; it is the first process of natural language processing tools before tagging words.
- lexical level: recognise the structure and meaning at the word level. It is used
to build stop-words lists or thesauri, but also to detect and tag the words
with their type such as proper noun, verb, etc. Proper nouns are excellent
indicators of the relevance of a document. Their use may require common
knowledge for instance if a document mentions
- syntactic level: analysis of the structure of sentences. It can be used, for instance, to map passive forms to active forms, participating in the normalisation of the form of the sentence before semantic interpretation.
- semantic level: interpret meaning of the lower level results. It is used for disambiguation and construction of conceptual structures. Disambiguation can rely for instance on analysis of the local context of the term occurrences matching it to a thesaurus including the different meanings and their usual context.
- discourse level: interpret the document structure to determine the structure of clauses, sentences, and paragraphs that determines the rhetorical flow of a document and its meaning. It can be used for document summarisation or to type the knowledge extracted from a part.
- pragmatic level: introduce external knowledge such as the user profile, the context, common knowledge, domain knowledge to improve interpretation.
Besides the text analysis, surrogates can exploit other structural clues in the document that pertains to pragmatic knowledge of document structuring and domain knowledge. Examples of key indicators are:
- bibliographic citations: co-citation i.e., co-occurrence of documents in citations of a group of documents on a subject, bibliographic coupling i.e., two documents citing the same third document dealing with a given topic.
- hyperlink: the anchors of the link enable us to locate a set of related documents, and the labels of the links enables us to improve the description of the topics of these documents.
- structural clues: trigger terms such as 'legend', 'conclusion', 'example', etc. to locate important information in the text, source of document (type of journal, recognised author, etc.), etc.
The problem is the evaluation of the topicality of the document [Korfhage, 1997] i.e., how well the topic of the document matches the topic of the query.
Whatever model is used, ultimately the system calculates a function or a measure to evaluate the relevance of a document to a query. If a query is considered as a document then a similarity approach will most probably be applied; otherwise, the mapping of a document to the query or the projection of the query on the documents will use a relevance measure that can give a binary result (relevant / not relevant) or a degree of relevance (a percentage).
If the document space and the query space are based on terms, then the matching algorithm as to be designed to retrieve document that include the terms (some or all of them) of the query.
In a boolean space, the system has prepared a signature file for each document and evaluates the truth value of the boolean query represented by an expression composed of terms as boolean variables and boolean operators AND/OR/NOT, against each signature of the documents that fixes the value of each variable representing a term. For efficiency the query may be recast in more convenient forms for processing (disjunctive normal form, conjunctive normal form); they have an equivalent truth table while enabling optimised and simplified processing algorithms. Possible refinement for boolean query solving include: focusing on specified syntactic component of each document (the title, the abstract, etc.) or regions (e.g. beginning of the document); addition of operators, the most common one being the proximity operator forcing two keywords to be within a close distance in the text.
In the extension of boolean space by fuzzy logic, extended boolean operators make use of the weights assigned to the terms to evaluate their arguments. The result is no longer boolean, but a value between 0 and 1 corresponding to the estimated degree to which the given logical expression matches the given document.
We have seen that statistic methods are used to provide the weights of probabilistic queries; the membership is considered to be a probability thus between 0 and 1. The term probabilities are combined to calculate probabilities of a document to match a query i.e., probabilistic approaches are trying to calculate the conditional probability P(D|A, B, C, …) that the given document D is relevant, given the clues A, B, C, etc. Making some statistical simplifying assumption, the conditional joint probability is replaced by a separate probability for each event. The usual assumption is that these properties are not independent, but that the same degree of dependence holds for both the relevant document set and the non-relevant document set, yet we have seen it was not exactly the case.
The vectorial representation leads to two techniques:
- metrics or dissimilarity measures: calculate a distance between the query point and the document point in the vector space or between two document points for clustering. Multiple reference points can also be used by the system: query, profile, other documents, known authors, known journal etc. They can be uses to specify the interest area and ask for document similar to these points. This is close to case-based reasoning and a lot of metrics have been proposed here too.
- angular measures: calculate the angle between the vectors representing the query and the document, or two documents for clustering for instance. The most famous one is the cosine measure. One problem with cosine similarity, is that it tends to produce relatively low similarity values for long documents, especially when the document is long because it deals with multiple topics. Thus here again alternative approaches and measures have been proposed.
Other probabilistic methods used are Bayesian probability and Bayesian networks, inference models, etc. they will not be developed here.
Once the optimised query is executed to the user’s satisfaction, it is typically thrown away. An idea is to reuse queries and their results if there is a reasonable assumption that user information needs will recur; such queries are called persistent queries. If a new query is equivalent to a persistent query, the result is reused, otherwise the closest query is used a the start point of the search process [Greengrass, 2000]. I believe the use of persistent queries as representing persistent needs of communities of interest can be exploited even further in managing knowledge needs and fostering exchanges for monitoring a state of the art.
Another approach is the clustering of documents mainly based on statistical methods or graph theory. Clustering is the grouping of documents into distinct classes according to the properties captured in their surrogates. Clustering algorithms seek features that will separate the documents into groups ideally completely separate and as far apart as possible in feature space. It uses a document-document similarity and usually a threshold. Most of the algorithms iteratively build a hierarchy of clusters. Of course, and once again, it poses the problem of finding a reliable similarity and of choosing an adequate threshold. To group similar documents in clusters has a first interest linked to the previous search method: the acceleration of the searching of relevant documents; for instance, the algorithm compares its query to the centroids (virtual or real document surrogate representing the average profile of a cluster) of the clusters to determine in which cluster it is likely to find the best answers. Clustering documents within a collection is a form of unsupervised classification. Just as it can help searching algorithms to focus on the right cluster, it can also be used to provide browsing approaches. In browsing, the user starts searching the data without a clear-cut end goal in mind, without clear-cut knowledge of what data is available, and very likely, without clear-cut knowledge of how the data is organised. Clustering can reveal the intrinsic structure of a collection and when combined with user-friendly display, can be an effective tool for browsing a large collection and "zeroing in" on documents relevant to some given topic or other criterion. [Greengrass, 2000]
Going further than information retrieval that retrieves documents relevant to a given query or topic, information extraction tries to directly return the information needed by the user in a response that may have been generated from multiple resources. This area heavily relies on natural language analysis to understand the query, understand the documents and build an answer; many problems remain to be solved in this area.
Most of logic-based information retrieval systems are turning into web-based systems as part of the general trend to try to build a knowledge-based Web, or Semantic Web. In chapter 3, we shall see some examples of reasoning at the knowledge level to improve information retrieval. Indeed, if the surrogates are highly structured and exploit knowledge modelling languages, additional inference capabilities can be added to the improve retrieval, reasoning on models of the domain, deducing implicit knowledge, etc.
"A major factor in user acceptance of any retrieval system is the interface through which the user interacts with the system" [Korfhage, 1997]. Clearly there is a need for the development of visual information retrieval interfaces. Search results usually take the form of a list of documents ranked according to their estimated relevance to the query or topic for which they were retrieved. To represent this list, the system uses a second facet of document surrogates i.e., the surrogate used to present and identify the document in front of the user. An identifier is always present in both types of surrogates, but it is not enough for user since it is usually a system identifier such as a URI http://www.mycorp.com/reportV278.htm#C12 or a key in databases. This second facet requires information such as: title, summary (that may require automatic generation methods), authors' abstract, focused extract, preview, snapshot, keywords, review, age, ISBN, position in a classification/presentation structure (e.g. in pop music, new releases, on sale, French speaking, etc.) etc. On the choice of the surrogate depends the ability of the system to propose views on the results that organise the selected document set so that the user gets the “big picture” quickly, zero in rapidly on the desired documents, and see which documents are closely related. [Greengrass, 2000]. The bad news is that usually the relevance and choice of the content of a surrogate is domain-dependent.
The second use of information resource surrogates is to provide a highly synthetic and representative structure that reduces the content of the resource to those features relevant for the user to identify the resource and its content.
User interaction must not be reduced to the submission of a query; users interact with the system in many ways. They formulate queries, review the results, provide feedback, refine their original requests, describe their profile, build training sets, set algorithm parameters, etc.
When evaluating results, two positions can be adopted:
- Quantitative assessment position: the two main measures of information retrieval are recall and precision. Precision is the ratio of relevant items retrieved to all items retrieved. Recall is the ratio of relevant items retrieved to all relevant items available in the collection. Measuring precision is (relatively) easy if a set of competent users agree on the relevance or non-relevance of each of the retrieved resources. On the contrary, measuring recall is much more difficult because it requires knowing the number of relevant documents in the entire collection, which means that all the documents in the entire collection must be assessed. If the collection is large, this is quite simply not feasible [Greengrass, 2000].
- Qualitative assessment position: it tries to take into account the user's preferences and feedback such as the acceptance or rejection, the order of consultation, the novelty of documents, the number of documents examined before the relevant ones were found, the known reference documents found, the number of documents needed to get the answers vs. the number of document retrieved, etc. [Korfhage, 1997] also rightly distinguishes between relevance to the user’s query, and pertinence to the user’s needs.
A simple feature such as the size and grouping of the results can raise non trivial questions. For instance, on one hand, there may be multiple subsets of relevant resources, any of which will satisfy the user’s requirement and on the other hand, two relevant resources may present contradictory views hence users would be seriously misled if they only see one of the resources.
One approach for dealing with the subjectivity of evaluation is to provide or generate “user profiles,” i.e., knowledge about the user’s needs, preferences, etc. The objective is to give the user not just what he asked for, but what he “meant” by what he asked for. Or the profile may be generated automatically, based on statistics derived from documents the user has designated as relevant to his needs [Greengrass, 2000].
The user model is increasingly forming a significant component of complete solutions, introducing information about a user's preferences, background, educational level, familiarity with the area of inquiry, language capabilities, journal subscriptions, reading habits, etc. into the retrieval process [Korfhage, 1997]. There are several ways to enrich queries with profile information:
- profile used as post-filter: used after the query is processed to sort the results.
- profile used as pre-filter: modify the query before processing to focus the search.
- profile used as co-filter: documents are compared to both the profile and the query in a combining evaluation function.
Solutions including user profiles may use the multiple reference points mentioned before. Additional factors can also introduce additional retrieval tests such as "is the document too old?", "has it already been consulted by the user", "is it too detailed/shallow?", "is it written for/by an expert/novice?".
Another approach exploiting user profiles is collaborative filtering. It compares user profiles to build recommendations: if two user profiles are close, then it is likely that the consultation of one user can be used to make suggestions to the other, and vice versa. This fosters the emergence of communities of usage and communities of interest.
[Greengrass, 2000] calls interactive directed searching means, the systems in which the user engages in an interactive process, either to formulate the original query, or to refine the query on the basis of the initial results returned. A significant improvement in the performance of any solution is gained by involving the user in the query processing of refinement [Korfhage, 1997].
The process of query expansion and re-weighting may be wholly automatic or may involve a combination of automatic processes and user interaction. Relevance feedback in query expansion and refinement, is the classic method of improving a query interactively: based on the feedback, the system automatically resubmits or proposes expansions and refinements of user-generated queries.
In vector spaces, a classic refinement method is term re-weighting. Given a relevance feedback on a result, the system increases the weights of terms that occur in relevant documents and reduces the weights of terms that occur in non-relevant documents. It can also modify the document vectors by adding terms drawn from the user’s query or application domain to the indexes of documents judged relevant, thus keeping a memory of terms judged relevant for the construction of a surrogate. Query expansion usually means addition of relevant terms for instance drawn from the most relevant documents or from thesauri giving synonymous terms.
The future lies in hybrid solutions.
Hybrid documents: Most of the research focused on textual documents, but as multimedia documents are integrated in information systems, the ability to process and retrieve images and sounds is becoming important. Multimedia querying is in its infancy and most existing systems rely on descriptions of the multimedia resources, extracted information from the documentary context e.g. legend, surrounding text, etc. There exist experimental image retrieval systems using image recognition, classifier, etc. exploiting image features (colors, shades, outlines, etc.). Even less developed, sound retrieval systems also extract features (rhythm, musical patterns, etc.) to develop similarity measures and matching systems, classifier, etc.
Moreover, "many documents are in an intermediate or mixed form, largely unformatted, but including some formatted portions. This suggests that a two-stage retrieval process might be efficient - doing a rough retrieval based on the formatted portion of the data, then refining the ore generated by this process to locate the desired items" [Korfhage, 1997]. In hybrid systems where document and structured information are used, the retrieval can concern, both the knowledge contained in the document and the indexing structure (e.g. "Find documents that contain the words w1,..., wn written by M. XYZ with the name of the editor"). Depending on the users, their profiles and contexts, hybrid systems will handle differently mixed documents an mixed data (blob, large text fields, etc.). The available collections and their structure may themselves be semantically described to guide the choice of a searching algorithm for a given query and a given collection.
Hybrid systems: Information retrieval system solutions span many research fields, such as linguistics, databases, logics, etc. Each of them has its advantages and drawbacks, so the trend is to mix them in hybrid solutions to get the best of each. There are several ways of coupling different systems. The first manner is to cascade method complexity. For instance some system starts with coarse retrieval of candidate resources using statistical methods and shallow natural language extensions, then more sophisticated natural language processing tools are applied to the resulting list of resources retrieved by the first stage [Greengrass, 2000]. This technique allows quick filtering of information, refinement and focused passage retrieval, localising hot spots in a document where the interest is stronger. The other approach consist of merging results from parallel searches, such as it is done in meta-search engines. Fusion may be carried out between documents from different collections and / or using different search methods; in all cases different results must be compared to merge them into a single homogeneous result that can be presented to the user. Automatic adaptation of a method or a collaboration of methods may be done, using, for instance, genetic algorithms based on feedback. The introduction of 'recommender' systems based on collaborative filtering has the advantage of working for any type of resources and of introducing a collaborative improvement of the search mechanisms. The problem of multiple collections and systems meets the problematic of parallel and distributed systems, introducing new hard complexity concerns.
Finally, as information systems become accepted and used in our society, they raise new problems of ethical nature, legal nature (copyright, privacy), security nature, etc.
"When I use a word," Humpty Dumpty said in rather a scornful tone,
"it means just what I choose it to mean - neither more nor less."
— Lewis Carroll
Knowledge engineering is a broad research where the overall issue is the acquisition and modelling of knowledge. Modelling knowledge, consists in representing it in order to store it, to communicate it or to externally manipulate it. Automating its external manipulation leads to the design of knowledge-based systems i.e., systems which behaviour relies on the symbolic manipulation of formal models of knowledge pieces in order to perform meaningful operations that simulate intelligent capabilities.
The representation step raises the problem of the form i.e., the choice of a representation formalism that allows to capture the semantics at play in the considered pieces of knowledge. One approach that emerged in the late 80s is based on the concept of ontologies. An ontology, as we shall see, is that part of the knowledge model that captures the semantics of primitives used to make formal assertions about the application domain of the knowledge-based solution.
Thus, in this chapter, I shall focus on the branch of knowledge modelling that applies logic and develops ontology to build computable models of some application domain. I shall divide my discussion in two large sections:
- the ontology object: focusing on the nature and the characteristics of the ontology object, its core notions and its lifecycle.
- the ontology engineering: focusing on the design rationale and the assisting tools.
"The word ontology comes from the Greek ontos for being and logos for word. It is a relatively new term in the long history of philosophy, introduced by the 19th century German philosophers to distinguish the study of being as such from the study of various kinds of beings in the natural sciences. The more traditional term is Aristotle's word category (kathgoria), which he used for classifying anything that can be said or predicated about anything." [Sowa, 2000b]
The word ontology can be used and has been used with very different meanings attached to it. Ironically, the ontology field suffered a lot from ambiguity. The Knowledge Engineering Community borrowed the term Ontology from the name of a branch of philosophy some 15 years ago and converted into an object: an ontology. In the mid-90s philosophers 'took it back' and began to clean the definitions that had been adopted. In this part, I summarise some of the survey I wrote in [Gandon, 2002a]. I shall focus on the definitional and design aspects that were of direct interest to my PhD.
Ontology is a new object of Artificial Intelligence that recently came to maturity and a powerful conceptual tool of Knowledge Modelling. It provides a coherent base to build on, and a shared reference to align with, in the form of a consensual conceptual vocabulary on which one can build descriptions and communication acts.
"People, organisations and software systems must communicate between and among themselves. However, due to different needs and background contexts, there can be widely varying viewpoints and assumptions regarding what is essentially the same subject matter. Each uses different jargon; each may have differing, overlapping and/or mismatched concepts, structures and methods" [Uschold and Gruninger, 1996]. Some of the consequences of a lack of a shared understanding are: poor communication, difficulties in identifying requirements and therefore, in specifying a system, limited inter-operability, limited potential of reusability and sharing, therefore, wasted efforts in re-inventing the wheel. There is a need to "reduce or eliminate conceptual and terminological confusion and come to a shared understanding. (...) the development and implementation of an explicit account of a shared understanding (i.e., an 'ontology') in a given subject area, can improve such communication, which in turn can give rise to greater reuse and sharing, inter-operability, and more reliable software" [Uschold and Gruninger, 1996]. An ontology is a unifying framework for different viewpoints and serves as the basis for enabling communication between people, between people and systems, between systems: this unifying conceptual framework is intended to function as a lingua-franca.
Ontologies are to semantics, what grounding is to electronic: a common base to build on, and a shared reference to align with. Ontologies are considered as a powerful tool to lift ambiguity: one of the main roles of ontologies is to disambiguate, providing a semantic ground, a consensual conceptual vocabulary, on which one can build descriptions and communication acts. [Bachimont, 2001] explains that on the one hand, ontologies provide notional resources to formulate and make explicit knowledge and on the other hand, they constitute a shared framework that different actors can mobilise. Ontology can represent the meaning of different contents exchanged in information systems.
The more we develop intelligent information systems, the more the general knowledge about things and their categories appears to play a pivotal role in inferences. Therefore, this knowledge needs to be given to the machines if we want them to behave intelligently and intelligibly. To illustrate that point I shall take an example of problem where ontologies prove to be useful in the context of the work reported here: information retrieval. The general problem is to formulate a query over a mass of information and get an answer as precise and relevant as possible.
In her tutorial at ECAI 98, Assunción Gómez-Pérez asked the participants: "What is a pipe ?". Extending her example we can imagine four answers to this very same question and they are given in a dictionary definition:
PIPE noun [C]. A short narrow tube with a small container at one end, used for smoking e.g. tobacco. || A long tube made of metal or plastic that is used to carry water or oil or gas. || A temporary section of computer memory that can link two different computer processes. || A simple musical instrument made of a short narrow tube which is played by blowing through it.
One term linked to four concepts is a case of ambiguity. The contrary is one concept denoted by several terms, and it is a case of synonyms e.g.
These trivial cases pose a serious problem to computerised systems that are not able the see these differences and equivalences unless they have been made explicit to them. Indeed if we take the example of a classic user of the Altavista search engine looking for books by Ernest Hemingway, the commonly chosen keywords are "+book +hemingway". The search engine will encounter several types of problems:
- Noise: a problem of precision that will lead the search engine to collect a page with the sentence "The Old Book Pub, 3 Avenue Hemingway" while it is obvious to us that this is not relevant (unless one wants to drown one's sorrows of not having found the books).
- Missed answer: a problem of recall where the search engine misses a page containing a sentence such as "The novel 'The Old Man and The Sea' by Ernest Hemingway" because it does not know the basic categories of documents, and therefore, does not know that a novel is a book.
If we look at the way a human does answer a question, we may find interesting leads to solve the problem. Consider this little speech between two persons:
"What is the last document you read ?" Rose asked.
"The article Gruber wrote on ontology in 1993." Olivier answered.
The answer Olivier gave is based on an organisation of concepts used for at least two purposes:
- Identification: the ability to recognise an object, an action, etc. as belonging to a category e.g. the ability to recognise an object as being a book and an action as being reading.
- Specialisation and generalisation: the ability to memorise abstractions of categories differentiated in hierarchies of specialisation/generalisation e.g.: "articles, books, and newspapers are documents", "novels are books", etc. These hierarchies are the basis of inferences at the heart of information retrieval and exchange e.g.: the syllogism "a novel is a book" and "a book is a document" therefore, "a novel is a document" so if I am looking for a document, a novel is a valid answer.
This structure of categories is learnt through education and social cultural interactions. For instance imagine the following naive story:
A family is on the road for holidays. The child sees a horse by the window, it is the first time he sees a horse.
"Look mum... it is a big dog !" The child says.
The mother looks and recognises a horse.
"No Tom, it is a horse... see it's much bigger !" The mother corrects.
The child adapts his categories and takes notes of the differences he perceives or he is told, to differentiate these new categories from others. A few kilometres later the child sees a donkey for the first time.
"Look mum... another horse !" The child says.
The mother looks and recognises the donkey.
"No Tom, it is a donkey... see it's a little bit smaller, it is grey..." The mother patiently corrects.
And so on.
In these interactions, categories are learnt, exchanged and aligned. This will enable understanding in the future when they will be used for communication.
Thus this structure in hierarchical categories captures a consensus and is socially and culturally dependent. If there is a mismatch or a lack, an interaction takes place to align the two opinions or fill the gap as in the example of the child. The consensus is implicit: in the case of the interactions about the document, both speakers implicitly consider that they have a shared and consensual conceptualisation of the reality of documents. By answering with an article the second speaker considers that the first speaker knows that an article is a document.
This background knowledge is lacking in information systems relying only on terms and plain-text search. A possible approach is thus to make this knowledge explicit and capture it in logical structures that can be exploited by automated systems. This is exactly the purpose of an ontology: to captures the semantics and relations of the notions we use, make them explicit and eventually code them in symbolic systems so that they can be manipulated and exchanged.
I shall give here some of the latest definitions proposed in the knowledge engineering community and adopted in this work. I added personal definitions and dictionary definitions of notions commonly used in the field.
something formed in the mind, a constituent of thought; it is used to structure knowledge and perceptions of the world. || an idea, a principle, which can be semantically valued and communicated.
notion usually expressed by a term (or more generally by a sign) || a concept represents a group of objects or beings sharing characteristics that enable us to recognise them as forming and belonging to this group.
notion of an association or a link between concepts usually expressed by a term or a graphical convention (or more generally by a sign)
extension / intension
distinction between ways in which a notion may be regarded: its extension is the collection of things to which the notion applies; its intension is the set of features those things are presumed to have in common. There exists a duality between intension and extension: to included intensions I1 Ì I2 correspond included extensions E1 É E2.
concept in intension / intension of a concept
set of attributes, characteristics or properties shared by the object or beings included in or to which the concept applies.
e.g. for the concept of a car the intension includes the characteristics of a road vehicle with an engine, usually four wheels and seating for between one and six people.
concept in extension / extension of a concept
set of objects or beings included in or to which the concept applies.
e.g. for the concept of a car the extension includes: the Mazda MX5 with the registration 2561 SH 45, the green car parked at the corner of the road in front of my office, etc.
relation in intension / intension of a relation
set of attributes, characteristics or properties that characterises every realisation of a relation.
e.g. for the relation parenthood the intension includes the characteristics of the raising of children and all the responsibilities and activities that are involved in it.
signature of a relation
set of concepts that can be linked by a relation, this constraint is a characteristic of the relation that participate to the definition of its intension.
e.g. for the relation parenthood the signature says it is a relation between two members of the same species.
relation in extension / extension of a relation
set of effective realisations of a relation between object or beings.
e.g. for the relation parenthood the extension includes: Jina and Toms are the Parents of Jim, Mr Michel Gandon is my Father, etc.
that branch of philosophy which deals with the nature and the organisation of reality [Guarino and Giaretta, 1995]. || a branch of metaphysics which investigates the nature and essential properties and relations of all beings as such.
the systematic, formal, axiomatic development of the logic of all forms and modes of being [Guarino and Giaretta, 1995].
an intensional semantic structure which encodes the implicit rules constraining the structure of a piece of reality [Guarino and Giaretta, 1995] || the action of building such a structure.
a logical theory which gives an explicit, partial account of a conceptualisation [Guarino and Giaretta, 1995] (based on [Gruber, 1993]); the aim of ontologies is to define which primitives, provided with their associated semantics, are necessary for knowledge representation in a given context. [Bachimont, 2000]
a partial semantic account of the intended conceptualisation of a logical theory [Guarino and Giaretta, 1995] || practically, an agreement to use a vocabulary (i.e., ask queries and make assertions) in a way that is consistent with respect to the theory that specifies the ontology. Software pieces are built so that they commit to ontologies and ontologies are designed so that they enable us to share knowledge with and among these software pieces. [Uschold and Gruninger, 1996]
a set of formulas intended to be always true according to a certain conceptualisation [Guarino and Giaretta, 1995].
the branch of knowledge engineering which exploits the principles of (formal) Ontology to build ontologies [Guarino and Giaretta, 1995]. || defining an ontology is a modelling task based on the linguistic expression of knowledge. [Bachimont, 2000]
a person who builds ontologies or whose job is connected with ontologies' science or engineering.
state of affairs
the general state of things, the combination of circumstances at a given time. The ontology can provide the conceptual vocabulary to describe a state of affairs. Together this description and the state of affair form a model.
a classification based on similarities.
The study of part-whole relationships.
a classification based on part-of relation.
In order to express and communicate an intension, we choose a symbolic representation e.g. the different definitions associated to a term and given by a dictionary. Note that exemplification and illustration used in dictionaries show that it is sometimes necessary to clarify a definition in natural language, producing a representative sample of the extension (i.e., examples) or using other means of representation (e.g. a picture).
To exemplify these definitions, we can reuse the well-known situation of cubes on a table. Figure 4 shows a schema depicting the real scene of three cubes being arranged on a table. A conceptual vocabulary (toy ontology) is proposed to talk about some aspects of this reality (some of them are ignored, for instance there is no vocabulary to express the dimensions of the cubes). Finally, the state of affairs of the scene observed is described using the primitives of the ontology.
This example illustrates the definition of an ontology:
The ontology is...
the column b makes explicit the concepts used here.
some aspects were overlooked e.g.: the dimensions of the cubes, their material, etc.
of a conceptualisation
in this reality we recognise some entities (cube, table, etc.) and some rules (spatial relations, geometry, labelling of cubes etc.)
It is important to stress that nothing, in the original definition of an ontology used in knowledge engineering (i.e., "an ontology is a specification of a conceptualisation" [Gruber, 1993]) nor in the definition we gave above, obliges the ontologist to use a formal language to make the ontology explicit. The representations of intensions can be organised, structured and constrained to express a logical theory accounting for relations existing between concepts; an ontology is an object capturing this theory. The final representation of the intensions and the ontological structure can make use of more or less formal languages, depending on the intended use of the ontology. An automated exploitation of an ontology by an artificial system will most probably imply some formalisation of some chosen aspects of the ontology to enable the formal manipulation of those aspects. The formal expression of an intension provides a precise and unambiguous representation of the meaning of a concept; it allows software to manipulate it and use it as well as the models the representation of which is based on the primitives provided by the ontology. [Sowa, 2000b] distinguishes between a terminological ontology and a formal ontology. They are the two extremes of a continuum: as more axioms are added to a terminological ontology, it may evolve into a formal or axiomatised ontology.
Symbols are at the base of our natural language and therefore, are the most commonly used means for communicating concept. They are also used to build artificial symbolic systems at the base of automation; the most advanced artificial symbolic systems are logical systems and other derived knowledge representation languages of all sorts. Thus ontologies are massive collections of Peirce's three kinds of signs: icons, which show the form of something; indices, which point to something; and symbols, which represent something according to some convention. [Sowa, 2000b]. For instance the concept of 'fire' can be represented by signs such as in Figure 5:
However, bare symbolic systems are ontologically neutral. [Sowa, 2000b] explained that an uninterpreted logic "imposes no constraints on the subject matter or the way the subject is characterised. By itself, logic says nothing about anything, but the combination of logic with an ontology provides a language that can express relationships about the entities in the domain of interest."
Sowa  gives the following example represented here in logic and conceptual graphs [Sowa, 1984]. The fact represented here is "Tom the cat is chasing a mouse".
( $ x: Cat ) ( $ y: Chase ) ( $ z: Mouse ) ( Identifier(x , "Tom") Ù Agent(y , x) Ù Theme(y , z) )
This formula and the associated conceptual graph introduce several ontological assumptions because they suppose that there exist entities of types Cat, Chase, and Mouse; that some entities have character strings as names; and Chase can be linked to two concepts of other entities by relations of type Agent and Theme, etc. [Sowa, 2000]
Now, to show that the logic does not capture the meaning of the primitives and that we interpret the meaning of the primitives because they are words of our natural language, let us consider the following formula and graph:
( $ x: Abc) ( $ y: Jkl) ( $ z: Pqr ) ( Stu(x , "def") Ù Ghi (y , x) Ù Mno(y , z) )
They are logically equivalent to the one of "Tom the cat is chasing a mouse", but without any interpretation possible because the primitives have lost the ontological meaning we were able to find using natural language. This very same formula could mean "The man Joe is writing a document":
( $ x: Man) ( $ y: Write ) ( $ z: Document) ( Identifier(x , "Joe") Ù Agent(y , x) Ù Theme(y , z) )
By maintaining human understandable representations of the notions, the ontology captures the mapping between the symbolic system used for artificial intelligence manipulation and the observations of the real world viewed from the perspective of an adopted conceptualisation. By capturing this isomorphism the ontologist intends to capture an advocated consensual interpretation for the users of the system.
Now what is the place / position of ontology in knowledge? If we consider the sentence "children are young humans", the context is not clear yet this is perfectly intelligible, perfectly usable, this is knowledge. In fact, the context here is general: it is knowledge about a category of things and not a about a specific occurrence of a concept. Thus this knowledge is universally verified in the human culture. This is typically an ontological piece of knowledge.
Taxonomy is one of the possible structuring for an ontology, it is a form of logical theory. Its importance comes from the fact that it supports elementary inferences constantly at play in information searching and communicating processes: identification and generalisation/specialisation. The previous example of ontological knowledge children are young humans could be positioned in a taxonomy as the one given in Figure 6.
But 'ontology' is not a synonym of 'taxonomy'. Other logical theories are useful to capture the definitional characteristics of the concepts we manipulate. For instance, the concepts of chemical elements make extensive use for their definition of partonomies: (water (H2O) and phenol (-OH) contain hydrogen (H) and oxygen (O); alcohol (methanol CH3-OH, Ethanol C2H6-OH, etc.) contain phenol, hydrogen and carbon etc.). An example of such structure is given in Figure 7.
Partonomies are not the unique alternative to taxonomies. Further logical formalisations can be done especially to enable further inference capabilities such as automatic classification of new categories or identification of objects e.g.:
director (x):= person(x) Ù ($y organisation(y) Ù manage (x,y))
Or to capture causal models e.g.:
(salty things Þ thirst) Ù (thirst Þ to drink) thus (salty things Þ to drink)
Sometimes instances are included in ontologies they could be called universal instances. For instance constants (e.g. c the speed of light, g the gravity constant,...) or global objects (the activity "the research") to enable a unique reference to the object. But this is not the real purpose of ontologies: ontologies concentrate on universal concepts and reify them if the need comes to use these concepts as objects of the discourse.
The OntoWeb European network has built a web application in order to allow the community to register their ontologies, methodologies, tools and languages for building ontologies, as well as applications in areas like: the semantic web, e-commerce, knowledge management, natural language processing, etc.
In order to witness the large number of application domains of ontologies, I shall give in the first following sub-section an overview of the variety of domains where ontologies were built; to do so I used the online report mentioned above. In a second section I shall focus on the ontologies that are close to my subject, i.e., organisations and knowledge management.
As we shall see, ontologies vary a lot in their form, exhaustivity, specificity and granularity. Thus some of the resources mentioned here are a thesaurus while other are formal logical theories.
- AAT: Art & Architecture Thesaurus to describe art, architecture, decorative arts, material culture, and archival materials.
- Airport Codes: simple ontology containing Airport codes around the world.
- ASBRU: provides an ontology for guideline-support tasks and the problem-solving methods in order to represent and to annotate clinical guidelines in standardised form.
- Bibliographic Ontology: a sample bibliographic ontology, with data types taken from ISO standard.
- FIPA Agent Communication Language: contains an ontology describing speech acts for artificial agents communication.
- CHEMICALS: Ontology containing knowledge within the domain of chemical elements and crystalline structures.
- CoreLex: an ontology for lexical semantic database and tagset for nouns, organised around systematic polysemy and underspecification.
- EngMath: mathematics engineering ontologies including ontologies for scalar quantities, vector quantities, and unary scalar functions.
- Gene Ontology: A dynamic controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. The three organising principles of this ontology are molecular function, biological process and cellular component.
- Gentology: genealogy ontology for data interchange between different applications.
- Knowledge Representation Ontology: from the book Knowledge Representation by John F. Sowa, this ontology proposes a top level for knowledge representation based on basic categories and distinctions that have been derived from a variety of sources in logic, linguistics, philosophy, and artificial intelligence.
- Open Cyc: an upper ontology for all of human consensus reality i.e., 6000 concepts of common knowledge.
- PLANET: Planet is a reusable ontology for representing plans that is designed to accommodate a diverse range of real-world plans, both manually and automatically created.
- ProPer: ontology to manage skills and competencies of people.
- SurveyOntology: ontology used to describe large questionnaires; it was developed for clients at the census bureau and department of labour.
- UMDL Ontology: ontology for describing digital library content.
- UMLS: the Unified Medical Language System provides a biomedical vocabulary from disparate sources such as clinical terminologies, drug sources, vocabularies in different languages, and clinical terminologies.
This shorten list of some existing ontologies allows me to clearly show that the application of ontologies covers a broad range of domains and that ontology engineering has already developed a set of varied ontologies that is increasing every days.
Some ontologies have been studied in this work and were partially reused or influenced our choices; three of them are extremely known:
- The Dublin Core Element Set Ontology: ontology interoperable metadata standards and developing specialised metadata vocabularies for describing resources that enable more intelligent information discovery systems.
- TOVE: The goal of the TOronto Virtual Enterprise project is to create a data model that provides a shared terminology for the enterprise, defines the meaning of each term in a precise and as unambiguous manner as possible, implements the semantics in a set of axioms, and defines a symbology for depicting a the concepts.
In addition, to the above ontologies, some existing ontology-based systems influenced my work; they will be presented in chapter 3.
When an ontology participates to knowledge modelling as the one of a corporate memory, it becomes a full component of the model. It is also subject to its lifecycle since evolutions of the modelling needs may imply evolutions of the modelling primitives provided by the ontology. Therefore, the lifecycle of a corporate memory depicted in Figure 2 (page 33) also applies to an ontology that would be included in that memory.
The design of an ontology is an iterative maturation process. Through iterative design and refinement, we augment the ontology, developing the formal counter parts of semantic aspects relevant for the system in the application scenarios. This means the ontology will come to full development, becoming mature, by evolving through intermediate states to reach a desired state.
As soon as the ontology becomes large, the ontology engineering process has to be considered as a project, and therefore, project management methods must be applied. [Fernandez et al., 1997] recognised that planning and specification are important activities. The authors give the activities to be done during the ontology development process (Figure 8): planning, specifying, acquiring knowledge, conceptualising, formalising, integrating, implementing, evaluating, documenting, and maintaining.
[Fernandez et al., 1997] criticise the waterfall and the incremental lifecycles and propose to use the evolving prototype lifecycle that lets the ontologist modify, add, and remove definitions in the ontology at any time if some definition is missed or wrong. Knowledge acquisition, documentation and evaluation are support activities that are carried out throughout these states. When applied to a changing domain, the ontology will have to evolve. At anytime someone can ask for including or modifying notions of the ontology. To maintain the ontology is an important activity to be done carefully.
Ontology design is a project, and should be treated as such, especially when it becomes large. Project Management and software engineering techniques and guidelines should be adapted and applied to ontology engineering. [Fernandez et al., 1997] stress that before building your ontology, you should plan the main tasks to be done, how they will be arranged, how much time you need to perform them and with which resources (people, software and hardware).
Merging the two lifecycle diagrams of Figure 2 and Figure 8, I propose the cycle depicted in Figure 9. The Design & Build and Evolution activities are very close; they both include the activities of specification, conceptualisation, formalisation and implementation relying on data collected (knowledge acquisition or integration of whole or part of existing ontologies). Evaluation and Evolution respectively are the activities triggering and carrying out maintenance. Evaluation uses similar techniques to needs detection.
Even if a full investigation of the complete lifecycle is out of the scope of this document, it must be noticed that ontology maintenance has consequences beyond the ontology lifecycle. It impacts everything that was built using this ontology. A software where the ontology was hardwired has to be versioned, knowledge bases coherence has to be maintained, etc. Therefore, although the problem of the ontology evolution itself is already a complex one, one should consider the fact that ontologies provide building blocks for modelling and implementation. What happens to the elements that were built thanks to these building blocks when a change occurs in the ontology? (e.g. Add a concept, Delete a concept, Modify a concept, Modify hierarchy, Add a relation, Delete a relation, Modify a relation, etc.).
The activities described in the rest of this chapter participate to design and build activities because these are where the major part of the research work is situated.
The art of ranking things in genera and species is of no small importance
and very much assists our judgement as well as our memory.
This helps one not merely to retain things, but also to find them.
And those who have laid out all sorts of notions under certain headings
or categories have done something very useful
— Gottfried Wilhelm von Leibniz
[Mizoguchi et al., 1997] explained in one sentence what is the challenge the ontology engineering field must face: "Most of the conventional software is built with an implicit conceptualisation. The new generation of AI systems should be built based on a conceptualisation represented explicitly". In fact, by making at least some aspects of our conceptualisations explicit to the systems, we can improve their behaviour through inferences exploiting this explicit partial conceptualisation of our reality. "Ontology in philosophy contributes to understanding of the existence. While it is acceptable as science, its contribution to engineering is not enough, it is not for ontology engineering which has to demonstrate the practical utility of ontology. It is true that every software has an ontology in itself and every president of a company has his/her own ontology of enterprise. But, such an ontology is implicit. An explicit representation of ontology is critical to our purpose of making computers 'intelligent'. (...) the ultimate purpose of ontology engineering is: 'To provide a basis of building models of all things, in which information science is interested, in the world'." [Mizoguchi et al., 1997]
Since the scientific discipline of Ontology is evolving towards an engineering discipline, it does need principled methodologies [Guarino and Welty, 2000]. In the following part I shall investigate the activities involved in designing an ontology and participating in the ontology lifecycle. Several guidelines and methods have been proposed to design ontologies. I shall try to give an overview of different options proposed so far and I shall try to conciliate the different contributions when it is possible.
To see a world in a grain of sand
And a heaven in a wild flower,
Hold infinity in the palm of your hand
And eternity in an hour.
— William Blake
One should not start the development of one's ontology without knowing its purpose and scope [Fernandez et al., 1997]. In order to identify these goals and the limits, one has to clearly state why the ontology is being built, what its intended uses are and who are the stakeholders [Uschold and Gruninger, 1996]. Then, one should use the answers to write a requirements specification document.
Specification will give the scope and granularity of the ontology: a notion has to be included or detailed if and only if its inclusion or details answers a specified need. Inspired by [Charlet et al., 2000], here is an example of variation in granularity around the concepts of man and woman:
- man < human / woman < human : we know there exists two different concepts man and woman that are descended from the concept human.
- man := human - characteristic - male / woman := human - characteristic - female: the two different concepts man and woman are still descended from the concept human but now they are known to be different because of a characteristic being male or female.
- man := human - attribute - sex - value - male / human - attribute - sex - value - female: the two different concepts man and woman are still descended from the concept human with a differentiating characteristic being male or female, but we now know that this characteristic is the value of the sex attribute.
- and so on.
This example is in the context of a medical application and depending on the specifications of the requirements, a level of granularity may be insufficient (hampering the abilities of the system) or useless (costing resources for no reason).
Adapting the characteristics of indexes given in information retrieval [Korfhage, 1997] and the notion of granularity illustrated above, I retain three characteristics of the scope of an ontology:
- exhaustivity: breadth of coverage of the ontology, i.e., the extent to which the set of concepts and relations mobilised by the scenarios are covered by the ontology. Beware, a shallow ontology (e.g. one concept 'entity' and one relation 'in relation with') can be exhaustive.
- specificity: depth of coverage of the ontology i.e., the extend to which specific concept and relation types are precisely identified. The example given for exhaustivity had a very low specificity; an ontology containing exactly 'german shepherd', 'poodle' and 'labrador' may be very specific, but if the scenario concerns all dogs then its exhaustivity is very poor.
- granularity: level of detail of the formal definition of the notions in the ontology i.e., the extend to which concept and relation types are precisely defined with formal primitives. An ontology relying only on subsumption hierarchies has a very low granularity while an ontology in which the notions systematically have a detailed formal definition based on the other ontological primitives, has a very high granularity (c.f. previous example in the context of a medical application).
Three characteristics of the scope of an ontology are its exhaustivity, specificity and granularity. Exhaustivity is the breadth of coverage of the ontology i.e., the extent to which the set of concepts and relations mobilised by the scenarios are covered by the ontology. Specificity is the depth of coverage of the ontology i.e., the extend to which specific concept and relation types are precisely identified. Granularity is the level of detail of the formal definition of the notions in the ontology i.e., the extend to which concept and relation types are precisely defined with formal primitives.
An interesting technique to capture the application requirements in context, is the one of scenario analysis as presented for example in [Caroll, 1997] and used for software engineering. Scenarios are used as the entrance point in the project, they are usually information-rich stories capturing problems and wishes. [Uschold and Gruninger, 1996] uses the notion of motivating scenarios: "The development of ontologies is motivated by scenarios that arise in the applications (...). The motivating scenarios are story problems or examples which are not adequately addressed by existing ontologies. A motivating scenario also provides a set of intuitively possible solutions to the scenario problems. These solutions provide an informal intended semantics for the objects and relations that will later be included in the ontology. Any proposal for a new ontology or extension to an ontology should describe one or more motivating scenarios, and the set of intended solutions to the problems presented in the scenarios. (…) By providing a scenario, we can understand the motivation for the prior ontology in terms of its applications."
[Caroll, 1997] proposes to base the system design activity upon scenario descriptions. Scenarios are a relevant medium for representing, analysing and planning how a system might impact its stakeholders' activities and experiences.
"The defining property of a scenario is that it projects a concrete narrative description of activity that the user engages in when performing a specific task, a description sufficiently detailed so that design implications can be inferred and reasoned about. Using scenarios in system development helps keep the future use of the envisioned system in view as the system is designed and implemented; it makes use concrete (which makes it easier to discuss use and design use). (...) Scenarios seek to be concrete; they focus on describing particular instances of use, and on user's view of what happens, how it happens, and why. Scenarios are grounded in the work activities of prospective users; the work users do drives the development of the system intended to augment this work. Thus scenarios are often open-ended and fragmentary; they help developers and users pose new questions, question new answers, open up possibilities. It is not a problem if one scenario encompasses, extends, or depends upon another; such relations may reveal important aspects of use. They can be informal and rough, since users as well as developers may create and use them, they should be as colloquial, and as accessible as possible. They help developers and their users envision the outcomes of design -an integrated description of what the system will do and how it will do it- and thereby better manage and control these outcomes." [Caroll, 1997]
Scenarios have the advantage to enable communication in natural language while capturing situation, context, stakeholders, problems and solutions with their associated vocabulary. [Uschold and Gruninger, 1996] exploit the scenarios to introduce informal competency questions. These are queries arising in the scenarios and placing expressiveness requirements on the envisioned ontology: the ontology must be able to represent the informal competency questions and characterise their answers. "The competency questions specify the requirements for an ontology and as such are the mechanism for characterising the ontology design search space. The questions serve as constraints on what the ontology can be, rather than determining a particular design with its corresponding ontological commitments. There is no single ontology associated with a set of competency questions." [Uschold and Gruninger, 1996]
Scenario analysis, as many other activities in ontology engineering, is not a one-off activity, but will be pursued during the whole ontology design process and lifecycle. New scenarios will arise, existing scenarios will be refined. [Fernandez et al., 1997] also note that the inspection of glossary terms (lexicons) without looking into the details of the definitions can help at that stage to define the scope of the ontology. We shall see that this is closely related to the middle-out perspective of ontology engineering.
Data collection or
knowledge acquisition is a collection-analysis cycle where the result of a
required collection is analysed and this analysis triggers new collections.
Elicitation techniques help elicit knowledge. Several techniques exist for data
collection and benefit from two decades of work in the knowledge acquisition
community (see, for example, [Dieng, 1990], [Dieng, 1993] and [Dieng et al., 1998] as well as [Sebillotte,
The elicitation techniques are typically associated with bottom-up approaches. They may also be used in a top-down approach guided by models such as the ones of CommonKADS [Breuker and Van de Velde, 1994], to elicit data required by the models. In fact, as noticed in [Fernandez et al., 1997], they influence the whole process of engineering and maintenance, and different techniques may be useful at different stages of the process. A given source can be exploited in one or more of the perspectives adopted for ontology engineering depending on the nature of the source. For example interviews are prone to bottom-up and middle-out perspectives, whereas integrating ontologies leads to top-down and middle-out perspectives.
[Uschold and Gruninger, 1996] used brainstorming sessions to produce all potentially relevant terms and phrases; at this stage the terms alone represent the concepts, thus concealing significant ambiguities and differences of opinion.
[Fernandez et al., 1997] used:
- Non-structured interviews: with experts, to build a preliminary draft of the requirements specification document.
- Informal text analysis: to study the main concepts given in books and handbooks. This study enables you to fill in the set of intermediate representations of the conceptualisation.
- Formal text analysis: The first thing to do is to identify the structures to be detected (definition, affirmation, etc.) and the kind of knowledge contributed by each one (concepts, attributes, values, and relationships).
- Structured interviews: with experts to get specific and detailed knowledge about concepts, their properties and their relationships, to evaluate the conceptual model once the conceptualisation activity has been finished, and to evaluate implementation.
It should be noticed that among the documents studied during data collection, there can be existing terminologies or even ontologies collected from the domain. This will then lead to the process of integrating other ontologies in the ontology being built.
The data collection is not only a source of raw material: for example interviews to expert might help to build concept classification trees and to contrast them against figures given in books [Fernandez et al., 1997].
Knowledge acquisition and modelling is a dialog and a joint construction work with the stakeholders (users, sleeping partners, managers, providers, administrators, customers, etc.). They must be involved in the process of ontology engineering, and for these reasons semi-formal/natural language views (scenarios, tables, lists, informal figures) of the ontology must be available at any stage of the ontology lifecycle to enable interaction between and with the stakeholders.
Data collection is a goal-driven process. People in charge of the data collection always have an idea of what they are looking for and what they want to do with the collected data. It is essential to consider the end product that one desires right from the start (scenarios, models, ontologies…) and from that to derive what information should be identified and extracted during the data collection. The models will also provide facets, views or points of view that are useful to manage and modularise the collection products: organisational view, document view...
Interviews: Interviews can be individual or grouped, they can take place with a large or small number of people, and they can be one-off interviews or 'repeater interviews'. Depending on these circumstances the techniques and the organisation of the interviews can vary a lot. Principally an interview can be extremely structured (interrogation) or completely free (spontaneous expression), but the most common one is the semi-structured interview which can be anywhere on the continuum between these two extremes. A classic plan of progress during a semi-structured interview is:
1. Opening discussion: This first part is typically unstructured. To initiate the dialog first questions have to be very general, very broad. A good subject to start with is the tasks and roles of the people interviewed. Interviewee are encourage to speak until a long silence sets up. If need be, spontaneous expression can be kept running using the journalists' short questions (why? how? when? who? where? with what? any thing else?).
2. Flashback and clarification: Once reached a terminal silence ("-anything else? … -Hum no!") a more structured part begins. It implies to take notes during the first part and to build some informal personal representation so as to be able to identify points and subjects to be clarified or detailed through flashback questions. It is also in this part that questions that have been prepared before according to the information that is looked for and that has not been spontaneously answered in the first part can be asked (e.g. strategic aspects detected in the scenarios).
3. Self-synthesis: Finally, a good idea is to make interviewed people synthesise, summarise or analyse themselves what they said during the interview and make them conclude.
As the collection goes on, interviews tend to be more and more structured since the collection tend to be more and more focused.
Observations: As for interviews, observation comes with a broad range of options. It can be anywhere on the continuum between spying someone in his everyday activity without him knowing it (not very ethical, but extremely pure and natural view) and asking someone to simulate his activity in front of you in a reconstruction fashion (much more acceptable, but it may skew or distort some real issues). The observation can be about people, the way they work (with or without making them comment), on a real task or on simulated scenario, in real time, recorded or based upon traces of the activity. It can also be focused on very specific aspects (documents manipulated, desk organisation, acquaintance network, etc.). Depending on the actor the interesting situation may be very different.
Document Analysis: [Aussenac-Gilles et al., 2000] explain that documents or any natural language support (messages, text files, paper books, technical manuals, notes, protocol transcripts, etc.) are one of the possible forms the knowledge may take. Authors promote a new approach for knowledge modelling based on knowledge elicitation from technical documents which benefits from the increasing amount of available electronic texts and of the maturity of natural language processing tools. The acquisition of knowledge from texts applies natural language processing tools and based on results in linguistics to semi-automate systematic text analysis and ease the modelling process.
[Aussenac-Gilles et al., 2000] distinguish and describe four major steps in this process:
1. Setting up the corpus: From the requirements that explain the objectives underlying the model development, the designer selects texts among the available technical documentation. He must be an expert about texts in this domain to characterise their type and their content. The corpus has to cover the entire domain specified by the application. A glossary, if it exists, is useful to determine sub-domains and to verify that they are well covered. The corpus is then digitalised if it was not. Beginning the modelling may lead to reconsider the corpus.
2. Linguistic study: This step consists in selecting adequate linguistic tools and techniques and in applying them to the text. Their results are sifted and a first linguistic based elicitation is made. The objective is to allow the selection of the terms and lexical relations that will be modelled. The results of this stage are quite raw and will be further refined.
3. Normalisation: This step includes two parts. The first part is still linguistic, it refines the previous lexical results. The second part concerns the semantic interpretation to structure concepts and semantic relations. The modelling goes from terminological analysis to conceptual analysis, that means from terms to concepts and from lexical relations to semantic ones. During normalisation, the amount of data to be studied is gradually restricted.
4. Formalisation: The formalisation step includes building and validating the ontology. Some existing ontologies may help to build the highest levels and to structure it into large sub-domains. Then semantic concepts and relations are translated into formal concepts and roles and inserted in the ontology. This may imply to restructure the ontology or to define additional concepts, so that the inheritance constraints on the subsumption links are correct. Inserting a new concept triggers a local verification to guarantee the syntactic validity of the added description. A global validation of the formal model is performed once the ontology reaches a quite stable state to verify its consistency.
"The modelling work must be carried out from documents attested in the practice of a domain and gathered in a corpus.(...) The selection [of the documents] is based on criteria pertaining to the analysis method used (corpus analysis, for instance) and to the problem to be solved (to keep only the documents relevant for the problem to be solved). Setting up a corpus is delicate: the choice of a corpus introduces bias, that we may not be able to evaluate." [Bachimont, 2000] Here the definition of the scope and especially the scenarios are of valuable help to choose the corpus and the criteria for gathering and analysis.
NLP tools are extremely interesting to scale up the process of data-collection because they can assist and partially relieve the knowledge engineer when large corpora have to be analysed. However the document analysis aspect of data-collection is not limited to the linguistic content of documents. Graphical documents may be very rich too: organisation charts, maps, tables, figures, etc. as shown in Figure 11. For these documents few assisting tools exists and large corpora of drawings, for instance, cannot be analysed by one person in charge of knowledge engineering or management.
table of chemical elements
schema of DNA structure
Other aspects of documents are interesting such as the effective use of the document (e.g. compare an empty form with a filled form) or their flows in an organisation (what are the typical pathways of a given type of document).
Questionnaire & Questioning: Questionnaires are a relatively inexpensive way of getting people to provide information. However, elaborating a questionnaire is a critical job and a good questionnaire has to be tested and reviewed several times to validate its usability just like any artefact. Note that in many situations, other data collection techniques are superior.
From a general point of view, it may seem trivial, but the questions must be formulated so that each respondent clearly understands the topic, the context and the perspective. The profile and the role of the respondents must be considered when choosing the question so that the persons are not asked to give information that they do not have. The order and the way questions are formulated may influence the answers, so one must be very careful not to skew the collection or to miss something.
Brainstorming & Brainwriting: It was originally a group problem-solving technique. More generally speaking, it consist of a meeting session based on spontaneous and unrestrained contributions from all members of the group. The discussion is centred on a specific theme, a set of ideas or problems with the purpose of generating new ideas or solving the problems.
It can be noticed that if both brainstorming and interviews have to be used, it is better to run the interviews before (at least the first wave) so that some ideas have already been gathered to sustain the discussion and that the exchange of ideas done during brainstorming will not bias the individual interviews.
In some cases, hierarchy or other background influence may hamper a brainstorming session (e.g. someone may not want to publicly go against the opinion of his chief, someone may be too shy to speak). It is therefore extremely important to carefully choose participant. It is also a case where brain-writing can be used: people write down their ideas, put the paper in a basket and they are anonymously written down and discussed. An alternative is also to make people re-draw the papers from the basket, write their comments, put it back and so on until everyone has written something on all the pieces of paper.
New technologies can enable other forms of discussions (e.g. news, mailing-lists…), but here again human contact is usually the best option.
During data collection, several drafts and intermediate versions of data collection reports, structures and traces (e.g. scenario reports, lexicons, interview transcriptions, pictures of observations...) are generated. Analysing each one of them provide guidance and focus for further data collection. There may be several analysis of one version, for instance one may have to analyse a product again after discovering new areas of interest that had not been identified yet when the first analysis occurred. A product can also be analysed by several people. For these reasons, it is important that the product be kept intact with the exact wording, phrasing and representations captured by data collection.
The analysis of a report resulting from a data collection session can be divided in several phases:
1. Recognition: First the report has to be reviewed entirely without trying to structure or link together the information it contains. One must concentrate on identifying the blocks (e.g. a definition or a schema), the elements (e.g. a term denoting a concept) and the connectors (e.g. logical and chronological connectors).
2. Then, the report is reviewed several times not only to make sure every interesting bit has been identified, but also and mainly to structure and bring out the links, relationships, dependencies, grouping and cross references. the review is not a linear reading.
3. Once it seems everything has been extracted, the analysis gives an informal, but structured, annotated and tidy report of what has been collected.
Electronic copies of the reports do facilitate manipulation and versioning.
Based on the analysis the following wave of data collection will probably be more focused trying to confirm or invalidate intermediary results and to gather further details. This can lead to more structured interviews, focused observation, discussions with the people about the intermediary results, or one-off communication (e.g. phone or e-mail) to clarify a special point. Once the informal models are stable enough they represent the starting point for formalisation.
To use the same words is not a sufficient guarantee of
understanding; one must use the same words for
the same genus of inward experience; ultimately