Documentation of the "Term Search" tool


This tool is the first step for a collaborative version of WebKB our knowledge-based annotation/IR tool WebKB. Currently, WebKB exploits users' Web-accessible knowledges bases; it does not handle a centralized/shared knowledge base (KB). In the new version, it will be possible for anyone to search this shared KB but also to update it (via some protocols to ensure that the KB remains consistent and that no information is lost).

We have initialized the KB of our system (built on top of the main memory database system FastDB) with the WordNet lexical database (in accordance with the conventions we advocate, we have only included the WordNet categories and links relative to nouns and adjectives, i.e. an ontology of about 84,000 formal terms organized by 9 kinds of links and representing meanings of about 115,000 words). We have corrected a few structural problems (e.g. 4 categories that were referring to themselves) and added a top-level ontology of about 150 concept types and 120 relation types in order to ease, guide and check the uses and refinements of this ontology.
In WordNet, the subtypeOf and instanceOf kinds of links are not distinguished (there is only one "hypernym" relation). We have not made that distinction yet but will do it in September.
Since the KB is big, some queries may return a lot of results. Moreover, in order to avoid useless generations of large amounts of results, we limited the number of terms explored via the subtype links to 2000 (otherwise, a Web page of more than 12 Mb could be generated in the default format, 24 Mb in RDF). If you want to get the whole ontology, you may download a 4.2 Mb gzipped version of that ontology in RDF or in our format (described below).


Terms, identifiers, names and creators

Each (formal) term (concept type, relation type, individual) has one creator, may have at most one identifier (at present all terms in the KB have an identifier), and may have several names (at present only English nouns/adjectives or English nominal expressions coming from WordNet and our top-level ontology). Conversely, each name may be connected to many terms (since a word may have several meanings). Each link (relation) between terms (e.g. subtypeOf, instanceOf, exclusion, nounTypeOf) and each link between a term and a name also has a creator. The storing of creator for terms and links is an essential step for supportting the update and exploitation of the knowledge base by multiple users.

In the section "Selection options" of the interface, an identifier must be prefixed by '#'. A name specification may include wildcards: '?' for one character, '*' for any number of characters. At present, the interface does not require (but permits) the specification of term creators because the terms in the KB come from only two creators.

Here is the general convention (it is already used in the user-friendly input/output formats): a string that denotes a term shows its creator, its identifier and one or all its names. For instance: wn#wn9561632__share__portion__part__percentage denotes a term that comes from WordNet, has for identifier "wn9561632" and has for names according to WordNet: "share", "portion", "part", "percentage". Actually, for readability reasons, terms coming from WordNet are shown without "wn" before the '#'. Similarly, not all the names of a term have to be shown. Thus, for instance, #wn9561632 and #wn9561632__share denote the same term but gives less information about its names. When a term is shown with a key but without name, this means the key is also a name. For instance, pm#Situation denotes a term created by a user with identifier "pm" (which is an alias for "philippe.martin@gu.edu.au"), with key "Situation" and with name "situation". (Thus, "pm#Situation" may be used in the interface; "pm#*" may also be used to select all terms from the user with identifier "pm").
The creator of a link, say from a term t1 to a term t2, is displayed only if that creator is different from the creator of t1. If so, the creator identifier is displayed within parenthesis after t2. For instance, rdfs#Container > pm#Set(pm)   means that according to the user "pm" pm#Set is a subtype of rdfs#Container. Alternatively, this can be written:   pm#Set < rdfs#Container.
Click here for the grammars (EBNF plus Lex&Yacc) of the "comprehensive, parsable, still user-friendly" format.

In the RDF format, the creators of the terms and of the links are not yet shown. When the need comes, it will be done, at least for the terms (this simply requires the generation of relations creator from the terms). The representation of the creators of the links is a bit more difficult and much less readable because the representation of contexts is quite cumbersome in RDF.

Whenever an option is selected or a term is entered, the interface shows how it is translated according to the GET protocol into parameters for the called CGI server. Thus, a developper may know how to directly call this CGI server from a program and exploit its results (the POST protocol may also be used). This also means that any term in the KB can be refered via a URL and that knowledge related to this term can be fetched via this URL. Such URLs may for instance be used in RDF schemas.


Comparison with similar tools

The WordNet Web site proposes an on-line access to their database plus a browser to install. However, our interface proposes more options of search (e.g. via identifiers, wildcarded names and attached links) and display formats (e.g. indented list in conjunction with formal notations). Futhermore, we reorganized the ill-structured top-level of WordNet inside a top-level ontology. Finally, we will ultimately permit Web users to complement -- and in some cases, correct -- the content of the knowledge base.

Dan Brickley also implemented a server to provide the supertypes of a given WordNet term in RDF format. However, this server does not distinguish between identifiers and names. The provided information is therefore incorrect (in Dan Brickley words: "the current demo conflates 'word senses' with the words associated with those senses"). No form-based interface, search option, and format other than RDF is provided.



Philippe Martin
Last modified: Tue Aug 1 14:58:27 MEST 2000