Projet EuropAID: Sanskrit

Overall Objective

Ancient texts whether religious, scientific or philosophic are known to us due to the patient and vigilant work of scribes who, from centuries to centuries, have copied and copied again successive versions of an original text (usually lost for ever. )

So there is a chain of copies starting with the original text and continued by an immense tree of hundreds of copies that has grown more or less like a genealogical tree. Out of this chain, we only know the “surviving copies” which have reached our century. They are never identic to each other, sometimes extremely different: Parts of the original are missing, fragments are not readable anymore, some have been miscopied, some others have been voluntarily transformed.

Out of so many contradictory copies, the task of the epigraphist is to compare these hundreds of versions and to try to re-built a reference text that conciliates as much as possible all available versions and tends to reflect and restore the original text.

New hardware with their hundreds of Giga memory and colossal computing capacity allow to process and compare simultaneously large amount of archives. Recently a set of software has been developed that is now available for Latin and Greek scripts. They open decisive avenues to review hypothesis about ancient texts from Roman and Hellenistic periods.

In India where archives are not huge, but oceanic! There is a urgent need for the same software adapted to Indian scripts and languages starting with languages derived from Sanskrit.

The multiplier effect will be immediate since Sanskrit scripts archives have spread on a vast territory not only in India, but from the whole Indian sub-continent to Thailand, Burma, Cambodia and Lao.

Specific objective

The project intends (a) to develop IT tools for an automatic classification and document dating of those ancient texts. (b) In order to make the IT classification tools workable for users it is necessary to combine them with a viewer that displays simultaneously on the computer screen different versions of a same text including scholars’commentaries. It is a second objective of the present project to develop such a viewer or software for critical editions for ancient texts written in scripts deriving from Sanskrit scripts.

(a) In order to do so, the project will liaise with an ICT initiative from PCRD 5^th, where INRIA has plaid a key role, since INRIA was in charge the project activity “ Symbolic statistics”. It is sometimes called “cladistic approach”, name proposed by the first biologists who had to invent this tool for their own research needs.

“Symbolic statistics” is precisely a set of principles and approaches which are currently used for developing tools of automatic classification.

The wording seems complex, but the principle is simple: Given a set of mathematical objects, “Symbolic statistics” method consists into establishing a procedure that calculates a distance between these objects. The data necessary for calculating such distances are statistically-born like frequencies etc…

Of course the distance chosen is supposed to measure “differences” or “proximities” between various versions.

In the case of our project, the set of mathematical objects is the set of all available contemporary versions of a same original ancient text. The distance that has to be measured, is the similarity between two versions, between two fragments of texts, or even the “genealogical link” between two versions (For instance when version A is obviously copied out of version B. The statistical tool to mobilize is the number of occurrences of signs, words, expressions, syntactic forms etc…)

Finally it is expected, that such automatic comparisons ends up with proposing to users hypothesis of text dating.

(b) In order to make the IT classification tools workable for users, in particulars scholars, it is necessary to view the different versions of the same texts simultaneously on the screen. In other terms, it is necessary to provide potential users with a viewer that displays simultaneously on the computer screen the different versions of a same text to be analysed, including scholars’commentaries.

The principle consists into dividing the computer screen in 3 windows: The first displays a “master version” that is considered subjectively by scholars as a reference version. The second windows display for each word/sentence variants proposed in all other versions that the scholar wishes to study and a third window proposed a number of footnotes or commentaries regarding those versions.

It is a second objective of the present project to develop such a viewer ” for ancient texts written in scripts deriving from Sanskrit scripts. To make it short, we shall call it a “software for critical editions”.

At last it is to be noticed that those software will be opened strictly with open source. Once achieved, they will be downloadable from the project site for free.

Example of a critical edition “viewer (Dr. Peter Robinson's Collate program.):

The top windows display the “ reference text” that is proposed by the epigraphist as possibly final
The bottom left window gives the list of manuscript where the fragment is “present” or “not present”.

The bottom right gives the list of manuscripts where similar forms of the partially readable word “Akekoamen” can be found.

Detailed description of activities

Maximum 9 pages. Include the title and a detailed description of each activity to be undertaken to produce the results, specifying where applicable the role of each partner (or associates or subcontractors) in the activities. In this respect, the detailed description of activities must not be confused with the plan of action (see 1.1.9).

Preliminary

Open source and free softwares

In order to ensure a wide diffusion of the software developed by the project, it will be produced from open sources as a free software. It will be possible to use it under Microsoft Windows as well as under LINUX.

The software will be produced in C or C++ according to an open source licence such as the GNU licence. Some Interfaces can be produced in JAVA

LaTex

Texts are transcripted in LaTeX.like format

LaTex is a free software derived from Tex. Latex is a non WysiWyg text processor which have been developed for a wide variety of languages with various different script such as Cyrillic, Aarab, Hebrew…. Denagary..Chinese

Most of these various extensions can be viewed in any CTAN site such as: www.ctan.org/ctan

A few principles about Asian scripts and IT tools

The aim of a critical edition is to display, on the same screen, different versions of an initial text, such as they can appear within different manuscripts. The displayed text is mostly call canonical text or master text, the different versions are called variants, which usually appears as footnotes. Each variant is referenced with the set of manuscripts which contains the variants.

To build a critical edition manually is a rather boring task, this is why software packages are often solicited to offer a help to the construction of such edition. The software compares the different manuscripts one by one with the master text, building the critical edition by showing the differences between the master text and each manuscript.

Establishing a critical edition offers generally the possibility to construct simultaneously some other data such as the one necessary for a further data analysis. During such analysis, the construction of clusters of manuscripts, as well as the construction of some phylogenic trees between them can be achieved.

Some software still exist which can partially accomplish the previous goal, but such software are not suitable for our purpose, because all of them are based on the usual ASCII encoding as it is used for English and (more or less) for all European languages.

Unfortunately such an approach is not valid for most Asian scripts.

In particular it would not apply for the alphabetical one which are written according to an Indian alphabet more ore less derived from Sanskrit., such as Pali, Thai , Khmer, Burmese… neither to the pictogram one such as Chinese, Korean, Japanese…

For that reason, we must deploy different instruments:

For the sake of convenience, we will focus mainly on Indian texts and on other languages whose scripts derive form Sanskrit. Such scripts have two main consequences on the computer treatment:

· Sanskrit alphabet has 46 letters, such an alphabet is transliterated, i.e. some letters of the Sanskrit alphabet are represented by more than one ASCII characters. For example, the Devanagari characters set is mostly written according to the Velthuis encoding.

· As a consequence the lexical order induced by the Sanskrit alphabet is far away from the order induced by the sequence of ASCII characters used for the representation of the letters. For various reasons the lexical order is very important in the construction of a complete critical edition, and need some special tools to be introduced.

· Sanskrit scripts mostly ignores blank as a word separator. For instance, in Devanagari scripts “the city is beautiful” would write “thecityisbeautiful”

- Such a feature can induce a combinatorial explosion of computation time. It can make very long and difficult (from the computation point of wiew) the production of critical edition.

Remarks:

· Introduction of the Unicode will only change slightly the problems we mentioned below.

· The amount of problems seems rather similar for pictograms scripts (Chinese, Japanese…)

We propose to build software which can handle Sanskrit and other Asian languages with both alphabet and scriptural tradition derived from the Sanskrit in order to produce a critical edition. The input will be on the one hand a set of manuscripts, which are to be compared, on the other hand with a Master text divided into words. The transliterated aspect will be taken into account as well as the Sanskrit alphabetical order, which will be considered to produce an index. The input will be a LaTex like set of texts, each manuscript will correspond to one file. Three kinds of outputs will be produced :

· The LaTex text of the critical edition dedicated to the Paper Version

· An XML version dedicated to the production of the Electronic Version

· A set of information dedicated to a further analysis of the manuscript set. Such information will consist of comments about the manuscript (omission, ink colour change.., as well as statistical information produced by the software, such as the number of occurrence of a precise word.

We will begin to apply our software to the Sanskrit for which we will have an important set of examples. Then we plan to make at least one adaptation to an other Asian language from which the script is derived from the Sanskrit, such as Khmer. Then we plan to consider under which conditions our software can be applied to pictogramic languages such as Chinese.

The production of critical edition in such a condition is still an "open subject". The absence of blank characters between words makes necessary the comparison of very long sequence of characters, Such a problem is very similar to the one encountered in biology to compare DNA sequences. But the basic algorithm will be inspired mostly by the algorithm form Hunt and Szymanski (1977) and Myers that is the basement of the Unix DIFF command.

By using the information resulting from processing the “critical edition” software, new and unexpected data analysis on old texts will be made possible, in particular building up a “genealogical tree of all copies” often called a “phylogenic manuscripts trees”, In other terms, when dealing with hundreds copies of an original text, and after having sorted out which parts of these copies may be original or not, it is possible to design sub-families of copies deriving from an ancient common copy

A few words about text comparison – classification and dating

The realization of a “critical edition” of a text broadly disseminated for centuries requires the processing of a great number of sources. Sometimes more than hundred manuscripts are to be compared. Manual processing of such an amount of manuscripts is almost insurmountable. The purpose of the present project is to allow an automation of this work.

As a base for experimentation, a grammatical comment Sanskrit of the 7th century, "Benares Glose". To date, more than hundred of manuscripts of the "Benares Glose" have been listed in various libraries in India and Europe.

An electronic version of the complete text of the "Benares glose" was carried out on the basis of an Indian edition. This edition is apparently based on a single source. It thus does not take into account the various versions of the text of them such as they are presented in the tree of existing manuscripts.

It cannot be used directly. It could be used nevertheless basic as a starting point which will facilitate the processing of other manuscripts. We will used it as master text, at least at the beginning of our work

A "critical" edition in a strict sense of the term must however take into account all the sources available in order to process the greatest volume of information available, allowing to re-build the history of the text and of its transmission.

Comparison of manuscripts

The master text will be the electronical version of the "Benares glose" non critical edition

The "master text” is to be modified manually progressively according to the manuscript examined in order to work out the best possible version. Such version is aimed to become the final version of the "Benares glose".

In addition, in order to allow textual analysis, an analytical version of the "master text" is also envisaged, which will include in particular the decomposition of the phonetic connections (sandhi), the separation of the prefixes, and the analysis of compound words. This analytical version of the master text (padapatha) will be employed in addition during the automatic

Each of the manuscript will be collect in a single file, All alternatives (corrections, omissions, additions, etc.) as well as metadata will be transferred on a well.

Each file thus constitutes a genuine electronic reproduction of a given manuscript (see below, M1, M2, etc). These files constitute the "raw material" and will be compared two by two to each other.

The elements which differ compared to this version (alternatives, gaps, additions, etc), determined by the software, will appear in footnote.

A generation of index and will be used as reference by the software to carry out the different comparisons.

The Computer-assisted generation of a critical apparatus aims at to producing in the most automatic way a critical edition.

The processing is guided by the examination of the master text (padapatha) which contains separations in paragraphs, separations of the words (not carried out in general on the manuscripts before re-transcription).

All manuscripts are compared with the master text, 2 to 2 ,paragraphs by paragraphs so as to be able to generate part of the critical apparatus. The other part is made up directly, by collecting the "footnotes" which accompany each manuscript.

One of the objectives of the project is to see which parts of the process could be automated.

Four outputs are expected

A LaTex text corresponding to the version paper "the critical edition".
The XML version allowing to generate an interactive version displayed on the computer "screen"
A set of data likely to be used for a data analysis.
A set of metadata given likely to be joined together in a "multi-media" base of data.

The transformation form Latex to XML (and vice versa) could be carried out apart from the program by software specialized such "Latex2Xml"

The Analysis of the Data Deriving from the Critical edition Construction

The analysis of the data collected during the manuscripts comparison is an important task .

The data necessary to various classifications will be in parts generated by the program of critical generation of edition.

The first level consists into carrying out a partition of manuscripts in various clusters.

The different clusters obtained will indicate which manuscripts are more similar, which one are the most different, depending on a set of pre-decided criteria

The set of cluster will provide a serious scientific basis in order to determine which set of manuscripts is coming from a same “father” manuscript.

It will be possible for the user:

to select the data relating to each manuscripts being used to establish the distances
to choose a type of distance to carry out the comparisons
to choose the number of classes of the partition

A second level of data analysis will consist into the creation of a phylogenic tree of the manuscripts.

The user will be able to choose the manuscripts that it desire to see intervening in the construction of such a tree, as well as the type of metric.

Example of phylogenic tree

Classification of the manuscripts of “Hymns” of Homer,

(Les Belles Lettres, Paris)

M, D, At, H, J, K etc… are the only manuscripts which are now available.

All others, (f) (s) (z) (x) (psi), (fi) are only assumptions. They do not exist anymore.

Epigraphist assume that H,J,K come from a same “father” (z). They assume also (z), and (f) come from a common father (g). the whole set comes from (Omega) which is supposed to be the original document of XI century BC.

Collation Demo Guide

Text collation supports language-sensitive comparison of strings, allowing for text searching and alphabetical sorting. The collation classes provide a choice of ordering strength (for example, to ignore or not ignore case differences) and handle ignored, expanding, and contracting characters.

Developers don't need to know anything about the collation rules for various languages. Any features requiring collation can use the collation object associated with the current default locale, or with a specific locale (like France or Japan) if appropriate.

Collation Basics Localizable Collation Customization Details

Collation Basics

Correctly sorting strings is tricky, even in English. The results of a sort must be consistent&emdash;any differences in strings must always be sorted the same way. The sort assigns relative priorities to different features of the text, based on the characters themselves and on the current ordering strength of the collation object.

To See This...	Do This...
Consistent sorting: In English, uppercase letters always sort after lowercase letters whenever there are no other differences in compared strings.	· Click on the Sort Ascending button. · Click on the Sort Descending button. · The relative order of "pat", "Pat", and "PAT" reverses.
Differences in ordering strength: Secondary ordering strength means case differences are disregarded (enabling case-insensitive searching). With primary ordering strength accents are also ignored&emdash;only base letters are compared.	· Select Primary from the Strength menu. · Click alternately on Sort Ascending and Sort Descending. · The relative order of "pat", "Pat", and "PAT" stays the same

Other special characters, including accented or grouped characters, add other complications. For example, the "-" hyphen character in the word "black-bird" is only significant if the other letters in the compared strings are identical.

Localizable Collation

Different collation objects associated with various locales handle the differences required when sorting text strings for different languages.

To See This...	Do This...
In French, accent differences are sorted from the end of the word, so the ordering of "pêche" and "péché" changes from the English ordering.	· Select Tertiary from the Strength menu. · Select French (France) from the Locale menu
In German the ordering of "Töne" changes, because German treats o + umlaut (ö) as if it were oe.	· Select German (Germany) from the Locale menu.

Customization

You can produce a new collation by adding to or changing an existing one. You can do this in the demo using the Collation Rules field in the demonstration. This field shows the rules that make up the collation sequence for that language. (At the start of the list, are a number of odd-looking items such as"\u0308". These use Java notation for Unicode characters, used here because most browsers are currently unable to display the full range of Unicode characters.)

In all of the following examples, you can cut and paste sample rules or test cases instead of typing them in manually. Paste them at the end of the respective fields.

To See This...	Do This...
You can modify an existing collation. Adding items at the end of a collation overrides earlier information. For example, you can make the letter P sort at the end of the alphabet.	· Enter the sample rules at the end of the Collation Rules field. · Hit the Set Rules button. · Select Sort Ascending to see the resulting sort order.
Sample Rules:
< p , P

Making P sort at the end may not seem terribly useful, but it is used to make modifications in the sorting sequence for different languages.

To See This...	Do This...
You can add new rules to an existing collation. For example, you can add CH as a single letter after C, as in traditional Spanish sorting.	· Enter the sample rules at the end of the Collation Rules field. · Enter the test cases at the end of the test field. · Hit the Set Rules button. · Select Sort Ascending to see the resulting sort order.
Sample Rules:
& c < ch , cH, Ch, CH
Sample Test Cases:
Cat czar churo darn

As well as adding sequences of characters that act as a single character (this is known as contraction), you can also add characters that act like a sequence of characters (this is known as expansion).

http://java.fh-wolfenbuettel.de/jdk1.1/demo/i18n/Collate/example1.html#basics

Working Package N° 3 – Testing software on a genuine example

Hence, the core issue of this project lies in the historical reconstitution of texts. Ultimately, it will be possible to use the software for various scholarly disciplines (philosophy, linguistics, mathematics, medicine, etc.) as well as literary arts and poetic. But in order to test the software, we have decided to use a grammatical text: the Glose of Benares (Kashikavritti).

Why did we choose this particular text?

– Choice of the field: Grammar (vyâkarana) plays an important role in Indian thought, comparable to that of mathematics in Western philosophy. It is considered as the principal auxiliary of the Veda. Hence, the choice of this discipline was quite an obvious one.

– Choice of a particular text: several criteria had to be met with.

1) Relevance in the History of Sanskrit Grammar. The Glose of Benares is the most ancient complete commentary on the main grammatical treatise (Pânini’s Ashtâdhyâyî, 5^th cent. BC) on which all Indian linguistic speculations are based).

The Glose of Benares is an invaluable too in order to understand Pânini’s book. It is also fundamental from the point of view of the history of linguistic ideas in India since all later works are based on or inspired by Pânini. The Glose of Benares has already been edited, but only on the basis of one or very few manuscripts. Therefore, nothing is known about its history, interpolated passages, omissions, how the text has been transmitted in the Indian subcontinent, which would be really unthinkable in the case of any Classical work in Greek or Latin.

2) Number of manuscripts. In order to test the software and avoid bugs, it was necessary to compare a great number of manuscripts of the same text on a wide scale, with various scripts (there are more that 10 different Indian scripts). There are over 150 listed manuscripts of the Glose of Benares in Indian or Western libraries.

3) Collection of manuscripts: Collecting manuscripts, especially in India, is a time-consuming task (most Indian libraries do not send copies of their manuscripts, they have to be collected on the spot and curators are sometimes reluctant to let scholars get photographs or microfilms). More than 2/3 of the manuscripts of the Glose of Benares have already been collected by J. Bronkhorst (University of Lausanne) more than 10 years ago in order to produce a critical edition of the text. After having examined some of the manuscripts, he realised that it was impossible to produce the edition manually and he decided to postpone this work until a proper computer software is constructed. He has very happily made his copies of the manuscripts available to us, so that we do not waste time with the collection work.

A few quantifiable figures for WP3 package

Total number of pages of the Glose of Benares: 784 p. (electronic text in A4)

Total number of words: 244 039

Number of characters (without spaces) 1 430 297

For the present project, we have selected 3 chapters of about 20 p. each (i.e. about 44 000 characters). Each team will be responsible for a particular chapter (Pune, Kathmandu, Pondicherry), but for some of the scripts, collaborations will be organised (for instance, South Indian scholars may not be familiar with Bengali script, or Nepali scholars with Kerala script).

In the same way, for manuscripts that can only be examined on the spot, and for which copies can not be made available, one member of the team will work simultaneously on all the three chapters.

Brigitte Trousse

Last modified: Tue Jun 4 10:40:29 MEST 2002

Projet EuropAID: Sanskrit

Overall Objective