Projet EuropAID: Sanskrit
|
Open source and free softwares In order to ensure a wide diffusion of the software developed by the project, it will be produced from open sources as a free software. It will be possible to use it under Microsoft Windows as well as under LINUX. The software will be produced in C or C++ according to an open source licence such as the GNU licence. Some Interfaces can be produced in JAVA
LaTex
Texts are transcripted in LaTeX.like format LaTex is a free software derived from Tex. Latex is a non WysiWyg text processor which have been developed for a wide variety of languages with various different script such as Cyrillic, Aarab, Hebrew…. Denagary..Chinese Most of these various extensions can be viewed in any CTAN site such as: www.ctan.org/ctan
|
A few principles about Asian scripts and IT tools
The aim of a critical edition is to display, on the same screen, different versions of an initial text, such as they can appear within different manuscripts. The displayed text is mostly call canonical text or master text, the different versions are called variants, which usually appears as footnotes. Each variant is referenced with the set of manuscripts which contains the variants.
To build a critical edition manually is a rather boring task, this is why software packages are often solicited to offer a help to the construction of such edition. The software compares the different manuscripts one by one with the master text, building the critical edition by showing the differences between the master text and each manuscript.
Establishing a critical edition offers generally the possibility to construct simultaneously some other data such as the one necessary for a further data analysis. During such analysis, the construction of clusters of manuscripts, as well as the construction of some phylogenic trees between them can be achieved.
Some software still exist which can partially accomplish the previous goal, but such software are not suitable for our purpose, because all of them are based on the usual ASCII encoding as it is used for English and (more or less) for all European languages.
Unfortunately such an approach is not valid for most Asian scripts.
In particular it would not apply for the alphabetical one which are written according to an Indian alphabet more ore less derived from Sanskrit., such as Pali, Thai , Khmer, Burmese… neither to the pictogram one such as Chinese, Korean, Japanese…
For that reason, we must deploy different instruments:
For the sake of convenience, we will focus mainly on Indian texts and on other languages whose scripts derive form Sanskrit. Such scripts have two main consequences on the computer treatment:
· Sanskrit alphabet has 46 letters, such an alphabet is transliterated, i.e. some letters of the Sanskrit alphabet are represented by more than one ASCII characters. For example, the Devanagari characters set is mostly written according to the Velthuis encoding.
· As a consequence the lexical order induced by the Sanskrit alphabet is far away from the order induced by the sequence of ASCII characters used for the representation of the letters. For various reasons the lexical order is very important in the construction of a complete critical edition, and need some special tools to be introduced.
· Sanskrit scripts mostly ignores blank as a word separator. For instance, in Devanagari scripts “the city is beautiful” would write “thecityisbeautiful”
- Such a feature can induce a combinatorial explosion of computation time. It can make very long and difficult (from the computation point of wiew) the production of critical edition.
Remarks:
· Introduction of the Unicode will only change slightly the problems we mentioned below.
· The amount of problems seems rather similar for pictograms scripts (Chinese, Japanese…)
We propose to build software which can handle Sanskrit and other Asian languages with both alphabet and scriptural tradition derived from the Sanskrit in order to produce a critical edition. The input will be on the one hand a set of manuscripts, which are to be compared, on the other hand with a Master text divided into words. The transliterated aspect will be taken into account as well as the Sanskrit alphabetical order, which will be considered to produce an index. The input will be a LaTex like set of texts, each manuscript will correspond to one file. Three kinds of outputs will be produced :
· The LaTex text of the critical edition dedicated to the Paper Version
· An XML version dedicated to the production of the Electronic Version
· A set of information dedicated to a further analysis of the manuscript set. Such information will consist of comments about the manuscript (omission, ink colour change.., as well as statistical information produced by the software, such as the number of occurrence of a precise word.
We will begin to apply our software to the Sanskrit for which we will have an important set of examples. Then we plan to make at least one adaptation to an other Asian language from which the script is derived from the Sanskrit, such as Khmer. Then we plan to consider under which conditions our software can be applied to pictogramic languages such as Chinese.
The production of critical edition in such a condition is still an "open subject". The absence of blank characters between words makes necessary the comparison of very long sequence of characters, Such a problem is very similar to the one encountered in biology to compare DNA sequences. But the basic algorithm will be inspired mostly by the algorithm form Hunt and Szymanski (1977) and Myers that is the basement of the Unix DIFF command.
By using the information resulting from processing the “critical edition” software, new and unexpected data analysis on old texts will be made possible, in particular building up a “genealogical tree of all copies” often called a “phylogenic manuscripts trees”, In other terms, when dealing with hundreds copies of an original text, and after having sorted out which parts of these copies may be original or not, it is possible to design sub-families of copies deriving from an ancient common copy
A few words about text comparison – classification and dating
The realization of a “critical edition” of a text broadly disseminated for centuries requires the processing of a great number of sources. Sometimes more than hundred manuscripts are to be compared. Manual processing of such an amount of manuscripts is almost insurmountable. The purpose of the present project is to allow an automation of this work.
As a base for experimentation, a grammatical comment Sanskrit of the 7th century, "Benares Glose". To date, more than hundred of manuscripts of the "Benares Glose" have been listed in various libraries in India and Europe.
An electronic version of the complete text of the "Benares glose" was carried out on the basis of an Indian edition. This edition is apparently based on a single source. It thus does not take into account the various versions of the text of them such as they are presented in the tree of existing manuscripts.
It cannot be used directly. It could be used nevertheless basic as a starting point which will facilitate the processing of other manuscripts. We will used it as master text, at least at the beginning of our work
A "critical" edition in a strict sense of the term must however take into account all the sources available in order to process the greatest volume of information available, allowing to re-build the history of the text and of its transmission.
Comparison of manuscripts
The master text will be the electronical version of the "Benares glose" non critical edition
The "master text” is to be modified manually progressively according to the manuscript examined in order to work out the best possible version. Such version is aimed to become the final version of the "Benares glose".
In addition, in order to allow textual analysis, an analytical version of the "master text" is also envisaged, which will include in particular the decomposition of the phonetic connections (sandhi), the separation of the prefixes, and the analysis of compound words. This analytical version of the master text (padapatha) will be employed in addition during the automatic
Each of the manuscript will be collect in a single file, All alternatives (corrections, omissions, additions, etc.) as well as metadata will be transferred on a well.
Each file thus constitutes a genuine electronic reproduction of a given manuscript (see below, M1, M2, etc). These files constitute the "raw material" and will be compared two by two to each other.
The elements which differ compared to this version (alternatives, gaps, additions, etc), determined by the software, will appear in footnote.
A generation of index and will be used as reference by the software to carry out the different comparisons.
The Computer-assisted generation of a critical apparatus aims at to producing in the most automatic way a critical edition.
The processing is guided by the examination of the master text (padapatha) which contains separations in paragraphs, separations of the words (not carried out in general on the manuscripts before re-transcription).
All manuscripts are compared with the master text, 2 to 2 ,paragraphs by paragraphs so as to be able to generate part of the critical apparatus. The other part is made up directly, by collecting the "footnotes" which accompany each manuscript.
One of the objectives of the project is to see which parts of the process could be automated.
Four outputs are expected
The transformation form Latex to XML (and vice versa) could be carried out apart from the program by software specialized such "Latex2Xml"
The analysis of the data collected during the manuscripts comparison is an important task .
The data necessary to various classifications will be in parts generated by the program of critical generation of edition.
The first level consists into carrying out a partition of manuscripts in various clusters.
The different clusters obtained will indicate which manuscripts are more similar, which one are the most different, depending on a set of pre-decided criteria
The set of cluster will provide a serious scientific basis in order to determine which set of manuscripts is coming from a same “father” manuscript.
It will be possible for the user:
A second level of data analysis will consist into the creation of a phylogenic tree of the manuscripts.
The user will be able to choose the manuscripts that it desire to see intervening in the construction of such a tree, as well as the type of metric.
Example of phylogenic tree
Classification of the manuscripts of “Hymns” of Homer,
(Les Belles Lettres, Paris)
M, D, At, H, J, K etc… are the only manuscripts which are now available.
All others, (f) (s) (z) (x) (psi), (fi) are only assumptions. They do not exist anymore.
Epigraphist assume that H,J,K come from a same “father” (z). They assume also (z), and (f) come from a common father (g). the whole set comes from (Omega) which is supposed to be the original document of XI century BC.
Collation Demo GuideText collation supports language-sensitive comparison of strings, allowing for text searching and alphabetical sorting. The collation classes provide a choice of ordering strength (for example, to ignore or not ignore case differences) and handle ignored, expanding, and contracting characters. Developers don't need to know anything about the collation rules for various languages. Any features requiring collation can use the collation object associated with the current default locale, or with a specific locale (like France or Japan) if appropriate. Collation Basics Localizable Collation Customization Details Collation BasicsCorrectly sorting strings is tricky, even in English. The results of a sort must be consistent&emdash;any differences in strings must always be sorted the same way. The sort assigns relative priorities to different features of the text, based on the characters themselves and on the current ordering strength of the collation object.
Other special characters, including accented or grouped characters, add other complications. For example, the "-" hyphen character in the word "black-bird" is only significant if the other letters in the compared strings are identical. Localizable CollationDifferent collation objects associated with various locales handle the differences required when sorting text strings for different languages.
|
To See This... |
Do This... |
You can modify an existing collation. Adding items at the end of a collation overrides earlier information. For example, you can make the letter P sort at the end of the alphabet. |
· Enter the sample rules at the end of the
Collation Rules field. |
Sample Rules: |
|
< p , P |
Making P sort at the end may not seem terribly useful, but it is used to make modifications in the sorting sequence for different languages.
To See This... |
Do This... |
You can add new rules to an existing collation. For example, you can add CH as a single letter after C, as in traditional Spanish sorting. |
· Enter the sample rules at the end of the
Collation Rules field. |
Sample Rules: |
|
& c < ch , cH, Ch, CH |
|
Sample Test Cases: |
|
Cat |
As well as adding sequences of characters that act as a single character (this is known as contraction), you can also add characters that act like a sequence of characters (this is known as expansion).
http://java.fh-wolfenbuettel.de/jdk1.1/demo/i18n/Collate/example1.html#basics
|
|
||
Working Package N° 3 – Testing software on a genuine example
Hence, the core issue of this project lies in the historical reconstitution of texts. Ultimately, it will be possible to use the software for various scholarly disciplines (philosophy, linguistics, mathematics, medicine, etc.) as well as literary arts and poetic. But in order to test the software, we have decided to use a grammatical text: the Glose of Benares (Kashikavritti).
Why did we choose this particular text?
– Choice of the field: Grammar (vyâkarana) plays an important role in Indian thought, comparable to that of mathematics in Western philosophy. It is considered as the principal auxiliary of the Veda. Hence, the choice of this discipline was quite an obvious one.
– Choice of a particular text: several criteria had to be met with.
1) Relevance in the History of Sanskrit Grammar. The Glose of Benares is the most ancient complete commentary on the main grammatical treatise (Pânini’s Ashtâdhyâyî, 5th cent. BC) on which all Indian linguistic speculations are based).
The Glose of Benares is an invaluable too in order to understand Pânini’s book. It is also fundamental from the point of view of the history of linguistic ideas in India since all later works are based on or inspired by Pânini. The Glose of Benares has already been edited, but only on the basis of one or very few manuscripts. Therefore, nothing is known about its history, interpolated passages, omissions, how the text has been transmitted in the Indian subcontinent, which would be really unthinkable in the case of any Classical work in Greek or Latin.
2) Number of manuscripts. In order to test the software and avoid bugs, it was necessary to compare a great number of manuscripts of the same text on a wide scale, with various scripts (there are more that 10 different Indian scripts). There are over 150 listed manuscripts of the Glose of Benares in Indian or Western libraries.
3) Collection of manuscripts: Collecting manuscripts, especially in India, is a time-consuming task (most Indian libraries do not send copies of their manuscripts, they have to be collected on the spot and curators are sometimes reluctant to let scholars get photographs or microfilms). More than 2/3 of the manuscripts of the Glose of Benares have already been collected by J. Bronkhorst (University of Lausanne) more than 10 years ago in order to produce a critical edition of the text. After having examined some of the manuscripts, he realised that it was impossible to produce the edition manually and he decided to postpone this work until a proper computer software is constructed. He has very happily made his copies of the manuscripts available to us, so that we do not waste time with the collection work.
A few quantifiable figures for WP3 package
Total number of pages of the Glose of Benares: 784 p. (electronic text in A4)
Total number of words: 244 039
Number of characters (without spaces) 1 430 297
For the present project, we have selected 3 chapters of about 20 p. each (i.e. about 44 000 characters). Each team will be responsible for a particular chapter (Pune, Kathmandu, Pondicherry), but for some of the scripts, collaborations will be organised (for instance, South Indian scholars may not be familiar with Bengali script, or Nepali scholars with Kerala script).
In the same way, for manuscripts that can only be examined on the spot, and for which copies can not be made available, one member of the team will work simultaneously on all the three chapters.
Last modified: Tue Jun 4 10:40:29 MEST 2002