previous
TOC
next
Tralics, a LaTeX to XML translator; Part II

2. Interpreting XML in TeX

We shall describe here the xmltex.tex file. This is a piece of code written by David Carlisle as described in [2], it is a follow-up to typehtml, a package for typesetting HTML. The idea was to write a TeX file that interprets some XML code and typesets it, using rules defined in some other files (the so-called .xmt files), that depend on the DTD or namespaces. Some of these files are described in following chapters. This interpreter is used for the production of the Raweb (on figure 1 of the first part of this document, the arrows from `xmlfo´ to `PDF´ or `PS´ use this file). The XML file contains a lot of Unicode characters, that can be coded using iso-latin1 or UTF-8 encoding. Interpreting them in TeX is a real challenge. Here is an example:

<m:math overflow="scroll">
  <mrow xmlns="http://www.w3.org/1998/Math/MathML">
   <msup><mi>L</mi> <mn>2</mn> </msup><mo>&#x2192;</mo>
    <msup><mi>L</mi> <mi>&#x221E;</mi> </msup></mrow></m:math>

In this example, only ASCII characters are used, complicated things are written in the form &#x221E;, this is the same as &#8734;. The <mrow> element has an xmlns attribute, it is hence the same as <m:mrow>. The action associated to this element is stored in some command, so one question is: what´s the name of this command? every Unicode character is allowed, i.e., much more than the 256 internal characters of TeX. We have another problem, it is that an element name could be entered as <José> (iso-latin1 encoding). A good encoding is UTF-8, since it allows encoding of all Unicode characters on 8 bits. For instance, the representation of a character looks like this: \8:Ă©, and that of the element is \E:0:JosĂ©. Here between the two colons we have the value of the namespace, a sequence of digits, where 0 represents the empty namespace, 3, the MathML namespace, etc. (namespaces are defined in [11]). All commands that are dynamically created start with a prefix. This is `8:´ for a UTF-8 character, `A:´ for the global attribute list, `E:´ for the start of an element, `E/:´ for the end of an element, `Q:´ for a processing instruction like <?xml?>, `XML:´ for a namespace. An example of such a namespace command is \XML:http://www.w3.org/1998/Math/MathML. Usual LaTeX commands contain only letters, reserved names may contain @; using a prefix with a character other than these reduces the risk of conflict with existing commands. We must use \csname for producing these commands. The math formula above is L 2 L .

In order to make the code easier to understand, we have invented some commands that are inlined in the real code (for efficiency reasons). The command \XML:http.../MathML (full name shown above) contains the unique identifier for the MathML namespace; this is in fact the number 3. It can be constructed via \jg@NSuri. In the case of <m:math>, the value of the `m´ prefix is the same number, it will be obtained via \jg@namespace{m}. In fact, when we parse an element, the prefix is in a global variable, so that we can use the parameterless command \jg@this@namespace.

\def\jg@NSuri#1{\csname XML:#1\endcsname}
\def\jg@namespace#1{\csname XMLNS@#1\endcsname}
\def\jg@this@namespace#1{\jg@namespace{\XML@this@prefix}}

In some cases, we need a canonical version of a string. We shall use the \catxii command for this: if \val is a command that expands to `some/val´ then the expansion of `\meaning\val´ is `macro:->some/val´, this is a list of character tokens with category code 12 (except for spaces). The \strip@prefix command removes everything up to `>´, it yields `some/val´. We need an \expandafter for changing the order of expansion.

\def\catxii#1{\expandafter\strip@prefix\meaning#1}

See the TeXbook [6] for details about \expandafter, category codes, the result of \meaning, what happens if \meaning produces no greater-than sign, etc. See the LaTeX source code for \strip@prefix. See the Unicode book [8], paragraph 2.5, for the definition of encodings like UTF-8, UTF-16 and UTF-32, and the whole book for the significance of characters U+221E.

UTF-8 encoding is defined as follows. A character X will be represented using a variable number of bytes, say A, AB, ABC or ABCD. Let x be the integer value of X, and a, b, c and d the values of A, B, C, and D. The first byte indicates the length of the sequence: if a<128, the sequence is of length one, and x=a. Otherwise, a starts with k bits 1 followed by a 0 bit, the sequence is of length k, the k-1 characters that follow start with a 1 and a 0 (and have 6 significant bits). These are the relations we shall use:(note: )

If0x<2 7 k=1x=x 1 a=xIf2 7 x<2 11 k=2x=x 1 2 6 +x 2 a=128+64+x 1 ,b=128+x 2 If2 11 x<2 16 k=3x=x 1 2 12 +x 2 2 6 +x 3 a=128+64+32+x 1 ,b=128+x 2 ,c=128+x 3 If2 16 x<2 21 k=4x=x 1 2 18 +x 2 2 12 +x 3 2 6 +x 4 a=128+64+32+16+x 1 ,b=128+x 2 ,c=128+x 3 ,d=128+x 4

In all cases 0x i <64. The case x2 21 is not handled. As an example, the character with code 233 is coded as Ă© . Note: assume that X is an iso-latin1 character, if it fits on seven bits it is represented by itself. Otherwise, if 128x<128+64, the first character is Â, the second is X, and if x128+64 the first character is Ă, the second has value x-64 (note that most useful latin1 characters are in the range 192-255).

The file we consider starts like this:

1 %% Copyright 2000 David Carlisle, NAG Ltd.
2 %% re-released by Sebastian Rahtz June 2002
3 %% This file is distributed under the LaTeX Project Public License
4 %% (LPPL) as found at http://www.latex-project.org/lppl.txt
5 %% Either version 1.0, or at your option, any later version.

Unless told otherwise, newline characters are ignored in the xmltex file (in particular, on line 6, the space after the opening brace). More generally, lots of characters have category codes that depend on the context. We have not shown all these category changes; since definitions are in local groups, they are generally global (hence the \gdef here).

2.1. Constructing characters

Let´s consider the following task: We have a character X, with code x, and x is in \count@. We want to find the bytes ABCD, with codes a, b, c and d. These quantities are obtained by writing x in base 64, with digits x i , and we add 128 to everything. The first byte is a bit more complicated to compute. This piece of code uses two temporary registers \@tempcnta and \@tempcntb for the division. It replaces x by its quotient, and puts in the \uccode of `#1 the next byte (this assumes that the argument of the command is a character).

6 \gdef\XML@utfeight@a#1{
7      \@tempcnta\count@
8      \divide\count@64
9      \@tempcntb\count@
10      \multiply\count@64
11      \advance\@tempcnta-\count@
12      \advance\@tempcnta"80
13      \uccode`#1\@tempcnta
14      \count@\@tempcntb}

This is the caller of the preceding command:

15 \gdef\XML@charref#1#2;{
16   \begingroup
17   \uppercase{\count@\if x\noexpand#1"\else#1\fi#2}\relax
18   \ifnum\count@<"80\relax
19     \uccode`\~\count@
20     \uppercase{
21     \ifnum\catcode\count@=\active
22       \gdef\XML@tempa{\utfeightay~}
23     \else
24       \gdef\XML@tempa{~}
25     \fi}
26   \else\ifnum\count@<"800\relax
27      \XML@utfeight@a,
28      \XML@utfeight@b C\utfeightb.,
29   \else\ifnum\count@<"10000\relax
30      \XML@utfeight@a;
31      \XML@utfeight@a,
32      \XML@utfeight@b E\utfeightc.{,;}
33    \else
34      \XML@utfeight@a;
35      \XML@utfeight@a,
36      \XML@utfeight@a!
37      \XML@utfeight@b F\utfeightd.{!,;}
38     \fi
39     \fi
40   \fi
41   \endgroup}

There is a similar command, except that the test on lines 21-25 is assumed to be true, and code on line 22 is executed. It seems to be used only for reading auxiliary files in XML format; however, the .aux files contain currently no XML code.

42 \gdef\XML@charref@tex#1#2;{
43   ...}

We shall see in the sequel some instances of `black magic´. The result of `\uppercase{xe9}´ is `XE9´. However, if you say `\uppercase{\foo~}´ the result is `\foo W´, where W is the character found in the uc table of the tilde character, and the category code of this character is the same as the tilde character (in general active). The substitution is done before \foo is evaluated(note: ). In some cases, \foo is \endgroup. In our case, the group ends at line 41. The \uppercase on line 17 is not black magic. The idea is the following: imagine that we want to read something like `&#233;´ or `&#xe9;´, and that the ampersand and sharp characters have been read. Then \XML@charref reads all characters up to the semi colon. Arguments are 2, 33 in one case, x, e9 in the other case. A construction like \count@="E9 puts 233 into \count@, upper case letters are needed. What \uppercase produces in our example is `\count@\if X\noexpand X"\else X\fi E9´; there is a \relax in the code whose purpose is to mark the end of the number (we do not want the \ifnum to be expanded before assignment is done); this \relax command could have been in the uppercase list. I don´t know if \noexpand is needed here(note: ). The effect of the conditional is just to replace the X by a " (you cannot do this using black magic, because the double quote has to be of category 12, so that the argument of the command must be of category code 12). What our code does is to put the number (say 233) into \count@. It chooses one of four alternatives on line 18, 26, 29 and 33; it corresponds to the number of bytes used to represent the Unicode character in UTF-8 format. In any case, the result is a definition of \XML@tempa as a command that start with \UTF8? (this is a shorthand for one of \UTF8ax, \UTF8ay, \UTF8az, \UTF8b, \UTF8c, or \UTF8d, the real name of the command is \utfeightax, etc.) followed by some characters (1, 1, 1, 2, 3, and 4 respectively).

Let´s start with the case of one byte, lines 19-25. We have a special case here, because the UTF-8 character can be represented by a single TeX character; we use it, in the case where it is not expandable (i.e., is non active); the code on lines 41-42 does not use this simplification. As an example, if the number x is 65, then \XML@tempa will contain A; if it is 60, it will contain `\utfeightay<´ (we assume that the less-than sign is of category code 12 when the code is read, this is needed on line 18, and of category 13 when the code is executed).

In the case where more than one byte is used, the idea is the following. We have to compute some integers a, b, c and d (two three or four values are required). These integers are in the range 1–255. If we store them in the uc-slot of A, B, C or D, then \uppercase{ABCD} will give a sequence of four characters, whose codes are the numbers a, b, c and d. Instead of these letters, point, exclamation point, comma and semi colon are used, in a random order. This is completely irrelevant since modifications are local (the group ends on line 41), and the \uppercase on line 47 sees only these character tokens, together with non-character tokens that are not affected. The code could be slightly optimized if, on one hand, we notice that a is always stored in `.´ (point) and, on the other hand, that b could always be stored in `!´ (exclamation point). On lines 26 to 37 we compute b, c and d, and call \XML@utfeight@b with four arguments. Argument #3 is the character that will hold a, argument #4 is the list of characters that are already set, argument #2 is the command name, one of the \UTF8? commands mentioned above. The first argument is a C, E, or F. Remember that a=x 1 +s, where x 1 is in \count@, s depends on the number of bytes. It is sixteen times 12, 14 or 15 (in base 16, it is C0, E0 or F0). What the next function does is then obvious:

44 \gdef\XML@utfeight@b#1#2#3#4{
45      \advance\count@"#10\relax
46      \uccode`#3\count@
47      \uppercase{\gdef\XML@tempa{#2#3#4}}}

Assume that our number is 233 (or E9, in base 16). We have x 1 =3 and x 2 =41. This gives b=128+41, stored in the \uccode of #4. This is the character © . Here #1 is C, "#10 is 192. Thus we store 195 (this is the code of Ă) in the \uccode of #3. Thus, the effect of the uppercase is to define the command \XML@tempa (this is a temporary command name that any command may redefine), it takes no argument, expands to \utfeightbĂ© . The important point to remember: \XML@charref puts in \XML@tempa a list of tokens, this list is independent of the context, but the commands in the list have a meaning that depends on the context (redefined by the commands defined in the next paragraph).

2.2. Using UTF-8 characters

The piece of code that follows defines the six commands \UTF8? (there are other versions of the same commands). These definitions are useful in a context where we evaluate a piece of text.

48 \def\unprotect@utfeight{
49   \let<\XML@lt@markup
50   \let&\XML@amp@markup
51   \def\utfeightax##1{
52     \csname 8:\string##1\endcsname}
53   \let\utfeightay\utfeightax
54   \let\utfeightaz\utfeightax
55   \def\utfeightb##1##2{
56     \csname 8:##1\string##2\endcsname}
57   \def\utfeightc##1##2##3{
58     \csname 8:##1\string##2\string##3\endcsname}
59   \def\utfeightd##1##2##3##4{
60     \csname 8:##1\string##2\string##3\string##4\endcsname}}

For instance, \utfeightbĂ© expands to \csname8:Ă\string ©\endcsname. We shall see in a minute why all characters have to be protected, except the first one. If we expand this, we get the command with this strange name \8:Ă© . This command is assumed to typeset the Unicode character 233. Its definition could be, for instance, `\ifmmode \acute{e}\else \'{e}\fi´. Such a definition is valid only in a context where we typeset the object. Inside an \edef, the expansion of the conditional may give random results, inside a \csname, some tokens are illegal. Note that, in this command, less-than and ampersand are active, they scan something in the XML file; they should be input as `&lt;´ or `&amp;´ if you want a typeset < or &.

The next command looks funny:

61 \gdef\UnicodeCharacter#1#2{
62    \begingroup
63    \def\active{\catcode\count@}
64    \XML@charref#1;
65    \expandafter\expandafter\expandafter
66    \expandafter\expandafter\expandafter
67    \expandafter
68     \gdef\XML@tempa{#2}
69   \endgroup}

There are seven \expandafter in a row. Write \E instead, in order to gain space. Assume that we have a command \A that expands to \B that expands to \C that expands to \D. The expansion of `\E\E\E\E\E\E\E\gdef\A´ is `\E\E\E\gdef\B´. This expands to `\E\gdef\C´. This expands to `\gdef\D´. Suppose that we say \UnicodeCharacter{233}{\'e}. In this case \XML@charref will define \XML@tempa as shown above. This is our \A. The expansion \B is \utfeightbĂ© . Its expansion \C is \csname..., its expansion \D is \8:Ă© . Hence, the code is \def\8:Ă©{\'e}. Thus, we know how to define every Unicode character. There is a little hack here (on line 63, you see why?). Characters like A, B, C, typeset to themselves. But some other characters have to be defined. We say for instance

70 \UnicodeCharacter{94}{\textasciicircum}
71 \UnicodeCharacter{x5C}{\textbackslash}
72 \UnicodeCharacter{x5F}{\textunderscore}
73 \UnicodeCharacter{13}{ \ignorespaces}
74 \UnicodeCharacter{32}{ \ignorespaces}
75 \UnicodeCharacter{9}{ \ignorespaces}

These definitions come from the xmltex.tex file, and the Raweb redefines the character U+5C, so as to allow it in math mode also. The definition of characters 9, 13 and 32 (spaces) is a bit strange: the \ignorespaces command expands the next token, and removes it, if it is a space; hence spaces given in the form &#32; are not removed. Worse: `\parindent = 12 cm´ becomes illegal if what follows the equals sign comes from an XML file. The xmltex.tex file also has these definitions(note: ).

76 \expandafter\def\csname8:\string<\endcsname{\ifmmode\langle\else\textless\fi}
77 \expandafter\def\csname8:\string>\endcsname{\ifmmode\rangle\else\textgreater\fi}
78 \expandafter\def\csname8:\string{\endcsname{\{}
79 \expandafter\def\csname8:\string}\endcsname{\}}

What does the test on line 21 do? it compares the category code of \count@ with \active; this is 13, and the test is false in the cases shown above (well, the backslash may be active while reading the XML file, it is surely not while processing line 71). Redefining \active has as side effect that it will expand to `\catcode\count@´ and this is the same as \catcode\count@. As a consequence `\XML@tempa´ expands to `\utfeightay^´ that expands to \csname... that expands to \8:^. Hence, line 70 defines the command \8:^. This is what is desired. Note: when the XML file is read, all characters with code 128 are active, those with code 31 have category 12 (in fact, they are invalid in XML1.0).

The xmltex.tex file starts like this (before category codes of usual characters have been changed).

80 \count@0
81 \catcode0=13
82 \gdef\XML@tempa{
83  \begingroup
84    \uccode0\count@
85   \uppercase{\endgroup
86     \edef^^@{
87       \ifnum\catcode\count@=11 %
88         \noexpand\utfeightay\else\noexpand\utfeightax\fi
89       \noexpand^^@}
90     \expandafter\edef\csname 8:\string^^@\endcsname{\string^^@}}
91  \ifnum\count@<127\advance\count@1 \expandafter\XML@tempa\fi}
92 \XML@tempa
93 \catcode0=9

Here we have real magic. There is a loop over all numbers x between 0 and 127. The number x is in \count@. For each x, code on lines 83–90 are executed. The null character (number zero) is active, and its uc value is x. In lines 86–90, it will be replaced by the character x. Note that this character is input as ^^@. Assume for instance that x=65, so that it represents the letter A, or that x=61 (character `=´). The second \edef defines \8:A or \8:= to be A or = (note: the purpose of the \edef is to expand the \string in the body, so that the character in the body is a non-active character). Hence the effect is the same as \UnicodeCharacter{65}{A}.(note: ) The purpose of the \edef on line 86 is the expansion of the conditional: we define A to be `\utfeightay A´, and = to be `\utfeightax=´. The character after the command is active. Consider this:

94 \def\use@utfeightay{...}
95 \use@utfeightay ^^M ^_~%$#{}

We have simplified a bit the code. The idea is that, for the characters listed here, \utfeightay is used instead of \utfeightax. We shall see later that \utfeightaz is used for ampersand and less than in a case like &amp; and &lt;.

The following piece of code defines the commands \UTF8? (version two).

96 \def\utfeight@protect@internal{
97   \let\utfeightax\noexpand
98   \let\utfeightay\noexpand
99   \def\utfeightaz{
100     \noexpand\utfeightaz\noexpand}
101   \let<\relax\let&\relax
102   \def\utfeightb##1##2{
103     \noexpand\utfeightb##1\string##2}
104   \def\utfeightc##1##2##3{
105     \noexpand\utfeightc##1\string##2\string##3}
106   \def\utfeightd##1##2##3##4{
107     \noexpand\utfeightd##1\string##2\string##3\string##4}}

What happens if a UTF8 character appears in an \edef? For instance, the character `é´, represented as `\utfeightb Ă©´ expands to the expansion of `\noexpand\utfeightb Ă\string©´, namely `\utfeightbĂ©´. The only thing that might have changed is the category code of © . If it was active, it is now 12 (remember, the first character is never active). In the case \utfeightay A, the expansion is A, because \utfeightay is \noexpand. In the case of \utfeightaz W, the expansion is itself! Note that `<´ and `&´ are not modified.

This is version three:

108 \def\utfeight@protect@external{
109   \def\utfeightax{
110     \noexpand\noexpand\noexpand}
111   \let\utfeightay\utfeighta@ref
112   \let\utfeightaz\utfeighta@ref
113   \edef<{\string<}
114   \edef&{\string&}
115   \def\utfeightb##1##2{
116     ##1\string##2}
117   \def\utfeightc##1##2##3{
118     ##1\string##2\string##3}
119   \def\utfeightd##1##2##3##4{
120     ##1\string##2\string##3\string##4}}

In such a case, the expansion of `\utfeightb Ă©´ is `Ă©´ where both characters are of category code 12. This is very interesting in the case of \write that expands everything. The string Ă© is the UTF-8 representation of é, and can be read again without trouble.(note: ) The expansion of `\utfeightax~´ is `\noexpand~´. It will become ~ after another expansion. In the case of `\utfeightax A´, the expansion is `&#65;´ because of the following lines:

121 \def\utfeighta@ref#1{
122   \string&\string##\number\expandafter`\string#1\string;}

Version four: this is the easy version: everything is converted into characters, of category code 12; in this case Unicode characters can be used inside a \csname.

123 \def\utfeight@protect@chars{
124   \let\utfeightax\string
125   \let\utfeightay\string
126   \let\utfeightaz\string
127   \def\utfeightb##1##2{
128     ##1\string##2}
129   \def\utfeightc##1##2##3{
130     ##1\string##2\string##3}
131   \def\utfeightd##1##2##3##4{
132     ##1\string##2\string##3\string##4}}

2.3. Warnings

This piece of code is used in cases where we want to print something. It is the last definition of the \UTF8? series.

133 \def\utfeight@protect@typeout{
134   \utfeight@protect@chars
135   \let<\relax
136   \let&\relax}

This is the piece of code that removes the traces.

137 \def\xmltraceoff{
138   \global\let\XML@trace@warn\@gobble
139   \global\let\XML@trace@warnNI\@gobble
140   \global\let\XML@trace@warnE\@gobble
141   \global\let\XML@attrib@trace\relax}

These are the commands that print a warning. We simplified a bit the code by removing (here) the body of some commands, and (elsewhere) calls to trace.

142 \def\XML@warnNI#1{
143   {\let\protect\string\utfeight@protect@typeout\message{^^J#1}}}
144 \def\XML@warn#1{
145   {\let\protect\string\utfeight@protect@typeout\message{^^J\XML@w@#1}}}
146 \def\XML@attrib@trace{...}
147 \def\XML@doattribute@warn#1#2#3{...}
148 \let\XML@trace@warn\XML@warn
149 \let\XML@trace@warnNI\XML@warnNI
150 \let\XML@trace@warnE\message

2.4. Reading the text

The next lines of code define a command \nfss@catcodes, such that, when executed, all characters have standard category codes. The @ character is a letter, quotes, less-than greater-than and equals-to are of category other.

151  \def\nfss@catcodes{
152   \catcode`\\0
153   % Idem for {}%^@#"'<=>
154   }

This changes even more category codes. Dollar, ampersand, hat, underscore, space have standard category codes, others have category 12.

155 \def\XML@reset{
156   \nfss@catcodes
157   % reset $&^_ space
158   % reset :!=|
159   \catcode`\~\active\def~{\nobreakspace{}}
160   \let\XML@ns@a@\XML@ns@a@tex
161   \let\XML@ns\XML@ns@tex}

The next lines of code define a command \XML@catcodes, such that, when executed, all characters have category codes useful for reading an XML file.

162 \def\XML@catcodes{
163   \catcode`\ \active
164   % same for: ^^M ^^I <>:[]%&"'=
165   % same for: /!?-${}#_\~
166   \def~{\utfeightay~}
167   \let\XML@ns@a@\XML@ns@a@xml
168   \let\XML@ns\XML@ns@xml
169 }

The following two commands are inlined for efficiency reasons. We have introduced them in order to gain space.

170 \def\Normalspace{\catcode`\^^I=10 \catcode`\^^M=10 \catcode`\ =10 }
171 \def\Activespace{\catcode`\^^I=13 \catcode`\^^M=13 \catcode`\ =13 }

This piece of code does a loop, starting with \count@, up to \@tempcnta (excluded). The loop puts the current number in the uc-code of tilde, and uppercasifies the value of \XML@tempa, to be defined later, in the form \def\XML@tempa{{...}}, double braces are needed because \uppercase want a brace-delimited list of tokens.

172 \gdef\utfeightloop{
173   \uccode`\~\count@
174   \expandafter\uppercase\XML@tempa
175   \advance\count@\@ne
176   \ifnum\count@<\@tempcnta
177   \expandafter\utfeightloop
178   \fi}

We leave it as an exercise to the reader to define a command \XML@utfeight whose expansion is `utf-8´, all characters being of category code 12. This piece of code does nothing if the current encoding is `utf-8´, otherwise it sets the current encoding to `utf-8´, and does some action.

179 \gdef\XML@setutfeight{
180   \ifx\XML@utfeight\XML@thisencoding
181   \else
182     \let\XML@thisencoding\XML@utfeight
183     ...% see below
184   \fi}

This is the action: for every character that is the first in a sequence of 2, 3 or 4 characters, it defines the character (for instance Ă) to take 1, 2 or 3 arguments. For instance Ă is defined as \utfeightb Ă#1. The first argument to \utfeightb, \utfeightc, or \utfeightd is not active! This works, because \string~ is expanded to ~ of category code 12, where ~ is replaced by the \uppercase on line 174 by the character (for instance Ă), the \utfeightb command is not expanded since preceded by a \noexpand. The definition is in a double group (\begingroup on line 185, braces on lines 188, 192, 194). The definition is visible outside the group because it is global: we use \xdef. We could replace \gdef by \def here, whether the temporary is restored or not after the loop is irrelevant.

185   \begingroup
186   \count@"C2
187   \@tempcnta"E0
188   \gdef\XML@tempa{{\xdef~####1{\noexpand\utfeightb\string~####1}}}
189   \utfeightloop
190   \count@"E0
191   \@tempcnta"F0
192   \gdef\XML@tempa{{\xdef~####1####2{\noexpand\utfeightc\string~####1####2}}}
193   \utfeightloop
194   \@tempcnta"F4  \gdef\XML@tempa{{\xdef~####1####2####3{%
195                              \noexpand\utfeightd\string~####1####2####3}}}
196   \utfeightloop
197   \endgroup

This defines a command named \Q:xml. It calls \XML@xmldecl after having changed the category code of white space.

198 \expandafter\gdef \csname Q:xml\endcsname{
199   \Normalspace
200   \XML@xmldecl}

This resets some category codes and calls \XML@encoding. The argument is something strange. The idea is that we parse <?xml foo='bar'?>. We read everything up to the end of the element, and provide a default encoding attribute. A \relax marker is put at the end.

201 \gdef\XML@xmldecl#1?>{
202   \Activespace
203   \XML@encoding#1 e="utf-8"\relax}

The XML norm (see for instance [10], [9], [1]) says (rules 23, 24, 32, 80) that in <?xml?> only the encoding attribute can start with the letter `e´. This makes the loop easy. Note: other attributes are version, currently ignored (there are two versions of the XML standard, and the difference between them is tiny), and standalone (completely ignored).

204 \gdef\XML@encoding#1 #2{
205   \if\noexpand#2e
206     \expandafter\XML@encoding@aux
207   \else
208     \expandafter\XML@encoding
209   \fi}

We shall see later that \XML@quoted\foo reads 'bar' or "bar", and calls \foo with the value `bar´. The following piece of code grabs the attribute name, the equals sign, reads the attribute value and calls another command.

210 \gdef\XML@encoding@aux#1={
211   \XML@quoted\XML@setenc}

Here the \lowercase is no magic: the XML norm says (rule 80) that all characters should be ASCII characters (letters, digits, dot, underscore, dash), and case independent. In the case where the encoding is not UTF-8, some file is read; for instance iso-8859-1.xmt. On page 2, we have seen how to find the UTF-8 representation of a latin1 character. It depends on whether the character is smaller or larger than 192. Two easy loops suffice to define all characters like é as \utfeightbĂ© .

212 \def\XML@setenc#1#2\relax{
213   \lowercase{\gdef\XML@tempa{#1}}
214   \xdef\XML@tempa{\catxii\XML@tempa}
215   \ifx\XML@tempa\XML@thisencoding
216   \else
217     \ifx\XML@utfeight\XML@tempa
218       \XML@setutfeight
219     \else...% code not shown here
220     \fi
221   \fi}

2.5. Namespaces

You say \XML@ns@alloc{foo} in order to declare `foo´ as a namespace name; after that the value can be found by \jg@NSuri {foo}. In the case this command is defined, there is nothing to do. Otherwise, we allocate a number using the counter \XML@ns@count, say 3, and put this in the command. We define two other commands: \jg@namespace{3} will be 3, and \A:3 will be empty (we shall see that this is the global attribute list of the namespace). Note: we use here the pseudo commands \jg@NSuri, so that a double indirection is needed; as a consequence \expandafter3 should be replaced by a sequence of three \expandafter tokens.

224 \def\XML@ns@alloc#1{
225   \expandafter3\ifx\jg@NSuri{#1}\relax
226     \global\advance\XML@ns@count\@ne
227     \expandafter3\xdef\jg@NSuri{#1}{\the\XML@ns@count}
228     \global\expandafter3\let\csname A:\the\XML@ns@count\endcsname\@empty
229     \expandafter3\xdef\jg@namespace{\the\XML@ns@count} {\the\XML@ns@count}
230   \fi}

The namespace stuff is initialized like this; number 0 corresponds to the empty namespace. Note that the recommendations say: The prefix `xml´ is by definition bound to the namespace name: http://www.w3.org/XML/1998/namespace.

231 \XML@ns@count-1
232 \XML@ns@alloc{}
233 \XML@ns@alloc{http://www.w3.org/1998/xml}
234 \def\XMLNS@xml{1}
235 \XML@ns@alloc{http://www.dcarlisle.demon.co.uk/xmltex}

The next piece of code is standard trick to convert `foo:bar´ into {foo}{bar} and `foo´ into {}{foo}. The auxiliary command sees `bar´ or `\@´ as second argument, argument 3 is junk. This works only if \\ does not appear in the argument, moreover, it is recommended that at most one colon appears, and no \@, otherwise, two many tokens are considered as junk.

236 \gdef\XML@ns@xml#1{\expandafter\XML@ns@a@xml#1:\@:\\}
237 \gdef\XML@ns@a@xml#1:#2:#3\\{
238   \ifx\@#2 \XML@ns@b{}{#1}
239   \else    \XML@ns@b{#1}{#2}
240   \fi}

The function above depends on the category code of the colon character. We define an alternative version of the command and its helper, and install \XML@ns@a@ to be the TeX variant, but it may be redefined (see lines 160 and 167).

241 \def\XML@ns@tex#1{...}
242 \def\XML@ns@a@tex#1:#2:#3\\{...}
243 \let\XML@ns@a@\XML@ns@a@tex
244 \let\XML@ns\XML@ns@tex

What this code does is just to expand everything (in order to get a canonical form). Thus \XML@ns, as well as all its variants, take a sequence like `foo:bar´, puts it in a canonical form, and puts `foo´ in \XML@this@prefix, `bar´ in \XML@this@local.

245 \def\XML@ns@b#1#2{
246   \begingroup
247   \utfeight@protect@chars
248   \xdef\XML@tempa{#1}
249   \xdef\XML@tempb{#2}
250   \endgroup
251   \let\XML@this@prefix\XML@tempa
252   \let\XML@this@local\XML@tempb
253   }

2.6. Redefining \protect

In order to prevent premature expansion, you can insert \protect before a command; this makes it “robust”; the \protect command is defined in LaTeX, its value depends on the context. It may be \@unexpandable@protect, that is \noexpand\protect\noexpand. Hence \protect\foo expands to itself in an \edef. On line 96, we define a command so that \utfeightb Ă© (the internal representation of é) also expands to itself. In this section, we modify all context switch commands in order to make all UTF-8 characters naturally robust.

We start with a modified \xdef in which \protect and UTF-8 characters are left unchanged. This works well in a group because the end of the group restores the old value.

254 \def\unrestored@protected@xdef{
255    \utfeight@protect@internal
256    \let\protect\@unexpandable@protect
257    \xdef
258 }

Another extension to LaTeX: Here everything is done in a group, the definition is global, the modifications to \protect and \UTF8? are local; the group is terminated after the \xdef because of the \afterassignment.

259 \def\protected@xdef{
260    \begingroup
261    \utfeight@protect@internal
262    \let\protect\@unexpandable@protect
263    \afterassignment\endgroup
264    \xdef}

Yet another one: No group is used here, and \afterassignment gets another token as argument. This is useful if we do not want an \xdef (for instance, \refstepcounter uses this to define the current label). The meaning of UTF-8 characters is not restored, but reset to XML mode.

265 \def\protected@edef{
266    \let\@@protect\protect
267    \let\protect\@unexpandable@protect
268    \utfeight@protect@internal
269    \afterassignment\restore@protect
270    \edef
271 }

We have to restore \protect and some other commands.

272 \def\restore@protect{\let\protect\@@protect
273    \unprotect@utfeight}

We have to redefine \protected@write. This is a command that takes 3 arguments. It writes the last argument on the file defined by the first argument. Protection works as follows: there is an \edef that will expand all tokens but the protected ones, the current page reference (i.e., \thepage), including side-effects that come from evaluating the second argument. For instance, in Chapter 4, line 2445, there is an example where \jgFOlabel is set to \relax; the \addtocontents command defines \label, \index and \glossary to gobble their arguments.

274 \long\def \protected@write#1#2#3{
275       \begingroup
276        \let\thepage\relax
277        #2
278        \utfeight@protect@external
279        \let\protect\@unexpandable@protect
280        \edef\reserved@a{\write#1{#3}}
281        \reserved@a
282       \endgroup
283       \if@nobreak\ifvmode\nobreak\fi\fi
284 }

We must also redefine this (it is used by \typeout).

285 \def\set@display@protect{
286   \let\protect\string
287   \utfeight@protect@typeout}

2.7. The catalogue

The catalogue is an association list, a sequence of the form \key{val1}{val2}. We have mentioned elsewhere that adding something at the end of a token list is not obvious. Here we proceed as follows. Consider

\edef\val{\noexpand\the\list\noexpand\key{\catxii\val}}

If we assume that \list is a command that cannot be expanded and \val expands to `some/val´, the code above puts \the\list\key{some/val} into \val. Assume now that \list is a reference to a token list, and that we say

\list\expandafter\expandafter\expandafter{\val{aux}}

Since \list is a reference to a token list, the code above is an assignment, after \list we have a token list, and the first token is expanded to see if it is a left brace. Because of the \expandafter the code is equivalent to

\list\expandafter{\the\list\key{some/val}{aux}}

Now, the token that follows \list can be expanded; hence the result is the same as

\list{<value of the list>\key{some/val}{aux}}

We can also say something like

\list\expandafter{\the\expandafter\list\expandafter\key\val{aux}}

Here the effect of \expandafter is to expand \the; this expands the token that follows, namely the \expandafter, so that \val is expanded. This the result is the same as

\list{<value of the list>\key some/val{aux}}

The catalogue is a token list defined by sequence of assignments like this:

288 \SYSTEM   {http://www.oucs.ox.ac.uk/dtds/tei-oucs.dtd} {tei.xmt}
289 \NAMESPACE{http://www.w3.org/1998/Math/MathML}         {mathml2.xmt}
290 \NAMESPACE{http://www.dcarlisle.demon.co.uk/sec}       {sec.xmt}
291 \NAME{langtest}                                        {langtest.xmt}
292 \NAME{TEI.2}                                           {tei.xmt}
293 \NAME{html}                                            {html.xmt}
294 \NAMESPACE{http://www.w3.org/1999/XSL/Format}          {fotex.xmt}

Here the last item on each line is the name of a TeX file to load in some cases. There are five different items in the catalogue, thus five commands that put things in it, and five other commands that extract something. The action of \FOO{A}{B} is essentially to add \XML@@FOO{A}{B} at the end of the token list.

Let´s start with the \PUBLIC command. It takes two arguments, an URI and a file name.

288 \def\PUBLIC#1#2{
289  \xdef\XML@tempa{#1}
290  \xdef\XML@tempa{\noexpand\the\XML@catalogue\noexpand\XML@@PUBLIC
291              {\catxii\XML@tempa}}
292  \global\XML@catalogue\expandafter\expandafter\expandafter{
293    \XML@tempa{#2}}}

Same idea here.

294 \def\SYSTEM#1#2{
295  \xdef\XML@tempa{#1}
296  \xdef\XML@tempa{\noexpand\the\XML@catalogue\noexpand\XML@@SYSTEM
297              {\catxii\XML@tempa}}
298  \global\XML@catalogue\expandafter\expandafter\expandafter{
299    \XML@tempa{#2}}}

In the case of a namespace, for instance MathML, we compute the namespace number of it, and the catalogue associates to this number the file in which everything is defined.

300 \def\NAMESPACE#1#2{
301   \utfeight@protect@chars
302   \XML@ns@alloc{#1}
303   \edef\@tempa{{\jg@NSuri{#1}}}
304   \global\XML@catalogue\expandafter{\the\expandafter\XML@catalogue
305      \expandafter\XML@@NAMESPACE\@tempa{#2}}
306   \unprotect@utfeight}

This is the easiest of all commands, since we do not have to do anything with the arguments.

307 \def\NAME#1#2{
308  \global\XML@catalogue\expandafter{\the\XML@catalogue\XML@@NAME{#1}{#2}}}

You run the catalogue by evaluating it. For instance, if you put `foo´ into \XML@PUBLIC, then the value associated to foo by the \PUBLIC command will be put in \XML@use.

309 \def\XML@@PUBLIC#1#2{
310   \gdef\XML@tempa{#1}
311   \ifx\XML@tempa\XML@PUBLIC \def\XML@use{#2}\fi}

Same action for SYSTEM. The temporary variable has a different name.

312 \def\XML@@SYSTEM#1#2{
313   \def\@tempa{#1}
314   \ifx\@tempa\XML@SYSTEM \def\XML@use{#2}\fi}

Same action for NAMESPACE.

315 \def\XML@@NAMESPACE#1#2{
316   \def\@tempa{#1}
317   \ifx\@tempa\XML@NAMESPACE  \def\XML@use{#2}\fi}

Same action for NAME.

318 \def\XML@@NAME#1#2{
319   \def\@tempa{#1}
320   \ifx\@tempa\XML@NAME  \def\XML@use{#2}\fi}

You say \XMLNS{html}{http://www.w3.org/1999/xhtml}. The effect is to associate to the name html the namespace value of the second argument, for instance 17.

321 \def\XMLNS#1#2{
322   \utfeight@protect@chars
323   \XML@ns@alloc{#2}
324   \edef\@tempa{{#1}{\jg@NSuri{#2}}}
325   \global\XML@catalogue\expandafter{\the\expandafter\XML@catalogue
326      \expandafter\XML@@XMLNS\@tempa}
327   \unprotect@utfeight}

This piece of code is a bit strange; it might produce unexpected results. The idea is the following. The command \XML@checkknown will run the catalogue in case of unknown elements. In the case of <TEI.2> or <html>, where no namespace prefix is given, the command \XML@NAME is set, and \XML@@NAME may define \XML@use. However, if \XMLNS has been defined as above, this piece of code is also executed: it defines a default namespace, in particular it could replace <0:html> by <17:html>. The last action is to define \XML@NAMESPACE, and we are ready to run the catalogue again.

328 \def\XML@@XMLNS#1#2{
329   \def\@tempa{#1}
330   \ifx\@tempa\XML@NAME
331     \edef\XMLNS@{#2}
332     \edef\XML@this@element{\XMLNS@\noexpand:\XML@this@local}
333     \let\XML@NAMESPACE\XMLNS@
334   \fi}

2.8. Reading elements

Let´s start slowly. This piece of code is executed whenever we see a less-than sign, that is the start of an element. We have to distinguish between </foo>, <?foo>, <!foo> and <foo>. The procedure reads one character. What makes everything interesting is that \fi tokens are missing. On the other hand, we have inserted a \@ marker, whose purpose is to skip easily over all useless tokens.

335 \def\XML@lt@markup#1{
336   \Normalspace
337   \ifx/#1\XML@getend
338   \else\ifx!#1\XML@getdecl
339   \else\ifx?#1\XML@getpi
340   \else\XML@getname#1\@}

The function that follows is defined in an environment where space, newline, and tabulation are active characters (remember that \endlinechar is -1, so that newline characters are produced only via ^^M). The code makes these characters active, and defines them; this action is local to a group (the groups ends on line 352). When we typeset some text, it is wise to activate these characters; on the other hand, spaces have normal category code when scanning attributes. In any case, space characters disappear at end of line, this explains the need of the % signs here. This piece of code is called when the XML file contains <foo>; we have read the less than sign, and the letter that follows; the letter is in #1 (if you look at line 340, you see that #1 is nothing else than the first argument of \XML@lt@markup, because this cannot be \@. It could be Ă, i.e., the first byte of a Unicode character). We know that this is not slash, not an exclamation point, not a question mark, and we close these conditionals. The reader should take some time, in order to understand how \XML@tempa is defined.

341 \gdef\XML@getname#1\@{
342 \fi\fi\fi
343 \begingroup
344 \Activespace
345 \def {\iffalse{\fi}\XML@getname@}
346 \let^^M %
347 \let^^I %
348 \def/{\iffalse{\fi}\XML@getname@/}
349 \def>{\iffalse{\fi}\XML@getname@>}
350 \unrestored@protected@xdef\XML@tempa{\iffalse}\fi#1}

The last line contains \unrestored@protected@xdef; this is a command that modifies the behavior of some UTF-8 characters; it assumes to be in a group (thus the `unrestored´); it evaluates to \xdef (see line 257). After TeX has seen the opening brace, all tokens are expanded; as a result the `\iffalse}\fi´ is ignored; note that the `\iffalse{\fi´ that appears in the definition of space, tabulation, newline, slash, greater-than sign disappears also; the full expansion of these commands is: a closing brace, \XML@getname@, and maybe one character. It is this closing brace that terminates the \xdef; said otherwise, \XML@tempa will contain everything up to these characters. In the case <foo>, <foo/>, <foo a='b'>, it will contain `foo´. In the case of <José>, in a document with latin1 encoding, it will contain `JosĂ©´, where the Ă is an active character.

The \XML@getname@ command is defined below. It closes the group in which space and other characters have a funny definition. The \XML@begingroup is a hack that saves some stack space. It has the same features as \begingroup. The \XML@w@ command contains N spaces (where N is the current level). It is used for debugging, and argument grabbing. What the code does is: Put in \begintag the name of the element, in \XML@parent the current element (the parent of this one), initialize the current attribute list \XML@attribute@toks to the empty list, and parse the attributes.

351 \def\XML@getname@{
352   \endgroup
353   \XML@begingroup
354   \edef\XML@w@{ \XML@w@}
355   \let\begintag\XML@tempa
356   \let\XML@parent\XML@this@element
357   \XML@attribute@toks{}
358   \XML@getattrib}

All these \expandafter in the code have as purpose to pop the conditional stack (said otherwise, if the command takes an argument, the argument will be what follows the \fi, not the \fi itself)(note: ). There are two cases to consider: there is an attribute, or there is none. In the case where there is no attribute, there are two subcases: the element can be empty or not. If you say <foo␣␣bar='gee'>, the first space was active, and read by the magic above; the second one has category code 10, and is discarded because the argument to this command is not a delimited argument.

359 \def\XML@getattrib#1{
360   \ifx#1/
361     \expandafter\XML@endempty
362   \else
363   \ifx#1>
364      \expandafter\expandafter\expandafter\XML@startelement
365   \else
366     \XML@getattrib@a#1
367   \fi
368   \fi}
369 \let\XML@@getattrib\XML@getattrib

In the case of <foo bar='1'/>, when the slash is seen, the greater sign is read, and </foo> is pushed back in the input stream. After that, we proceed as if there were no slash. This means that <foo/> is the same as <foo></foo>.

370 \def\XML@endempty#1>{
371   \expandafter\XML@startelement
372   \expandafter<\expandafter/\begintag>}

Here is a little trick: in the case where we are reading <foo bar='1'>, line 366 contains the command \XML@getattrib@a, followed by the letter b, followed by \fi\fi, followed by ar='1' (still unread). What we do is a trick to read an optional space before the equals sign (the space after the equals sign disappears because \XML@quoted uses an undelimited argument; a space in the attribute value will not disappear, because \XML@qq uses a delimited argument). We save the attribute name in a variable, and read the value.

373 \gdef\XML@getattrib@a#1\fi\fi#2={
374   \fi\fi
375    \XML@set@this@attribute#1#2 \@
376    \XML@quoted\XML@attribval}
377  
378 \def\XML@set@this@attribute#1 #2\@{
379   \def\XML@this@attribute{#1}}

You say \XML@quoted\foo'bar' or \XML@quoted\foo"bar". In both cases, \foo is called with bar as argument. In general, error handling is very poor. The purpose of \ERROR here is not to report an error in the case of wrong syntax. It will be used on line 550.

380 \def\XML@quoted#1#2{
381    \ifx#2"\expandafter\XML@qq
382    \else\ifx#2'\expandafter\expandafter\expandafter\XML@q
383    \else
384      \ERROR#2
385    \fi \fi #1}
386 \def\XML@qq#1#2"{#1{#2}}
387 \def\XML@q#1#2'{#1{#2}}

In order to make things easier to understand, write \Att instead of \XML@this@attribute, this is the attribute to analyze. Write \AL instead of \XML@attribute@toks, this is the resulting list to which tokens will be added. Before we forget it: this command terminates on line 407, with \XML@getattrib, hence continues parsing the attribute list. The normalized attribute value is compared to \XML@ns@decl, a command that contains `xmlns´ with category codes 12. If the attribute name is `xmlns´, this defines the default namespace, if the attribute is `xmlns:foo´, this defines the namespace prefix `foo´ for this element and its content. In both cases, we call \XML@ns@uri. Otherwise if the attribute is foo:bar = `gee´ we add \XML@doattribute{foo}{bar}{gee} to the token list.

388 \def\XML@attribval#1{
389   \xdef\XML@tempa{\catxii\XML@this@attribute}
390   \ifx\XML@tempa\XML@ns@decl
391     \XML@ns@uri{}{#1}
392    \else
393      \XML@ns\XML@this@attribute
394      \edef\XML@this@prefix{\catxii\XML@this@prefix}
395      \ifx\XML@this@prefix\XML@ns@decl
396        \XML@ns@uri\XML@this@local{#1}
397      \else
398       \begingroup
399       \utfeight@protect@internal
400        \xdef\XML@tempa{
401           \the\XML@attribute@toks
402           \noexpand\XML@doattribute{\XML@this@prefix}{\XML@this@local}{#1}}
403        \endgroup
404        \XML@attribute@toks\expandafter{\XML@tempa}
405       \fi
406     \fi
407   \XML@getattrib}

Assume that we have xmlns:foo = `http://www.w3.org/1998/Math/MathML´. This piece of code allocates a number for the URI if not already done. Let´s assume that this number is 3. It then defines \XMLNS@foo to be 3. In the case xmlns='...' it defines \XMLNS@, the default namespace.

408 \def\XML@ns@uri#1#2{
409   \utfeight@protect@chars
410   \XML@ns@alloc{#2}
411   \expandafter3\edef\jg@namespace{#1}{\jg@NSuri{#2}}
412   \unprotect@utfeight}

The next macro is called when we see the greater-than sign that closes the opening tag of an element. We first do something with default attributes, this will be explained later. After that, we split the element name into `foo:bar´. Assume that the namespace number of `foo´ is 4, we put in `\XML@this@element´ the tokens `4:bar´. We execute a piece of code that can possibly load a file in which the element´s behavior is defined, and then, we execute the associated code. We shall see later that there is a command \xmlgrab that reads everything (including subelements) up to some end tag. This command redefines \XML@doelement. This explains why \XML@doelement is not inlined.

413 \gdef\XML@startelement{
414   \XML@default@attributes
415   \Activespace
416   \XML@ns\begintag
417   \edef\XML@this@element{
418     \jg@namespace{\XML@this@prefix\expandafter}\noexpand:\XML@this@local}
419   \XML@checkknown
420   \XML@doelement}

The action associated to <m:math> is just to call \E:3:math. We shall see later how this command can be defined.

421 \def\XML@doelement{
422   \csname E:\XML@this@element \endcsname}

Assume that we want to evaluate \E:3:math. This routine does nothing if the command exists. Otherwise, it “runs the catalogue”, and does a check: a warning is printed in case where the command does not exist. Assume first that the prefix is empty (has number 0). In this case, the catalogue can have an entry for the name (defined via the \NAME command), defining a file to load. Otherwise, the catalogue should have an entry for the namespace (here 3) defined via \NAMESPACE. In any case, the catalogue should define \XML@use the name of a file to be loaded. We simplified a bit the code by introducing the \jg@this@namespace command; note that we could use \XML@this@element. Important note: on line 353, there is a command \XML@begingroup, so that all definitions from the included file are local to this group. If the current element is not the root element, and if you want the definitions to apply to all elements (and not only the descendants of this one), the definitions should better be \global.

423 \def\XML@checkknown{
424   \expandafter\ifx
425     \csname E:\jg@this@namespace:\XML@this@local\endcsname
426     \relax
427    \let\XML@use\@empty
428    \ifnum0=\jg@this@namespace
429      \let\XML@NAME\XML@this@local
430      \the\XML@catalogue
431    \else
432      \edef\XML@NAMESPACE{\jg@this@namespace}
433      \fi
434    \let\XML@NAME\relax
435    \the\XML@catalogue
436    \inputonce\XML@use
437    \expandafter\ifx\csname E:\jg@this@namespace\csname
438        :\XML@this@local\endcsname\relax
439      \XML@trace@warnE{Undefined}
440    \fi
441  \fi}

2.9. End of element

Look at lines 337-340. When we are reading </foo >, the \XML@getend command is called after the slash has been read. This little piece of code grabs all tokens, until the end of the command (argument #1, unused), and everything up to the greater-than sign (argument #2). It closes the conditional, and calls another command (the purpose of the call is to get rid of the final space). Why this changes the category code of space and not tabulation is beyond me.

442 \def\XML@getend#1\@#2>{
443   \fi
444   \catcode`\ \active
445   \XML@getend@a#2 \@}

We have now read </foo >, and \endtag contains `foo´. We extract the namespace part, and call a command (that may be redefined in case of grab).

446 \gdef\XML@getend@a#1 #2\@{
447   \Activespace
448   \def\endtag{#1}
449   \XML@ns\endtag
450   \XML@doend}

The action associated to </m:math> is to call the command named \E/:3:math. After that, we have to close a group (opened on line 353).

451 \gdef\XML@doend{
452   \csname E/:\jg@this@namespace:\XML@this@local  \endcsname
453   \XML@endgroup
454   \Activespace}

2.10. Using attributes

Consider <X:elt foo:bar='gee' color='red' xmlns:X='myX'/>. We have already seen that the effect of the xmlns:X attribute is to define X as a namespace for this element and its children. In the case foo:bar, the definition of `foo´ could come later. For this reason, when we parse the attribute list, we construct a list of the form \do{a}{b}{c}(note: ), in our case it contains \do{foo}{bar}{gee} and \do{}{color}{red}. This list is in \XML@attribute@toks. There is another list that contains terms of the form \Att name\relax\cmd{val}action(note: ), it depends on the behavior of the element <X:elt>. Imagine that any element in the X namespace has an attribute some:background, with a default value of black, and <X:elt> has an attribute color, with some default value, and that some action is associated to it. This second list is the argument of the macro whose definition follows. It is constructed by \XMLelement; this construction can occur because of autoloading of some package, and this depends on the current namespace, i.e., myX. What follows \Att is the name with its namespace, for instance 25:background or 0:color; it is followed by \relax and the name of the command in which the element can get the value; it is followed by the default value (in braces) and an action (a sequence of commands, generally empty). Let´s assume that there is no action for the background, but \checkcolor for the color.

This piece of code evaluates both lists, with a double \relax at the end. We have to evaluate all namespaces, but there is no default namespace for attributes. For this reason we set \XMLNS@ to 0.

455 \def\XML@setattributes#1{
456   \let\XMLNS@@\XMLNS@
457    \def\XMLNS@{0}
458    \the\expandafter\XML@attribute@toks#1\relax\relax
459    \let\XMLNS@\XMLNS@@}

Let´s assume that our big list is the following \do {foo} {bar} {gee} \do {} {color} {red} \Att 25:background\relax \setbg {black} \Att 0:color\relax \setcol {blue} \checkcolor \relax \relax. We give here the definition of the command that evaluates the \do. It reads the three token lists that follow and defines a command that will be used twice. If this command is \T, then \T\foo applies \foo to some argument, namely \Att name\relax, where the character string between the two commands is the full name, for instance 17:bar or 0:color.

460 \def\XML@doattribute#1#2#3{
461   \xdef\XML@tempa##1{\noexpand##1{
462        \noexpand\XML@attrib\jg@namespace{#1}:#2\relax}}
463   \XML@tempa\XML@attrib@x{#3}
464   \XML@tempa\XML@attrib@y}

The command that follows is called twice in our example, with arguments `\Att17:bar\relax´ and `gee´, then `\Att0:color\relax´ and `red´. The action is to define a command \XML@tempb, whose action is to read some tokens, and define some command to be `gee´ or `red´. We shall see in a moment how this command is used. Putting a \def in a \def in a \def is unusual. There is a priori no reason why the second one should be global. The inner one has to be local.

465 \def\XML@attrib@x#1#2{
466     \gdef\XML@tempb##1#1##2##3##4\relax\relax{
467     \def##2{#2}
468     ##1##4\relax\relax}}

The command that follows is simple (its body has one line) but a bit subtle. Remember the long list of tokens shown above. It is of the form \do...\do...\Att...\Att...\relax\relax. We have read the \do..., and constructed a \Att..., which is in #1. Everything else is in #2. We apply \XML@tempb to the list, where the first \do... is removed and a new \Att... is added. Remember that \Att is followed by a name, then \relax, a command, a value, and maybe action. In #1 we have only the name and \relax; we provide here \XML@temp@l as command name, and 6 as value. We show the code, then the explanations.

469 \def\XML@attrib@y#1#2\relax\relax{
470   \XML@tempb#2#1\XML@temp@l{6}\relax\relax}

Consider first the case where #1 is `\Att17:bar\relax´. The command \XML@tempb takes four arguments, the first argument is delimited by #1, and the last by a double \relax. Our element <X:elt> knows nothing about `foo:bar´, so that the #1 is the one provided on line 470. Thus, the first argument is #2, the second argument is \XML@temp@l, the third argument is `6´, the last is empty. The effect of line 467 is to define \XML@temp@l, this is a dummy command, its definition is irrelevant. The effect of line 468 is to evaluate the long list again.

Consider now the case where #1 is `\Att0:color\relax´. This is found in the long list because the element accepts the color attribute. Hence the arguments of \XML@tempb are the following: The first argument is `\Att 5:background\relax \setbg {black}´ (in general, it starts with all the unhandled \do... commands), the second argument is `\setcol´, the third is `blue´, the last is `\checkcolor \Att 0:color\relax \XML@temp@l {6}´. The concatenation of arguments 1 and 4 is the long list, with the \Att... of color removed, and re-inserted at the end, with \XML@temp@l as command and 6 as value. The action is to define \setcol to `red´. The definition is local to the group started on line 353, ended on line 453. The action associated to <X:elt>, </X:elt> and descendants can see this value.

After all these \do... have been evaluated, our long list reduces to `\Att 25:background\relax \setbg {black} \checkcolor \Att 0:color\relax \XML@temp@l {6} \relax \relax´. This is the list constructed by \XMLelement, possibly re-ordered, where the command name associated to attributes that have a value is replaced by a dummy name. Note the placement of \checkcolor in this list: when it is evaluated, \setcol is defined, either to the value of the XML file or the default value. The definition of \Att is given here: it sets \setbg to black, and \XML@temp@l to 6. The only subtlety is that, if the default value is \inherit nothing happens. Note: the initial value should always be defined; however, the code checks this, replacing undefined by \relax.

471 \def\XML@attrib#1\relax#2#3{
472   \ifx\inherit#3\relax% #3 might be empty
473     \ifx#2\@undefined
474       \def#2{\relax}
475     \fi
476   \else
477     \def#2{#3}
478   \fi}
479 %
480 \let\inherit\XML@attrib % just some random name

2.11. Processing instructions

We consider here parsing <?xml?>. The layout here is awful, for the same reason as \XML@getname. We use however a different trick: we use \csname and define the delimiters (space, question mark), to evaluate to \endcsname (no namespace hacking needed here).

481 \gdef\XML@getpi#1\@{
482 \fi\fi\fi
483 \begingroup
484 \utfeight@protect@chars
485 \Activespace
486 \def?{\endcsname?}
487 \let \endcsname
488 \let^^M\endcsname
489 \let^^I\endcsname
490 \expandafter\XML@getpi@\csname
491 Q:}

Hence, in the case <?xml something?>, the \XML@getpi@ command is called, with as argument the token \Q:xml. What we do is close the current group, activate spaces, evaluate the command. If the command is not defined, we put a \XML@getpi@x before it. Note that, if a command is constructed by \csname, its value is \relax instead of undefined; this is a local assignment, after \endgroup, the undefined value is restored. Note that the action associated to <?xml?> is defined on line 198.

492 \def\XML@getpi@#1{
493   \endgroup
494   \Activespace
495   \ifx#1\@undefined
496     \expandafter\XML@getpi@x
497   \fi
498   #1}

In the case where <?foo something?> is seen, and the command \Q:foo is undefined, we read everything, and call \XML@dopi with innocent arguments.

499 \def\XML@getpi@x#1#2?>{
500   \XML@dopi{Undefined}{}}

In the case <?xmltex something?> we use an auxiliary command that grabs the content with the right category codes. This allows TeX commands, with TeX syntax.

501 \expandafter\def \csname Q:xmltex\endcsname{
502   \begingroup
503   \XML@reset
504   \catcode`\>\active
505   \XML@xmltexpi}

The code of the command is trivial. We call \XML@dopi.

506 \gdef\XML@xmltexpi#1?>{
507   \endgroup
508   \XML@dopi{xmltex}{#1}}

The default action is trivial also. Not inlined because of grabbing.

509 \def\XML@dopi#1#2{
510   #2}

2.12. Declarations

In this paragraph, we consider declarations, things that start with <!. Here the \@ has as purpose to read in #1 all tokens up to the end of \XML@lt@markup. We read the first two characters after the <! and decide what to do. The @ at the end makes reading of arguments easy.

511 \def\XML@getdecl#1\@#2#3{
512 \fi\fi
513   \if-\noexpand#2\XML@comment     %   --
514   \else\if N\noexpand#3\XML@entity%   EN TITY
515   \else\if L\noexpand#3\XML@dec@e%    EL EMENT
516   \else\if A\noexpand#2\XML@dec@a%    AT TLIST
517   \else\if D\noexpand#2\XML@doctype%  DO CTYPE
518   \else\if C\noexpand#3\XML@cdata%    [C DATA
519   \else        \XML@dec@n%            NO TATION
520 @}

Easy part: Element declarations are ignored. In fact, elements can only be defined via an xmt file.

521 \def\XML@dec@e#1@#2>{
522   \fi\fi\fi
523   \XML@checkend@subset}

In the case of <!ATTLIST...> declarations, we will do something. We start with closing all conditionals. After that, we read the element name and save it somewhere. Then we parse the list.

524 \def\XML@dec@a#1 #2 {
525   \fi\fi\fi\fi
526   \protected@xdef\XML@tempa{#2}
527   \XML@dec@a@x}

The XML production number 52 says that we should have `AttDef*´, optional space and close tag; thus the code fails if no attribute is declared. Let´s hope that the list is not empty. Production 53 says that `AttDef´ is space, name, space, `AttType´, space, `DefaultDecl´. We read the name and store it in \XML@tempb. After that we look at the character that follows. It could be an open parenthesis, or something else. The type of the attribute is ignored.

528 \gdef\XML@dec@a@x#1 #2{
529   \protected@xdef\XML@tempb{#1}
530    \if(\noexpand#2
531      \begingroup
532      \catcode`\(\active
533      \expandafter\XML@dec@a@brack
534    \else
535       \expandafter\XML@dec@a@type
536    \fi}

Rule 59 says that the type can be a list enclosed by parentheses.

537 \gdef\XML@dec@a@brack#1){
538   \endgroup
539   \XML@dec@a@hash}

According to rules 54, 55, and 56, the type can be CDATA, ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN or NMTOKENS. It could also be NOTATION followed by a list. Here we skip over the word, and continue parsing.

540 \def\XML@dec@a@type#1 {
541   \XML@dec@a@hash}

Rule 60 says that we should have #REQUIRED, #IMPLIED, or a default value, optionally preceded by #FIXED. We consider three cases: If we see a #, we read it via \XML@dec@a@type; this will call this function. Said otherwise, when we are here, we might have read the #FIXED, and are ready for the value, or we might have read #REQUIRED, and a > sign is OK, as well as another attribute declaration; for this reason, we redefine \ERROR: this command is called when the character that follows is neither a single quote nor a double quote.

542 \gdef\XML@dec@a@hash$1{
543   \if\noexpand$1#
544     \expandafter\XML@dec@a@type
545   \else
546     \ifx$1>
547       \let\ERROR\@undefined
548       \expandafter\expandafter\expandafter\XML@checkend@subset
549     \else
550       \let\ERROR\XML@dec@a@nodef
551       \XML@dec@a@def$1
552     \fi
553   \fi}

When we come here, we have finished our `AttDef´ and we are ready for the next one.

554 \gdef\XML@dec@a@nodef#1\fi\fi#2{
555   \fi\fi
556   \XML@dec@a@x#1}

When we come here, we have a default value for the attribute.

557 \def\XML@dec@a@def#1\fi\fi{
558  \fi\fi
559   \XML@quoted\XML@dec@a@default#1}

This code adds \XML@add@attrib{name}{att}{val} to a global list, where `name´ is the name of the element, `att´ the name of the attribute and `val´ the default value of the attribute. It continues parsing the declaration.

560 \def\XML@dec@a@default#1#2{
561   \ifx\XML@default@attributes\relax
562     \let\XML@default@attributes\@empty
563   \fi
564   \toks@\expandafter{\XML@default@attributes}
565   \protected@xdef\XML@default@attributes{
566     \the\toks@\noexpand\XML@add@attrib{\XML@tempa}{\XML@tempb}{#1}}
567   \XML@dec@a@hash#2}

Remember line 414: there was \XML@default@attributes. This list was constructed by the code above. It consists of a sequence of \XML@add@attrib ABC. What we do here is to evaluate in a context where \begintag is the element to be evaluated. If it matches, we call \XML@attribval. The effect is as if the user gave B = `C´; in the case B = `something´ is on the attribute list, this is evaluated first, and the B = `C´ is useless.

568 \def\XML@add@attrib#1#2#3{
569   \gdef\XML@tempa{#1}
570   \ifx\XML@tempa\begintag
571    \def\XML@this@attribute{#2}
572     \let\XML@getattrib\relax
573     \XML@attribval{#3}
574     \let\XML@getattrib\XML@@getattrib
575   \fi}

This reads a comment. The code is trivial. An intermediary command is needed for the case where we want to grab something.

576 \def\XML@comment#1@#2-->{
577   \fi
578   \Activespace
579   \XML@comment@}

This is the intermediary command.

580 \def\XML@comment@{\XML@checkend@subset}

2.13. Entities

We have to distinguish between <!ENTITY foo ...> and <!ENTITY % foo ...>. If a percent character is present, this is a parameter entity, and it can be used only in a DTD. Moreover, it is always a parsed entity (no NDATA allowed in the declaration). The command defined here takes as argument some junk, and what follows, the percent sign or a name. Here in the code, we have two versions of \XML@input. This command will be explained later. We continue parsing with \XML@p@ent or \XML@ent.

581 \gdef\XML@entity#1 #2 {
582   \fi\fi
583   \ifx%#2
584   \def\XML@input{
585     \ifx\XML@use\XML@SYSTEM\expandafter\@gobble\else
586       \noexpand\inputonce\fi}
587   \expandafter\XML@p@ent
588   \else
589   \def\XML@input{\noexpand\xmlinput}
590   {\utfeight@protect@chars\xdef\XML@ename{&#2}}
591   \expandafter\XML@ent
592    \fi}

We have to distinguish between <!ENTITY % foo "val">, <!ENTITY % foo2 SYSTEM "val">, and <!ENTITY % foo3 PUBLIC "file" "val">. Here we put in \XML@ename the entity name `%foo1´ and look at the first character of what follows.

593 \gdef\XML@p@ent#1 #2{
594   {\utfeight@protect@chars\xdef\XML@ename{%#1}}
595   \if\noexpand#2P\XML@E@public
596   \else\if\noexpand#2S\XML@E@system
597    \else\XML@E@internal#2}

We have to make the same distinctions in the case <!ENTITY foo1 ...>. Here we have put the entity name `&foo1´ in \XML@ename and look at the first character of what follows. This looks like above, but NDATA is allowed here.

598 \def\XML@ent#1{
599   \if\noexpand#1P\XML@E@public
600   \else\if\noexpand#1S\XML@E@system
601    \else\XML@E@internal#1}

This handles the case <!ENTITY % foo1 "val"> or <!ENTITY foo2 'val'>. In \XML@ename we have `%foo1´ or `&foo2´. We have read the opening quote. What we do on line 607 is to redefine it to be </, so that the parser will see val</>. We will read the argument via \xmlgrab. This command will be explained later; the important point is that it will read everything up to the end of the current element (i.e. up to the </>), and call the command associated to the current element, defined on line 608. The effect is to call \XML@E@internal@x after the assignment, which is the \gdef that defines \+%foo1 or \+&foo2 with as body all the grabbed text. The whole difficulty is that the declaration could be something like <!ENTITY ier "<hi rend='sup'>er</hi>">, so that the attribute list has to be parsed, but it is too early for namespace processing. For this reason, some commands have to be redefined.

602 \gdef\XML@E@internal#1{
603   \fi\fi
604   \begingroup
605   \let\XML@endgroup\endgroup  % use real groups instead of faked ones.
606   \let\XML@begingroup\begingroup
607   \def#1{</}
608   \expandafter\def\csname E\string/:\endcsname{
609     \afterassignment\XML@E@internal@x
610     \expandafter\gdef\csname+\XML@ename\endcsname}
611   \begingroup
612   \let\XML@ns@decl\relax% stop xmlns `attribute' being recognised
613   \let\XML@this@local\@empty
614   \def\XML@this@prefix{*} % set up special prefix to gobble colon
615   \let\XML@checkknown\relax % disable these
616   \def\XML@ns##1{% hobble namespace code to put all name in local part.
617     \protected@edef\XML@this@local{##1}
618     \def\XML@this@prefix{*}}
619   \xmlgrab}

This closes the group started line 611. After that, we execute three tokens after the current group, namely \XML@trace@warn (for debug), \+&foo2 (argument of previous) and \fihack. The group ends because of the token at line 812 (the \XML@endgroup redefined above).

620 \def\XML@E@internal@x{
621    \endgroup
622     \aftergroup\XML@trace@warn
623     \expandafter\aftergroup\csname+\XML@ename\endcsname
624     \aftergroup\fihack
625 }

The \fi comes from line 813. This piece of code just ignores the conditional. It continues parsing of the element.

626 \def\fihack#1\fi{\expandafter\XML@checkend@subset}

When we see <!ENTITY foo PUBLIC 'pub-part' 'system-part'>, this piece of code finishes reading the PUBLIC token, then reads pub-part, and calls \XML@E@pubid.

627 \def\XML@E@public#1 {
628    \fi
629    \XML@quoted\XML@E@pubid}

After that, all characters in pub-part are converted to category 12, using a classical method, the result is stored in \XML@E@pubid, and system-part is read.

630 \def\XML@E@pubid#1{
631   \def\XML@PUBLIC{#1}
632   \edef\XML@PUBLIC{\catxii\XML@PUBLIC}
633   \XML@quoted\XML@E@systemid}

The case <!ENTITY foo SYSTEM 'system-part'> is similar, but there is no public part.

634 \def\XML@E@system#1 {
635    \fi\fi
636    \def\XML@PUBLIC{}
637    \XML@quoted\XML@E@systemid}

Now we run the catalogue. This sets \XML@use to either the value associated to the `pub-part´ in a PUBLIC item of the catalogue, or the value associated to the `system-part´ in a SYSTEM item, or \XML@SYSTEM if nothing is found. Assume that this is X; we call \XML@E@internal@ with \XML@input{X} as first argument, the second argument being the unread character, that should be a greater-than sign (but junk is silently ignored). An unparsed entity contains NDATA Y for some Y. In this case, the command is called with {Y}{X} instead, and \XML@E@ndata is used to read Y.

638 \def\XML@E@systemid#1#2{
639   \def\XML@SYSTEM{#1}
640   \let\XML@use\XML@SYSTEM
641   \the\XML@catalogue
642   \if\noexpand#2N
643    \expandafter\XML@E@ndata
644   \else
645     \afterfi
646     \XML@E@internal@{\XML@input{\XML@use}}#2
647   \fi}

In the case of NDATA, we hack a bit. All characters up to the greater-than sign are read, this gives three lists: the unread part of `NDATA´, the value that follows, optional junk.

648 \def\XML@E@ndata#1 #2>{\XML@ndata@#2 >}
649 \def\XML@ndata@#1 #2>{
650     \XML@E@internal@{{#1}{\XML@use}}>}

The command takes two arguments: some action, and everything that remains on the current element. Assume that we consider <!ENTITY % foo SYSTEM "bar">. In this case (parameter entity), we define \+%foo; its body will be the expansion of #1. This is \XML@input{\XML@use}, where the argument is what is found in the catalogue, and the command is defined on lines 584 or 589. In the case of foo, without percent sign (general entity), the command \XML@input is the same as \xmlinput. Otherwise, it is \inputonce (except: it ignores the argument if not found in the catalogue).

651 \def\XML@E@internal@#1#2>{
652    \expandafter\protected@xdef\csname+\XML@ename\endcsname{#1}
653   \XML@checkend@subset}}

This is done at the end of an <!ENTITY...> declaration.

654 \gdef\XML@checkend@subset{
655   \Normalspace
656   \XML@checkend@subset@}

This is a bit strange: why does the command take these four arguments? The definition of the percent character is to make it a normal Unicode character (end of local DTD).

657 \gdef\XML@checkend@subset@#1#2#3#4{
658   \ifx]#1
659   \let\XML@w@\@empty
660   \gdef%{\utfeightay%}
661   \let\XML@checkend@subset\relax
662   \expandafter\XML@loaddoctype
663   \fi
664   #1#2#3#4}

In the case where the <!DOCTYPE> element specifies a DTD, we load the file.

665 \def\XML@loaddoctype#1#2{
666   \Activespace
667   \ifx\XML@D@dtd\relax\else
668     \inputonce\XML@D@dtd
669   \fi}

2.14. Interpreting the Doctype element

Consider now that case of <!DOCTYPE TEI.2 SYSTEM "teilite.dtd">. This piece of code finishes reading the DOCTYPE name, then the name of the document element, it puts it in \documentelement. It then checks if what follows is PUBLIC, SYSTEM, an internal subset, or the end of the element. The command closes all \fi that are open. There is an @ here that makes it easy to skip over the conditionals defined here.

670 \gdef\XML@doctype#1 #2 #3{
671  \fi\fi\fi\fi\fi
672   \def\documentelement{#2}
673   \let\XML@D@dtd\relax
674   \if\noexpand#3P\XML@D@public
675   \else\if\noexpand#3S\XML@D@system
676   \else\ifx#3[\XML@D@internal
677   \else%must be > the end
678     \XML@D@empty
679   @}

If nothing is given, we have nothing to do.

680 \gdef\XML@D@empty @{
681    \fi\fi\fi}

In the case of PUBLIC, we parse the public value, and do something with it.

682 \gdef\XML@D@public#1 {
683    \fi
684    \XML@quoted\XML@pubid}

In the case of <!DOCTYPE foo PUBLIC "aaa" "bbb">, we put the aaa part in \XML@PUBLIC, change all category codes to 12, and read the `bbb´ part.

685 \gdef\XML@pubid#1{
686   \def\XML@PUBLIC{#1}
687   \edef\XML@PUBLIC{\catxii\XML@PUBLIC}
688   \XML@quoted\XML@systemid}

In the case of <!DOCTYPE foo SYSTEM "bbb">, it is as above, but the public part is empty.

689 \gdef\XML@D@system#1 {
690    \fi\fi
691    \def\XML@PUBLIC{}
692    \XML@quoted\XML@systemid}

We put the bbb part in \XML@SYSTEM, change all category codes to 12, run the catalogue, and put the result in \XML@D@dtd for later use by \XML@loaddoctype.

693 \gdef\XML@systemid#1{
694   \protected@edef\XML@SYSTEM{#1}
695   \edef\XML@SYSTEM{\catxii\XML@SYSTEM}
696   \let\XML@use\@empty
697   \the\XML@catalogue
698   \let\XML@D@dtd\XML@use
699   \XML@D@internal@}

When we have no PUBLIC and no SYSTEM part, but only a local DTD, we call this: it pops the conditional stack, and pushes back the open bracket. The action is the same as if a PUBLIC or SYSTEM part had be given.

700 \gdef\XML@D@internal#1@{
701   \fi\fi\fi
702   \XML@D@internal@[}

We parse the internal DTD in the same fashion as everything else. However %foo; evaluates to something, so that the percent sign must be activated. The vertical bar is used for comments.

703 \gdef\XML@D@internal@#1{
704   \ifx[#1
705     \let%\XML@pcent
706     \edef\XML@w@{ \XML@w@}
707      \expandafter\XML@checkend@subset
708   \else
709       | it had better be the closing >
710    \fi}

Inside a local DTD, the parameter entity `%foo;´ evaluates to \+%foo.

711 \gdef\XML@pcent#1;{
712   \csname+%#1\endcsname
713   \XML@checkend@subset}

When you say &#foo; the \XML@charref command is called to parse the entity; the result is in \XML@tempa. It will in general be evaluated right now. On the other hand `&foo;´ evaluates to \+&foo, there is no intermediate command.

714 \let&\XML@amp@markup
715 \gdef\XML@amp@markup$1$2;{
716   \ifx#$1\@empty
717    \XML@charref$2;
718    \XML@tempa
719   \else
720    \begingroup\utfeight@protect@chars
721    \expandafter\aftergroup
722    \csname+\string&$1$2\expandafter\endcsname
723    \endgroup
724   \fi}

In the case of <![CDATA xxx ]]>, this reads up to the first space.

725 \gdef\XML@cdata #1[{
726  \fi\fi\fi\fi\fi\fi
727   \Activespace
728   \XML@cdata@a}

And this reads everything up to the special end marker ]]>. Less-than sign and ampersand are not active.

729 \gdef\XML@cdata@a#1]]>{
730   \begingroup
731   \edef<{\noexpand\utfeightaz\string<}
732   \edef&{\noexpand\utfeightaz\string&}
733   \XML@docdata{#1}}

The action here is trivial. We need an intermediary command, in the case of grab.

734 \def\XML@docdata#1{#1\endgroup}

The only thing done here is to skip over everything, until the end.

735 \def\XML@dec@n#1N #2 #3 {
736  \fi\fi\fi\fi\fi\fi
737   \XML@quoted\XML@notation
738   }
739  
740 \def\XML@notation#1#2{
741   \ifx>#2
742    \expandafter\XML@checkend@subset
743   \else
744     \afterfi
745     \XML@quoted\XML@notation#2
746   \fi}

2.15. Grabbing content

The normal behavior of <foo>text</foo> is like \begin{foo}text\end{foo}. This method is the most efficient concerning memory space. In some cases, we prefer the equivalent of \def \arg {text}, \foo {arg}. The interesting point is that, assuming that <foo> takes two children, and that we do not care about cases with incorrect syntax, we can manage everything so that the user function sees these two children, for instance in the form \split\arg\first\second followed by \foo\first\second. In fact, instead of calling a single command, we call two commands, as \foofirst\first, \foosecond.

The following piece of code is the definition of the <msup> element in the MathML namespace. We shall explain the syntax of \XMLelement later. Line 10002 says: we do not care about attributes, line 10003 says that we want to grab the content of the element. Line 10004 says: there are two children, and we want to apply some commands to them.

10000    \XMLelement{m:msup}
10001      {}
10002      {\xmlgrab}
10003      {\xmltextwochildren\@firstofone\sp#1}

We shall define \xmlgrab below. It will read the content of the element, and use \@empty as element separator (see code line 819); remember that \@empty expands to nothing, hence is harmless. This marker allows easy splitting. In the case of the example of the start of the chapter, the tokens are

10004 \xmltextwochildren\@firstofone\sp
10005 +<3:mi^^I>L</3:mi>\@empty  <3:mn^^I>2</3:mn>\@empty  +

The second line is printed by TeX, when we ask for the value of +#1+, we have inserted the plus signs, this being the easiest way to see that each \@empty is followed by a space (they are in the input file), and for each opening tag, a tabulation between the tag name and the attribute list (which is empty in this case). With the definition below, the arguments of \xmltextwochildren will be

10006 #1=\@firstofone
10007 #2=\sp
10008 #3=<3:mi>L</3:mi>
10009 #4=<3:mn>2</3:mn>

(we did not show the tabulations, nor the spaces). Tabulations are read again by the parser when looking for attributes, spaces are ignored, as usual, in math mode. The effect of the command is to apply the first argument to the third, the second to the fourth. If the arguments are, say, \A, \B, CC and DD, the result is \A{CC}\B{DD}. In our case, we want CC^{DD}, so that \A is just a command that removes useless braces, and \B is the TeX primitive for superscripts.

747 \def\xmltextwochildren#1#2#3\@empty#4\@empty{
748   #1{#3}#2{#4}}
749 \def\xmltexthreechildren#1#2#3#4\@empty#5\@empty#6\@empty{
750   #1{#4}#2{#5}#3{#6}}

This is a small function that returns everything before the \@empty. You have to use \@ to mark the end of the child list (only first child is used here).

751 \def\xmltexfirstchild#1\@empty#2\@{
752   #1}

If you say \xmltexforall\cmd{list}, where the second argument is a list of tokens with \@empty between tokens, this applies \cmd to each item. Moreover, the quantity \xml@name contains the name of the element. The end of the loop relies on the fact that no element name starts with a space.

753 \def\xmltexforall#1#2{
754   \xmltexf@rall#1#2< >\@empty}
755  
756 \def\xmltexf@rall#1#2<#3 #4>#5\@empty{
757   \ifx\relax#3\relax
758   \else
759   \def\xml@name{#3}#1{<#3 #4>#5}
760   \expandafter\xmltexf@rall\expandafter#1
761   \fi}

The action associated to <foo> consists in two parts: first all attributes are scanned, and some commands are instantiated (see section 2.10), and then the start code is executed (in the example, line 10002). When we see </foo>, we execute the end code (line 10003, in the example). In the special case where the initial action is \xmlgrab, the command gets an argument. This argument is computed on line 811 as the value of the token list \XMLgrabtoks. Thus, the \xmlgrab command must read all tokens, up to the end tag; it must handle namespaces properly (as the example shows, all namespaces, even the default ones, are replaced by integers).

The idea is to redefine temporarily all commands \XML@do..., for the case <foo>, <?foo>, <!foo>, and </foo>, and ask them to put the result in the list. We store in \XML@next@level the value of \XML@w@ at the next level. It is thus possible to check, for a given element, if it is a child (and not merely a descendant) of the current element, so that we know where to insert the \@empty markers. The main routine here is \grab@.

762 \def\xmlgrab{
763   \begingroup
764   \global\XMLgrabtoks{}
765   \let\XML@this@level\XML@w@
766   \edef\XML@next@level{ \XML@w@}
767   \let\XML@doelement\XML@grabelement
768   \let\XML@doend\XML@grabend
769   \let\XML@docdata\XML@grabcdata
770   \let\XML@comment@\XML@grabcomment@
771   \let\XML@dopi\XML@grabpi
772   \XMLgrab@}

This uses the same magic as \XML@getname. The idea is to read everything until the next less-than sign, putting all tokens in the command \XML@tempa.

773 \def\XMLgrab@{
774   \utfeight@protect@internal
775   \def<{\iffalse{\fi}\XMLgrab@@}
776   \xdef\XML@tempa{\iffalse}\fi}

When \XMLgrab@ has read everything between tags, it puts the grabbed tokens in the token register \XMLgrabtoks, and then evaluates the less-than sign.

777 \def\XMLgrab@@{
778   \global\XMLgrabtoks\expandafter{\the\expandafter\XMLgrabtoks\XML@tempa}
779   \XML@lt@markup}

This command is called when we grab the content of an element. Assume that we have seen <mi>L</mi>. When we are here, we have seen the first less-than sign. And we know that \XML@this@element is `3:mi´. We add to the token list <3:mi atts>. There is a tabulation after the element name, this is obtained by uppercasing the tilde. Attributes are added by a call to \the\XML@attribute@toks, with a temporary redefinition of \XML@doattribute, and the default namespace is neutralized.

780 \uppercase{
781 \gdef\XML@grabelement{
782    \Activespace
783   \global\XMLgrabtoks\expandafter{
784     \the\expandafter\XMLgrabtoks
785       \expandafter<\XML@this@element~}
786    \begingroup
787    \let\XML@doattribute\XML@grabattribute
788    \def\XMLNS@{0}
789    \expandafter\let\csname XMLNS@0\endcsname\XMLNS@
790    \the\XML@attribute@toks
791    \endgroup
792    \Activespace
793   \global\XMLgrabtoks\expandafter{
794     \the\XMLgrabtoks
795     >}
796    \XMLgrab@}
797 }

Assume that foo:bar = `gee´ is in the attribute list of the current element, and assume that foo has namespace number 4. We add 4:bar="gee" and a space to the token list.

798 \gdef\XML@grabattribute#1#2#3{
799   \protected@xdef\XML@tempa{\jg@namespace{#1}:#2}
800   \global\XMLgrabtoks\expandafter{
801     \the\expandafter\XMLgrabtoks
802     \XML@tempa="#3" }}

Let´s assuming that we are grabbing something and we see </foo>. There are two cases to consider. If this element is the one we are looking for, we close the group open by \xmlgrab, and we execute the command \E/:3:msup (there are some hacks here; the \uppercase command replaces dot and star by slash and colon with the right category code). We pass the grabbed token list as argument. This is achieved by putting an \expandafter just before the \endcsname, it will expand \the, i.e., replace \XMLgrabtoks by its value (since we want braces around this token list, another \expandafter is needed). After execution of the command, we have to close our XML group and hack with category codes. On the other hand, in the case where the element does not end grabbing, we add </4:foo> to the end of the \XMLgrabtoks token list. If this is a direct child, we add also a \@empty marker. We continue grabbing via a call to \XMLgrab@.

803 \uppercase{
804 \gdef\XML@grabend{
805   \ifx\XML@this@level\XML@w@
806     \endgroup
807     \csname
808       E.*\jg@this@namespace
809         *\XML@this@local
810     \expandafter\endcsname\expandafter{
811       \the\XMLgrabtoks}
812     \XML@endgroup
813     \ifnum\catcode`\^^M=10  \Activespace \fi
814   \else
815     \xdef\XML@tempa{\noexpand<\noexpand/
816       \jg@this@namespace\noexpand: %%%% \expandafter omitted [jg]
817              \XML@this@local
818     \noexpand>
819     \ifx\XML@next@level\XML@w@\noexpand\@empty\fi}
820     \global\XMLgrabtoks\expandafter{
821       \the\expandafter\XMLgrabtoks
822       \XML@tempa}
823     \XML@endgroup
824   \expandafter
825     \XMLgrab@
826   \fi}}

When we want to grab something and see <?PI etc?>, we add all these tokens to our list.

827 \gdef\XML@grabpi#1#2{
828   \global\XMLgrabtoks\expandafter{
829   \the\XMLgrabtoks<?#1^^I#2?>}
830   \XMLgrab@}

If you say \NDATAEntity\att\A\B, if the expansion of \att is something like foo, and \&+foo expands to \bar and \gee, this piece of code applies \A to \bar and \B to \gee(note: ).

831 \gdef\NDATAEntity#1{
832   \expandafter\expandafter\expandafter
833   \XML@ndataentity\csname+&#1\endcsname}
834  
835 \gdef\XML@ndataentity#1#2#3#4{
836   #3{#1}#4{#2}}

Grabbing CDATA is easy: what we do is re-insert the content of the element, and continue grabbing. Ampersands and less-than signs are replaced by the equivalent of &amp; and &lt;, said otherwise, inactive characters.

837 \def\XML@grabcdata#1{
838   \utfeight@protect@internal
839   \edef<{\noexpand\utfeightaz\string<}
840   \edef&{\noexpand\utfeightaz\string&}
841   \xdef\XML@tempa{#1}
842   \endgroup
843    \expandafter\XMLgrab@\XML@tempa}

When we grab a comment, the only thing we need to do is continue grabbing.

844 \def\XML@grabcomment@{
845   \XMLgrab@}

2.16. Defining actions

This is for use in a .xmt file, the file that defines actions for each element. After \XMLentity{foo}{bar}, the XML entity &foo; evaluates to bar.

846 \gdef\XMLentity#1#2{
847   \expandafter\gdef\csname+&#1\endcsname{#2}}

These are the predefined entities:

848 \XMLentity{amp}{\utfeightaz&}
849 \XMLentity{quot}{\utfeightax"}
850 \XMLentity{apos}{\utfeightax'}
851 \XMLentity{lt}{\utfeightaz<}
852 \XMLentity{gt}{\utfeightax>}

The \XMLelement command appears in a .xmt file. It takes four arguments: the first one is the name of an element. We have seen an example above. Assume that the name is m:msup. Assume that the `m´ prefix stands for MathML and this corresponds to the number 3. We define two commands \E:3:msup and \E/:3:msup. The body of these commands is argument #3 and #4. In the case where #3 is \xmlgrab then the \E/:3:msup command takes an argument (see previous section). Argument #2 explains what to do with attributes. It should contain a sequence of \XMLattribute commands. Note: all attributes declared for the current namespace (found in \A:3) are added to the list. A call to evaluation of the attribute list is inserted in the body of \E:3:msup before the user code. On line 860, 861, the purpose of all these \expandafter is to expand the \the, so as to insert the token list (and not a reference to it) in the body of the definition.

853 \long\def\XMLelement#1#2#3#4{
854   \XML@ns{#1}
855   \xdef\XML@tempc{:\jg@this@namespace
856        :\XML@this@local}
857   \toks@\expandafter{\csname A:\jg@this@namespace
858     \endcsname}
859   #2
860   \expandafter\gdef\csname E\XML@tempc\expandafter\endcsname
861   \expandafter{\expandafter\XML@setattributes\expandafter{\the\toks@}#3}
862   \gdef\XML@tempa{#3}
863   \ifx\XML@tempa\XML@xmlgrab
864     \expandafter\gdef\csname E\string/\XML@tempc\endcsname##1
865     {#4}
866   \else
867     \expandafter\gdef\csname E\string/\XML@tempc\endcsname
868     {#4}
869   \fi}
870 \def\XML@xmlgrab{\xmlgrab}

When you say \XMLattribute {form} {\mycmd} {inline} this puts in \XML@tempa the quantity \XML@attrib 0:form\relax \mycmd {inline}. If `form´ is replaced by `m:form´ and the namespace value of `m´ is 3, then `3:form´ will be used instead of `0:form´. This works by redefining locally the default namespace to be the empty namespace. The value of \XML@tempa will be added to the end of \toks@, the token list used in \XMLelement or \XMLnamespaceattribute. For some strange reason the second argument is put in \XML@tempa, but not the last one (this implies that the third argument is not expanded; the single token of the second argument is not expanded, because of the \noexpand; if the second argument has more than one token, you lose).

871 \long\def\XMLattribute#1#2#3{
872   {\def\XMLNS@{0}
873   \XML@ns{#1}
874   \xdef\XML@tempa{\noexpand\XML@attrib
875       \jg@this@namespace
876         :\XML@this@local\relax\noexpand#2}}
877   \toks@\expandafter{\the\expandafter\toks@\XML@tempa{#3}}}

Like above, with a little hack. Assume that the attribute has to be stored in \foo. Then \utfeight@chardef\foo is executed (some time after a value has been stored).

878 \long\def\XMLattributeX#1#2#3{
879   {\def\XMLNS@{0}
880   \XML@ns{#1}
881   \xdef\XML@tempa{\noexpand\XML@attrib
882       \jg@this@namespace
883         :\XML@this@local\relax\noexpand#2}}
884   \toks@\expandafter{\the\expandafter\toks@\XML@tempa{#3}\utfeight@chardef#2}}

This is the action associated to the special setting used above. The command is fully expanded, in a group where this is harmless, globally put in a temporary, and then, outside the group, the temporary is put again in the command. There seems to be a problem here: what if the argument contains ampersands and less-than signs?

885 \def\utfeight@chardef#1{
886 \begingroup
887 \utfeight@protect@chars
888 \xdef\x@temp{#1}
889 \endgroup
890 \let#1\x@temp}

In case \XMLnamespaceattribute{foo}{bar}{gee}{etc}, if the namespace number of foo is 4, this code will define the action associated to the attribute defined by bar, gee, etc, and put it in \A:4. This command should be used at toplevel, not inside \XMLelement.

891 \long\def\XMLnamespaceattribute#1#2#3#4{
892    \toks@\expandafter\expandafter\expandafter{\csname A:%
893     \jg@namespace{#1}\endcsname}
894   \XMLattribute{#2}{#3}{#4}
895   \expandafter\xdef\csname A:\jg@namespace{#1}\endcsname{\the\toks@}}

Idem, expanded.

896 \long\def\XMLnamespaceattributeX#1#2#3#4{
897    \toks@\expandafter\expandafter\expandafter{\csname A:%
898     \jg@namespace{#1}\endcsname}
899   \XMLattributeX{#2}{#3}{#4}
900   \expandafter\xdef\csname A:\jg@namespace{#1}\endcsname{\the\toks@}}

If you say \XMLname{foo:bar}{\gee} this puts in \gee something like 4:bar.

901 \long\gdef\XMLname#1#2{{
902   \XML@ns{#1}
903   \xdef#2{\jg@this@namespace\noexpand:\XML@this@local}}}

If you say \XMLstring\foo<>bar</> this piece of code reads the <>, then calls \xmlgrab, which reads everything until the </>, and executes the code associated to the end tag: this defines \foo, and closes the group.

904 \gdef\XMLstring#1#2<>{
905   \begingroup
906   \let\XML@endgroup\endgroup
907   \let\XML@this@local\@empty
908   \let\XML@this@prefix\@empty
909   \expandafter\def\csname E/:\XMLNS@:\endcsname{\gdef#1}
910   \XML@catcodes
911   \xmlgrab}

Same code, expanded.

912 \gdef\XMLstringX#1#2<>{
913   \begingroup
914   \let\XML@endgroup\endgroup
915   \let\XML@this@local\@empty
916   \let\XML@this@prefix\@empty
917   \expandafter\def\csname E/:\XMLNS@:\endcsname{\xdef#1}
918   \XML@catcodes
919   \utfeight@protect@chars
920   \xmlgrab}

2.17. Other commands

Public version of \XML@setenc.

921 \let\DeclareNamespace\XML@ns@uri  %  version for xmt files
922 \def\FileEncoding#1{\XML@setenc{#1}\relax}
923 \newtoks\XML@attribute@toks
924 \newcount\XML@ns@count
925 \newtoks\XMLgrabtoks

The next function reads an XML file. The idea is to restore the current encoding. We have not shown the source of \XML@xmlinput.

926 \def\xmlinput#1{
927  \IfFileExists{#1}
928   {\expandafter\XML@xmlinput\expandafter
929     \XML@setenc\expandafter{\XML@thisencoding}\relax
930   }{\XML@warn{No file: #1}}}
931  
932 \def\XML@xmlinput{...}

This reads a TeX file only once.

933 \def\inputonce#1{
934   \expandafter\ifx\csname xmt:#1\endcsname\relax
935   \global\expandafter\let\csname xmt:#1\endcsname\@ne
936   \begingroup
937   \XML@reset
938   \def\XMLNS@{0}
939   \input{#1}
940   \endgroup
941   \fi}

In case you want some characters to be active.

942 \gdef\ActivateASCII#1{
943   \uppercase{\count@"0\if x\noexpand#1\relax\else\count@#1\fi\relax}
944   \toks@\expandafter{\nfss@catcodes}
945        \xdef\nfss@catcodes{
946        \catcode\the\count@=\the\catcode\the\count@\relax\the\toks@}
947   \toks@\expandafter{\XML@catcodes}
948      \xdef\XML@catcodes{
949        \catcode\the\count@\active\the\toks@}
950   \expandafter\ifx\csname8:"\endcsname\relax
951     \expandafter\gdef\csname8:"\endcsname{"}
952   \fi}
953 \ActivateASCII{94}% ^ for tex ^^ notation in aux files
954 \ActivateASCII{x5C}% \
955 \ActivateASCII{x5F}%  underscore [jg]
956 \ActivateASCII{123}% {
957 \ActivateASCII{125}%  close brace [jg]

We redefine \obeyspaces and \obeylines. The idea is to redefine the action associated to space and newline character.

958 \expandafter\def\expandafter\obeylines\expandafter{
959 \expandafter\def\csname 8:\string^^M\endcsname{\leavevmode\hfil \break\null}}
960  
961 \expandafter\def\expandafter\obeyspaces\expandafter{
962 \expandafter\def\csname 8: \endcsname{\nobreakspace}}

This line is a bit strange:

963 \expandafter\def\csname 8:\string^^I\expandafter\endcsname
964        \expandafter{\csname 8: \endcsname}

The end of the xmltex.tex file is like this (we simplified a bit the code, by assuming that we do not want to dump a format, and that \xmlfile is defined). Essentially, we reset the end-of-line character to its normal meaning, we set the category codes, and we load the XML file.

965 \def\XML@tempa{\catcode`\-12\relax\input\xmlfile\relax}
966 \endlinechar`\^^M \expandafter\XML@catcodes\XML@tempa

2.18. Example

Assume that we have a file thesis.tex containing the following lines.

1001 \def\xmlfile{these.xml}
1002 \def\LastDeclaredEncoding{T1}
1003 \input{xmltex.tex}
1004 \end{document}

When TeX processes this file, it loads xmltex.tex, the file described in this chapter, because of line 1003. This defines a lot of commands; however the last line (line 966) contains some action, consisting essentially into setting some variables (end-of-line character, category codes) to values useful for typesetting. There are two hooks, not shown here. First, if the file xmltex.cfg is found, it will be loaded. The default file contains some Unicode character definitions, and the catalogue shown earlier. Second, if thesis.cfg is found, it will be loaded. After that the XML file is loaded, this is defined on line 1001. Let´s assume that the root element of the XML file is <fo:root>, and that the name space associated to `fo´ is declared in the catalogue, and loads fotex.xmt. This file is described in Chapter 4, and the action associated to </fo:root> is \end{document}, so that line 1004 is not really needed. Line 1002 is required in some cases 9but it is not clear which ones).

previous
TOC
next
Back to main page