EuroTEX 92 | September 14{18, Prague, Czechoslovakia ArabTEX | Typesetting Arabic with Vowels and Ligatures Klaus Lagally Universität Stuttgart Institut für Informatik Breitwiesenstraße 20-22 D-7000 Stuttgart 80, Germany lagally@informatik.uni-stuttgart.de Abstract We present a TEX macro package for generating the arabic writing from a standardized ASCII input notation. It can handle partial or full vocalization, and generates automatically most of the common ligatures. There is limited support for Farsi, Urdu, and Pashto. ArabTEX is compatible with Plain TEX and also most LaTEX environments; arabic and other material can be mixed freely. For special purposes the standard transliteration can be additionally generated. ArabTEX uses no preprocessor and thus should be compatible with any TEX implementation that allows dynamic loading of additional macro files and fonts. Key words: Arabic, transliteration, vocalization, ligatures. 1 Introduction This is a personal story. The author, interested in the arabic language since he was a young boy, some time ago by accident found out about an evening course on Arabic at a local school, and decided to join in. The course was designed for people wishing to visit an arabic country with some knowledge of Arabic, and as the teacher would not recommend any suitable and easily affordable textbook for that purpose, he handed out his own handwritten notes. This intrigued the author, and so he bought an arabic grammar book from a renowned publisher [Fischer87]. Upon closer inspection the arabic examples looked somewhat strange, and after contacting the author of the book it turned out that the latter had added the vowel signs to the arabic examples on the printing plates by hand! 1 Klaus Lagally This came as a great surprise, especially when considering the fact that the underlying printed arabic text looked beautiful. Apparently there remained some unsolved problems in the printers' business, and knowing the power of TEX[Knuth84], the author decided to try doing something about it. The result of that effort is now called ArabTEX, a system consisting of a large macro package and several fonts. 2 Design goals The typical user of ArabTEX, as we imagine her/him, - knows some Arabic, - is interested in high quality writing, - has little money to spare, - cannot afford specialized equipment, - is willing to learn some simple rules, but: - is not, and is not willing to become, a TEX expert. This description fits well onto several linguists we know. Alas, not every one of them can even afford a simple PC. From this projected user profile follow some requirements for the system: - it should be inexpensive, - it should not require specialized equipment, - it should be easily portable, - it should be sufficiently powerful to generate any reasonable arabic text with high quality, - it should, after some training, be usable by a person who is not a computer expert. However, - it need not be extremely efficient, 2 ArabTEX | Typesetting Arabic with Vowels and Ligatures - it need not support everyday office use, - it need not be interactive. As it happens, our starting point was TEX (in fact, LaTEX[Lamport86]), and we noticed that there are two quite different populations of TEX users: - the experts, in full control of all specialized features, constantly finding new applications, and - the everyday users, getting their work done by filling in some forms designed by a expert, and letting TEX do the rest. Our hypothetical user definitely belongs to the second category. Therefore, for him it is extremely important to have a convenient user interface. Devising such an interface turned out to be a major task. 3 Characteristics of the Arabic script The arabic script, like the scripts for all semitic languages, runs from right to left. This fact, whereas leading to some complications in connection with line-breaking whenever we want to mix arabic and non-arabic texts, turned out to be an absolutely minor problem in comparison with the fact that the arabic script is a cursive style, extremely well adapted to hand-writing. As far as we know, this has always been so [Endress82b], and contrary to common belief the script is very easy to write; even a motivated beginner can acquire a fair hand-writing style within a few weeks. Calligraphic excellence, of course, is a different matter [Schimmel70]. In a cursive hand, we do not assemble character after character on a common baseline, but try to join adjacent letters into a softly flowing curve. This makes for ease of writing, and also for aesthetic beauty, but has the consequence that the script, although still arranging the individual words in a horizontal sequence, is essentially two-dimensional. Another consequence is that the form of a letter depends on the context, and if adjacent letters are combined into ligatures a surprising manifold of different forms may emerge. Most of these are not mandatory, but their omission will lead to a serious loss of quality that can easily be noticed even by an outsider, and quality has always been considered very important. A script of that characteristic is not very convenient to print, and indeed the arabic script has resisted mechanization for a long time [Endress82a]. 3 Klaus Lagally The first attempts to print Arabic with movable type were undertaken about 1500 A.D., surprisingly in central Europe, but the printing tradition of Arabic seriously started in 1727 when the \Ottoman printing agency" in Istanbul was founded. It had the types made in the Netherlands where the technology existed, and for several decades only official documents and scientific works were allowed to be printed. Religious works like the Qur'an and its commentaries still were reproduced by hand-writing, and later by lithography from hand-written originals; thus the risk of misprints in the Holy Scriptures was avoided. A second official printing agency was founded 1821 in Cairo; others followed, and in 1906 a new typeface standard was adopted, with remarkably good results, that is still in use today. Of the several different writing styles that exist, Naskhi was adopted for printing as it is very easily readable, and mostly adheres to the baseline. Still, even printing Naskhi is a formidable task; whereas a european printer's box contains less than 100 different letter forms including capitals, digits, and special characters, you need far more than 500 different forms for good quality arabic printing. The situation improved in the 1970's when photo-typesetting equipment became available and the first computer programs to typeset Arabic were developped [MacKay77]. Now also other writing styles like Nasta`liq, as used mainly in Iran and the adjacent countries, could be handled, and many new typefaces, e.g. for newspapers, were developped. But you can still find headlines which have obviously been reproduced from a hand-written original. The calligrapher's profession is still alive (see, e.g., [H?a<=sim80]). Even if the technology for printing arabic texts nowadays exists, some problems remain. In the Arabic language, as in all semitic languages, the main information resides in the consonants and the long vowels, and usually only these are written explicitly. Short vowels, the doubling of a consonant, and the like are either not indicated at all or expressed by diacritical marks placed above or below the characters. A native speaker generally does not need this additional information as he can deduce it from the context; it is only required when introducing new words, for resolving ambiguities, and in religious texts where the exact pronunciation is considered important. Considering the already very large number of different letter forms in a printer's box, also storing all the possible combinations would be prohibitively expensive, and thus manual corrections are necessary. This is awkward and expensive, so it is avoided whenever possible, and thus the religious texts we have seen all have been reproduced from manuscripts. 4 ArabTEX | Typesetting Arabic with Vowels and Ligatures 3.1 Transcription and Transliteration If we want to generate the arabic writing of a given text automatically, we have to denote the text in a way that can readily be processed by our computer. There exists no standard suitable for our purpose, so we have to invent one; and since linguists always had related problems and also are among our prospective users, we try to imitate their solutions as closely as possible. In this context there exist two concepts that are closely related (and therefore frequently confused): transcription and transliteration. \Transcription" means: representing the sounds of the given language as closely as possible. This can even be done in the language itself, e.g., transcribing the sound of the english word \enough" as \enuff"; on the other hand there exists a language independent standard, the International Phonetic Alphabet. \Transliteration" on the other hand means: representing the writing of the given language by using a different set of characters. In theory, just a unique representation is needed; in practice it is also required that the transliteration be easily readable, and also give some indication of the sounds. Therefore some compromises are usually made, with the consequence that deducing the writing from the transliteration requires some knowledge of the language in question. For Arabic and some other languages using the arabic script, there exist two nearly identical international standards [DIN31635, ISO/R233] for transliteration in the given loose sense. As there are more arabic letters than in the Latin alphabet, these conventions make heavy use of diacritical marks, and so we cannot use them directly for our purpose. 3.2 Input notation If we want to typeset arabic texts with TEX, we have two possibilities: - either have a preprocessor transform our input text into some intermediate notation that can be processed by TEX, - or enhancing the power of TEX by adding suitable macros so that it can process our input text directly. The first possibility is extremely flexible, as far as the possible input codings are concerned, and can be made very efficient. It has been used 5 Klaus Lagally in some existing systems, e.g. ScholarTEX [Haralambous91]. However, every user now needs a version of the preprocessor tailored to her/his computer system and cooperating well with the local TEX implementation. Thus we may run into portability and maintenance problems, and possibly a complicated installation procedure. The second possibility, which we adopted, by itself is as portable as TEX itself is; but, writing the needed algorithms in TEX macro language is no easy task, and the macros might not run as efficiently as a preprocessor system. Like everywhere, here also is a tradeoff between generality and speed. If, as we did, we choose the macro solution then TEX must be able to read our input notation directly, therefore we should better use only the standard 7-bit ASCII characters (there are extensions to TEX using 8- bit characters but these are in no way standardised so we could run into severe compatibility problems). We want to keep the input notation easily readable, but we have the problem that we need about 30 different letters, and some of them sound very much alike. Even when also using the capital letters for coding (Arabic needs no capitals), we could not find a one-to-one correspondence between ASCII characters and arabic sounds that is easy to read and remember. The solution we finally found was to use both one-character and twocharacter encodings, and to adhere closely to the standard transliteration. The rules are simple: - whenever the transliteration uses just a single letter, we also use that letter; - whenever the transliteration uses a letter with a diacritical mark, we use the same letter and precede it with the punctuation mark most closely resembling the diacritic. This is easily remembered, fairly readable, and works well because punctuation marks (except hyphen) never occur within a word. Using this coding scheme we get an additional bonus: if, for some reason, we want to also typeset the standard transliteration of an arabic word, we have to code the diacritical marks used; and whereas this can be done in TEX using existing commands, these look awkward and are not easy to learn and remember. On the other hand it turned out not to be too difficult to derive the transliteration from our coding scheme, and so we can use it for both 6 ArabTEX | Typesetting Arabic with Vowels and Ligatures purposes, thereby avoiding the danger of constantly confusing two closely related, but different, notations. In fact, the description we gave is somewhat oversimplified. There are some (fortunately rare) exceptions to the transliteration rules, and sometimes words written differently are transcribed identically, so in these cases we have to code additional information. 4 Processing Arabic Text In the following we give a general overview of the tasks our system has to perform when typesetting Arabic. We discuss this in the context of a simplified model: viz., that a text as seen by TEX is a sequence of paragraphs, each of which is a sequence of words. TEX will transform each word into an internal representation and will arrange these word images into lines. The sequence of lines thus generated will be broken up into pages which will be sent to a device-independent output file, later to be viewed or printed by a device-dependent driver program. There is indeed much more to it but the details are not relevant to our exposition. 4.1 Overall structure: Quotations, Paragraphs If we want to typeset a document containing arabic text, we will distinguish two different cases: - short arabic quotations inside a line of text in some european language, - longer arabic passages consisting of one or several paragraphs. An in-line quotation is handled as a whole. We process the arabic words in reverse order, one word at a time, and insert the results into the normal output. This could lead to problems if a quotation would be split across a line boundary, because in that case the two parts should be individually reversed. We ought to do the line-breaking first and the reversal afterwards, but we know of no easy way of doing that with TEX. To handle this problem, an extension of TEX, TEX{XET, has been proposed [Knuth and MacKay87], but it is not generally available, and also not compatible with the standard printer driver programs. So we have to forbid line-breaking within a quotation, and for technical reasons quotations have to be very short anyway. 7 Klaus Lagally Longer arabic passages are handled differently. Here we process the individual words in their natural order, arrange the results in reverse order, and do the line-breaking ourselves. Inside an arabic paragraph we can again have insertions, e.g., short quotations (now of non-arabic text), or even inline mathematical formulas. For the same reasons as above, we have again to forbid line-breaks inside an insertion. In both cases we have to take care of the fact that numbers in Arabic are written like in the european languages, i.e., the sequence of digits is not reversed. We could have put the responsibility for indicating what is considered to be a number on the user; however we decided just to define a number as a sequence of characters starting with a digit and ending with a space, and to typeset this sequence in the natural order. 4.2 Numbers, Words, Subwords As we saw, every arabic word or number is processed individually, and the result is a description of its graphical representation given in terms of symbols from a given font arranged in a two-dimensional pattern. There is no unique correspondence between these symbols and arabic characters; a character image might be built up from several symbols, and it also sometimes happens that a symbol represents more than one character. The reason behind this is that the arabic characters may be collected into several classes whose members are closely related and differ only in a few features that can be separated out. Fortunately the same is true for the ligatures, and we can also handle the vocalization by the same mechanism, so that a single font of less than 256 characters is sufficient for expressing a much larger set of graphical symbols and combinations. When we want to typeset a number in the arabic script, we just arrange the isolated graphical symbols corresponding to the digits from left to right and we are done. Typesetting a word of text is more involved. Logically, a written word consists of a sequence of character images connected to each other as far as possible, and possibly changing their shape depending on the context. In addition, these character images may carry diacritical marks. Not all characters can be joined to their successors (probably because the writing would become ambiguous otherwise), and thus we can consider a word being a sequence of subwords, whose characters are all connected. To each subword corresponds a graphical representation, and these are arranged side by side. 8 ArabTEX | Typesetting Arabic with Vowels and Ligatures In this step they are possibly displaced vertically such that their last (i.e. leftmost) character has its normal position on the baseline, and horizontally such that their spacing looks pleasant. fi ? fi PA ff &" g ff ff ? A ff m fi c Figure 1: Character assembly with components shown. 4.3 Characters, ties, diacritics Depending on its position in a subword, a character might take on one of several shapes: the isolated, initial, medial, and final shape. This forms might still be modified if the character enters into a ligature. Fortunately, as far as ligatures are concerned all characters of a class perform alike, thus the number of different cases, although large, remains manageable. When we process a word, we perform the following steps: - we sequentially process the input representation to break it up into a sequence of individual characters, each with accompanying diacritical information; - we process this sequence in reverse order to determine the shape of each character depending on its position in the subword and on the surrounding characters; - starting on the baseline, we position these character shapes so that they join smoothly, either directly or by means of connecting strokes. To each character, we add the appropriate diacritical marks (there may be none or even more than one per character). For an example, see Figure 1. - Whenever the next character considered (this is the preceding one, when writing by hand!) cannot be joined to its logical successor, we have reached a subword boundary; we reposition this character so it will again sit on the baseline, and add suitable spacing. 9 Klaus Lagally The resulting graphical representation of the word is passed back to the caller to be inserted into the output. 5 User Interface In the following we shall only describe the main features; for more details, see the ArabTEX documentation [Lagally92]. 5.1 Activating ArabTEX To use the ArabTEX package with a file to be processed by Plain TEX, load it via \input arabtex; with LaTEX, include arabtex as a document style option. In both cases, several additional files and the default font will be installed. 5.2 Mode control As there are several language-dependent writing conventions, you have to select a language by one of the commands \setarab, \setfarsi, \seturdu, \setpashto, or \setverb (no special processing in this case). There are three different modes of handling short vowels: - \vocalize: short vowels written in the input will be indicated in the output by diacritical marks; - \fullvocalize: also the absence of a short vowel will be indicated; - \novocalize: short vowels will show up in the transliteration, but will be omitted in the arabic writing. You can locally override this feature. By \arabtrue, \arabfalse, \transtrue, \transfalse you can switch on and off the generation of the arabic writing and/or the standard transliteration. By default, the arabic writing is on, and the transliteration is off. Bold-face can be selected by \setbold; \setnormal will revert to normal. 5.3 Arabic text Short arabic quotations in normal text are included in angle brackets. These thus have a special significance (outside of mathematical mode) and can no 10 ArabTEX | Typesetting Arabic with Vowels and Ligatures more be used for other purposes, e.g., for normal text or in local macros. This special behaviour is switched on by language selection, and can be switched off again by \setnormal. An arabic paragraph is started by the command\begin{arabtext} and ends with \end{arabtext}. This looks like, and nearly operates like, a LaTEX environment even when working with Plain TEX. However, neither displayed mathematical text nor other LaTEX environments may be nested in an arabic paragraph. Inside an arabic paragraph we can have non-arabic quotations delimited by angle brackets, and in-line mathematical formulas delimited by single dollar signs. These insertions must fit on one output line. 5.4 Input coding Table 1: Coding of arabic characters a @ a b H. b p H? p t ?H t _t ?H t? ^g ` <=g .h h h. _h p h>= c flh c ^c x <=c ,c ?h ?c d X d _d ?X d? r P r z R z ^z T <=z s Ä s ^s ? <=s .s ? s. .d ? d. .t ? t. .z ? z. ` ? , .g ?? _g f ? f q ? q v ö v k ? k g ? g l ? l m ? m n ?? n h ? h w ? w y ?? y T ?? t The input notation, the arabic writing in the isolated form, and the transliteration of the characters used for Arabic and Persian are given in Table 1. For Urdu, Pashto, and for special purposes there are some additional codings. Note also the following: - is tah marbouta, is tanwin, is alif maqsoura. - , , denote the long vowels, , , the short vowels if required. 11 Klaus Lagally - <'> (right quote) is hamza (glottal stop). After \setarab, its carrier will be determined by the context according to the full hamza rules, otherwise by a following short vowel. - <'A> generates madda. - Doubled consonants are written twice (shadda). - <|> will break unwanted ligatures, <-> joins two words and will only show up in the transliteration, and <--> will elongate the connection between two adjacent letters (kashida). - The definite article is always written (with hyphen), even if it precedes a (double) \sun letter". 5.5 Special features For Farsi, Urdu, Pashto and some other languages using the arabic script, the coding conventions are slightly different, and not described here. Furthermore, the language-specific processing may be locally overridden, and there is also a verbatim mode capable of representing unusual or archaic ways of writing. Mode-changing commands may also occur inside an arabic paragraph thus allowing local mode changes. 6 Implementation The ArabTEX system consists of a large number of macros, and their interaction is surprisingly complex. They are grouped into several packages, each devoted to a separate task. As ArabTEX can be considered a translator, we imitate the usual modularization of a compiler. In that view, ArabTEX consists of a Driver Module calling a number of auxiliary modules for specialized tasks, and finally passing the output back to the normal TEX paragraph mechanism. Thus arabic text can also appear inside most LaTEX environments, including moving arguments. However, LaTEX is no prerequisite for running ArabTEX. 6.1 The Driver Module The Driver Module, arabtex.sty, is loaded by LaTEX or by a small Loader Module, arabtex.tex, when using Plain TEX. The latter module simulates the (few) LaTEX features used by ArabTEX. 12 ArabTEX | Typesetting Arabic with Vowels and Ligatures The Driver Module, when executed, defines and initializes some common variables and loads the remaining files constituting ArabTEX. It also implements the mode-changing commands, and contains several local submodules: - the Insertion Processor for arabic quotations, - the Paragraph Processor for arabic paragraphs, - the Output Processor, - the Word Processor. Both the Insertion Processor and the Paragraph Processor pass single arabic words to the Word Processor to generate the graphical representation (and/or possibly the transliteration) and process the resulting output further. The Insertion Processor breaks up short quotations into individual words and feeds both the resulting arabic representation and the transliteration into the normal output stream. The Paragraph Processor also breaks up the input into individual words; the output of the Word Processor, however, is now handled differently. The transliteration, if generated, is fed into the normal output stream; the arabic representation is passed to the Output Processor. The Output Processor lines up the arabic representations from right to left in a local buffer. Whenever a line is completed, it is interleaved with the normal output, if any. At the end of an arabic paragraph, the buffer is flushed, and the paragraph is finished by the normal TEX paragraphing mechanism. For an example, see Figure 4. The Word Processor passes the input to the Scanner Module, ascan.sty, to generate a standardized internal representation independent of the external coding. This internal representation is then passed to the Transliteration Module, atrans.sty, if the transliteration is wanted. Otherwise, or additionally, it is passed to the Parser Module, aparse.sty, to isolate the individual graphical components. The output of the Parser Module is further processed by the Assembly Module, awrite.sty, to generate the arabic representation. 13 Klaus Lagally 6.2 The Scanner Module The main task of the Scanner Module is to break up the input stream into tokens denoting individual arabic characters; should the input notation be changed, then only the Scanner Module would have to be adapted accordingly. There is one case handled in a special way: for hamza the character preceding it is repeated after it to ease further processing. 6.3 The Transliteration Module This module has to transform the sequence of tokens into the external representation of the standard transliteration. As the transliteration does not always follow the arabic writing closely, some special cases have to be considered, e.g., in connection with endings and with the definite article whose spelling depends on the first consonant of the following word. Also sometimes an initial vowel has to be suppressed (wasla). <=guh.?a wa-h.im?aruhu -at?a s.ad??qun -il?a <=guh.?a yat.lubu minhu h.im?arahu li-yarkabahu f?? safratin qas.??- ratin wa-q?ala lahu: sawfa -u ,??duhu -ilayka f?? 'l-mas?a-i , wa--adfahu laka -u<=gratan.fa-q?ala <=guh.?a: -an?a -?asifun <=giddan -ann?? l?a -astat.??,u -an -uh.aqqiqa laka ra_gbataka, fa-'lh.im?aru laysa hun?a 'l-yawma.wa-qabla -an yutimmu <=guh.?a kal?amahu bada-a 'l-h.im?aru yanhaqu f?? 's.t.ablihi.fa-q?ala lahu s.ad??quhu: -inn?? -asma ,u h.im?araka y?a <=guh.?a yanhaqu.fa-q?ala lahu <=guh.?a: _gar??bun -amruka y?a s.ad??q??! -atus.addiqu 'l-h.im?ara wa-tukad?d?ibun??? Figure 2: Arabic transliteration. 6.4 The Parser Module The Parser Module has to break up the token sequence into a backward sequence of \writing syllables". A \writing syllable" is not to be confused with a syllable in the usual sense, but consists of a single consonant or long vowel with additional diacritical information denoting e.g., a short vowel, consonant doubling, tanwin and hamza. Whereas the basic algorithm is straightforward, there is a surprisingly large number of special cases since the various languages supported by ArabTEX have different notational conventions, and there are also some options (not described here) to locally modify the writing. A typical example is the handling of hamza, the glottal stop. Whereas denoting a distinctive sound, it is not considered a letter, and 14 ArabTEX | Typesetting Arabic with Vowels and Ligatures thus a carrier for it has to be determined which depends on the context in a rather complicated way. fi? fiP > ffäkff ff? Affm fic : fi?ff? ffÄff?fl ff? ??? ffQÖ??ff ff?fl ??? ffQ ?fi ff? ?? ?flff fi?ffJ. ff?QffÖ??ff fi? ffP > ffäkff fi? ?J?ff fiI. fi??ffÄ? Affm fic ?ff?@flff <=?K? Yff ff? ?ff?G fffl@ . ??? ffQc fifl@ ff?ff? fi?ff?fl X fffl@ff? , Zff A ff? ffÜ? @ ?? ?flff ff?J? ff? @flff fi? fiYJ??ff fifl@ ff? ? ff? : Affm fic ffÄ ff?fi ff?fl . ff??ffJ?? @ Aff?J fi? ffüo? ff? fiP > ffämff?' Aff?fl , ff?ff?J ffJ. ?? ffP ff?ff? ff? fflff?fi ffk fifl@ ?? fffl@ fi?J??ff ff?J? fffl@ BffB ?? fflff?G fffl@ @ ?fflYg.ff <=??ff ffi@ Aff?K fffl@ . ?ff?ffJ. ff?? @ ?? ?flff fi?ff? ?DffK? fiP > ffämff?' @ fffl@ ffYffK. fi? ff?CffC ff? Affm fic fiffl" ??ff fiK? ?? fffl@ ff?J. ff?fl ff? : fi? fi?fiK? Yff ff? fi?ff? ffÄ ff?fi ff?fl . fi? ff? ?DffK? Affm fic AffK? ff? ffP > ffäkff fi? ffÜfiÖ fffl@ ?? fflff?G @flff : Affm fic fi?ff? ffÄ ff?fi ff?fl ? ?? ??ff fiK. fflff?Y ff?fi?K ff? ffP > ffä mff?' @ fi? fflffY ff?fi?Ä fffl@ ! ?? ?fiffK? Yff ff? AffK? ff? fiQ? fffl@ <=I.K? Qff ff?? Figure 3: Vocalized Arabic text. 6.5 The Assembly Module Finally, from the reversed sequence of \writing syllables" produced by the Parser Module, the graphical representation is determined. Every \writing syllable" consists of a basic character and diacritical information. Every character belongs to a character class, represented by a \skeleton", and is locally identified by a \modifier" (usually a pattern of dots). The further processing of a \writing syllable" proceeds in several steps: - The skeleton and the modifier are determined. - Depending on context, the appropriate joining form of the skeleton (isolated, initial, medial, final) is determined. - Also depending on the context, the skeleton may take part in a ligature and thus get a different shape. Generally, and with very few exceptions, ligature generation is optional; and since it is also complicated (though not difficult), it has been delegated to a separate Ligature Module, aligs.sty. 15 Klaus Lagally - After the definite form of the skeleton has been determined, it is positioned in the output. If it is an isolated or final shape, it is generally put on the baseline with suitable spacing to its left neighbour, if any. Otherwise it is joined to its left neighbour, either directly or by means of a connecting stroke whose form depends on the partners. As the connection point of its left neighbour need not be on the baseline, the skeleton possibly must be vertically adjusted, and a new connection point for its right neighbour, if that exists, will be determined. - After positioning the skeleton, the modifier will be added to identify the character in question. - Finally, the diacritical information is added. 6.6 The Ligature Module This module is called by the Assembly Module for each character. It will receive as input information a description of a skeleton shape and the shape of its right neighbour, and will return a possibly changed skeleton shape, a possibly changed shape of the right neighbour, and frequently also a connecting stroke. With the exception of very few, but important, cases where ligatures are mandatory, the Ligature Module might return its input information unchanged, and indeed there is an option to switch most ligatures off. However, the art of forming ligatures evolved gradually during many centuries of writing, and their inclusion will greatly improve the quality of the result; and whereas a good many cases are handled already, there is still room for improvement. 7 Experiences One of the reasons for implementing ArabTEX this way was to test the power of TEX on a large example. We found that it could be done, but we drastically underestimated the amount of work involved. The techniques used in the described modules are comparatively straightforward; even the full power of context-free language analysis is rarely needed. However, due to the great number of special cases the complexity is considerable, and the macro technique used is extremely vulnerable to trivial coding errors whose effects will propagate throughout the system very quickly, and frequently will lead to very puzzling results. Thus systematic structuring is a must, and 16 ArabTEX | Typesetting Arabic with Vowels and Ligatures fi? fiP > ffäkff ff? Affm fic <=guh.?a wa-h.im?aruhu -at?a s.ad??qun -il?a <=guh.?a yat.lubu minhu h.im?arahu li-yarkabahu f?? safratin qas.??- ratin wa-q?ala lahu: : fi?ff? ffÄff?fl ff? ??? ffQÖ??ff ff?fl ??? ffQ ?fi ff? ?? ?flff fi?ffJ. ff?QffÖ??ff fi? ffP > ffäkff fi? ?J?ff fiI. fi??ffÄ? Affm fic ?ff?@flff <=?K? Yff ff? ?ff?G fffl@ sawfa -u ,??duhu -ilayka f?? 'l-mas?a-i , wa--adfahu laka -u<=gratan. . ??? ffQc fifl@ ff?ff? fi?ff?fl X fffl@ff? , Zff A ff? ffÜ? @ ?? ?flff ff?J? ff? @flff fi? fiYJ??ff fifl@ ff? ? ff? fa-q?ala <=guh.?a: : Affm fic ffÄ ff?fi ff?fl -an?a -?asifun <=giddan -ann?? l?a -astat.??,u -an -uh.aqqiqa laka ra_gbataka, fa-'lh.im?aru laysa hun?a 'l-yawma. . ff??ffJ?? @ Aff?J fi? ffüo? ff? fiP > ffämff?' Aff?fl , ff?ff?J ffJ. ?? ffP ff?ff? ff? fflff?fi ffk fifl@ ?? fffl@ fi?J??ff ff?J? fffl@ BffB ?? fflff?G fffl@ @ ?fflYg.ff <=??ff ffi@ Aff?K fffl@ wa-qabla -an yutimmu <=guh.?a kal?amahu bada-a 'l-h.im?aru yanhaqu f?? 's.t.ablihi. . ?ff?ffJ. ff?? @ ?? ?flff fi?ff? ?DffK? fiP > ffämff?' @ fffl@ ffYffK. fi? ff?CffC ff? Affm fic fiffl" ??ff fiK? ?? fffl@ ff?J. ff?fl ff? fa-q?ala lahu s.ad??quhu: : fi? fi?fiK? Yff ff? fi?ff? ffÄ ff?fi ff?fl -inn?? -asma ,u h.im?araka y?a <=guh.?a yanhaqu. . fi? ff? ?DffK? Affm fic AffK? ff? ffP > ffäkff fi? ffÜfiÖ fffl@ ?? fflff?G @flff fa-q?ala lahu <=guh.?a: : Affm fic fi?ff? ffÄ ff?fi ff?fl _gar??bun -amruka y?a s.ad??q??! -atus.addiqu 'l-h.im?ara wa-tukad?d?ibun??? ? ?? ??ff fiK. fflff?Y ff?fi?K ff? ffP > ffä mff?' @ fi? fflffY ff?fi?Ä fffl@ ! ?? ?fiffK? Yff ff? AffK? ff? fiQ? fffl@ <=I.K? Qff ff?? Figure 4: Arabic text with transliteration. a complete redesign after having a working prototype payed off very well and led to a considerable increase of stability. There are still some errors in the system, but they seem to be well hidden, and show up at a surprisingly low rate. Furter plans, besides correcting errors, are: designing a Nasta`liq font that looks better for Persian, and generally improving on the still very rudimentary support for non-arabic languages using the same script. 17 Klaus Lagally Acknowledgments The development of ArabTeX would not have been possible without the assistance of many people. Apart from my local team, helpful advice came among others from Ivan Derzhansky, Wolfdietrich Fischer, Ahmed El-Hadi, Abdelsalam Heddaya, Iqbal Khan, Tom Koornwinder, Eberhard Krueger, Asif Lakehsar, Jan Lodder, Richard Lorch, Eberhard Mattes, and Bernd Raichle. I also have to thank the many users who sent bug reports and comments. References [DIN31635] DIN 31 635: Umschrift des Arabischen Alphabets, Deutsches Institut für Normung e.V., 1982. [Endress82a] Gerhard Endress, Die Arabische Schrift, in [Fischer82], p. 165 ff. [Endress82b] Gerhard Endress, Handschriftenkunde, in [Fischer82], p. 271 ff. [Fischer82] Wolfdietrich Fischer (ed.), Grundriß der Arabischen Philologie, Band 1: Sprachwissenschaft, Dr. Ludwig Reichert Verlag, Wiesbaden 1982. [Fischer87] Wolfdietrich Fischer, Grammatik des Klassischen Arabisch, 2. Auflage, Verlag Otto Harrassowitz, Wiesbaden 1987. [Haralambous91] Yannis Haralambous, \TEX and Those Other Languages", TUGboat, Volume 12 (1991), pp. 539{548. [H?a<=sim80] ffl??G. Q?? @ ffl?u?' @ Y? @ ??fl , ?HA ffl?u?' @ Yffl"m? "??A? (H?a<=sim Muh.ammad al- H?at.t.?at., Qaw?a`id al-H?at.t.i al-`Arab??), Maktaba an-Nahd.a, Baghdad; D?ar al-Qalam, Beirut, 1400/1980. [ISO/R233] ISO/R 233 - 1961: International System for the Transliteration of Arabic Characters, International Standards Institution, 1961. [Knuth84] Donald E. Knuth, The TEXbook, Volume A of Computers & Typesetting, Addison-Wesley, Reading, Mass., 1984. [Knuth and MacKay87] Donald E. Knuth and Pierre A. MacKay, \Mixing right-toleft texts with left-to-right texts", TUGboat, Volume 8 (1987), pp. 14{25. [Lagally92] Klaus Lagally, ArabTEX, a System for Typesetting Arabic, User manual. Report 6/92, Fakultät Informatik, Universität Stuttgart, 1992. 18 ArabTEX | Typesetting Arabic with Vowels and Ligatures [Lamport86] Leslie Lamport, LaTEX, a Document Preparation System, Addison-Wesley, Reading, Mass., 1986. [MacKay77] Pierre MacKay, The KATIB System, a revolutionary advancement in Arabic Script Typesetting by means of the Computer, in Scholarly Publishing 8,2 (Toronto 1977) pp. 142{150. [Schimmel70] Annemarie Schimmel, Islamic Calligraphy, E.J.Brill, Leiden, Netherlands 1970. Appendix Installing ArabTEX ArabTEX uses no preprocessor and thus should be compatible with any TEX implementation that allows dynamic loading of additional macro files and fonts. The ArabTEX distribution consists of the following components: - TEX macro files with extensions .sty and .tex: these files are installed on the TEX input path for source files. - Font metric files (extension .tfm) and compressed pixel files (extension .pk) for the fonts nash14 and nash14bf at several common magnification steps. Installation of these files is strongly system dependent; in case that they cannot be used, the METAFONT sources are also available (extension .mf) to rebuild the fonts locally. - installation notes, user manual, answers to questions, demos, and the like: ASCII and/or TEX files for local printing. The system is available from the author's institution (anonymous FTP from ifi.informatik.uni-stuttgart.de, directory pub/arabtex) and from many other common servers. At the time of this writing, version 2.02 is current. The old version 1 should no more be used. ArabTEX is copyrighted, but free use for scientific, experimental and other strictly private, noncommercial purposes is granted. Space and time requirements are not negligible; however, ArabTEX has been used frequently and successfully even on a PC XT standard configuration. 19