ArabTEX, a System for Typesetting Arabic Klaus Lagally? Abstract TEX, the powerful computer typesetting system by D.E. Knuth, and LaTEX, its extension by L. Lamport to handle structured documents, have both been adapted to handle passages of arabic script. Our new system, ArabTEX, will accept as input an ASCII encoding of the standard transliteration of Arabic, and will generate the arabic writing with a full complement of vowel marks, automatically producing most of the usual ligatures, and obeying the common writing rules. Likewise, or additionally, the standard transliteration can also be generated from the same input. The notation is easily readable and suitable for electronic transmission. For non-vocalized writing, a reduced input notation is available, as are some extensions for other languages using the arabic script, and for reproducing erroneous or archaic documents. ArabTEX will run on a wide range of hardware platforms in conjunction with any TEX implementation that can be extended by loading additional macros and additional fonts. No special equipment is required. 1 Overview We present ArabTEX, a system for preparing documents in some European language which contain passages in Arabic or some other language using the arabic script. ArabTEX is not a general purpose word processing system for producing, e.g., arabic newspapers, or for everyday office use in an arabic environment. For these application fields good systems already exist; however, these systems are usually based on dedicated hardware equipment, and they also require special training for efficient use. ArabTEX, on the other hand, is mainly targeted towards use by scholars with little or no computer or typesetting experience, and who have no specialized equipment available besides a standard PC or work station with a high resolution printer. 1.1 On Mathematical Typesetting In order to explain the operation of ArabTEX, let us consider the related problem of mathematical typesetting first. Typesetting mathematical formulas is inherently difficult for several reasons: - there is a very large number of different symbols, some of them occurring in various shapes and sizes; ?CV: born in 1937, Public School, Graduation 1956, University Studies in Mathematics and Physics, Ph.D. in Theoretical Physics 1967, Work on Operating Systems and Programming Languages, Professor of Computer Science 1976, Universität Stuttgart, Germany 1 Y ml = eimÖ ? s 2l + 1 4ß ? s(l ?m)! (l +m)! ? (?1)l+m 2l ? l! ? sinm ? ? ? d d cos ? ?l+m sin2l ? Figure 1: Example of mathematical text - these symbols have to be arranged in a two-dimensional pattern according to the structure of the formula; - frequently there are several possibilities of rendering a given formula, and choosing the optimal way requires knowledge both in typesetting and mathematics. This is a rare combination of skills, and thus there are not too many publishing houses capable of producing printed mathematical text of high quality. Also, specialized equipment is needed, and the communication between the author and the publisher, due to their different views, might not be too easy. Thus, producing a mathematical textbook is a timeconsuming and expensive task, and publications on a smaller scale are usually reproduced from a camera-ready original prepared by the author himself, either using a specialized typewriter or inserting the formulas by hand, (as this author had to do when preparing his Ph.D. thesis.) This state of affairs led D.E. Knuth to the development of TEX [Knuth84], a computer program that enables the author to do the typesetting himself. The basic idea behind it is that the author will input his text concentrating on its content and logical structure, and the program will do the typesetting automatically, drawing on a large body of specialized knowledge. The results look impressive, and indeed several textbooks have been produced using TEX. Even when using TEX, the production of a fair-looking document is not an easy task, as there still is a very high variety of possible layouts for the same text, and a casual user might easily go astray. Fortunately TEX is not only a typesetter, but also an interpreter for its special (macro) programming language, and most documents are structured according to one of a few standard patterns. Thus by providing a set of standard definitions, e.g. the macro package LaTEX by L. Lamport [Lamport86], the user's task can be reduced to \filling in the blanks", and the macros will automatically take care of clerical tasks like section and subsection numbering, placement of figures and tables, building of itemized or enumerated lists, cross-referencing, indexing, etc. 1.2 Some Problems with Arabic Typesetting arabic text inside a document in some European language has to cope with the same problems the user faces when including mathematical formulas, and a few more: - Arabic runs from right to left. This means that the sophisticated line-breaking algorithm of TEX will not work for arabic text. - Arabic is written in a cursive style. 2 fi ? fi PA ff &"g ff ff ? A ff m fi c Figure 2: Character assembly with components shown. We will have to join the individual graphical symbols arising from the arabic characters into a softly flowing curve, and the form of these symbols is strongly contextdependent. - There is a large number of ligatures, and some of these are mandatory. A mathematical formula is a highly structured entity, and its two-dimensional layout is basically determined by this structure. Whereas this is also true for the arabic writing, the user will usually think of an arabic word as being a linear sequence of individual letters. We ought not to emburden him with the task of specifying exactly the graphical representation which is influenced by a number of well-known and time-honoured rules. - There is a complex set of diacritical marks whose use should be controllable by the user. Diacritical marks are usually redundant and therefore omitted in most cases, but under special circumstances they are essential, and the user should have full control over their use. - Sometimes also the standard transliteration is required. Whereas this is not a problem of arabic typesetting at all, the transliteration uses many symbols outside the standard character set which cannot be coded directly, and to avoid confusion, we should be able to obtain it from an input notation not too different from that for denoting the arabic writing. In addition we require that the user need not be a computer expert, and that the input notation for arabic text be easily readable and suitable for electronic transmission. It turns out that the technique of extending TEX, or LaTEX, by still another macro package is sufficiently powerful to construct a system that fulfills the given requirements. Technical details are outside the scope of this presentation; they can be found, e.g., in [Lagally92b]. 2 Input Notation for Arabic TEX is not an interactive system. It will transform a given input file into a deviceindependent output representation which afterwards, by some device-specific driver program, can be sent, e.g. to a laser printer, to a photo-typesetter, or can be viewed on a high 3 resolution computer screen. If, as is normally the case, the user has to correct some typos or wants to change the appearance of the document, he has to edit the input file by means of some text editor, and therefore the input representation must be easily readable and in close correspondence to the desired output. For Arabic, a plausible candidate seems to be the standard transliteration [DIN31635], [ISO/R233]; however, it cannot be used directly as it makes heavy use of diacritical marks, most of which are not readily available on the common computer keyboards, and whose internal coding unfortunately is in no way standardized. Also there are many national keyboard variants handling these special symbols differently. For a truly portable system, we should use only the symbols common to all keyboards, i.e. letters, figures, and punctuation marks. 2.1 The Basic Coding Scheme a @ a b H. b p H? p t ?H t _t ?H t? ^g ` <=g .h h h. _h p h>= c flh c ^c x <=c ,c ?h ?c d X d _d ?X d? r P r z R z ^z T <=z s Ä s ^s ? <=s .s ? s. .d ? d. .t ? t. .z ? z. ` ? , .g ?? _g f ? f q ? q v ö v k ? k g ? g l ? l m ? m n ?? n h ? h w ? w y ?? y T ?? t Figure 3: Coding of arabic characters The ArabTEX encoding is based on the standard transliteration, but uses one- and two-character encodings according to the following rules (see Figure 3): - whenever the transliteration uses just a single letter, we also use that letter; - whenever the transliteration uses a letter with a diacritical mark, we use the same letter and precede it with the punctuation mark most closely resembling the diacritic. - , , denote the long vowels, , , the short vowels if required. - <'> (right quote) is hamza (glottal stop). If the arabic writing mode has been selected, its carrier will be determined by the context according to the full hamza rules, otherwise by a following short vowel. - <'A> generates madda. - is ta' marbuta, is tanwin, is alif maqsura. 4 - Doubled consonants are written twice (shadda). This is easily remembered, fairly readable, and works well because punctuation marks (except hyphen) never occur within a word. 2.2 Variants and Extensions The coding scheme as given contains the full information necessary for obtaining both the fully vocalized arabic writing and the standard transliteration. For an experienced user, some simplifications are possible: if neither the vocalization nor the transliteration are required the user does not need to denote the short vowels except in cases where they influence the hamza writing. Also there are many additional options: - the vocalization can be controlled in three levels, and locally; - there are language-specific modes for Farsi, Urdu, and Pashto with the necessary additional input codings; - the generation of ligatures can be locally modified; - all default settings can be locally overridden. Full details are given in [Lagally92a]. 3 Examples of Document Structure ArabTEX follows the TEX paradigm of the user specifying the logical structure of the document, and letting the computer worry about the typesetting details. TEX itself offers a large set of mechanisms for describing the desired appearance of the document, and ArabTEX just adds a few commands for indicating arabic text, plus the internal routines responsible for the language-specific processing. As the basic TEX mechanisms are comparatively low-level, an inexperienced user will rather start with LaTEX to handle the standard cases in a convenient way. Users preferring to work with Plain TEX can of course do so. A LaTEX document consists of a header specifying the document style and possibly modifying some parameters, and a document body containing the text of the document in free format and grouped according to the logical structure of the document. For an example of the input format, see Figure 4 which should be fairly self-explanatory. Note that the percent mark indicates a comment for a human reader of the source text, that is not otherwise processed. The output for the same text is shown in Figure 5. Figure 6 shows a more realistic example. Here both the arabic writing and the transliteration have been switched on, and the transliteration output is interleaved automatically with the arabic writing. Still another example is this paper itself [Lagally92c]: it has been produced using LaTEX and ArabTEX without any manual cutting and pasting. (We had to use some technical tricks to produce Figure 2 which is non-standard.) 5 \documentstyle[11pt,dina4,arabtex,atrans]{article} % choose a document style, the type size and paper format % load the ArabTeX macros and the transliteration module \setarab % select language-specific processing, e.g. for \vocalize % indicate short vowels by diacritics \begin{document} This is a short demonstration. We start with an arabic insertion <^gu.hA wa-.himAruhu> inside a line of English text. Please note the automatic formatting of this paragraph which ends with a blank line. By changing some switch settings, \arabfalse % no arabic writing \transtrue % transliterate we can also produce the transliteration from the same input: <^gu.hA wa-.himAruhu>, and we should not forget to switch back! \arabtrue % arabic writing on again \transfalse % no transliteration For longer arabic texts \ArabTeX\ has to do the line-breaking: \begin{arabtext} 'at_A .sadIquN 'il_A ^gu.hA ya.tlubu minhu .himarahu li-yarkabahu fI safraTiN qa.sIraTiN wa-qAla lahu : sawfa 'u`Iduhu 'ilayka fI al-masA'i , wa-'adfa`u laka 'u^graTaN . \end{arabtext} As we see, including arabic text is not difficult. \end{document} Figure 4: Input for a sample LaTEX document This is a short demonstration. We start with an arabic insertion fi? fiP > ffäkff ff? Affm fic inside a line of English text. Please note the automatic formatting of this paragraph which ends with a blank line. By changing some switch settings, we can also produce the transliteration from the same input: <=guh.?a wa-h.im?aruhu, and we should not forget to switch back! For longer arabic texts ArabTEX has to do the line-breaking: Zff A ff? ffÜ? @ ?? ?flff ff?J? ff? @flff fi? fiYJ??ff fifl@ ff? ? ff? : fi?ff? ff? Aff?fl ff? ??? ffQÖ??ff ff?fl ??? ffQ ?fi ff? ?? ?flff fi?ffJ. ff?QffÖ??ff fi? ffQff"gff fi? ?J?ff fiI. fi??ffÄ? Affm fic ?ff?@flff <=?K? Yff ff? ?ff?G fffl@ . ??? ffQc fifl@ ff?ff? fi? ff?fl X fffl@ff? , As we see, including arabic text is not difficult. Figure 5: Output for the sample LaTEX document 6 fi? fiP > ffäkff ff? Affm fic <=guh.?a wa-h.im?aruhu -at?a s.ad??qun -il?a <=guh.?a yat.lubu minhu h.im?arahu li-yarkabahu f?? safratin qas.??ratin wa-q?ala lahu: sawfa -u ,??duhu -ilayka f?? 'l-mas?a-i Zff A ff? ffÜ? @ ?? ?flff ff?J? ff? @flff fi? fiYJ??ff fifl@ ff? ? ff? : fi?ff? ffÄff?fl ff? ??? ffQÖ??ff ff?fl ??? ffQ ?fi ff? ?? ?flff fi?ffJ. ff?QffÖ??ff fi? ffP > ffäkff fi? ?J?ff fiI. fi??ffÄ? Affm fic ?ff?@flff <=?K? Yff ff? ?ff?G fffl@ , wa--adfa ,u laka -u<=gratan. . ??? ffQc fifl@ ff?ff? fi? ff?fl X fffl@ff? , fa-q?ala <=guh.?a: -an?a -?asifun <=giddan -ann?? l?a -astat.??,u -an -uh.aqqiqa laka ra_gbataka, fa-'lh.im?aru laysa hun?a 'l-yawma. . ff? ?ffJ?? @ Aff?J fi? ffüo? ff? fiP > ffämff?' Aff?fl , ff?ff?J ffJ. ?? ffP ff?ff? ff? fflff?fi ffk fifl@ ?? fffl@ fi?J??ffff?J? fffl@ BffB ?? fflff?G fffl@ @ ?fflY g.ff <=??ff ffi@ Aff?K fffl@ : Affm fic ff? A ff?fi ff?fl wa-qabla -an yutimmu <=guh.?a kal?amahu bada-a 'l-h.im?aru yanhaqu f?? 's.t.ablihi. fa-q?ala lahu s.ad??quhu: -inn?? -asma ,u h.im?araka y?a <=guh.?a Affm fic AffK? ff?ffP > ffäkff fi? ffÜfiÖfffl@ ?? fflff?G @flff : fi? fi?fiK? Yff ff? fi?ff? ff? A ff?fi ff?fl . ?ff?ffJ. ff?? @ ?? ?flff fi?ff? ?DffK? fiP > ffä mff?' @ fffl@ ffYffK. fi? ff? CffC ff? Affm fic fiffl" ??ff fiK? ?? fffl@ ff?J. ff?fl ff? yanhaqu. . fi? ff? ?DffK? fa-q?ala lahu <=guh.?a: _gar??bun -amruka y?a s.ad??q??! -atus.addiqu 'l-h.im?ara wa-tukad?d?ibun??? ? ?? ??ff fiK. fflff?Y ff?fi?K ff? ffP > ffä mff?' @ fi? fflffY ff?fi?Ä fffl@ ! ?? ?fiffK? Yff ff? AffK? ff? fiQ? fffl@ <=I.K? Qff ff?? : Affm fic fi?ff? ff? A ff?fi ff?fl Figure 6: Arabic text with simultaneous transliteration. 4 Availability and first Experiences ArabTEX is freely available for scientific and private use, without any guarantee for correctness, and without any explicit or implied warranty. It can be picked up via FTP from ifi.informatik.uni-stuttgart.de (129.69.211.1), directory pub/arabtex. Prospective users without Internet access should contact the author. A first version of ArabTEX has been distributed on the Internet in August 1991. In the sequel, many error reports and suggestions for improvement reached us, which led to a second, much improved and expanded, version in May 1992. The system is still in an experimental stage; yet about 500 persons and institutions downloaded it up to now, and presently about a dozen errors are known and being corrected. We know of a few institutions using the system for production work; e.g., the American Arab Scientific Society is typesetting its quarterly newsletter using ArabTEX. 5 Acknowledgments The development of ArabTEX would not have been possible without the assistance of many people. Apart from our local team, Udo Merkel and Heribert Schlebbe, helpful advice came among others from Ivan Derzhansky, Wolfdietrich Fischer, Ahmed El-Hadi, Abdelsalam Heddaya, Iqbal Khan, Tom Koornwinder, Eberhard Krüger, Asif Lakehsar, Jan Lodder, Richard Lorch, Eberhard Mattes, and Bernd Raichle. We also have to thank the many users who sent error reports, comments, and suggestions. 7 References [DIN31635] DIN 31 635: Umschrift des Arabischen Alphabets, Deutsches Institut für Normung e.V., 1982. [ISO/R233] ISO/R 233 - 1961: International System for the Transliteration of Arabic Characters, International Standards Institution, 1961. [Knuth84] Donald E. Knuth, The TEXbook, Volume A of Computers & Typesetting, Addison-Wesley, Reading, Mass., 1984. [Lagally92a] Klaus Lagally, ArabTEX, a System for Typesetting Arabic, User manual. Report 1992/06, Fakultät Informatik, Universität Stuttgart, 1992. [Lagally92b] Klaus Lagally, ArabTEX - Typesetting Arabic with Vowels and Ligatures, in: EuroTEX '92, Proceedings of the 7th European TEX Conference, Prague, Czechoslovakia, September 14-18, 1992. Also available as: Report 1992/07, Fakultät Informatik, Universität Stuttgart, 1992. [Lagally92c] Klaus Lagally, ArabTEX, a System for Typesetting Arabic, Paper to be presented at the 3rd International Conference and Exhibition on Multi-lingual Computing (Arabic and Roman Script), University of Durham, UK, December 10-12, 1992. Report 1992/11, Fakultät Informatik, Universität Stuttgart, 1992. [Lamport86] Leslie Lamport, LaTEX, a Document Preparation System, Addison- Wesley, Reading, Mass., 1986. Author's address: Prof. Klaus Lagally Institut für Informatik Universität Stuttgart Breitwiesenstraße 20-22 W-7000 Stuttgart 80 GERMANY lagally@informatik.uni-stuttgart.de 8