1%**start of header 2\input tugbot.sty 3\vol 0, 0. 4\issdate ????????, 198x. 5\issueseqno=00 6\twocol 7%**end of header 8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 9% tb20mackay.tex 10 11\title Turkish Hyphenations for \TeX 12\\Pierre A. Mackay 13\endx 14 15\pagexref{mackay} 16 17Turkish belongs to the class of agglutinative languages, which 18means that it expresses syntactic relations between words through 19discrete suffixes, each of which conveys a single idea such as 20plurality or case in nouns, and plurality, person, tense, voice 21or any of the other possibilities in verbs. Since each suffix is 22a distinct syllable (occasionally more than one syllable), Turkish 23sentences are likely to contain a high proportion of long multi-syllable 24words, and to need an efficient system of hyphenation for typesetting. 25Owing to the long association of almost every Turkic-language region 26with Islam, certain conventions of the language have been deeply 27influenced by Arabic orthographic habits, and among these is the 28syllabification scheme on which a system of hyphenation is built. 29 30According to the syllabification pattern of Arabic, a syllable 31is assumed always to 32consist of an initial consonant (even when that consonant is no longer 33written) and to terminate in a vowel {\tt -cv-} or in the next unvowelled 34consonant {\tt -cvc-}. This pattern is followed so absolutely that it is 35permitted to break up native Turkish suffixes. The plural suffix 36\hbox{\it -ler-} will be hyphenated as {\it -le-rine\/} in an environment 37where the {\tt -cv-cv-cv} pattern predominates. A syllabic division of 38{\it\c cektirilebilecek\/} provides six places for 39hyphenation {\it \c cek-ti-ri-le-bi-le-cek}, while a 40morphological division of the word would produce only five 41{\it \c cek-tir-il-e-bil-ecek}.\footnote{$^*$}{The word is 42a future participle, and describes something as being 43capable of being extracted at some time in the future\Dash like a tooth.} 44 45There are almost no exceptions to this pattern. Words which 46appear to begin with a vowel, like {\it et-mek}, can also be 47described as beginning with the now suppressed half-consonant 48{\it hamza}. Widely sanctioned orthographic irregularities like 49{\it brak-mak\/} can be found in stricter orthography as {\it b\i-rak-mak}. 50The only universally practiced violation of the rule is associated 51with the word {\it T\"urk}, in which the {\it -rk-} combination is 52inseparable, and contributes to several of 53the very few three-consonant clusters 54regularly used in the language---{\it T\"urk\c ce}, {\it T\"urkler}. 55One other significant consonant cluster occurs in the suffix 56{\it [i]m-trak}. 57 58The Ottoman Texts Project at the University of Washington has 59undertaken the development of a set of editing and typesetting 60tools for the production of texts in modern Latin-letter Turkish, 61using the full range of diacriticals needed for scholarly editions 62of historic Arabic-script manuscripts. Because we wish to work 63in cooperation with scholars in Turkey, who are most likely to 64have access to unmodified versions of \TeX, we have chosen 65a font-based adaptation of the \TeX\ environment, which will require 66no alterations in the program. The work on fonts is largely complete, 67and one of the last major efforts necessary is the creation of 68a Turkish hyphenation table. 69 70The obvious way to create such a table in the \TeX\ environment, is to 71run a list of correctly hyphenated words through {\tt Patgen}, but 72it is not always easy to find such a list. English and German dictionaries 73quite commonly provide hyphenation patterns, but the dictionaries of 74the Romance languages rarely do, and in Turkish, the hyphenation pattern 75is so obvious that the production of such a list is viewed as an 76unimaginable waste of time. Rather than try to scan a Turkish 77word-list and supply hyphens, we have taken advantage of the strict formalism 78of the patterns and generated the Turkish hyphenation file by 79program. 80 81Turkish orthography uses a very large number of accented characters. 82The Latin-letter character set which has been in 83use since the orthographic reform of 1928 is extended, even in Modern 84Turkish, by means of a considerable number of diacriticals and accents. A 85diligent search through the modern dictionary will produce several 86five- and six-letter words in which every character is accented, and an 87intensive search might come up with words as much as nine letters long 88with every character accented. In critical editions of Ottoman texts, 89the number of accents more than doubles. Modern Turkish knows only 90the accented and unaccented pair of letters `{\bf s}' and `{\bf\c s}', but 91Ottoman Turkish has `{\bf s}', `{\bf\c s}', `{\bf\d s}' and `{\bf\b s}', which 92represent four completely distinct characters in the Arabic alphabet. 93The letter `{\bf h}' shows almost as much variety, and so do several 94others. Our Ottoman Turkish font has twenty-seven accent and letter 95composites, in addition to the basic twenty-six simple Latin letters. 96Moreover, all composites can exist in upper case forms as well as in 97lower case. To accommodate these composite characters in the normal 98{\ninerm ASCII} character set, 99we use an input coding convention in which accented 100letters are treated as a class of ligatures, and three characters from the 101{\ninerm ASCII} symbol set are borrowed for use as postpositive 102pseudo-letters, to trigger the selection of accented letters in 103the Turkish fonts. The three symbols are the exclamation 104point `{\tt!}', the equals sign `{\tt=}', and the colon `{\tt:}'. 105 106The choice of these symbols is 107based on a proposal made more than ten years ago at the Orientalist 108Congress held in Paris, in 1974. Owing to the extraordinary richness 109of the Ottoman Turkish character set, it has been necessary to extend 110the old proposal, but it still retains the original principles, which 111are closely associated with the coding scheme used by the Onomasticon 112Arabicum project, which is coordinated at the Centre National de la Recherche 113Scientifique in Paris. (The Onomasticon Arabicum uses a post-positive dot 114and a post-positive hyphen to indicate diacriticals, which is 115acceptable in a data-base of names, but not in continuous prose text.) 116The current set of conventions, using (|! = :|), 117produces an input file which can, if 118necessary, be edited on a ordinary terminal lacking any special 119Turkish character features, and which a Turkish speaker can become 120accustomed to without too much difficulty. When coupled with a 121well-designed macro file and a rewritten hyphenation table, it 122provides the possibility of naturalizing a \TeX\ environment into 123Turkish without any large investment in special purpose hardware and 124rewritten versions of non-standard (non-)\TeX. 125 126The exclamation point is used for all the ``emphatic'' letters of the 127Arabic alphabet (the alphabet in which Turkish was written until 1928). 128These are the letters {\it \d Dad\/} (usually pronounced as `{\bf z}' in 129Turkish, and hence paired with a non-Arabic letter known as {\it \.Zad\/}), 130{\it \d Sad}, {\it \d Ha'}, {\it \d Ta'} and {\it \d Za'}. 131The equals sign is used for all the 132consonants which are represented in Latin-letter transcriptions by a 133letter with a bar under, such as `{\bf\b d}' ({\it dhal\/}), more 134commonly written in 135Turkish as `{\bf\b z}', and also for vowels with a macron or, following the 136Turkish convention, a `hat' accent, and similar forms, chosen like the 137cupped `{\bf\u g}', because the equals sign is visually closer than the colon 138is. (Moreover, the colon is needed for a different variety of the 139letter `{\bf g}'.) The colon is a catch-all for everything else, but works 140out rather well visually, as it happens. The three post-positives are 141not accents, but regular characters, which use the \TeX\ convention of 142ligatures to invoke accented characters from the font, just as the 143second `{\bf f$\,$}' in the normal \TeX\ `{\bf ff$\,$}' 144ligature pair does. If a standard 145Latin-letter character does not have an associated ligature table in 146the font, a following diacritical postpositive will be unaffected. 147Thus, the letter `{\bf o}', 148when followed by a colon will produce `{\bf\"o}', but the letter `{\bf e}' when 149followed by a colon will produce `{\bf e:}'. The equals sign retains 150its normal function in math mode because the math font {\tt TFM} files 151do not call it into ligature pairings, and the colon and exclamation point 152can be invoked by the command sequences {\tt\bs:} and {\tt\bs bang} 153when the simple character will not work. 154 155Since the hyphenation evaluation loop in \TeX\ dismantles all ligatures 156before it looks for acceptable hyphenation positions, it will have 157to accept the post-positive symbols (|! = :|) as part of the alphabet, 158so each of these symbols receives its own value as an |\lccode|. 159The full Turkish-\TeX\ alphabet is: 160 161{\advance\baselineskip by 3pt 162\obeylines 163a \^a e \i\ i \^\i\ o \"o \^o u \"u 164` ' b c d f g h j k l m n p r s t v y z 165\d d \d h \d k \d s \d t \d z 166\b d \u g \b h \~n \b s \b t \b z 167\c c \.g \c s \.z 168\par } 169\smallskip 170In the hyphenation loop of \TeX, these characters resolve into 171the set: 172 173{\advance\baselineskip by 3pt 174\obeylines 175|! = : @ # a b c d e f g h i| 176|j k l m n o p r s t u v y z| 177\par } 178\smallskip 179\noindent and it is this latter set only which will appear in the 180hyphenation patterns. The dotted {\tt i} in the above list really 181stands for the Turkish undotted `{\bf\i}'. The input code convention 182for Turkish uses {\tt i:} for the Turkish `{\bf i}'. The {\tt @} 183sign stands for the Arabic letter {\it hamza} and the {\tt \#} 184stands for {\it `\kern-1.5pt ayn}. To avoid conflict with 185{\tt plain.tex} uses of these two characters, they appear explicitly 186only in the hyphenation pattern file. Turkish text input uses 187|\`| to generate |\char'43| ({\it `\kern-1.5pt ayn}) and |\'| 188to generate |\char'100| ({\it hamza}). 189 190We begin constructing the table by considering the pseudo-letters 191(|! = :|). Since these are used exclusively in ligature pairs, 192no hyphenation is ever permissible between them and the preceding 193letter. Odd values permit, and even values in the hyphenation 194code prohibit hyphenation, 195so we give the highest possible even value (8) to the region 196preceding each pseudo-letter. The pseudo-letters can follow 197both vowels and consonants, so hyphenation will often, but not 198always, be possible after them. We give that region the lowest 199possible odd value (1) to show that hyphens are permitted here. 200\smallskip 201\centerline{\tt 8!1 8=1 8:1} 202\smallskip 203 204In strict orthography, a vowel cannot be separated from the 205preceding consonant, and the few apparent instances of hyphenation 206between two adjacent vowels (suppressed consonant) can be treated later. 207In all normal instances a vowel cannot accept a hyphen in the 208preceding region and will probably accept one in the following region, 209so the vowels are set thus. 210\smallskip 211\centerline{\tt 2a1 2e1 2i1 2o1 2u1} 212\smallskip 213 214A consonant may begin a {\tt -cv-} sequence or end a {\tt -cvc-} sequence, 215so we give it a 1 on either side: 216\smallskip 217\centerline{\tt 1b1 $\;\ldots\;$ 1z1} 218\smallskip 219 220This simple lot of patterns will provide for all normal {\tt -cv-} 221instances such as 222\smallskip 223\centerline{\tt1h1\ \ \ \ } %seven elements 224\centerline{\tt\ \ 8=1\ \ } 225\centerline{\tt\ \ \ \ 2a1} 226\centerline{\tt1h8=2a1} 227\smallskip 228\noindent which will result in the sequence {\tt-h=a-}, with hyphens fore 229and aft. 230 231The next group of patterns controls hyphenation at the end 232of words. \TeX\ will usually not break off two-letter fragments 233in its hyphenation loop, but owing to the nature of the input coding 234we have chosen, it may see a three- or four-letter sequence where 235a two-letter result is intended. We do not want to find 236{\it l\"u}, {\it\c c\"u\/} and {\it si\/} isolated at the beginning 237of a line, nor do we really want the {\it cek\/} of {\it -ecek\/} 238broken off if it is at the end of a word. To prevent hyphenations of 239this sort, the program generates all possible patterns of the type: 240\smallskip 241\centerline{|2ba=.| $\;\ldots\;$ |2z:u:.|} 242\smallskip 243\noindent using the conventional |.| for end-of-word. 244The resultant list includes sequences that are phonetically impossible in 245Turkish but these take up so little additional space in the file that 246they can be left there. The pattern |2e2cek.|\kern -1.5pt is added as a special case. 247 248The break after {\tt -cvc-} syllables is almost taken care of: 249\smallskip 250\centerline{\tt1h1\ \ \ \ } %seven elements 251\centerline{\tt\ \ 8=1\ \ } 252\centerline{\tt\ \ \ \ 1h1} 253\centerline{\tt1h8=1h1} 254\smallskip 255\noindent but it makes the thoroughly undesirable {\tt -cv-ccv-} 256sequence as acceptable as the correct {\tt -cvc-cv-} sequence. 257To prevent this error, all possible Turkish 258two-consonant sequences (e.g.\ {\tt h=h=}${}\rightarrow{}$`{\bf\b h\b h}') 259are covered by patterns such as {\tt2h=h=}, in which the value 2 260will override the 1 after the preceding vowel. 261 262The few undesirable hyphenations at the beginning of 263words which appear to start with a vowel are prevented 264by generating the patterns 265|.a=2| through |.u:2| and 266similarly, the few instances where an apparent {\tt -cv-v-} hyphenation 267stands for {\tt -cv-[c]v-} can be allowed by adding the full range of patterns 268|a3a2| through |u:3u:2| which includes a large number of impossible 269pairings. 270 271 272The last patterns to be added are |m1t4ra4k| and |t2u8:2r4k1|. At the price of 273slightly excessive strictness (the prohibition against the {\tt r-k} 274division is only valid when the word begins with an upper-case {\tt T}) 275we can ensure that {\it T\"urk\/} always stays in one piece. 276 277Files of this sort, when generated by program, tend to be 278larger than hand-worked files, but if it seems that 279all the redundancies mentioned above might 280be seriously wasteful of space, consider the following statistics: 281\smallskip 282\settabs 4 \columns 283{\advance\baselineskip by 3pt 284\+&\hfil{}Entries\hfil&\hfil{}Trie size\hfil&\hfil{}Ops\hfil\cr 285\+\quad English&\hfil4460\hfil&\hfil5492\hfil&\hfil181\hfil\cr 286\+\quad Turkish&\hfil1840\hfil&\hfil\hphantom{0}616\hfil&\hfil\hphantom{0}16\hfil\cr 287}\smallskip 288 289The format file that makes use of this set of patterns will no 290longer serve very well for English language \TeX. The font-based 291solution to foreign-language typesetting is definitely monolingual, 292since only one {\tt hyphen.tex} file can be read in at a time. 293A multilingual system, good for both English and Turkish, would 294require modifications of the program code. 295This simple solution, however, will be quite satisfactory in a 296purely Turkish environment, and can be made even more successful 297by taking the {\tt tex.pool} file and translating it all into 298Turkish. 299 300\endinput 301 302%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 303\bye 304 305 306