• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1%**start of header
2\input tugbot.sty
3\vol 0, 0.
4\issdate ????????, 198x.
5\issueseqno=00
6\twocol
7%**end of header
8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9%		tb20mackay.tex
10
11\title Turkish Hyphenations for \TeX
12\\Pierre A. Mackay
13\endx
14
15\pagexref{mackay}
16
17Turkish belongs to the class of agglutinative languages, which
18means that it expresses syntactic relations between words through
19discrete suffixes, each of which conveys a single idea such as
20plurality or case in nouns, and plurality, person, tense, voice
21or any of the other possibilities in verbs.  Since each suffix is
22a distinct syllable (occasionally more than one syllable), Turkish
23sentences are likely to contain a high proportion of long multi-syllable
24words, and to need an efficient system of hyphenation for typesetting.
25Owing to the long association of almost every Turkic-language region
26with Islam, certain conventions of the language have been deeply
27influenced by Arabic orthographic habits, and among these is the
28syllabification scheme on which a system of hyphenation is built.
29
30According to the syllabification pattern of Arabic, a syllable
31is assumed always to
32consist of an initial consonant (even when that consonant is no longer
33written) and to terminate in a vowel {\tt -cv-} or in the next unvowelled
34consonant {\tt -cvc-}.  This pattern is followed so absolutely that it is
35permitted to break up native Turkish suffixes.  The plural suffix
36\hbox{\it -ler-} will be hyphenated as {\it -le-rine\/} in an environment
37where the {\tt -cv-cv-cv} pattern predominates. A syllabic division of
38{\it\c cektirilebilecek\/} provides six places for
39hyphenation {\it \c cek-ti-ri-le-bi-le-cek}, while a
40morphological division of the word would produce only five
41{\it \c cek-tir-il-e-bil-ecek}.\footnote{$^*$}{The word is
42a future participle, and describes something as being
43capable of being extracted at some time in the future\Dash like a tooth.}
44
45There are almost no exceptions to this pattern.  Words which
46appear to begin with a vowel, like {\it et-mek}, can also be
47described as beginning with the now suppressed half-consonant
48{\it hamza}.  Widely sanctioned orthographic irregularities like
49{\it brak-mak\/} can be found in stricter orthography as {\it b\i-rak-mak}.
50The only universally practiced violation of the rule is associated
51with the word {\it T\"urk}, in which the {\it -rk-} combination is
52inseparable, and contributes to several of
53the very few three-consonant clusters
54regularly used in the language---{\it T\"urk\c ce}, {\it T\"urkler}.
55One other significant consonant cluster occurs in the suffix
56{\it [i]m-trak}.
57
58The Ottoman Texts Project at the University of Washington has
59undertaken the development of a set of editing and typesetting
60tools for the production of texts in modern Latin-letter Turkish,
61using the full range of diacriticals needed for scholarly editions
62of historic Arabic-script manuscripts.  Because we wish to work
63in cooperation with scholars in Turkey, who are most likely to
64have access to unmodified versions of \TeX, we have chosen
65a font-based adaptation of the \TeX\ environment, which will require
66no alterations in the program.  The work on fonts is largely complete,
67and one of the last major efforts necessary is the creation of
68a Turkish hyphenation table.
69
70The obvious way to create such a table in the \TeX\ environment, is to
71run a list of correctly hyphenated words through {\tt Patgen}, but
72it is not always easy to find such a list.  English and German dictionaries
73quite commonly provide hyphenation patterns, but the dictionaries of
74the Romance languages rarely do, and in Turkish, the hyphenation pattern
75is so obvious that the production of such a list is viewed as an
76unimaginable waste of time.  Rather than try to scan a Turkish
77word-list and supply hyphens, we have taken advantage of the strict formalism
78of the patterns and generated the Turkish hyphenation file by
79program.
80
81Turkish orthography uses a very large number of accented characters.
82The Latin-letter character set which has been in
83use since the orthographic reform of 1928 is extended, even in Modern
84Turkish, by means of a considerable number of diacriticals and accents.  A
85diligent search through the modern dictionary will produce several
86five- and six-letter words in which every character is accented, and an
87intensive search might come up with words as much as nine letters long
88with every character accented.  In critical editions of Ottoman texts,
89the number of accents more than doubles.  Modern Turkish knows only
90the accented and unaccented pair of letters `{\bf s}' and `{\bf\c s}', but
91Ottoman Turkish has `{\bf s}', `{\bf\c s}', `{\bf\d s}' and `{\bf\b s}', which
92represent four completely distinct characters in the Arabic alphabet.
93The letter `{\bf h}' shows almost as much variety, and so do several
94others.  Our Ottoman Turkish font has twenty-seven accent and letter
95composites, in addition to the basic twenty-six simple Latin letters.
96Moreover, all composites can exist in upper case forms as well as in
97lower case.  To accommodate these composite characters in the normal
98{\ninerm ASCII} character set,
99we use an input coding convention in which accented
100letters are treated as a class of ligatures, and three characters from the
101{\ninerm ASCII} symbol set are borrowed for use as postpositive
102pseudo-letters, to trigger the selection of accented letters in
103the Turkish fonts.  The three symbols are the exclamation
104point `{\tt!}', the equals sign `{\tt=}', and the colon `{\tt:}'.
105
106The choice of these symbols is
107based on a proposal made more than ten years ago at the Orientalist
108Congress held in Paris, in 1974.  Owing to the extraordinary richness
109of the Ottoman Turkish character set, it has been necessary to extend
110the old proposal, but it still retains the original principles, which
111are closely associated with the coding scheme used by the Onomasticon
112Arabicum project, which is coordinated at the Centre National de la Recherche
113Scientifique in Paris.  (The Onomasticon Arabicum uses a post-positive dot
114and a post-positive hyphen to indicate diacriticals, which is
115acceptable in a data-base of names, but not in continuous prose text.)
116The current set of conventions, using (|! = :|),
117produces an input file which can, if
118necessary, be edited on a ordinary terminal lacking any special
119Turkish character features, and which a Turkish speaker can become
120accustomed to without too much difficulty.  When coupled with a
121well-designed macro file and a rewritten hyphenation table, it
122provides the possibility of naturalizing a \TeX\ environment into
123Turkish without any large investment in special purpose hardware and
124rewritten versions of non-standard (non-)\TeX.
125
126The exclamation point is used for all the ``emphatic'' letters of the
127Arabic alphabet (the alphabet in which Turkish was written until 1928).
128These are the letters {\it \d Dad\/} (usually pronounced as `{\bf z}' in
129Turkish, and hence paired with a non-Arabic letter known as {\it \.Zad\/}),
130{\it \d Sad}, {\it \d Ha'}, {\it \d Ta'} and {\it \d Za'}.
131The equals sign is used for all the
132consonants which are represented in Latin-letter transcriptions by a
133letter with a bar under, such as `{\bf\b d}' ({\it dhal\/}), more
134commonly written in
135Turkish as `{\bf\b z}', and also for vowels with a macron or, following the
136Turkish convention, a `hat' accent, and similar forms, chosen like the
137cupped `{\bf\u g}', because the equals sign is visually closer than the colon
138is.  (Moreover, the colon is needed for a different variety of the
139letter `{\bf g}'.)  The colon is a catch-all for everything else, but works
140out rather well visually, as it happens.  The three post-positives are
141not accents, but regular characters, which use the \TeX\ convention of
142ligatures to invoke accented characters from the font, just as the
143second `{\bf f$\,$}' in the normal \TeX\ `{\bf ff$\,$}'
144ligature pair does.  If a standard
145Latin-letter character does not have an associated ligature table in
146the font, a following diacritical postpositive will be unaffected.
147Thus, the letter `{\bf o}',
148when followed by a colon will produce `{\bf\"o}', but the letter `{\bf e}' when
149followed by a colon will produce `{\bf e:}'.  The equals sign retains
150its normal function in math mode because the math font {\tt TFM} files
151do not call it into ligature pairings, and the colon and exclamation point
152can be invoked by the command sequences {\tt\bs:} and {\tt\bs bang}
153when the simple character will not work.
154
155Since the hyphenation evaluation loop in \TeX\ dismantles all ligatures
156before it looks for acceptable hyphenation positions, it will have
157to accept the post-positive symbols (|! = :|) as part of the alphabet,
158so each of these symbols receives its own value as an |\lccode|.
159The full Turkish-\TeX\ alphabet is:
160
161{\advance\baselineskip by 3pt
162\obeylines
163a \^a e \i\ i \^\i\ o \"o \^o u \"u
164` ' b c d f g h j k l m n p r s t v y z
165\d d \d h \d k \d s \d t \d z
166\b d \u g \b h \~n \b s \b t \b z
167\c c \.g \c s \.z
168\par }
169\smallskip
170In the hyphenation loop of \TeX, these characters resolve into
171the set:
172
173{\advance\baselineskip by 3pt
174\obeylines
175|! = : @ # a b c d e f g h i|
176|j k l m n o p r s t u v y z|
177\par }
178\smallskip
179\noindent and it is this latter set only which will appear in the
180hyphenation patterns.  The dotted {\tt i} in the above list really
181stands for the Turkish undotted `{\bf\i}'.  The input code convention
182for Turkish uses {\tt i:} for the Turkish `{\bf i}'.  The {\tt @}
183sign stands for the Arabic letter {\it hamza} and the {\tt \#}
184stands for {\it `\kern-1.5pt ayn}.  To avoid conflict with
185{\tt plain.tex} uses of these two characters, they appear explicitly
186only in the hyphenation pattern file.  Turkish text input uses
187|\`| to generate |\char'43| ({\it `\kern-1.5pt ayn}) and |\'|
188to generate |\char'100| ({\it hamza}).
189
190We begin constructing the table by considering the pseudo-letters
191(|! = :|).  Since these are used exclusively in ligature pairs,
192no hyphenation is ever permissible between them and the preceding
193letter.  Odd values permit, and even values in the hyphenation
194code prohibit hyphenation,
195so we give the highest possible even value (8) to the region
196preceding each pseudo-letter.  The pseudo-letters can follow
197both vowels and consonants, so hyphenation will often, but not
198always, be possible after them.  We give that region the lowest
199possible odd value (1) to show that hyphens are permitted here.
200\smallskip
201\centerline{\tt 8!1 8=1 8:1}
202\smallskip
203
204In strict orthography, a vowel cannot be separated from the
205preceding consonant, and the few apparent instances of hyphenation
206between two adjacent vowels (suppressed consonant) can be treated later.
207In all normal instances a vowel cannot accept a hyphen in the
208preceding region and will probably accept one in the following region,
209so the vowels are set thus.
210\smallskip
211\centerline{\tt 2a1 2e1 2i1 2o1 2u1}
212\smallskip
213
214A consonant may begin a {\tt -cv-} sequence or end a {\tt -cvc-} sequence,
215so we give it a 1 on either side:
216\smallskip
217\centerline{\tt 1b1 $\;\ldots\;$ 1z1}
218\smallskip
219
220This simple lot of patterns will provide for all normal {\tt -cv-}
221instances such as
222\smallskip
223\centerline{\tt1h1\ \ \ \ } %seven elements
224\centerline{\tt\ \ 8=1\ \ }
225\centerline{\tt\ \ \ \ 2a1}
226\centerline{\tt1h8=2a1}
227\smallskip
228\noindent which will result in the sequence {\tt-h=a-}, with hyphens fore
229and aft.
230
231The next group of patterns controls hyphenation at the end
232of words.  \TeX\ will usually not break off two-letter fragments
233in its hyphenation loop, but owing to the nature of the input coding
234we have chosen, it may see a three- or four-letter sequence where
235a two-letter result is intended.  We do not want to find
236{\it l\"u}, {\it\c c\"u\/} and {\it si\/} isolated at the beginning
237of a line, nor do we really want the {\it cek\/} of {\it -ecek\/}
238broken off if it is at the end of a word. To prevent hyphenations of
239this sort, the program generates all possible patterns of the type:
240\smallskip
241\centerline{|2ba=.| $\;\ldots\;$ |2z:u:.|}
242\smallskip
243\noindent using the conventional |.| for end-of-word.
244The resultant list includes sequences that are phonetically impossible in
245Turkish but these take up so little additional space in the file that
246they can be left there.  The pattern |2e2cek.|\kern -1.5pt is added as a special case.
247
248The break after {\tt -cvc-} syllables is almost taken care of:
249\smallskip
250\centerline{\tt1h1\ \ \ \ } %seven elements
251\centerline{\tt\ \ 8=1\ \ }
252\centerline{\tt\ \ \ \ 1h1}
253\centerline{\tt1h8=1h1}
254\smallskip
255\noindent but it makes the thoroughly undesirable {\tt -cv-ccv-}
256sequence as acceptable as the correct {\tt -cvc-cv-} sequence.
257To prevent this error, all possible Turkish
258two-consonant sequences (e.g.\ {\tt h=h=}${}\rightarrow{}$`{\bf\b h\b h}')
259are covered by patterns such as {\tt2h=h=}, in which the value 2
260will override the 1 after the preceding vowel.
261
262The few undesirable hyphenations at the beginning of
263words which appear to start with a vowel are prevented
264by generating the patterns
265|.a=2| through |.u:2| and
266similarly, the few instances where an apparent {\tt -cv-v-} hyphenation
267stands for {\tt -cv-[c]v-} can be allowed by adding the full range of patterns
268|a3a2| through |u:3u:2| which includes a large number of impossible
269pairings.
270
271
272The last patterns to be added are |m1t4ra4k| and |t2u8:2r4k1|.  At the price of
273slightly excessive strictness (the prohibition against the {\tt r-k}
274division is only valid when the word begins with an upper-case {\tt T})
275we can ensure that {\it T\"urk\/} always stays in one piece.
276
277Files of this sort, when generated by program, tend to be
278larger than hand-worked files, but if it seems that
279all the redundancies mentioned above might
280be seriously wasteful of space, consider the following statistics:
281\smallskip
282\settabs 4 \columns
283{\advance\baselineskip by 3pt
284\+&\hfil{}Entries\hfil&\hfil{}Trie size\hfil&\hfil{}Ops\hfil\cr
285\+\quad English&\hfil4460\hfil&\hfil5492\hfil&\hfil181\hfil\cr
286\+\quad Turkish&\hfil1840\hfil&\hfil\hphantom{0}616\hfil&\hfil\hphantom{0}16\hfil\cr
287}\smallskip
288
289The format file that makes use of this set of patterns will no
290longer serve very well for English language \TeX.  The font-based
291solution to foreign-language typesetting is definitely monolingual,
292since only one {\tt hyphen.tex} file can be read in at a time.
293A multilingual system, good for both English and Turkish, would
294require modifications of the program code.
295This simple solution, however, will be quite satisfactory in a
296purely Turkish environment, and can be made even more successful
297by taking the {\tt tex.pool} file and translating it all into
298Turkish.
299
300\endinput
301
302%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
303\bye
304
305
306