tr/paper/turkish-hyphens.tex

%**start of header
\input tugbot.sty
\vol 0, 0.
\issdate ????????, 198x.
\issueseqno=00
\twocol
%**end of header
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%		tb20mackay.tex

\title Turkish Hyphenations for \TeX
\\Pierre A. Mackay
\endx

\pagexref{mackay}

Turkish belongs to the class of agglutinative languages, which
means that it expresses syntactic relations between words through
discrete suffixes, each of which conveys a single idea such as
plurality or case in nouns, and plurality, person, tense, voice
or any of the other possibilities in verbs.  Since each suffix is
a distinct syllable (occasionally more than one syllable), Turkish
sentences are likely to contain a high proportion of long multi-syllable
words, and to need an efficient system of hyphenation for typesetting.
Owing to the long association of almost every Turkic-language region
with Islam, certain conventions of the language have been deeply
influenced by Arabic orthographic habits, and among these is the
syllabification scheme on which a system of hyphenation is built.

According to the syllabification pattern of Arabic, a syllable
is assumed always to
consist of an initial consonant (even when that consonant is no longer
written) and to terminate in a vowel {\tt -cv-} or in the next unvowelled
consonant {\tt -cvc-}.  This pattern is followed so absolutely that it is
permitted to break up native Turkish suffixes.  The plural suffix
\hbox{\it -ler-} will be hyphenated as {\it -le-rine\/} in an environment
where the {\tt -cv-cv-cv} pattern predominates. A syllabic division of
{\it\c cektirilebilecek\/} provides six places for
hyphenation {\it \c cek-ti-ri-le-bi-le-cek}, while a
morphological division of the word would produce only five
{\it \c cek-tir-il-e-bil-ecek}.\footnote{$^*$}{The word is
a future participle, and describes something as being
capable of being extracted at some time in the future\Dash like a tooth.}

There are almost no exceptions to this pattern.  Words which
appear to begin with a vowel, like {\it et-mek}, can also be
described as beginning with the now suppressed half-consonant
{\it hamza}.  Widely sanctioned orthographic irregularities like
{\it brak-mak\/} can be found in stricter orthography as {\it b\i-rak-mak}.
The only universally practiced violation of the rule is associated
with the word {\it T\"urk}, in which the {\it -rk-} combination is
inseparable, and contributes to several of
the very few three-consonant clusters
regularly used in the language---{\it T\"urk\c ce}, {\it T\"urkler}.
One other significant consonant cluster occurs in the suffix
{\it [i]m-trak}.

The Ottoman Texts Project at the University of Washington has
undertaken the development of a set of editing and typesetting
tools for the production of texts in modern Latin-letter Turkish,
using the full range of diacriticals needed for scholarly editions
of historic Arabic-script manuscripts.  Because we wish to work
in cooperation with scholars in Turkey, who are most likely to
have access to unmodified versions of \TeX, we have chosen
a font-based adaptation of the \TeX\ environment, which will require
no alterations in the program.  The work on fonts is largely complete,
and one of the last major efforts necessary is the creation of
a Turkish hyphenation table.

The obvious way to create such a table in the \TeX\ environment, is to
run a list of correctly hyphenated words through {\tt Patgen}, but
it is not always easy to find such a list.  English and German dictionaries
quite commonly provide hyphenation patterns, but the dictionaries of
the Romance languages rarely do, and in Turkish, the hyphenation pattern
is so obvious that the production of such a list is viewed as an
unimaginable waste of time.  Rather than try to scan a Turkish
word-list and supply hyphens, we have taken advantage of the strict formalism
of the patterns and generated the Turkish hyphenation file by
program.

Turkish orthography uses a very large number of accented characters.
The Latin-letter character set which has been in
use since the orthographic reform of 1928 is extended, even in Modern
Turkish, by means of a considerable number of diacriticals and accents.  A
diligent search through the modern dictionary will produce several
five- and six-letter words in which every character is accented, and an
intensive search might come up with words as much as nine letters long
with every character accented.  In critical editions of Ottoman texts,
the number of accents more than doubles.  Modern Turkish knows only
the accented and unaccented pair of letters `{\bf s}' and `{\bf\c s}', but
Ottoman Turkish has `{\bf s}', `{\bf\c s}', `{\bf\d s}' and `{\bf\b s}', which
represent four completely distinct characters in the Arabic alphabet.
The letter `{\bf h}' shows almost as much variety, and so do several
others.  Our Ottoman Turkish font has twenty-seven accent and letter
composites, in addition to the basic twenty-six simple Latin letters.
Moreover, all composites can exist in upper case forms as well as in
lower case.  To accommodate these composite characters in the normal
{\ninerm ASCII} character set,
we use an input coding convention in which accented
letters are treated as a class of ligatures, and three characters from the
{\ninerm ASCII} symbol set are borrowed for use as postpositive
pseudo-letters, to trigger the selection of accented letters in
the Turkish fonts.  The three symbols are the exclamation
point `{\tt!}', the equals sign `{\tt=}', and the colon `{\tt:}'.

The choice of these symbols is
based on a proposal made more than ten years ago at the Orientalist
Congress held in Paris, in 1974.  Owing to the extraordinary richness
of the Ottoman Turkish character set, it has been necessary to extend
the old proposal, but it still retains the original principles, which
are closely associated with the coding scheme used by the Onomasticon
Arabicum project, which is coordinated at the Centre National de la Recherche
Scientifique in Paris.  (The Onomasticon Arabicum uses a post-positive dot
and a post-positive hyphen to indicate diacriticals, which is
acceptable in a data-base of names, but not in continuous prose text.)
The current set of conventions, using (|! = :|),
produces an input file which can, if
necessary, be edited on a ordinary terminal lacking any special
Turkish character features, and which a Turkish speaker can become
accustomed to without too much difficulty.  When coupled with a
well-designed macro file and a rewritten hyphenation table, it
provides the possibility of naturalizing a \TeX\ environment into
Turkish without any large investment in special purpose hardware and
rewritten versions of non-standard (non-)\TeX.

The exclamation point is used for all the ``emphatic'' letters of the
Arabic alphabet (the alphabet in which Turkish was written until 1928).
These are the letters {\it \d Dad\/} (usually pronounced as `{\bf z}' in
Turkish, and hence paired with a non-Arabic letter known as {\it \.Zad\/}),
{\it \d Sad}, {\it \d Ha'}, {\it \d Ta'} and {\it \d Za'}.
The equals sign is used for all the
consonants which are represented in Latin-letter transcriptions by a
letter with a bar under, such as `{\bf\b d}' ({\it dhal\/}), more
commonly written in
Turkish as `{\bf\b z}', and also for vowels with a macron or, following the
Turkish convention, a `hat' accent, and similar forms, chosen like the
cupped `{\bf\u g}', because the equals sign is visually closer than the colon
is.  (Moreover, the colon is needed for a different variety of the
letter `{\bf g}'.)  The colon is a catch-all for everything else, but works
out rather well visually, as it happens.  The three post-positives are
not accents, but regular characters, which use the \TeX\ convention of
ligatures to invoke accented characters from the font, just as the
second `{\bf f$\,$}' in the normal \TeX\ `{\bf ff$\,$}'
ligature pair does.  If a standard
Latin-letter character does not have an associated ligature table in
the font, a following diacritical postpositive will be unaffected.
Thus, the letter `{\bf o}',
when followed by a colon will produce `{\bf\"o}', but the letter `{\bf e}' when
followed by a colon will produce `{\bf e:}'.  The equals sign retains
its normal function in math mode because the math font {\tt TFM} files
do not call it into ligature pairings, and the colon and exclamation point
can be invoked by the command sequences {\tt\bs:} and {\tt\bs bang}
when the simple character will not work.

Since the hyphenation evaluation loop in \TeX\ dismantles all ligatures
before it looks for acceptable hyphenation positions, it will have
to accept the post-positive symbols (|! = :|) as part of the alphabet,
so each of these symbols receives its own value as an |\lccode|.
The full Turkish-\TeX\ alphabet is:

{\advance\baselineskip by 3pt
\obeylines
a \^a e \i\ i \^\i\ o \"o \^o u \"u
` ' b c d f g h j k l m n p r s t v y z
\d d \d h \d k \d s \d t \d z
\b d \u g \b h \~n \b s \b t \b z
\c c \.g \c s \.z
\par }
\smallskip
In the hyphenation loop of \TeX, these characters resolve into
the set:

{\advance\baselineskip by 3pt
\obeylines
|! = : @ # a b c d e f g h i|
|j k l m n o p r s t u v y z|
\par }
\smallskip
\noindent and it is this latter set only which will appear in the
hyphenation patterns.  The dotted {\tt i} in the above list really
stands for the Turkish undotted `{\bf\i}'.  The input code convention
for Turkish uses {\tt i:} for the Turkish `{\bf i}'.  The {\tt @}
sign stands for the Arabic letter {\it hamza} and the {\tt \#}
stands for {\it `\kern-1.5pt ayn}.  To avoid conflict with
{\tt plain.tex} uses of these two characters, they appear explicitly
only in the hyphenation pattern file.  Turkish text input uses
|\`| to generate |\char'43| ({\it `\kern-1.5pt ayn}) and |\'|
to generate |\char'100| ({\it hamza}).

We begin constructing the table by considering the pseudo-letters
(|! = :|).  Since these are used exclusively in ligature pairs,
no hyphenation is ever permissible between them and the preceding
letter.  Odd values permit, and even values in the hyphenation
code prohibit hyphenation,
so we give the highest possible even value (8) to the region
preceding each pseudo-letter.  The pseudo-letters can follow
both vowels and consonants, so hyphenation will often, but not
always, be possible after them.  We give that region the lowest
possible odd value (1) to show that hyphens are permitted here.
\smallskip
\centerline{\tt 8!1 8=1 8:1}
\smallskip

In strict orthography, a vowel cannot be separated from the
preceding consonant, and the few apparent instances of hyphenation
between two adjacent vowels (suppressed consonant) can be treated later.
In all normal instances a vowel cannot accept a hyphen in the
preceding region and will probably accept one in the following region,
so the vowels are set thus.
\smallskip
\centerline{\tt 2a1 2e1 2i1 2o1 2u1}
\smallskip

A consonant may begin a {\tt -cv-} sequence or end a {\tt -cvc-} sequence,
so we give it a 1 on either side:
\smallskip
\centerline{\tt 1b1 $\;\ldots\;$ 1z1}
\smallskip

This simple lot of patterns will provide for all normal {\tt -cv-}
instances such as
\smallskip
\centerline{\tt1h1\ \ \ \ } %seven elements
\centerline{\tt\ \ 8=1\ \ }
\centerline{\tt\ \ \ \ 2a1}
\centerline{\tt1h8=2a1}
\smallskip
\noindent which will result in the sequence {\tt-h=a-}, with hyphens fore
and aft.

The next group of patterns controls hyphenation at the end
of words.  \TeX\ will usually not break off two-letter fragments
in its hyphenation loop, but owing to the nature of the input coding
we have chosen, it may see a three- or four-letter sequence where
a two-letter result is intended.  We do not want to find
{\it l\"u}, {\it\c c\"u\/} and {\it si\/} isolated at the beginning
of a line, nor do we really want the {\it cek\/} of {\it -ecek\/}
broken off if it is at the end of a word. To prevent hyphenations of
this sort, the program generates all possible patterns of the type:
\smallskip
\centerline{|2ba=.| $\;\ldots\;$ |2z:u:.|}
\smallskip
\noindent using the conventional |.| for end-of-word.
The resultant list includes sequences that are phonetically impossible in
Turkish but these take up so little additional space in the file that
they can be left there.  The pattern |2e2cek.|\kern -1.5pt is added as a special case.

The break after {\tt -cvc-} syllables is almost taken care of:
\smallskip
\centerline{\tt1h1\ \ \ \ } %seven elements
\centerline{\tt\ \ 8=1\ \ }
\centerline{\tt\ \ \ \ 1h1}
\centerline{\tt1h8=1h1}
\smallskip
\noindent but it makes the thoroughly undesirable {\tt -cv-ccv-}
sequence as acceptable as the correct {\tt -cvc-cv-} sequence.
To prevent this error, all possible Turkish
two-consonant sequences (e.g.\ {\tt h=h=}${}\rightarrow{}$`{\bf\b h\b h}')
are covered by patterns such as {\tt2h=h=}, in which the value 2
will override the 1 after the preceding vowel.

The few undesirable hyphenations at the beginning of
words which appear to start with a vowel are prevented
by generating the patterns
|.a=2| through |.u:2| and
similarly, the few instances where an apparent {\tt -cv-v-} hyphenation
stands for {\tt -cv-[c]v-} can be allowed by adding the full range of patterns
|a3a2| through |u:3u:2| which includes a large number of impossible
pairings.


The last patterns to be added are |m1t4ra4k| and |t2u8:2r4k1|.  At the price of
slightly excessive strictness (the prohibition against the {\tt r-k}
division is only valid when the word begins with an upper-case {\tt T})
we can ensure that {\it T\"urk\/} always stays in one piece.

Files of this sort, when generated by program, tend to be
larger than hand-worked files, but if it seems that
all the redundancies mentioned above might
be seriously wasteful of space, consider the following statistics:
\smallskip
\settabs 4 \columns
{\advance\baselineskip by 3pt
\+&\hfil{}Entries\hfil&\hfil{}Trie size\hfil&\hfil{}Ops\hfil\cr
\+\quad English&\hfil4460\hfil&\hfil5492\hfil&\hfil181\hfil\cr
\+\quad Turkish&\hfil1840\hfil&\hfil\hphantom{0}616\hfil&\hfil\hphantom{0}16\hfil\cr
}\smallskip

The format file that makes use of this set of patterns will no
longer serve very well for English language \TeX.  The font-based
solution to foreign-language typesetting is definitely monolingual,
since only one {\tt hyphen.tex} file can be read in at a time.
A multilingual system, good for both English and Turkish, would
require modifications of the program code.
This simple solution, however, will be quite satisfactory in a
purely Turkish environment, and can be made even more successful
by taking the {\tt tex.pool} file and translating it all into
Turkish.

\endinput

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bye