README
1Hyphen - hyphenation library to use converted TeX hyphenation patterns
2
3(C) 1998 Raph Levien
4(C) 2001 ALTLinux, Moscow
5(C) 2006, 2007, 2008 László Németh
6
7This was part of libHnj library by Raph Levien.
8
9Peter Novodvorsky from ALTLinux cut hyphenation part from libHnj
10to use it in OpenOffice.org.
11
12Compound word and non-standard hyphenation support by László Németh.
13
14License is the original LibHnj license:
15LibHnj is dual licensed under LGPL and MPL (see also README.libhnj).
16
17Because LGPL allows GPL relicensing, COPYING contains now
18LGPL/GPL/MPL tri-license for explicit Mozilla source compatibility.
19
20Original Libhnj source with OOo's patches are managed by Rene Engelhard
21and Chris Halls at Debian:
22
23http://packages.debian.org/stable/libdevel/libhnj-dev
24and http://packages.debian.org/unstable/source/libhnj
25
26
27OTHER FILES
28
29This distribution is the source of the en_US hyphenation patterns
30"hyph_en_US.dic", too. See README_hyph_en_US.txt.
31
32Source files of hyph_en_US.dic in the distribution:
33
34hyphen.tex (en_US hyphenation patterns from plain TeX)
35
36 Source: http://tug.ctan.org/text-archive/macros/plain/base/hyphen.tex
37
38tbhyphext.tex: hyphenation exception log from TugBoat archive
39
40 Source of the hyphenation exception list:
41 http://www.ctan.org/tex-archive/info/digests/tugboat/tb0hyf.tex
42
43 Generated with the hyphenex script
44 (http://www.ctan.org/tex-archive/info/digests/tugboat/hyphenex.sh)
45
46 sh hyphenex.sh <tb0hyf.tex >tbhyphext.tex
47
48
49INSTALLATION
50
51./configure
52make
53make install
54
55UNIT TESTS (WITH VALGRIND DEBUGGER)
56
57make check
58VALGRIND=memcheck make check
59
60USAGE
61
62./example hyph_en_US.dic mywords.txt
63
64or (under Linux)
65
66echo example | ./example hyph_en_US.dic /dev/stdin
67
68NOTE: In the case of Unicode encoded input, convert your words
69to lowercase before hyphenation (under UTF-8 console environment):
70
71cat mywords.txt | awk '{print tolower($0)}' >mywordslow.txt
72
73DEVELOPMENT
74
75See README.hyphen for hyphenation algorithm, README.nonstandard
76and doc/tb87nemeth.pdf for non-standard hyphenation,
77README.compound for compound word hyphenation, and tests/*.
78
79Description of the dictionary format:
80
81First line contains the character encoding (ISO8859-x, UTF-8).
82
83Possible options in the following lines:
84
85LEFTHYPHENMIN num minimal hyphenation distance from the left word end
86RIGHTHYPHENMIN num minimal hyphation distance from the right word end
87COMPOUNDLEFTHYPHENMIN num min. hyph. dist. from the left compound word boundary
88COMPOUNDRIGHTHYPHENMIN num min. hyph. dist. from the right comp. word boundary
89
90hyphenation patterns see README.* files
91
92NEXTWORD separate the two compound sets (see README.compound)
93
94Default values:
95Without explicite declarations, hyphenmin fields of dict struct
96are zeroes, but in this case the lefthyphenmin and righthyphenmin
97will be the default 2 under the hyphenation (for backward compatibility).
98
99Comments
100
101Use percent sign at the beginning of the lines to add comments to your
102hpyhenation patterns (after the character encoding in the first line):
103
104% comment
105
106*****************************************************************************
107* Warning! Correct working of Libhnj *needs* prepared hyphenation patterns. *
108
109For example, generating hyph_en_US.dic from "hyphen.us" TeX patterns:
110
111perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1
112
113or with default LEFTHYPHENMIN and RIGHTHYPHENMIN values:
114
115perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1 2 3
116perl substrings.pl hyphen.gb hyph_en_GB.dic ISO8859-1 3 3
117****************************************************************************
118
119OTHERS
120
121Java hyphenation: Peter B. West (Folio project) implements a hyphenator with
122non standard hyphenation facilities based on extended Libhnj. The HyFo module
123is released in binary form as jar files and in source form as zip files.
124See http://sourceforge.net/project/showfiles.php?group_id=119136
125
126László Németh
127<nemeth (at) openoffice (dot) org>
128
README.compound
1Compound word hyphenation
2
3Hyphen library supports better compound word hyphenation and special
4rules of compound word hyphenation of German languages and other
5languages with arbitrary number of compound words. The new options,
6COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN help to set the right
7style for the hyphenation of compound words.
8
9Algorithm
10
11The algorithm is an extension of the original pattern based hyphenation
12algorithm. It uses two hyphenation pattern sets, defined in the same
13pattern file and separated by the NEXTLEVEL keyword. First pattern
14set is for hyphenation only at compound word boundaries, the second one
15is for hyphenation within words or word parts.
16
17Recursive compound level hyphenation
18
19The algorithm is recursive: every word parts of a successful
20first (compound) level hyphenation will be rehyphenated
21by the same (first) pattern set.
22
23Finally, when first level hyphenation is not possible, Hyphen uses
24the second level hyphenation for the word or the word parts.
25
26Word endings and word parts
27
28Patterns for word endings (patterns with ellipses) match the
29word parts, too.
30
31Options
32
33COMPOUNDLEFTHYPHENMIN: min. hyph. dist. from the left compound word boundary
34COMPOUNDRIGHTHYPHENMIN: min. hyph. dist. from the right comp. word boundary
35NEXTLEVEL: sign second level hyphenation patterns
36
37Default hyphenmin values
38
39Default values of COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN are 0,
40and 0 under the hyphenation, too. ("0" values of
41LEFTHYPHENMIN and RIGHTHYPHENMIN mean the default "2" under the hyphenation.)
42
43Examples
44
45See tests/compound* test files.
46
47Preparation of hyphenation patterns
48
49It hasn't been special pattern generator tool for compound hyphenation
50patterns, yet. It is possible to use PATGEN to generate both of
51pattern sets, concatenate it manually and set the requested HYPHENMIN values.
52(But don't forget the preprocessing steps by substrings.pl before
53concatenation.) One of the disadvantage of this method, that PATGEN
54doesn't know recursive compound hyphenation of Hyphen.
55
56László Németh
57<nemeth (at) openoffice.org>
58
README.hyphen
1Brief explanation of the hyphenation algorithm herein.[1]
2
3Raph Levien <raph@acm.org>
44 Aug 1998
5
6 The hyphenation algorithm is basically the same as Knuth's TeX
7algorithm. However, the implementation is quite a bit faster.
8
9 The hyphenation files from TeX can almost be used directly. There
10is a preprocessing step, however. If you don't do the preprocessing
11step, you'll get bad hyphenations (i.e. a silent failure).
12
13 Start with a file such as hyphen.us. This is the TeX ushyph1.tex
14file, with the exception dictionary encoded using the same rules as
15the main portion of the file. Any line beginning with % is a comment.
16Each other line should contain exactly one rule.
17
18 Then, do the preprocessing - "perl substrings.pl hyphen.us". The
19resulting file is hyphen.mashed. It's in Perl, and it's fairly slow
20(it uses brute force algorithms; about 17 seconds on a P100), but it
21could probably be redone in C with clever algorithms. This would be
22valuable, for example, if it was handle user-supplied exception
23dictionaries by integrating them into the rule table.[2]
24
25 Once the rules are preprocessed, loading them is quite quick -
26about 200ms on a P100. It then hyphenates at about 40,000 words per
27second on a P100. I haven't benchmarked it against other
28implementations (both TeX and groff contain essentially the same
29algorithm), but expect that it runs quite a bit faster than any of
30them.
31
32Knuth's algorithm
33
34 This section contains a brief explanation of Knuth's algorithm, in
35case you missed it from the TeX books. We'll use the semi-word
36"example" as our running example.
37
38 Since the beginning and end of a word are special, the algorithm is
39actually run over the prepared word (prep_word in the source)
40".example.". Knuths algorithm basically just does pattern matches from
41the rule set, then applies the matches. The patterns in this case that
42match are "xa", "xam", "mp", and "pl". These are actually stored as
43"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between
44the letters, they are added in. If two (or more) patterns have numbers
45in the same place, the highest number wins. Here's the example:
46
47 . e x a m p l e .
48 x1a
49 x a m3
50 4m1p
51 1p2l2
52 -----------------
53 . e x1a4m3p2l2e .
54
55 Finally, hyphens are placed wherever odd numbers appear. They are,
56however, suppressed after the first letter and before the last letter
57of the word (TeX actually suppresses them before the next-to-last, as
58well). So, it's "ex-am-ple", which is correct.
59
60 Knuth uses a trie to implement this. I.e. he stores each rule in a
61trie structure. For each position in the word, he searches the trie,
62searching for a match. Most patterns are short, so efficiency should
63be quite good.
64
65Theory of the algorithm
66
67 The algorithm works as a slightly modified finite state machine.
68There are two kinds of transitions: those that consume one letter of
69input (which work just like your regular finite state machine), and
70"fallback" transitions, which don't consume any input. If no
71transition matching the next letter is found, the fallback is used.
72One way of looking at this is a form of compression of the transition
73tables - i.e. it behaves the same as a completely vanilla state
74machine in which the actual transition table of a node is made up of
75the union of transition tables of the node itself, plus its fallbacks.
76
77 Each state is represented by a string. Thus, if the current state
78is "am" and the next letter is "p", then the next state is "amp".
79Fallback transitions go to states which chop off one or (sometimes)
80more letters from the beginning. For example, if none of the
81transitions from "amp" match the next letter, then it will fall back
82to "mp". Similarly, if none of the transitions from "mp" match the
83next letter, it will fall back to "m".
84
85 Each state is also associated with a (possibly null) "match"
86string. This represents the union of all patterns which are
87right-justified substrings of the match string. I.e. the pattern "mp"
88is a right-justified substring of the state "amp", so it's numbers get
89added in. The actual calculation of this union is done by the
90Perl preprocessing script, but could probably be done in C just about
91as easily.
92
93 Because each state transition either consumes one input character
94or shortens the state string by one character, the total number of
95state transitions is linear in the length of the word.
96
97[1] Documentations:
98
99Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.
100Stanford University, 1983. http://www.tug.org/docs/liang.
101
102László Németh: Automatic non-standard hyphenation in OpenOffice.org,
103TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf
104
105[2] There is the C version of pattern converter "substrings.c"
106in the distribution written by Nanning Buitenhuis. Unfortunatelly,
107this version hasn't handled the non standard extension of the
108algorithm, yet.
109
README.nonstandard
1Non-standard hyphenation
2------------------------
3
4Some languages use non-standard hyphenation; `discretionary'
5character changes at hyphenation points. For example,
6Catalan: paral·lel -> paral-lel,
7Dutch: omaatje -> oma-tje,
8German (before the new orthography): Schiffahrt -> Schiff-fahrt,
9Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!)
10Swedish: tillata -> till-lata.
11
12Using this extended library, you can define
13non-standard hyphenation patterns. For example:
14
15l·1l/l=l
16a1atje./a=t,1,3
17.schif1fahrt/ff=f,5,2
18.as3szon/sz=sz,2,3
19n1nyal./ny=ny,1,3
20.til1lata./ll=l,3,2
21
22or with narrow boundaries:
23
24l·1l/l=,1,2
25a1atje./a=,1,1
26.schif1fahrt/ff=,5,1
27.as3szon/sz=,2,1
28n1nyal./ny=,1,1
29.til1lata./ll=,3,1
30
31Note: Libhnj uses modified patterns by preparing substrings.pl.
32Unfortunatelly, now the conversion step can generate bad non-standard
33patterns (non-standard -> standard pattern conversion), so using
34narrow boundaries may be better for recent Libhnj. For example,
35substrings.pl generates a few bad patterns for Hungarian hyphenation
36patterns resulting bad non-standard hyphenation in a few cases. Using narrow
37boundaries solves this problem. Java HyFo module can check this problem.
38
39Syntax of the non-standard hyphenation patterns
40------------------------------------------------
41
42pat1tern/change[,start,cut]
43
44If this pattern matches the word, and this pattern win (see README.hyphen)
45in the change region of the pattern, then pattern[start, start + cut - 1]
46substring will be replaced with the "change".
47
48For example, a German ff -> ff-f hyphenation:
49
50f1f/ff=f
51
52or with expansion
53
54f1f/ff=f,1,2
55
56will change every "ff" with "ff=f" at hyphenation.
57
58A more real example:
59
60% simple ff -> f-f hyphenation
61f1f
62% Schiffahrt -> Schiff-fahrt hyphenation
63%
64schif3fahrt/ff=f,5,2
65
66Specification
67
68- Pattern: matching patterns of the original Liang's algorithm
69 - patterns must contain only one hyphenation point at change region
70 signed with an one-digit odd number (1, 3, 5, 7 or 9).
71 These point may be at subregion boundaries: schif3fahrt/ff=,5,1
72 - only the greater value guarantees the win (don't mix non-standard and
73 non-standard patterns with the same value, for example
74 instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2)
75
76- Change: new characters.
77 Arbitrary character sequence. Equal sign (=) signs hyphenation points
78 for OpenOffice.org (like in the example). (In a possible German LaTeX
79 preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz
80 with `ssz, according to the German and Hungarian Babel settings.)
81
82- Start: starting position of the change region.
83 - begins with 1 (not 0): schif3fahrt/ff=f,5,2
84 - start dot doesn't matter: .schif3fahrt/ff=f,5,2
85 - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2
86 - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3
87 ("össze" looks "össze" in an ISO 8859-1 8-bit editor).
88
89- Cut: length of the removed character sequence in the original word.
90 - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3
91 ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor).
92
93Dictionary developing
94---------------------
95
96There hasn't been extended PatGen pattern generator for non-standard
97hyphenation patterns, yet.
98
99Fortunatelly, non-standard hyphenation points are forbidden in the PatGen
100generated hyphenation patterns, so with a little patch can be develop
101non-standard hyphenation patterns also in this case.
102
103Warning: If you use UTF-8 Unicode encoding in your patterns, call
104substrings.pl with UTF-8 parameter to calculate right
105character positions for non-standard hyphenation:
106
107./substrings.pl input output UTF-8
108
109Programming
110-----------
111
112Use hyphenate2() or hyphenate3() to handle non-standard hyphenation.
113See hyphen.h for the documentation of the hyphenate*() functions.
114See example.c for processing the output of the hyphenate*() functions.
115
116Warning: change characters are lower cased in the source, so you may need
117case conversion of the change characters based on input word case detection.
118For example, see OpenOffice.org source
119(lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx).
120
121László Németh
122<nemeth (at) openoffice.org>
123