1Brief explanation of the hyphenation algorithm herein.[1] 2 3Raph Levien <raph@acm.org> 44 Aug 1998 5 6 The hyphenation algorithm is basically the same as Knuth's TeX 7algorithm. However, the implementation is quite a bit faster. 8 9 The hyphenation files from TeX can almost be used directly. There 10is a preprocessing step, however. If you don't do the preprocessing 11step, you'll get bad hyphenations (i.e. a silent failure). 12 13 Start with a file such as hyphen.us. This is the TeX ushyph1.tex 14file, with the exception dictionary encoded using the same rules as 15the main portion of the file. Any line beginning with % is a comment. 16Each other line should contain exactly one rule. 17 18 Then, do the preprocessing - "perl substrings.pl hyphen.us". The 19resulting file is hyphen.mashed. It's in Perl, and it's fairly slow 20(it uses brute force algorithms; about 17 seconds on a P100), but it 21could probably be redone in C with clever algorithms. This would be 22valuable, for example, if it was handle user-supplied exception 23dictionaries by integrating them into the rule table.[2] 24 25 Once the rules are preprocessed, loading them is quite quick - 26about 200ms on a P100. It then hyphenates at about 40,000 words per 27second on a P100. I haven't benchmarked it against other 28implementations (both TeX and groff contain essentially the same 29algorithm), but expect that it runs quite a bit faster than any of 30them. 31 32Knuth's algorithm 33 34 This section contains a brief explanation of Knuth's algorithm, in 35case you missed it from the TeX books. We'll use the semi-word 36"example" as our running example. 37 38 Since the beginning and end of a word are special, the algorithm is 39actually run over the prepared word (prep_word in the source) 40".example.". Knuths algorithm basically just does pattern matches from 41the rule set, then applies the matches. The patterns in this case that 42match are "xa", "xam", "mp", and "pl". These are actually stored as 43"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between 44the letters, they are added in. If two (or more) patterns have numbers 45in the same place, the highest number wins. Here's the example: 46 47 . e x a m p l e . 48 x1a 49 x a m3 50 4m1p 51 1p2l2 52 ----------------- 53 . e x1a4m3p2l2e . 54 55 Finally, hyphens are placed wherever odd numbers appear. They are, 56however, suppressed after the first letter and before the last letter 57of the word (TeX actually suppresses them before the next-to-last, as 58well). So, it's "ex-am-ple", which is correct. 59 60 Knuth uses a trie to implement this. I.e. he stores each rule in a 61trie structure. For each position in the word, he searches the trie, 62searching for a match. Most patterns are short, so efficiency should 63be quite good. 64 65Theory of the algorithm 66 67 The algorithm works as a slightly modified finite state machine. 68There are two kinds of transitions: those that consume one letter of 69input (which work just like your regular finite state machine), and 70"fallback" transitions, which don't consume any input. If no 71transition matching the next letter is found, the fallback is used. 72One way of looking at this is a form of compression of the transition 73tables - i.e. it behaves the same as a completely vanilla state 74machine in which the actual transition table of a node is made up of 75the union of transition tables of the node itself, plus its fallbacks. 76 77 Each state is represented by a string. Thus, if the current state 78is "am" and the next letter is "p", then the next state is "amp". 79Fallback transitions go to states which chop off one or (sometimes) 80more letters from the beginning. For example, if none of the 81transitions from "amp" match the next letter, then it will fall back 82to "mp". Similarly, if none of the transitions from "mp" match the 83next letter, it will fall back to "m". 84 85 Each state is also associated with a (possibly null) "match" 86string. This represents the union of all patterns which are 87right-justified substrings of the match string. I.e. the pattern "mp" 88is a right-justified substring of the state "amp", so it's numbers get 89added in. The actual calculation of this union is done by the 90Perl preprocessing script, but could probably be done in C just about 91as easily. 92 93 Because each state transition either consumes one input character 94or shortens the state string by one character, the total number of 95state transitions is linear in the length of the word. 96 97[1] Documentations: 98 99Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er. 100Stanford University, 1983. http://www.tug.org/docs/liang. 101 102László Németh: Automatic non-standard hyphenation in OpenOffice.org, 103TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf 104 105[2] There is the C version of pattern converter "substrings.c" 106in the distribution written by Nanning Buitenhuis. Unfortunatelly, 107this version hasn't handled the non standard extension of the 108algorithm, yet. 109