1--- 2layout: default 3title: Break Rules 4nav_order: 1 5parent: Boundary Analysis 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Break Rules 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Introduction 24 25ICU locates boundary positions within text by means of rules, which are a form 26of regular expressions. The form of the rules is similar, but not identical, 27to the boundary rules from the Unicode specifications 28[[UAX-14](https://www.unicode.org/reports/tr14/), 29[UAX-29](https://www.unicode.org/reports/tr29/)], and there is a reasonably close 30correspondence between the two. 31 32Taken as a set, the ICU rules describe how to move forward to the next boundary, 33starting from a known boundary. 34ICU includes rules for the standard boundary types (word, line, etc.). 35Applications may also create customized break iterators from their own rules. 36 37ICU's built-in rules are located at 38[icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules). 39These can serve as examples when writing your own, and as starting point for 40customizations. 41 42### Rule Tutorial 43 44Rules most commonly describe a range of text that should remain together, 45unbroken. For example, this rule 46 47``` 48 [\p{Letter}]+; 49``` 50 51matches a run of one or more letters, and would cause them to remain unbroken. 52 53The part within `[`brackets`]` follows normal ICU [UnicodeSet pattern syntax](../strings/unicodeset.md). 54 55The qualifier, '`+`' in this case, can be one of 56 57| Qualifier | Meaning | 58| --------- | ------------------------ | 59| empty | Match exactly once | 60| `?` | Match zero or one time | 61| `+` | Match one or more times | 62| `*` | Match zero or more times | 63 64#### Variables 65 66A variable names a set or rule sub-expression. They are useful for documenting 67what something represents, and for simplifying complex expressions by breaking 68them up. 69 70"Variable" is something if a misnomer; they cannot be reassigned, but are more 71of a constant expression. 72 73They start with a '`$`', both in the definition and use. 74 75``` 76 # Variable Definition 77 $ASCIILetNum = [A-Za-z0-9]; 78 # Variable Use 79 $ASCIILetNum+; 80``` 81 82#### Comments and Semicolons 83 84'`#`' begins a comment, which extends to the end of a line. 85 86Comments may stand alone, or appear after another statement on a line. 87 88All rule statements or expressions are terminated by semicolons. 89 90#### Chained Matching 91 92Most ICU rule sets use the concept of "chained matching". The idea is that 93complete match can be composed from multiple pieces, with each piece coming from 94an individual rule of a rule set. 95 96This idea is unique to ICU break rules, it is not a concept found in other 97regular expression based matchers. Some of the Unicode standard break rules 98would be difficult to implement without it. 99 100Starting with an example, 101 102``` 103 !!chain; 104 word_char = [\p{Letter}]; 105 word_joiner = [_-]; 106 $word_char+; 107 $word_char $word_joiner $word_char; 108``` 109 110These rules will match "`abc`", "`hello_world`", `"hi-there"`, 111"`a-bunch_of-joiners-here`". 112 113They will not match "`-abc`", "`multiple__joiners`", "`tail-`" 114 115A full match is composed of pieces or submatches, possibly from different rules, 116with adjacent submatches linked by at least one overlapping character. 117 118In the example below, matching "`hello_world`", 119 120* '`1`' shows matches of the first rule, `word_char+` 121 122* '`2`' shows matches of the second rule, `$word_char $word_joiner $word_char` 123 124``` 125 hello_world 126 11111 11111 127 222 128``` 129 130There is an overlap of the matched regions, which causes the chaining mechanism 131to join them into a single overall match. 132 133The mechanism is a good match to, for example, [Unicode's word break 134rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules 135WB5 through WB13 combine to piece together longer words from multiple short 136segments. 137 138`!!chain;` enables chaining in a rule set. It is disabled by default for back 139compatibility—very old versions of ICU did not support it, and it was 140originally introduced as an option. 141 142#### Parentheses and Alternation 143 144Rule expressions can contain parentheses and '`|`' operators, representing 145alternation or "or" operations. This follows conventional regular expression 146behavior. 147 148For example, the following would match a simplified identifier: 149 150``` 151 $Letter ($Letter | $Digit)*; 152``` 153 154#### String and Character Literals 155 156Similarly to common regular expressions, literal characters that do not have 157other special meaning represent themselves. So the rule 158 159``` 160 Hello; 161``` 162 163would match the literal input "`Hello`". 164 165In practice, nearly all break rules are composed from `[`sets`]` based on Unicode 166character properties; literal characters in rules are very rare. 167 168To prevent random typos in rules from being treated as literals, use this 169option: 170 171``` 172 !!quoted_literals_only; 173``` 174 175With the option, the naked `Hello` becomes a rule syntax error while a quoted 176`"hello"` still matches a literal hello. 177 178`!!quoted_literals_only` is strongly recommended for all rule sets. The random 179typo problem is very real, and surprisingly hard to recognize and debug. 180 181#### Explicit Break Rules 182 183A rule containing a slash (`/`) will force a boundary when it matches, even when 184other rules or chaining would otherwise lead to a longer match. Also called Hard 185Break Rules, these have the form 186 187``` 188 pre-context / post-context; 189``` 190 191where the pre and post-context look like normal break rules. Both the pre and 192post context are required, and must not allow a zero-length match. There should 193be no overlap between characters that end a match of the pre-context and those 194that begin a match of the post-context. 195 196Chaining into a hard break rule operates normally. There is no chaining out of a 197hard break rule; when the post-context matches a break is forced immediately. 198 199Note: future versions of ICU may loosen the restrictions on explicit break 200rules. The behavior of rules with missing or overlapping contexts is subject to 201change. 202 203#### Chaining Control 204 205Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules 206so marked can begin a match after a preceding boundary or at the start of text, 207but cannot extend a match via chaining from another rule. 208 209~~The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining 210from one rule to another from occurring on any character whose Line Break 211property is Combining Mark. This option is subject to change or removal, and 212should not be used in general. Within ICU, it is used only with the line break 213rules. We hope to replace it with something more general.~~ 214 215> :point_right: **Note**: `!!LBCMNoChain` is deprecated, and will be removed 216> completely from a future version of ICU. 217 218## Rule Status Values 219 220Break rules can be tagged with a number, which is called the *rule status*. 221After a boundary has been located, the status number of the specific rule that 222determined the boundary position is available to the application through the 223function `getRuleStatus()`. 224 225For the predefined word boundary rules, status values are available to 226distinguish between boundaries associated with words, numbers, and those around 227spaces or punctuation. Similarly for line break boundaries, status values 228distinguish between mandatory line endings (new line characters) and break 229opportunities that are appropriate points for line wrapping. Refer to the ICU 230API documentation for the C header file `ubrk.h` or to Java class 231`RuleBasedBreakIterator` for a complete list of the predefined boundary 232classifications. 233 234When creating custom sets of break rules, integer status values can be 235associated with boundary rules in whatever way will be convenient for the 236application. There is no need to remain restricted to the predefined values and 237classifications from the standard rules. 238 239It is possible for a set of break rules to contain more than a single rule that 240produces some boundary in an input text. In this event, `getRuleStatus()` will 241return the numerically largest status value from the matching rules, and the 242alternate function `getRuleStatusVec()` will return a vector of the values from 243all of the matching rules. 244 245In the source form of the break rules, status numbers appear at end of a rule, 246and are enclosed in `{`braces`}`. 247 248Hard break rules that also have a status value place the status at the end, for 249example 250 251``` 252 pre-context / post-context {1234}; 253``` 254 255### Word Dictionaries 256 257For some languages that don't normally use spaces between words, break iterators 258are able to supplement the rules with dictionary based breaking. Some languages, 259Thai or Lao, for example, use a dictionary for both word and line breaking. 260Others, such as Japanese, use a dictionary for word breaking, but not for line 261breaking. 262 263To enable dictionary use, 264 2651. The break rules must select, as unbroken chunks, ranges of text to be passed 266 off to the word dictionary for further subdivision. 2672. The break rules must define a character class named `$dictionary` that 268 contains the characters (letters) to be handled by the dictionary. 269 270The dictionary implementation, on receiving a range of text, will map it to a 271specific dictionary based on script, and then delegate to that dictionary for 272subdividing the range into words. 273 274See, for example, this snippet from the [line break rules](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/brkitr/rules/line.txt): 275 276``` 277 # Dictionary character set, for triggering language-based break engines. Currently 278 # limited to LineBreak=Complex_Context (SA). 279 $dictionary = [$SA]; 280``` 281 282## Rule Options 283 284| Option | Description | 285| --------------- | ----------- | 286| `!!chain` | Enable rule chaining. Default is no chaining. | 287| `!!forward` | The rules that follow are for forward iteration. Forward rules are now the only type of rules needed or used. | 288 289### Deprecated Rule Options 290 291| Deprecated Option | Description | 292| --------------- | ----------- | 293| ~~`!!reverse`~~ | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer needed; any rules in a Reverse rule section are ignored.~~ | 294| ~~`!!safe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer needed; any rules in such a section are ignored.~~ | 295| ~~`!!safe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer needed; any rules in such a section are ignored.~~ | 296| ~~`!!LBCMNoChain`~~ | ~~*[deprecated]* Disable chaining when the overlap character matches `\p{Line_Break=Combining_Mark}`~~ | 297 298## Rule Syntax 299 300Here is the syntax for the boundary rules. (The EBNF Syntax is given below.) 301 302| Rule Name | Rule Values | Notes | 303| ---------- | ----------- | ----- | 304| rules | statement+ | | 305| statement | assignment \| rule \| control | 306| control | (`!!forward` \| `!!reverse` \| `!!safe_forward` \| `!!safe_reverse` \| `!!chain`) `;` 307| assignment | variable `=` expr `;` | 5 | 308| rule | `^`? expr (`{`number`}`)? `;` | 8,9 | 309| number | [0-9]+ | 1 | 310| break-point | `/` | 10 | 311| expr | expr-q \| expr `\|` expr \| expr expr | 3 | 312| expr-q | term \| term `*` \| term `?` \| term `+` | 313| term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point | 314| rule-special | *any printing ascii character except letters or numbers* \| white-space | 315| rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character except* `\p` *or* `\P` | 316| variable | `$` name-start-char name-char* | 7 | 317| name-start-char | `_` \| \p{L} | 318| name-char | name-start-char \| \\p{N} | 319| quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single quotes)*+ `'` | 320| escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md) chapter* | 321| unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 | 322| comment | unescaped `#` *(any char except new-line)** new-line | 2 | 323| s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 | 324| new-line | LF, CR, NEL | 2 | 325 326### Rule Syntax Notes 327 3281. The number associated with a rule that actually determined a break position 329 is available to the application after the break has been returned. These 330 numbers are *not* Perl regular expression repeat counts. 331 3322. Comments are recognized and removed separately from otherwise parsing the 333 rules. They may appear wherever a space would be allowed (and ignored.) 334 3353. The implicit concatenation of adjacent terms has higher precedence than the 336 `|` operation. "`ab|cd`" is interpreted as "`(ab)|(cd)`", not as "`a(b|c)d`" or 337 "`(((ab)|c)d)`" 338 3394. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSet` class. 340 It is not repeated here. 341 3425. For `$`variables that will be referenced from inside of a `UnicodeSet`, the 343 definition must consist only of a Unicode Set. For example, when variable `$a` 344 is used in a rule like `[$a$b$c]`, then this definition of `$a` is ok: 345 “`$a=[:Lu:];`” while this one “`$a=abcd;`” would cause an error when `$a` was 346 used. 347 3486. Spaces are allowed nearly anywhere, and are not significant unless escaped. 349 Exceptions to this are noted. 350 3517. No spaces are allowed within a variable name. The variable name `$dictionary` 352 is special. If defined, it must be a Unicode Set, the characters of which 353 will trigger the use of word dictionary based boundaries. 354 3558. A leading `^` on a rule prevents chaining into that rule. It can only match 356 immediately after a preceding boundary, or at the start of text. 357 3589. `{`nnn`}` appearing at the end of a rule is a Rule Status number, not a repeat 359 count as it would be with conventional regular expression syntax. 360 36110. A `/` in a rule specifies a hard break point. If the rule matches, a 362 boundary will be forced at the position of the `/` within the match. 363 364### EBNF Syntax used for the RBBI rules syntax description 365 366| syntax | description | 367| -- | ------------------------- | 368| a? | zero or one instance of a | 369| a+ | one or more instances of a | 370| a* | zero or more instances of a | 371| a \| b | either a or b, but not both | 372| `a` "`a`" | the literal string between the quotes or displayed as `monospace` | 373 374## Planned Changes and Removed or Deprecated Rule Features 375 3761. Reverse rules could formerly be indicated by beginning them with an 377 exclamation `!`. This syntax is deprecated, and will be removed from a 378 future version of ICU. 379 3802. `!!LBCMNoChain` was a global option that specified that characters with the 381 line break property of "Combining Character" would not participate in rule 382 chaining. This option was always considered internal, is deprecated and will 383 be removed from a future version of ICU. 384 3853. Naked rule characters. Plain text, in the context of a rule, is treated as 386 literal text to be matched, much like normal regular expressions. This turns 387 out to be very error prone, has been the source of bugs in released versions 388 of ICU, and is not useful in implementing normal text boundary rules. A 389 future version will reject literal text that is not escaped. 390 3914. Exact reverse rules and safe forward rules: planned changes to the break 392 engine implementation will remove the need for exact reverse rules and safe 393 forward rules. 394 3955. `{bof}` and `{eof}`, appearing within `[`sets`]`, match the beginning or ending of 396 the input text, respectively. This is an internal (not documented) feature 397 that will probably be removed in a future version of ICU. They are currently 398 used by the standard rules for word, line and sentence breaking. An 399 alternative is probably needed. The existing implementation is incomplete. 400 401## Additional Sample Code 402 403**C/C++** 404See [icu/source/samples/break/](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/break/) 405in the ICU source distribution for code samples showing the use of ICU boundary analysis. 406 407## Details about Dictionary-Based Break Iteration 408 409> :point_right: **Note**: This section below is originally from August 2012. 410> It is probably out of date, for example `brkfiles.mk` does not exist anymore. 411 412Certain Unicode characters have a "dictionary" bit set in the break iteration 413rules, and text made up of these characters cannot be handled by the rules-based 414break iteration code for lines or words. Rather, they must be handled by a 415dictionary-based approach. The ICU approach is as follows: 416 417Once the Dictionary bit is detected, the set of characters with that bit is 418handed off to "dictionary code." This code then inspects the characters more 419carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean). 420If text in this script has not yet been handled, it loads the appropriate 421dictionary from disk, and initializes a specialized "BreakEngine" class for that 422script. 423 424There are three such specialized classes: Thai, Khmer and CJK. 425 426Thai and Khmer use very similar approaches. They look through a dictionary that 427is not weighted by word frequency, and attempt to find the longest total "match" 428that can be made in the text. 429 430For Chinese and Japanese text, on the other hand, we have a unified dictionary 431(due to the fact that both use some of the same characters, it is difficult to 432distinguish them) that contains information about word frequencies. The 433algorithm to match text then uses dynamic programming to find the set of breaks 434it considers "most likely" based on the frequency of the words created by the 435breaks. This algorithm could also be used for Thai and Khmer, but we do not have 436sufficient data to do so. This algorithm could also be used for Korean, but once 437again we do not have the data to do so. 438 439Code of interest is in `source/common/dictbe.{h, cpp}`, `source/common/brkeng.{h, 440cpp}`, `source/common/dictionarydata.{h, cpp}`. The dictionaries use the `BytesTrie` 441and `UCharsTrie` as their data store. The binary form of these dictionaries is 442produced by the `gendict` tool, which has source in `source/tools/gendict`. 443 444In order to add new dictionary implementations, a few changes have to be made. 445First, you should create a new subclass of `DictionaryBreakEngine` or 446`LanguageBreakEngine` in `dictbe.cpp` that implements your algorithm. Then, in 447`brkeng.cpp`, you should add logic to create this dictionary break engine if we 448strike the appropriate script - which should only be 3 or so lines of code at 449the most. Lastly, you should add the correct data file. If your data is to be 450represented as a `.dict` file - as is recommended, and in fact required if you 451don't want to make substantial code changes to the engine loader - you need to 452simply add a file in the correct format for gendict to the `source/data/brkitr` 453directory, and add its name to the list of `BRK_DICT_SOURCE` in 454`source/data/brkitr/brkfiles.mk`. This will cause your dictionary (say, `foo.txt`) 455to be added as a `UCharsTrie` dictionary with the name foo.dict. If you want your 456dictionary to be a `BytesTrie` dictionary, you will need to specify a transform 457within the `Makefile`. To do so, find the part of `source/data/Makefile.in` and 458`source/data/makedata.mak` that deals with `thaidict.dict` and `khmerdict.dict` and 459add a similar set of lines for your script. Lastly, in 460`source/data/brkitr/root.txt`, add a line to the dictionaries `{}` section of the 461form: 462 463``` 464 shortscriptname:process(dependency){"dictionaryname.dict"} 465``` 466 467For example, for Katakana: 468 469``` 470 Kata:process(dependency){"cjdict.dict"} 471``` 472 473Make sure to add appropriate tests for the new implementation. 474