• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Break Rules
4nav_order: 1
5parent: Boundary Analysis
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Break Rules
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Introduction
24
25ICU locates boundary positions within text by means of rules, which are a form
26of regular expressions. The form of the rules is similar, but not identical,
27to the boundary rules from the Unicode specifications
28[[UAX-14](https://www.unicode.org/reports/tr14/),
29[UAX-29](https://www.unicode.org/reports/tr29/)], and there is a reasonably close
30correspondence between the two.
31
32Taken as a set, the ICU rules describe how to move forward to the next boundary,
33starting from a known boundary.
34ICU includes rules for the standard boundary types (word, line, etc.).
35Applications may also create customized break iterators from their own rules.
36
37ICU's built-in rules are located at
38[icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/brkitr/rules).
39These can serve as examples when writing your own, and as starting point for
40customizations.
41
42### Rule Tutorial
43
44Rules most commonly describe a range of text that should remain together,
45unbroken. For example, this rule
46
47```
48    [\p{Letter}]+;
49```
50
51matches a run of one or more letters, and would cause them to remain unbroken.
52
53The part within `[`brackets`]` follows normal ICU [UnicodeSet pattern syntax](../strings/unicodeset.md).
54
55The qualifier, '`+`' in this case, can be one of
56
57| Qualifier | Meaning                  |
58| --------- | ------------------------ |
59| empty     | Match exactly once       |
60| `?`       | Match zero or one time   |
61| `+`       | Match one or more times  |
62| `*`       | Match zero or more times |
63
64#### Variables
65
66A variable names a set or rule sub-expression. They are useful for documenting
67what something represents, and for simplifying complex expressions by breaking
68them up.
69
70"Variable" is something if a misnomer; they cannot be reassigned, but are more
71of a constant expression.
72
73They start with a '`$`', both in the definition and use.
74
75```
76    # Variable Definition
77    $ASCIILetNum = [A-Za-z0-9];
78    # Variable Use
79    $ASCIILetNum+;
80```
81
82#### Comments and Semicolons
83
84'`#`' begins a comment, which extends to the end of a line.
85
86Comments may stand alone, or appear after another statement on a line.
87
88All rule statements or expressions are terminated by semicolons.
89
90#### Chained Matching
91
92Most ICU rule sets use the concept of "chained matching". The idea is that
93complete match can be composed from multiple pieces, with each piece coming from
94an individual rule of a rule set.
95
96This idea is unique to ICU break rules, it is not a concept found in other
97regular expression based matchers. Some of the Unicode standard break rules
98would be difficult to implement without it.
99
100Starting with an example,
101
102```
103    !!chain;
104    word_char = [\p{Letter}];
105    word_joiner = [_-];
106    $word_char+;
107    $word_char $word_joiner $word_char;
108```
109
110These rules will match "`abc`", "`hello_world`", `"hi-there"`,
111"`a-bunch_of-joiners-here`".
112
113They will not match "`-abc`", "`multiple__joiners`", "`tail-`"
114
115A full match is composed of pieces or submatches, possibly from different rules,
116with adjacent submatches linked by at least one overlapping character.
117
118In the example below, matching "`hello_world`",
119
120* '`1`' shows matches of the first rule, `word_char+`
121
122* '`2`' shows matches of the second rule, `$word_char $word_joiner $word_char`
123
124```
125      hello_world
126      11111 11111
127          222
128```
129
130There is an overlap of the matched regions, which causes the chaining mechanism
131to join them into a single overall match.
132
133The mechanism is a good match to, for example, [Unicode's word break
134rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules
135WB5 through WB13 combine to piece together longer words from multiple short
136segments.
137
138`!!chain;` enables chaining in a rule set. It is disabled by default for back
139compatibility—very old versions of ICU did not support it, and it was
140originally introduced as an option.
141
142#### Parentheses and Alternation
143
144Rule expressions can contain parentheses and '`|`' operators, representing
145alternation or "or" operations. This follows conventional regular expression
146behavior.
147
148For example, the following would match a simplified identifier:
149
150```
151    $Letter ($Letter | $Digit)*;
152```
153
154#### String and Character Literals
155
156Similarly to common regular expressions, literal characters that do not have
157other special meaning represent themselves. So the rule
158
159```
160    Hello;
161```
162
163would match the literal input "`Hello`".
164
165In practice, nearly all break rules are composed from `[`sets`]` based on Unicode
166character properties; literal characters in rules are very rare.
167
168To prevent random typos in rules from being treated as literals, use this
169option:
170
171```
172    !!quoted_literals_only;
173```
174
175With the option, the naked `Hello` becomes a rule syntax error while a quoted
176`"hello"` still matches a literal hello.
177
178`!!quoted_literals_only` is strongly recommended for all rule sets. The random
179typo problem is very real, and surprisingly hard to recognize and debug.
180
181#### Explicit Break Rules
182
183A rule containing a slash (`/`) will force a boundary when it matches, even when
184other rules or chaining would otherwise lead to a longer match. Also called Hard
185Break Rules, these have the form
186
187```
188    pre-context / post-context;
189```
190
191where the pre and post-context look like normal break rules. Both the pre and
192post context are required, and must not allow a zero-length match. There should
193be no overlap between characters that end a match of the pre-context and those
194that begin a match of the post-context.
195
196Chaining into a hard break rule operates normally. There is no chaining out of a
197hard break rule; when the post-context matches a break is forced immediately.
198
199Note: future versions of ICU may loosen the restrictions on explicit break
200rules. The behavior of rules with missing or overlapping contexts is subject to
201change.
202
203#### Chaining Control
204
205Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules
206so marked can begin a match after a preceding boundary or at the start of text,
207but cannot extend a match via chaining from another rule.
208
209~~The !!LBCMNoChain; statement modifies chaining behavior by preventing chaining
210from one rule to another from occurring on any character whose Line Break
211property is Combining Mark. This option is subject to change or removal, and
212should not be used in general. Within ICU, it is used only with the line break
213rules. We hope to replace it with something more general.~~
214
215> :point_right: **Note**: `!!LBCMNoChain` is deprecated, and will be removed
216> completely from a future version of ICU.
217
218## Rule Status Values
219
220Break rules can be tagged with a number, which is called the *rule status*.
221After a boundary has been located, the status number of the specific rule that
222determined the boundary position is available to the application through the
223function `getRuleStatus()`.
224
225For the predefined word boundary rules, status values are available to
226distinguish between boundaries associated with words, numbers, and those around
227spaces or punctuation. Similarly for line break boundaries, status values
228distinguish between mandatory line endings (new line characters) and break
229opportunities that are appropriate points for line wrapping. Refer to the ICU
230API documentation for the C header file `ubrk.h` or to Java class
231`RuleBasedBreakIterator` for a complete list of the predefined boundary
232classifications.
233
234When creating custom sets of break rules, integer status values can be
235associated with boundary rules in whatever way will be convenient for the
236application. There is no need to remain restricted to the predefined values and
237classifications from the standard rules.
238
239It is possible for a set of break rules to contain more than a single rule that
240produces some boundary in an input text. In this event, `getRuleStatus()` will
241return the numerically largest status value from the matching rules, and the
242alternate function `getRuleStatusVec()` will return a vector of the values from
243all of the matching rules.
244
245In the source form of the break rules, status numbers appear at end of a rule,
246and are enclosed in `{`braces`}`.
247
248Hard break rules that also have a status value place the status at the end, for
249example
250
251```
252    pre-context / post-context {1234};
253```
254
255### Word Dictionaries
256
257For some languages that don't normally use spaces between words, break iterators
258are able to supplement the rules with dictionary based breaking. Some languages,
259Thai or Lao, for example, use a dictionary for both word and line breaking.
260Others, such as Japanese, use a dictionary for word breaking, but not for line
261breaking.
262
263To enable dictionary use,
264
2651. The break rules must select, as unbroken chunks, ranges of text to be passed
266   off to the word dictionary for further subdivision.
2672. The break rules must define a character class named `$dictionary` that
268   contains the characters (letters) to be handled by the dictionary.
269
270The dictionary implementation, on receiving a range of text, will map it to a
271specific dictionary based on script, and then delegate to that dictionary for
272subdividing the range into words.
273
274See, for example, this snippet from the [line break rules](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/brkitr/rules/line.txt):
275
276```
277    #  Dictionary character set, for triggering language-based break engines. Currently
278    #  limited to LineBreak=Complex_Context (SA).
279    $dictionary = [$SA];
280```
281
282## Rule Options
283
284| Option          | Description |
285| --------------- | ----------- |
286| `!!chain`       |  Enable rule chaining. Default is no chaining. |
287| `!!forward`     |  The rules that follow are for forward iteration. Forward rules are now the only type of rules needed or used.   |
288
289### Deprecated Rule Options
290
291| Deprecated Option          | Description |
292| --------------- | ----------- |
293| ~~`!!reverse`~~     | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer needed; any rules in a Reverse rule section are ignored.~~ |
294| ~~`!!safe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer needed; any rules in such a section are ignored.~~ |
295| ~~`!!safe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer needed; any rules in such a section are ignored.~~ |
296| ~~`!!LBCMNoChain`~~ | ~~*[deprecated]* Disable chaining when the overlap character matches `\p{Line_Break=Combining_Mark}`~~ |
297
298## Rule Syntax
299
300Here is the syntax for the boundary rules. (The EBNF Syntax is given below.)
301
302| Rule Name | Rule Values | Notes |
303| ---------- | ----------- | ----- |
304| rules | statement+ | |
305| statement | assignment \| rule \| control |
306| control | (`!!forward` \| `!!reverse` \| `!!safe_forward` \| `!!safe_reverse` \| `!!chain`) `;`
307| assignment | variable `=` expr `;` | 5 |
308| rule | `^`? expr (`{`number`}`)? `;` | 8,9 |
309| number | [0-9]+ | 1 |
310| break-point | `/` | 10 |
311| expr | expr-q \| expr `\|` expr \| expr expr | 3 |
312| expr-q | term \| term `*` \| term `?` \| term `+` |
313| term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point |
314| rule-special | *any printing ascii character except letters or numbers* \| white-space |
315| rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character except* `\p` *or* `\P` |
316| variable | `$` name-start-char name-char* | 7 |
317| name-start-char | `_` \| \p{L} |
318| name-char | name-start-char \| \\p{N} |
319| quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single quotes)*+ `'` |
320| escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md) chapter* |
321| unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 |
322| comment | unescaped `#` *(any char except new-line)** new-line | 2 |
323| s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 |
324| new-line | LF, CR, NEL | 2 |
325
326### Rule Syntax Notes
327
3281. The number associated with a rule that actually determined a break position
329   is available to the application after the break has been returned. These
330   numbers are *not* Perl regular expression repeat counts.
331
3322. Comments are recognized and removed separately from otherwise parsing the
333   rules. They may appear wherever a space would be allowed (and ignored.)
334
3353. The implicit concatenation of adjacent terms has higher precedence than the
336   `|` operation. "`ab|cd`" is interpreted as "`(ab)|(cd)`", not as "`a(b|c)d`" or
337   "`(((ab)|c)d)`"
338
3394. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSet` class.
340   It is not repeated here.
341
3425. For `$`variables that will be referenced from inside of a `UnicodeSet`, the
343   definition must consist only of a Unicode Set. For example, when variable `$a`
344   is used in a rule like `[$a$b$c]`, then this definition of `$a` is ok:
345   “`$a=[:Lu:];`” while this one “`$a=abcd;`” would cause an error when `$a` was
346   used.
347
3486. Spaces are allowed nearly anywhere, and are not significant unless escaped.
349   Exceptions to this are noted.
350
3517. No spaces are allowed within a variable name. The variable name `$dictionary`
352   is special. If defined, it must be a Unicode Set, the characters of which
353   will trigger the use of word dictionary based boundaries.
354
3558. A leading `^` on a rule prevents chaining into that rule. It can only match
356   immediately after a preceding boundary, or at the start of text.
357
3589. `{`nnn`}` appearing at the end of a rule is a Rule Status number, not a repeat
359   count as it would be with conventional regular expression syntax.
360
36110. A `/` in a rule specifies a hard break point. If the rule matches, a
362    boundary will be forced at the position of the `/` within the match.
363
364### EBNF Syntax used for the RBBI rules syntax description
365
366| syntax | description |
367| -- | ------------------------- |
368| a? | zero or one instance of a |
369| a+ | one or more instances of a |
370| a* | zero or more instances of a |
371| a \| b | either a or b, but not both |
372| `a` "`a`" | the literal string between the quotes or displayed as `monospace` |
373
374## Planned Changes and Removed or Deprecated Rule Features
375
3761. Reverse rules could formerly be indicated by beginning them with an
377   exclamation `!`. This syntax is deprecated, and will be removed from a
378   future version of ICU.
379
3802. `!!LBCMNoChain` was a global option that specified that characters with the
381   line break property of "Combining Character" would not participate in rule
382   chaining. This option was always considered internal, is deprecated and will
383   be removed from a future version of ICU.
384
3853. Naked rule characters. Plain text, in the context of a rule, is treated as
386   literal text to be matched, much like normal regular expressions. This turns
387   out to be very error prone, has been the source of bugs in released versions
388   of ICU, and is not useful in implementing normal text boundary rules. A
389   future version will reject literal text that is not escaped.
390
3914. Exact reverse rules and safe forward rules: planned changes to the break
392   engine implementation will remove the need for exact reverse rules and safe
393   forward rules.
394
3955. `{bof}` and `{eof}`, appearing within `[`sets`]`, match the beginning or ending of
396   the input text, respectively. This is an internal (not documented) feature
397   that will probably be removed in a future version of ICU. They are currently
398   used by the standard rules for word, line and sentence breaking. An
399   alternative is probably needed. The existing implementation is incomplete.
400
401## Additional Sample Code
402
403**C/C++**
404See [icu/source/samples/break/](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/break/)
405in the ICU source distribution for code samples showing the use of ICU boundary analysis.
406
407## Details about Dictionary-Based Break Iteration
408
409> :point_right: **Note**: This section below is originally from August 2012.
410> It is probably out of date, for example `brkfiles.mk` does not exist anymore.
411
412Certain Unicode characters have a "dictionary" bit set in the break iteration
413rules, and text made up of these characters cannot be handled by the rules-based
414break iteration code for lines or words. Rather, they must be handled by a
415dictionary-based approach. The ICU approach is as follows:
416
417Once the Dictionary bit is detected, the set of characters with that bit is
418handed off to "dictionary code." This code then inspects the characters more
419carefully, and splits them by script (Thai, Khmer, Chinese, Japanese, Korean).
420If text in this script has not yet been handled, it loads the appropriate
421dictionary from disk, and initializes a specialized "BreakEngine" class for that
422script.
423
424There are three such specialized classes: Thai, Khmer and CJK.
425
426Thai and Khmer use very similar approaches. They look through a dictionary that
427is not weighted by word frequency, and attempt to find the longest total "match"
428that can be made in the text.
429
430For Chinese and Japanese text, on the other hand, we have a unified dictionary
431(due to the fact that both use some of the same characters, it is difficult to
432distinguish them) that contains information about word frequencies. The
433algorithm to match text then uses dynamic programming to find the set of breaks
434it considers "most likely" based on the frequency of the words created by the
435breaks. This algorithm could also be used for Thai and Khmer, but we do not have
436sufficient data to do so. This algorithm could also be used for Korean, but once
437again we do not have the data to do so.
438
439Code of interest is in `source/common/dictbe.{h, cpp}`, `source/common/brkeng.{h,
440cpp}`, `source/common/dictionarydata.{h, cpp}`. The dictionaries use the `BytesTrie`
441and `UCharsTrie` as their data store. The binary form of these dictionaries is
442produced by the `gendict` tool, which has source in `source/tools/gendict`.
443
444In order to add new dictionary implementations, a few changes have to be made.
445First, you should create a new subclass of `DictionaryBreakEngine` or
446`LanguageBreakEngine` in `dictbe.cpp` that implements your algorithm. Then, in
447`brkeng.cpp`, you should add logic to create this dictionary break engine if we
448strike the appropriate script - which should only be 3 or so lines of code at
449the most. Lastly, you should add the correct data file. If your data is to be
450represented as a `.dict` file - as is recommended, and in fact required if you
451don't want to make substantial code changes to the engine loader - you need to
452simply add a file in the correct format for gendict to the `source/data/brkitr`
453directory, and add its name to the list of `BRK_DICT_SOURCE` in
454`source/data/brkitr/brkfiles.mk`. This will cause your dictionary (say, `foo.txt`)
455to be added as a `UCharsTrie` dictionary with the name foo.dict. If you want your
456dictionary to be a `BytesTrie` dictionary, you will need to specify a transform
457within the `Makefile`. To do so, find the part of `source/data/Makefile.in` and
458`source/data/makedata.mak` that deals with `thaidict.dict` and `khmerdict.dict` and
459add a similar set of lines for your script. Lastly, in
460`source/data/brkitr/root.txt`, add a line to the dictionaries `{}` section of the
461form:
462
463```
464    shortscriptname:process(dependency){"dictionaryname.dict"}
465```
466
467For example, for Katakana:
468
469```
470    Kata:process(dependency){"cjdict.dict"}
471```
472
473Make sure to add appropriate tests for the new implementation.
474