break-rules.md - OpenGrok cross reference for /third_party/icu/docs/userguide/boundaryanalysis/break-rules.md

Lines Matching +full:restricted +full:- +full:rules
1 ---
3 title: Break Rules
6 ---
7 <!--
10 -->
12 # Break Rules
16 {: .no_toc .text-delta }
21 ---
25 ICU locates boundary positions within text by means of rules, which are a form
26 of regular expressions. The form of the rules is similar, but not identical,
27 to the boundary rules from the Unicode specifications
28 [[UAX-14](https://www.unicode.org/reports/tr14/), 
29 [UAX-29](https://www.unicode.org/reports/tr29/)], and there is a reasonably close
32 Taken as a set, the ICU rules describe how to move forward to the next boundary,
34 ICU includes rules for the standard boundary types (word, line, etc.).
35 Applications may also create customized break iterators from their own rules.
37 ICU's built-in rules are located at
38 [icu/icu4c/source/data/brkitr/rules/](https://github.com/unicode-org/icu/tree/main/icu4c/source/dat…
44 Rules most commonly describe a range of text that should remain together,
58 | --------- | ------------------------ |
66 A variable names a set or rule sub-expression. They are useful for documenting
77     $ASCIILetNum = [A-Za-z0-9];
96 This idea is unique to ICU break rules, it is not a concept found in other
97 regular expression based matchers. Some of the Unicode standard break rules
105     word_joiner = [_-];
110 These rules will match "`abc`", "`hello_world`", `"hi-there"`,
111 "`a-bunch_of-joiners-here`".
113 They will not match "`-abc`", "`multiple__joiners`", "`tail-`"
115 A full match is composed of pieces or submatches, possibly from different rules,
134 rules](http://www.unicode.org/reports/tr29/#Word_Boundary_Rules), where rules
165 In practice, nearly all break rules are composed from `[`sets`]` based on Unicode
166 character properties; literal characters in rules are very rare.
168 To prevent random typos in rules from being treated as literals, use this
181 #### Explicit Break Rules
184 other rules or chaining would otherwise lead to a longer match. Also called Hard
185 Break Rules, these have the form
188     pre-context / post-context;
191 where the pre and post-context look like normal break rules. Both the pre and
192 post context are required, and must not allow a zero-length match. There should
193 be no overlap between characters that end a match of the pre-context and those
194 that begin a match of the post-context.
197 hard break rule; when the post-context matches a break is forced immediately.
200 rules. The behavior of rules with missing or overlapping contexts is subject to
205 Chaining into a rule can be dis-allowed by beginning that rule with a '`^`'. Rules
213 rules. We hope to replace it with something more general.~~
220 Break rules can be tagged with a number, which is called the *rule status*.
225 For the predefined word boundary rules, status values are available to
234 When creating custom sets of break rules, integer status values can be
235 associated with boundary rules in whatever way will be convenient for the
236 application. There is no need to remain restricted to the predefined values and
237 classifications from the standard rules.
239 It is possible for a set of break rules to contain more than a single rule that
241 return the numerically largest status value from the matching rules, and the
243 all of the matching rules.
245 In the source form of the break rules, status numbers appear at end of a rule,
248 Hard break rules that also have a status value place the status at the end, for
252     pre-context / post-context {1234};
258 are able to supplement the rules with dictionary based breaking. Some languages,
265 1. The break rules must select, as unbroken chunks, ranges of text to be passed
267 2. The break rules must define a character class named `$dictionary` that
274 … this snippet from the [line break rules](https://github.com/unicode-org/icu/blob/main/icu4c/sourc…
277     #  Dictionary character set, for triggering language-based break engines. Currently
285 | --------------- | ----------- |
287 | `!!forward`     |  The rules that follow are for forward iteration. Forward rules are now the onl…
292 | --------------- | ----------- |
293 …!!reverse`~~     | ~~*[deprecated]* The rules that follow are for reverse iteration. No longer nee…
294 …afe_forward`~~ | ~~*[deprecated]* The rules that follow are for safe forward iteration. No longer …
295 …afe_reverse`~~ | ~~*[deprecated]* The rules that follow are for safe reverse iteration. No longer …
300 Here is the syntax for the boundary rules. (The EBNF Syntax is given below.)
303 | ---------- | ----------- | ----- |
304 | rules | statement+ | |
309 | number | [0-9]+ | 1 |
310 | break-point | `/` | 10 |
311 | expr | expr-q \| expr `\|` expr \| expr expr | 3 |
312 | expr-q | term \| term `*` \| term `?` \| term `+` |
313 | term | rule-char \| unicode-set \| variable \| quoted-sequence \| `(` expr `)` \| break-point |
314 | rule-special | *any printing ascii character except letters or numbers* \| white-space |
315 | rule-char | *any non-escaped character that is not rule-special* \| `.` \| *any escaped character…
316 | variable | `$` name-start-char name-char* | 7 |
317 | name-start-char | `_` \| \p{L} |
318 | name-char | name-start-char \| \\p{N} |
319 | quoted-sequence | `'` *(any char except single quote or line terminator or two adjacent single qu…
320 | escaped-char | *See “Character Quoting and Escaping” in the [UnicodeSet](../strings/unicodeset.md…
321 | unicode-set | See [UnicodeSet](../strings/unicodeset.md) | 4 |
322 | comment | unescaped `#` *(any char except new-line)** new-line | 2 |
324 | new-line | LF, CR, NEL | 2 |
333    rules. They may appear wherever a space would be allowed (and ignored.)
339 4. The syntax for [unicode-set](../strings/unicodeset.md) is defined (and parsed) by the `UnicodeSe…
364 ### EBNF Syntax used for the RBBI rules syntax description
367 | -- | ------------------------- |
376 1. Reverse rules could formerly be indicated by beginning them with an
388    of ICU, and is not useful in implementing normal text boundary rules. A
391 4. Exact reverse rules and safe forward rules: planned changes to the break
392    engine implementation will remove the need for exact reverse rules and safe
393    forward rules.
398    used by the standard rules for word, line and sentence breaking. An
404 See [icu/source/samples/break/](https://github.com/unicode-org/icu/tree/main/icu4c/source/samples/b…
407 ## Details about Dictionary-Based Break Iteration
413 rules, and text made up of these characters cannot be handled by the rules-based
415 dictionary-based approach. The ICU approach is as follows:
448 strike the appropriate script - which should only be 3 or so lines of code at
450 represented as a `.dict` file - as is recommended, and in fact required if you
451 don't want to make substantial code changes to the engine loader - you need to