README.md
1<!--
2Copyright (C) 2016 and later: Unicode, Inc. and others.
3License & terms of use: http://www.unicode.org/copyright.html
4
5Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.
6-->
7
8This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
9===========================================
10
11The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted
12to provide an expected set of boundary positions to compare with the results from ICU break iteration.
13
14ICU4J also includes copies of the test reference rules, located in the directory
15main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/
16The copies should be kept synchronized; there should be no differences.
17
18Each set of reference break rules lives in a separate file.
19The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp.
20
21Each test file includes
22 - The type of ICU break iterator to create (word, line, sentence, etc.)
23 - The locale to use
24 - Character Class definitions
25 - Rule definitions
26
27To Do
28 - Extend the syntax to support rule tailoring.
29
30
31**character class definition**
32
33 name = set_regular_expression;
34
35*caution* When referenced, these definitions are textually substituted into the overall rule.
36To avoid unexpected behavior, include [brackets] around the full definition
37
38 letter_number = [:Letter:][:Number:];
39
40Will compile, but will produce unexpected results.
41
42 letter_number = [[:Letter:][:Number:]];
43
44is safe. The issue is similar to the problems that can occur with the C preprocessor
45and the use of parentheses around macro paramteters.
46
47**rule definition**
48
49 rule_regular_expression;
50
51**name**
52
53 [A-Za-z_][A-Za-z0-9_]*
54
55**set_regular_expression**
56
57The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern
58(They are mostly the same). May include previously defined set names, which are logically
59expanded in-place.
60
61**rule_regular_expression**
62
63 An ICU Regular Expression.
64 May include set names, which are logically expanded in-place.
65 May include a '÷', which defines a boundary position.
66
67Application of the rules:
68
69Matching begins at the start of text, or after a previously identified boundary.
70The pseudo-code below finds the next boundary.
71
72 while position < end of text
73 for each rule
74 if the text at position matches this rule
75 if the rule has a '÷'
76 Boundary is found.
77 return the position of the '÷' within the match.
78 else
79 position = last character of the rule match.
80 break from the inner rule loop, continue the outer loop.
81
82This differs from the Unicode UAX algorithm in that each position in the text is
83not tested separately. Instead, when a rule match is found, rule application restarts with the last
84character of the preceding rule match. ICU's break rules also operate this way.
85
86Expressing rules this way simplifies UAX rules that have leading or trailing context; it
87is no longer necessary to write expressions that match the context starting from
88any position within it.
89
90This rule form differs from ICU rules in that the rules are applied sequentially, as they
91are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.
92
93**Word Dictionaries**
94
95
96The monkey test does not test dictionary based breaking. The set named 'dictionary' is special,
97as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
98included in the randomly-generated test data.
99
100
101