1<!-- 2Copyright (C) 2016 and later: Unicode, Inc. and others. 3License & terms of use: http://www.unicode.org/copyright.html#License 4--> 5 6Updating ICU's built-in Break Iterator rules 7============================================ 8 9Here are instructions for updating ICU's built-in break iterator rules, for Grapheme, Word, Line and Sentence breaks. 10 11The ICU rules implement the boundary behavior from Unicode [UAX-14](https://unicode.org/reports/tr14/) and [UAX-29](https://unicode.org/reports/tr29/), with tailorings from CLDR and some ICU-specific enhancements. ICU rules updates are needed in response to changes from Unicode or CLDR, or for bug fixes. Often ideas for CLDR or UAX updates are prototyped in ICU first, before becoming official. 12 13This is not a cook book process. Familiarity with ICU break iterator behavior and rules is needed. Sets of break rules often interact in subtle and difficult to understand ways. Expect some bumps. 14 15### Have clear specifications for the change. 16 17The changes will typically come from a proposed update to Unicode UAX 29 or UAX 14, 18or from CLDR based tailorings to these specifications. 19 20As an example, see [CLDR proposal for Extended Indic Grapheme Clusters](https://github.com/unicode-org/cldr/tree/master/common/properties/segments). 21 22Often ICU will implement draft versions of proposed specification updates, to check that they are complete and consistent, and to identify any issues before they are released. 23 24### Files that typically will need to be updated: 25 26 27| File | Contents | 28|-------------------------------------|-------------------------- 29| icu/icu4c/source/... 30| .../test/testdata/rbbitst.txt | Data driven test file. Typically the first to be updated. 31| .../data/brkitr/rules/*.txt | Main break rule files. 32| .../test/intltest/rbbitst.cpp | Monkey Test Rules (as code; also called the 'original' or 'old' monkey test). 33| .../test/testdata/break_rules/*.txt | Monkey Test Rules (as data; also called the 'new' or 'rule-based' monkey test). 34| .../test/testdata/*BreakTest.txt | Unicode Supplied Test Files. 35||| 36| icu/icu4j/... 37| .../main/shared/data/icudata.jar | Data jar, includes break rules. Derived from ICU4C. 38| .../main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt | Test data, copied from ICU4C. 39| .../main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/* | Monkey test rules, copied from ICU4C. 40| .../main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBITestMonkey.java | Monkey test w rules as code. Port from ICU4C. 41 42 43### ICU4C 44 45The rule updates are done first for ICU4C, and then ported (code changes) or moved (data changes) to ICU4J. This order is easiest because the the break rule source files are part of the ICU4C project, as is the rule builder. 46 471. **Add basic tests**` to icu4c/source/test/testdata/rbbitst.txt` 48 49 This file contains data driven tests, which are basically text strings marked up with their expected break positions. The test syntax is documented in the file itself. 50 51 Add tests to to spot check the basics of the changes, to verify that some simple, straight forward cases work as expected. There is no need to thoroughly check corner cases; the goal at this step is a quick sanity check that will fail before the rule update and pass afterwards. 52 53 The [Unicode Utilities](http://unicode.org/cldr/utility/) can be very helpful at this point, for showing what characters 54 match a UnicodeSet expression, and for listing the properties of a particular character. 55 56 Tests added for the above example: 57 58 # 59 # ICU-13637 and CLDR-10994 - Indic Grapheme Cluster Boundary changes to support aksaras 60 # New rule: LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant 61 # Sample Chars: LinkingConsonant: \u0915 62 # Virama: \u094d [also Extend] 63 # ExtCccZWJ: \u0308 64 # Extend but not ExtCCCZWJ \u093A 65 <char> 66 <data>•\u0915\u094d\u0915•</data> 67 <data>•\u0915\u0308\u0308\u094d\u0308\u0308\u0915•</data> 68 <data>•\u0915\u0308\u0308\u094d\u0308\u0308•\u0041•</data> 69 <data>•\u0915\u0308\u0308\u094d\u093A\u093A•\u0915•</data> 70 71 Two copies of the test file exist in the ICU repository, one for C++ and one for Java. There are two because 72 there is no common place that will always be present and that the two build systems can both access. 73 74 The two should be identical. Verify this before starting to make changes. If they differ, it 75 probably means that some earlier change to ICU4C was not fully ported to ICU4J, and this 76 needs to be resolved before proceeding. 77 78 diff icu4c/source/test/testdata/rbbitst.txt icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt 79 80 Should show no differrence. 81 82 832. **Run the ICU4C break iterator tests**, and verify that the newly added tests fail as expected. 84 (We haven't updated the rules yet) 85 86 cd icu/icu4c/source 87 make -j6 check 88 89 To run just the RBBI Tests (you will be doing this a lot) 90 91 cd test/intltest 92 LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi 93 94 A snippet from the test output, showing one of many expected failures: 95 96 === Handling test: rbbi: === 97 rbbi { 98 TestExtended { 99 code alpha extend alphanum type word sent line name 100 ------------------------------------------------ 0 101 915 1 0 1 Lo LE LE AL DEVANAGARI LETTER KA 102 94d 0 1 0 Mn Extend EX CM DEVANAGARI SIGN VIRAMA 103 ------------------------------------------------ 2 104 915 1 0 1 Lo LE LE AL DEVANAGARI LETTER KA 105 Forward Iteration, break found, but not expected. Pos= 2 File line,col= 175, 21 106 Reverse Itertion, break found, but not expected. Pos= 2 File line,col= 175, 21 107 1083. **Update the main rule file for the break iterator type in question.** 109 110 For this example, the rule file is `icu4c/source/data/brkitr/rules/char.txt`. 111 (If the change is for word or line break, which have multiple rule files for tailorings, only update the root file at this time.) 112 113 Start by looking at how existing similar rules are being handled, and also refer to the ICU user guide section on [Break Rules](http://userguide.icu-project.org/boundaryanalysis/break-rules) for an explanation of rule syntax and behavior. 114 115 The transformation from UAX or CLDR style rules to ICU rules can be non-trivial. Sources of difficulties include: 116 117 - All ICU rules run in parallel, while UAX/CLDR rules are applied sequentially, stopping after the first match. The ICU rules sometimes require extra logic to prevent a later rule from preempting an earlier rule. This can be quite tricky to express. 118 119 - ICU rules match a run of text that does not have boundaries in its interior (unless the rule contains a "hard break", represented by a '/'. UAX and CLDR rules, on the other hand, tell whether a single text position is or is not a break, with the rule expressing pre and post context around that position. This transformation is generally not hard, and the ICU form of the rules is often simpler. 120 121 1224. **Rebuild the ICU data with the updated rules.** 123 124 cd icu4c/source/data 125 make 126 1275. **Rerun the data-driven test**, `rbbi/TestExtended`. With luck, it may pass. Failures fall into two classes: 128 129 - The newly added test failed. Either something is wrong with the test cases, or something is wrong with the rule updates. 130 131 - A previously existing test started failing. Examine the test case; it probably conflicts with the new rules. Either change the expected boundaries, or change the test string to something that generates the previous boundaries. Or remove the test. Whatever seems most sensible. 132 133 Fix any failures before proceeding. Don't try to run other tests yet; a large number of failures are very likely. 134 1356. **Run any relevant Unicode test data.** ICU's copies of the test files are here: 136 137 icu4c/source/test/testdata/GraphemeBreakTest.txt 138 icu4c/source/test/testdata/LineBreakTest.txt 139 icu4c/source/test/testdata/SentenceBreakTest.txt 140 icu4c/source/test/testdata/WordBreakTest.txt 141 142 If the update includes new versions of any of these files, copy them to the above locations. 143 144 To run the test: 145 146 cd icu4c/source/test/intltest 147 LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi/RBBITest/TestUnicodeFiles 148 149 The test files are from the Unicode Consortium. The official, released versions are at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/ . The files are copied, unmodified, into the ICU source tree to make them accessible to the ICU tests. 150 151 If the update is for a new Unicode version, or for a new CLDR tailoring of the root Unicode rules, it should include updated test data files. If they're missing, ask whoever is requesting or providing the updated rules for help. The test data is generated by CLDR tooling. 152 153 Copy any new Unicode test data files to their location in icu, and rerun the test. 154 155 Historically, failures have roughly equal chance of being problems with the test data or problems with the ICU rules. In either event, track down and fix any problems before proceeding. 156 157 *Note:* Known issues with the test data file are accounted for in the test code, in the function `RBBITest::testCaseIsKnownIssue()` in the file `rbbitst.cpp`. Test cases are skipped when ICU behavior has been patched or enhanced for some reason, relative to standard Unicode behavior. 158 1597. **Other Break Iterator Tests, except for Monkey Tests** 160 161 cd icu4c/source/test/intltest 162 LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi 163 164 This runs all of the RBBI tests, including the Monkey tests. For this step, ignore Monkey failures, and track down and fix any others. 165 166 There is a real mish-mash of old tests, checking random bits of hard coded data. 167 1688. **Monkey Tests** 169 170 Monkey testing compares the breaking behavior of the main ICU RBBI implementation with that of a reference implementation, using random data. 171 172 Monkey testing has proved to be by far the most effective way to check obscure edge and corner case behavior, to the point that it no longer seems worth while to hand-write more than fairly basic test cases. 173 174 ICU has two independent RBBI monkey tests. The original one implements the break rules directly in code. The algorithm sticks pretty close to that of the Unicode UAX specifications. The original monkey test checks only the root break iterator behavior, not any tailorings. 175 176 The newer, data-driven monkey test takes its reference rules from test data files, instead of hard coding them. It covers tailorings. Its algorithm differs somewhat from that of the UAX specifications, in that the rules match runs of text rather than testing pre or post context around a potential break. (This was intended as a first step in driving a revised algorithm back to the specifications, a task that hasn't yet happened.) 177 178 The original plan was to retire the old monkey test in favor of the newer data-driven one, but each tends to uncover problems that the other misses, so they both remain. 179 1809. **Original Monkey Test** 181 182 To run the test: 183 184 LD_LIBRARY_PATH=../../lib:../../stubdata:../../tools/ctestfw:$LD_LIBRARY_PATH ./intltest rbbi/RBBITest/TestMonkey@"type=char loop=-1" 185 186 Test parameters (following the '@' 187 seed=nnnnn Random number starting seed. 188 Setting the seed allows errors to be reproduced. 189 loop=nnn Looping count. Controls running time. 190 -1: run forever. 191 0 or greater: run length. 192 193 type = char | word | line | sent | title 194 195 Updating the test with new or revised rules requires changing the test source code, in `icu4c/source/test/intltest/rbbitst.cpp`. Look for the classes RBBICharMonkey, RBBIWordMonkey, RBBISentMonkey and RBBILineMonkey. The body of each class tracks the corresponding UAX-14 or UAX-29 specifications in defining the character classes and break rules. 196 197 After making changes, as a final check, let the test run for an extended period of time, on the order of several hours. 198 Run it from a terminal, and just interrupt it (Ctrl-C) when it's gone long enough. 199 20010. **New Monkey Test** 201 202 To run the test: 203 204 intltest rbbi/RBBIMonkeyTest/testMonkey@rules=grapheme.txt,loop=-1 205 206 The @rules parameter is the test rules file to run; test rules files are located in the directory `icu4c/source/test/testdata/break_rules` 207 208 The test should initially fail, because ICU's library rules have been updated (steps 3 and 4), but the reference rules used 209 by this test have not yet been. 210 211 Make the updates to the test rules and re-run. The rule syntax is described in 212 icu4c/source/test/testdata/break_rules/README.md. 213 The test reference rules are in this same directory. 214 215 Again, after everything appears to be working, let the test run for an extended length of time. Long runs are especially important with the more complex break rule sets, such as line break. 216 21711. **Tailorings** 218 219 If this is an update to word or line break root behavior, the rule changes must be propagated from from the root rule files to the tailored files, for both the main rules (source/data/brkitr/rules/*) and the monkey test rules (source/test/testdata/break_rules/*). 220 221 The easiest and safest way to do this is to create a patch file of the diffs to the root rule file, typically using `git diff`. 222 Apply it to the various tailorings of the break type being updated. 223 224 Merge conflicts when applying the patch would indicate that the same rules modified by the new change were also modified in the tailoring. When this happens, you just have to dig in and understand the intent of the rules and the tailoring were, and figure out what makes sense. Fortunately, conflicts are not common. 225 226 As with the main rules, after everything appears to be working, run the rule based monkey test for an extended period of time (with loop=-1). 227 228### ICU4J 229 2301. **Copy the Data Driven Test File to ICU4J** 231 232 Copy the file `rbbitst.txt` from ICU4C to ICU4J, and run the Java test. It should fail until the rules are updated. 233 234 cd <top level icu directory> 235 cp icu4c/source/test/testdata/rbbitst.txt icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt 236 237 Run the test from Eclipse. 238 239 Navigate to `/icu4j-core-tests/src/com/ibm/icu/dev/test/rbbi/RBBITestExtended.java`. 240 241 Select the `TestExtended()` function in the source code, right-click it and choose "Run As Junit Test". 242 243 Errors (expected because the break rules have not yet been updated) should show the failing line in `rbbitst.txt`. For example: 244 245 java.lang.AssertionError: Forward Iteration, break found, but not expected. Pos=2 File line,col= 175, 21 246 2472. **Refresh ICU4J data from ICU4C**. 248 249 This will bring over the updated break rules, refreshing the file `main/shared/data/icudata.jar`. Follow the instructions from `icu4c/source/data/icu4j-readme.txt`. 250 251 Rerun the ICU4J tests. `TestExtended` should now pass. Others may start failing. 252 2533. **Port the code-based Monkey Test changes from ICU4C** 254 255 ICU4C file to port from: `source/test/intltest/rbbitst.cpp` 256 257 ICU4J file to port to: `main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBITestMonkey.java` 258 259 To conveniently run the individual tests, look for the test functions `TestCharMonkey()`, `TestWordMonkey()`, etc. in `RBBITestMonkey.java`. 260 261 Test parameters are passed via the Eclipse Run Configuration settings, arguments tab, VM parameters. For example, 262 263 -ea -Dseed=554654 -Dloop=1 264 265 When the test appears to be working, run for an extended time (with -Dloop=-1). 266 2674. **New (rule driven) Monkey Test** 268 269 Copy the updated monkey test rules from ICU4C and run the test.** 270 271 ICU4C directory, to copy from: `source/test/testdata/break_rules/` 272 273 ICU4J directory, to copy to: `main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/` 274 275 Then rerun the rule based monkey test, in the file `main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBIMonkeyTest.java`. Find the test function `TestMonkey()`; it include comments describing how to run it with parameters from Eclipse. 276 277 Run the test(s) for the changed rules for an extended amount of time (with Dloop=-1). 278 279 280 281 282 283 284 285 286 287