1file: testdata/break_rules/readme.txt 2Copyright (C) 2016 and later: Unicode, Inc. and others. 3License & terms of use: http://www.unicode.org/copyright.html#License 4 5Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved. 6 7This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey. 8The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted 9to provide an expected set of boundary positions to compare with the results from ICU break iteration. 10 11ICU4J also includes copies of the test reference rules, located in the directory 12main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/ 13The copies should be kept synchronized; there should be no differences. 14 15Each set of reference break rules lives in a separate file. 16The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp. 17 18Each test file includes 19 - The type of ICU break iterator to create (word, line, sentence, etc.) 20 - The locale to use 21 - Character Class definitions 22 - Rule definitions 23 24To Do 25 - Extend the syntax to support rule tailoring. 26 27 28Character Class Definition: 29 name = set_regular_expression; 30 31Rule Definition: 32 rule_regular_expression; 33 34name: 35 [A-Za-z_][A-Za-z0-9_]* 36 37set_regular_expression: 38 The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern. 39 (They are mostly the same) 40 May include previously defined set names, which are logically expanded in-place. 41 42rule_regular_expression: 43 An ICU Regular Expression. 44 May include set names, which are logically expanded in-place. 45 May include a '÷', which defines a boundary position. 46 47Application of the rules: 48 Matching begins at the start of text, or after a previously identified boundary. 49 The pseudo-code below finds the next boundary. 50 51 while position < end of text 52 for each rule 53 if the text at position matches this rule 54 if the rule has a '÷' 55 Boundary is found. 56 return the position of the '÷' within the match. 57 else 58 position = last character of the rule match. 59 break from the inner rule loop, continue the outer loop. 60 61 This differs from the Unicode UAX algorithm in that each position in the text is 62 not tested separately. Instead, when a rule match is found, rule application restarts with the last 63 character of the preceding rule match. ICU's break rules also operate this way. 64 65 Expressing rules this way simplifies UAX rules that have leading or trailing context; it 66 is no longer necessary to write expressions that match the context starting from 67 any position within it. 68 69 This rule form differs from ICU rules in that the rules are applied sequentially, as they 70 are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel. 71 72Word Dictionaries 73 The monkey test does not test dictionary based breaking. The set named 'dictionary' is special, 74 as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are 75 included in the randomly-generated test data. 76 77