1--- 2title: Specifying text break variants in locale IDs 3--- 4 5# Specifying text break variants in locale IDs 6 7| | | 8|---|---| 9| Author | Peter Edberg | 10| Date | 2014-11-11, last update 2016-10-20 | 11| Status | Proposal | 12| Feedback to | pedberg (at) apple (dot) com | 13| Bugs | See list below | 14 15This proposal discusses options for extending Unicode locale identifiers to specify text break variants with a locale. It was prompted by CLDR and ICU bugs including the following, as well as by other requests: 16 17- CLDR #[2142](http://unicode.org/cldr/trac/ticket/2142), Alternate Grapheme Clusters 18- CLDR #[2161](http://unicode.org/cldr/trac/ticket/2161), Grapheme break iterator with legacy behavior 19- CLDR #[2825](http://unicode.org/cldr/trac/ticket/2825), Add aksha grapheme break 20- CLDR #[2975](http://unicode.org/cldr/trac/ticket/2975), Support legacy grapheme break 21- CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931), Provide mechanism for parameterizing linebreak, etc. 22- CLDR #[7032](http://unicode.org/cldr/trac/ticket/7032), BCP47 for break exceptions 23- CLDR #[8204](http://unicode.org/cldr/trac/ticket/8204), Other line break parameterization to support CSS word-break, etc. 24- ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379), Request to add Japanese linebreak tailoring selectable as variations 25- ICU #[11248](http://bugs.icu-project.org/trac/ticket/11248), Improve C/J FilteredBreakIterator, move to draft 26- ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530), More efficient representation for multiple line break rule sets 27- ICU #[11531](http://bugs.icu-project.org/trac/ticket/11531), Update RBBI TestMonkey to test line break variants 28- ICU #[11770](http://bugs.icu-project.org/trac/ticket/11770), BreakIterator should support new locale key "ss" 29- ICU #[11771](http://bugs.icu-project.org/trac/ticket/11771), FilteredBreakIterator should move from i18n to common 30 31## I. Options needed (as known so far) 32 33### A. Grapheme cluster break 34 35Need to choose one of the following (current CLDR/ICU implementation uses extended grapheme clusters): 36 37- Legacy UAX #29 grapheme clusters (also called spacing units). 38- Extended UAX #29 grapheme clusters: legacy clusters plus also include spacing combining marks in Indic scripts, and Thai SARA AM and Lao AM (but not other spacing vowels in SE Asian scripts). 39- Aksaras (Indic & SE Asian consonant/vowel clusters or syllables): extended clusters plus also include consonant-virama sequences, and spacing vowels in SE Asian scripts. 40 41### B. Word break 42 43Currently uses dictionary-based break for sequences in CJK scripts (Han/Kana/Hangul) or SE Asian scripts (LineBreak property value SA/Complex\_Context: Thai, Lao, Khmer, Myanmar, etc.); we need a locale keyword that can turn this on or off (i.e. off to use basic UAX #29 word break), at least for CJK. 44 45### C. Sentence break 46 47We need a locale keyword to control use of ULI suppressions data (i.e. to determine whether we should wrap the UAX29-based break iterator in a FilteredBreakIterator instance for the locale, and to determine which suppressions set to use). 48 49### D. Line break (highest priority) 50 51Currently ICU uses dictionary-based break for text in SE Asian scripts only. The two most important needs for line break control are: 52 53- For Japanese text, control whether line breaks are allowed before small kana and before the prolonged sound mark 30FC; this corresponds to (most of) the distinction between CSS level 3 strict and normal line break (see below), and is implemented by treating LineBreak property value CJ as either NS (strict) or ID (normal). 54- For Korean text, control whether the line break style is E. Asian style (breaks can occur in the middle of words) or “Western” style (breaks are space based), as described in UAX 14. 55 56Other desirable capabilities include: 57 58- In a CJK-language context, control over whether breaks are allowed in the middle of words in alphabetic scripts that normally use a space-based approach (e.g. Latin, Greek, Cyrillic). Currently fullwidth Latin letters have LineBreak property value ID and do allow such breaks, but normal Latin letters are AL and do not. 59- In a CJK-language context, explicit control over whether characters with LineBreak property value AI resolve to ID or AL (UAX 14 recommends using resolved East Asian Width to do this, but in the absence of that or any other higher-level mechanism they default to AL). This is somewhat related to the previous bullet. Note that characters with value AI include some symbols, punctuation, superscript digits, modifier letters, etc. 60- Full control over CSS line break styles, see below (these can be used to control most of the above line break features) 61 62## II. Notes on CSS level 3 line break 63 64(from draft of Jun 2015, [http://dev.w3.org/csswg/css-text/#line-breaking](http://dev.w3.org/csswg/css-text/#line-breaking)) 65 66CSS has two independent properties for controlling line break behavior: 67 68### A. The line-break property 69 70This is mainly about break behavior for punctuation and symbols, though it does affect small kana. The rules are intended to specify behavior that may be language-specific, but explicit rules are provided for CJK. Besides the “auto” value, there are three specific values for this property. 71 72- **strict:** The most restrictive rules, for longer lines and/or ragged margins. Prevents break before small kana and before prolonged sound mark 30FC (this is the set of characters with LineBreak property value CJ, which have general category Lo or Lm). 73- **normal:** Allows break before small kana and before prolonged sound mark 30FC. If the content language is Chinese or Japanese, also allows breaks before hyphen like characters: ‐ U+2010, – U+2013, ~ U+301C, ゠ U+30A0 (LineBreak property value BA for the first two, NS for the second two; general category Pd for all four). 74- **loose:** The least restrictive, used for short lines as in newspapers. In addition to breaks allowed for normal, allows breaks before iteration marks (々 U+3005, 〻 U+303B, ゝ U+309D, ゞ U+309E, ヽ U+30FD, ヾ U+30FE, all with LineBreak property value NS and general category Lm) and breaks between characters with LineBreak property value IN (inseparable). If the content language is Chinese or Japanese, also allows breaks before certain centered punctuation marks, before suffixes and after prefixes. 75 76### B. The word-break property 77 78This only controls break opportunities between letter-like characters (including ideographs), and has 3 possible values. Symbols that break in the same way as letters are affected in the same way by these options. 79 80- **normal:** Words break according to their customary rules. For Korean this specifies E. Asian style break behavior. 81- **break-all:** Allow breaks within words (between any two “typographic letter units” of general category L or N) unless forbidden by a line-break setting. This is mainly intended for a primarily-CJK context to allow breaks in the middle of normal Latin, Cyrillic, and Greek words. 82- **keep-all:** Prohibit breaks between letters regardless of line-break options, except where opportunities exists due to dictionary-based break. For Korean this option specifies “western”-style line break. This is also useful when short CJK snippets are included in text that is primarily in a language using space-based breaking. 83 84## III. Proposed -u extension keys 85 86### A. For control of grapheme cluster break 87 88For gb, current default is extended. 89 90``` 91<key name="gb" description="Grapheme cluster break type key"> 92 93 <type name="legacy" description=“Grapheme break using UAX #29 legacy grapheme clusters"/> 94 95 <type name="extended" description="Grapheme break using UAX #29 extended grapheme clusters"/> 96 97 <type name="aksara" description="Grapheme break adding aksaras to extended grapheme clusters”/> 98``` 99 100### B. For control of word break 101 102(Type key not needed yet and values undetermined, just reserve it) 103 104```<key name="wb" description="Word break type key">``` 105 106Will also need a word break parameter key to control whether dictionary-based work break is used, probably need separate control for at least for CJ, Korean, and SEAsian scripts; no key proposed yet. 107 108### C. For control of sentence break 109 110(Type key not needed yet and values undetermined, just reserve it) 111 112```<key name="sb" description="Sentence break type key">``` 113 114For ss, current default is none. 115 116``` 117<key name=“ss” description=“Sentence break parameter key to control use of suppressions data”> 118 119 <type name=“none” description="Don’t use segmentation suppressions data"/> 120 121 <type name=“standard” description="Use segmentation suppressions data of type standard"/> 122``` 123 124### D. For control of line break 125 126The current proposal is to use the *type* to specify the CSS line-break property; this can be used in older implementations as e.g. “@lb=strict”. One or more additional parameter keywords are provided to permit control of the CSS word-break property and to permit control of whether AI is treated as AL or ID. 127 128D1. Supporting CSS line-break 129 130For lb, the current default is normal for the "ja" locale, but strict for others; it should probably be normal for all since the distinction is mainly relevant for Japanese), and the discussion below assumes that change. 131 132``` 133<key name="lb" description="Line break type key"> 134 135 <type name="strict" description="CSS lev 3 line-break=strict, e.g. treat CJ as NS"/> 136 137 <type name="normal" description="CSS lev 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/> 138 139 <type name="loose" description="CSS lev 3 line-break=loose"/> 140``` 141 142D2. Supporting other controls including CSS word-break (for line break), first idea (2014-11) 143 144For the other controls, including support of the CSS word-break property, I think it is best to have separate control over how certain sets of characters are treated: 145 146- Treat Hangul (characters with LineBreak property value H2, H3, JL, JV, JT) per UAX 14 (default, for E. Asian break style), or as AL (for space-based break style, part of CSS word-break=keep-all). 147- Treat characters with LineBreak property value ID per UAX #14 (default, for E. Asian break style) or as AL (for space-based break style, part of CSS word-break=keep-all). Is this correct or is the real goal just to eliminate breaks between ID? 148- Treat alphabetic and numeric characters (General Category L and N) per UAX #14 (default), or as ID (to get behavior like CSS word-break=break-all). 149- Treat characters with LineBreak property value AI as AL (default per UAX #14) or as ID. 150 151``` 152<key name="lc” description="Line break class remapping“> 153 154 <type name=“LB_CLASS_MAP_CODE” description=“One or more linebreak class mapping codes, see xxx”/> 155``` 156 157where LB\_CLASS\_MAP\_CODE is a sequence of one or more of the following codes (separated by - or \_): 158 159- hang2al (treat Hangul as AL) 160- id2al (treat ID as AL) 161- alnum2id (treat normal alphabetic/numeric as ID) 162- ai2id (treat AI as ID) 163 164Then for example CSS lev 3 word-break=keep-all could be indicated as “-u-lc-hang2al-id2al”. 165 166D3. Supporting CSS word-break (for line break), second idea (2015-07) 167 168I now think explicit remapping of certain classes is the wrong approach for supporting the CSS word-break options for line break control: 169 170- These options are not defined in terms of UAX #14 LineBreak property values, but rather in terms of general categories L and N. 171- The specific definition of the CSS word-break options (and line-break options) may change somewhat over time; we need locale tags that map to the current CSS definition. 172- We may *also* want other kinds of line-break controls whose behavior does *not* change, and whose behavior *may* be defined in terms of LineBreak property values (as with the proposal in section D2 above), but that is a separate consideration. 173 174Thus I propose the following . 175 176``` 177<key name="lw" description="Line break key for CSS lev 3 word-break options"> 178 179 <type name="normal" description="CSS lev 3 word-break=normal"/> 180 181 <type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/> 182 183 <type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/> 184``` 185 186### E. Other ideas 187 188For linebreak control: 189 190- CLDR #4391 proposed using “-u-lb-strictja” to specify CSS line-break=strict. 191 - Mark Davis suggested that the -lb- keyword could take multiple values including all of those proposed for the separate -lc- keyword, thus eliminating the need for the -lc- keyword; for example, “-u-lb-strict-hang2al-id2al”. 192 193Overall: Another suggestion goes further than the second bullt above: Have just a single keyword to specify all break variants; it would be followed by a list of attributes that would all share a single namespace, and whose names would need to identify which type of break they affected. Examples might include gblegacy, gbextend, gbaksara (use one to specify grapheme break); ssnone, ssstd (use one to specify sentence break suppressions); etc. While this consumes less of the -u keyword namespace, it is less flexible at mapping to values specified in resource attributes, such as different types of sentence break suppression data, unless significant restrictions are placed on those attribute values. 194 195### F. Current status 196 197F1. keyword -lb- 198 199In the CLDR meeting of 2014-Nov-19, it was agreed to add the -lb- keyword with at least the values "strict", "normal" and "loose" for support of the corresponding CSS level 3 behavior; for legacy-style Unicode locale IDs using '@', "lb=" should be used. The implementation details are not yet determined or specified, nor are the details of any locale-specific override behavior. 200 201Current (2015-02-18) work under CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931) and ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379) includes the following, approved in CLDR and ICU meetings: 202 2031. Add new CLDR file common/bcp47/segmentation.xml (name OK?) with the following: 204 205``` 206<key name="lb" description="Line break type key" since="27"> 207 208 <type name="strict" description="CSS level 3 line-break=strict, e.g. treat CJ as NS"/> 209 210 <type name="normal" description="CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/> 211 212 <type name="loose" description="CSS lev 3 line-break=loose"/> 213 214</key> 215``` 216 2172. In CLDR file common/dtd/ldmlICU.dtd, add "alt" as an attribute for the \<icu:line ...> element (and allow multiple \<icu:line ...> elements). 2183. In ICU icu/trunk/source/data/xml/brkitr/ files such as root.xml, fi.xml, and ja.xml, add lines mapping the line break types to corresponding rule files, e.g. in root: 219 220``` 221<icu:line alt="loose" icu:dependency="line_loose.brk"/> 222 223<icu:line alt="normal" icu:dependency="line_normal.brk"/> 224 225<icu:line alt="strict" icu:dependency="line.brk"/> 226``` 227 228(Note that we need to add brkitr locales for zh and zh\_Hant since they have non-standard CSS line break types, like ja) 229 2304. In CLDR, update tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to handle the alts (3 added lines). 2315. In ICU, add 6 new line break rule files in source/data/brkitr/ (and delete line\_ja.txt): 232 233``` 234line_loose.txt 235 236line_loose_cj.txt 237 238line_loose_fi.txt 239 240line_normal.txt 241 242line_normal_cj.txt 243 244line_normal_fi.txt 245``` 246 247These result in an increase of about 630K bytes (2.5%) in the data file. These can be tailored out in cases for which it is a problem, either by deleting lines from the ICUdata/xml/brkitr/ files if building from CLDR data, or by deleting corresponding lines in the data/brkitr/\<locale>.txt files and deleting the unused files from BRK\_SOURCE in data/brkitr/brkfiles.mk. #[11530](http://bugs.icu-project.org/trac/ticket/11530) is to investigate a more efficient way of representing the line break rule variants. 248 249Note that the CLDR representation of the line break rules have not yet been updated to match (they are currently ignored when generating ICU data). 250 2516. In ICU4C, update BreakIterator::makeInstance to map locale to the correct ruleset (about 10 lines, not yet committed)... similar change in ICU4J. 2527. Update testate/rbbitst.txt to test the variants. More extensive monkey tests for the variants are covered by #[11531](http://bugs.icu-project.org/trac/ticket/11531). 253 254F2. keywords -ss-, -lw- 255 256Proposal for CLDR & ICU meetings 2015-Jul-08: 257 2581. In CLDR file common/bcp47/segmentation.xml add the following (approved in CLDR meeting 2015-Jul-08): 259 260``` 261<key name="lw" description="Line break key for CSS lev 3 word-break options" since="28"> 262 263 <type name="normal" description="CSS lev 3 word-break=normal"/> 264 265 <type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/> 266 267 <type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/> 268 269</key> 270``` 271 272Default value is "normal". English names for the values are: 273 274• normal: "Normal line breaks for words" 275 276• breakall: "Allow line breaks in all words" 277 278• keepall: "Prevent line breaks in all words" 279 280``` 281<key name="ss" description="Sentence break parameter key to control use of suppressions data" since="28"> 282 283<type name="none" description="Don’t use segmentation suppressions data"/> 284 285<type name="standard" description="Use segmentation suppressions data of type standard"/> 286 287</key> 288``` 289 290Current default value is "none". In the future we hope to make the default "standard". English names for the values are: 291 292- normal: "Normal sentence breaks per Unicode specification" 293- standard: "Prevent sentence breaks after standard abbreviations" 294 2952. In ICU BreakIterator, initial support will be incomplete (details for ICU4C below, similar approach in ICU4J): 296 297a) In ICU4C BreakIterator::makeInstance, for kind = UBRK\_SENTENCE, if locale has key "ss" with value "standard", then call FilteredBreakIteratorBuilder on the result of BreakIterator::buildInstance to produce a new BreakIterator\* which supports the sentence break exceptions. Notes: 298 299- Currently FilteredBreakIteratorBuilder does not have a way to support different segmentation suppression sets, it only supports the "standard" set. 300- A BreakIterator produced in this way currently supports the next() method but not the other BreakIterator methods for moving through text (see [class details](http://icu-project.org/apiref/icu4c/classicu_1_1FilteredBreakIteratorBuilder.html#details)). This should be fixed fairly soon. 301 302b) In ICU4C RuleBasedBreakIterator::handleNext and handlePrevious, for now we can implement an approximation of support for the key "lw" values by alteriing the character classes as follows (similar to the behavior in section D2 above): 303 304- For "keepall", if the class is Hangul (H2, H3, JL, JV, JT) or ID, remap to AL 305- For "breakall", if the class is AL, HL, AI, or NU, remap to ID. 306 307More complete support is dependent on a mechanism for turning on and off certain rules, see ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530). 308 309## IV. Implementation notes 310 311What I had in mind was that the break type selection (gb, lb) would be implemented by selection of different break table resources, while the parameter keywords (ss, lc) would be implemented in code (changing line break classes, perhaps with an annotation in the tables along the lines suggested in http://unicode.org/cldr/trac/ticket/4931). However, it is not clear how to implement selection of different tables given the current resource structure in ICU (which does not exactly mirror the CLDR structure). 312 313### A. CLDR XML structure 314 315Currently in CLDR we can have a structure locale-specific break iterator data icu/trunk/source/data/xml/brkitr/xx.xml as follows; except for the suppressions data, this is otherwise ignored for building ICU data (segmentation type is GraphemeClusterBreak, WordBreak, LineBreak, SentenceBreak): 316 317``` 318<ldml> 319 320 <segmentations> 321 322 <segmentation type="LineBreak"> 323 324 <variables> 325 326 …. 327 328 </variables> 329 330 <segmentRules> 331 332 …. 333 334 </segmentRules> 335 336 <segmentation type="SentenceBreak"> 337 338 <suppressions type="standard"> 339 340 … 341 342 </suppressions 343``` 344 345We could an attribute "alt" for \<segmentation> to specify the specific variant (corresponds to the value for the -gb or -lb keyword, for example), though this would currently be ignored for LDML to ICU conversion: 346 347```<segmentation type="LineBreak" alt="strict">``` 348 349Handling of default values and elements without "alt" is discussed in section E below. 350 351### B. ICU XML source structure 352 353In ICU we have XML source data and generated txt data. The XML source structure is specified by 354 355http://www.unicode.org/repos/cldr/trunk/common/dtd/ldmlICU.dtd 356 357and currently looks like this for root (any locale-specific data uses a subset of this): 358 359``` 360<special xmlns:icu="http://www.icu-project.org/"> 361 362 <icu:breakIteratorData> 363 364 <icu:boundaries> 365 366 <icu:grapheme icu:dependency="char.brk"/> 367 368 <icu:word icu:dependency="word.brk"/> 369 370 <icu:line icu:dependency="line.brk"/> or e.g. "line_xx.brk" in locale-specific data 371 372 … 373 374 </icu:boundaries> 375 376 <icu:dictionaries> 377 378 <icu:dictionary type="Hani" icu:dependency="cjdict.dict"/> 379 380 <icu:dictionary type="Hira" icu:dependency="cjdict.dict"/> 381 382 … 383 384 <icu:dictionary type="Thai" icu:dependency="thaidict.dict"/> 385 386 </icu:dictionaries> 387 388 </icu:breakIteratorData> 389 390</special> 391``` 392 393Note that the following attributes for the boundaries subelements (icu:word etc.) are defined in CLDR’s ICU DTD but currently unused: 394 395```icu:class NMTOKEN #IMPLIED``` 396 397```icu:append NMTOKEN #IMPLIED``` 398 399```icu:import NMTOKEN #IMPLIED``` 400 401We could define an additional attribute "alt" and then use that to match the CLDR \<segmentations> alt attribute: 402 403``` 404<icu:boundaries> 405 406 <icu:grapheme icu:dependency="char.brk"/> 407 408 <icu:grapheme alt="extended" icu:dependency="char.brk"/> 409 410 <icu:grapheme alt="legacy" icu:dependency="char_legacy.brk"/> 411 412 … 413 414 <icu:line icu:dependency="line.brk"/> 415 416 <icu:line alt="normal" icu:dependency="line.brk"/> 417 418 <icu:line alt="strict" icu:dependency="line_strict.brk"/> 419 420 … 421 422</icu:boundaries> 423``` 424 425### C. ICU txt resource structure 426 427The ICU xml files (and the CLDR xml files, for suppressions data) are processed by CLDR tools such as cldr/trunk/tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to generate the text resources, for example: 428 429``` 430root{ 431 432 boundaries{ 433 434 grapheme:process(dependency){"char.brk"} 435 436 line:process(dependency){"line.brk"} 437 438 … 439 440 word:process(dependency){"word.brk"} 441 442 } 443 444 dictionaries{ 445 446 Hani:process(dependency){"cjdict.dict"} 447 448 Hira:process(dependency){"cjdict.dict"} 449 450 ... 451 452 Thai:process(dependency){"thaidict.dict"} 453 454 } 455 456} 457 458xx{ 459 460 boundaries{ 461 462 line:process(dependency){"line_xx.brk"} 463 464 } 465 466 exceptions{ 467 468 SentenceBreak:array{ 469 470 "Mr.", 471 472 "Etc.", 473 474 … 475 476 } 477 478 } 479 480} 481``` 482 483These files are read by BreakIterator::buildInstance(...) in ICU4C, with a type parameter that maps directly to the key in the boundaries resource: "grapheme", "line", etc. Currently there is not a way to add attributes for the boundaries subelements such as line or word. However, we could map the icu:alt values proposed in section C to resource keys with extensions where appropriate: 484 485``` 486boundaries{ 487 488 grapheme:process(dependency){"char.brk"} 489 490 grapheme_extended:process(dependency){"char.brk"} 491 492 grapheme_legacy:process(dependency){"char_legacy.brk"} 493 494 … 495 496 line:process(dependency){"line.brk"} 497 498 line_normal:process(dependency){"line.brk"} 499 500 line_strict:process(dependency){"line_strict.brk"} 501 502 … 503 504} 505``` 506 507BreakIterator::buildInstance is called by BreakIterator::makeInstance, which provides the type keys "grapheme", "line", etc. It could use the locale to construct the resource keys with extensions. 508 509### D. Current dictionary break implementation 510 511(See also the [relevant section of the ICU User Guide](http://userguide.icu-project.org/boundaryanalysis#TOC-Details-about-Dictionary-Based-Break-Iteration)) 512 513The use of dictionary break depends on the existence in the rules of a variable "$dictionary" which defines the UnicodeSet of characters for which dictionary break should be used. 514 515For line break, this is defined as “```$dictionary = [:LineBreak = Complex_Context:];```” where the Line\_Break property value Complex\_Context is equivalent to SA and applies to most letters, marks, and some other signs in Southeast Asian scripts: Thai, Lao, Myanmar, Khmer, Tai Le, New Tai Lue, Tai Tham, Tai Viet, etc. For word break, in addition to characters with Line\_Break property value SA, the $dictionary set includes characters with script Han, Hiragana, Katakana, as well as composed Hangul syllables in the range \uAC00-\uD7A3 (not sure why the latter are included, since we do not have dictionary support for them). 516 517In both cases, the rules are defined to disallow breaks between characters in the $dictionary set. When determine the next or previous break, the iterator first determines the break using the normal rules (which will not break between characters in the $dictionary set); in the process it marks which characters are handled by a dictionary break engine (For each script that has a break dictionary, the associated break engine defines a more specific set of characters to which it applies). If characters handled by a dictionary break engine were encountered, the break iterator then invokes the dictionary break engines to determine breaks within the $dictionary-set span. 518 519### E. Multiple rule sets that depend on break type 520 521It would be nice for a given locale to be able to specify, for each break type, which variant is the default for that locale. In root this can just be done by using the resource key without any extension. In other locales, we could do something like this in the CLDR XML: 522 523``` 524<segmentations> 525 526 <default type="LineBreak" alt="strict"> 527 528 <default type="GraphemeClusterBreak" alt="legacy"> 529``` 530 531## V. Acknowledgments 532 533Thanks to Koji Ishii and the CLDR team for feedback on this document. 534 535 536