• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2title: Specifying text break variants in locale IDs
3---
4
5# Specifying text break variants in locale IDs
6
7|   |   |
8|---|---|
9| Author | Peter Edberg |
10| Date | 2014-11-11, last update 2016-10-20 |
11| Status | Proposal |
12| Feedback to | pedberg (at) apple (dot) com |
13| Bugs | See list below |
14
15This proposal discusses options for extending Unicode locale identifiers to specify text break variants with a locale. It was prompted by CLDR and ICU bugs including the following, as well as by other requests:
16
17- CLDR #[2142](http://unicode.org/cldr/trac/ticket/2142), Alternate Grapheme Clusters
18- CLDR #[2161](http://unicode.org/cldr/trac/ticket/2161), Grapheme break iterator with legacy behavior
19- CLDR #[2825](http://unicode.org/cldr/trac/ticket/2825), Add aksha grapheme break
20- CLDR #[2975](http://unicode.org/cldr/trac/ticket/2975), Support legacy grapheme break
21- CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931), Provide mechanism for parameterizing linebreak, etc.
22- CLDR #[7032](http://unicode.org/cldr/trac/ticket/7032), BCP47 for break exceptions
23- CLDR #[8204](http://unicode.org/cldr/trac/ticket/8204), Other line break parameterization to support CSS word-break, etc.
24- ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379), Request to add Japanese linebreak tailoring selectable as variations
25- ICU #[11248](http://bugs.icu-project.org/trac/ticket/11248), Improve C/J FilteredBreakIterator, move to draft
26- ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530), More efficient representation for multiple line break rule sets
27- ICU #[11531](http://bugs.icu-project.org/trac/ticket/11531), Update RBBI TestMonkey to test line break variants
28- ICU #[11770](http://bugs.icu-project.org/trac/ticket/11770), BreakIterator should support new locale key "ss"
29- ICU #[11771](http://bugs.icu-project.org/trac/ticket/11771), FilteredBreakIterator should move from i18n to common
30
31## I. Options needed (as known so far)
32
33### A. Grapheme cluster break
34
35Need to choose one of the following (current CLDR/ICU implementation uses extended grapheme clusters):
36
37- Legacy UAX #29 grapheme clusters (also called spacing units).
38- Extended UAX #29 grapheme clusters: legacy clusters plus also include spacing combining marks in Indic scripts, and Thai SARA AM and Lao AM (but not other spacing vowels in SE Asian scripts).
39- Aksaras (Indic & SE Asian consonant/vowel clusters or syllables): extended clusters plus also include consonant-virama sequences, and spacing vowels in SE Asian scripts.
40
41### B. Word break
42
43Currently uses dictionary-based break for sequences in CJK scripts (Han/Kana/Hangul) or SE Asian scripts (LineBreak property value SA/Complex\_Context: Thai, Lao, Khmer, Myanmar, etc.); we need a locale keyword that can turn this on or off (i.e. off to use basic UAX #29 word break), at least for CJK.
44
45### C. Sentence break
46
47We need a locale keyword to control use of ULI suppressions data (i.e. to determine whether we should wrap the UAX29-based break iterator in a FilteredBreakIterator instance for the locale, and to determine which suppressions set to use).
48
49### D. Line break (highest priority)
50
51Currently ICU uses dictionary-based break for text in SE Asian scripts only. The two most important needs for line break control are:
52
53- For Japanese text, control whether line breaks are allowed before small kana and before the prolonged sound mark 30FC; this corresponds to (most of) the distinction between CSS level 3 strict and normal line break (see below), and is implemented by treating LineBreak property value CJ as either NS (strict) or ID (normal).
54- For Korean text, control whether the line break style is E. Asian style (breaks can occur in the middle of words) or “Western” style (breaks are space based), as described in UAX 14.
55
56Other desirable capabilities include:
57
58- In a CJK-language context, control over whether breaks are allowed in the middle of words in alphabetic scripts that normally use a space-based approach (e.g. Latin, Greek, Cyrillic). Currently fullwidth Latin letters have LineBreak property value ID and do allow such breaks, but normal Latin letters are AL and do not.
59- In a CJK-language context, explicit control over whether characters with LineBreak property value AI resolve to ID or AL (UAX 14 recommends using resolved East Asian Width to do this, but in the absence of that or any other higher-level mechanism they default to AL). This is somewhat related to the previous bullet. Note that characters with value AI include some symbols, punctuation, superscript digits, modifier letters, etc.
60- Full control over CSS line break styles, see below (these can be used to control most of the above line break features)
61
62## II. Notes on CSS level 3 line break
63
64(from draft of Jun 2015, [http://dev.w3.org/csswg/css-text/#line-breaking](http://dev.w3.org/csswg/css-text/#line-breaking))
65
66CSS has two independent properties for controlling line break behavior:
67
68### A. The line-break property
69
70This is mainly about break behavior for punctuation and symbols, though it does affect small kana. The rules are intended to specify behavior that may be language-specific, but explicit rules are provided for CJK. Besides the “auto” value, there are three specific values for this property.
71
72- **strict:** The most restrictive rules, for longer lines and/or ragged margins. Prevents break before small kana and before prolonged sound mark 30FC (this is the set of characters with LineBreak property value CJ, which have general category Lo or Lm).
73- **normal:** Allows break before small kana and before prolonged sound mark 30FC. If the content language is Chinese or Japanese, also allows breaks before hyphen like characters: ‐ U+2010, – U+2013, ~ U+301C, ゠ U+30A0 (LineBreak property value BA for the first two, NS for the second two; general category Pd for all four).
74- **loose:** The least restrictive, used for short lines as in newspapers. In addition to breaks allowed for normal, allows breaks before iteration marks (々 U+3005, 〻 U+303B, ゝ U+309D, ゞ U+309E, ヽ U+30FD, ヾ U+30FE, all with LineBreak property value NS and general category Lm) and breaks between characters with LineBreak property value IN (inseparable). If the content language is Chinese or Japanese, also allows breaks before certain centered punctuation marks, before suffixes and after prefixes.
75
76### B. The word-break property
77
78This only controls break opportunities between letter-like characters (including ideographs), and has 3 possible values. Symbols that break in the same way as letters are affected in the same way by these options.
79
80- **normal:** Words break according to their customary rules. For Korean this specifies E. Asian style break behavior.
81- **break-all:** Allow breaks within words (between any two “typographic letter units” of general category L or N) unless forbidden by a line-break setting. This is mainly intended for a primarily-CJK context to allow breaks in the middle of normal Latin, Cyrillic, and Greek words.
82- **keep-all:** Prohibit breaks between letters regardless of line-break options, except where opportunities exists due to dictionary-based break. For Korean this option specifies “western”-style line break. This is also useful when short CJK snippets are included in text that is primarily in a language using space-based breaking.
83
84## III. Proposed -u extension keys
85
86### A. For control of grapheme cluster break
87
88For gb, current default is extended.
89
90```
91<key name="gb" description="Grapheme cluster break type key">
92
93	<type name="legacy" description=“Grapheme break using UAX #29 legacy grapheme clusters"/>
94
95	<type name="extended" description="Grapheme break using UAX #29 extended grapheme clusters"/>
96
97	<type name="aksara" description="Grapheme break adding aksaras to extended grapheme clusters”/>
98```
99
100### B. For control of word break
101
102(Type key not needed yet and values undetermined, just reserve it)
103
104```<key name="wb" description="Word break type key">```
105
106Will also need a word break parameter key to control whether dictionary-based work break is used, probably need separate control for at least for CJ, Korean, and SEAsian scripts; no key proposed yet.
107
108### C. For control of sentence break
109
110(Type key not needed yet and values undetermined, just reserve it)
111
112```<key name="sb" description="Sentence break type key">```
113
114For ss, current default is none.
115
116```
117<key name=“ss” description=“Sentence break parameter key to control use of suppressions data”>
118
119	<type name=“none” description="Don’t use segmentation suppressions data"/>
120
121	<type name=“standard” description="Use segmentation suppressions data of type standard"/>
122```
123
124### D. For control of line break
125
126The current proposal is to use the *type* to specify the CSS line-break property; this can be used in older implementations as e.g. “@lb=strict”. One or more additional parameter keywords are provided to permit control of the CSS word-break property and to permit control of whether AI is treated as AL or ID.
127
128D1. Supporting CSS line-break
129
130For lb, the current default is normal for the "ja" locale, but strict for others; it should probably be normal for all since the distinction is mainly relevant for Japanese), and the discussion below assumes that change.
131
132```
133<key name="lb" description="Line break type key">
134
135	<type name="strict" description="CSS lev 3 line-break=strict, e.g. treat CJ as NS"/>
136
137	<type name="normal" description="CSS lev 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/>
138
139	<type name="loose" description="CSS lev 3 line-break=loose"/>
140```
141
142D2. Supporting other controls including CSS word-break (for line break), first idea (2014-11)
143
144For the other controls, including support of the CSS word-break property, I think it is best to have separate control over how certain sets of characters are treated:
145
146- Treat Hangul (characters with LineBreak property value H2, H3, JL, JV, JT) per UAX 14 (default, for E. Asian break style), or as AL (for space-based break style, part of CSS word-break=keep-all).
147- Treat characters with LineBreak property value ID per UAX #14 (default, for E. Asian break style) or as AL (for space-based break style, part of CSS word-break=keep-all). Is this correct or is the real goal just to eliminate breaks between ID?
148- Treat alphabetic and numeric characters (General Category L and N) per UAX #14 (default), or as ID (to get behavior like CSS word-break=break-all).
149- Treat characters with LineBreak property value AI as AL (default per UAX #14) or as ID.
150
151```
152<key name="lc” description="Line break class remapping“>
153
154	<type name=“LB_CLASS_MAP_CODE” description=“One or more linebreak class mapping codes, see xxx”/>
155```
156
157where LB\_CLASS\_MAP\_CODE is a sequence of one or more of the following codes (separated by - or \_):
158
159- hang2al (treat Hangul as AL)
160- id2al (treat ID as AL)
161- alnum2id (treat normal alphabetic/numeric as ID)
162- ai2id (treat AI as ID)
163
164Then for example CSS lev 3 word-break=keep-all could be indicated as “-u-lc-hang2al-id2al”.
165
166D3. Supporting CSS word-break (for line break), second idea (2015-07)
167
168I now think explicit remapping of certain classes is the wrong approach for supporting the CSS word-break options for line break control:
169
170- These options are not defined in terms of UAX #14 LineBreak property values, but rather in terms of general categories L and N.
171- The specific definition of the CSS word-break options (and line-break options) may change somewhat over time; we need locale tags that map to the current CSS definition.
172- We may *also* want other kinds of line-break controls whose behavior does *not* change, and whose behavior *may* be defined in terms of LineBreak property values (as with the proposal in section D2 above), but that is a separate consideration.
173
174Thus I propose the following .
175
176```
177<key name="lw" description="Line break key for CSS lev 3 word-break options">
178
179	<type name="normal" description="CSS lev 3 word-break=normal"/>
180
181	<type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/>
182
183	<type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/>
184```
185
186### E. Other ideas
187
188For linebreak control:
189
190- CLDR #4391 proposed using “-u-lb-strictja” to specify CSS line-break=strict.
191	- Mark Davis suggested that the -lb- keyword could take multiple values including all of those proposed for the separate -lc- keyword, thus eliminating the need for the -lc- keyword; for example, “-u-lb-strict-hang2al-id2al”.
192
193Overall: Another suggestion goes further than the second bullt above: Have just a single keyword to specify all break variants; it would be followed by a list of attributes that would all share a single namespace, and whose names would need to identify which type of break they affected. Examples might include gblegacy, gbextend, gbaksara (use one to specify grapheme break); ssnone, ssstd (use one to specify sentence break suppressions); etc. While this consumes less of the -u keyword namespace, it is less flexible at mapping to values specified in resource attributes, such as different types of sentence break suppression data, unless significant restrictions are placed on those attribute values.
194
195### F. Current status
196
197F1. keyword -lb-
198
199In the CLDR meeting of 2014-Nov-19, it was agreed to add the -lb- keyword with at least the values "strict", "normal" and "loose" for support of the corresponding CSS level 3 behavior; for legacy-style Unicode locale IDs using '@', "lb=" should be used. The implementation details are not yet determined or specified, nor are the details of any locale-specific override behavior.
200
201Current (2015-02-18) work under CLDR #[4931](http://unicode.org/cldr/trac/ticket/4931) and ICU #[9379](http://bugs.icu-project.org/trac/ticket/9379) includes the following, approved in CLDR and ICU meetings:
202
2031. Add new CLDR file common/bcp47/segmentation.xml (name OK?) with the following:
204
205```
206<key name="lb" description="Line break type key" since="27">
207
208	<type name="strict" description="CSS level 3 line-break=strict, e.g. treat CJ as NS"/>
209
210	<type name="normal" description="CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh"/>
211
212	<type name="loose" description="CSS lev 3 line-break=loose"/>
213
214</key>
215```
216
2172. In CLDR file common/dtd/ldmlICU.dtd, add "alt" as an attribute for the \<icu:line ...> element (and allow multiple \<icu:line ...> elements).
2183. In ICU icu/trunk/source/data/xml/brkitr/ files such as root.xml, fi.xml, and ja.xml, add lines mapping the line break types to corresponding rule files, e.g. in root:
219
220```
221<icu:line alt="loose" icu:dependency="line_loose.brk"/>
222
223<icu:line alt="normal" icu:dependency="line_normal.brk"/>
224
225<icu:line alt="strict" icu:dependency="line.brk"/>
226```
227
228(Note that we need to add brkitr locales for zh and zh\_Hant since they have non-standard CSS line break types, like ja)
229
2304. In CLDR, update tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to handle the alts (3 added lines).
2315. In ICU, add 6 new line break rule files in source/data/brkitr/ (and delete line\_ja.txt):
232
233```
234line_loose.txt
235
236line_loose_cj.txt
237
238line_loose_fi.txt
239
240line_normal.txt
241
242line_normal_cj.txt
243
244line_normal_fi.txt
245```
246
247These result in an increase of about 630K bytes (2.5%) in the data file. These can be tailored out in cases for which it is a problem, either by deleting lines from the ICUdata/xml/brkitr/ files if building from CLDR data, or by deleting corresponding lines in the data/brkitr/\<locale>.txt files and deleting the unused files from BRK\_SOURCE in data/brkitr/brkfiles.mk. #[11530](http://bugs.icu-project.org/trac/ticket/11530) is to investigate a more efficient way of representing the line break rule variants.
248
249Note that the CLDR representation of the line break rules have not yet been updated to match (they are currently ignored when generating ICU data).
250
2516. In ICU4C, update BreakIterator::makeInstance to map locale to the correct ruleset (about 10 lines, not yet committed)... similar change in ICU4J.
2527. Update testate/rbbitst.txt to test the variants. More extensive monkey tests for the variants are covered by #[11531](http://bugs.icu-project.org/trac/ticket/11531).
253
254F2. keywords -ss-, -lw-
255
256Proposal for CLDR & ICU meetings 2015-Jul-08:
257
2581. In CLDR file common/bcp47/segmentation.xml add the following (approved in CLDR meeting 2015-Jul-08):
259
260```
261<key name="lw" description="Line break key for CSS lev 3 word-break options" since="28">
262
263	<type name="normal" description="CSS lev 3 word-break=normal"/>
264
265	<type name="breakall" description="CSS lev 3 word-break=break-all, allow breaks in words unless forbidden by lb setting"/>
266
267	<type name="keepall" description="CSS lev 3 word-break=keep-all, prohibit breaks in words except for dictionary breaks"/>
268
269</key>
270```
271
272Default value is "normal". English names for the values are:
273
274• normal: "Normal line breaks for words"
275
276• breakall: "Allow line breaks in all words"
277
278• keepall: "Prevent line breaks in all words"
279
280```
281<key name="ss" description="Sentence break parameter key to control use of suppressions data" since="28">
282
283<type name="none" description="Don’t use segmentation suppressions data"/>
284
285<type name="standard" description="Use segmentation suppressions data of type standard"/>
286
287</key>
288```
289
290Current default value is "none". In the future we hope to make the default "standard". English names for the values are:
291
292- normal: "Normal sentence breaks per Unicode specification"
293- standard: "Prevent sentence breaks after standard abbreviations"
294
2952. In ICU BreakIterator, initial support will be incomplete (details for ICU4C below, similar approach in ICU4J):
296
297a) In ICU4C BreakIterator::makeInstance, for kind = UBRK\_SENTENCE, if locale has key "ss" with value "standard", then call FilteredBreakIteratorBuilder on the result of BreakIterator::buildInstance to produce a new BreakIterator\* which supports the sentence break exceptions. Notes:
298
299- Currently FilteredBreakIteratorBuilder does not have a way to support different segmentation suppression sets, it only supports the "standard" set.
300- A BreakIterator produced in this way currently supports the next() method but not the other BreakIterator methods for moving through text (see [class details](http://icu-project.org/apiref/icu4c/classicu_1_1FilteredBreakIteratorBuilder.html#details)). This should be fixed fairly soon.
301
302b) In ICU4C RuleBasedBreakIterator::handleNext and handlePrevious, for now we can implement an approximation of support for the key "lw" values by alteriing the character classes as follows (similar to the behavior in section D2 above):
303
304- For "keepall", if the class is Hangul (H2, H3, JL, JV, JT) or ID, remap to AL
305- For "breakall", if the class is AL, HL, AI, or NU, remap to ID.
306
307More complete support is dependent on a mechanism for turning on and off certain rules, see ICU #[11530](http://bugs.icu-project.org/trac/ticket/11530).
308
309## IV. Implementation notes
310
311What I had in mind was that the break type selection (gb, lb) would be implemented by selection of different break table resources, while the parameter keywords (ss, lc) would be implemented in code (changing line break classes, perhaps with an annotation in the tables along the lines suggested in http://unicode.org/cldr/trac/ticket/4931). However, it is not clear how to implement selection of different tables given the current resource structure in ICU (which does not exactly mirror the CLDR structure).
312
313### A. CLDR XML structure
314
315Currently in CLDR we can have a structure locale-specific break iterator data icu/trunk/source/data/xml/brkitr/xx.xml as follows; except for the suppressions data, this is otherwise ignored for building ICU data (segmentation type is GraphemeClusterBreak, WordBreak, LineBreak, SentenceBreak):
316
317```
318<ldml>
319
320	<segmentations>
321
322		<segmentation type="LineBreak">
323
324			<variables>
325
326				….
327
328			</variables>
329
330			<segmentRules>
331
332				….
333
334			</segmentRules>
335
336		<segmentation type="SentenceBreak">
337
338			<suppressions type="standard">
339
340341
342			</suppressions
343```
344
345We could an attribute "alt" for \<segmentation> to specify the specific variant (corresponds to the value for the -gb or -lb keyword, for example), though this would currently be ignored for LDML to ICU conversion:
346
347```<segmentation type="LineBreak" alt="strict">```
348
349Handling of default values and elements without "alt" is discussed in section E below.
350
351### B. ICU XML source structure
352
353In ICU we have XML source data and generated txt data. The XML source structure is specified by
354
355http://www.unicode.org/repos/cldr/trunk/common/dtd/ldmlICU.dtd
356
357and currently looks like this for root (any locale-specific data uses a subset of this):
358
359```
360<special xmlns:icu="http://www.icu-project.org/">
361
362	<icu:breakIteratorData>
363
364		<icu:boundaries>
365
366			<icu:grapheme icu:dependency="char.brk"/>
367
368			<icu:word icu:dependency="word.brk"/>
369
370			<icu:line icu:dependency="line.brk"/> or e.g. "line_xx.brk" in locale-specific data
371
372373
374		</icu:boundaries>
375
376		<icu:dictionaries>
377
378			<icu:dictionary type="Hani" icu:dependency="cjdict.dict"/>
379
380			<icu:dictionary type="Hira" icu:dependency="cjdict.dict"/>
381
382383
384			<icu:dictionary type="Thai" icu:dependency="thaidict.dict"/>
385
386		</icu:dictionaries>
387
388	</icu:breakIteratorData>
389
390</special>
391```
392
393Note that the following attributes for the boundaries subelements (icu:word etc.) are defined in CLDR’s ICU DTD but currently unused:
394
395```icu:class NMTOKEN #IMPLIED```
396
397```icu:append NMTOKEN #IMPLIED```
398
399```icu:import NMTOKEN #IMPLIED```
400
401We could define an additional attribute "alt" and then use that to match the CLDR \<segmentations> alt attribute:
402
403```
404<icu:boundaries>
405
406	<icu:grapheme icu:dependency="char.brk"/>
407
408	<icu:grapheme alt="extended" icu:dependency="char.brk"/>
409
410	<icu:grapheme alt="legacy" icu:dependency="char_legacy.brk"/>
411
412413
414	<icu:line icu:dependency="line.brk"/>
415
416	<icu:line alt="normal" icu:dependency="line.brk"/>
417
418	<icu:line alt="strict" icu:dependency="line_strict.brk"/>
419
420421
422</icu:boundaries>
423```
424
425### C. ICU txt resource structure
426
427The ICU xml files (and the CLDR xml files, for suppressions data) are processed by CLDR tools such as cldr/trunk/tools/java/org/unicode/cldr/icu/BreakIteratorMapper.java to generate the text resources, for example:
428
429```
430root{
431
432	boundaries{
433
434		grapheme:process(dependency){"char.brk"}
435
436		line:process(dependency){"line.brk"}
437
438439
440		word:process(dependency){"word.brk"}
441
442	}
443
444	dictionaries{
445
446		Hani:process(dependency){"cjdict.dict"}
447
448		Hira:process(dependency){"cjdict.dict"}
449
450		...
451
452		Thai:process(dependency){"thaidict.dict"}
453
454	}
455
456}
457
458xx{
459
460	boundaries{
461
462		line:process(dependency){"line_xx.brk"}
463
464	}
465
466	exceptions{
467
468		SentenceBreak:array{
469
470			"Mr.",
471
472			"Etc.",
473
474475
476		}
477
478	}
479
480}
481```
482
483These files are read by BreakIterator::buildInstance(...) in ICU4C, with a type parameter that maps directly to the key in the boundaries resource: "grapheme", "line", etc. Currently there is not a way to add attributes for the boundaries subelements such as line or word. However, we could map the icu:alt values proposed in section C to resource keys with extensions where appropriate:
484
485```
486boundaries{
487
488	grapheme:process(dependency){"char.brk"}
489
490	grapheme_extended:process(dependency){"char.brk"}
491
492	grapheme_legacy:process(dependency){"char_legacy.brk"}
493
494495
496	line:process(dependency){"line.brk"}
497
498	line_normal:process(dependency){"line.brk"}
499
500	line_strict:process(dependency){"line_strict.brk"}
501
502503
504}
505```
506
507BreakIterator::buildInstance is called by BreakIterator::makeInstance, which provides the type keys "grapheme", "line", etc. It could use the locale to construct the resource keys with extensions.
508
509### D. Current dictionary break implementation
510
511(See also the [relevant section of the ICU User Guide](http://userguide.icu-project.org/boundaryanalysis#TOC-Details-about-Dictionary-Based-Break-Iteration))
512
513The use of dictionary break depends on the existence in the rules of a variable "$dictionary" which defines the UnicodeSet of characters for which dictionary break should be used.
514
515For line break, this is defined as “```$dictionary = [:LineBreak = Complex_Context:];```” where the Line\_Break property value Complex\_Context is equivalent to SA and applies to most letters, marks, and some other signs in Southeast Asian scripts: Thai, Lao, Myanmar, Khmer, Tai Le, New Tai Lue, Tai Tham, Tai Viet, etc. For word break, in addition to characters with Line\_Break property value SA, the $dictionary set includes characters with script Han, Hiragana, Katakana, as well as composed Hangul syllables in the range \uAC00-\uD7A3 (not sure why the latter are included, since we do not have dictionary support for them).
516
517In both cases, the rules are defined to disallow breaks between characters in the $dictionary set. When determine the next or previous break, the iterator first determines the break using the normal rules (which will not break between characters in the $dictionary set); in the process it marks which characters are handled by a dictionary break engine (For each script that has a break dictionary, the associated break engine defines a more specific set of characters to which it applies). If characters handled by a dictionary break engine were encountered, the break iterator then invokes the dictionary break engines to determine breaks within the $dictionary-set span.
518
519### E. Multiple rule sets that depend on break type
520
521It would be nice for a given locale to be able to specify, for each break type, which variant is the default for that locale. In root this can just be done by using the resource key without any extension. In other locales, we could do something like this in the CLDR XML:
522
523```
524<segmentations>
525
526	<default type="LineBreak" alt="strict">
527
528	<default type="GraphemeClusterBreak" alt="legacy">
529```
530
531## V. Acknowledgments
532
533Thanks to Koji Ishii and the CLDR team for feedback on this document.
534
535
536