• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Customization
4nav_order: 3
5parent: Collation
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Collation Customization
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25ICU uses the [CLDR root collation
26order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
27as a default starting point for ordering. (The CLDR root collation is based on
28the [UCA
29DUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table).)
30Not all languages have sorting sequences that correspond with the root collation
31order because no single sort order can simultaneously encompass the specifics of
32all the languages. In particular, languages that share a script may sort the
33same letters differently.
34
35Therefore, ICU provides a data-driven, flexible, and run-time-customizable
36mechanism called "tailoring". Tailoring overrides the default order of code
37points and the values of the ICU Collation Service attributes.
38
39## Collation Rule
40
41A `RuleBasedCollator` is built from a rule string which changes the sort order of
42some characters and strings relative to the default order. An empty string (or
43one with only white space and comments) results in a collator that behaves like
44the root collator.
45
46A tailoring is specified via a string containing a set of rules. ICU implements
47the (CLDR) [LDML collation rule
48syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more
49details see there.
50
51Each rule contains a string of ordered characters that starts with an **anchor
52point** or a **reset value**. For example, `"&a < g"`, places "g"
53after "a" and before "b", and the "a" does not change place. This rule has the
54following sorting consequences:
55
56Without rule | With rule
57------------ | ---------
58Abernathy    | Abernathy
59apple        | apple
60bird         | green
61Boston       | bird
62Graham       | Boston
63green        | Graham
64
65Note that only the word that starts with "g" has changed place. All the words
66sorted after "a" and "A" are sorted after "g".
67This includes "Graham"; "G" would have to be tailored separately, such as with
68`"&a < g <<< G"`.
69
70This is a non-complex example of a tailoring rule. Tailoring rules consist of
71zero or more rules and zero or more options. There must be at least one rule or
72at least one option. The rule syntax is discussed in more detail in the
73following sections.
74
75Note that the tailoring rules override the UCA ordering. In addition, if a
76character is reordered, it automatically reorders any other equivalent
77characters. For example, if the rule "&e<a" is used to reorder "a" in the list,
78"á" is also greater than "é".
79
80## Syntax
81
82The following table summarizes the basic syntax necessary for most usages:
83
84Symbol | Example&nbsp; | Description
85------ | ------------- | ----------------------------------
86`<`    | `a < b`       | Identifies a primary (base letter) difference between "a" and "b"
87`<<`   | `a << ä`      | Signifies a secondary (accent) difference between "a" and "ä"
88`<<<`  | `a<<<A`       | Identifies a tertiary difference between "a" and "A"
89`<<<<` | `か<<<<カ`     | Identifies a quaternary difference between "か" and "カ". (New in ICU 53.)
90`=`    | `x = y`       | Signifies no difference between "x" and "y".
91`&`    | `&Z`          | Instructs ICU to reset at this letter. These rules will be relative to this letter from here on, but will not affect the position of Z itself.
92
93> :point_right: **Note**: ICU permits up to three quaternary relations in a row
94> (except for intervening "=" identity relations).
95
96> :point_right: **Note**: In releases prior to 1.8,
97> ICU used the notations `;` to represent secondary relations and `,` to represent tertiary relations.
98> Starting in release 1.8, use `<<` symbols to represent secondary relations and
99> `<<<` symbols to represent tertiary relations.
100> Rules that use the `;` and `,` notations are still processed by ICU for compatibility;
101> also, some of the data used for tailoring to particular locales
102> has not yet been updated to the new syntax.
103> However, one should consider these symbols deprecated.
104
105> :point_right: **Note**: See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules)
106> and [Properties and ICU Rule Syntax](../../strings/properties.md) for
107> information regarding syntax characters.
108
109Repeated use of the same relation can be abbreviated, for example
110`&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`.
111For details see the
112[LDML collation spec, section
113Orderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).
114
115### Escaping Rules
116
117Most of the characters can be used as parts of rules. However, whitespace
118characters will be skipped over, and all ASCII characters that are not digits or
119letters are considered to be part of syntax. In order to use these characters in
120rules, they need to be escaped. Escaping can be done in several ways:
121
122*   Single characters can be escaped using backslash **\\** (U+005C).
123
124*   Strings can be escaped by putting them between single quotes **'like
125    this'**.
126
127*   The single quote (ASCII apostrophe) can be quoted using two single quotes
128    **''**, both inside and outside single-quote-escaped strings.
129
130### Simple Tailoring Examples
131
132Serbian (Latin) or Croatian: `& C < č <<< Č < ć <<< Ć`
133
134This rule is needed because the root collation order usually considers accents
135to have secondary differences in order to base character. This rule ensures that 'ć'
136'č' are treated as base letters.
137
138UCA             | Tailoring: `& C < č <<< Č < ć <<< Ć`
139--------------- | --------------
140CUKIĆ RADOJICA  | CUKIĆ RADOJICA
141ČUKIĆ SLOBODAN  | CUKIĆ SVETOZAR
142CUKIĆ SVETOZAR  | CURIĆ MILOŠ
143ČUKIĆ ZORAN     | CVRKALJ ÐURO
144CURIĆ MILOŠ     | ČUKIĆ SLOBODAN
145ĆURIĆ MILOŠ     | ČUKIĆ ZORAN
146CVRKALJ ÐURO    | ĆURIĆ MILOŠ
147
148Serbian (Latin) or Croatian: `& Ð < dž <<< Dž <<< DŽ`
149
150This rule is an example of a contraction. "D" alone is sorted after "C" and "Ž"
151is sorted after "Z", but "DŽ", due to the tailoring rule, is treated as a single
152letter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter
153after "D" in the UCA). Another thing to note in this example is capitalization
154of the letter "DŽ". There are three versions, since all three can legally appear
155in text. The fourth version "dŽ" is omitted since it does not occur.
156
157UCA      | Tailoring: `& Ð < dž <<< Dž <<< DŽ`
158-------- | ---------
159dan      | dan
160dubok    | dubok
161džabe    | đak
162džin     | džabe
163Džin     | džin
164DŽIN     | Džin
165đak      | DŽIN
166Evropa   | Evropa
167
168Danish: `&V <<< w <<< W`
169
170The letter 'W' is sorted after 'V', but is treated as a tertiary difference
171similar to the difference between 'v' and 'V'.
172
173UCA | `&V <<< w <<< W`
174--- | ----------------
175va  | va
176Va  | Va
177VA  | VA
178vb  | wa
179Vb  | Wa
180VB  | WA
181vz  | vb
182Vz  | Vb
183VZ  | VB
184wa  | wb
185Wa  | Wb
186WA  | WB
187wb  | vz
188Wb  | Vz
189WB  | VZ
190wz  | wz
191Wz  | Wz
192WZ  | WZ
193
194### Default Options
195
196ICU implements the [LDML collation
197options/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options).
198For more information see there.
199
200The tailoring inherits all the attribute values from the root collator unless
201they are explicitly redefined in the tailoring. The following summarizes
202the option settings. Default options are **in emphasis**.
203
204#### alternate
205- **`[alternate non-ignorable]`**
206- `[alternate shifted]`
207
208Sets the default value of the UCOL_ALTERNATE_HANDLING attribute. If
209set to shifted, variable code points will be ignored on the primary level.
210For details see the [“Ignore Punctuation” Options](ignorepunct.md) page.
211
212#### maxVariable
213- **`[maxVariable punct]`**
214- `[maxVariable space]`
215
216Sets the variable top to the top of the specified
217reordering group. (New in ICU 53.) All code points with primary weights less
218than or equal to the variable top will be considered variable, and thus affected
219by the alternate handling.
220
221#### variable top
222(deprecated)
223- `& X < [variable top]`
224
225Sets the default value for the variable top. All the code points with primary
226strengths less than variable top will be considered variable.
227*Changing the variable top via this rule syntax is deprecated since ICU 53.*
228It has been replaced by the maxVariable option.
229
230#### normalization
231- **`[normalization off]`**
232- `[normalization on]`
233
234Turns on or off the UCOL_NORMALIZATION_MODE attribute.
235If set to on, a quick check and neccessary normalization will be performed.
236
237#### strength
238- `[strength 1]`
239- `[strength 2]`
240- **`[strength 3]`**
241- `[strength 4]`
242- `[strength I]`
243
244Sets the default strength for the collator.
245
246#### backwards
247- `[backwards 2]`
248
249Sets the default value of the UCOL_FRENCH_COLLATION attribute. If set to on,
250weights on the secondary level will be reversed.
251
252#### caseLevel
253- **`[caseLevel off]`**
254- `[caseLevel on]`
255
256Turns on or off the UCOL_CASE_LEVEL attribute. If set to on a
257level consisting only of case characteristics will be inserted in front of
258tertiary level. To ignore accents but take cases into account, set strength to
259primary and case level to on.
260
261#### caseFirst
262- **`[caseFirst off]`**
263- `[caseFirst upper]`
264- `[caseFirst lower]`
265
266Sets the value for the UCOL_CASE_FIRST attribute. If set to
267upper, causes upper case to sort before lower case. If set to lower, lower case
268will sort before upper case. Useful for locales that have an already supported
269ordering but require different order of cases. Affects case and tertiary levels.
270
271#### numericOrdering
272- **`[numericOrdering off]`**
273- `[numericOrdering on]`
274
275Turns on or off the UCOL_NUMERIC_COLLATION attribute. If
276set to on, then sequences of decimal digits (gc=Nd) sort by their numeric value.
277
278#### hiraganaQ
279(deprecated)
280- **`[hiraganaQ off]`**
281- `[hiraganaQ on]`
282
283Controls special treatment of Hiragana code points on
284quaternary level. If turned on, Hiragana code points will get lower values than
285all the other non-variable code points. Strength must be greater or equal than
286quaternary if you want this attribute to take effect.
287*hiraganaQ is deprecated since ICU 50.* It was an implementation detail of the
288Japanese tailoring. In CLDR 25/ICU 53, the Japanese tailoring expresses the
289differences between Hiragana and Katakana via explicit quaternary (`<<<<`)
290relations.
291
292#### suppressContractions
293- `[suppressContractions [Љ-ґ]]`
294
295Removes context-sensitive mappings (contractions and prefix/context-before mappings)
296associated with each of the code points in the given UnicodeSet. It works on the
297current set of rules: It removes mappings from the root collation as well as
298from previous rules.
299
300This is the only way to *remove* mappings: The rule syntax otherwise only adds
301and overrides mappings. This special command is used in CLDR tailoring data to
302remove Cyrillic root collation contractions that are not necessary in several
303languages.
304
305#### optimize
306- `[optimize [Ά-ώ]]`
307
308Performance optimization for the code points in the UnicodeSet.
309In ICU, where tailoring data only contains the
310mappings that are different from the root collation (otherwise the data would be
311too large), falling back to root collation mappings for the rest of Unicode is
312slightly slower. The optimize command copies mappings for additional characters
313into the tailoring data.
314
315#### reorder
316followed by one or more reorder codes
317- `[reorder Grek Hani space]`
318
319Reorders scripts relative to each other and relative to a special set of
320non-script blocks (space, punctuation, symbol, currency, and digit). The default
321order is the same as in the DUCET and in the CLDR root collator.
322
323----
324
325A tailoring that consists only of options is also valid and has the same basic
326ordering as the root collation. For example, the Greek tailoring has option
327settings only: `[normalization on][reorder Grek]`
328
329(The examples in this chapter might refer to older versions of data for
330particular languages. Check CLDR or ICU for actual, current tailorings.)
331
332The following tailoring example reorders uppercase and lowercase and uses
333backwards-secondary ordering:
334
335```
336[caseFirst upper]
337[backwards 2]
338& C < č , Č
339& G < ģ , Ģ
340& I < y, Y
341& K < ķ , Ķ
342& L < ļ , Ļ
343& N < ņ , Ņ
344& S < š , Š
345& Z < ž , Ž
346```
347
348#### Values for Reorder Codes
349
350Reordering Group                         | Rule Value
351---------------------------------------- | ----------
352Unicode white space characters           | space
353Unicode punctuation                      | punct
354Unicode symbols except currency symbols  | symbol
355Unicode currency symbols                 | currency
356Unicode decimal digits                   | digit
357Unicode scripts not mentioned ("others") |Zzzz (= Unknown script)
358
359In addition, ISO **4-letter script codes** can be used. Codes for scripts that
360do not have Unicode characters (according to the Unicode Script property values)
361are ignored.
362
363Limitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers
364to multiple scripts that do not sort primary-equal.)
365
366*   For Chinese, use script code `Hani`, *not* `Hans` or `Hant`.
367*   For Japanese, use both `Kana` and `Hani` (*not* `Hira`).
368*   For Korean, use both `Hang` and `Hani` (*not* `Kore`).
369
370#### Semantics of a List of Reorder Codes
371
372This section is relevant for both the `[reorder ...]` rule syntax and the
373`Collator.setReorderCodes()` API.
374
375For an introduction and examples see the section “Script Reordering” in the
376[Collation Concepts chapter](../concepts.md).
377
378On the API, the special groups are represented with `Collator.ReorderCode`s
379(`UColReorderCode`) values rather than `UScript` (`UScriptCode`) values.
380
381In ICU 4.8-54, not every script could be reordered independently. CLDR and ICU
382supported reordering of groups of scripts, each of which started with one of the
383[Recommended
384Scripts](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). A
385script that is not Recommended always moved together with the Recommended Script
386that precedes it in DUCET order. (Hiragana sorts together with Katakana, Coptic
387with Greek, etc.) ICU allowed any one script of a (Recommended Script +
388DUCET-following) group in the `[reorder]` list, moving the whole set of scripts
389together. However, it was strongly recommended that only Recommended Scripts be
390used.
391
392Beginning with ICU 55, scripts only reorder together if they are primary-equal,
393for example Hiragana and Katakana.
394
395Zyyy=Common and Zinh=Inherited cannot be reordered.
396
397The special code Zzzz (= Unknown script = `UScript.UNKNOWN` =
398`Collator.ReorderCodes.OTHERS` = "others") stands for any script that is not
399explicitly mentioned in the list of reordering codes. If Zzzz is mentioned in
400the list, then any groups and scripts mentioned later in the list will go at the
401very end of the reordering, in the order given. If Zzzz is not mentioned, then
402all scripts that are not explicitly listed follow at the end in DUCET order.
403
404The special reorder code `Collator.ReorderCodes.NONE` (= `UScript.UNKNOWN`), when
405used alone (same as `[reorder Zzzz]` or not specifying a `[reorder]` rule in a
406tailoring), will remove any reordering for this collator. The result of setting
407no reordering will be to use the DUCET/CLDR order.
408
409On the API (not applicable to rule syntax), the special reorder code
410`Collator.ReorderCodes.DEFAULT` (= `UScript.INHERITED`) will reset the reordering
411for the collator to its default order. The default reordering may be the
412DUCET/CLDR order or may be a reordering that was specified when this collator
413was created from resource data or from rules. The DEFAULT code must be the sole
414code supplied when it used.
415
416For details see the [section “Collation Reordering” in the LDML collation
417spec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering).
418
419### Advanced Syntactical Elements
420
421Several other syntactical elements are needed in more specific situations.
422
423#### Order before
424
425- Syntax: `[before 1|2|3]`
426- Example: `&[before 2]a<ā<á<ǎ<à`
427
428Enables users to order characters **before **a given character. In UCA 3.0, the
429example is equivalent to & ㍡<ā<á<ǎ<à (㍡= \\u3361, ideographic telegraph symbol
430for hour nine) and makes accented 'a' letters sort before 'a'. Accents are often
431used to indicate the intonations in Pinyin. In this case, the non-accented
432letters sort after the accented letters.
433
434#### Expansion
435
436- Syntax: `/`
437- Example: `æ/e`
438
439Adds the collation element for 'e' to the collation element for æ.
440After a reset `&ae << æ` is equivalent to `&a << æ/e`. See the Expansion example
441below.
442
443#### Prefix processing
444
445- Syntax: `|`
446- Example: `a|b`
447
448If 'b' is encountered and it follows 'a',
449output the appropriate collation element. If 'b' follows any other letter,
450output the normal collation element for 'b'.
451The collation element for 'a' is not affected.
452
453This element is used to speed up sorting under JIS X 4061. See the
454Prefix example below.
455
456#### Reset to top
457
458- Syntax: `[top]`
459- Example: `&[top] < a < b < c …`
460
461**Deprecated, use indirect positioning instead**
462(`&[last regular]`, see section below)
463Reorders a set of characters 'above' the UCA. `[top]` is a virtual code point having the
464biggest primary weight value that will ever be assigned in the UCA. Above top,
465there is a large number of unassigned primary weights that can be used for a
466'large' tailoring, such as the reordering of the CJK characters according to a
467Far Eastern code page. The first difference after the top is always primary.
468
469### Indirect Positioning of Collation Elements
470
471Since ICU version 2.0, ICU allows for indirect positioning of collation elements
472(CE). Similar to the reset anchor `top`, these reset anchors allow for positioning of the
473tailoring relative to significant sections of the UCA table. You can use the
474`[before]` reset option to position before these sections.
475
476Name                      | Example CE value  | Note
477------------------------- | ----------------- | ------------
478first tertiary ignorable  | `[,,]`            | Start of the UCA table. This value will never change unless CEs are extended with higher level values.
479last tertiary ignorable   | `[,,]`            | This value will never change unless CEs are extended with higher level values.
480first secondary ignorable | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
481last secondary ignorable  | `[,, 05]`         | Currently there are no secondary ignorables in the UCA table.
482first primary ignorable   | `[, 87, 05]`      | Mostly for non-spacing combining marks.
483last primary ignorable    | `[, E1 B1, 05]`   | Currently this value points to a non-existing code point, used to facilitate sorting of compatibility characters.
484first variable            | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see below)
485last variable             | `[17 9B, 05, 05]` | End of variable section.
486first regular             | `[1A 20, 05, 05]` | This is the first regular CE (not primary ignorable and not variable). The majority of code points have regular CEs.
487last regular              | `[78 AA B2, 05, 05]` | Use `&[last regular]` instead of `&[top]`. (see below)
488first implicit            | `[E0 03 03, 05, 05]` | Section of implicitly generated collation elements. (see below)
489last implicit             | `[E3 DC 70 C0, 05, 05]` | End of implicit section. This is the CE of the last unassigned code point (U+10FFFD). (see below)
490first trailing            | `[E5, 05, 05]`    | Start of trailing section. (see below)
491last trailing             | `[FF FF, 05, 05]` | End of trailing collation elements section. This is the highest possible CE, and is the CE for U+FFFF. Not available for tailoring, see `[first trailing]`.
492
493"first variable": The current code point is TAB=U+0009. This is the start of the variable section. "Variable" characters will be ignored on primary/secondary/tertiary levels when the "shifted" option is on.
494
495Tailoring after "last regular" will effectively position characters
496between regular code points and "implicit" CEs (the next section).
497This should be used (only) for tailoring Han characters
498which tends to affect thousands of characters.
499The script reordering implementation assumes that CEs in this section
500are for "Hani" script characters.
501
502"Implicit" means that the UCA default ordering table (DUCET)
503does not explicitly specify CEs for CJK ideographs and unassigned code points;
504instead, their CEs are computed at runtime.
505
506Beginning with ICU 53, tailoring to any unassigned code point,
507including "last implicit", is not supported any more.
508
509"trailing": Tailoring characters after `[first trailing]`
510makes them sort after all other non-tailored code points except for U+FFFD and U+FFFF.
511
512The "trailing" section is reserved for future use, such as for non starting Jamos. See
513<http://www.unicode.org/reports/tr10/#Trailing_Weights>.
514CLDR 1.9/ICU 4.6 and later map U+FFFF to the very end of the trailing section.
515UCA 6.3/CLDR 24/ICU 52 and later map U+FFFD to just before U+FFFF.
516U+FFFD..U+FFFF are not tailorable, and nothing can tailor to them.
517<http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights>
518
519Before ICU 4.6, U+FFFF mapped to a completely ignorable CE, and `[last trailing]`
520was the same as `[first trailing]`.
521
522Not all of the indirect-positioning anchors are useful. Most of the 'first'
523elements should be used with the `[before]` directive, in order to make sure
524that your tailoring will sort before an interesting section.
525
526### Complex Tailoring Examples
527
528The following are several fragments of real tailorings, illustrating some of the
529advanced syntactical elements:
530
531#### Expansion Example:
532
533**Swedish:**
534```
535&t<<<þ/h
536&T<<<Þ/H
537```
538
539The letter 'þ' (THORN) is normally treated by UCA/root collation as a separate
540letter that has primary-level sorting after 'z'. However, in Swedish and some
541other Scandinavian languages, 'þ' and 'Þ' should be treated as just a
542tertiary-level difference from the letters "th" and "TH" respectively. This is
543an example of an expansion.
544
545UCA | `&t<<<þ/h, &T<<<Þ/H`
546--- | --------------------
547az  | az
548Az  | Az
549tha | tha
550Tha | þa
551THa | Tha
552thz | THa
553za  | Þa
554Za  | thz
555zz  | þz
556þa  | za
557Þa  | Za
558þz  | zz
559
560#### Prefix Example:
561
562Prefixes are used in Japanese tailorings to reduce the number of contractions. A
563big number of contractions is a performance burden on the commonly-used base
564characters, as their processing is much more complicated than the processing of
565regular elements.
566
567A prefix rule conditionally changes the CE of the character or string (e.g., ー)
568after the | symbol; unlike a contraction, it does not affect the CE of the
569preceding text (e.g., ァ). (By contrast, a contraction like ァー consumes both
570characters and can assign them a CE or expansion unrelated to ァ's CE.) A prefix
571rule is especially useful if the character or string (ー) after the | symbol
572occurs significantly less often than the first character of the prefix (ァ).
573
574```
575&[before 3]ァ <<< ァ|ー = ァ|ー = ぁ|ー
576```
577
578This could have been written as a series of contractions followed by expansion:
579
580```
581&[before 3]ァー <<< ァー = ァー = ぁー
582```
583
584However, in that case ァ, ァ and ぁ would start contractions. Since the prolonged
585sound mark (ー) occurs much less frequently than the other letters of Japanese
586Katakana and Hiragana, it is much more prudent to put the extra processing on it
587by using prefixes.
588
589#### Reset example:
590
591A "reset" always uses only the base character as the insertion point even if
592there is an expansion. So the following rule,
593
594```
595& J <<< K / B & K <<< M
596```
597
598is equivalent to
599
600```
601& J <<< K / B <<< M
602```
603
604Which produces the following sort order:
605
606"JA"
607
608"MA"
609
610"KA"
611
612"KC"
613
614"JC"
615
616"MC"
617
618> :point_right: **Note**: Assuming the letters "J", "K" and "M" have equal primary weights, the second
619> letter contains the differences among these strings. However, the letter "K" is
620> treated as if it always has a letter "B" following it while the letters "J" and
621> "M" do not.
622
623The following is an example of collation elements for these strings resulting
624from the specified rules:
625
626Strings | Collation Elements | &nbsp;         | &nbsp;
627------- | ------------------ | -------------- | ------
628"JA"    | `[005C.00.01]`     | `[0052.00.01]` |
629"MA"    | `[005C.00.03]`     | `[0052.00.01]` |
630"KA"    | `[005C.00.02]`     | `[0053.00.01]` | `[0052.00.01]`
631"KC"    | `[005C.00.02]`     | `[0053.00.01]` | `[0054.00.01]`
632"JC"    | `[005C.00.01]`     | `[0054.00.01]` |
633"MC"    | `[005C.00.03]`     | `[0054.00.01]` |
634
635## Tailoring Issues
636
637ICU uses canonical closure. This means that for each code point in Unicode, if
638the canonically composed form of a tailored string produces different collation
639elements than the canonically decomposed form, then the canonically composed
640form is effectively added to the ordering. If 'a' is tailored, for example, all
641of the accented 'a' characters are also tailored. Canonical closure allows
642collators to process Unicode strings in the FCD form as well as in NFD. (Note:
643Most but not all NFC strings are also in FCD. See
644<http://www.unicode.org/notes/tn5/#FCD>)
645
646However, *compatibility* equivalents are NOT automatically added. If the rule
647"&b < a" is in tailoring, and the order of **ⓐ (circled a)** is important, it
648needs to be tailored **explicitly**.
649
650Redundant tailoring rules are removed, with later rules "winning". The strengths
651around the removed rules are also fixed.
652
653### Example:
654
655The following table summarizes effects of different redundant rules.
656
657&nbsp; | Original                                                  | Equivalent
658------ | --------------------------------------------------------- | ----------
6591      | `& a < b < c < d` `& r < c`                               | `& a < b < d` `& r < c`
6602      | `& a < b < c < d` `& c < m`                               | `& a < b < c < m < d`
6613      | `& a < b < c < d` `& a < m`                               | `& a < m < b < c < d`
6624      | `& a <<< b << c < d` `& a < m`                            | `& a <<< b << c < m < d`
6635      | `& a < b < c < d` `& [before 1] c < m`                    | `& a < b < m < c < d`
6646      | `& a < b <<< c << d <<< e` `& [before 3] e <<< x`         | `& a < b <<< c << d <<< x <<< e`
6657      | `& a < b <<< c << d <<< e` `& [before 2] e <<< x`         | `& a < b <<< c <<< x << d <<< e`
6668      | `& a < b <<< c << d <<< e` `& [before 1] e <<< x`         | `& a <<< x < b <<< c << d <<< e`
6679      | `& a < b <<< c << d <<< e <<< f < g` `& [before 1] g < x` | `& a < b <<< c << d <<< e <<< f < x < g`
668
669If two different reset lists tailor the same character, then it is removed from the first
670one (see 1 in the table above).
671If the second list resets to a character tailored in the first list, then the second
672list is inserted in the first (see 2).
673If both lists reset to the same character, then the same thing
674happens (see 3). Whenever such an insertion occurs, the second strength
675"postpones" the position (see 4).
676
677If there is a `[before N]` on the reset, then the reset character is
678effectively replaced by the item that would be before it, either in a previous
679tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines
680the 'distance' before, based on the strength of the difference (see 6-8).
681However, this is subject to postponement (see 9), so be careful!
682
683### Reset semantics
684
685The reset semantic in ICU 1.8 and above is different from the previous ICU
686releases. Prior to version 1.8, the reset relation modifier was applicable only
687to the entry immediately following the reset entry. Also, the relation modifier
688applied to all entries that occurred until the next reset or primary relation.
689
690For example,
691
692```
693&xyz << e <<< f
694```
695
696was equivalent to
697
698```
699&x << e/yz <<< f
700```
701
702prior to ICU version 1.8.
703
704Starting with ICU version 1.8, the modifier is equivalent to
705
706```
707&x << e/yz <<< f/yz
708```
709
710The new semantic produces more intuitive results, especially when the character
711after the reset is decomposable. Since all rules are converted to NFD before
712they are interpreted, this can result in contractions that the rule-writer might
713not be aware of. Expansion propagates only until the next reset or primary
714relation occurs.
715
716For example, the following rule:
717
718```
719&ab = c <<< d << e <<< f < g <<< h
720```
721
722was equivalent to the following prior to ICU 1.8 and in Java:
723
724```
725&a = c/b <<< d << e <<< f < g <<< h
726```
727
728Starting with 1.8, it is equivalent to
729
730```
731&a = c / b <<< d / b << e / b <<< f / b < g <<< h
732```
733
734## Known Limitations
735
736The following are known limitations of the ICU collation implementation. These
737are theoretical limitations, however, since there are no known languages for
738which these limitations are an issue. However, for completeness they should be
739fixed in a future version after 1.8.1. The examples given are designed for
740simplicity in testing, and do not match any real languages.
741
742### Expansion
743
744The goal of expansion is to sort as if the expansion text were inserted right
745after the character. For example, with the rule
746
747```
748&a <<< c / e
749```
750
751The text "...**c**..." should sort as if it were right after "...**ae**..." with
752a tertiary difference. There are a few cases where this is not currently true.
753
754#### Recursive Expansion
755
756Given the rules
757
758```
759&a <<< c / e
760&g <<< e / I
761```
762
763Expansion should sort the text "...**c**..." as if it were just after
764"...**ae**...", and that should also sort as if it were just after
765"...**agi**...". This requires that the compilation of expansions be recursive
766(and check for loops as well!). ICU currently does not do this.
767
768Rules         | Desired Order | Current Order
769------------- | ------------- | -------------
770`& a = b / c` | add           | b
771`& d = c / e` | b             | add
772&nbsp;        | adf           | adf
773
774#### Contractions Spanning Expansions
775
776ICU currently always pre-compiles the expansion into an internal format (a list
777of one or more collation elements) when the rule is compiled. If there is a
778contraction that spans the end of the expanded text and the start of the
779original text, however, that contraction will not match. A text case that
780illustrates this is:
781
782Rules           | Desired Order | Current Order
783--------------- | ------------- | -------------
784`& a <<< c / e` | ad            | ad
785`& g <<< eh`    | c             | c
786&nbsp;          | af            | ch
787&nbsp;          | g             | af
788&nbsp;          | ch            | g
789&nbsp;          | h             | h
790
791Since the pre-compiled expansions are a huge performance gain, we will probably
792keep the implementation the way it is, but in the future allow additional syntax
793to indicate those few expansions that need to behave as if the text were
794inserted because of the existence of another contraction. Note that such
795expansions need to be recursively expanded (as in #1), but rather than at
796pre-compile time, these need to be done at runtime.
797
798While it is possible to automatically detect these cases, it would be better to
799allow explicit control in case spanning is not desired. An example of such
800syntax might be something like:
801
802```
803&a <<< c // e
804```
805
806**Notes:** ICU does handle the case where there is a contraction that is
807completely inside the expansion.
808
809Suppose that someone had the rules:
810
811```
812&a = c / e
813&x = ae
814```
815
816These do not cause **c** to sort as if it were **ae**, nor should they.
817
818### Normalization
819
820The Unicode Collation Algorithm specifies that all text sort as if it were first
821normalized into NFD. For performance reasons, ICU collation data is
822pre-processed so that there is no need to perform normalization on strings that
823are in [FCD](http://www.unicode.org/notes/tn5/#FCD) and do not contain any composite
824combining marks. Composite combining marks are: { U+0344, U+0F73, U+0F75, U+0F81
825}
826[`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5Elccc%3D0%3A%5D%26%5B%3AtoNFD%3D%2F..%2F%3A%5D&abb=on&g=)
827(These characters must be decomposed for discontiguous contractions to work
828properly. Use of these characters is discouraged by the Unicode Standard.). The
829vast majority of strings are in this form.
830
831#### Nulls in Contractions
832
833Nulls should not be used in contractions that could invoke normalization.
834
835Rules                | Desired Order | Current Order
836-------------------- | ------------- | -------------
837`& a <<< '\u0000'^`  | a             | '\\u0000'^
838&nbsp;               | '\\u0000'^    | a
839
840#### Contractions Spanning Normalization
841
842The following rule specifies that a grave accent followed by a **b** is a
843contraction, and sorts as if it were an **e**.
844
845```
846& e <<< ` b
847```
848
849On this basis, "...àb..." should sort as if it were just after "...ae...".
850Because of the preprocessing, however, the contraction will not match if this
851text is represented with the pre-composed character à, but **will** match if
852given the decomposed sequence **a + grave accent**. The same thing happens if
853the contraction spans the start of a normalized sequence.
854
855Rules        | Desired Order | Current Order
856------------ | ------------- | -------------
857& e <<< \` b | à             | à
858&nbsp;       | ad            | àb
859&nbsp;       | àb            | ad
860&nbsp;       | af            | af
861&nbsp;       | &nbsp;        |
862`& g <<< ca` | f             | cà
863&nbsp;       | ca            | f
864&nbsp;       | cà            | ca
865&nbsp;       | h             | h
866
867### Variable Top
868
869ICU lets you set the top of the variable range. This can be done, for example,
870to allow you to ignore just SPACES, and not punctuation.
871
872#### Variable Top Exclusion
873
874There is currently a limitation that causes variable top to (perhaps) exclude
875more characters than it should. This happens if you not only set variable top,
876but also tailor a number of characters around it with primary differences. The
877exact number that you can tailor depends on the internal "gaps" between the
878characters in the pre-compiled UCA table. Normally there is a gap of one. There
879are larger gaps between scripts (such as between Latin and Greek), and after
880certain other special characters. For example, if variable top is set to be at
881SPACE ('\\u0020'), then it works correctly with up to 70 characters also
882tailored after space. However, if variable top is set to be equal to HYPHEN
883('\\u2010'), only one other value can be accommodated.
884
885In the following, the goal is for x to be ignored and z not to be ignored.
886
887Rules              | Desired Order SHIFTED = ON | Current Order
888------------------ | -------------------------- | -------------
889`& \u2010`         | -                          | -
890`< x`              | z                          | z
891`< [variable top]` | zb                         | zb
892`< z`              | a                          | xb
893&nbsp;             | b                          | a
894&nbsp;             | -b                         | b
895&nbsp;             | xb                         | -b
896&nbsp;             | c                          | c
897
898> :point_right: **Note**: With ICU 1.8.1, the
899> user is advised not to tailor the variable top to customize more than two
900> primary relations (for example, `"& x < y < [variable top]"`). Starting in ICU
901> 2.0, setVariableTop() allows the user to set the variable top programmatically
902> to a legal single character or a valid contracting sequence. In addition, the
903> string that variable top is set to should not be treated as either inclusive or
904> exclusive in the rules.
905
906### Case Level/First/Second
907
908In ICU, it is possible to override the tertiary settings programmatically. This
909is used to change the default case behavior to be all upper first or all lower
910first. It can also be used for a separate case level, or to ignore all other
911tertiary differences (such as between circled and non-circled letters, or
912between half-width and full-width katakana). The case values are derived
913directly from the Unicode character properties, and not set by the rules.
914
915#### Mixed Case Contractions
916
917There is currently a limitation that all contractions of multiple characters can
918only have three special case values: upper, lower, and mixed. All mixed-case
919contractions are grouped together, and are not affected by the upper first vs.
920lower first flag.
921
922Rules      | Desired Order UPPER_FIRST | Current Order
923---------- | ------------------------- | -------------
924`& c < ch` | C                         | c
925`<<< cH`   | CH                        | CH
926`<<< Ch`   | Ch                        | cH
927`<<< CH`   | cH                        | Ch
928&nbsp;     | ch                        | ch
929
930## Building on Existing Locales
931
932All of the collation rules are additive; that is, they override what any
933previous rule expressed. That means that you can build on existing rules for
934given locales. Here is an example of this, which fetches the rules for a
935particular locale (Danish), then overrides some part (sorting '%' after 'm').
936The syntax is Java, but C/C++ has similar features.
937
938```java
939ULocale myLocale = new ULocale("da");
940try {
941
942    RuleBasedCollator col = (RuleBasedCollator) Collator.getInstance(myLocale);
943    String rules = col.getRules();
944    String myRules = "& m < '%'";
945    RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules);
946
947    // check the values
948
949    List<String> expected = Arrays.asList("a;m;%;z;aa".split(";"));
950    TreeSet<String> sorted = new TreeSet<String>(col2);
951    sorted.addAll(expected);
952    ArrayList<String> actual = new ArrayList<String>(sorted);
953    assertEquals("Customized rules with %", expected, actual);
954
955} catch (Exception e) {
956    throw new IllegalArgumentException("Failed to create customized rules", e);
957}
958```
959
960The root collator has an empty rules string (`getRules()` returns `""`): Any
961collator's tailoring rules string defines how a collator *differs* from the root
962collator, and the tailoring rules string was the input for building the
963tailoring collator. By contrast, the root collator itself is built from a file
964with explicit mappings (ICU4C source/data/unidata/FractionalUCA.txt)
965from characters/contractions to collation elements. This file represents the
966[DUCET](http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table)
967as [modified by
968CLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation).
969
970There are "extended" versions of `getRules()` which, when called with
971`delta=UCOL_FULL_RULES` (C/C++) or `fullrules=true` (Java), return "full rules"
972which are a concatenation of the "UCA rules" and the collator's tailoring. The
973"UCA rules" are published as UCA_Rules.txt in every [UCA
974release](http://www.unicode.org/Public/UCA/).
975
976*   "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which
977    applies to all collators, and provides the DUCET as its Default table.
978*   ICU's root collator implements the CLDR-modified collation element table.
979    The "UCA rules" returned from ICU functions are equivalently modified rules
980    compared with those for the DUCET.
981
982The "UCA rules" are an *approximation* of the root collator's sort order, but
983there are some differences because not all of the details of the root collator
984mappings can be expressed in rule syntax. In particular, a collator built from
985ICU4C source/data/unidata/UCARules.txt
986has at least the following issues compared with the real root collator:
987
988*   inefficient (long) collation element weights
989*   CODAN (numeric collation) will not work (the 0 digit's primary weight is
990    hardcoded, or specified in FractionalUCA.txt)
991*   script reordering will not work
992*   alternate=shifted will not work
993*   the sort order has some differences from the regular root collator,
994    including additional tertiary differences
995
996The "full rules" are almost never used, or useful, at runtime. They are included
997in ICU for historical reasons and for UCA consistency tests. They might be
998usable for emulating the CLDR/ICU sort order with a collation implementation not
999based on CLDR/ICU.
1000
1001Collation rule strings in general are not commonly used but are a significant
1002portion of the data size in ICU collation resource bundles, especially for CJK
1003languages. The rule strings can be omitted from those resource bundles by adding
1004the `--omitCollationRules` option to the relevant `genrb` invocations
1005(for ICU 53..63, in icu4c/source/data/Makefile.in)
1006or, since ICU 64, with a [data filter config file](../../icu_data/buildtool.md).
1007(See for example the relevant
1008[ICU integration test instructions](http://site.icu-project.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-rule-strings).)
1009
1010If the tailoring rules are needed but the 150kB or so of "UCA rules" are not,
1011then the line
1012
1013```
1014UCARules:process(uca_rules){"../unidata/UCARules.txt"}
1015```
1016
1017in
1018[source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/coll/root.txt)
1019can be commented out or deleted.
1020
1021## Cautions
1022
1023The following are not known rule limitations, but rather cautions.
1024
1025### Resets
1026
1027Since resets always work on the existing state, the user is required to make
1028sure that the rule entries are in the proper order.
1029
1030Rules     | Order | Comment
1031--------- | ----- | -------
1032`& a < b` | a     | The rules mean: put **b** after **a**, then put **c** after **a** (inserting **before** the **b**).
1033`& a < c` | c     |
1034&nbsp;    | b     |
1035
1036### Postpone Insertion
1037
1038When using a reset to insert a value X with a certain strength difference after
1039a value Y, it actually is inserted just before the next item of the same
1040strength or higher following Y. Thus, the following are equivalent:
1041
1042```
1043... m < a = c <<< d << e <<< f < g <<< h & a << x
1044... m < a = c <<< d << x << e <<< f < g <<< h
1045```
1046
1047> :point_right: **Note**: This is different from the Java semantics.
1048> In Java, the value is inserted immediately after the reset character.
1049
1050### Jamo Tailoring
1051
1052If Jamo characters are tailored, that causes the code to go through a slow path,
1053which will have a significant effect on performance.
1054
1055### Compatibility Decompositions
1056
1057When tailoring a letter, the customization affects all of its canonical
1058equivalents. That is, if tailoring rule sorts an **'a'** after**'e '**, for
1059example, then "**"à", "á", ...** are also sorted after '**e**'.his is not true
1060for compatibility equivalents. If the desired sorting order is for a
1061**superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule
1062for that.
1063
1064### Case Differences
1065
1066Similarly, when tailoring an "**a" to be sorted** after "**e"**, including
1067"**A"** to be after "**e" **as well, it is required to have a specific rule for
1068that sorting sequence.
1069
1070### Automatic Expansions
1071
1072ICU will automatically form expansions whenever a reset is to a multi-character
1073value that is not a contraction. For example, `& ab <<< c` is equivalent to
1074`& a <<< c / b`. The user may be unaware of this happening, since it may not be
1075obvious that the reset is to a multi-character value. For example, `& à<<< d` is
1076equivalent to & a <<< d / \`
1077