1--- 2title: Collation Guidelines 3--- 4 5# Collation Guidelines 6 7Collation sequences can be quite tricky to specify. 8 9The locale\-based collation rules in Unicode CLDR specify customizations of the standard data for [UTS \#10: Unicode Collation Algorithm](http://www.unicode.org/reports/tr10/#Introduction) (UCA). Requests to change the collation order for a given locale, or to supply additional variants, need to follow the guidelines in this document. 10 11## Filing a Request 12 13Requests to change the collation order for a given locale, or to supply additional variants should be filed as CLDR bug tickets. See [CLDR Change Requests](https://cldr.unicode.org/index/bug-reports) 14 15### Rules 16 17The request should present the precise change expressed as rules. The rules must be supplied in the syntax as specified in [http://www.unicode.org/reports/tr35/tr35\-collation.html\#Rules](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). (This used to be called the "basic syntax".) The rules must also be [Minimal Rules](https://cldr.unicode.org/index/cldr-spec/collation-guidelines) as described below: *only* differences from [http://unicode.org/charts/uca/](http://unicode.org/charts/uca/) should be specified. 18 19*\& c \< cs* 20 21\& cs \<\<\< ccs / cs 22 23Normally CLDR does not accept submissions that reorder *particular* digits, punctuation, or other symbols, following instead the UCA ordering for those characters. However, if punctuation, general symbols, currency symbols, or digits *as a class* all sort after letters, that change can be accommodated. Similarly, if the letters in a particular script sort ahead of others (such as Greek characters ahead of Latin), that can also be accommodated. Both of these are done with a reorder setting. Note: For a given language, CLDR normally sorts the language's native script before other scripts, via the reorder setting. 24 25### Test Data 26 27Please supply short test cases that illustrate the correct sorting behavior as a list of lines in sorted order. Try to include cases that show the boundary behavior by including suffixes, such as the following to illustrate that "cs" and "ccs" sort specially. 28 29c 30 31cy 32 33cs 34 35cscs 36 37ccs 38 39cscsy 40 41ccsy 42 43csy 44 45d 46 47### Justification 48 49Provide justification for your change. Citations should be to authoritative pages on the web, in English. 50 51### Testing Your Request 52 53Please test out any suggested rules before filing a bug. 54 551. Go to the [ICU Collation Demo](http://demo.icu-project.org/icu-bin/collation.html). 562. Pick the language for which you want to change the rules, or keep it on "und" (root) if you want to start from the Unicode/CLDR default sort order. 573. Put your rules into the "Append rules" box. 584. Put an interesting list of strings into the Input box. 595. Click "sort" and verify the sort order and levels of differences. 60 61Or 62 631. Go to the [ICU Locale Explorer](http://demo.icu-project.org/icu-bin/locexp). 642. Pick the appropriate locale. 653. Follow the instructions at the bottom to use your suggested rules on your suggested test data. 664. Verify that the proper order results. 67 68## Determining the Order 69 70The exact collation sequence for a given language may be difficult to determine. The base ordering of characters can be fairly straightforward, but there are quite a few other complications involved. 71 72Most standards that specify collation, such as DIN or CS, are not targeted at algorithmic sorting, and are not complete algorithmic specifications. For example, CSN 97 6030 requires transliteration of foreign scripts, but there are many choices as to how to transliterate, and the exact mechanism is not specified. It also specifies that geometric shapes are sorted by the number of vertices and edges, which is, at a minimum, difficult to determine; and are subject to variation in glyphs. 73 74The CLDR goals are to match the sorting of exemplar letters and common punctuation and leave everything else to the standard UCA ordering. For more information, see [UTS \#10: Unicode Collation Algorithm](http://www.unicode.org/reports/tr10/#Introduction) (UCA). 75 76### Determining Level Differences 77 78It is often tricky to determine the exact relationship between characters. In the UCA, case and similar variant differences are at a third (tertiary) level, while accent and similar differences are at a second (secondary) level, and base letter differences are at the first (primary) level. That results in an order like the following: 79 801. cina 812. Cina 823. çina 834. Çina 845. **d**ina 85 86That is, the difference between c and C is weaker than the difference between c and ç, which in turn is weaker than the difference between c and d. For any two characters α and β, it may be very clear that α \< β, but not be clear what the right level difference is. To establish this, see if you can find examples of two words that of the following form. 87 88Primary Test 89 901. ...α...Z 912. ...β...A 92 93That is, the words are identical except for α, β, A, and Z, *and* you know that A and Z have a clear primary difference. If we get the above ordering in dictionaries and other sources, you know that the difference between α and β is a primary difference. If we get the opposite ordering than 1,2 above, then you only know that the difference between α and β is *not* a primary difference: it may be secondary or tertiary. 94 95You now need to distinguish which of the non\-primary level differences you could have. So try again, this time seeing if you can find examples of two words that of the following form, where you know that A and Á have a clear secondary difference in the script. 96 97Secondary Test1 98 991. ...α...Á 1002. ...β...A 101 102Now the ordering of these two strings tells you whether the difference between α and β is a secondary difference, or not. Alternatively, you can look for words of the form: 103 104Secondary Test2 105 1061. ...B...α 1072. ...b...β 108 109where b \< B at a tertiary level. If you get the above ordering for the secondary test2, you also know that the difference between α and β is at a secondary level. The Test2 form is often easier to find examples for. 110 111If you have established that the characters have neither a primary nor secondary difference, the following can be used in a similar fashion to test whether the difference is at a tertiary level or not. 112 113Tertiary Test 114 1151. ...α...B 1162. ...β...b 117 118If there is no tertiary difference, then the difference is not significant enough for CLDR to take it into account, so they will be treated as equals (unless someone sorts with a final, codepoint level). 119 120### Contractions 121 122Characters may behave differently in different contexts. For example, "ch" in Slovak sorts after H. A sequence of characters that behaves that way is called a contraction. Another common case of contractions is in the case of syllabaries, where a sequence of characters forming a syllable collates as a unit. 123 124Note that contractions are typically rather expensive in implementations: they take more storage, and are much slower to compare. So they should be avoided where possible. For example, suppose that we have the following sequence in a dictionary (where the uppercase characters represent characters in the target script): 125 126KB 127 128... // combinations of K with consonants 129 130KZ 131 132KA 133 134KE 135 136KI 137 138KO 139 140KU 141 142LB 143 144... 145 146There are two ways to produce this ordering. One is to have KA, KE, KI, etc be contractions. The other is to order all the vowels after all the consonants. Where the latter is sufficient, it is strongly preferred. 147 148## Minimal Rules 149 150The goal is always specify the ***minimal*** differences from the DUCET. For example, take the case of Slovak, where everything sorts as in DUCET except for certain characters. The following rules place the characters ä, č, đ, and the sequence "ch" (and their case variants) at the appropriate positions in the sorting sequence, and with the appropriate strengths: 151 152**Minimal Rules** 153 154\& A 155 156\< ä \<\<\< Ä 157 158\& C 159 160\< č \<\<\< Č 161 162\& D 163 164\< đ \<\<\< Đ 165 166\& H 167 168\< ch \<\<\< cH \<\<\< Ch \<\<\< CH 169 170... 171 172It would be possible instead to have rules that list every letter used by Slovak \[a á ä b c č d ď e é f\-h {ch} i í j\-l ĺ ľ m n ň o ó ô p\-r ŕ s š t ť u ú v\-y ý z ž], looking something like the following. 173 174**Maximal Rules** 175 176\& A \<\< á \<\<\< Á 177 178\< ä \<\<\< Ä 179 180\< b \<\<\< B 181 182\< c \<\<\< C 183 184\< č \<\<\< Č 185 186\< d 187 188... 189 190***The Maximal Rules format is not accepted in CLDR.*** The reasons are: 191 1921. Every time a character is tailored, the data for that character takes up more room in typical implementations. That means that the data for collation is larger, downloads of collation libraries with that data are slower, sort keys are longer, and performance is slower; sometimes very much so. 1932. Related characters in the same script are in a peculiar order. For example, if the Slovak tailoring omits ƀ, then it would show up as after z. 194 195You can see what the UCA currently does with a given script by looking at the charts at [Unicode Collation Charts](http://www.unicode.org/charts/collation/), or at the [UCA in ICU\-style rules](http://unicode.org/cldr/data/diff/collation/UCA.txt). For example, suppose that U\+0D89 SINHALA LETTER IYANNA and U\+0D8A SINHALA LETTER IIYANNA needed to come after U\+0D96 SINHALA LETTER AUYANNA, in primary order, and that otherwise DUCET was ok. Then you would give the following rules: 196 197\& ඖ \# U\+0D96 SINHALA LETTER AUYANNA 198 199\< ඉ \# U\+0D89 SINHALA LETTER IYANNA 200 201\< ඊ \# U\+0D8A SINHALA LETTER IIYANNA 202 203## Pitfalls 204 205There are a number of pitfalls with collation, so be careful. In some cases, such as Hungarian or Japanese, the rules can be fairly complicated (of course, reflecting that the sorting sequence for those languages is complicated). 206 2071. **Only tailor expected data.** We focus on the required collation sequence for a given language with normal data. So we don't include full\-width characters for a European collation sequence, such as 208 - ... CSCS \<\<\< CSCS ... 209 - ... CSCS \<\<\< \\uFF23\\uFF33\\uFF23\\uFF33 ... (equivalently) 2102. **Tailor trailing contractions.** If a sequence of characters is treated as a unit for collation, it should be entered as a contraction. 211 1. \& c \< ch 212 2. One might think that sequence like "dz" doesn't require that, since it would always come after "d" followed by any other letter; it is a "trailing contraction". But in unusual cases, that wouldn't be true; if "dz" is a unit sorted as if it were a distinct letter after "d", one should get the ordering "dα" \< "dz". The correct behavior will only happen if "dz" is a contraction, such as 213 3. \& d \< dz 2143. **Watch out for Expansions.** If you have a rule like \&cs \< d, and "cs" has not occurred in a previous rule as a contraction, then this is automatically considered to be the same as \&c \< d / s; that is, the d *expands* as if it were a "cs" (actually, primary greater than a "cs", since we wrote "\<"). This expansion takes effect until the next primary difference. 215 1. So suppose that "ccs" is to behave as if it were "cscs", and take case differences into account. You might try to do this with the rules on the left: 216 217| Rules (Wrong) | Actual Effect | 218|---|---| 219| \& C \< cs \<\<\< Cs \<\<\< CS | \& C \< cs \<\<\< Cs \<\<\< CS | 220| \& cscs \<\<\< ccs | **\& cs \<\<\< ccs / cs** | 221| \<\<\< Cscs \<\<\< Ccs | **\<\<\< Cscs / cs \<\<\< Ccs / cs** | 222| \<\<\< CSCS \<\<\< CCS | **\<\<\< CSCS / cs \<\<\< CCS / cs** | 223 2241. But since the CSCS has not been made a contraction in previous rules, this produces an automatic expansion, one that continues through the entire sequence of non\-primary differences, as shown on the right. This is *not* what is wanted: each item acts like it expands compared to the previous item. So CCS, for example, will act like it expands to CSCScs! 2252. What you actually want is the following: 226 227| Rules (Right) | Actual Effect | 228|---|---| 229| \& C \< cs \<\<\< Cs \<\<\< CS | \& C \< cs \<\<\< Cs \<\<\< CS | 230| \& cscs \<\<\< ccs | \& cs \<\<\< ccs / cs | 231| \& Cscs \<\<\< Ccs | \& Cs \<\<\< Ccs / cs | 232| \& CSCS \<\<\< CCS | \& CS \<\<\< CCS / CS | 233 2341. In short, when you have expansions, it is always safer and clearer to express them with separate resets. There are only a few exceptions to this, notably when CJK characters are interleaved with Hangul Syllables. 235 2361. **Minimal Rules.** Example: Maltese was sorting character sequences *before* a base character using the following style: 237 1. \& B 238 2. \< ċ 239 3. \<\<\<Ċ 240 4. \< c 241 5. \<\<\<C 242 6. The correct rules should be the minimal ones. 243 7. \& \[before 1] c \< ċ \<\<\< Ċ 244 8. This finds the highest primary (that's what the 1 is for) character less than c, and uses that as the reset point. For Maltese, the same technique needs to be used for ġ and ż. 2452. **Blocking Contractions.** Contractions can be blocked with CGJ, as described in the Unicode Standard and in the [Characters and Combining Marks FAQ](http://www.unicode.org/faq/char_combmark.html). 2463. **Case Combinations.** The lowercase, titlecase, and uppercase variants of contractions need to be supplied, with tertiary differences in that order (regardless of the caseFirst setting). That is, if *ch* is a contraction, then you would have the rules `... ch <<< Ch <<< CH`. Other case variants such as *cH* are excluded because they are unlikely to represent the contraction, for example in *McHugh*. (Therefore, *mchugh* and *McHugh* will be primary different if *ch* adds a primary difference.) \[[\#8248](http://unicode.org/cldr/trac/ticket/8248)] 247 248