1--- 2layout: default 3title: Concepts 4nav_order: 1 5parent: Collation 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Collation Concepts 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25The previous section demonstrated many of the requirements imposed on string 26comparison routines that try to correctly collate strings according to 27conventions of more than a hundred different languages, written in many 28different scripts. This section describes the principles and architecture behind 29the ICU Collation Service. 30 31## Sortkeys vs Comparison 32 33Sort keys are most useful in databases, where the overhead of calling a function 34for each comparison is very large. 35 36Generating a sort key from a Collator is many times more expensive than doing a 37compare with the Collator (for common use cases). That's if the two functions 38are called from Java or C. So for those languages, unless there is a very large 39number of comparisons, it is better to call the compare function. 40 41Here is an example, with a little back-of-the-envelope calculation. Let's 42suppose that with a given language on a given platform, the compare performance 43(CP) is 100 faster than sortKey performance (SP), and that you are doing a 44binary search of a list with 1,000 elements. The binary comparison performance 45is BP. We'd do about 10 comparisons, getting: 46 47compare: 10 \* CP 48 49sortkey: 1 \* SP + 10 \* BP 50 51Even if BP is free, compare would be better. One has to get up to where log2(n) 52= 100 before they break even. 53 54But even this calculation is only a rough guide. First, the binary comparison is 55not completely free. Secondly, the performance of compare function varies 56radically with the source data. We optimized for maximizing performance of 57collation in sorting and binary search, so comparing strings that are "close" is 58optimized to be much faster than comparing strings that are "far away". That 59optimization is important because normal sort/lookup operations compare close 60strings far more often -- think of binary search, where the last few comparisons 61are always with the closest strings. So even the above calculation is not very 62accurate. 63 64## Comparison Levels 65 66In general, when comparing and sorting objects, some properties can take 67precedence over others. For example, in geometry, you might consider first the 68number of sides a shape has, followed by the number of sides of equal length. 69This causes triangles to be sorted together, then rectangles, then pentagons, 70etc. Within each category, the shapes would be ordered according to whether they 71had 0, 2, 3 or more sides of the same length. However, this is not the only way 72the shapes can be sorted. For example, it might be preferable to sort shapes by 73color first, so that all red shapes are grouped together, then blue, etc. 74Another approach would be to sort the shapes by the amount of area they enclose. 75 76Similarly, character strings have properties, some of which can take precedence 77over others. There is more than one way to prioritize the properties. 78 79For example, a common approach is to distinguish characters first by their 80unadorned base letter (for example, without accents, vowels or tone marks), then 81by accents, and then by the case of the letter (upper vs. lower). Ideographic 82characters might be sorted by their component radicals and then by the number of 83strokes it takes to draw the character. 84An alternative ordering would be to sort these characters by strokes first and 85then by their radicals. 86 87The ICU Collation Service supports many levels of comparison (named "Levels", 88but also known as "Strengths"). Having these categories enables ICU to sort 89strings precisely according to local conventions. However, by allowing the 90levels to be selectively employed, searching for a string in text can be 91performed with various matching conditions. 92 93Performance optimizations have been made for ICU collation with the default 94level settings. Performance specific impacts are discussed in the Performance 95section below. 96 97Following is a list of the names for each level and an example usage: 98 991. Primary Level: Typically, this is used to denote differences between base 100 characters (for example, "a" < "b"). It is the strongest difference. For 101 example, dictionaries are divided into different sections by base character. 102 This is also called the level-1 strength. 103 1042. Secondary Level: Accents in the characters are considered secondary 105 differences (for example, "as" < "às" < "at"). Other differences between 106 letters can also be considered secondary differences, depending on the 107 language. A secondary difference is ignored when there is a primary 108 difference anywhere in the strings. This is also called the level-2 109 strength. 110 Note: In some languages (such as Danish), certain accented letters are 111 considered to be separate base characters. In most languages, however, an 112 accented letter only has a secondary difference from the unaccented version 113 of that letter. 114 1153. Tertiary Level: Upper and lower case differences in characters are 116 distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In 117 addition, a variant of a letter differs from the base form on the tertiary 118 level (such as "A" and "Ⓐ"). Another example is the difference between large 119 and small Kana. A tertiary difference is ignored when there is a primary or 120 secondary difference anywhere in the strings. This is also called the 121 level-3 strength. 122 1234. Quaternary Level: When punctuation is ignored (see Ignoring Punctuations 124 (§)) at level 1-3, an additional level can be used to distinguish words with 125 and without punctuation (for example, "ab" < "a-b" < "aB"). This difference 126 is ignored when there is a primary, secondary or tertiary difference. This 127 is also known as the level-4 strength. The quaternary level should only be 128 used if ignoring punctuation is required or when processing Japanese text 129 (see Hiragana processing (§)). 130 1315. Identical Level: When all other levels are equal, the identical level is 132 used as a tiebreaker. The Unicode code point values of the NFD form of each 133 string are compared at this level, just in case there is no difference at 134 levels 1-4. For example, Hebrew cantillation marks are only distinguished 135 at this level. This level should be used sparingly, as only code point 136 value differences between two strings is an extremely rare occurrence. 137 Using this level substantially decreases the performance for 138 both incremental comparison and sort key generation (as well as increasing 139 the sort key length). It is also known as level 5 strength. 140 141## Backward Secondary Sorting 142 143Some languages require words to be ordered on the secondary level according to 144the *last* accent difference, as opposed to the *first* accent difference. This 145was previously the default for all French locales, based on some French 146dictionary ordering traditions, but is currently only applicable to Canadian 147French (locale **fr_CA**), for conformance with the [Canadian sorting 148standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in 149ordering is only noticeable for a small number of pairs of real words. For more 150information see [UCA: Contextual 151Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity). 152 153Example: 154 155Forward secondary | Backward secondary 156----------------- | ------------------ 157cote | cote 158coté | côte 159côte | coté 160côté | côté 161 162## Contractions 163 164A contraction is a sequence consisting of two or more letters. It is considered 165a single letter in sorting. 166 167For example, in the traditional Spanish sorting order, "ch" is considered a 168single letter. All words that begin with "ch" sort after all other words 169beginning with "c", but before words starting with "d". 170 171Other examples of contractions are "ch" in Czech, which sorts after "h", and 172"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" 173respectively. 174 175Example: 176 177Order without contraction | Order with contraction "lj" sorting after letter "l" 178------------------------- | ---------------------------------------------------- 179la | la 180li | li 181lj | lk 182lja | lz 183ljz | lj 184lk | lja 185lz | ljz 186ma | ma 187 188Contracting sequences such as the above are not very common in most languages. 189 190> :point_right: **Note** Since ICU 2.2, and as required by the UCA, 191> if a completely ignorable code point 192> appears in text in the middle of contraction, it will not break the contraction. 193> For example, in Czech sorting, cU+0000h will sort as it were ch. 194 195## Expansions 196 197If a letter sorts as if it were a sequence of more than one letter, it is called 198an expansion. 199 200For example, in German phonebook sorting (de@collation=phonebook or BCP 47 201de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae." 202All words starting with "ä" will sort between words starting with "ad" and words 203starting with "af". 204 205In the case of Unicode encoding, characters can often be represented either as 206pre-composed characters or in decomposed form. For example, the letter "à" can 207be represented in its decomposed (a+\`) and pre-composed (à) form. Most 208applications do not want to distinguish text by the way it is encoded. A search 209for "à" should find all instances of the letter, regardless of whether the 210instance is in pre-composed or decomposed form. Therefore, either form of the 211letter must result in the same sort ordering. The architecture of the ICU 212Collation Service supports this. 213 214## Contractions Producing Expansions 215 216It is possible to have contractions that produce expansions. 217 218One example occurs in Japanese, where the vowel with a prolonged sound mark is 219treated to be equivalent to the long vowel version: 220 221カアー<<< カイー and\ 222キイー<<< キイー 223 224> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses 225> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings) 226> instead of contraction producing expansions. 227 228## Normalization 229 230In the section on expansions, we discussed that text in Unicode can often be 231represented in either pre-composed or decomposed forms. There are other types of 232equivalences possible with Unicode, including Canonical and Compatibility. The 233process of 234Normalization ensures that text is written in a predictable way so that searches 235are not made unnecessarily complicated by having to match on equivalences. Not 236all text is normalized, however, so it is useful to have a collation service 237that can address text that is not normalized, but do so with efficiency. 238 239The ICU Collation Service handles un-normalized text properly, producing the 240same results as if the text were normalized. 241 242In practice, most data that is encountered is in normalized or semi-normalized 243form already. The ICU Collation Service is designed so that it can process a 244wide range of normalized or un-normalized text without a need for normalization 245processing. When a case is encountered that requires normalization, the ICU 246Collation Service drops into code specific to this purpose. This maximizes 247performance for the majority of text that does not require normalization. 248 249In addition, if the text is known with certainty not to contain un-normalized 250text, then even the overhead of checking for normalization can be eliminated. 251The ICU Collation Service has the ability to turn Normalization Checking either 252on or off. If Normalization Checking is turned off, it is the user's 253responsibility to insure that all text is already in the appropriate form. This 254is true in a great majority of the world languages, so normalization checking is 255turned off by default for most locales. 256 257If the text requires normalization processing, Normalization Checking should be 258on. Any language that uses multiple combining characters such as Arabic, ancient 259Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking 260to be on, or the text to go through a normalization process before collation. 261 262For more information about Normalization related reordering please see 263[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and 264[UAX #15.](http://www.unicode.org/reports/tr15/) 265 266> :point_right: **Note** ICU supports two modes of normalization: on and off. 267> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU. 268 269## Ignoring Punctuation 270 271In some cases, punctuation can be ignored while searching or sorting data. For 272example, this enables a search for "biweekly" to also return instances of 273"bi-weekly". In other cases, it is desirable for punctuated text to be 274distinguished from text without punctuation, but to have the text sort close 275together. 276 277These two behaviors can be accomplished if there is a way for a character to be 278ignored on all levels except for the quaternary level. If this is the case, then 279two strings which compare as identical on the first three levels (base letter, 280accents, and case) are then distinguished at the fourth level based on their 281punctuation (if any). If the comparison function ignores differences at the 282fourth level, then strings that differ by punctuation only are compared as 283equal. 284 285The following table shows the results of sorting a list of terms in 3 different 286ways. In the first column, punctuation characters (space " ", and hyphen "-") 287are not ignored (" " < "-" < "b"). In the second column, punctuation characters 288are ignored in the first 3 levels and compared only in the fourth level. In the 289third column, punctuation characters are ignored in the first 3 levels and the 290fourth level is not considered. In the last column, punctuated terms are 291equivalent to the identical terms without punctuation. 292 293For more options and details see the [“Ignore Punctuation” 294Options](customization/ignorepunct.md) page. 295 296Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength 297------------- | --------------------------------- | ------------------------------- 298black bird | black bird | **black bird** 299black Bird | black-bird | **black-bird** 300black birds | blackbird | **blackbird** 301black-bird | black Bird | black Bird 302black-Bird | black-Bird | black-Bird 303black-birds | blackBird | blackBird 304blackbird | black birds | black birds 305blackBird | black-birds | black-birds 306blackbirds | blackbirds | blackbirds 307 308> :point_right: **Note** The strings with the same font format in the last column are 309compared as equal by ICU Collator.\ 310> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that 311> follow shifted code points will be completely ignored. This means that an accent 312> following a space will compare as if it was a space alone. 313 314## Case Ordering 315 316The tertiary level is used to distinguish text by case, by small versus large 317Kana, and other letter variants as noted above. 318 319Some applications prefer to emphasize case differences so that words starting 320with the same case sort together. Some Japanese applications require the 321difference between small and large Kana be emphasized over other tertiary 322differences. 323 324The UCA does not provide means to separate out either case or Kana differences 325from the remaining tertiary differences. However, the ICU Collation Service has 326two options that help in customize case and/or Kana differences. Both options 327are turned off by default. 328 329### CaseFirst 330 331The Case-first option makes case the most significant part of the tertiary 332level. Primary and secondary levels are unaffected. With this option, words 333starting with the same case sort together. The Case-first option can be set to 334make either lowercase sort before 335uppercase or uppercase sort before lowercase. 336 337Note: The case-first option does not constitute a separate level; it is simply a 338reordering of the tertiary level. 339 340ICU makes use of the following three case categories for sorting 341 3421. uppercase: "ABC" 343 3442. mixed case: "Abc", "aBc" 345 3463. normal (lowercase or no case): "abc", "123" 347 348Mixed case is always sorted between uppercase and normal case when the 349"case-first" option is set. 350 351### CaseLevel 352 353The Case Level option makes a separate level for case differences. This is an 354extra level positioned between secondary and tertiary. The case level is used in 355Japanese to make the difference between small and large Kana more important than 356the other tertiary differences. It also can be used to ignore other tertiary 357differences, or even secondary differences. This is especially useful in 358matching. For example, if the strength is set to primary only (level-1) and the 359case level is turned on, the comparison ignores accents and tertiary differences 360except for case. The contents of the case level are affected by the case-first 361option. 362 363The case level is independent from the strength of comparison. It is possible to 364have a collator set to primary strength with the case level turned on. This 365provides for comparison that takes into account the case differences, while at 366the same time ignoring accents and tertiary differences other than case. This 367may be used in searching. 368 369Example: 370 371**Case-first off, Case level off** 372 373apple\ 374ⓐⓟⓟⓛⓔ\ 375Abernathy\ 376ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 377ähnlich\ 378Ähnlichkeit 379 380**Lowercase-first, Case level off** 381 382apple\ 383ⓐⓟⓟⓛⓔ\ 384ähnlich\ 385Abernathy\ 386ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 387Ähnlichkeit 388 389**Uppercase-first, Case level off** 390 391Abernathy\ 392ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 393Ähnlichkeit\ 394apple\ 395ⓐⓟⓟⓛⓔ\ 396ähnlich 397 398**Lowercase-first, Case level on** 399 400apple\ 401Abernathy\ 402ⓐⓟⓟⓛⓔ\ 403ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 404ähnlich\ 405Ähnlichkeit 406 407**Uppercase-first, Case level on** 408 409Abernathy\ 410apple\ 411ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\ 412ⓐⓟⓟⓛⓔ\ 413Ähnlichkeit\ 414ähnlich 415 416## Script Reordering 417 418Script reordering allows scripts and some other groups of characters to be moved 419relative to each other. This reordering is done on top of the DUCET/CLDR 420standard collation order. Reordering can specify groups to be placed at the 421start and/or the end of the collation order. 422 423By default, reordering codes specified for the start of the order are placed in 424the order given after several special non-script blocks. These special groups of 425characters are space, punctuation, symbol, currency, and digit. Script groups 426can be intermingled with these special non-script groups if those special groups 427are explicitly specified in the reordering. 428 429The special code `others` stands for any script that is not explicitly mentioned 430in the list. Anything that is after others will go at the very end of the list 431in the order given. For example, `[Grek, others, Latn]` will result in an 432ordering that puts all scripts other than Greek and Latin between them. 433 434### Examples: 435 436Note: All examples below use the string equivalents for the scripts and reorder 437codes that would be used in collator rules. The script and reorder code 438constants that would be used in API calls will be different. 439 440**Example 1:**\ 441set reorder code - `[Grek]`\ 442result - `[space, punctuation, symbol, currency, digit, Grek, others]` 443 444**Example 2:**\ 445set reorder code - `[Grek]`\ 446result - `[space, punctuation, symbol, currency, digit, Grek, others]` 447 448followed by: set reorder code - `[Hani]`\ 449result -` [space, punctuation, symbol, currency, digit, Hani, others]` 450 451That is, setting a reordering always modifies 452the DUCET/CLDR order, replacing whatever was previously set, rather than adding 453on to it. In order to cumulatively modify an ordering, you have to retrieve the 454existing ordering, modify it, and then set it. 455 456**Example 3:**\ 457set reorder code - `[others, digit]`\ 458result - `[space, punctuation, symbol, currency, others, digit]` 459 460**Example 4:**\ 461set reorder code - `[space, Grek, punctuation]`\ 462result - `[symbol, currency, digit, space, Grek, punctuation, others]` 463 464**Example 5:**\ 465set reorder code - `[Grek, others, Hani]`\ 466result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]` 467 468**Example 6:**\ 469set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 470result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 471 472followed by:\ 473set reorder code - `[NONE]`\ 474result - DUCET/CLDR 475 476**Example 7:**\ 477set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 478result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 479 480followed by:\ 481set reorder code - `[DEFAULT]`\ 482result - original reordering for the locale which may or may not be DUCET/CLDR 483 484**Example 8:**\ 485set reorder code - `[Grek, others, Hani, symbol, Tglg]`\ 486result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]` 487 488followed by:\ 489set reorder code - `[]`\ 490result - original reordering for the locale which may or may not be DUCET/CLDR 491 492**Example 9:**\ 493set reorder code - `[Hebr, Phnx]`\ 494result - error 495 496Beginning with ICU 55, scripts only reorder together if they are primary-equal, 497for example Hiragana and Katakana. 498 499ICU 4.8-54: 500 501* Scripts were reordered in groups, each normally starting with a [Recommended 502 Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). 503* Reorder codes moved as a group (were “equivalent”) if their scripts shared a 504 primary-weight lead byte. 505* For example, Hebr and Phnx were “equivalent” reordering codes and were 506 reordered together. Their order relative to each other could not be changed. 507* Only any one code out of any group could be reordered, not multiple of the 508 same group. 509 510## Sorting of Japanese Text (JIS X 4061) 511 512Japanese standard JIS X 4061 requires two changes to the collation procedures: 513special processing of Hiragana characters and (for performance reasons) prefix 514analysis of text. 515 516### Hiragana Processing 517 518JIS X 4061 standard requires more levels than provided by the UCA. To offer 519conformant sorting order, ICU uses the quaternary level to distinguish between 520Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana 521symbols on quaternary level, thus causing Hiragana sequences to sort before 522corresponding Katakana sequences. 523 524### Prefix Analysis 525 526Another characteristics of sorting according to the JIS X 4061 is a large number 527of contractions followed by expansions (see 528[Contractions Producing Expansions](#contractions-producing-expansions)). 529This causes all the Hiragana and Katakana codepoints to be treated as 530contractions, which reduces performance. The solution we adopted introduces the 531prefix concept which allows us to improve the performance of Japanese sorting. 532More about this can be found in the [customization 533chapter](customization/index.md) . 534 535## Thai/Lao reordering 536 537UCA requires that certain Thai and Lao prevowels be reordered with a code point 538following them. This option is always on in the ICU implementation, as 539prescribed by the UCA. 540 541This rule takes effect when: 542 5431. A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the 544 range \\U0E01-\\U0E2E 545 or 546 5472. A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the 548 range \\U0E81-\\U0EAE. In these cases the vowel is placed after the 549 consonant for collation purposes. 550 551> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai 552> reordering. Java.text.\* classes allow tailorings to turn off reordering by 553> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai 554> prevowels. 555 556## Space Padding 557 558In many database products, fields are padded with null. To get correct results, 559the input to a Collator should omit any superfluous trailing padding spaces. The 560problem arises with contractions, expansions, or normalization. Suppose that 561there are two fields, one containing "aed" and the other with "äd". German 562phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will 563compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < 564"aed". But if both fields are padded with spaces to a length of 3, then this 565will reverse the order, since the first will compare as if it were one character 566longer. In other words, when you start with strings 1 and 2 567 5681 | a | e | d | \<space\> 569-- | -- | -- | --------- | --------- 5702 | ä | d | \<space\> | \<space\> 571 572they end up being compared on a primary level as if they were 1' and 2' 573 5741' | a | e | d | \<space\> | 575-- | -- | -- | -- | --------- | --------- 5762' | a | e | d | \<space\> | \<space\> 577 578Since 2' has an extra character (the extra space), it counts as having a primary 579difference when it shouldn't. The correct result occurs when the trailing 580padding spaces are removed, as in 1" and 2" 581 5821" | a | e | d 583-- | -- | -- | -- 5842" | a | e | d 585 586## Collator naming scheme 587 588***Starting with ICU 54, the following naming scheme and its API functions are deprecated.*** 589Use `ucol_open()` with language tag collation keywords instead 590(see [Collation API Details](api.md)). For example, 591`ucol_open("de-u-co-phonebk-ka-shifted", &errorCode)` for German Phonebook order 592with "ignore punctuation" mode. 593 594When collating or matching text, a number of attributes can be used to affect 595the desired result. The following describes the attributes, their values, their 596effects, their normal usage, and the string comparison performance and sort key 597length implications. It also includes single-letter abbreviations for both the 598attributes and their values. These abbreviations allow a 'short-form' 599specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which 600can be used to specific that the desired options are: UCA version 4.0.0; ignore 601spaces, punctuation and symbols; use Swedish linguistic conventions; compare 602case-insensitively. 603 604A number of attribute values are common across different attributes; these 605include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless 606otherwise stated, the examples use the UCA alone with default settings. 607 608> :point_right: **Note** In order to achieve uniqueness, a collator name always 609> has the attribute abbreviations sorted. 610 611### Main References 612 6131. For a full list of supported locales in ICU, see [Locale 614 Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains 615 an on-line demo showing sorting for each locale. The demo allows you to try 616 different attribute values, to see how they affect sorting. 617 6182. To see tabular results for the UCA table itself, see the [Unicode Collation 619 Charts](http://www.unicode.org/charts/collation/) . 620 6213. For the UCA specification, see [UTS #10: Unicode Collation 622 Algorithm](http://www.unicode.org/reports/tr10/) . 623 6244. For more detail on the precise effects of these options, see [Collation 625 Customization](customization/index.md) . 626 627#### Collator Naming Attributes 628 629Attribute | Abbreviation | Possible Values 630---------------------- | ------------ | --------------- 631Locale | L | \<language\> 632Script | Z | \<script\> 633Region | R | \<region\> 634Variant | V | \<variant\> 635Keyword | K | \<keyword\> 636 | | 637Strength | S | 1, 2, 3, 4, I, D 638Case_Level | E | X, O, D 639Case_First | C | X, L, U, D 640Alternate | A | N, S, D 641Variable_Top | T | \<hex digits\> 642Normalization Checking | N | X, O, D 643French | F | X, O, D 644Hiragana | H | X, O, D 645 646#### Collator Naming Attribute Descriptions 647 648The **Locale** attribute is typically the most 649important attribute for correct sorting and matching, according to the user 650expectations in different countries and regions. The default UCA ordering will 651only sort a few languages such as Dutch and Portuguese correctly ("correctly" 652meaning according to the normal expectations for users of the languages). 653Otherwise, you need to supply the locale to UCA in order to properly collate 654text for a given language. Thus a locale needs to be supplied so as to choose a 655collator that is correctly **tailored** for that locale. The choice of a locale 656will automatically preset the values for all of the attributes to something that 657is reasonable for that locale. Thus most of the time the other attributes do not 658need to be explicitly set. In some cases, the choice of locale will make a 659difference in string comparison performance and/or sort key length. 660 661In short attribute names, 662`<language>_<script>_<region>_<variant>@collation=<keyword>` is 663represented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not 664all the elements are required. Valid values for locale elements are general 665valid values for RFC 3066 locale naming. 666 667**Example:**\ 668**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\ 669**Locale="de" (German)** "Köpfe" < "Kypper" 670 671The **Strength** attribute determines whether accents or 672case are taken into account when collating or matching text. ( (In writing 673systems without case or accents, it controls similarly important features). The 674default strength setting usually does not need to be changed for collating 675(sorting), but often needs to be changed when **matching** (e.g. SELECT). The 676possible values include Default (D), Primary (1), Secondary (2), Tertiary (3), 677Quaternary (4), and Identical (I). 678 679For example, people may choose to ignore accents or ignore accents and case when 680searching for text. 681 682Almost all characters are distinguished by the first three levels, and in most 683locales the default value is thus Tertiary. However, if Alternate is set to be 684Shifted, then the Quaternary strength (4) can be used to break ties among 685whitespace, punctuation, and symbols that would otherwise be ignored. If very 686fine distinctions among characters are required, then the Identical strength (I) 687can be used (for example, Identical Strength distinguishes between the 688**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more 689examples, look at the cells with white backgrounds in the collation charts). 690However, using levels higher than Tertiary - the Identical strength - result in 691significantly longer sort keys, and slower string comparison performance for 692equal strings. 693 694**Example:**\ 695**S=1** role = Role = rôle\ 696**S=2** role = Role < rôle\ 697**S=3** role < Role < rôle 698 699The **Case_Level** attribute is used when ignoring accents 700**but not** case. In such a situation, set Strength to be Primary, and 701Case_Level to be On. In most locales, this setting is Off by default. There is a 702small string comparison performance and sort key impact if this attribute is set 703to be On. 704 705**Example:**\ 706**S=1, E=X** role = Role = rôle\ 707**S=1, E=O** role = rôle < Role 708 709The **Case_First** attribute is used to control whether 710uppercase letters come before lowercase letters or vice versa, in the absence of 711other differences in the strings. The possible values are Uppercase_First (U) 712and Lowercase_First (L), plus the standard Default and Off. There is almost no 713difference between the Off and Lowercase_First options in terms of results, so 714typically users will not use Lowercase_First: only Off or Uppercase_First. 715(People interested in the detailed differences between X and L should consult 716the [Collation Customization](customization/index.md) ). 717Specifying either L or U won't affect string comparison performance, but will 718affect the sort key length. 719 720**Example:**\ 721**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\ 722**C=U** "China" < "china" < "Denmark" < "denmark" 723 724The **Alternate** attribute is used to control the handling of 725the so-called **variable **characters in the UCA: whitespace, punctuation and 726symbols. If Alternate is set to Non-Ignorable (N), then differences among these 727characters are of the same importance as differences among letters. If Alternate 728is set to Shifted (S), then these characters are of only minor importance. The 729Shifted value is often used in combination with Strength set to Quaternary. In 730such a case, white-space, punctuation, and symbols are considered when comparing 731strings, but only if all other aspects of the strings (base letters, accents, 732and case) are identical. If Alternate is not set to Shifted, then there is no 733difference between a Strength of 3 and a Strength of 4. 734 735For more information and examples, see 736[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in 737the UCA. 738 739The reason the Alternate values are not simply On and Off is that 740additional Alternate values may be added in the future. 741 742The UCA option 743**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted. 744 745The default for most locales is Non-Ignorable. If Shifted is selected, it may be 746slower if there are many strings that are the same except for punctuation; sort 747key length will not be affected unless the strength level is also increased. 748 749**Example:**\ 750**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\ 751**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\ 752**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA 753 754The **Variable_Top** attribute is only meaningful if the 755Alternate attribute is not set to Non-Ignorable. In such a case, it controls 756which characters count as ignorable. The \<hex\> value specifies the "highest" 757character sequence (in UCA order) weight that is to be considered ignorable. 758 759Thus, for example, if a user wanted white-space to be ignorable, but not any 760visible characters, then s/he would use the value Variable_Top=0020 (space). The 761digits should only be a single character. All characters of the same primary 762weight are equivalent, so Variable_Top=3000 (ideographic space) has the same 763effect as Variable_Top=0020. 764 765This setting (alone) has little impact on string comparison performance; setting 766it lower or higher will make sort keys slightly shorter or longer respectively. 767 768**Example:**\ 769**S=3, A=S** di Silva = diSilva < U.S.A. = USA\ 770**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA 771 772The **Normalization** setting determines whether 773text is thoroughly normalized or not in comparison. Even if the setting is off 774(which is the default for many locales), text as represented in common usage 775will compare correctly (for details, see [UTN 776#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in 777non-canonical order will there be a problem. If the setting is On, then the best 778results are guaranteed for all possible text input.There is a medium string 779comparison performance cost if this attribute is On, depending on the frequency 780of sequences that require normalization. There is no significant effect on sort 781key length.If the input text is known to be in NFD or NFKD normalization forms, 782there is no need to enable this Normalization option. 783 784**Example:**\ 785**N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\ 786**N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈ 787 788Some **French** dictionary ordering traditions sort strings with 789different accents from the back of the string. This attribute is automatically 790set to On for the Canadian French locale (fr_CA). Users normally would not need 791to explicitly set this attribute. There is a string comparison performance cost 792when it is set On, but sort key length is unaffected. 793 794**Example:**\ 795**F=X** cote < coté < côte < côté\ 796**F=O** cote < côte < coté < côté 797 798Compatibility with JIS x 4061 requires the introduction of an 799additional level to distinguish **Hiragana** and Katakana characters. If 800compatibility with that standard is required, then this attribute is set On, and 801the strength should be set to at least Quaternary. 802 803This attribute is an implementation detail of the CLDR Japanese tailoring. The 804implementation might change to use a different mechanism to achieve the same 805Japanese sort order. Since ICU 50, this attribute is not settable any more. 806 807**Example:**\ 808**H=X, S=4** きゅう = キュウ < きゆう = キユウ\ 809**H=O, S=4** きゅう < キュウ < きゆう < キユウ 810 811> :point_right: **Note** If attributes in collator name are not overridden, 812> it is assumed that they are the same as for the given locale. 813> For example, a collator opened with an empty 814> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.* 815 816### Summary of Value Abbreviations 817 818Value | Abbreviation 819------------- | ------------ 820Default | D 821On | O 822Off | X 823Primary | 1 824Secondary | 2 825Tertiary | 3 826Quaternary | 4 827Identical | I 828Shifted | S 829Non-Ignorable | N 830Lower-First | L 831Upper-First | U 832