• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Concepts
4nav_order: 1
5parent: Collation
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Collation Concepts
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25The previous section demonstrated many of the requirements imposed on string
26comparison routines that try to correctly collate strings according to
27conventions of more than a hundred different languages, written in many
28different scripts. This section describes the principles and architecture behind
29the ICU Collation Service.
30
31## Sortkeys vs Comparison
32
33Sort keys are most useful in databases, where the overhead of calling a function
34for each comparison is very large.
35
36Generating a sort key from a Collator is many times more expensive than doing a
37compare with the Collator (for common use cases). That's if the two functions
38are called from Java or C. So for those languages, unless there is a very large
39number of comparisons, it is better to call the compare function.
40
41Here is an example, with a little back-of-the-envelope calculation. Let's
42suppose that with a given language on a given platform, the compare performance
43(CP) is 100 faster than sortKey performance (SP), and that you are doing a
44binary search of a list with 1,000 elements. The binary comparison performance
45is BP. We'd do about 10 comparisons, getting:
46
47compare: 10 \* CP
48
49sortkey: 1 \* SP + 10 \* BP
50
51Even if BP is free, compare would be better. One has to get up to where log2(n)
52= 100 before they break even.
53
54But even this calculation is only a rough guide. First, the binary comparison is
55not completely free. Secondly, the performance of compare function varies
56radically with the source data. We optimized for maximizing performance of
57collation in sorting and binary search, so comparing strings that are "close" is
58optimized to be much faster than comparing strings that are "far away". That
59optimization is important because normal sort/lookup operations compare close
60strings far more often -- think of binary search, where the last few comparisons
61are always with the closest strings. So even the above calculation is not very
62accurate.
63
64## Comparison Levels
65
66In general, when comparing and sorting objects, some properties can take
67precedence over others. For example, in geometry, you might consider first the
68number of sides a shape has, followed by the number of sides of equal length.
69This causes triangles to be sorted together, then rectangles, then pentagons,
70etc. Within each category, the shapes would be ordered according to whether they
71had 0, 2, 3 or more sides of the same length. However, this is not the only way
72the shapes can be sorted. For example, it might be preferable to sort shapes by
73color first, so that all red shapes are grouped together, then blue, etc.
74Another approach would be to sort the shapes by the amount of area they enclose.
75
76Similarly, character strings have properties, some of which can take precedence
77over others. There is more than one way to prioritize the properties.
78
79For example, a common approach is to distinguish characters first by their
80unadorned base letter (for example, without accents, vowels or tone marks), then
81by accents, and then by the case of the letter (upper vs. lower). Ideographic
82characters might be sorted by their component radicals and then by the number of
83strokes it takes to draw the character.
84An alternative ordering would be to sort these characters by strokes first and
85then by their radicals.
86
87The ICU Collation Service supports many levels of comparison (named "Levels",
88but also known as "Strengths"). Having these categories enables ICU to sort
89strings precisely according to local conventions. However, by allowing the
90levels to be selectively employed, searching for a string in text can be
91performed with various matching conditions.
92
93Performance optimizations have been made for ICU collation with the default
94level settings. Performance specific impacts are discussed in the Performance
95section below.
96
97Following is a list of the names for each level and an example usage:
98
991.  Primary Level: Typically, this is used to denote differences between base
100    characters (for example, "a" < "b"). It is the strongest difference. For
101    example, dictionaries are divided into different sections by base character.
102    This is also called the level-1 strength.
103
1042.  Secondary Level: Accents in the characters are considered secondary
105    differences (for example, "as" < "às" < "at"). Other differences between
106    letters can also be considered secondary differences, depending on the
107    language. A secondary difference is ignored when there is a primary
108    difference anywhere in the strings. This is also called the level-2
109    strength.
110    Note: In some languages (such as Danish), certain accented letters are
111    considered to be separate base characters. In most languages, however, an
112    accented letter only has a secondary difference from the unaccented version
113    of that letter.
114
1153.  Tertiary Level: Upper and lower case differences in characters are
116    distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). In
117    addition, a variant of a letter differs from the base form on the tertiary
118    level (such as "A" and "Ⓐ"). Another example is the difference between large
119    and small Kana. A tertiary difference is ignored when there is a primary or
120    secondary difference anywhere in the strings. This is also called the
121    level-3 strength.
122
1234.  Quaternary Level: When punctuation is ignored (see Ignoring Punctuations
124    (§)) at level 1-3, an additional level can be used to distinguish words with
125    and without punctuation (for example, "ab" < "a-b" < "aB"). This difference
126    is ignored when there is a primary, secondary or tertiary difference. This
127    is also known as the level-4 strength. The quaternary level should only be
128    used if ignoring punctuation is required or when processing Japanese text
129    (see Hiragana processing (§)).
130
1315.  Identical Level: When all other levels are equal, the identical level is
132    used as a tiebreaker. The Unicode code point values of the NFD form of each
133    string are compared at this level, just in case there is no difference at
134    levels 1-4. For example, Hebrew cantillation marks are only distinguished
135    at this level. This level should be used sparingly, as only code point
136    value differences between two strings is an extremely rare occurrence.
137    Using this level substantially decreases the performance for
138    both incremental comparison and sort key generation (as well as increasing
139    the sort key length). It is also known as level 5 strength.
140
141## Backward Secondary Sorting
142
143Some languages require words to be ordered on the secondary level according to
144the *last* accent difference, as opposed to the *first* accent difference. This
145was previously the default for all French locales, based on some French
146dictionary ordering traditions, but is currently only applicable to Canadian
147French (locale **fr_CA**), for conformance with the [Canadian sorting
148standard](http://www.unicode.org/reports/tr10/#CanStd). The difference in
149ordering is only noticeable for a small number of pairs of real words. For more
150information see [UCA: Contextual
151Sensitivity](http://www.unicode.org/reports/tr10/#Contextual_Sensitivity).
152
153Example:
154
155Forward secondary | Backward secondary
156----------------- | ------------------
157cote              | cote
158coté              | côte
159côte              | coté
160côté              | côté
161
162## Contractions
163
164A contraction is a sequence consisting of two or more letters. It is considered
165a single letter in sorting.
166
167For example, in the traditional Spanish sorting order, "ch" is considered a
168single letter. All words that begin with "ch" sort after all other words
169beginning with "c", but before words starting with "d".
170
171Other examples of contractions are "ch" in Czech, which sorts after "h", and
172"lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n"
173respectively.
174
175Example:
176
177Order without contraction | Order with contraction "lj" sorting after letter "l"
178------------------------- | ----------------------------------------------------
179la                        | la
180li                        | li
181lj                        | lk
182lja                       | lz
183ljz                       | lj
184lk                        | lja
185lz                        | ljz
186ma                        | ma
187
188Contracting sequences such as the above are not very common in most languages.
189
190> :point_right: **Note** Since ICU 2.2, and as required by the UCA,
191> if a completely ignorable code point
192> appears in text in the middle of contraction, it will not break the contraction.
193> For example, in Czech sorting, cU+0000h will sort as it were ch.
194
195## Expansions
196
197If a letter sorts as if it were a sequence of more than one letter, it is called
198an expansion.
199
200For example, in German phonebook sorting (de@collation=phonebook or BCP 47
201de-u-co-phonebk), "ä" sorts as though it were equivalent to the sequence "ae."
202All words starting with "ä" will sort between words starting with "ad" and words
203starting with "af".
204
205In the case of Unicode encoding, characters can often be represented either as
206pre-composed characters or in decomposed form. For example, the letter "à" can
207be represented in its decomposed (a+\`) and pre-composed (à) form. Most
208applications do not want to distinguish text by the way it is encoded. A search
209for "à" should find all instances of the letter, regardless of whether the
210instance is in pre-composed or decomposed form. Therefore, either form of the
211letter must result in the same sort ordering. The architecture of the ICU
212Collation Service supports this.
213
214## Contractions Producing Expansions
215
216It is possible to have contractions that produce expansions.
217
218One example occurs in Japanese, where the vowel with a prolonged sound mark is
219treated to be equivalent to the long vowel version:
220
221カアー<<< カイー and\
222キイー<<< キイー
223
224> :point_right: **Note** Since ICU 2.0 Japanese tailoring uses
225> [prefix analysis](http://www.unicode.org/reports/tr35/tr35-collation.html#Context_Sensitive_Mappings)
226> instead of contraction producing expansions.
227
228## Normalization
229
230In the section on expansions, we discussed that text in Unicode can often be
231represented in either pre-composed or decomposed forms. There are other types of
232equivalences possible with Unicode, including Canonical and Compatibility. The
233process of
234Normalization ensures that text is written in a predictable way so that searches
235are not made unnecessarily complicated by having to match on equivalences. Not
236all text is normalized, however, so it is useful to have a collation service
237that can address text that is not normalized, but do so with efficiency.
238
239The ICU Collation Service handles un-normalized text properly, producing the
240same results as if the text were normalized.
241
242In practice, most data that is encountered is in normalized or semi-normalized
243form already. The ICU Collation Service is designed so that it can process a
244wide range of normalized or un-normalized text without a need for normalization
245processing. When a case is encountered that requires normalization, the ICU
246Collation Service drops into code specific to this purpose. This maximizes
247performance for the majority of text that does not require normalization.
248
249In addition, if the text is known with certainty not to contain un-normalized
250text, then even the overhead of checking for normalization can be eliminated.
251The ICU Collation Service has the ability to turn Normalization Checking either
252on or off. If Normalization Checking is turned off, it is the user's
253responsibility to insure that all text is already in the appropriate form. This
254is true in a great majority of the world languages, so normalization checking is
255turned off by default for most locales.
256
257If the text requires normalization processing, Normalization Checking should be
258on. Any language that uses multiple combining characters such as Arabic, ancient
259Greek, Hebrew, Hindi, Thai or Vietnamese either requires Normalization Checking
260to be on, or the text to go through a normalization process before collation.
261
262For more information about Normalization related reordering please see
263[Unicode Technical Note #5](http://www.unicode.org/notes/tn5/) and
264[UAX #15.](http://www.unicode.org/reports/tr15/)
265
266> :point_right: **Note** ICU supports two modes of normalization: on and off.
267> Java.text.\* classes offer compatibility decomposition mode, which is not supported in ICU.
268
269## Ignoring Punctuation
270
271In some cases, punctuation can be ignored while searching or sorting data. For
272example, this enables a search for "biweekly" to also return instances of
273"bi-weekly". In other cases, it is desirable for punctuated text to be
274distinguished from text without punctuation, but to have the text sort close
275together.
276
277These two behaviors can be accomplished if there is a way for a character to be
278ignored on all levels except for the quaternary level. If this is the case, then
279two strings which compare as identical on the first three levels (base letter,
280accents, and case) are then distinguished at the fourth level based on their
281punctuation (if any). If the comparison function ignores differences at the
282fourth level, then strings that differ by punctuation only are compared as
283equal.
284
285The following table shows the results of sorting a list of terms in 3 different
286ways. In the first column, punctuation characters (space " ", and hyphen "-")
287are not ignored (" " < "-" < "b"). In the second column, punctuation characters
288are ignored in the first 3 levels and compared only in the fourth level. In the
289third column, punctuation characters are ignored in the first 3 levels and the
290fourth level is not considered. In the last column, punctuated terms are
291equivalent to the identical terms without punctuation.
292
293For more options and details see the [“Ignore Punctuation”
294Options](customization/ignorepunct.md) page.
295
296Non-ignorable | Ignorable and Quaternary strength | Ignorable and Tertiary strength
297------------- | --------------------------------- | -------------------------------
298black bird    | black bird                        | **black bird**
299black Bird    | black-bird                        | **black-bird**
300black birds   | blackbird                         | **blackbird**
301black-bird    | black Bird                        | black Bird
302black-Bird    | black-Bird                        | black-Bird
303black-birds   | blackBird                         | blackBird
304blackbird     | black birds                       | black birds
305blackBird     | black-birds                       | black-birds
306blackbirds    | blackbirds                        | blackbirds
307
308> :point_right: **Note** The strings with the same font format in the last column are
309compared as equal by ICU Collator.\
310> Since ICU 2.2 and as prescribed by the UCA, primary ignorable code points that
311> follow shifted code points will be completely ignored. This means that an accent
312> following a space will compare as if it was a space alone.
313
314## Case Ordering
315
316The tertiary level is used to distinguish text by case, by small versus large
317Kana, and other letter variants as noted above.
318
319Some applications prefer to emphasize case differences so that words starting
320with the same case sort together. Some Japanese applications require the
321difference between small and large Kana be emphasized over other tertiary
322differences.
323
324The UCA does not provide means to separate out either case or Kana differences
325from the remaining tertiary differences. However, the ICU Collation Service has
326two options that help in customize case and/or Kana differences. Both options
327are turned off by default.
328
329### CaseFirst
330
331The Case-first option makes case the most significant part of the tertiary
332level. Primary and secondary levels are unaffected. With this option, words
333starting with the same case sort together. The Case-first option can be set to
334make either lowercase sort before
335uppercase or uppercase sort before lowercase.
336
337Note: The case-first option does not constitute a separate level; it is simply a
338reordering of the tertiary level.
339
340ICU makes use of the following three case categories for sorting
341
3421.  uppercase: "ABC"
343
3442.  mixed case: "Abc", "aBc"
345
3463.  normal (lowercase or no case): "abc", "123"
347
348Mixed case is always sorted between uppercase and normal case when the
349"case-first" option is set.
350
351### CaseLevel
352
353The Case Level option makes a separate level for case differences. This is an
354extra level positioned between secondary and tertiary. The case level is used in
355Japanese to make the difference between small and large Kana more important than
356the other tertiary differences. It also can be used to ignore other tertiary
357differences, or even secondary differences. This is especially useful in
358matching. For example, if the strength is set to primary only (level-1) and the
359case level is turned on, the comparison ignores accents and tertiary differences
360except for case. The contents of the case level are affected by the case-first
361option.
362
363The case level is independent from the strength of comparison. It is possible to
364have a collator set to primary strength with the case level turned on. This
365provides for comparison that takes into account the case differences, while at
366the same time ignoring accents and tertiary differences other than case. This
367may be used in searching.
368
369Example:
370
371**Case-first off, Case level off**
372
373apple\
374ⓐⓟⓟⓛⓔ\
375Abernathy\
376ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
377ähnlich\
378Ähnlichkeit
379
380**Lowercase-first, Case level off**
381
382apple\
383ⓐⓟⓟⓛⓔ\
384ähnlich\
385Abernathy\
386ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
387Ähnlichkeit
388
389**Uppercase-first, Case level off**
390
391Abernathy\
392ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
393Ähnlichkeit\
394apple\
395ⓐⓟⓟⓛⓔ\
396ähnlich
397
398**Lowercase-first, Case level on**
399
400apple\
401Abernathy\
402ⓐⓟⓟⓛⓔ\
403ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
404ähnlich\
405Ähnlichkeit
406
407**Uppercase-first, Case level on**
408
409Abernathy\
410apple\
411ⒶⒷⒺⓇⓃⒶⓉⒽⓎ\
412ⓐⓟⓟⓛⓔ\
413Ähnlichkeit\
414ähnlich
415
416## Script Reordering
417
418Script reordering allows scripts and some other groups of characters to be moved
419relative to each other. This reordering is done on top of the DUCET/CLDR
420standard collation order. Reordering can specify groups to be placed at the
421start and/or the end of the collation order.
422
423By default, reordering codes specified for the start of the order are placed in
424the order given after several special non-script blocks. These special groups of
425characters are space, punctuation, symbol, currency, and digit. Script groups
426can be intermingled with these special non-script groups if those special groups
427are explicitly specified in the reordering.
428
429The special code `others` stands for any script that is not explicitly mentioned
430in the list. Anything that is after others will go at the very end of the list
431in the order given. For example, `[Grek, others, Latn]` will result in an
432ordering that puts all scripts other than Greek and Latin between them.
433
434### Examples:
435
436Note: All examples below use the string equivalents for the scripts and reorder
437codes that would be used in collator rules. The script and reorder code
438constants that would be used in API calls will be different.
439
440**Example 1:**\
441set reorder code - `[Grek]`\
442result - `[space, punctuation, symbol, currency, digit, Grek, others]`
443
444**Example 2:**\
445set reorder code - `[Grek]`\
446result - `[space, punctuation, symbol, currency, digit, Grek, others]`
447
448followed by: set reorder code - `[Hani]`\
449result -` [space, punctuation, symbol, currency, digit, Hani, others]`
450
451That is, setting a reordering always modifies
452the DUCET/CLDR order, replacing whatever was previously set, rather than adding
453on to it. In order to cumulatively modify an ordering, you have to retrieve the
454existing ordering, modify it, and then set it.
455
456**Example 3:**\
457set reorder code - `[others, digit]`\
458result - `[space, punctuation, symbol, currency, others, digit]`
459
460**Example 4:**\
461set reorder code - `[space, Grek, punctuation]`\
462result - `[symbol, currency, digit, space, Grek, punctuation, others]`
463
464**Example 5:**\
465set reorder code - `[Grek, others, Hani]`\
466result - `[space, punctuation, symbol, currency, digit, Grek, others, Hani]`
467
468**Example 6:**\
469set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
470result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
471
472followed by:\
473set reorder code - `[NONE]`\
474result - DUCET/CLDR
475
476**Example 7:**\
477set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
478result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
479
480followed by:\
481set reorder code - `[DEFAULT]`\
482result - original reordering for the locale which may or may not be DUCET/CLDR
483
484**Example 8:**\
485set reorder code - `[Grek, others, Hani, symbol, Tglg]`\
486result - `[space, punctuation, currency, digit, Grek, others, Hani, symbol, Tglg]`
487
488followed by:\
489set reorder code - `[]`\
490result - original reordering for the locale which may or may not be DUCET/CLDR
491
492**Example 9:**\
493set reorder code - `[Hebr, Phnx]`\
494result - error
495
496Beginning with ICU 55, scripts only reorder together if they are primary-equal,
497for example Hiragana and Katakana.
498
499ICU 4.8-54:
500
501*   Scripts were reordered in groups, each normally starting with a [Recommended
502    Script](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts).
503*   Reorder codes moved as a group (were “equivalent”) if their scripts shared a
504    primary-weight lead byte.
505*   For example, Hebr and Phnx were “equivalent” reordering codes and were
506    reordered together. Their order relative to each other could not be changed.
507*   Only any one code out of any group could be reordered, not multiple of the
508    same group.
509
510## Sorting of Japanese Text (JIS X 4061)
511
512Japanese standard JIS X 4061 requires two changes to the collation procedures:
513special processing of Hiragana characters and (for performance reasons) prefix
514analysis of text.
515
516### Hiragana Processing
517
518JIS X 4061 standard requires more levels than provided by the UCA. To offer
519conformant sorting order, ICU uses the quaternary level to distinguish between
520Hiragana and Katakana. Hiragana symbols are given smaller values than Katakana
521symbols on quaternary level, thus causing Hiragana sequences to sort before
522corresponding Katakana sequences.
523
524### Prefix Analysis
525
526Another characteristics of sorting according to the JIS X 4061 is a large number
527of contractions followed by expansions (see
528[Contractions Producing Expansions](#contractions-producing-expansions)).
529This causes all the Hiragana and Katakana codepoints to be treated as
530contractions, which reduces performance. The solution we adopted introduces the
531prefix concept which allows us to improve the performance of Japanese sorting.
532More about this can be found in the [customization
533chapter](customization/index.md) .
534
535## Thai/Lao reordering
536
537UCA requires that certain Thai and Lao prevowels be reordered with a code point
538following them. This option is always on in the ICU implementation, as
539prescribed by the UCA.
540
541This rule takes effect when:
542
5431.  A Thai vowel of the range \\U0E40-\\U0E44 precedes a Thai consonant of the
544    range \\U0E01-\\U0E2E
545    or
546
5472.  A Lao vowel of the range \\U0EC0-\\U0EC4 precedes a Lao consonant of the
548    range \\U0E81-\\U0EAE. In these cases the vowel is placed after the
549    consonant for collation purposes.
550
551> :point_right: **Note** There is a difference between java.text.\* classes and ICU in regard to Thai
552> reordering. Java.text.\* classes allow tailorings to turn off reordering by
553> using the '!' modifier. ICU ignores the '!' modifier and always reorders Thai
554> prevowels.
555
556## Space Padding
557
558In many database products, fields are padded with null. To get correct results,
559the input to a Collator should omit any superfluous trailing padding spaces. The
560problem arises with contractions, expansions, or normalization. Suppose that
561there are two fields, one containing "aed" and the other with "äd". German
562phonebook sorting (de@collation=phonebook or BCP 47 de-u-co-phonebk) will
563compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" <
564"aed". But if both fields are padded with spaces to a length of 3, then this
565will reverse the order, since the first will compare as if it were one character
566longer. In other words, when you start with strings 1 and 2
567
5681  | a  | e  | d         | \<space\>
569-- | -- | -- | --------- | ---------
5702  | ä  | d  | \<space\> | \<space\>
571
572they end up being compared on a primary level as if they were 1' and 2'
573
5741' | a  | e  | d  | \<space\> | &nbsp;
575-- | -- | -- | -- | --------- | ---------
5762' | a  | e  | d  | \<space\> | \<space\>
577
578Since 2' has an extra character (the extra space), it counts as having a primary
579difference when it shouldn't. The correct result occurs when the trailing
580padding spaces are removed, as in 1" and 2"
581
5821" | a  | e  | d
583-- | -- | -- | --
5842" | a  | e  | d
585
586## Collator naming scheme
587
588***Starting with ICU 54, the following naming scheme and its API functions are deprecated.***
589Use `ucol_open()` with language tag collation keywords instead
590(see [Collation API Details](api.md)). For example,
591`ucol_open("de-u-co-phonebk-ka-shifted", &errorCode)` for German Phonebook order
592with "ignore punctuation" mode.
593
594When collating or matching text, a number of attributes can be used to affect
595the desired result. The following describes the attributes, their values, their
596effects, their normal usage, and the string comparison performance and sort key
597length implications. It also includes single-letter abbreviations for both the
598attributes and their values. These abbreviations allow a 'short-form'
599specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which
600can be used to specific that the desired options are: UCA version 4.0.0; ignore
601spaces, punctuation and symbols; use Swedish linguistic conventions; compare
602case-insensitively.
603
604A number of attribute values are common across different attributes; these
605include **Default** (abbreviated as D), **On** (O), and **Off** (X). Unless
606otherwise stated, the examples use the UCA alone with default settings.
607
608> :point_right: **Note** In order to achieve uniqueness, a collator name always
609> has the attribute abbreviations sorted.
610
611### Main References
612
6131.  For a full list of supported locales in ICU, see [Locale
614    Explorer](http://demo.icu-project.org/icu-bin/locexp) , which also contains
615    an on-line demo showing sorting for each locale. The demo allows you to try
616    different attribute values, to see how they affect sorting.
617
6182.  To see tabular results for the UCA table itself, see the [Unicode Collation
619    Charts](http://www.unicode.org/charts/collation/) .
620
6213.  For the UCA specification, see [UTS #10: Unicode Collation
622    Algorithm](http://www.unicode.org/reports/tr10/) .
623
6244.  For more detail on the precise effects of these options, see [Collation
625    Customization](customization/index.md) .
626
627#### Collator Naming Attributes
628
629Attribute              | Abbreviation | Possible Values
630---------------------- | ------------ | ---------------
631Locale                 | L            | \<language\>
632Script                 | Z            | \<script\>
633Region                 | R            | \<region\>
634Variant                | V            | \<variant\>
635Keyword                | K            | \<keyword\>
636&nbsp;                 | &nbsp;       | &nbsp;
637Strength               | S            | 1, 2, 3, 4, I, D
638Case_Level             | E            | X, O, D
639Case_First             | C            | X, L, U, D
640Alternate              | A            | N, S, D
641Variable_Top           | T            | \<hex digits\>
642Normalization Checking | N            | X, O, D
643French                 | F            | X, O, D
644Hiragana               | H            | X, O, D
645
646#### Collator Naming Attribute Descriptions
647
648The **Locale** attribute is typically the most
649important attribute for correct sorting and matching, according to the user
650expectations in different countries and regions. The default UCA ordering will
651only sort a few languages such as Dutch and Portuguese correctly ("correctly"
652meaning according to the normal expectations for users of the languages).
653Otherwise, you need to supply the locale to UCA in order to properly collate
654text for a given language. Thus a locale needs to be supplied so as to choose a
655collator that is correctly **tailored** for that locale. The choice of a locale
656will automatically preset the values for all of the attributes to something that
657is reasonable for that locale. Thus most of the time the other attributes do not
658need to be explicitly set. In some cases, the choice of locale will make a
659difference in string comparison performance and/or sort key length.
660
661In short attribute names,
662`<language>_<script>_<region>_<variant>@collation=<keyword>` is
663represented by: `L<language>_Z<script>_R<region>_V<variant>_K<keyword>`. Not
664all the elements are required. Valid values for locale elements are general
665valid values for RFC 3066 locale naming.
666
667**Example:**\
668**Locale="sv" (Swedish)** "Kypper" < "Köpfe"\
669**Locale="de" (German)** "Köpfe" < "Kypper"
670
671The **Strength** attribute determines whether accents or
672case are taken into account when collating or matching text. ( (In writing
673systems without case or accents, it controls similarly important features). The
674default strength setting usually does not need to be changed for collating
675(sorting), but often needs to be changed when **matching** (e.g. SELECT). The
676possible values include Default (D), Primary (1), Secondary (2), Tertiary (3),
677Quaternary (4), and Identical (I).
678
679For example, people may choose to ignore accents or ignore accents and case when
680searching for text.
681
682Almost all characters are distinguished by the first three levels, and in most
683locales the default value is thus Tertiary. However, if Alternate is set to be
684Shifted, then the Quaternary strength (4) can be used to break ties among
685whitespace, punctuation, and symbols that would otherwise be ignored. If very
686fine distinctions among characters are required, then the Identical strength (I)
687can be used (for example, Identical Strength distinguishes between the
688**Mathematical Bold Small A** and the **Mathematical Italic Small A.** For more
689examples, look at the cells with white backgrounds in the collation charts).
690However, using levels higher than Tertiary - the Identical strength - result in
691significantly longer sort keys, and slower string comparison performance for
692equal strings.
693
694**Example:**\
695**S=1** role = Role = rôle\
696**S=2** role = Role < rôle\
697**S=3** role < Role < rôle
698
699The **Case_Level** attribute is used when ignoring accents
700**but not** case. In such a situation, set Strength to be Primary, and
701Case_Level to be On. In most locales, this setting is Off by default. There is a
702small string comparison performance and sort key impact if this attribute is set
703to be On.
704
705**Example:**\
706**S=1, E=X** role = Role = rôle\
707**S=1, E=O** role = rôle < Role
708
709The **Case_First** attribute is used to control whether
710uppercase letters come before lowercase letters or vice versa, in the absence of
711other differences in the strings. The possible values are Uppercase_First (U)
712and Lowercase_First (L), plus the standard Default and Off. There is almost no
713difference between the Off and Lowercase_First options in terms of results, so
714typically users will not use Lowercase_First: only Off or Uppercase_First.
715(People interested in the detailed differences between X and L should consult
716the [Collation Customization](customization/index.md) ).
717Specifying either L or U won't affect string comparison performance, but will
718affect the sort key length.
719
720**Example:**\
721**C=X or C=L** "china" < "China" < "denmark" < "Denmark"\
722**C=U** "China" < "china" < "Denmark" < "denmark"
723
724The **Alternate** attribute is used to control the handling of
725the so-called **variable **characters in the UCA: whitespace, punctuation and
726symbols. If Alternate is set to Non-Ignorable (N), then differences among these
727characters are of the same importance as differences among letters. If Alternate
728is set to Shifted (S), then these characters are of only minor importance. The
729Shifted value is often used in combination with Strength set to Quaternary. In
730such a case, white-space, punctuation, and symbols are considered when comparing
731strings, but only if all other aspects of the strings (base letters, accents,
732and case) are identical. If Alternate is not set to Shifted, then there is no
733difference between a Strength of 3 and a Strength of 4.
734
735For more information and examples, see
736[Variable_Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) in
737the UCA.
738
739The reason the Alternate values are not simply On and Off is that
740additional Alternate values may be added in the future.
741
742The UCA option
743**Blanked** is expressed with Strength set to 3, and Alternate set to Shifted.
744
745The default for most locales is Non-Ignorable. If Shifted is selected, it may be
746slower if there are many strings that are the same except for punctuation; sort
747key length will not be affected unless the strength level is also increased.
748
749**Example:**\
750**S=3, A=N** di Silva < Di Silva < diSilva < U.S.A. < USA\
751**S=3, A=S** di Silva = diSilva < Di Silva < U.S.A. = USA\
752**S=4, A=S** di Silva < diSilva < Di Silva < U.S.A. < USA
753
754The **Variable_Top** attribute is only meaningful if the
755Alternate attribute is not set to Non-Ignorable. In such a case, it controls
756which characters count as ignorable. The \<hex\> value specifies the "highest"
757character sequence (in UCA order) weight that is to be considered ignorable.
758
759Thus, for example, if a user wanted white-space to be ignorable, but not any
760visible characters, then s/he would use the value Variable_Top=0020 (space). The
761digits should only be a single character. All characters of the same primary
762weight are equivalent, so Variable_Top=3000 (ideographic space) has the same
763effect as Variable_Top=0020.
764
765This setting (alone) has little impact on string comparison performance; setting
766it lower or higher will make sort keys slightly shorter or longer respectively.
767
768**Example:**\
769**S=3, A=S** di Silva = diSilva < U.S.A. = USA\
770**S=3, A=S, T=0020** di Silva = diSilva < U.S.A. < USA
771
772The **Normalization** setting determines whether
773text is thoroughly normalized or not in comparison. Even if the setting is off
774(which is the default for many locales), text as represented in common usage
775will compare correctly (for details, see [UTN
776#5](http://www.unicode.org/notes/tn5/)). Only if the accent marks are in
777non-canonical order will there be a problem. If the setting is On, then the best
778results are guaranteed for all possible text input.There is a medium string
779comparison performance cost if this attribute is On, depending on the frequency
780of sequences that require normalization. There is no significant effect on sort
781key length.If the input text is known to be in NFD or NFKD normalization forms,
782there is no need to enable this Normalization option.
783
784**Example:**\
785**N=X** ä = a + ◌̈ < ä + ◌̣ < ạ + ◌̈\
786**N=O** ä = a + ◌̈ < ä + ◌̣ = ạ + ◌̈
787
788Some **French** dictionary ordering traditions sort strings with
789different accents from the back of the string. This attribute is automatically
790set to On for the Canadian French locale (fr_CA). Users normally would not need
791to explicitly set this attribute. There is a string comparison performance cost
792when it is set On, but sort key length is unaffected.
793
794**Example:**\
795**F=X** cote < coté < côte < côté\
796**F=O** cote < côte < coté < côté
797
798Compatibility with JIS x 4061 requires the introduction of an
799additional level to distinguish **Hiragana** and Katakana characters. If
800compatibility with that standard is required, then this attribute is set On, and
801the strength should be set to at least Quaternary.
802
803This attribute is an implementation detail of the CLDR Japanese tailoring. The
804implementation might change to use a different mechanism to achieve the same
805Japanese sort order. Since ICU 50, this attribute is not settable any more.
806
807**Example:**\
808**H=X, S=4** きゅう = キュウ < きゆう = キユウ\
809**H=O, S=4** きゅう < キュウ < きゆう < キユウ
810
811> :point_right: **Note** If attributes in collator name are not overridden,
812> it is assumed that they are the same as for the given locale.
813> For example, a collator opened with an empty
814> string has the same attribute settings as **AN_CX_EX_FX_HX_KX_NX_S3_T0000**.*
815
816### Summary of Value Abbreviations
817
818Value         | Abbreviation
819------------- | ------------
820Default       | D
821On            | O
822Off           | X
823Primary       | 1
824Secondary     | 2
825Tertiary      | 3
826Quaternary    | 4
827Identical     | I
828Shifted       | S
829Non-Ignorable | N
830Lower-First   | L
831Upper-First   | U
832