• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Architecture
4nav_order: 2
5parent: Collation
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Collation Service Architecture
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25This section describes the design principles, architecture and coding
26conventions of the ICU Collation Service.
27
28## Collator
29
30To use the Collation Service, a Collator must first be instantiated. An
31Collator is a data structure or object that maintains all of the property
32and state information necessary to define and support the specific collation
33behavior provided. Examples of properties described in the Collator are the
34locale, whether normalization is to be performed, and how many levels of
35collation are to be evaluated. Examples of the state information described in
36the Collator include the direction of a Collation Element Iterator (forward
37or backward) and the status of the last API executed.
38
39The Collator is instantiated either by referencing a locale or by defining a
40custom set of rules (a tailoring).
41
42The Collation Service uses the paradigm:
43
441.  Open a Collator,
45
462.  Use while necessary,
47
483.  Close the Collator.
49
50Collator instances cannot be shared among threads. You should open them
51instead, and use a different collator for each separate thread. The safe clone
52function is supported for cloning collators in a thread-safe fashion.
53
54The Collation Service follows the ICU conventions for locale designation
55when opening collators:
56
571.  NULL means the default locale.
58
592.  The empty locale name ("") means the root locale.
60    The Collation Service adheres to the ICU conventions described in the
61    "[ICU Architectural Design](../design.md) " section of the users guide.
62    In particular:
63
643.  The standard error code convention is usually followed. (Functions that do
65    not take an error code parameter do so for backward compatibility.)
66
674.  The string length convention is followed: when passing a `UChar *`, the
68    length is required in a separate argument. If -1 is passed for the length,
69    it is assumed that the string is zero terminated.
70
71### Collation locale and keyword handling
72
73When a collator is created from a locale, the collation service (like all ICU
74services) must map the requested locale to the localized collation data
75available to ICU at the time. It does so using the standard ICU locale fallback
76mechanism. See the fallback section of the [locale
77chapter](../locale/index.md) for more details.
78
79If you pass a regular locale in, like "en_US", the collation service first
80searches with fallback for "collations/default" key. The first such key it finds
81will have an associated string value; this is the keyword name for the collation
82that is default for this locale. If the search falls all the way back to the
83root locale, the collation service will us the "collations/default" key there,
84which has the value "standard".
85
86If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the
87collation service searches with fallback for "collations/phonebook". If the
88search is successful, the collation service uses the string value it finds to
89instantiate a Collator. If the search fails because no such key is present in
90any of ICU's locale data (e.g., "de@collation=funky"), the service returns a
91collator implementing the default tailoring of the locale.
92If the fallback is all the way to the root locale, then
93the return `UErrorCode` is `U_USING_DEFAULT_WARNING`.
94
95## Input values for collation
96
97Collation deals with processing strings. ICU generally requires that all the
98strings should be in UTF-16 format, and that all the required conversion should
99done before ICU functions are used. In the case of collation, there are APIs
100that can also take instances of character iterators (`UCharIterator`)
101or UTF-8 directly.
102
103Theoretically, character iterators can iterate strings
104in any encoding. ICU currently provides character iterator implementations for
105UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an
106little endian machine). It should be noted, however, that using iterators for
107collation APIs has a performance impact. It should be used in situations when it
108is not desirable to convert whole strings before the operation - such as when
109using a string compare function.
110
111## Collation Elements
112
113As discussed in the introduction, there are many possible orderings for sorted
114text, depending on language and other factors. Ideally, there is a way to
115describe each ordering as a set of rules for calculating numeric values for each
116string of text. The collation process then becomes one of simply comparing these
117numeric values.
118
119This essentially describes the way the Collation Service works. To implement
120a particular sort ordering, first the relationship between each character or
121character sequence is derived. For example, a Spanish ordering defines the
122letter sequence "CH" to be between the letters "C" and "D". As also discussed in
123the introduction, to order strings properly requires that comparison of base
124letters must be considered separately from comparison of accents. Letter case
125must also be considered separately from either base letters or accents. Any
126ordering specification language must provide a way to define the relationships
127between characters or character sequences on multiple levels. ICU supports this
128by using "<" to describe a relationship at the primary level, using "<<" to
129describe a relationship at the secondary level, and using "<<<" to describe a
130relationship at the tertiary level. Here are some example usages:
131
132Symbol | Example  | Description
133------ | -------- | -----------
134`<`    | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch"
135`<<`   | `a << ä` | Make a secondary (accent) difference between "a" and "ä"
136`<<<`  | `a<<<A`  | Make a tertiary difference between "a" and "A"
137
138A more complete description of the ordering specification symbols and their
139meanings is provided in the section on Collation Tailoring.
140
141Once a sort ordering is defined by specifying the desired relationships between
142characters and character sequences, ICU can convert these relationships to a
143series of numerical values (one for each level) that satisfy these same
144relationships.
145
146This series of numeric values, representing the relative weighting of a
147character or character sequence, is called a Collation Element (CE).
148One possible encoding of a Collation Element is a 32-bit value consisting of
149a 16-bit primary weight, a 8-bit secondary weight,
1502 case bits, and a 6-bit tertiary weight.
151
152The sort weight of a string is represented by the collation elements of its
153component characters and character sequences. For example, the sort weight of
154the string "apple" would consist of its component Collation Elements, as shown
155here:
156
157"Apple" | "Apple" Collation Elements
158------- | --------------------------
159a       | `[1900.05.05]`
160p       | `[3700.05.05]`
161p       | `[3700.05.05]`
162l       | `[2F00.05.05]`
163e       | `[2100.05.05]`
164
165In this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an
1668-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of
16705 (hex).
168
169String comparison is performed by comparing the collation elements of each
170string. Each of the primary weights are compared. If a difference is found, that
171difference determines the relationship between the two strings. If no
172differences are found, the secondary weights are compared and so forth.
173
174With ICU it is possible to specify how many levels should be compared. For some
175applications, it can be desirable to compare only primary levels or to compare
176only primary and secondary levels.
177
178## Sort Keys
179
180If a string is to be compared thousands or millions of times,
181it can be more efficient to use sort keys.
182Sort keys are useful in situations where a large amount of data is indexed
183and frequently searched. The sort key is generated once and used in subsequent
184comparisons, rather than repeatedly generating the string's Collation Elements.
185The comparison of sort keys is a very efficient and simple binary compare of strings of
186unsigned bytes.
187
188An important property of ICU sort keys is that you can obtain the same results
189by comparing 2 strings as you do by comparing the sort keys of the 2 strings
190(provided that the same ordering and related collation attributes are used).
191
192An ICU sort key is a pre-processed sequence of bytes generated from a Unicode
193string. The weights for each comparison level are concatenated, separated by a
194"0x01" byte between levels.
195The entire sequence is terminated with a 0x00 byte for convenience in C APIs.
196(This 0x00 terminator is counted in the sort key length —
197unlike regular strings where the NUL terminator is excluded from the string length.)
198
199ICU actually compresses the sort keys so that they take the
200minimum storage in memory and in databases.
201
202<!-- TODO: (diagram was missing in Google Sites already)
203    The diagram below represents an uncompressed sort key in ICU for ease of understanding.  -->
204
205### Sort key size
206
207One of the more important issues when considering using sort keys is the sort
208key size. Unfortunately, it is very hard to give a fast exact answer to the
209following question: "What is the maximum size for sort keys generated for
210strings of size X". This problem is twofold:
211
2121.  The maximum size of the sort key depends on the size of the collation
213    elements that are used to build it. Size of collation elements vary greatly
214    and depends both on the alphabet in question and on the locale used.
215
2162.  Compression is used in building sort keys. Most 'regular' sequences of
217    characters produce very compact sort keys.
218
219If one is to assume the worst case and use too-big buffers, a lot of space will
220be wasted. However, if you use too-small buffers, you will lose performance if
221generated sort keys are longer than supplied buffers too often
222(and you have to reallocate for each of those).
223A good strategy
224for this problem would be to manually manage a large buffer for storing sortkeys
225and keep a list of indices to sort keys in this buffer (see the "large buffers"
226[Collation Example](examples#using-large-buffers-to-manage-sort-keys)
227for more details).
228
229Here are some rules of a thumb, please do not rely on them. If you are looking
230at the East Asian locales, you probably want to go with 5 bytes per code point.
231For Thai, 3 bytes per code point should be sufficient. For all the other locales
232(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point.
233These values are based on average lengths of sort keys generated with tertiary
234strength. If you need quaternary and identical strength (you should not), add 3
235bytes per code point to each of these.
236
237### Partial sort keys
238
239In some cases, most notably when implementing [radix
240sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only
241parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing
242parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be
243compressed; that is, they may or may not be compatible with regular sort keys.
244
245### Merging sort keys
246
247Sometimes, it is useful to be able to merge sort keys. One example is having
248separate sort keys for first and last names. If you need to perform an operation
249that requires a sort key generated on the whole name, instead of concatenating
250strings and regenerating sort keys, you should merge the sort keys. The merging
251is done by merging the corresponding levels while inserting a terminator between
252merged parts. The reserved sort key byte value for the merge terminator is 0x02.
253For more details see [UCA section 1.6, Merging Sort
254Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels).
255
256*   C API: unicode/ucol.h `ucol_mergeSortkeys()`
257*   Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)`
258
259CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is
260intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield
261the same results as merging their individual sort keys.
262This has been fully implemented in ICU since version 53.
263
264### Generating bounds for a sort key (prefix matching)
265
266Having sort keys for strings allows for easy creation of bounds - sort keys that
267are guaranteed to be smaller or larger than any sort key from a give range. For
268example, if bounds are produced for a sortkey of string "smith", strings between
269upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh".
270Two kinds of upper bounds can be generated - the first one will match only
271strings of equal length, while the second one will match all the strings with
272the same initial prefix.
273
274CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum
275primary weight, so that for example the string "smith\\uFFFF" can be used as the
276upper bound rather than modifying the sort key for "smith".
277
278## Collation Element Iterator
279
280The collation element iterator is used for traversing Unicode string collation
281elements one at a time. It can be used to implement language-sensitive text
282search algorithms like Boyer-Moore.
283
284For most applications, the two API categories, compare and sort key, are
285sufficient. Most people do not need to manipulate collation elements directly.
286
287Example:
288
289Consider iterating over "apple" and "äpple". Here are sequences of collation
290elements:
291
292String 1 | String 1 Collation Elements
293-------- | ---------------------------
294a        | `[1900.05.05]`
295p        | `[3700.05.05]`
296p        | `[3700.05.05]`
297l        | `[2F00.05.05]`
298e        | `[2100.05.05]`
299
300String 2 | String 2 Collation Elements
301-------- | ---------------------------
302a        | `[1900.05.05]`
303\\u0308  | `[0000.9D.05]`
304p        | `[3700.05.05]`
305p        | `[3700.05.05]`
306l        | `[2F00.05.05]`
307e        | `[2100.05.05]`
308
309The resulting CEs are typically masked according to the desired strength, and
310zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength)
311produces the results of NULL secondary and tertiary differences. The collator then
312ignores the NULL differences and declares a match. For more details see the
313paper "Efficient text searching in Java™: Finding the right string in any
314language" by Laura Werner (
315<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>).
316
317## Collation Attributes
318
319The Collation Service has a number of attributes whose values can be changed
320during run time. These attributes affect both the functionality and the
321performance of the Collation Service. This section describes these
322attributes and, where possible, their performance impact. Performance
323indications are only approximate and timings may vary significantly depending on
324the CPU, compiler, etc.
325
326Although string comparison by ICU and comparison of each string's sort key give
327the same results, attribute settings can impact the execution time of each
328method differently. To be precise in the discussion of performance, this section
329refers to the API employed in the measurement. The `ucol_strcoll` function is the
330API for string comparison. The `ucol_getSortKey` function is used to create sort
331keys.
332
333> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`,
334> that can be used to set any attribute to its default value
335> (which is inherited from the UCA and the tailoring).
336
337### Attribute Types
338
339#### Strength level
340
341Collation strength, or the maximum collation level used for comparison, is set
342by using the `UCOL_STRENGTH` attribute. Valid values are:
343
3441.  `UCOL_PRIMARY`
345
3462.  `UCOL_SECONDARY`
347
3483.  `UCOL_TERTIARY` (default)
349
3504.  `UCOL_QUATERNARY`
351
3525.  `UCOL_IDENTICAL`
353
354#### French collation
355
356The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary
357differences in reverse order. Valid values are:
358
3591.  `UCOL_OFF` (default): compares secondary differences in the order they appear
360    in the string.
361
3622.  `UCOL_ON`: causes secondary differences to be considered in reverse order, as
363    it is done in the French language.
364
365#### Normalization mode
366
367The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`,
368controls whether text normalization is performed on the input strings. Valid
369values are:
370
3711.  `UCOL_OFF` (default): turns off normalization check
372
3732.  `UCOL_ON` : normalization is checked and the collator performs normalization
374    if it is needed.
375
376X                     | FCD | NFC | NFD
377--------------------- | --- | --- | ---
378A-ring                | Y   | Y   |
379Angstrom              | Y   |     |
380A + ring              | Y   |     | Y
381A + grave             | Y   | Y   |
382A-ring + grave        | Y   |     |
383A + cedilla + ring    | Y   |     | Y
384A + ring + cedilla    |     |     |
385A-ring + cedilla      |     | Y   |
386
387With normalization mode turned on, the `ucol_strcoll` function slows down by 10%.
388In addition, the time to generate a sort key also increases by about 25%.
389
390#### Alternate handling
391
392This attribute allows shifting of the variable characters (usually spaces and
393punctuation, in the UCA also most symbols) from the primary to the quaternary
394strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For
395details see [UCA: Variable
396Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML:
397Collation
398Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings),
399and [“Ignore Punctuation” Options](customization/ignorepunct.md).
400
4011.  `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as
402    all the other characters
403
4042.  `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at
405    the primary, secondary and tertiary levels and their primary strengths will
406    be shifted to the quaternary level.
407
408#### Case Ordering
409
410Some conventions require uppercase letters to sort before lowercase ones, while
411others require the opposite. This attribute is controlled by the value of the
412`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary
413weights along with other appearance characteristics (like circling of letters).
414The case-first attribute allows for emphasizing of the case property of the
415letters by reordering the tertiary weights with either upper-first, and/or
416lowercase-first. This difference gets the most significant bit in the weight.
417Valid values for this attribute are:
418
4191.  `UCOL_OFF` (default): leave tertiary weights unaffected
420
4212.  `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort
422    before uppercase
423
4243.  `UCOL_UPPER_FIRST` : causes uppercase letters to sort first
425
426The case-first attribute does not affect the performance substantially.
427
428#### Case level
429
430When this attribute is set, an additional level is formed between the secondary
431and tertiary levels, known as the Case Level. The case level is used to
432distinguish large and small Japanese Kana characters. Case level could also be
433used in other situations. for example to distinguish certain Pinyin characters.
434Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this
435attribute are
436
4371.  `UCOL_OFF` (default): no additional case level
438
4392.  `UCOL_ON` : adds a case level
440
441#### Hiragana Quaternary
442
443*This setting is deprecated and ignored in recent versions of ICU.*
444
445Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points
446will sort before everything else on the quaternary level. If set to `UCOL_OFF`
447Hiragana letters are treated the same as all the other code points. This setting
448can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute.
449You probably won't need to use it.
450
451#### Variable Top
452
453Variable Top is a boundary which decides whether the code points will be treated
454as variable (shifted to quaternary level in the **shifted** mode) or
455non-ignorable. Special APIs are used for setting of variable top. It can
456basically be set either to a codepoint or a primary strength value.
457
458## Performance
459
460ICU collation is designed to be fast, small and customizable. Several techniques
461are used to enhance the performance:
462
4631.  Providing optimized processing for Latin characters.
464
4652.  Comparing strings incrementally and stopping at the first significant
466    difference.
467
4683.  Tuning to eliminate unnecessary file access or memory allocation.
469
4704.  Providing efficient preflight functions that allows fast sort key size
471    generation.
472
4735.  Using a single, shared copy of UCA in memory for the read-only default sort
474    order. Only small tailoring tables are kept in memory for locale-specific
475    customization.
476
4776.  Compressing sort keys efficiently.
478
4797.  Making the sort order be data-driven.
480
481In general, the best performance from the Collation Service is expected by
482doing the following:
483
4841.  After opening a collator, keep and reuse it until done. Do not open new
485    collators for the same sort order. (Note the restriction on
486    multi-threading.)
487
4882.  Use `ucol_strcoll` etc. when comparing strings. If it is necessary to
489    compare strings thousands or millions of times,
490    create the sort keys first and compare the sort keys instead.
491    Generating the sort keys of two strings is about 5-10
492    times slower than just comparing them directly.
493
4943.  Follow the best practice guidelines for generating sort keys. Do not call
495    `ucol_getSortKey` twice to first size the key and then allocate the sort key
496    buffer and repeat the call to the function to fill in the buffer.
497
498### Performance and Storage Implications of Attributes
499
500Most people use the default attributes when comparing strings or when creating
501sort keys. When they do want to customize the ordering, the most common options
502are the following :
503
504`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\
505Used to ignore space and punctuation characters
506
507`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\
508Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences.
509
510`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\
511Used to change the ordering of upper vs. lower case letters (as
512well as small vs. large kana)
513
514`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\
515Used to ignore only the accent differences.
516
517`UCOL_NORMALIZATION_MODE == UCOL_ON`\
518Force to always check for normalization. This
519is used if the input text may not be in FCD form.
520
521`UCOL_FRENCH_COLLATION == UCOL_OFF`\
522This is only useful for languages like French and Catalan that may turn this attribute on.
523(It is the default only for Canadian French ("fr-CA").)
524
525In String Comparison, most of these options have little or no effect on
526performance. The only noticeable one is normalization, which can cost 10%-40% in
527performance.
528
529For Sort Keys, most of these options either leave the storage alone or reduce
530it. Shifting can reduce the storage by about 10%-20%; case level + primary-only
531can decrease it about 20% to 40%. Using no French accents can reduce the storage
532by about 38% , but only for languages like French and Catalan that turn it on by
533default. On the other hand, using Shifted + Quaternary can increase the storage by
53410%-15%. (The Identical Level also increases the length, but this option is not
535recommended).
536
537> :point_right: **Note** All of the above numbers are based on
538> tests run on a particular machine, with a particular set of data.
539> (The data for each language is a large number of names
540> in that language in the format <first_name>, <last name>.)
541> The performance and storage may vary, depending on the particular computer,
542> operating system, and data.
543
544## Versioning
545
546Sort keys are often stored on disk for later reuse. A common example is the use
547of keys to build indexes in databases. When comparing keys, it is important to
548know that both keys were generated by the same algorithms and weightings.
549Otherwise, identical strings with keys generated on two different dates, for
550example, might compare as unequal. Sort keys can be affected by new versions of
551ICU or its data tables, new sort key formats, or changes to the Collator.
552Starting with release 1.8.1, ICU provides a versioning mechanism to identify the
553version information of the following (but not limited to),
554
5551.  The run-time executable
556
5572.  The collation element content
558
5593.  The Unicode/UCA database
560
5614.  The tailoring table
562
563The version information of Collator is a 32-bit integer. If a new version of ICU
564has changes affecting the content of collation elements, the version information
565will be changed. In that case, to use the new version of ICU collator will
566require regenerating any saved or stored sort keys.
567
568However, it is possible to modify ICU code or data without changing relevant version numbers,
569so it is safer to regenerate sort keys any time after any part of ICU has been updated.
570
571Since ICU4C 1.8.1.
572it is possible to build your program so that it uses more than one version of
573ICU (only in C/C++, not in Java). Therefore, you could use the current version
574for the features you need and use the older version for collation.
575
576## Programming Examples
577
578See the [Collation Examples](examples.md) chapter for an example of how to
579compare and create sort keys with the default locale in C, C++ and Java.
580