1--- 2layout: default 3title: Architecture 4nav_order: 2 5parent: Collation 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Collation Service Architecture 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25This section describes the design principles, architecture and coding 26conventions of the ICU Collation Service. 27 28## Collator 29 30To use the Collation Service, a Collator must first be instantiated. An 31Collator is a data structure or object that maintains all of the property 32and state information necessary to define and support the specific collation 33behavior provided. Examples of properties described in the Collator are the 34locale, whether normalization is to be performed, and how many levels of 35collation are to be evaluated. Examples of the state information described in 36the Collator include the direction of a Collation Element Iterator (forward 37or backward) and the status of the last API executed. 38 39The Collator is instantiated either by referencing a locale or by defining a 40custom set of rules (a tailoring). 41 42The Collation Service uses the paradigm: 43 441. Open a Collator, 45 462. Use while necessary, 47 483. Close the Collator. 49 50Collator instances cannot be shared among threads. You should open them 51instead, and use a different collator for each separate thread. The safe clone 52function is supported for cloning collators in a thread-safe fashion. 53 54The Collation Service follows the ICU conventions for locale designation 55when opening collators: 56 571. NULL means the default locale. 58 592. The empty locale name ("") means the root locale. 60 The Collation Service adheres to the ICU conventions described in the 61 "[ICU Architectural Design](../design.md) " section of the users guide. 62 In particular: 63 643. The standard error code convention is usually followed. (Functions that do 65 not take an error code parameter do so for backward compatibility.) 66 674. The string length convention is followed: when passing a `UChar *`, the 68 length is required in a separate argument. If -1 is passed for the length, 69 it is assumed that the string is zero terminated. 70 71### Collation locale and keyword handling 72 73When a collator is created from a locale, the collation service (like all ICU 74services) must map the requested locale to the localized collation data 75available to ICU at the time. It does so using the standard ICU locale fallback 76mechanism. See the fallback section of the [locale 77chapter](../locale/index.md) for more details. 78 79If you pass a regular locale in, like "en_US", the collation service first 80searches with fallback for "collations/default" key. The first such key it finds 81will have an associated string value; this is the keyword name for the collation 82that is default for this locale. If the search falls all the way back to the 83root locale, the collation service will us the "collations/default" key there, 84which has the value "standard". 85 86If there is a locale with a keyword, like "de-u-co-phonebk" or "de@collation=phonebook", the 87collation service searches with fallback for "collations/phonebook". If the 88search is successful, the collation service uses the string value it finds to 89instantiate a Collator. If the search fails because no such key is present in 90any of ICU's locale data (e.g., "de@collation=funky"), the service returns a 91collator implementing the default tailoring of the locale. 92If the fallback is all the way to the root locale, then 93the return `UErrorCode` is `U_USING_DEFAULT_WARNING`. 94 95## Input values for collation 96 97Collation deals with processing strings. ICU generally requires that all the 98strings should be in UTF-16 format, and that all the required conversion should 99done before ICU functions are used. In the case of collation, there are APIs 100that can also take instances of character iterators (`UCharIterator`) 101or UTF-8 directly. 102 103Theoretically, character iterators can iterate strings 104in any encoding. ICU currently provides character iterator implementations for 105UTF-8 and UTF-16BE (useful when processing data from a big endian platform on an 106little endian machine). It should be noted, however, that using iterators for 107collation APIs has a performance impact. It should be used in situations when it 108is not desirable to convert whole strings before the operation - such as when 109using a string compare function. 110 111## Collation Elements 112 113As discussed in the introduction, there are many possible orderings for sorted 114text, depending on language and other factors. Ideally, there is a way to 115describe each ordering as a set of rules for calculating numeric values for each 116string of text. The collation process then becomes one of simply comparing these 117numeric values. 118 119This essentially describes the way the Collation Service works. To implement 120a particular sort ordering, first the relationship between each character or 121character sequence is derived. For example, a Spanish ordering defines the 122letter sequence "CH" to be between the letters "C" and "D". As also discussed in 123the introduction, to order strings properly requires that comparison of base 124letters must be considered separately from comparison of accents. Letter case 125must also be considered separately from either base letters or accents. Any 126ordering specification language must provide a way to define the relationships 127between characters or character sequences on multiple levels. ICU supports this 128by using "<" to describe a relationship at the primary level, using "<<" to 129describe a relationship at the secondary level, and using "<<<" to describe a 130relationship at the tertiary level. Here are some example usages: 131 132Symbol | Example | Description 133------ | -------- | ----------- 134`<` | `c < ch` | Make a primary (base letter) difference between "c" and the character sequence "ch" 135`<<` | `a << ä` | Make a secondary (accent) difference between "a" and "ä" 136`<<<` | `a<<<A` | Make a tertiary difference between "a" and "A" 137 138A more complete description of the ordering specification symbols and their 139meanings is provided in the section on Collation Tailoring. 140 141Once a sort ordering is defined by specifying the desired relationships between 142characters and character sequences, ICU can convert these relationships to a 143series of numerical values (one for each level) that satisfy these same 144relationships. 145 146This series of numeric values, representing the relative weighting of a 147character or character sequence, is called a Collation Element (CE). 148One possible encoding of a Collation Element is a 32-bit value consisting of 149a 16-bit primary weight, a 8-bit secondary weight, 1502 case bits, and a 6-bit tertiary weight. 151 152The sort weight of a string is represented by the collation elements of its 153component characters and character sequences. For example, the sort weight of 154the string "apple" would consist of its component Collation Elements, as shown 155here: 156 157"Apple" | "Apple" Collation Elements 158------- | -------------------------- 159a | `[1900.05.05]` 160p | `[3700.05.05]` 161p | `[3700.05.05]` 162l | `[2F00.05.05]` 163e | `[2100.05.05]` 164 165In this example, the letter "a" has a 16-bit primary weight of 1900 (hex), an 1668-bit secondary weight of 05 (hex), and a combined 8-bit case-tertiary weight of 16705 (hex). 168 169String comparison is performed by comparing the collation elements of each 170string. Each of the primary weights are compared. If a difference is found, that 171difference determines the relationship between the two strings. If no 172differences are found, the secondary weights are compared and so forth. 173 174With ICU it is possible to specify how many levels should be compared. For some 175applications, it can be desirable to compare only primary levels or to compare 176only primary and secondary levels. 177 178## Sort Keys 179 180If a string is to be compared thousands or millions of times, 181it can be more efficient to use sort keys. 182Sort keys are useful in situations where a large amount of data is indexed 183and frequently searched. The sort key is generated once and used in subsequent 184comparisons, rather than repeatedly generating the string's Collation Elements. 185The comparison of sort keys is a very efficient and simple binary compare of strings of 186unsigned bytes. 187 188An important property of ICU sort keys is that you can obtain the same results 189by comparing 2 strings as you do by comparing the sort keys of the 2 strings 190(provided that the same ordering and related collation attributes are used). 191 192An ICU sort key is a pre-processed sequence of bytes generated from a Unicode 193string. The weights for each comparison level are concatenated, separated by a 194"0x01" byte between levels. 195The entire sequence is terminated with a 0x00 byte for convenience in C APIs. 196(This 0x00 terminator is counted in the sort key length — 197unlike regular strings where the NUL terminator is excluded from the string length.) 198 199ICU actually compresses the sort keys so that they take the 200minimum storage in memory and in databases. 201 202<!-- TODO: (diagram was missing in Google Sites already) 203 The diagram below represents an uncompressed sort key in ICU for ease of understanding. --> 204 205### Sort key size 206 207One of the more important issues when considering using sort keys is the sort 208key size. Unfortunately, it is very hard to give a fast exact answer to the 209following question: "What is the maximum size for sort keys generated for 210strings of size X". This problem is twofold: 211 2121. The maximum size of the sort key depends on the size of the collation 213 elements that are used to build it. Size of collation elements vary greatly 214 and depends both on the alphabet in question and on the locale used. 215 2162. Compression is used in building sort keys. Most 'regular' sequences of 217 characters produce very compact sort keys. 218 219If one is to assume the worst case and use too-big buffers, a lot of space will 220be wasted. However, if you use too-small buffers, you will lose performance if 221generated sort keys are longer than supplied buffers too often 222(and you have to reallocate for each of those). 223A good strategy 224for this problem would be to manually manage a large buffer for storing sortkeys 225and keep a list of indices to sort keys in this buffer (see the "large buffers" 226[Collation Example](examples#using-large-buffers-to-manage-sort-keys) 227for more details). 228 229Here are some rules of a thumb, please do not rely on them. If you are looking 230at the East Asian locales, you probably want to go with 5 bytes per code point. 231For Thai, 3 bytes per code point should be sufficient. For all the other locales 232(mostly Latin and Cyrillic), you should be fine with 2 bytes per code point. 233These values are based on average lengths of sort keys generated with tertiary 234strength. If you need quaternary and identical strength (you should not), add 3 235bytes per code point to each of these. 236 237### Partial sort keys 238 239In some cases, most notably when implementing [radix 240sorting](http://en.wikipedia.org/wiki/Radix_sort), it is useful to produce only 241parts of sort keys at a time. ICU4C 2.6+ provides an API that allows producing 242parts of sort keys (`ucol_nextSortKeyPart` API). These sort keys may or may not be 243compressed; that is, they may or may not be compatible with regular sort keys. 244 245### Merging sort keys 246 247Sometimes, it is useful to be able to merge sort keys. One example is having 248separate sort keys for first and last names. If you need to perform an operation 249that requires a sort key generated on the whole name, instead of concatenating 250strings and regenerating sort keys, you should merge the sort keys. The merging 251is done by merging the corresponding levels while inserting a terminator between 252merged parts. The reserved sort key byte value for the merge terminator is 0x02. 253For more details see [UCA section 1.6, Merging Sort 254Keys](http://www.unicode.org/reports/tr10/#Interleaved_Levels). 255 256* C API: unicode/ucol.h `ucol_mergeSortkeys()` 257* Java API: `com.ibm.icu.text.CollationKey merge(CollationKey source)` 258 259CLDR 1.9/ICU 4.6 and later map U+FFFE to a special collation element that is 260intended to allow concatenating strings like firstName+\\uFFFE+lastName to yield 261the same results as merging their individual sort keys. 262This has been fully implemented in ICU since version 53. 263 264### Generating bounds for a sort key (prefix matching) 265 266Having sort keys for strings allows for easy creation of bounds - sort keys that 267are guaranteed to be smaller or larger than any sort key from a give range. For 268example, if bounds are produced for a sortkey of string "smith", strings between 269upper and lower bounds with one level would include "Smith", "SMITH", "sMiTh". 270Two kinds of upper bounds can be generated - the first one will match only 271strings of equal length, while the second one will match all the strings with 272the same initial prefix. 273 274CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum 275primary weight, so that for example the string "smith\\uFFFF" can be used as the 276upper bound rather than modifying the sort key for "smith". 277 278## Collation Element Iterator 279 280The collation element iterator is used for traversing Unicode string collation 281elements one at a time. It can be used to implement language-sensitive text 282search algorithms like Boyer-Moore. 283 284For most applications, the two API categories, compare and sort key, are 285sufficient. Most people do not need to manipulate collation elements directly. 286 287Example: 288 289Consider iterating over "apple" and "äpple". Here are sequences of collation 290elements: 291 292String 1 | String 1 Collation Elements 293-------- | --------------------------- 294a | `[1900.05.05]` 295p | `[3700.05.05]` 296p | `[3700.05.05]` 297l | `[2F00.05.05]` 298e | `[2100.05.05]` 299 300String 2 | String 2 Collation Elements 301-------- | --------------------------- 302a | `[1900.05.05]` 303\\u0308 | `[0000.9D.05]` 304p | `[3700.05.05]` 305p | `[3700.05.05]` 306l | `[2F00.05.05]` 307e | `[2100.05.05]` 308 309The resulting CEs are typically masked according to the desired strength, and 310zero CEs are discarded. In the above example, masking with 0xFFFF0000 (for primary strength) 311produces the results of NULL secondary and tertiary differences. The collator then 312ignores the NULL differences and declares a match. For more details see the 313paper "Efficient text searching in Java™: Finding the right string in any 314language" by Laura Werner ( 315<http://icu-project.org/docs/papers/efficient_text_searching_in_java.html>). 316 317## Collation Attributes 318 319The Collation Service has a number of attributes whose values can be changed 320during run time. These attributes affect both the functionality and the 321performance of the Collation Service. This section describes these 322attributes and, where possible, their performance impact. Performance 323indications are only approximate and timings may vary significantly depending on 324the CPU, compiler, etc. 325 326Although string comparison by ICU and comparison of each string's sort key give 327the same results, attribute settings can impact the execution time of each 328method differently. To be precise in the discussion of performance, this section 329refers to the API employed in the measurement. The `ucol_strcoll` function is the 330API for string comparison. The `ucol_getSortKey` function is used to create sort 331keys. 332 333> :point_right: **Note** There is a special attribute value, `UCOL_DEFAULT`, 334> that can be used to set any attribute to its default value 335> (which is inherited from the UCA and the tailoring). 336 337### Attribute Types 338 339#### Strength level 340 341Collation strength, or the maximum collation level used for comparison, is set 342by using the `UCOL_STRENGTH` attribute. Valid values are: 343 3441. `UCOL_PRIMARY` 345 3462. `UCOL_SECONDARY` 347 3483. `UCOL_TERTIARY` (default) 349 3504. `UCOL_QUATERNARY` 351 3525. `UCOL_IDENTICAL` 353 354#### French collation 355 356The `UCOL_FRENCH_COLLATION` attribute determines whether to sort the secondary 357differences in reverse order. Valid values are: 358 3591. `UCOL_OFF` (default): compares secondary differences in the order they appear 360 in the string. 361 3622. `UCOL_ON`: causes secondary differences to be considered in reverse order, as 363 it is done in the French language. 364 365#### Normalization mode 366 367The `UCOL_NORMALIZATION_MODE` attribute, or its alias `UCOL_DECOMPOSITION_MODE`, 368controls whether text normalization is performed on the input strings. Valid 369values are: 370 3711. `UCOL_OFF` (default): turns off normalization check 372 3732. `UCOL_ON` : normalization is checked and the collator performs normalization 374 if it is needed. 375 376X | FCD | NFC | NFD 377--------------------- | --- | --- | --- 378A-ring | Y | Y | 379Angstrom | Y | | 380A + ring | Y | | Y 381A + grave | Y | Y | 382A-ring + grave | Y | | 383A + cedilla + ring | Y | | Y 384A + ring + cedilla | | | 385A-ring + cedilla | | Y | 386 387With normalization mode turned on, the `ucol_strcoll` function slows down by 10%. 388In addition, the time to generate a sort key also increases by about 25%. 389 390#### Alternate handling 391 392This attribute allows shifting of the variable characters (usually spaces and 393punctuation, in the UCA also most symbols) from the primary to the quaternary 394strength level. This is set by using the `UCOL_ALTERNATE_HANDLING` attribute. For 395details see [UCA: Variable 396Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting), [LDML: 397Collation 398Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), 399and [“Ignore Punctuation” Options](customization/ignorepunct.md). 400 4011. `UCOL_NON_IGNORABLE` (CLDR/ICU default): variable characters are treated as 402 all the other characters 403 4042. `UCOL_SHIFTED` (UCA default): all the variable characters will be ignored at 405 the primary, secondary and tertiary levels and their primary strengths will 406 be shifted to the quaternary level. 407 408#### Case Ordering 409 410Some conventions require uppercase letters to sort before lowercase ones, while 411others require the opposite. This attribute is controlled by the value of the 412`UCOL_CASE_FIRST`. The case difference in the UCA is contained in the tertiary 413weights along with other appearance characteristics (like circling of letters). 414The case-first attribute allows for emphasizing of the case property of the 415letters by reordering the tertiary weights with either upper-first, and/or 416lowercase-first. This difference gets the most significant bit in the weight. 417Valid values for this attribute are: 418 4191. `UCOL_OFF` (default): leave tertiary weights unaffected 420 4212. `UCOL_LOWER_FIRST`: causes lowercase letters and uncased characters to sort 422 before uppercase 423 4243. `UCOL_UPPER_FIRST` : causes uppercase letters to sort first 425 426The case-first attribute does not affect the performance substantially. 427 428#### Case level 429 430When this attribute is set, an additional level is formed between the secondary 431and tertiary levels, known as the Case Level. The case level is used to 432distinguish large and small Japanese Kana characters. Case level could also be 433used in other situations. for example to distinguish certain Pinyin characters. 434Case level is controlled by `UCOL_CASE_LEVEL` attribute. Valid values for this 435attribute are 436 4371. `UCOL_OFF` (default): no additional case level 438 4392. `UCOL_ON` : adds a case level 440 441#### Hiragana Quaternary 442 443*This setting is deprecated and ignored in recent versions of ICU.* 444 445Hiragana Quaternary can be set to `UCOL_ON`, in which case Hiragana code points 446will sort before everything else on the quaternary level. If set to `UCOL_OFF` 447Hiragana letters are treated the same as all the other code points. This setting 448can be changed on run-time using the `UCOL_HIRAGANA_QUATERNARY_MODE` attribute. 449You probably won't need to use it. 450 451#### Variable Top 452 453Variable Top is a boundary which decides whether the code points will be treated 454as variable (shifted to quaternary level in the **shifted** mode) or 455non-ignorable. Special APIs are used for setting of variable top. It can 456basically be set either to a codepoint or a primary strength value. 457 458## Performance 459 460ICU collation is designed to be fast, small and customizable. Several techniques 461are used to enhance the performance: 462 4631. Providing optimized processing for Latin characters. 464 4652. Comparing strings incrementally and stopping at the first significant 466 difference. 467 4683. Tuning to eliminate unnecessary file access or memory allocation. 469 4704. Providing efficient preflight functions that allows fast sort key size 471 generation. 472 4735. Using a single, shared copy of UCA in memory for the read-only default sort 474 order. Only small tailoring tables are kept in memory for locale-specific 475 customization. 476 4776. Compressing sort keys efficiently. 478 4797. Making the sort order be data-driven. 480 481In general, the best performance from the Collation Service is expected by 482doing the following: 483 4841. After opening a collator, keep and reuse it until done. Do not open new 485 collators for the same sort order. (Note the restriction on 486 multi-threading.) 487 4882. Use `ucol_strcoll` etc. when comparing strings. If it is necessary to 489 compare strings thousands or millions of times, 490 create the sort keys first and compare the sort keys instead. 491 Generating the sort keys of two strings is about 5-10 492 times slower than just comparing them directly. 493 4943. Follow the best practice guidelines for generating sort keys. Do not call 495 `ucol_getSortKey` twice to first size the key and then allocate the sort key 496 buffer and repeat the call to the function to fill in the buffer. 497 498### Performance and Storage Implications of Attributes 499 500Most people use the default attributes when comparing strings or when creating 501sort keys. When they do want to customize the ordering, the most common options 502are the following : 503 504`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED`\ 505Used to ignore space and punctuation characters 506 507`UCOL_ALTERNATE_HANDLING == UCOL_SHIFTED` **and** `UCOL_STRENGTH == UCOL_QUATERNARY`\ 508Used to ignore the space and punctuation characters except when there are no previous letter, accent, or case/variable differences. 509 510`UCOL_CASE_FIRST == UCOL_LOWER_FIRST` **or** `UCOL_CASE_FIRST == UCOL_UPPER_FIRST`\ 511Used to change the ordering of upper vs. lower case letters (as 512well as small vs. large kana) 513 514`UCOL_CASE_LEVEL == UCOL_ON` **and** `UCOL_STRENGTH == UCOL_PRIMARY`\ 515Used to ignore only the accent differences. 516 517`UCOL_NORMALIZATION_MODE == UCOL_ON`\ 518Force to always check for normalization. This 519is used if the input text may not be in FCD form. 520 521`UCOL_FRENCH_COLLATION == UCOL_OFF`\ 522This is only useful for languages like French and Catalan that may turn this attribute on. 523(It is the default only for Canadian French ("fr-CA").) 524 525In String Comparison, most of these options have little or no effect on 526performance. The only noticeable one is normalization, which can cost 10%-40% in 527performance. 528 529For Sort Keys, most of these options either leave the storage alone or reduce 530it. Shifting can reduce the storage by about 10%-20%; case level + primary-only 531can decrease it about 20% to 40%. Using no French accents can reduce the storage 532by about 38% , but only for languages like French and Catalan that turn it on by 533default. On the other hand, using Shifted + Quaternary can increase the storage by 53410%-15%. (The Identical Level also increases the length, but this option is not 535recommended). 536 537> :point_right: **Note** All of the above numbers are based on 538> tests run on a particular machine, with a particular set of data. 539> (The data for each language is a large number of names 540> in that language in the format <first_name>, <last name>.) 541> The performance and storage may vary, depending on the particular computer, 542> operating system, and data. 543 544## Versioning 545 546Sort keys are often stored on disk for later reuse. A common example is the use 547of keys to build indexes in databases. When comparing keys, it is important to 548know that both keys were generated by the same algorithms and weightings. 549Otherwise, identical strings with keys generated on two different dates, for 550example, might compare as unequal. Sort keys can be affected by new versions of 551ICU or its data tables, new sort key formats, or changes to the Collator. 552Starting with release 1.8.1, ICU provides a versioning mechanism to identify the 553version information of the following (but not limited to), 554 5551. The run-time executable 556 5572. The collation element content 558 5593. The Unicode/UCA database 560 5614. The tailoring table 562 563The version information of Collator is a 32-bit integer. If a new version of ICU 564has changes affecting the content of collation elements, the version information 565will be changed. In that case, to use the new version of ICU collator will 566require regenerating any saved or stored sort keys. 567 568However, it is possible to modify ICU code or data without changing relevant version numbers, 569so it is safer to regenerate sort keys any time after any part of ICU has been updated. 570 571Since ICU4C 1.8.1. 572it is possible to build your program so that it uses more than one version of 573ICU (only in C/C++, not in Java). Therefore, you could use the current version 574for the features you need and use the older version for collation. 575 576## Programming Examples 577 578See the [Collation Examples](examples.md) chapter for an example of how to 579compare and create sort keys with the default locale in C, C++ and Java. 580