1--- 2layout: default 3title: Collation 4nav_order: 9 5has_children: true 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Collation 13 14## Overview 15 16Information is displayed in sorted order to enable users to easily find the 17items they are looking for. However, users of different languages might have 18very different expectations of what a "sorted" list should look like. Not only 19does the alphabetical order vary from one language to another, but it also can 20vary from document to document within the same language. For example, phonebook 21ordering might be different than dictionary ordering. String comparison is one 22of the basic functions most applications require, and yet implementations often 23do not match local conventions. The ICU Collation Service provides string 24comparison capability with support for appropriate sort orderings for each of 25the locales you need. In the event that you have a very unusual requirement, you 26are also provided the facilities to customize orderings. 27 28Starting in release 1.8, the ICU Collation Service is compliant to the Unicode 29Collation Algorithm (UCA) ([Unicode Technical Standard 30#10](http://www.unicode.org/reports/tr10/)) and based on the Default 31Unicode Collation Element Table (DUCET) which defines the same sort order as ISO 3214651. 33 34The ICU Collation Service also contains several enhancements that are not 35available in UCA. These have been adopted into the [CLDR Collation 36Algorithm](http://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm). 37For example: 38 39* Additional case handling (as specified by CLDR): ICU allows case differences 40 to be ignored or flipped. Uppercase letters can be sorted before lowercase 41 letters, or vice-versa. 42* Easy customization (as specified by CLDR): Services can be easily tailored 43 to address a wide range of collation requirements. 44* The [default (root) sort 45 order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation) 46 has been tailored slightly for improved functionality and performance. 47 48In other words, ICU implements the CLDR Collation Algorithm which is an 49extension of the Unicode Collation Algorithm (UCA) which is an extension of ISO 5014651. 51 52There are several benefits to using the collation algorithms defined in these 53standards, including: 54 55* The algorithms have been designed and reviewed by experts in multilingual 56 collation, and therefore are robust and comprehensive. 57 58* Applications that share sorted data but do not agree on how the data should 59 be ordered fail to perform correctly. By conforming to the CLDR/UCA/14651 60 standards for collation and using CLDR language-specific collation data, 61 independently developed applications sort data identically and perform 62 properly. 63 64In addition, Unicode contains a large set of characters. This can make it 65difficult for collation to be a fast operation or require collation to use 66significant memory or disk resources. The ICU collation implementation is 67designed to be fast, have a small memory footprint and be highly customizable. 68 69There are many challenges when accommodating the world's languages and writing 70systems and the different orderings that are used. However, the ICU Collation 71Service provides an excellent means for comparing strings in a locale-sensitive 72fashion. 73 74For example, here are some of the ways languages vary in ordering strings: 75 76* The letters A-Z can be sorted in a different order than in English. For 77 example, in Lithuanian, "y" is sorted between "i" and "k". 78 79* Combinations of letters can be treated as if they were one letter. For 80 example, in traditional Spanish "ch" is treated as a single letter, and 81 sorted between "c" and "d". 82 83* Accented letters can be treated as minor variants of the unaccented letter. 84 For example, "é" can be treated equivalent to "e". 85 86* Accented letters can be treated as distinct letters. For example, "Å" in 87 Danish is treated as a separate letter that sorts just after "Z". 88 89* Unaccented letters that are considered distinct in one language can be 90 indistinct in another. For example, the letters "v" and "w" are two 91 different letters according to English. However, "v" and "w" are 92 traditionally considered variant forms of the same letter in Swedish. 93 94* A letter can be treated as if it were two letters. For example, in German 95 phonebook (or "lists of names") order "ä" is compared as if it were "ae". 96 97* Thai requires that the order of certain letters be reversed. 98 99* Some French dictionary ordering traditions sort accents in backwards order, 100 from the end of the string. For example, the word "côte" sorts before "coté" 101 because the acute accent on the final "e" is more significant than the 102 circumflex on the "o". 103 104* Sometimes lowercase letters sort before uppercase letters. The reverse is 105 required in other situations. For example, lowercase letters are usually 106 sorted before uppercase letters in English. Danish letters are the exact 107 opposite. 108 109* Even in the same language, different applications might require different 110 sorting orders. For example, in German dictionaries, "öf" would come before 111 "of". In phone books the situation is the exact opposite. 112 113* Sorting orders can change over time due to government regulations or new 114 characters/scripts in Unicode. 115 116To accommodate the many languages and differing requirements, ICU collation 117supports customizing sort orderings - also known as **tailoring**. More details 118regarding tailoring are discussed in the [Customization 119chapter.](customization/index.md) 120 121The basic ICU Collation Service is provided by two main categories of APIs: 122 123* String comparison - most commonly used: APIs return result of comparing two 124 strings (greater than, equal or less than). This is used as a comparator 125 when sorting lists, building tree maps, etc. 126 127* Sort key generation - used when a very large set of strings are 128 compared/sorted repeatedly: APIs return a zero-terminated array of bytes per 129 string known as a sort key. The keys can be compared directly using strcmp 130 or memcmp standard library functions, saving repeated lookup and computation 131 of each string's collation properties. For example, database applications 132 use index tables of sort keys to index strings quickly. Note, however, that 133 this only improves performance for large numbers of strings because sorting 134 via the comparison functions is very fast. For more information, see 135 [Sortkeys vs Comparison](concepts#sortkeys-vs-comparison). 136 137ICU provides an AlphabeticIndex API for generating language-appropriate 138sorted-section labels like in dictionaries and phone books. 139 140ICU also provides a higher-level [string search](string-search) 141API which can be used, for example, for case-insensitive or accent-insensitive 142search in an editor or in a web page. ICU string search is based on the 143low-level [collation element iteration](architecture). 144 145## Programming Examples 146 147Here are some [API usage conventions](api.md) for the ICU Collation Service 148APIs. 149