--- layout: default title: API Details nav_order: 6 parent: Collation --- # Collation API Details {: .no_toc } ## Contents {: .no_toc .text-delta } 1. TOC {:toc} --- ## Overview This section describes some of the usage conventions for the ICU Collation Service API. ## Collator Instantiation To use the Collation Service, you must instantiate a `Collator`. The Collator defines the properties and behavior of the sort ordering. The Collator can be repeatedly referenced until all collation activities have been performed. The Collator can then be closed and removed. ### Instantiating the Predefined Collators ICU comes with a large set of already predefined collators that are suited for specific locales. Most of the ICU locales have a predefined collator. In the worst case, the CLDR default set of rules, which is mostly equivalent to the UCA default ordering (DUCET), is used. The default sort order itself is designed to work well for many languages. (For example, there are no tailorings for the standard sort orders for English, German, French, etc.) To instantiate a predefined collator, use the APIs `ucol_open`, `createInstance` and `getInstance` for C, C++ and Java codes respectively. The C API takes a locale ID (or language tag) string argument, C++ takes a Locale object, and Java takes a Locale or ULocale. For some languages, multiple collation types are available; for example, "de-u-co-phonebk" / "de@collation=phonebook". They can be enumerated via `Collator::getKeywordValuesForLocale()`. See also the list of available collation tailorings in the online [ICU Collation Demo](http://demo.icu-project.org/icu-bin/collation.html). Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive. See the [LDML Collation spec, Collation Settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings), and the [data file](https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml) listing the valid collation keywords and their values. (The deprecated attributes kh/colHiraganaQuaternary and vt/variableTop are not supported.) For the [old locale extension syntax](http://www.unicode.org/reports/tr35/tr35.html#Old_Locale_Extension_Syntax), the data file's alias names are used (first alias, if defined, otherwise the name): "de@collation=phonebook;colCaseLevel=yes;kv=space" For the language tag syntax, the non-alias names are used, and "true" values can be omitted: "de-u-co-phonebk-kc-kv-space" This example demonstrates the instantiation of a collator. **C:** ```c UErrorCode status = U_ZERO_ERROR; UCollator *coll = ucol_open("en_US", &status); if(U_SUCCESS(status)) { /* close the collator*/ ucol_close(coll); } ``` **C++:** ```c++ UErrorCode status = U_ZERO_ERROR; Collator *coll = Collator::createInstance(Locale("en", "US"), status); if(U_SUCCESS(status)) { //close the collator delete coll; } ``` **Java:** ```java Collator col = null; try { col = Collator.getInstance(Locale.US); } catch (Exception e) { System.err.println("English collation creation failed."); e.printStackTrace(); } ``` ### Instantiating Collators Using Custom Rules If the ICU predefined collators are not appropriate for your intended usage, you can define your own set of rules and instantiate a collator that uses them. For more details, please see [the section on collation customization](customization/index). This example demonstrates the instantiation of a collator. **C:** ```c UErrorCode status = U_ZERO_ERROR; U_STRING_DECL(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52); UCollator *coll; U_STRING_INIT(rules, "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E", 52); coll = ucol_openRules(rules, -1, UCOL_ON, UCOL_DEFAULT_STRENGTH, NULL, &status); if(U_SUCCESS(status)) { /* close the collator*/ ucol_close(coll); } ``` **C++:** ```c++ UErrorCode status = U_ZERO_ERROR; UnicodeString rules(u"&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E"); Collator *coll = new RuleBasedCollator(rules, status); if(U_SUCCESS(status)) { //close the collator delete coll; } ``` **Java:** ```java RuleBasedCollator coll = null; String ruleset = "&9 < a, A < b, B < c, C; ch, cH, Ch, CH < d, D, e, E"; try { coll = new RuleBasedCollator(ruleset); } catch (Exception e) { System.err.println("Customized collation creation failed."); e.printStackTrace(); } ``` ## Compare Two of the most used functions in ICU collation API, `ucol_strcoll` and `ucol_getSortKey`, have their counterparts in both Win32 and ANSI APIs: ICU C | ICU C++ | ICU Java | ANSI/POSIX | WIN32 ----------------- | --------------------------- | -------------------------- | ---------- | ----- `ucol_strcoll` | `Collator::compare` | `Collator.compare` | `strcoll` | `CompareString` `ucol_getSortKey` | `Collator::getSortKey` | `Collator.getCollationKey` | `strxfrm` | `LCMapString`   | `Collator::getCollationKey` |   |   | For more sophisticated usage, such as user-controlled language-sensitive text searching, an iterating interface to collation is provided. Please refer to the section below on `CollationElementIterator` for more details. The `ucol_compare` function compares one pair of strings at a time. Comparing two strings is much faster than calculating sort keys for both of them. However, if comparisons should be done repeatedly on a very large number of strings, generating and storing sort keys can improve performance. In all other cases (such as quick sort or bubble sort of a moderately-sized list of strings), comparing strings works very well. The C API used for comparing two strings is `ucol_strcoll`. It requires two `UChar *` strings and their lengths as parameters, as well as a pointer to a valid `UCollator` instance. The result is a `UCollationResult` constant, which can be one of `UCOL_LESS`, `UCOL_EQUAL` or `UCOL_GREATER`. The C++ API offers the method `Collator::compare` with several overloads. Acceptable input arguments are `UChar *` with length of strings, or `UnicodeString` instances. The result is a member of the `UCollationResult` or `EComparisonResult` enums. The Java API provides the method `Collator.compare` with one overload. Acceptable input arguments are Strings or Objects. The result is an int value, which is less than zero if source is less than target, zero if source and target are equal, or greater than zero if source is greater than target. There are also several convenience functions and methods returning a boolean value, such as `ucol_greater`, `ucol_greaterOrEqual`, `ucol_equal` (in C) `Collator::greater`, `Collator::greaterOrEqual`, `Collator::equal` (in C++) and `Collator.equals` (in Java). ### Examples **C:** ```c UChar *s [] = { /* list of Unicode strings */ }; uint32_t listSize = sizeof(s)/sizeof(s[0]); UErrorCode status = U_ZERO_ERROR; UCollator *coll = ucol_open("en_US", &status); uint32_t i, j; if(U_SUCCESS(status)) { for(i=listSize-1; i>=1; i--) { for(j=0; j=1; i--) { for(j=0; jcompare(s[j], s[j+1]) == UCOL_LESS) { swap(s[j], s[j+1]); } } } delete coll; } ``` **Java:** ```java String s [] = { /* list of Unicode strings */ }; try { Collator coll = Collator.getInstance(Locale.US); for (int i = s.length - 1; i > = 1; i --) { for (j=0; j bufferLen) { if (currBuffer == buffer) { currBuffer = (char*)malloc(expectedLen); } else { currBuffer = (char*)realloc(currBuffer, expectedLen); } } bufferLen = ucol_getSortKey(coll, source[i], -1, currBuffer, expectedLen); } processSortKey(i, currBuffer, bufferLen); if (currBuffer != buffer && currBuffer != NULL) { free(currBuffer); } } ``` > :point_right: **Note** Although the API allows you to call > `ucol_getSortKey` with `NULL` to see what the > sort key length is, it is strongly recommended that you NOT determine the length > first, then allocate and fill the sort key buffer. If you do, it requires twice > the processing since computing the length has to do the same calculation as > actually getting the sort key. Instead, the example shown above uses a stack buffer. ### Using Iterators for String Comparison ICU4C's `ucol_strcollIter` API allows for comparing two strings that are supplied as character iterators (`UCharIterator`). This is useful when you need to compare differently encoded strings using `strcoll`. In that case, converting the strings first would probably be wasteful, since `strcoll` usually gives the result before whole strings are processed. This API is implemented only as a C function in ICU4C. There are no equivalent C++ or ICU4J functions. ```c ... /* we are arriving with two char*: utf8Source and utf8Target, with their * lengths in utf8SourceLen and utf8TargetLen */ UCharIterator sIter, tIter; uiter_setUTF8(&sIter, utf8Source, utf8SourceLen); uiter_setUTF8(&tIter, utf8Target, utf8TargetLen); compareResultUTF8 = ucol_strcollIter(myCollation, &sIter, &tIter, &status); ... ``` ### Obtaining Partial Sort Keys When using different sort algorithms, such as radix sort, sometimes it is useful to process strings only as much as needed to feed into the sorting algorithm. For that purpose, ICU provides the `ucol_nextSortKeyPart` API, which also takes character iterators. This API allows for iterating over subsequent pieces of an uncompressed sort key. Between calls to the API you need to save a 64-bit state. Following is an example of simulating a string compare function using the partial sort key API. Your usage model is bound to look much different. ```c static UCollationResult compareUsingPartials(UCollator *coll, const UChar source[], int32_t sLen, const UChar target[], int32_t tLen, int32_t pieceSize, UErrorCode *status) { int32_t partialSKResult = 0; UCharIterator sIter, tIter; uint32_t sState[2], tState[2]; int32_t sSize = pieceSize, tSize = pieceSize; int32_t i = 0; uint8_t sBuf[16384], tBuf[16384]; if(pieceSize > 16384) { *status = U_BUFFER_OVERFLOW_ERROR; return UCOL_EQUAL; } *status = U_ZERO_ERROR; sState[0] = 0; sState[1] = 0; tState[0] = 0; tState[1] = 0; while(sSize == pieceSize && tSize == pieceSize && partialSKResult == 0) { uiter_setString(&sIter, source, sLen); uiter_setString(&tIter, target, tLen); sSize = ucol_nextSortKeyPart(coll, &sIter, sState, sBuf, pieceSize, status); tSize = ucol_nextSortKeyPart(coll, &tIter, tState, tBuf, pieceSize, status); partialSKResult = memcmp(sBuf, tBuf, pieceSize); } if(partialSKResult < 0) { return UCOL_LESS; } else if(partialSKResult > 0) { return UCOL_GREATER; } else { return UCOL_EQUAL; } } ``` ### Other Examples A longer example is presented in the 'Examples' section. Here is an illustration of the usage model. **C:** ```c #define MAX_KEY_SIZE 100 #define MAX_BUFFER_SIZE 10000 #define MAX_LIST_LENGTH 5 const char text[] = { "Quick", "fox", "Moving", "trucks", "riddle" }; const UChar s [5][20]; int i; int32_t length, expectedLen; uint8_t temp[MAX_BUFFER _SIZE]; uint8_t *temp2 = NULL; uint8_t keys [MAX_LIST_LENGTH][MAX_KEY_SIZE]; UErrorCode status = U_ZERO_ERROR; temp2 = temp; length = MAX_BUFFER_SIZE; for( i = 0; i < 5; i++) { u_uastrcpy(s[i], text[i]); } UCollator *coll = ucol_open("en_US",&status); uint32_t length; if(U_SUCCESS(status)) { for(i=0; i length) { if (temp2 == temp) { temp2 =(char*)malloc(expectedLen); } else { temp2 =(char*)realloc(temp2, expectedLen); } length =ucol_getSortKey(coll, s[i], -1, temp2, expectedLen); } memcpy(key[i], temp2, length); } } qsort(keys, MAX_LIST_LENGTH,MAX_KEY_SIZE*sizeof(uint8_t), strcmp); for (i = 0; i < MAX_LIST_LENGTH; i++) { free(key[i]); } ucol_close(coll); ``` **C++:** ```c++ #define MAX_LIST_LENGTH 5 const UnicodeString s [] = { "Quick", "fox", "Moving", "trucks", "riddle" }; CollationKey *keys[MAX_LIST_LENGTH]; UErrorCode status = U_ZERO_ERROR; Collator *coll = Collator::createInstance(Locale("en_US"), status); uint32_t i; if(U_SUCCESS(status)) { for(i=0; igetCollationKey(s[i], -1); } qsort(keys, MAX_LIST_LENGTH, sizeof(CollationKey),compareKeys); delete[] keys; delete coll; } ``` **Java:** ```java String s [] = { "Quick", "fox", "Moving", "trucks", "riddle" }; CollationKey keys[] = new CollationKey[s.length]; try { Collator coll = Collator.getInstance(Locale.US); for (int i = 0; i < s.length; i ++) { keys[i] = coll.getCollationKey(s[i]); } Arrays.sort(keys); } catch (Exception e) { System.err.println("Error creating English collator"); e.printStackTrace(); } ``` ## Collation ElementIterator A collation element iterator can only be used in one direction. This is established at the time of the first call to retrieve a collation element. Once `ucol_next` (C), `CollationElementIterator::next` (C++) or `CollationElementIterator.next` (Java) are invoked, `ucol_previous` (C), `CollationElementIterator::previous` (C++) or `CollationElementIterator.previous` (Java) should not be used (and vice versa). The direction can be changed immediately after `ucol_first`, `ucol_last`, `ucol_reset` (in C), `CollationElementIterator::first`, `CollationElementIterator::last`, `CollationElementIterator::reset` (in C++) or `CollationElementIterator.first`, `CollationElementIterator.last`, `CollationElementIterator.reset` (in Java) is called, or when it reaches the end of string while traversing the string. When `ucol_next` is called at the end of the string buffer, `UCOL_NULLORDER` is always returned with any subsequent calls to `ucol_next`. The same applies to `ucol_previous`. An example of how iterators are used is the Boyer-Moore search implementation, which can be found in the samples section. ### API Example **C:** ```c UCollator *coll = ucol_open("en_US",status); UErrorCode status = U_ZERO_ERROR; UChar text[20]; UCollationElements *collelemitr; uint32_t collelem; u_uastrcpy(text, "text"); collelemitr = ucol_openElements(coll, text, -1, &status); collelem = 0; do { collelem = ucol_next(collelemitr, &status); } while (collelem != UCOL_NULLORDER); ucol_closeElements(collelemitr); ucol_close(coll); ``` **C++:** ```c++ UErrorCode status = U_ZERO_ERROR; Collator *coll = Collator::createInstance(Locale::getUS(), status); UnicodeString text("text"); CollationElementIterator *collelemitr = coll->createCollationElementIterator(text); uint32_t collelem = 0; do { collelem = collelemitr->next(status); } while (collelem != CollationElementIterator::NULLORDER); delete collelemitr; delete coll; ``` **Java:** ```java try { RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance(Locale.US); String text = "text"; CollationElementIterator collelemitr = coll.getCollationElementIterator(text); int collelem = 0; do { collelem = collelemitr.next(); } while (collelem != CollationElementIterator.NULLORDER); } catch (Exception e) { System.err.println("Error in collation iteration"); e.printStackTrace(); } ``` ## Setting and Getting Attributes The general attribute setting APIs are `ucol_setAttribute` (in C) and `Collator::setAttribute` (in C++). These APIs take an attribute name and an attribute value. If the name and the value pass a syntax and range check, the property of the collator is changed. If the name and value do not pass a syntax and range check, however, the state is not changed and the error code variable is set to an error condition. The Java version does not provide general attribute setting APIs; instead, each attribute has its own setter API of the form `RuleBasedCollator.setATTRIBUTE_NAME(arguments)`. The attribute getting APIs are `ucol_getAttribute` (C) and `Collator::getAttribute` (C++). Both APIs require an attribute name as an argument and return an attribute value if a valid attribute name was supplied. If a valid attribute name was not supplied, however, they return an undefined result and set the error code. Similarly to the setter APIs for the Java version, no generic getter API is provided. Each attribute has its own setter API of the form `RuleBasedCollator.getATTRIBUTE_NAME()` in the Java version. ## References 1. Ken Whistler, Markus Scherer: "Unicode Technical Standard #10, Unicode Collation Algorithm" () 2. ICU Design doc: "Collation v2" () 3. Mark Davis: "ICU Collation Design Document" () 3. The Unicode Standard, chapter 5, "Implementation guidelines" () 4. Laura Werner: "Efficient text searching in Java: Finding the right string in any language" () 5. Mark Davis, Martin Dürst: "Unicode Standard Annex #15: Unicode Normalization Forms" ().