1--- 2layout: default 3title: Normalization 4nav_order: 3 5parent: Transforms 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Normalization 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25Normalization is used to convert text to a unique, equivalent form. Software can 26normalize equivalent strings to one particular sequence, such as normalizing 27composite character sequences into pre-composed characters. 28 29Normalization allows for easier sorting and searching of text. The ICU 30normalization APIs support the standard normalization forms which are described 31in great detail in [Unicode Technical Report #15 (Unicode Normalization 32Forms)](http://www.unicode.org/reports/tr15/) and the Normalization, Sorting and 33Searching sections of chapter 5 of the [Unicode 34Standard](http://www.unicode.org/versions/latest/). ICU also supports related, 35additional operations. Some of them are described in [Unicode Technical Note #5 36(Canonical Equivalence in Applications)](http://www.unicode.org/notes/tn5/). 37 38## New API 39 40ICU 4.4 adds the Normalizer2 API (in 41[Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/Normalizer2.html), 42[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNormalizer2.html) and 43[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html)), replacing almost all 44of the old Normalizer API. There is a [design 45doc](http://site.icu-project.org/design/normalization/custom) with many details. 46All of the replaced old API is now implemented as a thin wrapper around the new 47API. 48 49Here is a summary of the differences: 50 51* Custom data: The new API uses non-static functions. A Normalizer2 instance 52 can be created from standard Unicode normalization data, or from a custom 53 (application-specific) data file with custom data processed by the new 54 gennorm2 tool. 55 * Examples for possible custom data include UTS #46 IDNA mappings, MacOS X 56 file system normalization, and a combination of NFKC with case folding 57 (see the Unicode FC_NFKC_Closure property). 58 * By using a single data file and a single processing step for 59 combinations like NFKC + case folding, the performance for such 60 operations is improved. 61* NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and 62 removing ignorable characters which was introduced with Unicode 5.2. 63* The old unorm.icu data file (used in Java, was hardcoded in the common 64 library in C/C++) has been replaced with two new files, nfc.nrm and 65 nfkc.nrm. If only canonical or only compatibility mappings are needed, then 66 the other data file can be removed. There is also a new nfkc_cf.nrm file for 67 NFKC_Casefold. 68* FCD: The old API supports [FCD 69 processing](http://www.unicode.org/notes/tn5/#FCD) only for NFC/NFD data. 70 Normalizer2 supports it for any data file, including NFKC/NFKD and custom 71 data. 72* FCC: Normalizer2 optionally supports [contiguous 73 composition](http://www.unicode.org/notes/tn5/#FCC) which is almost the same 74 as NFC/NFKC except that the normalized form also passes the FCD test. This 75 is also supported for any standard or custom data file. 76* Quick check: There is a new `spanQuickCheckYes()` function for an optimized 77 combination of quick check and normalization. 78* Filtered: The new FilteredNormalizer2 class combines a Normalizer2 instance 79 with a UnicodeSet to limit normalization to certain characters. For example, 80 The old API's UNICODE_3_2 option is implemented via a FilteredNormalizer2 81 using a UnicodeSet with the pattern `[:age=3.2:]`. (In other words, Unicode 82 3.2 normalization now requires the uprops.icu data.) 83* Ease of use: In general, the switch to a factory method, otherwise 84 non-static functions, and multiple data files, simplifies all of the 85 function signatures. 86* Iteration: Support for iterative normalization is now provided by functions 87 that test properties of code points, rather than requiring a particular type 88 of ICU character iterator. The old implementation anyway simply fetched the 89 code points and used equivalent code point test functions. The new API also 90 provides a wider variety of such test functions. 91* String interfaces: In Java, input parameters are now CharSequence 92 references, and output is to StringBuilder or Appendable. 93 94The new API does not replace a few pieces of the old API: 95 96* The string comparison functions are still provided only on the old API, 97 although reimplemented using the new code. They use multiple Normalizer2 98 instances (FCD and NFD) and are therefore a poor fit for the new Normalizer2 99 class. If necessary, a modernized replacement taking multiple Normalizer2 100 instances as parameters is possible, but not planned. 101* The old QuickCheck return values are used by the new API as well. 102 103## Data File Syntax 104 105The gennorm2 tool accepts one or more .txt files and generates a .nrm binary 106data file for `Normalizer2.getInstance()`. For gennorm2 command line options, 107invoke `gennorm2 --help`. 108 109gennorm2 starts with no data. If you want to include standard Unicode 110Normalization data, use the files in 111[{ICU4C}/source/data/unidata/norm2/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata/norm2) 112. You can modify one of them, or provide it together with one or more additional 113files that add or remove mappings. 114 115Hangul/Jamo data (mappings and ccc=0) are predefined and cannot be modified. 116 117Mappings in one text file can override mappings in previous files of the same 118gennorm2 invocation. 119 120Comments start with #. White space between tokens is ignored. Characters are 121written as hexadecimal code points. Combining class values are written as 122decimal numbers. 123 124In each file, each character can have at most one mapping and at most one ccc 125(canonical combining class) value. A ccc value must not be 0. (ccc=0 is the 126default.) 127 128Each line defines data for either a single code point (`00E1`) or a range of 129code points (`0300..0314`). 130 131A two-way mapping must map to a sequence of exactly two characters. Multi-code 132point ranges cannot have two-way mappings. 133 134A one-way mapping can map to zero, one, two or more characters. Mapping to zero 135characters removes the original character in normalization. 136 137The generator tool will apply each mapping recursively to each other. Groups of 138mappings that are forbidden by the Unicode Normalization algorithms are reported 139as errors. For example, if a character has a two-way mapping, then neither of 140its mapping characters can have a one-way mapping. 141 142``` 143* Unicode 6.1 # Optional Unicode version (since ICU 49; default: uchar.h U_UNICODE_VERSION) 14400E1=0061 0301 # Two-way mapping 14500AA>0061 # One-way mapping 1460300..0314:230 # ccc for a code point range 1470315:232 # ccc for a single code point 1480132..0133>0069 006A # Range, each code point mapping to "ij" 149E0000..E0FFF> # Range, each code point mapping to the empty string 150``` 151 152It is possible to override mappings from previous source files, including 153removing a mapping: 154 155``` 156 00AA- 157 E0000..E0FFF- 158``` 159 160## Data Generation Tool 161 162Normally, data from one or more input files is combined as described above, 163processed, and a binary data file is written for use by the ICU library (same 164file for C++ and Java). The binary data file format changes occasionally in 165order to support additional functionality. 166 167```shell 168 bin/gennorm2 -v -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 169``` 170 171For the complete set of options, invoke `gennorm2 --help`. 172 173Instead of the binary data file, the processed data can be written into a C 174file. This is closely tied to the needs of the ICU library. The format may 175change from one ICU version to the next. 176 177```shell 178 bin/gennorm2 -v -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt **--csource** 179``` 180 181With the --combined option, gennorm2 writes the combined data of the input 182files. The following example writes the combined NFKC_Casefold data. (New in ICU 18360.) 184 185```shell 186 bin/gennorm2 -o /tmp/nfkc_cf.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **--combined** 187``` 188 189With the "minus" operator, gennorm2 writes the diffs of the combined data from 190two sets of input files. (New in ICU 60.) 191 192For example, the nfkc_cf.txt file in ICU contains the Unicode NFKC_CF mappings, 193extracted from the UCD file DerivedNormalizationProps.txt. It is not minimal. 194The following command line generates the minimal differences of NFKC_Casefold 195compared with NFKC. 196 197```shell 198 bin/gennorm2 -o /tmp/nfkc_cf-minus-nfkc.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **minus** nfc.txt nfkc.txt 199``` 200 201## Example 202 203```java 204class NormSample { 205public: 206 // ICU service objects should be cached and reused, as usual. 207 NormSample(UErrorCode &errorCode) 208 : nfkc(*Normalizer2::getNFKCInstance(errorCode), 209 fcd(*Normalizer2::getInstance(NULL, "nfc", UNORM2_FCD, errorCode) {} 210 211 // Normalize a string. 212 UnicodeString toNFKC(const UnicodeString &s, UErrorCode &errorCode) { 213 return nfkc.normalize(s, errorCode); 214 } 215 216 // Ensure FCD before processing (like in sort key generation). 217 // In practice, almost all strings pass the FCD test, so it might make sense to 218 // test for it and only normalize when necessary, rather than always normalizing. 219 void processText(const UnicodeString &s, UErrorCode &errorCode) { 220 UnicodeString fcdString; 221 const UnicodeString *ps; // points to either s or fcdString 222 int32_t spanQCYes=fcd.spanQuickCheckYes(s, errorCode); 223 if(U_FAILURE(errorCode)) { 224 return; // report error 225 } 226 if(spanQCYes==s.length()) { 227 ps=&s; // s is already in FCD 228 } else { 229 // unnormalized suffix as a read-only alias (does not copy characters) 230 UnicodeString unnormalized=s.tempSubString(spanQCYes); 231 // set the fcdString to the FCD prefix as a read-only alias 232 fcdString.setTo(FALSE, s.getBuffer(), spanQCYes); 233 // automatic copy-on-write, and append the FCD'ed suffix 234 fcd.normalizeSecondAndAppend(fcdString, unnormalized, errorCode); 235 ps=&fcdString; 236 if(U_FAILURE(errorCode)) { 237 return; // report error 238 } 239 } 240 // ... now process the string *ps which is in FCD ... 241 } 242private: 243 const Normalizer2 &nfkc; 244 const Normalizer2 &fcd; 245}; 246``` 247