• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Normalization
4nav_order: 3
5parent: Transforms
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Normalization
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25Normalization is used to convert text to a unique, equivalent form. Software can
26normalize equivalent strings to one particular sequence, such as normalizing
27composite character sequences into pre-composed characters.
28
29Normalization allows for easier sorting and searching of text. The ICU
30normalization APIs support the standard normalization forms which are described
31in great detail in [Unicode Technical Report #15 (Unicode Normalization
32Forms)](http://www.unicode.org/reports/tr15/) and the Normalization, Sorting and
33Searching sections of chapter 5 of the [Unicode
34Standard](http://www.unicode.org/versions/latest/). ICU also supports related,
35additional operations. Some of them are described in [Unicode Technical Note #5
36(Canonical Equivalence in Applications)](http://www.unicode.org/notes/tn5/).
37
38## New API
39
40ICU 4.4 adds the Normalizer2 API (in
41[Java](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/Normalizer2.html),
42[C++](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNormalizer2.html) and
43[C](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html)), replacing almost all
44of the old Normalizer API. There is a [design
45doc](http://site.icu-project.org/design/normalization/custom) with many details.
46All of the replaced old API is now implemented as a thin wrapper around the new
47API.
48
49Here is a summary of the differences:
50
51*   Custom data: The new API uses non-static functions. A Normalizer2 instance
52    can be created from standard Unicode normalization data, or from a custom
53    (application-specific) data file with custom data processed by the new
54    gennorm2 tool.
55    *   Examples for possible custom data include UTS #46 IDNA mappings, MacOS X
56        file system normalization, and a combination of NFKC with case folding
57        (see the Unicode FC_NFKC_Closure property).
58    *   By using a single data file and a single processing step for
59        combinations like NFKC + case folding, the performance for such
60        operations is improved.
61*   NFKC_Casefold: ICU 4.4 supports the combination of NFKC, case folding and
62    removing ignorable characters which was introduced with Unicode 5.2.
63*   The old unorm.icu data file (used in Java, was hardcoded in the common
64    library in C/C++) has been replaced with two new files, nfc.nrm and
65    nfkc.nrm. If only canonical or only compatibility mappings are needed, then
66    the other data file can be removed. There is also a new nfkc_cf.nrm file for
67    NFKC_Casefold.
68*   FCD: The old API supports [FCD
69    processing](http://www.unicode.org/notes/tn5/#FCD) only for NFC/NFD data.
70    Normalizer2 supports it for any data file, including NFKC/NFKD and custom
71    data.
72*   FCC: Normalizer2 optionally supports [contiguous
73    composition](http://www.unicode.org/notes/tn5/#FCC) which is almost the same
74    as NFC/NFKC except that the normalized form also passes the FCD test. This
75    is also supported for any standard or custom data file.
76*   Quick check: There is a new `spanQuickCheckYes()` function for an optimized
77    combination of quick check and normalization.
78*   Filtered: The new FilteredNormalizer2 class combines a Normalizer2 instance
79    with a UnicodeSet to limit normalization to certain characters. For example,
80    The old API's UNICODE_3_2 option is implemented via a FilteredNormalizer2
81    using a UnicodeSet with the pattern `[:age=3.2:]`. (In other words, Unicode
82    3.2 normalization now requires the uprops.icu data.)
83*   Ease of use: In general, the switch to a factory method, otherwise
84    non-static functions, and multiple data files, simplifies all of the
85    function signatures.
86*   Iteration: Support for iterative normalization is now provided by functions
87    that test properties of code points, rather than requiring a particular type
88    of ICU character iterator. The old implementation anyway simply fetched the
89    code points and used equivalent code point test functions. The new API also
90    provides a wider variety of such test functions.
91*   String interfaces: In Java, input parameters are now CharSequence
92    references, and output is to StringBuilder or Appendable.
93
94The new API does not replace a few pieces of the old API:
95
96*   The string comparison functions are still provided only on the old API,
97    although reimplemented using the new code. They use multiple Normalizer2
98    instances (FCD and NFD) and are therefore a poor fit for the new Normalizer2
99    class. If necessary, a modernized replacement taking multiple Normalizer2
100    instances as parameters is possible, but not planned.
101*   The old QuickCheck return values are used by the new API as well.
102
103## Data File Syntax
104
105The gennorm2 tool accepts one or more .txt files and generates a .nrm binary
106data file for `Normalizer2.getInstance()`. For gennorm2 command line options,
107invoke `gennorm2 --help`.
108
109gennorm2 starts with no data. If you want to include standard Unicode
110Normalization data, use the files in
111[{ICU4C}/source/data/unidata/norm2/](https://github.com/unicode-org/icu/tree/master/icu4c/source/data/unidata/norm2)
112. You can modify one of them, or provide it together with one or more additional
113files that add or remove mappings.
114
115Hangul/Jamo data (mappings and ccc=0) are predefined and cannot be modified.
116
117Mappings in one text file can override mappings in previous files of the same
118gennorm2 invocation.
119
120Comments start with #. White space between tokens is ignored. Characters are
121written as hexadecimal code points. Combining class values are written as
122decimal numbers.
123
124In each file, each character can have at most one mapping and at most one ccc
125(canonical combining class) value. A ccc value must not be 0. (ccc=0 is the
126default.)
127
128Each line defines data for either a single code point (`00E1`) or a range of
129code points (`0300..0314`).
130
131A two-way mapping must map to a sequence of exactly two characters. Multi-code
132point ranges cannot have two-way mappings.
133
134A one-way mapping can map to zero, one, two or more characters. Mapping to zero
135characters removes the original character in normalization.
136
137The generator tool will apply each mapping recursively to each other. Groups of
138mappings that are forbidden by the Unicode Normalization algorithms are reported
139as errors. For example, if a character has a two-way mapping, then neither of
140its mapping characters can have a one-way mapping.
141
142```
143* Unicode 6.1         # Optional Unicode version (since ICU 49; default: uchar.h U_UNICODE_VERSION)
14400E1=0061 0301        # Two-way mapping
14500AA>0061             # One-way mapping
1460300..0314:230        # ccc for a code point range
1470315:232              # ccc for a single code point
1480132..0133>0069 006A  # Range, each code point mapping to "ij"
149E0000..E0FFF>         # Range, each code point mapping to the empty string
150```
151
152It is possible to override mappings from previous source files, including
153removing a mapping:
154
155```
156    00AA-
157    E0000..E0FFF-
158```
159
160## Data Generation Tool
161
162Normally, data from one or more input files is combined as described above,
163processed, and a binary data file is written for use by the ICU library (same
164file for C++ and Java). The binary data file format changes occasionally in
165order to support additional functionality.
166
167```shell
168    bin/gennorm2 -v -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
169```
170
171For the complete set of options, invoke `gennorm2 --help`.
172
173Instead of the binary data file, the processed data can be written into a C
174file. This is closely tied to the needs of the ICU library. The format may
175change from one ICU version to the next.
176
177```shell
178    bin/gennorm2 -v -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt **--csource**
179```
180
181With the --combined option, gennorm2 writes the combined data of the input
182files. The following example writes the combined NFKC_Casefold data. (New in ICU
18360.)
184
185```shell
186    bin/gennorm2 -o /tmp/nfkc_cf.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **--combined**
187```
188
189With the "minus" operator, gennorm2 writes the diffs of the combined data from
190two sets of input files. (New in ICU 60.)
191
192For example, the nfkc_cf.txt file in ICU contains the Unicode NFKC_CF mappings,
193extracted from the UCD file DerivedNormalizationProps.txt. It is not minimal.
194The following command line generates the minimal differences of NFKC_Casefold
195compared with NFKC.
196
197```shell
198    bin/gennorm2 -o /tmp/nfkc_cf-minus-nfkc.txt -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt **minus** nfc.txt nfkc.txt
199```
200
201## Example
202
203```java
204class NormSample {
205public:
206    // ICU service objects should be cached and reused, as usual.
207    NormSample(UErrorCode &errorCode)
208        : nfkc(*Normalizer2::getNFKCInstance(errorCode),
209            fcd(*Normalizer2::getInstance(NULL, "nfc", UNORM2_FCD, errorCode) {}
210
211    // Normalize a string.
212    UnicodeString toNFKC(const UnicodeString &s, UErrorCode &errorCode) {
213        return nfkc.normalize(s, errorCode);
214    }
215
216    // Ensure FCD before processing (like in sort key generation).
217    // In practice, almost all strings pass the FCD test, so it might make sense to
218    // test for it and only normalize when necessary, rather than always normalizing.
219    void processText(const UnicodeString &s, UErrorCode &errorCode) {
220        UnicodeString fcdString;
221        const UnicodeString *ps;  // points to either s or fcdString
222        int32_t spanQCYes=fcd.spanQuickCheckYes(s, errorCode);
223        if(U_FAILURE(errorCode)) {
224            return;  // report error
225        }
226        if(spanQCYes==s.length()) {
227            ps=&s;  // s is already in FCD
228        } else {
229            // unnormalized suffix as a read-only alias (does not copy characters)
230            UnicodeString unnormalized=s.tempSubString(spanQCYes);
231            // set the fcdString to the FCD prefix as a read-only alias
232            fcdString.setTo(FALSE, s.getBuffer(), spanQCYes);
233            // automatic copy-on-write, and append the FCD'ed suffix
234            fcd.normalizeSecondAndAppend(fcdString, unnormalized, errorCode);
235            ps=&fcdString;
236            if(U_FAILURE(errorCode)) {
237                return;  // report error
238            }
239        }
240        // ... now process the string *ps which is in FCD ...
241    }
242private:
243    const Normalizer2 &nfkc;
244    const Normalizer2 &fcd;
245};
246```
247