• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Case Mappings
4nav_order: 1
5parent: Transforms
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Case Mappings
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25Case mapping is used to handle the mapping of upper-case, lower-case, and title
26case characters for a given language. Case is a normative property of characters
27in specific alphabets (e.g. Latin, Greek, Cyrillic, Armenian, and Georgian)
28whereby characters are considered to be variants of a single letter. ICU refers
29to these variants, which may differ markedly in shape and size, as uppercase
30letters (also known as capital or majuscule) and lower-case letters (also known
31as small or minuscule). Alphabets with case differences are called bicameral and
32alphabets without case differences are called unicameral.
33
34Due to the inclusion of certain composite characters for compatibility, such as
35the Latin capital letter 'DZ' (\\u01F1 'DZ'), there is a third case called title
36case. Title case is used to capitalize the first character of a word such as the
37Latin capital letter 'D' with small letter 'z' ( \\u01F2 'Dz'). The term "title
38case" can also be used to refer to words whose first letter is an uppercase or
39title case letter and the rest are lowercase letters. However, not all words in
40the title of a document or first words in a sentence will be title case. The use
41of title case words is language dependent. For example, in English, "Taming of
42the Shrew" would be the appropriate capitalization and not "Taming Of The
43Shrew".
44
45> :point_right: **Note**: *As of Unicode 11, Georgian now has Mkhedruli (lowercase) and Mtavruli
46(uppercase) which form case pairs, but are not used in title case.*
47
48Sample code is available in the ICU source code library at
49[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ustring/ustring.cpp)
50.
51
52Please refer to the following sections in the [The Unicode Standard](http://www.unicode.org/versions/latest/)
53for more information about case mapping:
54
55*   3.13 Default Case Algorithms
56*   4.2 Case
57*   5.18 Case Mappings
58
59## Simple (Single-Character) Case Mapping
60
61The general case mapping in ICU is non-language based and a 1 to 1 generic
62character map.
63
64A character is considered to have a lowercase, uppercase, or title case
65equivalent if there is a respective "simple" case mapping specified for the
66character in the [Unicode Character Database](http://www.unicode.org/ucd/) (UnicodeData.txt).
67If a character has no mapping equivalent, the result is the character itself.
68
69The APIs provided for the general case mapping, located in `uchar.h` file, handles
70only single characters of type `UChar32` and returns only single characters. To
71convert a string to a non-language based specific case, use the APIs in either
72the `unistr.h` or `ustring.h` files with a `NULL` argument locale.
73
74## Full (Language-Specific) Case Mapping
75
76There are different case mappings for different locales. For instance, unlike
77English, the character Latin small letter 'i' in Turkish has an equivalent Latin
78capital letter 'I' with dot above ( \\u0130 'İ').
79
80Similar to the simple case mapping API, a character is considered to have a
81lowercase, uppercase or title case equivalent if there is a respective mapping
82specified for the character in the Unicode Character database (UnicodeData.txt).
83In the case where a character has no mapping equivalent, the result is the
84character itself.
85
86To convert a string to a language based specific case, use the APIs in `ustring.h`
87and `unistr.h` with an intended argument locale.
88
89ICU implements full Unicode string case mappings.
90
91**In general:**
92
93*   **case mapping can change the number of code points and/or code units of a
94    string,**
95*   **is language-sensitive (results may differ depending on language), and**
96*   **is context-sensitive (a character in the input string may map differently
97    depending on surrounding characters).**
98
99## Case Folding
100
101Case folding maps strings to a canonical form where case differences are erased.
102Using the case folding API, ICU supports fast matches without regard to case in
103lookups, since only binary comparison is required.
104
105The CaseFolding.txt file in the Unicode Character Database is used for
106performing locale-independent case folding. This text file is generated from the
107case mappings in the Unicode Character Database, using both the single-character
108and the multi-character mappings. The CaseFolding.txt file transforms all
109characters having different case forms into a common form. To compare two
110strings for non-case-sensitive matching, you can transform each string and then
111use a binary comparison. There are also functions to compare two strings
112case-insensitively using the same case folding data.
113
114Unicode case folding is not context-sensitive. It is also not
115language-sensitive, although there is a flag for whether to apply special
116mappings for use with Turkic (Turkish/Azerbaijani) text data.
117
118Character case folding APIs implementations are located in:
119
1201.  `uchar.h` for single character folding
121
1222.  `ustring.h` and `unistr.h` for character string folding.
123