1--- 2layout: default 3title: Properties 4nav_order: 2 5parent: Chars and Strings 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Properties 13 14## Overview 15 16Text processing requires that a program treat text appropriately. If text is 17exchanged between several systems, it is important for them to process the text 18consistently. This is done by assigning each character, or a range of 19characters, attributes or properties used for text processing, and by defining 20standard algorithms for at least the basic text operations. 21 22Traditionally, such attributes and algorithms have not been well-defined for 23most character sets, and text processing had to rely on ad-hoc solutions. Over 24time, standards were created for querying properties of the system codepage. 25However, the set of these properties was limited. Their data was not coordinated 26among implementations, and standard algorithms were not available. 27 28It is one of the strengths of Unicode that it not only defines a very large 29character set, but also assigns a comprehensive set of properties and usage 30notes to all characters. It defines standard algorithms for critical text 31processing, and the data is publicly provided and kept up-to-date. See 32https://www.unicode.org/ and https://www.unicode.org/main.html for more information. 33 34Sample code is available in the ICU source code library at 35[icu4c/source/samples/props/props.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/props/props.cpp). 36See also the source code for the [Unicode 37browser](https://github.com/unicode-org/icu-demos/tree/main/ubrowse) demo 38application, which can be used 39[online](https://icu4c-demos.unicode.org/icu-bin/ubrowse) to browse Unicode 40characters with their properties. 41 42## Unicode Character Database properties in ICU APIs 43 44The following table shows all Unicode Character Database properties (except for 45purely "extracted" ones and Unihan properties) and the corresponding ICU APIs. 46Most of the time, ICU4C provides functions in 47icu4c/source/common/unicode/uchar.h and ICU4J provides parallel functions in the 48com.ibm.icu.lang.UCharacter class. Properties of a single Unicode character are 49accessed by its 21-bit code point value (type: UChar32=int32_t in C/C++, int in 50Java). 51 52[Surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point) 53mostly have default property values, except for the General_Category (gc=Cs). 54 55For integer values outside the Unicode code point range (negative or ≥ 560x110000), most API functions return null values (false, 0, etc.). API functions 57that map a code point to another (e.g., u_foldCase()/UCharacter.foldCase()) 58normally return out-of-range values (i.e., map them to themselves), just like 59for unassigned code points or generally code points that have no specific 60mappings. In particular, -1 (=U_SENTINEL in ICU4C) is mapped to -1. 61 62Most properties are also available via UnicodeSet APIs and patterns. See the 63Lookup section below. 64 65See [UAX #44, Unicode Character 66Database](https://www.unicode.org/reports/tr44/#Properties) itself for 67comparison. The UCD files 68[PropertyAliases.txt](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) 69and 70[PropertyValueAliases.txt](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) 71list all properties and their values by name and type. 72 73UAX #44 also shows which UCD files have data for which properties, 74and many other useful details. 75 76Most properties that use binary, integer, or enumerated values are available via 77functions u_hasBinaryProperty and u_getIntPropertyValue which take UProperty 78enum constants to select the property. (ICU4J UCharacter member functions do not 79have the "u_" prefix.) The constant names include the long property name 80according to PropertyAliases.txt, e.g., UCHAR_LINE_BREAK. Corresponding property 81value enum constant names often contain the short property name and the long 82value name, e.g., U_LB_LINE_FEED. For enumeration/integer type properties, the 83enumeration result type is also listed here. 84 85Some UnicodeSet APIs use the same UProperty constants. Other UnicodeSet APIs and 86UnicodeSet and regular expression patterns use the long or short property 87aliases and property value aliases (see PropertyAliases.txt and 88PropertyValueAliases.txt). 89 90There is one pseudo-property, UCHAR_GENERAL_CATEGORY_MASK for which the APIs do 91not use a single value but a bit-set (a mask) of zero or more values, with each 92bit corresponding to one UCHAR_GENERAL_CATEGORY value. This allows ICU to 93represent property value aliases for multiple general categories, like "Letters" 94(which stands for "Uppercase Letters", "Lowercase Letters", etc.). In other 95words, there are two ICU properties for the same Unicode property, one 96delivering single values (for per-code point lookup) and the other delivering 97sets of values (for use with value aliases and UnicodeSet). 98 99| UCD Name | Type | | ICU4C uchar.h / ICU4J UCharacter | 100|--------------|--------|-----|------------------------------| 101| Age | Unicode version | (U) | C: u_charAge fills in UVersionInfo<br>Java: getAge returns a VersionInfo reference | 102| Alphabetic | binary | (U) | u_isUAlphabetic, UCHAR_ALPHABETIC | 103| ASCII_Hex_Digit | binary | (U) | UCHAR_ASCII_HEX_DIGIT | 104| Basic_Emoji* | binary | (U) | UCHAR_BASIC_EMOJI | 105| Bidi_Class | enum | (U) | u_charDirection, UCHAR_BIDI_CLASS<br>returns enum UCharDirection | 106| Bidi_Control | binary | (U) | UCHAR_BIDI_CONTROL | 107| Bidi_Mirrored | binary | (U) | u_isMirrored, UCHAR_BIDI_MIRRORED | 108| Bidi_Mirroring_Glyph | code point | | u_charMirror | 109| Bidi_Paired_Bracket_Type | enum | (U) | UCHAR_BIDI_PAIRED_BRACKET_TYPE<br>returns enum UBidiPairedBracketType | 110| Block | enum | (U) | ublock_getCode, UCHAR_BLOCK<br>returns enum UBlockCode | 111| Canonical_Combining_Class | 0..255 | (U) | u_getCombiningClass, UCHAR_CANONICAL_COMBINING_CLASS | 112| Case_Folding | Unicode string | | u_strFoldCase (ustring.h) | 113| Case_Ignorable | binary | (U) | UCHAR_CASE_IGNORABLE | 114| Cased | binary | (U) | UCHAR_CASED | 115| Changes_When_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_CASEFOLDED | 116| Changes_When_Casemapped | binary | (U) | UCHAR_CHANGES_WHEN_CASEMAPPED | 117| Changes_When_NFKC_Casefolded | binary | (U) | UCHAR_CHANGES_WHEN_NFKC_CASEFOLDED | 118| Changes_When_Lowercased | binary | (U) | UCHAR_CHANGES_WHEN_LOWERCASED | 119| Changes_When_Titlecased | binary | (U) | UCHAR_CHANGES_WHEN_TITLECASED | 120| Changes_When_Uppercased | binary | (U) | UCHAR_CHANGES_WHEN_UPPERCASED | 121| Composition_Exclusion | binary | (c) | contributes to Full_Composition_Exclusion | 122| Dash | binary | (U) | UCHAR_DASH | 123| Decomposition_Mapping | Unicode string | | NFKC Normalizer2::getRawDecomposition() | 124| Decomposition_Type | enum | (U) | UCHAR_DECOMPOSITION_TYPE<br>returns enum UDecompositionType | 125| Default_Ignorable_Code_Point | binary | (U) | UCHAR_DEFAULT_IGNORABLE_CODE_POINT | 126| Deprecated | binary | (U) | UCHAR_DEPRECATED | 127| Diacritic | binary | (U) | UCHAR_DIACRITIC | 128| East_Asian_Width | enum | (U) | UCHAR_EAST_ASIAN_WIDTH<br>returns enum UEastAsianWidth | 129| Emoji | binary | (U) | UCHAR_EMOJI | 130| Emoji_Component | binary | (U) | UCHAR_EMOJI_COMPONENT | 131| Emoji_Keycap_Sequence* | binary | (U) | UCHAR_EMOJI_KEYCAP_SEQUENCE | 132| Emoji_Modifier | binary | (U) | UCHAR_EMOJI_MODIFIER | 133| Emoji_Modifier_Base | binary | (U) | UCHAR_EMOJI_MODIFIER_BASE | 134| Emoji_Presentation | binary | (U) | UCHAR_EMOJI_PRESENTATION | 135| Expands_On_NF* | binary | | available via normalization API (normalizer2.h) | 136| Extended_Pictographic | binary | (U) | UCHAR_EXTENDED_PICTOGRAPHIC | 137| Extender | binary | (U) | UCHAR_EXTENDER | 138| FC_NFKC_Closure | Unicode string | | u_getFC_NFKC_Closure | 139| Full_Composition_Exclusion | binary | (U) | UCHAR_FULL_COMPOSITION_EXCLUSION | 140| General_Category | enum | (U) | u_charType, UCHAR_GENERAL_CATEGORY, UCHAR_GENERAL_CATEGORY_MASK<br>returns enum UCharCategory | 141| Grapheme_Base | binary | (U) | UCHAR_GRAPHEME_BASE | 142| Grapheme_Cluster_Break | enum | (U) | UCHAR_GRAPHEME_CLUSTER_BREAK<br>returns enum UGraphemeClusterBreak | 143| Grapheme_Extend | binary | (U) | UCHAR_GRAPHEME_EXTEND | 144| Grapheme_Link | binary | (U) | UCHAR_GRAPHEME_LINK | 145| Hangul_Syllable_Type | enum | (U) | UCHAR_HANGUL_SYLLABLE_TYPE<br>returns enum UHangulSyllableType | 146| Hex_Digit | binary | (U) | UCHAR_HEX_DIGIT | 147| Hyphen | binary | (U) | UCHAR_HYPHEN | 148| ID_Continue | binary | (U) | UCHAR_ID_CONTINUE | 149| ID_Start | binary | (U) | UCHAR_ID_START | 150| Ideographic | binary | (U) | UCHAR_IDEOGRAPHIC | 151| IDS_Binary_Operator | binary | (U) | UCHAR_IDS_BINARY_OPERATOR | 152| IDS_Triary_Operator | binary | (U) | UCHAR_IDS_TRINARY_OPERATOR | 153| Indic_Positional_Category | enum | (U) | UCHAR_INDIC_POSITIONAL_CATEGORY<br>returns enum UIndicPositionalCategory | 154| Indic_Syllabic_Category | enum | (U) | UCHAR_INDIC_SYLLABIC_CATEGORY<br>returns enum UIndicSyllabicCategory | 155| ISO_Comment | ASCII string | | u_getISOComment | 156| Jamo_Short_Name | ASCII string | (c) | contributes to Name | 157| Join_Control | binary | (U) | UCHAR_JOIN_CONTROL | 158| Joining_Group | enum | (U) | UCHAR_JOINING_GROUP<br>returns enum UJoiningGroup | 159| Joining_Type | enum | (U) | UCHAR_JOINING_TYPE<br>returns enum UJoiningType | 160| Line_Break | enum | (U) | UCHAR_LINE_BREAK<br>returns enum ULineBreak | 161| Logical_Order_Exception | binary | (U) | UCHAR_LOGICAL_ORDER_EXCEPTION | 162| Lowercase | binary | (U) | u_isULowercase, UCHAR_LOWERCASE | 163| Lowercase_Mapping | Unicode string | | available via u_strToLower (ustring.h) | 164| Math | binary | (U) | UCHAR_MATH | 165| Name | ASCII string | (U) | u_charName(U_UNICODE_CHAR_NAME or U_EXTENDED_CHAR_NAME) | 166| Name_Alias | ASCII string | | u_charName(U_CHAR_NAME_ALIAS) | 167| NF*_QuickCheck | enum | (U) | UCHAR_NF*_QUICK_CHECK and available via quickCheck (normalizer2.h)<br>returns UNormalizationCheckResult (no/maybe/yes) | 168| NFKC_Casefold | Unicode string | | available via normalization API (normalizer2.h "nfkc_cf") | 169| Noncharacter_Code_Point | binary | (U) | UCHAR_NONCHARACTER_CODE_POINT, <br /> U_IS_UNICODE_NONCHAR (utf.h) | 170| Numeric_Type | enum | (U) | UCHAR_NUMERIC_TYPE<br>returns enum UNumericType | 171| Numeric_Value | double | (U) | u_getNumericValueJava/UnicodeSet: only non-negative integers, no fractions | 172| Other_Alphabetic | binary | (c) | contributes to Alphabetic | 173| Other_Default_Ignorable_Code_Point | binary | (c) | contributes to Default_Ignorable_Code_Point | 174| Other_Grapheme_Extend | binary | (c) | contributes to Grapheme_Extend | 175| Other_Lowercase | binary | (c) | contributes to Lowercase | 176| Other_Math | binary | (c) | contributes to Math | 177| Other_Uppercase | binary | (c) | contributes to Uppercase | 178| Pattern_Syntax | binary | (U) | UCHAR_PATTERN_SYNTAX | 179| Pattern_White_Space | binary | (U) | UCHAR_PATTERN_WHITE_SPACE | 180| Prepended_Concatenation_Mark | binary | (U) | UCHAR_PREPENDED_CONCATENATION_MARK | 181| Quotation_Mark | binary | (U) | UCHAR_QUOTATION_MARK | 182| Radical | binary | (U) | UCHAR_RADICAL | 183| Regional_Indicator | binary | (U) | UCHAR_REGIONAL_INDICATOR | 184| RGI_Emoji* | binary | (U) | UCHAR_RGI_EMOJI | 185| RGI_Emoji_Flag_Sequence* | binary | (U) | UCHAR_RGI_EMOJI_FLAG_SEQUENCE | 186| RGI_Emoji_Modifier_Sequence* | binary | (U) | UCHAR_RGI_EMOJI_MODIFIER_SEQUENCE | 187| RGI_Emoji_Tag_Sequence* | binary | (U) | UCHAR_RGI_EMOJI_TAG_SEQUENCE | 188| RGI_Emoji_ZWJ_Sequence* | binary | (U) | UCHAR_RGI_EMOJI_ZWJ_SEQUENCE | 189| Script | enum | (U) | uscript_getCode (uscript.h), UCHAR_SCRIPT<br>returns enum UScriptCode | 190| Script_Extensions | list | (U) | uscript_getScriptExtensions & uscript_hasScript (uscript.h), UCHAR_SCRIPT_EXTENSIONS<br>returns a list of enum UScriptCode values | 191| Sentence_Break | enum | (U) | UCHAR_SENTENCE_BREAK<br>returns enum USentenceBreak | 192| Simple_Case_Folding | code point | | u_foldCase | 193| Simple_Lowercase_ Mapping | code point | | u_tolower | 194| Simple_Titlecase_ Mapping | code point | | u_totitle | 195| Simple_Uppercase_ Mapping | code point | | u_toupper | 196| Soft_Dotted | binary | (U) | UCHAR_SOFT_DOTTED | 197| STerm | binary | (U) | UCHAR_S_TERM | 198| Terminal_Punctuation | binary | (U) | UCHAR_TERMINAL_PUNCTUATION | 199| Titlecase_Mapping | Unicode string | | u_strToTitle (ustring.h) | 200| Unicode_1_Name | ASCII string | (U) | u_charName(U_UNICODE_10_CHAR_NAME or U_EXTENDED_CHAR_NAME) | 201| Unified_Ideograph | binary | (U) | UCHAR_UNIFIED_IDEOGRAPH | 202| Uppercase | binary | (U) | u_isUUppercase, UCHAR_UPPERCASE | 203| Uppercase_Mapping | Unicode string | | u_strToUpper (ustring.h) | 204| Vertical_Orientation | enum | (U) | UCHAR_VERTICAL_ORIENTATION<br>returns enum UVerticalOrientation | 205| White_Space | binary | (U) | u_isUWhiteSpace, UCHAR_WHITE_SPACE | 206| Word_Break | enum | (U) | UCHAR_WORD_BREAK<br>returns enum UWordBreakValues | 207| XID_Continue | binary | (U) | UCHAR_XID_CONTINUE | 208| XID_Start | binary | (U) | UCHAR_XID_START | 209 210Notes: 211 2121. (c) - This property only **contributes** to "real" properties (mostly 213 "Other_..." properties), so there is no direct support for this property in 214 ICU. 215 2162. (U) - This property is available via the UnicodeSet APIs and patterns. Any 217 property available in UnicodeSet is also available in regular expressions. 218 Properties which are not available in UnicodeSet are generally those that 219 are not available through a UProperty selector. 220 2213. When a property name is followed by a star (*), it is a property of strings; 222 for example, Basic_Emoji and RGI_Emoji. 223 See https://www.unicode.org/reports/tr51/#Emoji_Sets 224 Properties of strings are not yet supported in ICU regular expressions. 225 2264. UnicodeSet `[:scx=Arab:]` is a superset of `[:sc=Arab:]`; 227 see https://www.unicode.org/reports/tr18/#Script_Property 228 2295. Full case mapping properties (e.g., Lowercase_Mapping) are complex. 230 The string case mapping functions that implement them handle language-specific 231 and/or context-sensitive mappings. 232 The output may have more code points or fewer code points than the input. 233 234## Customization 235 236ICU does not provide the means to modify properties at runtime. The properties 237are provided exactly as specified by a recent version of the Unicode Standard 238(as published in the [Character 239Database](http://www.unicode.org/onlinedat/online.html)). 240 241For custom sets and maps, it is easiest to make UnicodeSet or 242UCPTrie/CodePointTrie objects with the desired values. 243 244However, if an application requires custom properties (for example, for [Private 245Use](http://www.unicode.org/glossary/) characters), then it is possible to 246change or add them at build-time. This is doable but not easy. 247 248It is done by modifying the Character Database files copied into the ICU source 249tree at 250[icu4c/source/data/unidata](https://github.com/unicode-org/icu/tree/main/icu4c/source/data/unidata). 251Since ICU 49, most of the properties have been combined into one file, 252unidata/ppucd.txt (see the [Preparsed 253UCD](https://icu.unicode.org/design/props/ppucd) design doc). Some of the 254remaining UCD files are still inputs, others are only used for unit tests. 255 256To add a character to such a file, a line must be inserted into the file with 257the format used in that file (see the online documentation on the [Unicode 258site](http://www.unicode.org/reports/tr44/) for more information). After 259modifying one or more of these files, the ICU data needs to be rebuilt, and the 260resulting files need to be checked into the ICU source tree. The files are 261processed by special ICU tools outside of the normal ICU build. The 262[unidata/changes.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt) 263file documents the process that has been used for the last several Unicode 264version updates; skip the file preparation and API update steps. 265 266Any available Unicode code point (0 to 10FFFF<sub>16</sub>) can be used. 267Code point values 268should be written with either 4, 5, or 6 hex digits. The minimum number of 269digits possible should be used (but no fewer than 4). Note that the Unicode 270Standard specifies that the 32 code points U+FDD0..U+FDEF and the 34 code points 271U+...xFFFE and U+...xFFFF (where x=0, 1, 2, ..., F, 10) are not characters, 272therefore they should not be added to any of the character database files. 273 274## Lookup 275 276For lookup by code point, iterate through the string, fetch code points, and 277either call the unicode/uchar.h / UCharacter or similar functions, or use 278dedicated sets and maps. For binary properties, and sets in general, there are 279also more efficient methods for iterating over substrings. 280 281### Binary property from code point 282 283Call one of the binary-property functions. Alternatively, make a UnicodeSet for 284the property (remember to freeze() it) or for a custom set of characters, and 285call contains(). 286 287### Binary property over string 288 289It is often useful to partition a string into substrings where every character 290has the property, and substrings where every character does not have the 291property. For example, to split the string at separator characters, remove 292certain types of characters, trim white space, etc. Use a UnicodeSet with its 293span() and spanBack() methods (available in C++ in UTF-8 versions). In Java, you 294can also use a UnicodeSetSpanner. 295 296### Enumerated property from code point 297 298Call one of the int-property functions. Alternatively, build a UCPTrie / 299CodePointTrie (new in ICU 63) via its mutable version and build method, then use 300that to get the int value for each code point. 301 302### Enumerated property over string 303 304Easiest is to iterate over code points of the string and call per-code point 305lookup methods (or use a code point trie). 306 307The UCPTrie / CodePointTrie (new in ICU 63) also offers C macros and a Java 308String iterator class where the iteration and data lookup are integrated to 309avoid redundancies in validation and range checks. 310 311The UTF-16 code point macros and the Java String iterator also provide the code 312point as output, because it has to be fetched or assembled anyway. 313 314The UTF-8 macros do not assemble the code point because that would be some 315amount of extra work, but often only the lookup value is used and the code point 316is not needed. When it is needed after all, it is possible to take advantage of 317the macros having validated the byte sequence: If the sequence was ill-formed, 318then the trie's error value is set. Therefore, if a value other than the trie 319error value was returned, then the sequence was well-formed, and the code point 320can be fetched without revalidating the sequence (e.g., via U8_NEXT_UNSAFE()). 321Since the length of the sequence (1..4 bytes) is also known from the iteration 322(string index before/after next() call), an even simpler piece of code can be 323used. (See for example the ICU-internal function codePointFromValidUTF8() in 324normalizer2impl.cpp.) 325 326### Code point trie most-optimized UTF-16 access 327 328UTF-16 text processing can be further optimized by detecting surrogate pairs and 329assembling supplementary code points only when there is non-trivial data 330available. 331 332At build time, iterate over all supplementary code points 333(umutablecptrie_getRange() / MutableCodePointTrie.getRange() starting from 334U+10000) to see if there is non-trivial data for any of the supplementary code 335points associated with a lead surrogate. If so, then set a special 336(application-specific) value for the lead surrogate. 337 338At runtime, use UCPTRIE_FAST_BMP_GET() per code *unit*. If there is non-trivial 339data and the code unit is a lead surrogate, then check if a trail surrogate 340follows. If so, assemble the supplementary code point with 341U16_GET_SUPPLEMENTARY() and look up its value with UCPTRIE_FAST_SUPP_GET(); 342otherwise deal with the unpaired surrogate in some way. (Java CodePointTrie.Fast 343and java.lang.Character have equivalent methods.) 344 345If there is only trivial data for lead and trail surrogates, then processing can 346often skip them. (In this case, there will be two data lookups, one for the lead 347surrogate and one for the trail surrogate, but they are fast, and this 348optimization speeds up the more common BMP characters by not checking for 349surrogates each time.) 350 351For example, in normalization or case mapping all characters that do not have 352any mappings are simply copied as is. 353 354## Properties in ICU Rule Syntax 355 356ICU rule syntaxes should use the Unicode Pattern_White_Space set as syntactic 357"spaces" to allow for the usage of white space characters outside of the normal 358ASCII range while still maintaining backward compatibility. See 359<https://www.unicode.org/reports/tr31/#Pattern_Syntax> for more information. 360