1--- 2layout: default 3title: Unicode Basics 4nav_order: 3 5parent: ICU 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Unicode Basics 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Introduction to Unicode 24 25Unicode is a standard that precisely defines a character set as well as a small 26number of encodings for it. It enables you to handle text in any language 27efficiently. It allows a single application executable to work for a global 28audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other 29modern systems, provides Internationalization solutions based on Unicode. 30 31This chapter is intended as an introduction to codepages in general and Unicode 32in particular. For further information, see: 33 341. [The Web site of the Unicode consortium](http://www.unicode.org/) 35 362. [What is 37 Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html) 38 393. [IBM® Globalization](http://www.ibm.com/software/globalization/) 40 41Go to the [online ICU demos](http://demo.icu-project.org/icu-bin/icudemos) to 42see how a Unicode-based server application can handle text in many languages and 43many encodings. 44 45## Traditional Character Sets and Unicode 46 47Representing text-format data in computers is a matter of defining a set of 48characters and assigning each of them a number and a bit representation. 49Underlying this basic idea are three related concepts: 50 511. A character set or repertoire is an unordered collection of characters that 52 can be represented by numeric values. 53 542. A coded character set maps characters from a character set or repertoire to 55 numeric values. 56 573. A character encoding scheme defines the representation of numeric values 58 from one or more coded character sets in bits and bytes. 59 60For simple encodings such as ASCII, the last two concepts are basically the 61same: ASCII assigns 128 characters and control codes to consecutive numbers from 620 to 127. These characters and control codes are encoded as simple, unsigned, 63binary integers. Therefore, ASCII is both a coded character set and a character 64encoding scheme. 65 66ASCII only encodes 128 characters, 33 of which are control codes rather than 67graphic, displayable characters. It was designed to represent English-language 68text for an American user base, and is therefore insufficient for representing 69text in almost any language other than American English. In fact, most 70traditional encodings were limited to one or few languages and scripts. 71 72ASCII offered a natural way to extend it: Designed in the 1960's to work in 73systems with 7-bit bytes while most computers and Internet protocols since the 741970's use 8-bit bytes, the extra bit allowed another 128 byte values to 75represent more characters. Various encodings were developed that supported 76different languages. Some of these were based on ASCII, others were not. 77 78Languages such as Japanese need to encode considerably more than 256 characters. 79Various encoding schemes enable large character sets with thousands or tens of 80thousands of characters to be represented. Most of those encodings are still 81byte-based, which means that many characters require two or more bytes of 82storage space. A process must be developed to interpret some byte values. 83 84Various character sets and encoding schemes have been developed independently, 85cover only one or few languages each, and are incompatible. This makes it very 86difficult for a single system to handle text in more than one language at a 87time, and especially difficult to do so in a way that is interoperable across 88different systems. 89 90Generally, the minimum requirement for the interoperable exchange of text data 91is that the encoding (character set & encoding scheme) must be properly 92specified in the document and in the protocol. For example, email/SMTP and 93HTML/HTTP provide the means to specify the "charset", as it is called in 94Internet standards. However, very often the encoding is not specified, specified 95incorrectly, or the sender and receiver disagree on its implementation. 96 97The ISO 2022 encoding scheme was created to store text in many different 98languages. It allows other encodings to be embedded by first announcing them and 99then switching between them. Full support for all features and possible 100encodings with ISO 2022 requires complicated processing and the need to support 101many encodings. For East Asian languages, subsets were developed that cover only 102one language or a few at a time, but they are much more manageable. ISO 2022 is 103not well-suited for use in internal processing. It is designed for data 104exchange. 105 106## Glyphs versus Characters 107 108Programmers often need to distinguish between characters and glyphs. A character 109is the smallest semantic unit in a writing system. It is an abstract concept 110such as the letter A or the exclamation point. A glyph is the visual 111presentation of one or more characters, and is often dependent on adjacent 112characters. 113 114There is not always a one-to-one mapping between characters and glyphs. In many 115languages (Arabic is a prime example), the way a character looks depends heavily 116on the surrounding characters. Standard printed Arabic has as many as four 117different printed representations (glyphs) for every letter of the alphabet. In 118many languages, two or more letters may combine together into a single glyph 119(called a ligature), or a single character might be displayed with more than one 120glyph. 121 122Despite the different visual variants of a particular letter, it still retains 123its identity. For example, the Arabic letter heh has four different visual 124representations in common use. Whichever one is used, it still keeps its 125identity as the letter heh. It is this identity that Unicode encodes, not the 126visual representation. This also cuts down on the number of independent 127character values required. 128 129## Overview of Unicode 130 131Unicode was developed as a single-coded character set that contains support for 132all languages in the world. The first version of Unicode used 16-bit numbers, 133which allowed for encoding 65,536 characters without complicated multibyte 134schemes. With the inclusion of more characters, and following implementation 135needs of many different platforms, Unicode was extended to allow more than one 136million characters. Several other encoding schemes were added. This introduced 137more complexity into the Unicode standard, but far less than managing a large 138number of different encodings. 139 140Starting with Unicode 2.0 (published in 1996), the Unicode standard began 141assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use 142them completely. This gives more than enough room for all written languages in 143the world. The original repertoire covered all major languages commonly used in 144computing. Unicode continues to grow, and it includes more scripts. 145 146The design of Unicode differs in several ways from traditional character sets 147and encoding schemes: 148 1491. Its repertoire enables users to include text efficiently in almost all 150 languages within a single document. 151 1522. It can be encoded in a byte-based way with one or more bytes per character, 153 but the default encoding scheme uses 16-bit units that allow much simpler 154 processing for all common characters. 155 1563. Many characters, such as letters with accents and umlauts, can be combined 157 from the base character and accent or umlaut modifiers. This combining 158 reduces the number of different characters that need to be encoded 159 separately. "Precomposed" variants for characters that existed in common 160 character sets at the time were included for compatibility. 161 1624. Characters and their usage are well-defined and described. While traditional 163 character sets typically only provide the name or a picture of a character 164 and its number and byte encoding, Unicode has a comprehensive database of 165 properties available for download. It also defines a number of processes and 166 algorithms for dealing with many aspects of text processing to make it more 167 interoperable. 168 169The early inclusion of all characters of commonly used character sets makes 170Unicode a useful "pivot" point for converting between traditional character 171sets, and makes it feasible to process non-Unicode text by first converting into 172Unicode, process the text, and convert it back to the original encoding without 173loss of data. 174 175> :point_right: *The first 128 Unicode code point values are assigned to the same characters as 176in US-ASCII. For example, the same number is assigned to the same character. The 177same is true for the first 256 code point values of Unicode compared to ISO 1788859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it 179easy to adapt many applications to Unicode because the numbers for many 180syntactically important characters are the same.* 181 182## Character Encoding Forms and Schemes for Unicode 183 184Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room 185to allow for unambiguous encoding of every character in common use. Such a 186character number is called a "code point". 187 188> :point_right: *Unicode code points are just non-negative integer numbers in a certain range. 189They do not have an implicit binary representation or a width of 21 or 32 bits. 190Binary representation and unit widths are defined for encoding forms.* 191 192For internal processing, the standard defines three encoding forms, and for file 193storage and protocols, some of these encoding forms have encoding schemes that 194differ in their byte ordering. The difference between an encoding form and an 195encoding scheme is that an encoding form maps the character set codes to values 196that fit into internal data types (like a short in C), while an encoding scheme 197maps to bits and bytes. For traditional encodings, they are the same since the 198encoding forms already map to bytes. 199 200The different Unicode encoding forms are optimized for a variety of different 201uses: 202 2031. UTF-16, the default encoding form, maps a character code point to either one 204 or two 16-bit integers. 205 2062. UTF-8 is a byte-based encoding that offers backwards compatibility with 207 ASCII-based, byte-oriented APIs and protocols. A character is stored with 1, 208 2, 3, or 4 bytes. 209 2103. UTF-32 is the simplest, but most memory-intensive encoding form: It uses one 211 32-bit integer per Unicode character. 212 2134. SCSU is an encoding scheme that provides a simple compression of Unicode 214 text. It is designed only for input and output, not for internal use. 215 216ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters 217(with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial 218support for supplementary characters. 219 220For input/output, character encoding schemes define a byte serialization of 221text. UTF-8 is itself both an encoding form, and an encoding scheme because it is 222byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one 223that serializes the code units in big-endian byte order (most significant byte 224first), and one that serializes the code units in little-endian byte order 225(least significant byte first). The corresponding encoding schemes are called 226UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE. 227 228> :point_right: *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer 229either to character encoding forms where 16/32-bit words are processed and are 230naturally stored in the platform endianness, or they refer to the 231IANA-registered charset names, i.e., to character encoding schemes or byte 232serializations. In addition to simple byte serialization, the charsets with 233these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).* 234 235## Overview of UTF-16 236 237The default encoding form of the Unicode Standard uses 16-bit code units. Code 238point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and 239are encoded with just one 16-bit unit of the same value. Code points from 24010000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called 241"surrogates", and they are called a "surrogate pair" when, together, they 242correctly encode one Unicode character. The first surrogate in a pair must be in 243the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to 244DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with 245either one code unit that is not a surrogate or with a correct pair of 246surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this 247mechanism and will never, by themselves, be assigned any characters. 248 249Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1 250assigns more than 40,000 supplementary characters that make use of surrogate 251pairs in UTF-16. 252 253Note that comparing UTF-16 strings lexically based on their 16-bit code units 254does not result in the same order as comparing the code points. This is not 255usually an issue since only rarely-used characters are affected. Most processes 256do not rely on the same results in such comparisons. Where necessary, a simple 257modification to a string comparison can be performed that still allows efficient 258code unit-based comparisons and makes them compatible with code point 259comparisons. ICU has C and C++ API functions for this. 260 261## Overview of UTF-8 262 263To meet the requirements of byte-oriented, ASCII-based systems, the Unicode 264Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that 265preserves ASCII transparency. 266 267UTF-8 maintains transparency for all the ASCII code values (0..127). These 268values do not appear in any byte of a transformed result except as the direct 269representation of the ASCII values. Thus, ASCII text is also UTF-8 text. 270 271Characteristics of UTF-8 include: 272 2731. Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the 274 same value. Therefore, ASCII characters take up 50% less space with UTF-8 275 encoding than with UTF-16. 276 2772. All other code points are encoded with multibyte sequences, with the first 278 byte (lead byte) indicating the number of bytes that follow (trail bytes). 279 This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub> 280 to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub> 281 and FF<sub>16</sub> are never used. 282 2833. UTF-8 is relatively compact and resource conservative in its use of the 284 bytes required for encoding text in European scripts, but uses 50% more 285 space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two 286 bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16), 287 and all others four. 288 2894. Binary comparisons of UTF-8 strings based on their bytes result in the same 290 order as comparing code point values. 291 292## Overview of UTF-32 293 294The UTF-32 encoding form always uses one single 32-bit integer per Unicode code 295point. This results in a very simple encoding. 296 297The drawback is its memory consumption: Since code point values use only 21 298bits, one-third of the memory is always unused, and since most commonly used 299characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit 300unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less). 301 302UTF-32 is mainly used in APIs that are defined with the same data type for both 303code points and code units. Modern versions of the C standard library that 304support Unicode use a 32-bit `wchar_t` with UTF-32 semantics. 305 306## Overview of SCSU 307 308SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of 309Unicode text for both input and output. It is a simple compression that 310transforms the text into a byte stream. It typically uses one byte per character 311in small scripts, and two bytes per character in large, East Asian scripts. 312 313It is usually shorter than any of the UTFs. However, SCSU is stateful, which 314makes it unsuitable for internal processing. It also uses all possible byte 315values, which might require additional processing for protocols such as SMTP 316(email). 317 318See also <https://www.unicode.org/reports/tr6/> . 319 320## Other Unicode Encodings 321 322Other Unicode encodings have been developed over time for various purposes. Most 323of them are implemented in ICU, see 324[source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt) 325 3261. BOCU-1: Binary-Ordered Compression of Unicode 327 An encoding of Unicode that is about as compact as SCSU but has a much 328 smaller amount of state. Unlike SCSU, it preserves code point order and can 329 be used in 8bit emails without a transfer encoding. BOCU-1 does **not** 330 preserve ASCII characters in ASCII-readable form. See [Unicode Technical 331 Note #6](http://www.unicode.org/notes/tn6/) . 332 3332. UTF-7: Designed for 7bit emails; simple and not very compact. Since email 334 systems have been 8-bit safe for several years, UTF-7 is not necessary any 335 more and not recommended. Most ASCII characters are readable, others are 336 base64-encoded. See [RFC 2152](http://www.ietf.org/rfc/rfc2152.txt) . 337 3383. IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing 339 Unicode strings as ASCII characters for Unix filenames. 340 **The name "IMAP-mailbox-name" is specific to ICU!** 341 See [RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION 342 4rev1](http://www.ietf.org/rfc/rfc2060.txt) section 5.1.3. Mailbox 343 International Naming Convention. 344 3454. UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See 346 [Unicode Technical Report #16](http://www.unicode.org/reports/tr16/) . **As 347 of ICU 2.6, UTF-EBCDIC is not implemented in ICU.** 348 3495. CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit 350 An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16) 351 string order instead of code point order. Not for open interchange. See 352 [Unicode Technical Report #26](http://www.unicode.org/reports/tr26/) . 353 354## Programming using UTFs 355 356Programming using any of the UTFs is much more straightforward than with 357traditional multi-byte character encodings, even though UTF-8 and UTF-16 are 358also variable-width encodings. 359 360Within each Unicode encoding form, the code unit values for singletons (code 361units that alone encode characters), lead units, and for trailing units are all 362disjointed. This has crucial implications for implementations. The following 363lists these implications: 364 3651. Determines the number of units for one code point using the lead unit. This 366 is especially important for UTF-8, where there can be up to 4 bytes per 367 character. 368 3692. Determines boundaries. If ICU users randomly access text, you can always 370 determine the nearest code-point boundaries with a small number of machine 371 instructions. 372 3733. Does not have any overlap. If ICU users search for string A in string B, you 374 never get a false match on code points. Users do not need to convert to code 375 points for string searching. False matches never occurs since the end of one 376 sequence is never the same as the start of another sequence. Overlap is one 377 of the biggest problems with common multi-byte encodings like Shift-JIS. All 378 the UTFs avoid this problem. 379 3804. Uses simple iteration. Getting the next or previous code point is 381 straightforward, and only takes a small number of machine instructions. 382 3835. Can use UTF-16 encoding, which is actually fully symmetric. ICU users can 384 determine from any single code unit whether it is the first, last, or only 385 one for a code point. Moving (iterating) in either direction through UTF-16 386 text is equally fast and efficient. 387 3886. Uses slow indexing by code points. This indexing procedure is a disadvantage 389 of all variable-width encodings. Except in UTF-32, it is inefficient to find 390 code unit boundaries corresponding to the nth code point or to find the code 391 point offset containing the nth code unit. Both involve scanning from the 392 start of the text or from a last known boundary. ICU, like most common APIs, 393 always indexes by code units. It counts code units and not code points. 394 395Conversion between different UTFs is very fast. Unlike converting to and from 396legacy encodings like Latin-2, conversion between UTFs does not require table 397look-ups. 398 399ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit 400type for code points, and used for single Unicode characters. It may be signed 401or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an 402unsigned 16-bit integer for UTF-16 code units. It is the base type for strings 403(`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide. 404 405Some higher-level APIs, used especially for formatting, use characters closer to 406a representation for a glyph. Such "user characters" are also called "graphemes" 407or "grapheme clusters" and require strings so that combining sequences can be 408included. 409 410## Serialized Formats 411 412In files, input, output, and network protocols, text must be accompanied by the 413specification of its character encoding scheme for a client to be able to 414interpret it correctly. (This is called a "charset" in Internet protocols.) 415However, an encoding scheme specification is not necessary if the text is only 416used within a single platform, protocol, or application where it is otherwise 417clear what the encoding is. (The language and text directionality should usually 418be specified to enable spell checking, text-to-speech transformation, etc.) 419 420*The discussion of encoding specifications in this section applies to standard 421Internet protocols where charset name strings are used. Other protocols may use 422numeric encoding identifiers and assign different semantics to those identifiers 423than Internet protocols.* 424 425Typically, the encoding specification is done in a protocol- and document 426format-dependent way. However, the Unicode standard offers a mechanism for 427tagging text files with a "signature" for cases where protocols do not identify 428character encoding schemes. 429 430The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by 431prepending it to a file or stream. The alternative function of U+FEFF as a 432format control character has been copied to U+2060 WORD JOINER, and U+FEFF 433should only be used for Unicode signatures. 434 435The different character encoding schemes generate different, distinct byte 436sequences for U+FEFF: 437 4381. UTF-8: EF BB BF 439 4402. UTF-16BE: FE FF 441 4423. UTF-16LE: FF FE 443 4444. UTF-32BE: 00 00 FE FF 445 4465. UTF-32LE: FF FE 00 00 447 4486. SCSU: 0E FE FF 449 4507. BOCU-1: FB EE 28 451 4528. UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F ) 453 4549. UTF-EBCDIC: DD 73 66 73 455 456ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature 457detection. 458 459*There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and 460CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The 461opportunity for misidentification of one as the other is one of the reasons why 462CESU-8 should only be used in limited, closed, specific environments.* 463 464In UTF-16 and UTF-32, where the signature also distinguishes between big-endian 465and little-endian byte orders, it is also called a byte order mark (BOM). The 466signature works for UTF-16 since the code point that has the byte-swapped 467encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a 468"non-character" code point.) In Internet protocols, if an encoding specification 469of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte 470sequence (BOM) that identifies the byte ordering, which is not the case for the 471encoding scheme/charset names with "BE" or "LE". 472 473*If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not 474begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE, 475respectively.* 476 477A signature is not part of the content, and must be stripped when processing. 478For example, blindly concatenating two files will give an incorrect result. 479 480If a signature was detected, then the signature "character" U+FEFF should be 481removed from the Unicode stream **after** conversion. Removing the signature 482bytes before conversion could cause the conversion to fail for stateful 483encodings like BOCU-1 and UTF-7. 484 485Whether a signature is to be recognized or not depends on the protocol or 486application. 487 4881. If a protocol specifies a charset name, then the byte stream must be 489 interpreted according to how that name is defined. Only the "UTF-16" and 490 "UTF-32" names include recognition of the byte order marks that are specific 491 to them (and the ICU converters for these names do this automatically). None 492 of the other Unicode charsets are defined to include any signature/BOM 493 handling. 494 4952. If no charset name is provided, for example for text files in most 496 filesystems, then applications must usually rely on heuristics to determine 497 the file encoding. Many document formats contain an embedded or implicit 498 encoding declaration, but for plain text files it is reasonable to use 499 Unicode signatures as simple and reliable heuristics. This is especially 500 common on Windows systems. However, some tools for plain text file handling 501 (e.g., many Unix command line tools) are not prepared for Unicode 502 signatures. 503 504## The Unicode Standard Is An Industry Standard 505 506The Unicode standard is an industry standard and parallels ISO 10646-1. Around 5071993, these two standards were effectively merged into the same character set 508standard. Both standards have the same character repertoire and the same 509encoding forms and schemes. 510 511One difference used to be that the ISO standard defined code point values to be 512from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add 513an amendment to the standard. The amendment removes this difference by declaring 514that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main 515reason for the ISO work group's decision is interoperability between the UTFs. 516UTF-16 can not encode any code points above this limit. 517 518This means that the code point space for both Unicode and ISO 10646 is now the 519same! **These changes to ISO 10646 have been made recently and should be 520complete in the edition ISO 10646:2003 which also combines all parts of the 521standard into one.** 522 523The former, larger code space is the reason why the ISO definition of UTF-8 524specifies sequences of five and six bytes to cover that whole range. 525 526Another difference is that the ISO standard defines encoding forms "UCS-4" and 527"UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of 5287FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee 529has accepted that the characters above 10FFFF will not be encoded, so there is 530essentially no difference between the forms. The "4" stands for "four-byte 531form". 532 533UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF, 534excluding the surrogate code points. Thus, it cannot represent the characters 535with code points above FFFF (called supplementary characters). 536 537*There is no conversion necessary between UCS-2 and UTF-16. The difference is 538only in the interpretation of surrogates.* 539 540The standards differ in what kind of information they provide: The Unicode 541standard provides more character properties and describes algorithms etc., while 542the ISO standard defines collections, subsets and similar. 543 544The standards are synchronized, and the respective committees work together to 545add new characters and assign code point values. 546