• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Unicode Basics
4nav_order: 3
5parent: ICU
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Unicode Basics
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Introduction to Unicode
24
25Unicode is a standard that precisely defines a character set as well as a small
26number of encodings for it. It enables you to handle text in any language
27efficiently. It allows a single application executable to work for a global
28audience. ICU, like Java™, Microsoft® Windows NT™, Windows™ 2000 and other
29modern systems, provides Internationalization solutions based on Unicode.
30
31This chapter is intended as an introduction to codepages in general and Unicode
32in particular. For further information, see:
33
341.  [The Web site of the Unicode consortium](http://www.unicode.org/)
35
362.  [What is
37    Unicode?](https://www.unicode.org/standard/WhatIsUnicode.html)
38
393.  [IBM® Globalization](http://www.ibm.com/software/globalization/)
40
41Go to the [online ICU demos](http://demo.icu-project.org/icu-bin/icudemos) to
42see how a Unicode-based server application can handle text in many languages and
43many encodings.
44
45## Traditional Character Sets and Unicode
46
47Representing text-format data in computers is a matter of defining a set of
48characters and assigning each of them a number and a bit representation.
49Underlying this basic idea are three related concepts:
50
511.  A character set or repertoire is an unordered collection of characters that
52    can be represented by numeric values.
53
542.  A coded character set maps characters from a character set or repertoire to
55    numeric values.
56
573.  A character encoding scheme defines the representation of numeric values
58    from one or more coded character sets in bits and bytes.
59
60For simple encodings such as ASCII, the last two concepts are basically the
61same: ASCII assigns 128 characters and control codes to consecutive numbers from
620 to 127. These characters and control codes are encoded as simple, unsigned,
63binary integers. Therefore, ASCII is both a coded character set and a character
64encoding scheme.
65
66ASCII only encodes 128 characters, 33 of which are control codes rather than
67graphic, displayable characters. It was designed to represent English-language
68text for an American user base, and is therefore insufficient for representing
69text in almost any language other than American English. In fact, most
70traditional encodings were limited to one or few languages and scripts.
71
72ASCII offered a natural way to extend it: Designed in the 1960's to work in
73systems with 7-bit bytes while most computers and Internet protocols since the
741970's use 8-bit bytes, the extra bit allowed another 128 byte values to
75represent more characters. Various encodings were developed that supported
76different languages. Some of these were based on ASCII, others were not.
77
78Languages such as Japanese need to encode considerably more than 256 characters.
79Various encoding schemes enable large character sets with thousands or tens of
80thousands of characters to be represented. Most of those encodings are still
81byte-based, which means that many characters require two or more bytes of
82storage space. A process must be developed to interpret some byte values.
83
84Various character sets and encoding schemes have been developed independently,
85cover only one or few languages each, and are incompatible. This makes it very
86difficult for a single system to handle text in more than one language at a
87time, and especially difficult to do so in a way that is interoperable across
88different systems.
89
90Generally, the minimum requirement for the interoperable exchange of text data
91is that the encoding (character set & encoding scheme) must be properly
92specified in the document and in the protocol. For example, email/SMTP and
93HTML/HTTP provide the means to specify the "charset", as it is called in
94Internet standards. However, very often the encoding is not specified, specified
95incorrectly, or the sender and receiver disagree on its implementation.
96
97The ISO 2022 encoding scheme was created to store text in many different
98languages. It allows other encodings to be embedded by first announcing them and
99then switching between them. Full support for all features and possible
100encodings with ISO 2022 requires complicated processing and the need to support
101many encodings. For East Asian languages, subsets were developed that cover only
102one language or a few at a time, but they are much more manageable. ISO 2022 is
103not well-suited for use in internal processing. It is designed for data
104exchange.
105
106## Glyphs versus Characters
107
108Programmers often need to distinguish between characters and glyphs. A character
109is the smallest semantic unit in a writing system. It is an abstract concept
110such as the letter A or the exclamation point. A glyph is the visual
111presentation of one or more characters, and is often dependent on adjacent
112characters.
113
114There is not always a one-to-one mapping between characters and glyphs. In many
115languages (Arabic is a prime example), the way a character looks depends heavily
116on the surrounding characters. Standard printed Arabic has as many as four
117different printed representations (glyphs) for every letter of the alphabet. In
118many languages, two or more letters may combine together into a single glyph
119(called a ligature), or a single character might be displayed with more than one
120glyph.
121
122Despite the different visual variants of a particular letter, it still retains
123its identity. For example, the Arabic letter heh has four different visual
124representations in common use. Whichever one is used, it still keeps its
125identity as the letter heh. It is this identity that Unicode encodes, not the
126visual representation. This also cuts down on the number of independent
127character values required.
128
129## Overview of Unicode
130
131Unicode was developed as a single-coded character set that contains support for
132all languages in the world. The first version of Unicode used 16-bit numbers,
133which allowed for encoding 65,536 characters without complicated multibyte
134schemes. With the inclusion of more characters, and following implementation
135needs of many different platforms, Unicode was extended to allow more than one
136million characters. Several other encoding schemes were added. This introduced
137more complexity into the Unicode standard, but far less than managing a large
138number of different encodings.
139
140Starting with Unicode 2.0 (published in 1996), the Unicode standard began
141assigning numbers from 0 to 10ffff<sub>16</sub>,which requires 21 bits but does not use
142them completely. This gives more than enough room for all written languages in
143the world. The original repertoire covered all major languages commonly used in
144computing. Unicode continues to grow, and it includes more scripts.
145
146The design of Unicode differs in several ways from traditional character sets
147and encoding schemes:
148
1491.  Its repertoire enables users to include text efficiently in almost all
150    languages within a single document.
151
1522.  It can be encoded in a byte-based way with one or more bytes per character,
153    but the default encoding scheme uses 16-bit units that allow much simpler
154    processing for all common characters.
155
1563.  Many characters, such as letters with accents and umlauts, can be combined
157    from the base character and accent or umlaut modifiers. This combining
158    reduces the number of different characters that need to be encoded
159    separately. "Precomposed" variants for characters that existed in common
160    character sets at the time were included for compatibility.
161
1624.  Characters and their usage are well-defined and described. While traditional
163    character sets typically only provide the name or a picture of a character
164    and its number and byte encoding, Unicode has a comprehensive database of
165    properties available for download. It also defines a number of processes and
166    algorithms for dealing with many aspects of text processing to make it more
167    interoperable.
168
169The early inclusion of all characters of commonly used character sets makes
170Unicode a useful "pivot" point for converting between traditional character
171sets, and makes it feasible to process non-Unicode text by first converting into
172Unicode, process the text, and convert it back to the original encoding without
173loss of data.
174
175> :point_right: *The first 128 Unicode code point values are assigned to the same characters as
176in US-ASCII. For example, the same number is assigned to the same character. The
177same is true for the first 256 code point values of Unicode compared to ISO
1788859-1 (Latin-1) which itself is a direct superset of US-ASCII. This makes it
179easy to adapt many applications to Unicode because the numbers for many
180syntactically important characters are the same.*
181
182## Character Encoding Forms and Schemes for Unicode
183
184Unicode assigns characters a number from 0 to 10FFFF<sub>16</sub>, giving enough elbow room
185to allow for unambiguous encoding of every character in common use. Such a
186character number is called a "code point".
187
188> :point_right: *Unicode code points are just non-negative integer numbers in a certain range.
189They do not have an implicit binary representation or a width of 21 or 32 bits.
190Binary representation and unit widths are defined for encoding forms.*
191
192For internal processing, the standard defines three encoding forms, and for file
193storage and protocols, some of these encoding forms have encoding schemes that
194differ in their byte ordering. The difference between an encoding form and an
195encoding scheme is that an encoding form maps the character set codes to values
196that fit into internal data types (like a short in C), while an encoding scheme
197maps to bits and bytes. For traditional encodings, they are the same since the
198encoding forms already map to bytes.
199
200The different Unicode encoding forms are optimized for a variety of different
201uses:
202
2031.  UTF-16, the default encoding form, maps a character code point to either one
204    or two 16-bit integers.
205
2062.  UTF-8 is a byte-based encoding that offers backwards compatibility with
207    ASCII-based, byte-oriented APIs and protocols. A character is stored with 1,
208    2, 3, or 4 bytes.
209
2103.  UTF-32 is the simplest, but most memory-intensive encoding form: It uses one
211    32-bit integer per Unicode character.
212
2134.  SCSU is an encoding scheme that provides a simple compression of Unicode
214    text. It is designed only for input and output, not for internal use.
215
216ICU uses UTF-16 internally. ICU 2.0 fully supports supplementary characters
217(with code points 10000<sub>16</sub>..10FFFF<sub>16</sub>). Older versions of ICU provided only partial
218support for supplementary characters.
219
220For input/output, character encoding schemes define a byte serialization of
221text. UTF-8 is itself both an encoding form, and an encoding scheme because it is
222byte-based. For each of UTF-16 and UTF-32, there are two variants defined: one
223that serializes the code units in big-endian byte order (most significant byte
224first), and one that serializes the code units in little-endian byte order
225(least significant byte first). The corresponding encoding schemes are called
226UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
227
228> :point_right: *The names "UTF-16" and "UTF-32" are ambiguous. Depending on context, they refer
229either to character encoding forms where 16/32-bit words are processed and are
230naturally stored in the platform endianness, or they refer to the
231IANA-registered charset names, i.e., to character encoding schemes or byte
232serializations. In addition to simple byte serialization, the charsets with
233these names also use optional Byte Order Marks (see [Serialized Formats](#serialized-formats) below).*
234
235## Overview of UTF-16
236
237The default encoding form of the Unicode Standard uses 16-bit code units. Code
238point values for the most common characters are in the range of 0 to FFFF<sub>16</sub> and
239are encoded with just one 16-bit unit of the same value. Code points from
24010000<sub>16</sub> to 10FFFF<sub>16</sub> are encoded with two code units that are often called
241"surrogates", and they are called a "surrogate pair" when, together, they
242correctly encode one Unicode character. The first surrogate in a pair must be in
243the range D800<sub>16</sub> to DBFF<sub>16</sub>, and the second one must be in the range DC00<sub>16</sub> to
244DFFF<sub>16</sub>. Every Unicode code point has only one possible UTF-16 encoding with
245either one code unit that is not a surrogate or with a correct pair of
246surrogates. The code point values D800<sub>16</sub> to DFFF<sub>16</sub> are set aside just for this
247mechanism and will never, by themselves, be assigned any characters.
248
249Most commonly used characters have code points below FFFF<sub>16</sub>, but Unicode 3.1
250assigns more than 40,000 supplementary characters that make use of surrogate
251pairs in UTF-16.
252
253Note that comparing UTF-16 strings lexically based on their 16-bit code units
254does not result in the same order as comparing the code points. This is not
255usually an issue since only rarely-used characters are affected. Most processes
256do not rely on the same results in such comparisons. Where necessary, a simple
257modification to a string comparison can be performed that still allows efficient
258code unit-based comparisons and makes them compatible with code point
259comparisons. ICU has C and C++ API functions for this.
260
261## Overview of UTF-8
262
263To meet the requirements of byte-oriented, ASCII-based systems, the Unicode
264Standard defines UTF-8. UTF-8 is a variable-length, byte-based encoding that
265preserves ASCII transparency.
266
267UTF-8 maintains transparency for all the ASCII code values (0..127). These
268values do not appear in any byte of a transformed result except as the direct
269representation of the ASCII values. Thus, ASCII text is also UTF-8 text.
270
271Characteristics of UTF-8 include:
272
2731.  Unicode code points 0 to 7F<sub>16</sub> are each encoded with a single byte of the
274    same value. Therefore, ASCII characters take up 50% less space with UTF-8
275    encoding than with UTF-16.
276
2772.  All other code points are encoded with multibyte sequences, with the first
278    byte (lead byte) indicating the number of bytes that follow (trail bytes).
279    This results in very efficient parsing. The lead bytes are in the range c0<sub>16</sub>
280    to fd<sub>16</sub>, the trail bytes are in the range 80<sub>16</sub> to bf<sub>16</sub>. The byte values fe<sub>16</sub>
281    and FF<sub>16</sub> are never used.
282
2833.  UTF-8 is relatively compact and resource conservative in its use of the
284    bytes required for encoding text in European scripts, but uses 50% more
285    space than UTF-16 for East Asian text. Code points up to 7FF<sub>16</sub> take up two
286    bytes, code points up to FFFF<sub>16</sub> take up three (50% more memory than UTF-16),
287    and all others four.
288
2894.  Binary comparisons of UTF-8 strings based on their bytes result in the same
290    order as comparing code point values.
291
292## Overview of UTF-32
293
294The UTF-32 encoding form always uses one single 32-bit integer per Unicode code
295point. This results in a very simple encoding.
296
297The drawback is its memory consumption: Since code point values use only 21
298bits, one-third of the memory is always unused, and since most commonly used
299characters have code point values of up to FFFF<sub>16</sub>, they take up only one 16-bit
300unit in UTF-16 (50% less) and up to three bytes in UTF-8 (25% less).
301
302UTF-32 is mainly used in APIs that are defined with the same data type for both
303code points and code units. Modern versions of the C standard library that
304support Unicode use a 32-bit `wchar_t` with UTF-32 semantics.
305
306## Overview of SCSU
307
308SCSU (Standard Compression Scheme for Unicode) is designed to reduce the size of
309Unicode text for both input and output. It is a simple compression that
310transforms the text into a byte stream. It typically uses one byte per character
311in small scripts, and two bytes per character in large, East Asian scripts.
312
313It is usually shorter than any of the UTFs. However, SCSU is stateful, which
314makes it unsuitable for internal processing. It also uses all possible byte
315values, which might require additional processing for protocols such as SMTP
316(email).
317
318See also <https://www.unicode.org/reports/tr6/> .
319
320## Other Unicode Encodings
321
322Other Unicode encodings have been developed over time for various purposes. Most
323of them are implemented in ICU, see
324[source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
325
3261.  BOCU-1: Binary-Ordered Compression of Unicode
327    An encoding of Unicode that is about as compact as SCSU but has a much
328    smaller amount of state. Unlike SCSU, it preserves code point order and can
329    be used in 8bit emails without a transfer encoding. BOCU-1 does **not**
330    preserve ASCII characters in ASCII-readable form. See [Unicode Technical
331    Note #6](http://www.unicode.org/notes/tn6/) .
332
3332.  UTF-7: Designed for 7bit emails; simple and not very compact. Since email
334    systems have been 8-bit safe for several years, UTF-7 is not necessary any
335    more and not recommended. Most ASCII characters are readable, others are
336    base64-encoded. See [RFC 2152](http://www.ietf.org/rfc/rfc2152.txt) .
337
3383.  IMAP-mailbox-name: A variant of UTF-7 that is suitable for expressing
339    Unicode strings as ASCII characters for Unix filenames.
340    **The name "IMAP-mailbox-name" is specific to ICU!**
341    See [RFC 2060 INTERNET MESSAGE ACCESS PROTOCOL - VERSION
342    4rev1](http://www.ietf.org/rfc/rfc2060.txt) section 5.1.3. Mailbox
343    International Naming Convention.
344
3454.  UTF-EBCDIC: An EBCDIC-friendly encoding that is similar to UTF-8. See
346    [Unicode Technical Report #16](http://www.unicode.org/reports/tr16/) . **As
347    of ICU 2.6, UTF-EBCDIC is not implemented in ICU.**
348
3495.  CESU-8: Compatibility Encoding Scheme for UTF-16: 8-Bit
350    An incompatible variant of UTF-8 that preserves 16-bit-Unicode (UTF-16)
351    string order instead of code point order. Not for open interchange. See
352    [Unicode Technical Report #26](http://www.unicode.org/reports/tr26/) .
353
354## Programming using UTFs
355
356Programming using any of the UTFs is much more straightforward than with
357traditional multi-byte character encodings, even though UTF-8 and UTF-16 are
358also variable-width encodings.
359
360Within each Unicode encoding form, the code unit values for singletons (code
361units that alone encode characters), lead units, and for trailing units are all
362disjointed. This has crucial implications for implementations. The following
363lists these implications:
364
3651.  Determines the number of units for one code point using the lead unit. This
366    is especially important for UTF-8, where there can be up to 4 bytes per
367    character.
368
3692.  Determines boundaries. If ICU users randomly access text, you can always
370    determine the nearest code-point boundaries with a small number of machine
371    instructions.
372
3733.  Does not have any overlap. If ICU users search for string A in string B, you
374    never get a false match on code points. Users do not need to convert to code
375    points for string searching. False matches never occurs since the end of one
376    sequence is never the same as the start of another sequence. Overlap is one
377    of the biggest problems with common multi-byte encodings like Shift-JIS. All
378    the UTFs avoid this problem.
379
3804.  Uses simple iteration. Getting the next or previous code point is
381    straightforward, and only takes a small number of machine instructions.
382
3835.  Can use UTF-16 encoding, which is actually fully symmetric. ICU users can
384    determine from any single code unit whether it is the first, last, or only
385    one for a code point. Moving (iterating) in either direction through UTF-16
386    text is equally fast and efficient.
387
3886.  Uses slow indexing by code points. This indexing procedure is a disadvantage
389    of all variable-width encodings. Except in UTF-32, it is inefficient to find
390    code unit boundaries corresponding to the nth code point or to find the code
391    point offset containing the nth code unit. Both involve scanning from the
392    start of the text or from a last known boundary. ICU, like most common APIs,
393    always indexes by code units. It counts code units and not code points.
394
395Conversion between different UTFs is very fast. Unlike converting to and from
396legacy encodings like Latin-2, conversion between UTFs does not require table
397look-ups.
398
399ICU provides two basic data type definitions for Unicode. `UChar32` is a 32-bit
400type for code points, and used for single Unicode characters. It may be signed
401or unsigned. It is the same as `wchar_t` if it is 32 bits wide. `UChar` is an
402unsigned 16-bit integer for UTF-16 code units. It is the base type for strings
403(`UChar *`), and it is the same as `wchar_t` if it is 16 bits wide.
404
405Some higher-level APIs, used especially for formatting, use characters closer to
406a representation for a glyph. Such "user characters" are also called "graphemes"
407or "grapheme clusters" and require strings so that combining sequences can be
408included.
409
410## Serialized Formats
411
412In files, input, output, and network protocols, text must be accompanied by the
413specification of its character encoding scheme for a client to be able to
414interpret it correctly. (This is called a "charset" in Internet protocols.)
415However, an encoding scheme specification is not necessary if the text is only
416used within a single platform, protocol, or application where it is otherwise
417clear what the encoding is. (The language and text directionality should usually
418be specified to enable spell checking, text-to-speech transformation, etc.)
419
420*The discussion of encoding specifications in this section applies to standard
421Internet protocols where charset name strings are used. Other protocols may use
422numeric encoding identifiers and assign different semantics to those identifiers
423than Internet protocols.*
424
425Typically, the encoding specification is done in a protocol- and document
426format-dependent way. However, the Unicode standard offers a mechanism for
427tagging text files with a "signature" for cases where protocols do not identify
428character encoding schemes.
429
430The character ZERO WIDTH NO-BREAK SPACE (FEFF<sub>16</sub>) can be used as a signature by
431prepending it to a file or stream. The alternative function of U+FEFF as a
432format control character has been copied to U+2060 WORD JOINER, and U+FEFF
433should only be used for Unicode signatures.
434
435The different character encoding schemes generate different, distinct byte
436sequences for U+FEFF:
437
4381.  UTF-8: EF BB BF
439
4402.  UTF-16BE: FE FF
441
4423.  UTF-16LE: FF FE
443
4444.  UTF-32BE: 00 00 FE FF
445
4465.  UTF-32LE: FF FE 00 00
447
4486.  SCSU: 0E FE FF
449
4507.  BOCU-1: FB EE 28
451
4528.  UTF-7: 2B 2F 76 ( 38 | 39 | 2B | 2F )
453
4549.  UTF-EBCDIC: DD 73 66 73
455
456ICU provides the function `ucnv_detectUnicodeSignature()` for Unicode signature
457detection.
458
459*There is no signature for CESU-8 separate from the one for UTF-8. UTF-8 and
460CESU-8 encode U+FEFF and in fact all BMP code points with the same bytes. The
461opportunity for misidentification of one as the other is one of the reasons why
462CESU-8 should only be used in limited, closed, specific environments.*
463
464In UTF-16 and UTF-32, where the signature also distinguishes between big-endian
465and little-endian byte orders, it is also called a byte order mark (BOM). The
466signature works for UTF-16 since the code point that has the byte-swapped
467encoding, FFFE<sub>16</sub>, will never be a valid Unicode character. (It is a
468"non-character" code point.) In Internet protocols, if an encoding specification
469of "UTF-16" or "UTF-32" is used, it is expected that there is a signature byte
470sequence (BOM) that identifies the byte ordering, which is not the case for the
471encoding scheme/charset names with "BE" or "LE".
472
473*If text is specified to be encoded in the UTF-16 or UTF-32 charset and does not
474begin with a BOM, then it must be interpreted as UTF-16BE or UTF-32BE,
475respectively.*
476
477A signature is not part of the content, and must be stripped when processing.
478For example, blindly concatenating two files will give an incorrect result.
479
480If a signature was detected, then the signature "character" U+FEFF should be
481removed from the Unicode stream **after** conversion. Removing the signature
482bytes before conversion could cause the conversion to fail for stateful
483encodings like BOCU-1 and UTF-7.
484
485Whether a signature is to be recognized or not depends on the protocol or
486application.
487
4881.  If a protocol specifies a charset name, then the byte stream must be
489    interpreted according to how that name is defined. Only the "UTF-16" and
490    "UTF-32" names include recognition of the byte order marks that are specific
491    to them (and the ICU converters for these names do this automatically). None
492    of the other Unicode charsets are defined to include any signature/BOM
493    handling.
494
4952.  If no charset name is provided, for example for text files in most
496    filesystems, then applications must usually rely on heuristics to determine
497    the file encoding. Many document formats contain an embedded or implicit
498    encoding declaration, but for plain text files it is reasonable to use
499    Unicode signatures as simple and reliable heuristics. This is especially
500    common on Windows systems. However, some tools for plain text file handling
501    (e.g., many Unix command line tools) are not prepared for Unicode
502    signatures.
503
504## The Unicode Standard Is An Industry Standard
505
506The Unicode standard is an industry standard and parallels ISO 10646-1. Around
5071993, these two standards were effectively merged into the same character set
508standard. Both standards have the same character repertoire and the same
509encoding forms and schemes.
510
511One difference used to be that the ISO standard defined code point values to be
512from 0 to 7FFFFFFF<sub>16</sub>, not just up to 10FFFF<sub>16</sub>. The ISO work group decided to add
513an amendment to the standard. The amendment removes this difference by declaring
514that no characters will ever be assigned code points above 10FFFF<sub>16</sub>. The main
515reason for the ISO work group's decision is interoperability between the UTFs.
516UTF-16 can not encode any code points above this limit.
517
518This means that the code point space for both Unicode and ISO 10646 is now the
519same! **These changes to ISO 10646 have been made recently and should be
520complete in the edition ISO 10646:2003 which also combines all parts of the
521standard into one.**
522
523The former, larger code space is the reason why the ISO definition of UTF-8
524specifies sequences of five and six bytes to cover that whole range.
525
526Another difference is that the ISO standard defines encoding forms "UCS-4" and
527"UCS-2". UCS-4 is essentially UTF-32 with a theoretical upper limit of
5287FFFFFFF<sub>16</sub>, using 31 out of the 32 bits. However, in practice, the ISO committee
529has accepted that the characters above 10FFFF will not be encoded, so there is
530essentially no difference between the forms. The "4" stands for "four-byte
531form".
532
533UCS-2 is a subset of UTF-16 that is limited to code points from 0 to FFFF,
534excluding the surrogate code points. Thus, it cannot represent the characters
535with code points above FFFF (called supplementary characters).
536
537*There is no conversion necessary between UCS-2 and UTF-16. The difference is
538only in the interpretation of surrogates.*
539
540The standards differ in what kind of information they provide: The Unicode
541standard provides more character properties and describes algorithms etc., while
542the ISO standard defines collections, subsets and similar.
543
544The standards are synchronized, and the respective committees work together to
545add new characters and assign code point values.
546