• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Conversion
4nav_order: 700
5has_children: true
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Conversion
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Conversion Overview
24
25A converter is used to convert from one character encoding to another. In the
26case of ICU, the conversion is always between Unicode and another encoding, or
27vice-versa. A text encoding is a particular mapping from a given character set
28definition to the actual bits used to represent the data.
29
30Unicode provides a single character set that covers the major languages of the
31world, and a small number of machine-friendly encoding forms and schemes to fit
32the needs of existing applications and protocols. It is designed for best
33interoperability with both ASCII and ISO-8859-1 (the most widely used character
34sets) to make it easier for Unicode to be used in almost all applications and
35protocols.
36
37Hundreds of encodings have been developed over the years, each for small groups
38of languages and for special purposes. As a result, the interpretation of text,
39input, sorting, display, and storage depends on the knowledge of all the
40different types of character sets and their encodings. Programs have been
41written to handle either one single encoding at a time and switch between them,
42or to convert between external and internal encodings.
43
44There is no single, authoritative source of precise definitions of many of the
45encodings and their names. However,
46[IANA](http://www.iana.org/assignments/character-sets) is the best source for
47names, and our Character Set repository is a good source of encoding definitions
48for each platform.
49
50The transferring of text from one machine to another one often causes some loss
51of information. Some platforms have a different interpretation of the text than
52the other platforms. For example, Shift-JIS can be interpreted differently on
53Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash
54symbol, while some UNIX machines map that byte value to the Yen symbol. Another
55problem arises when a character in the codepage looks like the Unicode Greek
56letter Mu or the Unicode micro symbol. Some platforms map this codepage byte
57sequence to one Unicode character, while another platform maps it to the other
58Unicode character. Fallbacks can partially fix this problem by mapping both
59Unicode characters to the same codepage byte sequence. Even though some
60character information is lost, the text is still readable.
61
62ICU's converter API has the following main features:
63
641.  Unicode surrogate support
65
662.  Support for all major encodings
67
683.  Consistent text conversion across all computer platforms
69
704.  Text data can be streamed (buffered) through the API
71
725.  Fast text conversion
73
746.  Supports fallbacks to the codepage
75
767.  Supports reverse fallbacks to Unicode
77
788.  Allows callbacks for handling and substituting invalid or unmapped byte
79    sequences
80
819.  Allows a user to add support for unsupported encodings
82
83This section deals with the processes of converting encodings to and from
84Unicode.
85
86## Recommendations
87
881.  **Use Unicode encodings whenever possible.** Together with Unicode for
89    internal processing, it makes completely globalized systems possible and
90    avoids the many problems with non-algorithmic conversions. (For a discussion
91    of such problems, see for example ["Character Conversions and Mapping
92    Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
93    on <http://icu-project.org/docs/> and the [XML Japanese
94    Profile](http://www.w3.org/TR/japanese-xml/)).
95
96    1.  Use UTF-8 and UTF-16.
97
98    2.  Use UTF-16BE, SCSU and BOCU-1 as appropriate.
99
100    3.  In special environments, other Unicode encodings may be used as well,
101        such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and
102        CESU-8. (For turning Unicode filenames into ASCII-only filename strings,
103        the IMAP-mailbox-name encoding can be used.)
104
105    4.  Do not exchange text with single/unpaired surrogates.
106
1072.  **Use legacy charsets only when absolutely necessary**. For best data
108    fidelity:
109
110    1.  ISO-8859-1 is relatively unproblematic — if its limited character
111        repertoire is sufficient — because it is converted trivially (1:1) to
112        Unicode, avoiding conversion table problems for its small set of
113        characters. (By contrast, proper conversion from US-ASCII requires a
114        check for illegal byte values 0x80..0xff, which is an unnecessary
115        complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly
116        as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)
117
118    2.  If you need to communicate with a certain platform, then use the same
119        conversion tables as that platform itself, or at least ones that are
120        very, very close.
121
122    3.  ICU's conversion table repository contains hundreds of Unicode
123        conversion tables from a number of common vendors and platforms as well
124        as comparisons between these conversion tables:
125        <https://icu.unicode.org/charts/charset> .
126
127    4.  Do not trust codepage documentation that is not machine-readable, for
128        example nice-looking charts: They are usually incomplete and out of
129        date.
130
131    5.  ICU's default build includes about 200 conversion tables. See the [ICU
132        Data](../icudata.md) chapter for how to add or remove conversion tables
133        and other data.
134
135    6.  In ICU, you can (and should) also use APIs that map a charset name
136        together with a standard/platform name. This allows you to get different
137        converters for the same ambiguous charset name (like "Shift-JIS"),
138        depending on the standard or platform specified. See the
139        [convrtrs.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt)
140        alias table, the [Using Converters](converters.md) chapter and [API
141        references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
142
143    7.  For data exchange (rather than pure display), turn off fallback
144        mappings: `ucnv_setFallback(cnv, false)`;
145
146    8.  For some text formats, especially XML and HTML, it is possible to set an
147        "escape callback" function that turns unmappable Unicode code points
148        into corresponding escape sequences, preventing data loss. See the API
149        references and the [ucnv sample
150        code](https://github.com/unicode-org/icu/tree/main/icu4c/source/samples/ucnv/)
151        .
152
153    9.  **Never modify a conversion table.** Instead, use existing ones that
154        match precisely those in systems with which you communicate. "Modifying"
155        a conversion table in reality just creates a new one, which makes the
156        whole situation even less manageable.
157