1--- 2layout: default 3title: Conversion 4nav_order: 4 5has_children: true 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Conversion 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Conversion Overview 24 25A converter is used to convert from one character encoding to another. In the 26case of ICU, the conversion is always between Unicode and another encoding, or 27vice-versa. A text encoding is a particular mapping from a given character set 28definition to the actual bits used to represent the data. 29 30Unicode provides a single character set that covers the major languages of the 31world, and a small number of machine-friendly encoding forms and schemes to fit 32the needs of existing applications and protocols. It is designed for best 33interoperability with both ASCII and ISO-8859-1 (the most widely used character 34sets) to make it easier for Unicode to be used in almost all applications and 35protocols. 36 37Hundreds of encodings have been developed over the years, each for small groups 38of languages and for special purposes. As a result, the interpretation of text, 39input, sorting, display, and storage depends on the knowledge of all the 40different types of character sets and their encodings. Programs have been 41written to handle either one single encoding at a time and switch between them, 42or to convert between external and internal encodings. 43 44There is no single, authoritative source of precise definitions of many of the 45encodings and their names. However, 46[IANA](http://www.iana.org/assignments/character-sets) is the best source for 47names, and our Character Set repository is a good source of encoding definitions 48for each platform. 49 50The transferring of text from one machine to another one often causes some loss 51of information. Some platforms have a different interpretation of the text than 52the other platforms. For example, Shift-JIS can be interpreted differently on 53Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash 54symbol, while some UNIX machines map that byte value to the Yen symbol. Another 55problem arises when a character in the codepage looks like the Unicode Greek 56letter Mu or the Unicode micro symbol. Some platforms map this codepage byte 57sequence to one Unicode character, while another platform maps it to the other 58Unicode character. Fallbacks can partially fix this problem by mapping both 59Unicode characters to the same codepage byte sequence. Even though some 60character information is lost, the text is still readable. 61 62ICU's converter API has the following main features: 63 641. Unicode surrogate support 65 662. Support for all major encodings 67 683. Consistent text conversion across all computer platforms 69 704. Text data can be streamed (buffered) through the API 71 725. Fast text conversion 73 746. Supports fallbacks to the codepage 75 767. Supports reverse fallbacks to Unicode 77 788. Allows callbacks for handling and substituting invalid or unmapped byte 79 sequences 80 819. Allows a user to add support for unsupported encodings 82 83This section deals with the processes of converting encodings to and from 84Unicode. 85 86## Recommendations 87 881. **Use Unicode encodings whenever possible.** Together with Unicode for 89 internal processing, it makes completely globalized systems possible and 90 avoids the many problems with non-algorithmic conversions. (For a discussion 91 of such problems, see for example ["Character Conversions and Mapping 92 Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt) 93 on <http://icu-project.org/docs/> and the [XML Japanese 94 Profile](http://www.w3.org/TR/japanese-xml/)). 95 96 1. Use UTF-8 and UTF-16. 97 98 2. Use UTF-16BE, SCSU and BOCU-1 as appropriate. 99 100 3. In special environments, other Unicode encodings may be used as well, 101 such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and 102 CESU-8. (For turning Unicode filenames into ASCII-only filename strings, 103 the IMAP-mailbox-name encoding can be used.) 104 105 4. Do not exchange text with single/unpaired surrogates. 106 1072. **Use legacy charsets only when absolutely necessary**. For best data 108 fidelity: 109 110 1. ISO-8859-1 is relatively unproblematic — if its limited character 111 repertoire is sufficient — because it is converted trivially (1:1) to 112 Unicode, avoiding conversion table problems for its small set of 113 characters. (By contrast, proper conversion from US-ASCII requires a 114 check for illegal byte values 0x80..0xff, which is an unnecessary 115 complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly 116 as ubiquitous for modern systems as US-ASCII was for 7-bit systems.) 117 118 2. If you need to communicate with a certain platform, then use the same 119 conversion tables as that platform itself, or at least ones that are 120 very, very close. 121 122 3. ICU's conversion table repository contains hundreds of Unicode 123 conversion tables from a number of common vendors and platforms as well 124 as comparisons between these conversion tables: 125 <http://icu-project.org/charts/charset/> . 126 127 4. Do not trust codepage documentation that is not machine-readable, for 128 example nice-looking charts: They are usually incomplete and out of 129 date. 130 131 5. ICU's default build includes about 200 conversion tables. See the [ICU 132 Data](../icudata.md) chapter for how to add or remove conversion tables 133 and other data. 134 135 6. In ICU, you can (and should) also use APIs that map a charset name 136 together with a standard/platform name. This allows you to get different 137 converters for the same ambiguous charset name (like "Shift-JIS"), 138 depending on the standard or platform specified. See the 139 [convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt) 140 alias table, the [Using Converters](converters.md) chapter and [API 141 references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . 142 143 7. For data exchange (rather than pure display), turn off fallback 144 mappings: `ucnv_setFallback(cnv, FALSE)`; 145 146 8. For some text formats, especially XML and HTML, it is possible to set an 147 "escape callback" function that turns unmappable Unicode code points 148 into corresponding escape sequences, preventing data loss. See the API 149 references and the [ucnv sample 150 code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/) 151 . 152 153 9. **Never modify a conversion table.** Instead, use existing ones that 154 match precisely those in systems with which you communicate. "Modifying" 155 a conversion table in reality just creates a new one, which makes the 156 whole situation even less manageable. 157