userguide/conversion/index.md

---
layout: default
title: Conversion
nav_order: 4
has_children: true
---
<!--
© 2020 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->

# Conversion
{: .no_toc }

## Contents
{: .no_toc .text-delta }

1. TOC
{:toc}

---

## Conversion Overview

A converter is used to convert from one character encoding to another. In the
case of ICU, the conversion is always between Unicode and another encoding, or
vice-versa. A text encoding is a particular mapping from a given character set
definition to the actual bits used to represent the data.

Unicode provides a single character set that covers the major languages of the
world, and a small number of machine-friendly encoding forms and schemes to fit
the needs of existing applications and protocols. It is designed for best
interoperability with both ASCII and ISO-8859-1 (the most widely used character
sets) to make it easier for Unicode to be used in almost all applications and
protocols.

Hundreds of encodings have been developed over the years, each for small groups
of languages and for special purposes. As a result, the interpretation of text,
input, sorting, display, and storage depends on the knowledge of all the
different types of character sets and their encodings. Programs have been
written to handle either one single encoding at a time and switch between them,
or to convert between external and internal encodings.

There is no single, authoritative source of precise definitions of many of the
encodings and their names. However,
[IANA](http://www.iana.org/assignments/character-sets) is the best source for
names, and our Character Set repository is a good source of encoding definitions
for each platform.

The transferring of text from one machine to another one often causes some loss
of information. Some platforms have a different interpretation of the text than
the other platforms. For example, Shift-JIS can be interpreted differently on
Windows™ compared to UNIX®. Windows maps byte value 0x5C to the backslash
symbol, while some UNIX machines map that byte value to the Yen symbol. Another
problem arises when a character in the codepage looks like the Unicode Greek
letter Mu or the Unicode micro symbol. Some platforms map this codepage byte
sequence to one Unicode character, while another platform maps it to the other
Unicode character. Fallbacks can partially fix this problem by mapping both
Unicode characters to the same codepage byte sequence. Even though some
character information is lost, the text is still readable.

ICU's converter API has the following main features:

1.  Unicode surrogate support

2.  Support for all major encodings

3.  Consistent text conversion across all computer platforms

4.  Text data can be streamed (buffered) through the API

5.  Fast text conversion

6.  Supports fallbacks to the codepage

7.  Supports reverse fallbacks to Unicode

8.  Allows callbacks for handling and substituting invalid or unmapped byte
    sequences

9.  Allows a user to add support for unsupported encodings

This section deals with the processes of converting encodings to and from
Unicode.

## Recommendations

1.  **Use Unicode encodings whenever possible.** Together with Unicode for
    internal processing, it makes completely globalized systems possible and
    avoids the many problems with non-algorithmic conversions. (For a discussion
    of such problems, see for example ["Character Conversions and Mapping
    Tables"](http://icu-project.org/docs/papers/conversions_and_mappings_iuc19.ppt)
    on <http://icu-project.org/docs/> and the [XML Japanese
    Profile](http://www.w3.org/TR/japanese-xml/)).

    1.  Use UTF-8 and UTF-16.

    2.  Use UTF-16BE, SCSU and BOCU-1 as appropriate.

    3.  In special environments, other Unicode encodings may be used as well,
        such as UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, UTF-EBCDIC, and
        CESU-8. (For turning Unicode filenames into ASCII-only filename strings,
        the IMAP-mailbox-name encoding can be used.)

    4.  Do not exchange text with single/unpaired surrogates.

2.  **Use legacy charsets only when absolutely necessary**. For best data
    fidelity:

    1.  ISO-8859-1 is relatively unproblematic — if its limited character
        repertoire is sufficient — because it is converted trivially (1:1) to
        Unicode, avoiding conversion table problems for its small set of
        characters. (By contrast, proper conversion from US-ASCII requires a
        check for illegal byte values 0x80..0xff, which is an unnecessary
        complication for modern systems with 8-bit bytes. ISO-8859-1 is nearly
        as ubiquitous for modern systems as US-ASCII was for 7-bit systems.)

    2.  If you need to communicate with a certain platform, then use the same
        conversion tables as that platform itself, or at least ones that are
        very, very close.

    3.  ICU's conversion table repository contains hundreds of Unicode
        conversion tables from a number of common vendors and platforms as well
        as comparisons between these conversion tables:
        <http://icu-project.org/charts/charset/> .

    4.  Do not trust codepage documentation that is not machine-readable, for
        example nice-looking charts: They are usually incomplete and out of
        date.

    5.  ICU's default build includes about 200 conversion tables. See the [ICU
        Data](../icudata.md) chapter for how to add or remove conversion tables
        and other data.

    6.  In ICU, you can (and should) also use APIs that map a charset name
        together with a standard/platform name. This allows you to get different
        converters for the same ambiguous charset name (like "Shift-JIS"),
        depending on the standard or platform specified. See the
        [convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt)
        alias table, the [Using Converters](converters.md) chapter and [API
        references](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .

    7.  For data exchange (rather than pure display), turn off fallback
        mappings: `ucnv_setFallback(cnv, FALSE)`;

    8.  For some text formats, especially XML and HTML, it is possible to set an
        "escape callback" function that turns unmappable Unicode code points
        into corresponding escape sequences, preventing data loss. See the API
        references and the [ucnv sample
        code](https://github.com/unicode-org/icu/tree/master/icu4c/source/samples/ucnv/)
        .

    9.  **Never modify a conversion table.** Instead, use existing ones that
        match precisely those in systems with which you communicate. "Modifying"
        a conversion table in reality just creates a new one, which makes the
        whole situation even less manageable.