• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Converter
4nav_order: 1
5parent: Conversion
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Using Converters
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25When designing applications around Unicode characters, it is sometimes required
26to convert between Unicode encodings or between Unicode and legacy text data.
27The vast majority of modern Operating Systems support Unicode to some degree,
28but sometimes the legacy text data from older systems need to be converted to
29and from Unicode. This conversion process can be done with an ICU converter.
30
31## ICU converters
32
33ICU provides comprehensive character set conversion services, mapping tables,
34and implementations for many encodings. Since ICU uses Unicode (UTF-16)
35internally, all converters convert between UTF-16 (with the endianness according
36to the current platform) and another encoding. This includes Unicode encodings.
37In other words, internal text is 16-bit Unicode, while "external text" used as
38source or target for a conversion is always treated as a byte stream.
39
40ICU converters are available for a wide range of encoding schemes. Most of them
41are based on mapping table data that is handled by few generic implementations.
42Some encodings are implemented algorithmically in addition to (or instead of)
43using mapping tables, especially Unicode encodings. The partly or entirely
44table-based encoding schemes include: All ICU converters map only single Unicode
45character code points to and from single codepage character code points. ICU
46converters **do not** deal directly with combining characters, bidirectional
47reordering, or Arabic shaping, for example. Such processes, if required, must be
48handled separately. For example, while in Unicode, the ICU BiDi APIs can be used
49for bidirectional reordering after a conversion to Unicode or before a
50conversion from Unicode.
51
52ICU converters are not designed to perform any encoding autodetection. This
53means that the converters do not autodetect "endianness", the 6 Unicode encoding
54signatures, or the Shift-JIS vs. EUC-JP, etc. There are two exceptions: The
55UTF-16 and UTF-32 converters work according to Unicode's specification of their
56Character Encoding Schemes, that is, they read the BOM to figure out the actual
57"endianness".
58
59The ICU mapping tables mostly come from an [IBM® codepage
60repository](http://www.ibm.com/software/globalization/cdra). For non-IBM
61codepages, there is typically an equivalent codepage registered with this
62repository. However, the textual data format (.ucm files) is generic, and data
63for other codepage mapping tables can also be added.
64
65## Using the Default Codepage
66
67ICU has code to determine the default codepage of the system or process. This
68default codepage can be used to convert `char *` strings to and from Unicode.
69
70Depending on system design, setup and APIs, it may not always be possible to
71find a default codepage that fully works as expected. For example,
72
731.  On Windows there are three encodings in use at the same time. Unicode
74    (UTF-16) is always used inside of Windows, while for `char *` encodings there
75    are two classes, called "ANSI" and "OEM" codepages. ICU will use the ANSI
76    codepage. Note that the OEM codepage is used by default for console window
77    output.
78
792.  On some UNIX-type systems, non-standard names are used for encodings, or
80    non-standard encodings are used altogether. Although ICU supports over 200
81    encodings in its standard build and many more aliases for them, it will not
82    be able to recognize such non-standard names.
83
843.  Some systems do not have a notion of a system or process codepage, and may
85    not have APIs for that.
86
87If you have means of detecting a default codepage name that are more appropriate
88for your application, then you should set that name with `ucnv_setDefaultName()`
89as the first ICU function call. This makes sure that the internally cached
90default converter will be instantiated from your preferred name.
91
92Starting in ICU 2.0, when a converter for the default codepage cannot be opened,
93a fallback default codepage name and converter will be used. On most platforms,
94this will be US-ASCII. For z/OS (OS/390), ibm-1047,swaplfnl is the default
95fallback codepage. For AS/400 (iSeries), ibm-37 is the default fallback
96codepage. This default fallback codepage is used when the operating system is
97using a non-standard name for a default codepage, or the converter was not
98packaged with ICU. The feature allows ICU to run in unusual computing
99environments without completely failing.
100
101## Usage Model
102
103A "Converter" refers to the C structure "UConverter". Converters are cheap to
104create. Any data that is shared between converters of the same kind (such as the
105mappings, the name and the properties) are automatically cached and shared in
106memory.
107
108### Converter Names
109
110Codepages with encoding schemes have been given many names by various vendors
111and platforms over the years. Vendors have different ways specify which codepage
112and encoding are being used. IBM uses a CCSID (Coded Character Set IDentifier).
113Windows uses a CPID (CodePage IDentifier). Macintosh has a TextEncoding. Many
114Unix vendors use [IANA](http://www.iana.org/assignments/character-sets)
115character set names. Many of these names are aliases to converters within ICU.
116
117In order to help identify which names are recognized by certain platforms, ICU
118provides several converter alias functions. The complete description of these
119functions can be found in the [ICU API Reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
120
121| Function Names | Short Description |
122| -------------- | ----------------- |
123| `ucnv_countAvailable`, `ucnv_getAvailableName` | Get a list of available converter names that can be opened. |
124| `ucnv_openAllNames` | Get a list of all known converter names. |
125| `ucnv_getName` | Get the name of an open converter. |
126| `ucnv_countAliases`, `ucnv_getAlias` | Get the list of aliases for the specified converter. |
127| `ucnv_countStandards`, `ucnv_getStandard` | Get the list of known standards. |
128| `ucnv_openStandardNames` | Get a filtered list of aliases for a converter that is known by the specified standard. |
129| `ucnv_getStandardName` | Get the preferred alias name specified by a given standard. |
130| `ucnv_getCanonicalName` | Get the converter name from the alias that is recognized by the specified standard. |
131| `ucnv_getDefaultName` | Get the default converter name that is currently used by ICU and the operating system. |
132| `ucnv_setDefaultName` | Use this function to override the default converter name. |
133
134Even though IANA specifies a list of aliases, it usually does not specify the
135mappings or the actual character set for the aliases. Sometimes vendors will map
136similar glyph variants to different Unicode code points or sometimes they will
137assign completely different glyphs for the same codepage code point. Because of
138these ambiguities, you can sometimes get `U_AMBIGUOUS_ALIAS_WARNING` for the
139returned `UErrorCode` when more than one converter uses the requested alias. This
140is only a warning, and the results can still be used. This UErrorCode value is
141just a reminder that you may not get what you expected. The above functions can
142help you to determine which converter you actually wanted.
143
144EBCDIC based converters do have the option to swap the newline and linefeed
145character mappings. This can be useful when transferring EBCDIC documents
146between z/OS (MVS, os/390 and the rest of the zSeries family) and another EBCDIC
147machine like OS/400 on iSeries. The ",swaplnlf" or `UCNV_SWAP_LFNL_OPTION_STRING`
148from ucnv.h can be appended to a converter alias in order to achieve this
149behavior. You can view other available options in ucnv.h.
150
151You can always skip many of these aliasing and mapping problems by just using
152Unicode.
153
154### Creating a Converter
155
156There are four ways to create a converter:
157
1581.  **By name**: Converters can be created using different types of names. No
159    distinction is made when the converter is created, as to which name is being
160    employed. There are many types of aliases possible. Among these are
161    [IANA](http://www.iana.org/assignments/character-sets) ("shift_jis",
162    "koi8-r", or "iso-8859-3"), host specific names ("cp1252" which is the name
163    for a Microsoft® Windows™ or a similar IBM® codepage). Finally, ICU's own
164    internal canonical names for a converter can be used. These include "UTF-8"
165    or "ISO-8859-1" for built-in conversion types, and names such as
166    "ibm-949_P110-2000" (Shift-JIS with '\\' <-> '¥' mapping) or
167    "ibm-949_P11A-2000" (Shift-JIS with '\\' <-> '\\' mapping) for data-file
168    based conversions.
169
170    ```c
171    UConverter *conv = ucnv_open("shift_jis", &myError);
172    ```
173
174    As a convenience, converter names can be passed in as Unicode. (for example,
175    if a user passed in the string from a Unicode-based user interface).
176    However, the actual names are restricted to an invariant ASCII/EBCDIC
177    subset.
178
179    ```c
180    UChar *name = ...; UConverter *conv = ucnv_openU(name, &myError);
181    ```
182
183    Converter names are case-insensitive. In addition, beginning with ICU 3.6,
184    leading zeroes are ignored in sequences of digits (if further digits
185    follow), and all non-alphanumeric characters are ignored. Thus the strings
186    "UTF-8", "utf_8", "u\*T@f08" and "Utf 8" are equivalent. (Before ICU 3.6,
187    leading zeroes were not ignored, and only spaces, dashes and underscores
188    were ignored.) The `ucnv_compareNames()` function provides such string
189    comparisons.
190
191    Unlike the names of resources or other types of ICU data, converter names
192    can **not** be qualified with a path that indicates the directory or common
193    data file containing the corresponding converter data. The requested
194    converter's data must be present either in the main ICU data library or as a
195    separate file located in the ICU data directory. However, you can always
196    create a package of converters with pkgdata and open a converter from the
197    package with `ucnv_openPackage()`
198
199    ```c
200    UConverter *conv = ucnv_openPackage("./myPackage.dat", "customConverter", &myError);
201    ```
202
2032.  **By number**: The design of the ICU is to accommodate codepages provided by
204    different vendors. For example, the IBM CDRA (Character Data Representation
205    Architecture which is an IBM architecture that defines a set of identifiers)
206    has an ID type called the CCSID (Coded Character Set Identifier). The ICU
207    API for opening a codepage by number must be given a vendor along with the
208    number. Currently, only IBM (`UCNV_IBM`) is supported. For example, the US
209    EBCDIC codepage (IBM #37) can be opened with the following code:
210
211    ```c
212    ucnv_openCCSID(37, UCNV_IBM, &myErr);
213    ```
214
2153.  **By iteration**: An application might not know ahead of time which codepage
216    to use, and thus might need to query ICU to determine the entire list of
217    installed converters. The ICU returns a list of its canonical (internal)
218    names. From each names, the standard IANA name can be determined, and also a
219    list of aliases which point to that name can be determined. For example, ICU
220    might return among the canonical names "ibm-367". That name itself may or
221    may not provide the application or its users with the information needed.
222    (367 is actually the decimal form of a number that is calculated by
223    appending certain hex digits together.) However, the IANA name can be
224    requested from this canonical name, which should return something like
225    "us-ascii". The alias list for ibm-367 can be iterated over as well, which
226    returns additional names like "ascii", "646", "ansi_x3.4-1968" etc. If this
227    is not sufficient information, once a converter is opened, it can be queried
228    for its type, min and max char size, etc. This information is not available
229    without actually opening the converter (a fairly lightweight process.)
230
231    ```c
232    /* Returns count of the number of available names */
233    int count = ucnv_countAvailable();
234    /* get the canonical name of the 36th available converter */
235    const char *convName1 = ucnv_getAvailableName(36);
236    /* get the 3rd alias for a given codepage. */
237    const char *asciiAlias = ucnv_getAlias("ibm-367", 3, &myError);
238    /* Get the IANA name of the converter */
239    const char *ascii = ucnv_getStandardName("ibm-367", "IANA");
240    /* Get the one of the non preferred IANA name of the converter. */
241    UEnumeration *asciiEnum =
242    ucnv_openStandardNames("ibm-367", "IANA", &myError);
243    uenum_next(asciiEnum, &myError); /* skip preferred IANA alias */
244    /* get one of the non-preferred IANA aliases */
245    const char *ascii2 = uenum_next(asciiEnum, &myError);
246    uenum_close(asciiEnum);
247    ```
248
2494.  **By using the default converter**: The default converter can be opened by
250    passing a NULL as the name of the converter.
251
252    ```c
253    ucnv_open(NULL, &myErr);
254    ```
255
256> :point_right: **Note**: ICU chooses this converter based on the best information available to it.
257> The purpose of this converter is to interface with the OS using a codepage (i.e. `char *`).
258> Do not use it as a way of determining the best overall converter to use.
259> Usually any Unicode encoding form is the best way to store and send text data,
260> so that important data does not get lost in the conversion.
261> Also, if the OS supports Unicode-based API's (such as Win32),
262> it is better to use only those Unicode API's.
263> As an example, the new Windows 2000 locales (such as Hindi) do not
264> define the default codepage to something that supports Hindi.
265> The default converter is used in expressions such as: `UnicodeString text("abc");`
266> to convert 'abc', and in the `u_uastrcpy()` C functions.
267> Code operating at the [OS level](../design.md) MAY choose to
268> change the default converter with `ucnv_setDefaultName()`.
269> However, be aware that this change has inconsistent results if it is done after
270> ICU components are initialized.
271
272### Closing a Converter
273
274Closing a converter frees memory occupied by that instance of the converter.
275However it does not release the larger shared data tables the converter might
276use. OS-level code may call `ucnv_flushCache()` to explicitly free memory occupied
277by [unused tables](../design.md).
278
279```c
280ucnv_close(conv)
281```
282
283### Converter Life Cycle
284
285Note that a Converter is created with a certain type (for instance, ISO-8859-3)
286which does not change over the life of that [object](../design.md). Converters
287should be allocated one per thread. They are cheap to create, as the shared data
288doesn't need to be reallocated.
289
290This is the typical life cycle of a converter, as shown step-by-step:
291
2921.  First, open up the converter with a specified name (or alias name).
293    ```c
294    UConverter *conv = ucnv_open("shift_jis", &status);
295    ```
296
2972.  Target here is the `char s[]` to write into, and targetSize is how big the
298    target buffer is. Source is the UChars that are being converted.
299    ```c
300    int32_t len = ucnv_fromUChars(conv, target, targetSize, source, u_strlen(source), &status);
301    ```
302
3033.  Clean up the converter.
304    ```c
305    ucnv_close(conv);
306    ```
307
308### Sharing Converters Between Threads
309
310A converter cannot be shared between threads at the same time. However, if it is
311reset it can be used for unrelated chunks of data. For example, use the same
312converter for converting data from Unicode to ISO-8859-3, and then reset it. Use
313the same converter for converting data from ISO-8859-3 back into Unicode.
314
315### Converting Large Quantities of Data
316
317If it is necessary to convert a large quantity of data in smaller buffers, use
318the same converter to convert each buffer. This will make sure any state is
319preserved from one chunk to the next. Doing this conversion is known as
320streaming or buffering, and is mentioned [Buffered or Streamed](#3-buffered-or-streamed)
321section (§) later in this chapter.
322
323### Cloning a Converter
324
325Cloning a converter returns a clone of the converter object along with any
326internal state that the converter might be storing. Cloning routines must be
327used with extreme care when using converters for stateful or multibyte
328encodings. If the converter object is carrying an internal state, and the
329newly-created clone is used to convert a new chunk of text, the converter
330produces incorrect results. Also note that the caller owns the cloned object and
331has to call `ucnv_close()` to dispose of the object. Calling `ucnv_reset()` before
332cloning will reset the converter to its original state.
333
334```c
335UConverter* newCnv = ucnv_safeClone(oldCnv, 0, &bufferSize, &err)
336```
337
338## Converter Behavior
339
340### Conversion
341
3421.  The converters always consume the source buffer as far as possible, and
343    advance the source pointer.
344
3452.  The converters write to the target all converted output as far as possible,
346    and then write any remaining output to the internal services buffer. When
347    the conversion routines are called again, the internal buffer is flushed out
348    and written to the target buffer before proceeding with any further
349    conversion.
350
3513.  In conversions to Unicode from Multi-byte encodings or conversions from
352    Unicode involving surrogates, if (a) only a partial byte sequence is
353    retrieved from the source buffer, (b) the "flush" parameter is set to "true"
354    and (c) the end of source is reached, then the callback is called with
355    `U_TRUNCATED_CHAR_FOUND`.
356
357### Reset
358
359Converters can be reset explicitly or implicitly. Explicit reset is done by
360calling:
361
3621.  `ucnv_reset()`: Resets the converter to initial state in both directions.
363
3642.  `ucnv_resetToUnicode()`: Resets the converter to initial state to Unicode
365    direction.
366
3673.  `ucnv_resetFromUnicode()`: Resets the converter to initial state from Unicode
368    direction.
369
370The converters are reset implicitly when the conversion functions are called
371with the "flush" parameter set to "true" and the source is consumed.
372
373### Error
374
375#### Conversion from Unicode
376
377Not all characters can be converted from Unicode to other codepages. In most
378cases, Unicode is a superset of the characters supported by any given codepage.
379
380The default behavior of ICU in this case is to substitute the illegal or
381unmappable sequence, with the appropriate substitution sequence for that
382codepage. For example, ISO-8859-1, along with most ASCII-based codepages, has
383the character 0x1A (Control-Z) as the substitution sequence. When converting
384from Unicode to ISO-8859-1, any characters which cannot be converted would be
385replaced by 0x1A's.
386
387SubChar1 is sometimes used as substitution character in MBCS conversions. For
388more information on SubChar1 please see the [Conversion Data](data.md) chapter.
389
390In stateful converters like ISO-2022-JP, if a substitution character has to be
391written to the target, then an escape/shift sequence to change the state to
392single byte mode followed by a substitution character is written to the target.
393
394The substitution character can be changed by calling the `ucnv_setSubstChars()`
395function with the desired codepage byte sequence. However, this has some
396limitations: It only allows setting a single character (although the character
397can consist of multiple bytes), and it may not work properly for some stateful
398converters (like HZ or ISO 2022 variants) when setting a multi-byte substitution
399character. (It will work for EBCDIC_STATEFUL ones.) Moreover, for setting a
400particular character, the caller needs to know the correct byte sequence for
401that character in the converter's codepage. (For example, a space (U+0020) is
402encoded as 0x20 in ASCII-based codepages, 0x40 in EBCDIC-based ones, 0x00 0x20
403or 0x20 0x00 in UTF-16 depending on the stream's endianness, etc.)
404
405The `ucnv_setSubstString()` function (new in ICU 3.6) lifts these limitations. It
406takes a Unicode string and verifies that it can be converted to the codepage
407without error and that it is not too long (32 bytes as of ICU 3.6). The string
408can contain zero, one or more characters. An empty string has the effect of
409using the skip callback. See the Error Callbacks below. Stateful converters are
410fully supported. The same Unicode string will give equivalent results with all
411converters that support its conversion.
412
413Internally, `ucnv_setSubstString()` stores the byte sequence from the test
414conversion if the converter is stateless, or the Unicode string itself if the
415converter is stateful. If the Unicode string is stored, then it is converted on
416the fly during substitution, handling all state transitions.
417
418The function `ucnv_getSubstChars()` can be used to retrieve the substitution byte
419sequence if it is the default one, set by `ucnv_setSubstChars()`, or if
420`ucnv_setSubstString()` stored the byte sequence for a stateless converter. The
421Unicode string set for a stateful converter cannot be retrieved.
422
423#### Conversion to Unicode
424
425In conversion to Unicode, errors are normally due to ill-formed byte sequences:
426Unused byte values, or lead bytes not followed by trail bytes according to the
427encoding scheme. Well-formed but unmappable sequences are unusual but possible.
428
429The ICU default behavior is to emit an `U+FFFD REPLACEMENT CHARACTER` per
430offending sequence.
431
432If the conversion table .ucm file contains a `<subchar1>` entry (such as in the
433ibm-943 table), a U+001A C0 control ("SUB") is emitted for single-byte
434illegal/unmappable input rather than `U+FFFD REPLACEMENT CHARACTER`. For details
435on this behavior look for "001A" in the [Conversion Data](data.md) chapter.
436
437* This behavior originates from mainframes with dedicated single-byte-to-single-byte
438  and double-to-double conversions.
439* Emitting U+001A for single-byte errors can be avoided by (a) removing the
440  `<subchar1>` mapping or (b) using a similar conversion table that does not
441  have this mapping (e.g., windows-932 instead of ibm-943) or (c) writing a
442  custom callback function.
443
444### Error Codes
445
446Here are some of the `UErrorCode`s which have significant meaning for conversion:
447
448#### U_INDEX_OUTOFBOUNDS_ERROR
449
450In `getNextUChar()` - all source data
451has been consumed without producing a Unicode character
452
453#### U_INVALID_CHAR_FOUND
454No mapping was found from the source to the target encoding. For example, U+0398
455(Capital Theta) has no mapping into ISO-8859-1, and so U_INVALID_CHAR_FOUND
456will result.
457
458#### U_TRUNCATED_CHAR_FOUND
459
460All of the source data was read, and a
461character sequence was incomplete. For example, only half of a double-byte
462sequence may have been encountered. When converting FROM Unicode, this error
463would occur when a conversion ends with a low surrogate (U+D800) at the end of
464the source, with no corresponding high surrogate.
465
466#### U_ILLEGAL_CHAR_FOUND
467
468A character sequence was found in the source which is disallowed in the source
469encoding scheme. For example, many MBCS encodings have only certain byte
470sequences which are allowed as lead bytes. When converting from Unicode, if a
471low surrogate is NOT followed immediately by a high surrogate, or a high
472surrogate without its preceding low surrogate, an illegal sequence results.
473Note: Most, but not all, converters forbid surrogate code points or unpaired
474surrogate code units. (Lead surrogate without trail, or trail without lead.)
475Some converters permit surrogate code points/unpaired surrogates because their
476charset specification permits it. For example, LMBCS, SCSU and
477BOCU-1.
478
479#### U_INVALID_TABLE_FORMAT
480
481An error occurred trying to read the backing data
482for the converter. The data could be corrupt, or the wrong
483version.
484
485#### U_BUFFER_OVERFLOW_ERROR
486
487More output (target) characters were produced
488than fit in the target buffer. If in `to/fromUnicode()`, then process the target
489buffer and call the function again to retrieve the overflowed characters.
490
491### Error Callbacks
492
493What actually happens is that an "error callback function" is called at the
494point where the conversion failure occurred. The function can deal with the
495failed characters as it sees fit. Possible options at the callback's disposal
496include ignoring the bad sequence, converting it to a different sequence, and
497returning an error to the caller. The callback can also consume any data past
498where the error occurred, whether or not that data would have caused an error.
499Only one callback is installed at a time, per direction (to or from unicode).
500
501A number of canned functions are provided by ICU, and an application can write
502new ones. The "callbacks" are either From Unicode (to codepage), or To Unicode
503(from codepage). Here is a list of the canned callbacks in ICU:
504
5051.  UCNV_**FROM_U**_CALLBACK_SUBSTITUTE: This callback is installed by default.
506    It will write the codepage's substitute sequence or a user-set substitute
507    sequence, or convert a user-set substitute UnicodeString to the codepage.
508    See "Error / Conversion from Unicode" above.
509
5102.  UCNV_**TO_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. It
511    will write U+FFFD or sometimes U+001A. See "Error / Conversion to Unicode"
512    above.
513
5143.  UCNV_FROM_U_CALLBACK_SKIP, UCNV_TO_U_CALLBACK_SKIP: Simply ignores any
515    invalid characters in the input, no error is returned.
516
5174.  UCNV_FROM_U_CALLBACK_STOP, UCNV_TO_U_CALLBACK_STOP: Stop at the error.
518    Return the error to the caller. (When using the 'BUFFER' mode of conversion,
519    the source and target pointers returned can be examined to determine where
520    the error occurred. `ucnv_getInvalidUChars()` and `ucnv_getInvalidChars()`
521    return the actual text which failed).
522
5235.  UCNV_FROM_U_CALLBACK_ESCAPE, UCNV_TO_U_CALLBACK_ESCAPE: This callback is
524    especially useful for debugging. Missing codepage characters are replaced by
525    strings such as '%U094D' with the Unicode value, and missing Unicode chars
526    are replaced with text of the form '%X0A' where the codepage had the
527    unconvertible byte hex 0A.
528
529    When a callback is set, a "context" pointer is also provided. How this
530    pointer is created depends on the specific callback. There is usually a
531    `createContext()` function for that specific callback, where the caller can
532    set certain options for the callback. Consult the documentation for the
533    specific callback you are using. For ICU's canned callbacks, this pointer
534    may be set to NULL. The functions for setting a different callback also
535    return the old callback, and the old context pointer. These may be stored so
536    that the old callback is re-installed when an operation is finished.
537
538    Additionally the following options can be passed as the context parameter to
539    UCNV_FROM_U_CALLBACK_ESCAPE callback function to produce different outputs.
540
541    | UCNV_ESCAPE_ICU     | %U12345 |
542    | ------------------- | ------- |
543    | UCNV_ESCAPE_JAVA    | \\u1234 |
544    | UCNV_ESCAPE_C       | \\udbc9\\udd36 for Plane 1 and \\u1234 for Plane 0 codepoints |
545    | UCNV_ESCAPE_XML_DEC | \&#4460; number expressed in Decimal |
546    | UCNV_ESCAPE_XML_HEX | \&#x1234; number expressed in Hexadecimal |
547
548Here are some examples of how to use callbacks.
549
550```c
551UConverter              *u;
552void                    *oldContext, *newContext;
553UConverterFromUCallback oldAction, newAction;
554u = ucnv_open("shift_jis", &myError);
555
556... /* do some conversion with u from unicode.. */
557
558ucnv_setFromUCallBack(
559    u, MY_FROMU_CALLBACK, newContext, &oldAction, &oldContext, &myError);
560
561... /* do some other conversion from unicode */
562
563/* Now, set the callback back */
564ucnv_setFromUCallBack(
565    u, oldAction, oldContext, &newAction, &newContext, &myError);
566
567```
568
569### Custom Callbacks
570
571Writing a callback is somewhat involved, and will be covered more completely in
572a future version of this document. One might look at the source to the provided
573callbacks as a starting point, and address any further questions to the mailing
574list.
575
576Basically, callback, unlike other ICU functions which expect to be called with
577`U_ZERO_ERROR` as the input, is called in an exceptional error condition. The
578callback is a kind of 'last ditch effort' to rectify the error which occurred,
579before it is returned back to the caller. This is why the implementation of STOP
580is very simple:
581
582```c
583void UCNV_FROM_U_CALLBACK_STOP(...) { }
584```
585
586The error code such as `U_INVALID_CHAR_FOUND` is returned to the user. If the
587callback determines that no error should be returned to the user, then the
588callback must set the error code to `U_ZERO_ERROR`. Note that this is a departure
589from most ICU functions, which are supposed to check the error code and return
590immediately if it is set.
591
592> :point_right: **Note**: See the functions `ucnv_cb_write...()` for
593> functions which a callback may use to perform its task.
594
595#### Ignore Default_Ignorable_Code_Point
596
597Unicode has a number of characters that are not by themselves meaningful but
598assist with line breaking (e.g., U+00AD Soft Hyphen & U+200B Zero Width Space),
599bi-directional text layout (U+200E Left-To-Right Mark), collation and other
600algorithms (U+034F Combining Grapheme Joiner), or indicate a preference for a
601particular glyph variant (U+FE0F Variation Selector 16). These characters are
602"invisible" by default, that is, they should normally not be shown with a glyph
603of their own, except in special circumstances. Examples include showing a hyphen
604for when a Soft Hyphen was used for a line break, or modifying the glyph of a
605character preceding a Variation Selector.
606
607Unicode has a character property to identify such characters, as well as
608currently-unassigned code points that are intended to be used for similar
609purposes: Default_Ignorable_Code_Point, or "DI" for short:
610http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:]
611
612Most charsets do not have most or any of these characters.
613
614**ICU 54 and above by default skip default-ignorable code points if they are
615unmappable**. (Ticket #[10551](https://unicode-org.atlassian.net/browse/ICU-10551))
616
617**Older versions of ICU** replaced unmappable default-ignorable code points like
618any other unmappable code points, by a question mark or whatever substitution
619character is defined for the charset.
620
621For best results, a custom from-Unicode callback can be used to ignore
622Default_Ignorable_Code_Point characters that cannot be converted, so that they
623are removed from the charset output rather than replaced by a visible character.
624
625This is a code snippet for use in a custom from-Unicode callback:
626
627```c
628#include "unicode/uchar.h"
629// ...
630(from-Unicode callback)
631    switch(reason) {
632    case UCNV_UNASSIGNED:
633        if(u_hasBinaryProperty(codePoint, UCHAR_DEFAULT_IGNORABLE_CODE_POINT)) {
634            // Ignore/drop default ignorable code points that cannot be converted,
635            // rather than treating them like errors/writing a substitution character etc.
636            // For example, U+200B Zero Width Space,
637            // U+200E Left-To-Right Mark, U+FE0F Variation Selector 16.
638            *pErrorCode = U_ZERO_ERROR;
639            return;
640        } else {
641            // ...
642```
643
644## Modes of Conversion
645
646When a converter is instantiated, it can be used to convert both in the Unicode
647to Codepage direction, and also in the Codepage to Unicode direction. There are
648three ways to use the converters, as well as a convenience function which does
649not require the instantiation of a converter.
650
6511.  **Single-String**: Simplest type of conversion to or from Unicode. The data
652    is entirely contained within a single string.
653
6542.  **Character**: Converting from the codepage to a single Unicode codepoint,
655    one at a time.
656
6573.  **Buffer**: Convert data which may not fit entirely within a single buffer.
658    Usually the most efficient and flexible.
659
6604.  **Convenience**: Convert a single buffer from one codepage to another
661    through Unicode, without requiring the instantiation of a converter.
662
663### 1. Single-String
664
665Data must be contained entirely within a single string or buffer.
666
667```c
668conv = ucnv_open("shift_jis", &status);
669
670/* Convert from Unicode to Shift JIS */
671len = ucnv_fromUChars(conv, target, targetLen, source, sourceLen, &status);
672ucnv_close(conv);
673
674conv = ucnv_open("iso-8859-3", &status);
675/* Convert from ISO-8859-3 to Unicode */
676len = ucnv_toUChars(conv, target, targetSize, source, sourceLen, &status);
677ucnv_close(conv);
678```
679
680### 2. Character
681
682In this type, the input data is in the specified codepage. With each function
683call, only the next Unicode codepoint is converted at a time. This might be the
684most efficient way to scan for a certain character, or other processing of a
685single character at a time, because converters are stateful. This works even for
686multibyte charsets, and for stateful ones such as iso-2022-jp.
687
688```c
689conv = ucnv_open("Big-5", &status);
690UChar32 target;
691while(source < sourceLimit) {
692    target = ucnv_getNextUChar(conv, &source, sourceLimit, &status);
693    ASSERT(status);
694    processChar(target);
695}
696```
697
698### 3. Buffered or Streamed
699
700This is used in situations where a large document may be read in off of disk and
701processed. Also, many codepages take multiple bytes to encode a character, or
702have state. These factors make it impossible to convert arbitrary chunks of data
703without maintaining state across chunks. Even conversion from Unicode may
704encounter a leading surrogate at the end of one buffer, which needs to be paired
705with the trailing surrogate in the next buffer.
706
707A basic API principle of the ICU to/from Unicode functions is that they will
708ALWAYS attempt to consume all of the input (source) data, unless the output
709buffer is full or some other error occurs. In other words, there is no need to
710ever test whether all of the source data has been consumed.
711
712The basic loop that is used with the ICU buffer conversion routines is the same
713in the to and from Unicode directions. In the following pseudocode, either
714'source' (for fromUnicode) or 'target' (for toUnicode) are UTF-16 UChars.
715
716```c
717UErrorCode err = U_ZERO_ERROR;
718
719while (... /*input data available*/ ) {
720    ... /* read input data into buffer */
721
722    source = ... /* beginning of read data */;
723    sourceLimit = source + readLength; // end + 1
724
725    UBool flush = (further input data still available) // (i.e. feof())
726
727    /* loop until all source has been processed */
728    do {
729        /* set up target pointers */
730        target = ... /* beginning of output buffer */;
731        targetLimit = target + sizeOfOutput;
732
733        err = U_ZERO_ERROR; /* so that the to/from does not fail */
734
735        ucnv_to/fromUnicode(converter, &target, targetLimit,
736                    &source, sourceLimit, NULL, flush, &err);
737
738        ... /* write (target-beginningOfOutputBuffer) items
739               starting at beginning of output buffer */
740    } while (err == U_BUFFER_OVERFLOW_ERROR);
741    if(U_FAILURE(error)) {
742        ... /* process error */
743        break; /* out of the 'while' loop that reads source data */
744    }
745}
746/* loop to read input data */
747if(U_FAILURE(error)) {
748    ... /* process error further */
749}
750```
751
752The above code optimizes for processing entire chunks of input data. An
753efficient size for the output buffer can be calculated as follows. (in bytes):
754
755```c
756ucnv_getMinCharSize() * inputBufferSize * sizeof(UChar)
757ucnv_getMaxCharSize() * inputBufferSize
758```
759
760There are two loops used, an outer and an inner. The outer loop fetches input
761data to keep the source buffer full, and the inner loop 'writes' out data to
762keep the output buffer empty.
763
764Note that while this efficiently handles data on the input side, there are some
765cases where the size of the output buffer is fixed. For instance, in network
766applications it is sometimes desirable to fill every output packet completely
767(not including the last packet in the sequence). The above loop does not ensure
768that every output buffer is completely full. For example, if a 4 UChar input
769buffer was used, and a 3 byte output buffer with `fromUnicode()`, the loop would
770typically write 3 bytes, then 1, then 3, and so on. If, instead of efficient use
771of the input data, the goal is filling output buffers, a slightly different loop
772can be used.
773
774In such a scenario, the inner write does not occur unless a buffer overflow
775occurs OR 'flush' is true. So, the 'write' and resetting of the target and
776targetLimit pointers would only happen
777`if (err == U_BUFFER_OVERFLOW_ERROR || flush == true)`
778
779The flush parameter on each conversion call should be set to false, until the
780conversion call is called for the last time for the buffer. This is because the
781conversion is stateful. On the last conversion call, the flush parameter should
782be set to true. More details are mentioned in the API reference in
783[ucnv.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) .
784
785### 4. Pre-flighting
786
787Preflighting is the process of asking the conversion API for the size of target
788buffer required. (For a more general discussion, see the Preflighting section
789(§) in the [Strings](../strings/index.md) chapter.)
790
791This is accomplished by calling the `ucnv_fromUChars` and `ucnv_toUChars` functions.
792
793```c
794UChar uchar2;
795char input_char_buffer = "This is some text";
796
797targetsize = ucnv_toUChars(myConverter, NULL, targetcapacity,
798                           input_char_buffer, sizeof(input_char_buffer), &err);
799
800if(err==U_BUFFER_OVERFLOW_ERROR) {
801    err=U_ZERO_ERROR;
802    uchar2=(UChar*)malloc((targetsize) * sizeof(UChar));
803    targetsize = ucnv_toUChars(myConverter, uchar2, targetsize,
804                               input_char_buffer, sizeof(input_char_buffer), &err);
805    if(U_FAILURE(err)) {
806        printf("ucnv_toUChars() FAILED %s\n", myErrorName(err));
807    }
808    else {
809        printf("ucnv_toUChars() o.k.\n");
810    }
811}
812```
813
814> :point_right: **Note**: *This is inefficient since the conversion is performed
815> **twice**, once for finding the size of target and once for writing to the target*.
816
817### 5. Convenience
818
819ICU provides some convenience functions for conversions:
820
821```c
822ucnv_toUChars(myConverter, target_uchars, targetsize,
823              input_char_buffer, sizeof(input_char_buffer), &err);
824ucnv_fromUChars(cnv, cTarget, (cTargetLimit-cTarget),
825                uSource, (uSourceLimit-uSource), &errorCode);
826
827char target[100];
828UnicodeString str("ABCDEF", "iso-8859-1");
829int32_t targetsize = str.extract(0, str.length(), target, sizeof(target), "SJIS");
830target[targetsize] = 0; /* NULL termination */
831```
832
833## Conversion Examples
834
835See the [ICU Conversion Examples](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ucnv/convsamp.cpp) for more information.
836