1--- 2layout: default 3title: Converter 4nav_order: 1 5parent: Conversion 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Using Converters 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25When designing applications around Unicode characters, it is sometimes required 26to convert between Unicode encodings or between Unicode and legacy text data. 27The vast majority of modern Operating Systems support Unicode to some degree, 28but sometimes the legacy text data from older systems need to be converted to 29and from Unicode. This conversion process can be done with an ICU converter. 30 31## ICU converters 32 33ICU provides comprehensive character set conversion services, mapping tables, 34and implementations for many encodings. Since ICU uses Unicode (UTF-16) 35internally, all converters convert between UTF-16 (with the endianness according 36to the current platform) and another encoding. This includes Unicode encodings. 37In other words, internal text is 16-bit Unicode, while "external text" used as 38source or target for a conversion is always treated as a byte stream. 39 40ICU converters are available for a wide range of encoding schemes. Most of them 41are based on mapping table data that is handled by few generic implementations. 42Some encodings are implemented algorithmically in addition to (or instead of) 43using mapping tables, especially Unicode encodings. The partly or entirely 44table-based encoding schemes include: All ICU converters map only single Unicode 45character code points to and from single codepage character code points. ICU 46converters **do not** deal directly with combining characters, bidirectional 47reordering, or Arabic shaping, for example. Such processes, if required, must be 48handled separately. For example, while in Unicode, the ICU BiDi APIs can be used 49for bidirectional reordering after a conversion to Unicode or before a 50conversion from Unicode. 51 52ICU converters are not designed to perform any encoding autodetection. This 53means that the converters do not autodetect "endianness", the 6 Unicode encoding 54signatures, or the Shift-JIS vs. EUC-JP, etc. There are two exceptions: The 55UTF-16 and UTF-32 converters work according to Unicode's specification of their 56Character Encoding Schemes, that is, they read the BOM to figure out the actual 57"endianness". 58 59The ICU mapping tables mostly come from an [IBM® codepage 60repository](http://www.ibm.com/software/globalization/cdra). For non-IBM 61codepages, there is typically an equivalent codepage registered with this 62repository. However, the textual data format (.ucm files) is generic, and data 63for other codepage mapping tables can also be added. 64 65## Using the Default Codepage 66 67ICU has code to determine the default codepage of the system or process. This 68default codepage can be used to convert `char *` strings to and from Unicode. 69 70Depending on system design, setup and APIs, it may not always be possible to 71find a default codepage that fully works as expected. For example, 72 731. On Windows there are three encodings in use at the same time. Unicode 74 (UTF-16) is always used inside of Windows, while for `char *` encodings there 75 are two classes, called "ANSI" and "OEM" codepages. ICU will use the ANSI 76 codepage. Note that the OEM codepage is used by default for console window 77 output. 78 792. On some UNIX-type systems, non-standard names are used for encodings, or 80 non-standard encodings are used altogether. Although ICU supports over 200 81 encodings in its standard build and many more aliases for them, it will not 82 be able to recognize such non-standard names. 83 843. Some systems do not have a notion of a system or process codepage, and may 85 not have APIs for that. 86 87If you have means of detecting a default codepage name that are more appropriate 88for your application, then you should set that name with `ucnv_setDefaultName()` 89as the first ICU function call. This makes sure that the internally cached 90default converter will be instantiated from your preferred name. 91 92Starting in ICU 2.0, when a converter for the default codepage cannot be opened, 93a fallback default codepage name and converter will be used. On most platforms, 94this will be US-ASCII. For z/OS (OS/390), ibm-1047,swaplfnl is the default 95fallback codepage. For AS/400 (iSeries), ibm-37 is the default fallback 96codepage. This default fallback codepage is used when the operating system is 97using a non-standard name for a default codepage, or the converter was not 98packaged with ICU. The feature allows ICU to run in unusual computing 99environments without completely failing. 100 101## Usage Model 102 103A "Converter" refers to the C structure "UConverter". Converters are cheap to 104create. Any data that is shared between converters of the same kind (such as the 105mappings, the name and the properties) are automatically cached and shared in 106memory. 107 108### Converter Names 109 110Codepages with encoding schemes have been given many names by various vendors 111and platforms over the years. Vendors have different ways specify which codepage 112and encoding are being used. IBM uses a CCSID (Coded Character Set IDentifier). 113Windows uses a CPID (CodePage IDentifier). Macintosh has a TextEncoding. Many 114Unix vendors use [IANA](http://www.iana.org/assignments/character-sets) 115character set names. Many of these names are aliases to converters within ICU. 116 117In order to help identify which names are recognized by certain platforms, ICU 118provides several converter alias functions. The complete description of these 119functions can be found in the [ICU API Reference](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . 120 121| Function Names | Short Description | 122| -------------- | ----------------- | 123| `ucnv_countAvailable`, `ucnv_getAvailableName` | Get a list of available converter names that can be opened. | 124| `ucnv_openAllNames` | Get a list of all known converter names. | 125| `ucnv_getName` | Get the name of an open converter. | 126| `ucnv_countAliases`, `ucnv_getAlias` | Get the list of aliases for the specified converter. | 127| `ucnv_countStandards`, `ucnv_getStandard` | Get the list of known standards. | 128| `ucnv_openStandardNames` | Get a filtered list of aliases for a converter that is known by the specified standard. | 129| `ucnv_getStandardName` | Get the preferred alias name specified by a given standard. | 130| `ucnv_getCanonicalName` | Get the converter name from the alias that is recognized by the specified standard. | 131| `ucnv_getDefaultName` | Get the default converter name that is currently used by ICU and the operating system. | 132| `ucnv_setDefaultName` | Use this function to override the default converter name. | 133 134Even though IANA specifies a list of aliases, it usually does not specify the 135mappings or the actual character set for the aliases. Sometimes vendors will map 136similar glyph variants to different Unicode code points or sometimes they will 137assign completely different glyphs for the same codepage code point. Because of 138these ambiguities, you can sometimes get `U_AMBIGUOUS_ALIAS_WARNING` for the 139returned `UErrorCode` when more than one converter uses the requested alias. This 140is only a warning, and the results can still be used. This UErrorCode value is 141just a reminder that you may not get what you expected. The above functions can 142help you to determine which converter you actually wanted. 143 144EBCDIC based converters do have the option to swap the newline and linefeed 145character mappings. This can be useful when transferring EBCDIC documents 146between z/OS (MVS, os/390 and the rest of the zSeries family) and another EBCDIC 147machine like OS/400 on iSeries. The ",swaplnlf" or `UCNV_SWAP_LFNL_OPTION_STRING` 148from ucnv.h can be appended to a converter alias in order to achieve this 149behavior. You can view other available options in ucnv.h. 150 151You can always skip many of these aliasing and mapping problems by just using 152Unicode. 153 154### Creating a Converter 155 156There are four ways to create a converter: 157 1581. **By name**: Converters can be created using different types of names. No 159 distinction is made when the converter is created, as to which name is being 160 employed. There are many types of aliases possible. Among these are 161 [IANA](http://www.iana.org/assignments/character-sets) ("shift_jis", 162 "koi8-r", or "iso-8859-3"), host specific names ("cp1252" which is the name 163 for a Microsoft® Windows™ or a similar IBM® codepage). Finally, ICU's own 164 internal canonical names for a converter can be used. These include "UTF-8" 165 or "ISO-8859-1" for built-in conversion types, and names such as 166 "ibm-949_P110-2000" (Shift-JIS with '\\' <-> '¥' mapping) or 167 "ibm-949_P11A-2000" (Shift-JIS with '\\' <-> '\\' mapping) for data-file 168 based conversions. 169 170 ```c 171 UConverter *conv = ucnv_open("shift_jis", &myError); 172 ``` 173 174 As a convenience, converter names can be passed in as Unicode. (for example, 175 if a user passed in the string from a Unicode-based user interface). 176 However, the actual names are restricted to an invariant ASCII/EBCDIC 177 subset. 178 179 ```c 180 UChar *name = ...; UConverter *conv = ucnv_openU(name, &myError); 181 ``` 182 183 Converter names are case-insensitive. In addition, beginning with ICU 3.6, 184 leading zeroes are ignored in sequences of digits (if further digits 185 follow), and all non-alphanumeric characters are ignored. Thus the strings 186 "UTF-8", "utf_8", "u\*T@f08" and "Utf 8" are equivalent. (Before ICU 3.6, 187 leading zeroes were not ignored, and only spaces, dashes and underscores 188 were ignored.) The `ucnv_compareNames()` function provides such string 189 comparisons. 190 191 Unlike the names of resources or other types of ICU data, converter names 192 can **not** be qualified with a path that indicates the directory or common 193 data file containing the corresponding converter data. The requested 194 converter's data must be present either in the main ICU data library or as a 195 separate file located in the ICU data directory. However, you can always 196 create a package of converters with pkgdata and open a converter from the 197 package with `ucnv_openPackage()` 198 199 ```c 200 UConverter *conv = ucnv_openPackage("./myPackage.dat", "customConverter", &myError); 201 ``` 202 2032. **By number**: The design of the ICU is to accommodate codepages provided by 204 different vendors. For example, the IBM CDRA (Character Data Representation 205 Architecture which is an IBM architecture that defines a set of identifiers) 206 has an ID type called the CCSID (Coded Character Set Identifier). The ICU 207 API for opening a codepage by number must be given a vendor along with the 208 number. Currently, only IBM (`UCNV_IBM`) is supported. For example, the US 209 EBCDIC codepage (IBM #37) can be opened with the following code: 210 211 ```c 212 ucnv_openCCSID(37, UCNV_IBM, &myErr); 213 ``` 214 2153. **By iteration**: An application might not know ahead of time which codepage 216 to use, and thus might need to query ICU to determine the entire list of 217 installed converters. The ICU returns a list of its canonical (internal) 218 names. From each names, the standard IANA name can be determined, and also a 219 list of aliases which point to that name can be determined. For example, ICU 220 might return among the canonical names "ibm-367". That name itself may or 221 may not provide the application or its users with the information needed. 222 (367 is actually the decimal form of a number that is calculated by 223 appending certain hex digits together.) However, the IANA name can be 224 requested from this canonical name, which should return something like 225 "us-ascii". The alias list for ibm-367 can be iterated over as well, which 226 returns additional names like "ascii", "646", "ansi_x3.4-1968" etc. If this 227 is not sufficient information, once a converter is opened, it can be queried 228 for its type, min and max char size, etc. This information is not available 229 without actually opening the converter (a fairly lightweight process.) 230 231 ```c 232 /* Returns count of the number of available names */ 233 int count = ucnv_countAvailable(); 234 /* get the canonical name of the 36th available converter */ 235 const char *convName1 = ucnv_getAvailableName(36); 236 /* get the 3rd alias for a given codepage. */ 237 const char *asciiAlias = ucnv_getAlias("ibm-367", 3, &myError); 238 /* Get the IANA name of the converter */ 239 const char *ascii = ucnv_getStandardName("ibm-367", "IANA"); 240 /* Get the one of the non preferred IANA name of the converter. */ 241 UEnumeration *asciiEnum = 242 ucnv_openStandardNames("ibm-367", "IANA", &myError); 243 uenum_next(asciiEnum, &myError); /* skip preferred IANA alias */ 244 /* get one of the non-preferred IANA aliases */ 245 const char *ascii2 = uenum_next(asciiEnum, &myError); 246 uenum_close(asciiEnum); 247 ``` 248 2494. **By using the default converter**: The default converter can be opened by 250 passing a NULL as the name of the converter. 251 252 ```c 253 ucnv_open(NULL, &myErr); 254 ``` 255 256> :point_right: **Note**: ICU chooses this converter based on the best information available to it. 257> The purpose of this converter is to interface with the OS using a codepage (i.e. `char *`). 258> Do not use it as a way of determining the best overall converter to use. 259> Usually any Unicode encoding form is the best way to store and send text data, 260> so that important data does not get lost in the conversion. 261> Also, if the OS supports Unicode-based API's (such as Win32), 262> it is better to use only those Unicode API's. 263> As an example, the new Windows 2000 locales (such as Hindi) do not 264> define the default codepage to something that supports Hindi. 265> The default converter is used in expressions such as: `UnicodeString text("abc");` 266> to convert 'abc', and in the `u_uastrcpy()` C functions. 267> Code operating at the [OS level](../design.md) MAY choose to 268> change the default converter with `ucnv_setDefaultName()`. 269> However, be aware that this change has inconsistent results if it is done after 270> ICU components are initialized. 271 272### Closing a Converter 273 274Closing a converter frees memory occupied by that instance of the converter. 275However it does not release the larger shared data tables the converter might 276use. OS-level code may call `ucnv_flushCache()` to explicitly free memory occupied 277by [unused tables](../design.md). 278 279```c 280ucnv_close(conv) 281``` 282 283### Converter Life Cycle 284 285Note that a Converter is created with a certain type (for instance, ISO-8859-3) 286which does not change over the life of that [object](../design.md). Converters 287should be allocated one per thread. They are cheap to create, as the shared data 288doesn't need to be reallocated. 289 290This is the typical life cycle of a converter, as shown step-by-step: 291 2921. First, open up the converter with a specified name (or alias name). 293 ```c 294 UConverter *conv = ucnv_open("shift_jis", &status); 295 ``` 296 2972. Target here is the `char s[]` to write into, and targetSize is how big the 298 target buffer is. Source is the UChars that are being converted. 299 ```c 300 int32_t len = ucnv_fromUChars(conv, target, targetSize, source, u_strlen(source), &status); 301 ``` 302 3033. Clean up the converter. 304 ```c 305 ucnv_close(conv); 306 ``` 307 308### Sharing Converters Between Threads 309 310A converter cannot be shared between threads at the same time. However, if it is 311reset it can be used for unrelated chunks of data. For example, use the same 312converter for converting data from Unicode to ISO-8859-3, and then reset it. Use 313the same converter for converting data from ISO-8859-3 back into Unicode. 314 315### Converting Large Quantities of Data 316 317If it is necessary to convert a large quantity of data in smaller buffers, use 318the same converter to convert each buffer. This will make sure any state is 319preserved from one chunk to the next. Doing this conversion is known as 320streaming or buffering, and is mentioned [Buffered or Streamed](#3-buffered-or-streamed) 321section (§) later in this chapter. 322 323### Cloning a Converter 324 325Cloning a converter returns a clone of the converter object along with any 326internal state that the converter might be storing. Cloning routines must be 327used with extreme care when using converters for stateful or multibyte 328encodings. If the converter object is carrying an internal state, and the 329newly-created clone is used to convert a new chunk of text, the converter 330produces incorrect results. Also note that the caller owns the cloned object and 331has to call `ucnv_close()` to dispose of the object. Calling `ucnv_reset()` before 332cloning will reset the converter to its original state. 333 334```c 335UConverter* newCnv = ucnv_safeClone(oldCnv, 0, &bufferSize, &err) 336``` 337 338## Converter Behavior 339 340### Conversion 341 3421. The converters always consume the source buffer as far as possible, and 343 advance the source pointer. 344 3452. The converters write to the target all converted output as far as possible, 346 and then write any remaining output to the internal services buffer. When 347 the conversion routines are called again, the internal buffer is flushed out 348 and written to the target buffer before proceeding with any further 349 conversion. 350 3513. In conversions to Unicode from Multi-byte encodings or conversions from 352 Unicode involving surrogates, if (a) only a partial byte sequence is 353 retrieved from the source buffer, (b) the "flush" parameter is set to "true" 354 and (c) the end of source is reached, then the callback is called with 355 `U_TRUNCATED_CHAR_FOUND`. 356 357### Reset 358 359Converters can be reset explicitly or implicitly. Explicit reset is done by 360calling: 361 3621. `ucnv_reset()`: Resets the converter to initial state in both directions. 363 3642. `ucnv_resetToUnicode()`: Resets the converter to initial state to Unicode 365 direction. 366 3673. `ucnv_resetFromUnicode()`: Resets the converter to initial state from Unicode 368 direction. 369 370The converters are reset implicitly when the conversion functions are called 371with the "flush" parameter set to "true" and the source is consumed. 372 373### Error 374 375#### Conversion from Unicode 376 377Not all characters can be converted from Unicode to other codepages. In most 378cases, Unicode is a superset of the characters supported by any given codepage. 379 380The default behavior of ICU in this case is to substitute the illegal or 381unmappable sequence, with the appropriate substitution sequence for that 382codepage. For example, ISO-8859-1, along with most ASCII-based codepages, has 383the character 0x1A (Control-Z) as the substitution sequence. When converting 384from Unicode to ISO-8859-1, any characters which cannot be converted would be 385replaced by 0x1A's. 386 387SubChar1 is sometimes used as substitution character in MBCS conversions. For 388more information on SubChar1 please see the [Conversion Data](data.md) chapter. 389 390In stateful converters like ISO-2022-JP, if a substitution character has to be 391written to the target, then an escape/shift sequence to change the state to 392single byte mode followed by a substitution character is written to the target. 393 394The substitution character can be changed by calling the `ucnv_setSubstChars()` 395function with the desired codepage byte sequence. However, this has some 396limitations: It only allows setting a single character (although the character 397can consist of multiple bytes), and it may not work properly for some stateful 398converters (like HZ or ISO 2022 variants) when setting a multi-byte substitution 399character. (It will work for EBCDIC_STATEFUL ones.) Moreover, for setting a 400particular character, the caller needs to know the correct byte sequence for 401that character in the converter's codepage. (For example, a space (U+0020) is 402encoded as 0x20 in ASCII-based codepages, 0x40 in EBCDIC-based ones, 0x00 0x20 403or 0x20 0x00 in UTF-16 depending on the stream's endianness, etc.) 404 405The `ucnv_setSubstString()` function (new in ICU 3.6) lifts these limitations. It 406takes a Unicode string and verifies that it can be converted to the codepage 407without error and that it is not too long (32 bytes as of ICU 3.6). The string 408can contain zero, one or more characters. An empty string has the effect of 409using the skip callback. See the Error Callbacks below. Stateful converters are 410fully supported. The same Unicode string will give equivalent results with all 411converters that support its conversion. 412 413Internally, `ucnv_setSubstString()` stores the byte sequence from the test 414conversion if the converter is stateless, or the Unicode string itself if the 415converter is stateful. If the Unicode string is stored, then it is converted on 416the fly during substitution, handling all state transitions. 417 418The function `ucnv_getSubstChars()` can be used to retrieve the substitution byte 419sequence if it is the default one, set by `ucnv_setSubstChars()`, or if 420`ucnv_setSubstString()` stored the byte sequence for a stateless converter. The 421Unicode string set for a stateful converter cannot be retrieved. 422 423#### Conversion to Unicode 424 425In conversion to Unicode, errors are normally due to ill-formed byte sequences: 426Unused byte values, or lead bytes not followed by trail bytes according to the 427encoding scheme. Well-formed but unmappable sequences are unusual but possible. 428 429The ICU default behavior is to emit an `U+FFFD REPLACEMENT CHARACTER` per 430offending sequence. 431 432If the conversion table .ucm file contains a `<subchar1>` entry (such as in the 433ibm-943 table), a U+001A C0 control ("SUB") is emitted for single-byte 434illegal/unmappable input rather than `U+FFFD REPLACEMENT CHARACTER`. For details 435on this behavior look for "001A" in the [Conversion Data](data.md) chapter. 436 437* This behavior originates from mainframes with dedicated single-byte-to-single-byte 438 and double-to-double conversions. 439* Emitting U+001A for single-byte errors can be avoided by (a) removing the 440 `<subchar1>` mapping or (b) using a similar conversion table that does not 441 have this mapping (e.g., windows-932 instead of ibm-943) or (c) writing a 442 custom callback function. 443 444### Error Codes 445 446Here are some of the `UErrorCode`s which have significant meaning for conversion: 447 448#### U_INDEX_OUTOFBOUNDS_ERROR 449 450In `getNextUChar()` - all source data 451has been consumed without producing a Unicode character 452 453#### U_INVALID_CHAR_FOUND 454No mapping was found from the source to the target encoding. For example, U+0398 455(Capital Theta) has no mapping into ISO-8859-1, and so U_INVALID_CHAR_FOUND 456will result. 457 458#### U_TRUNCATED_CHAR_FOUND 459 460All of the source data was read, and a 461character sequence was incomplete. For example, only half of a double-byte 462sequence may have been encountered. When converting FROM Unicode, this error 463would occur when a conversion ends with a low surrogate (U+D800) at the end of 464the source, with no corresponding high surrogate. 465 466#### U_ILLEGAL_CHAR_FOUND 467 468A character sequence was found in the source which is disallowed in the source 469encoding scheme. For example, many MBCS encodings have only certain byte 470sequences which are allowed as lead bytes. When converting from Unicode, if a 471low surrogate is NOT followed immediately by a high surrogate, or a high 472surrogate without its preceding low surrogate, an illegal sequence results. 473Note: Most, but not all, converters forbid surrogate code points or unpaired 474surrogate code units. (Lead surrogate without trail, or trail without lead.) 475Some converters permit surrogate code points/unpaired surrogates because their 476charset specification permits it. For example, LMBCS, SCSU and 477BOCU-1. 478 479#### U_INVALID_TABLE_FORMAT 480 481An error occurred trying to read the backing data 482for the converter. The data could be corrupt, or the wrong 483version. 484 485#### U_BUFFER_OVERFLOW_ERROR 486 487More output (target) characters were produced 488than fit in the target buffer. If in `to/fromUnicode()`, then process the target 489buffer and call the function again to retrieve the overflowed characters. 490 491### Error Callbacks 492 493What actually happens is that an "error callback function" is called at the 494point where the conversion failure occurred. The function can deal with the 495failed characters as it sees fit. Possible options at the callback's disposal 496include ignoring the bad sequence, converting it to a different sequence, and 497returning an error to the caller. The callback can also consume any data past 498where the error occurred, whether or not that data would have caused an error. 499Only one callback is installed at a time, per direction (to or from unicode). 500 501A number of canned functions are provided by ICU, and an application can write 502new ones. The "callbacks" are either From Unicode (to codepage), or To Unicode 503(from codepage). Here is a list of the canned callbacks in ICU: 504 5051. UCNV_**FROM_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. 506 It will write the codepage's substitute sequence or a user-set substitute 507 sequence, or convert a user-set substitute UnicodeString to the codepage. 508 See "Error / Conversion from Unicode" above. 509 5102. UCNV_**TO_U**_CALLBACK_SUBSTITUTE: This callback is installed by default. It 511 will write U+FFFD or sometimes U+001A. See "Error / Conversion to Unicode" 512 above. 513 5143. UCNV_FROM_U_CALLBACK_SKIP, UCNV_TO_U_CALLBACK_SKIP: Simply ignores any 515 invalid characters in the input, no error is returned. 516 5174. UCNV_FROM_U_CALLBACK_STOP, UCNV_TO_U_CALLBACK_STOP: Stop at the error. 518 Return the error to the caller. (When using the 'BUFFER' mode of conversion, 519 the source and target pointers returned can be examined to determine where 520 the error occurred. `ucnv_getInvalidUChars()` and `ucnv_getInvalidChars()` 521 return the actual text which failed). 522 5235. UCNV_FROM_U_CALLBACK_ESCAPE, UCNV_TO_U_CALLBACK_ESCAPE: This callback is 524 especially useful for debugging. Missing codepage characters are replaced by 525 strings such as '%U094D' with the Unicode value, and missing Unicode chars 526 are replaced with text of the form '%X0A' where the codepage had the 527 unconvertible byte hex 0A. 528 529 When a callback is set, a "context" pointer is also provided. How this 530 pointer is created depends on the specific callback. There is usually a 531 `createContext()` function for that specific callback, where the caller can 532 set certain options for the callback. Consult the documentation for the 533 specific callback you are using. For ICU's canned callbacks, this pointer 534 may be set to NULL. The functions for setting a different callback also 535 return the old callback, and the old context pointer. These may be stored so 536 that the old callback is re-installed when an operation is finished. 537 538 Additionally the following options can be passed as the context parameter to 539 UCNV_FROM_U_CALLBACK_ESCAPE callback function to produce different outputs. 540 541 | UCNV_ESCAPE_ICU | %U12345 | 542 | ------------------- | ------- | 543 | UCNV_ESCAPE_JAVA | \\u1234 | 544 | UCNV_ESCAPE_C | \\udbc9\\udd36 for Plane 1 and \\u1234 for Plane 0 codepoints | 545 | UCNV_ESCAPE_XML_DEC | \ᅬ number expressed in Decimal | 546 | UCNV_ESCAPE_XML_HEX | \ሴ number expressed in Hexadecimal | 547 548Here are some examples of how to use callbacks. 549 550```c 551UConverter *u; 552void *oldContext, *newContext; 553UConverterFromUCallback oldAction, newAction; 554u = ucnv_open("shift_jis", &myError); 555 556... /* do some conversion with u from unicode.. */ 557 558ucnv_setFromUCallBack( 559 u, MY_FROMU_CALLBACK, newContext, &oldAction, &oldContext, &myError); 560 561... /* do some other conversion from unicode */ 562 563/* Now, set the callback back */ 564ucnv_setFromUCallBack( 565 u, oldAction, oldContext, &newAction, &newContext, &myError); 566 567``` 568 569### Custom Callbacks 570 571Writing a callback is somewhat involved, and will be covered more completely in 572a future version of this document. One might look at the source to the provided 573callbacks as a starting point, and address any further questions to the mailing 574list. 575 576Basically, callback, unlike other ICU functions which expect to be called with 577`U_ZERO_ERROR` as the input, is called in an exceptional error condition. The 578callback is a kind of 'last ditch effort' to rectify the error which occurred, 579before it is returned back to the caller. This is why the implementation of STOP 580is very simple: 581 582```c 583void UCNV_FROM_U_CALLBACK_STOP(...) { } 584``` 585 586The error code such as `U_INVALID_CHAR_FOUND` is returned to the user. If the 587callback determines that no error should be returned to the user, then the 588callback must set the error code to `U_ZERO_ERROR`. Note that this is a departure 589from most ICU functions, which are supposed to check the error code and return 590immediately if it is set. 591 592> :point_right: **Note**: See the functions `ucnv_cb_write...()` for 593> functions which a callback may use to perform its task. 594 595#### Ignore Default_Ignorable_Code_Point 596 597Unicode has a number of characters that are not by themselves meaningful but 598assist with line breaking (e.g., U+00AD Soft Hyphen & U+200B Zero Width Space), 599bi-directional text layout (U+200E Left-To-Right Mark), collation and other 600algorithms (U+034F Combining Grapheme Joiner), or indicate a preference for a 601particular glyph variant (U+FE0F Variation Selector 16). These characters are 602"invisible" by default, that is, they should normally not be shown with a glyph 603of their own, except in special circumstances. Examples include showing a hyphen 604for when a Soft Hyphen was used for a line break, or modifying the glyph of a 605character preceding a Variation Selector. 606 607Unicode has a character property to identify such characters, as well as 608currently-unassigned code points that are intended to be used for similar 609purposes: Default_Ignorable_Code_Point, or "DI" for short: 610http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=[:DI:] 611 612Most charsets do not have most or any of these characters. 613 614**ICU 54 and above by default skip default-ignorable code points if they are 615unmappable**. (Ticket #[10551](https://unicode-org.atlassian.net/browse/ICU-10551)) 616 617**Older versions of ICU** replaced unmappable default-ignorable code points like 618any other unmappable code points, by a question mark or whatever substitution 619character is defined for the charset. 620 621For best results, a custom from-Unicode callback can be used to ignore 622Default_Ignorable_Code_Point characters that cannot be converted, so that they 623are removed from the charset output rather than replaced by a visible character. 624 625This is a code snippet for use in a custom from-Unicode callback: 626 627```c 628#include "unicode/uchar.h" 629// ... 630(from-Unicode callback) 631 switch(reason) { 632 case UCNV_UNASSIGNED: 633 if(u_hasBinaryProperty(codePoint, UCHAR_DEFAULT_IGNORABLE_CODE_POINT)) { 634 // Ignore/drop default ignorable code points that cannot be converted, 635 // rather than treating them like errors/writing a substitution character etc. 636 // For example, U+200B Zero Width Space, 637 // U+200E Left-To-Right Mark, U+FE0F Variation Selector 16. 638 *pErrorCode = U_ZERO_ERROR; 639 return; 640 } else { 641 // ... 642``` 643 644## Modes of Conversion 645 646When a converter is instantiated, it can be used to convert both in the Unicode 647to Codepage direction, and also in the Codepage to Unicode direction. There are 648three ways to use the converters, as well as a convenience function which does 649not require the instantiation of a converter. 650 6511. **Single-String**: Simplest type of conversion to or from Unicode. The data 652 is entirely contained within a single string. 653 6542. **Character**: Converting from the codepage to a single Unicode codepoint, 655 one at a time. 656 6573. **Buffer**: Convert data which may not fit entirely within a single buffer. 658 Usually the most efficient and flexible. 659 6604. **Convenience**: Convert a single buffer from one codepage to another 661 through Unicode, without requiring the instantiation of a converter. 662 663### 1. Single-String 664 665Data must be contained entirely within a single string or buffer. 666 667```c 668conv = ucnv_open("shift_jis", &status); 669 670/* Convert from Unicode to Shift JIS */ 671len = ucnv_fromUChars(conv, target, targetLen, source, sourceLen, &status); 672ucnv_close(conv); 673 674conv = ucnv_open("iso-8859-3", &status); 675/* Convert from ISO-8859-3 to Unicode */ 676len = ucnv_toUChars(conv, target, targetSize, source, sourceLen, &status); 677ucnv_close(conv); 678``` 679 680### 2. Character 681 682In this type, the input data is in the specified codepage. With each function 683call, only the next Unicode codepoint is converted at a time. This might be the 684most efficient way to scan for a certain character, or other processing of a 685single character at a time, because converters are stateful. This works even for 686multibyte charsets, and for stateful ones such as iso-2022-jp. 687 688```c 689conv = ucnv_open("Big-5", &status); 690UChar32 target; 691while(source < sourceLimit) { 692 target = ucnv_getNextUChar(conv, &source, sourceLimit, &status); 693 ASSERT(status); 694 processChar(target); 695} 696``` 697 698### 3. Buffered or Streamed 699 700This is used in situations where a large document may be read in off of disk and 701processed. Also, many codepages take multiple bytes to encode a character, or 702have state. These factors make it impossible to convert arbitrary chunks of data 703without maintaining state across chunks. Even conversion from Unicode may 704encounter a leading surrogate at the end of one buffer, which needs to be paired 705with the trailing surrogate in the next buffer. 706 707A basic API principle of the ICU to/from Unicode functions is that they will 708ALWAYS attempt to consume all of the input (source) data, unless the output 709buffer is full or some other error occurs. In other words, there is no need to 710ever test whether all of the source data has been consumed. 711 712The basic loop that is used with the ICU buffer conversion routines is the same 713in the to and from Unicode directions. In the following pseudocode, either 714'source' (for fromUnicode) or 'target' (for toUnicode) are UTF-16 UChars. 715 716```c 717UErrorCode err = U_ZERO_ERROR; 718 719while (... /*input data available*/ ) { 720 ... /* read input data into buffer */ 721 722 source = ... /* beginning of read data */; 723 sourceLimit = source + readLength; // end + 1 724 725 UBool flush = (further input data still available) // (i.e. feof()) 726 727 /* loop until all source has been processed */ 728 do { 729 /* set up target pointers */ 730 target = ... /* beginning of output buffer */; 731 targetLimit = target + sizeOfOutput; 732 733 err = U_ZERO_ERROR; /* so that the to/from does not fail */ 734 735 ucnv_to/fromUnicode(converter, &target, targetLimit, 736 &source, sourceLimit, NULL, flush, &err); 737 738 ... /* write (target-beginningOfOutputBuffer) items 739 starting at beginning of output buffer */ 740 } while (err == U_BUFFER_OVERFLOW_ERROR); 741 if(U_FAILURE(error)) { 742 ... /* process error */ 743 break; /* out of the 'while' loop that reads source data */ 744 } 745} 746/* loop to read input data */ 747if(U_FAILURE(error)) { 748 ... /* process error further */ 749} 750``` 751 752The above code optimizes for processing entire chunks of input data. An 753efficient size for the output buffer can be calculated as follows. (in bytes): 754 755```c 756ucnv_getMinCharSize() * inputBufferSize * sizeof(UChar) 757ucnv_getMaxCharSize() * inputBufferSize 758``` 759 760There are two loops used, an outer and an inner. The outer loop fetches input 761data to keep the source buffer full, and the inner loop 'writes' out data to 762keep the output buffer empty. 763 764Note that while this efficiently handles data on the input side, there are some 765cases where the size of the output buffer is fixed. For instance, in network 766applications it is sometimes desirable to fill every output packet completely 767(not including the last packet in the sequence). The above loop does not ensure 768that every output buffer is completely full. For example, if a 4 UChar input 769buffer was used, and a 3 byte output buffer with `fromUnicode()`, the loop would 770typically write 3 bytes, then 1, then 3, and so on. If, instead of efficient use 771of the input data, the goal is filling output buffers, a slightly different loop 772can be used. 773 774In such a scenario, the inner write does not occur unless a buffer overflow 775occurs OR 'flush' is true. So, the 'write' and resetting of the target and 776targetLimit pointers would only happen 777`if (err == U_BUFFER_OVERFLOW_ERROR || flush == true)` 778 779The flush parameter on each conversion call should be set to false, until the 780conversion call is called for the last time for the buffer. This is because the 781conversion is stateful. On the last conversion call, the flush parameter should 782be set to true. More details are mentioned in the API reference in 783[ucnv.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html) . 784 785### 4. Pre-flighting 786 787Preflighting is the process of asking the conversion API for the size of target 788buffer required. (For a more general discussion, see the Preflighting section 789(§) in the [Strings](../strings/index.md) chapter.) 790 791This is accomplished by calling the `ucnv_fromUChars` and `ucnv_toUChars` functions. 792 793```c 794UChar uchar2; 795char input_char_buffer = "This is some text"; 796 797targetsize = ucnv_toUChars(myConverter, NULL, targetcapacity, 798 input_char_buffer, sizeof(input_char_buffer), &err); 799 800if(err==U_BUFFER_OVERFLOW_ERROR) { 801 err=U_ZERO_ERROR; 802 uchar2=(UChar*)malloc((targetsize) * sizeof(UChar)); 803 targetsize = ucnv_toUChars(myConverter, uchar2, targetsize, 804 input_char_buffer, sizeof(input_char_buffer), &err); 805 if(U_FAILURE(err)) { 806 printf("ucnv_toUChars() FAILED %s\n", myErrorName(err)); 807 } 808 else { 809 printf("ucnv_toUChars() o.k.\n"); 810 } 811} 812``` 813 814> :point_right: **Note**: *This is inefficient since the conversion is performed 815> **twice**, once for finding the size of target and once for writing to the target*. 816 817### 5. Convenience 818 819ICU provides some convenience functions for conversions: 820 821```c 822ucnv_toUChars(myConverter, target_uchars, targetsize, 823 input_char_buffer, sizeof(input_char_buffer), &err); 824ucnv_fromUChars(cnv, cTarget, (cTargetLimit-cTarget), 825 uSource, (uSourceLimit-uSource), &errorCode); 826 827char target[100]; 828UnicodeString str("ABCDEF", "iso-8859-1"); 829int32_t targetsize = str.extract(0, str.length(), target, sizeof(target), "SJIS"); 830target[targetsize] = 0; /* NULL termination */ 831``` 832 833## Conversion Examples 834 835See the [ICU Conversion Examples](https://github.com/unicode-org/icu/blob/main/icu4c/source/samples/ucnv/convsamp.cpp) for more information. 836