1--- 2layout: default 3title: Conversion Data 4nav_order: 2 5parent: Conversion 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Conversion Data 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Introduction 24 25### Algorithmic vs. Data-based 26 27In a comprehensive conversion library, there are three kinds of codepage 28converter implementations: converters that use algorithms, mapping data, or 29those converters that use both. 30 311. Most codepages have a simple and straightforward structure but have an 32 arbitrary relationship between input and output character codes. Mapping 33 tables are necessary to define the conversion. If the codepage characters 34 use more than one byte each, then the mapping table must also define the 35 structure of the codepage. 36 372. Algorithmic converters work by transforming the input stream with built-in 38 algorithms and possibly small, hard coded tables. The conversion can be 39 complex, but the actual mapping of a character code is done numerically if 40 the converter is purely algorithmic. 41 423. In some cases, a converter needs to be algorithmic for its basic operations 43 but also relies on mapping data. 44 45ICU provides converter implementations for all three groups of codepages. Since 46ICU always converts, to or from Unicode, the purely algorithmic converters are 47the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, 48UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and 49ISO-8859-1 ("ISO Latin-1"), these encodings also use algorithmic converters for 50performance reasons. 51 52Most other codepages use simple byte sequences but are not encodings of Unicode. 53They are converted with generic code using mapping data tables. ICU also 54supports a few encodings, like ISO-2022 and its variants, that employ an 55algorithmic structure to switch between a set of codepages. The converters for 56these encodings are algorithmic but use mapping tables for the embedded 57codepages. 58 59### Stateful vs. Stateless 60 61Character encodings are either stateful or stateless: 62 631. Stateless encodings define a byte sequence for each character. Complete 64 character byte sequences can be used in any order, and the same complete 65 character byte sequences always encodes the same characters. It is 66 preferable to always encode one character using the same byte sequence. 67 682. Stateful encodings define byte sequences that change the state of the text 69 stream. Depending on the current state, the same byte sequence may encode a 70 different character and the same character may be encoded with different 71 byte sequences. 72 73This distinction between stateless and stateful encodings is important, because 74it determines if any available ICU converter implementation is used. The 75following are some more important considerations related to stateless versus 76stateful encodings: 77 781. A runtime converter object is always stateful, even for "stateless" 79 encodings. They are always stateful because an input buffer may end with a 80 partial byte sequence that is to be continued in the next input buffer in 81 the following conversion call. The information about this is stored in the 82 converter object. Similarly, if the input is Unicode text, then an input 83 buffer may end with the first of a pair of surrogates. The converter object 84 also stores overflow bytes or code units if the result of a character 85 mapping did not fit entirely into the output buffer. 86 872. Stateless encodings are stateful in our converter implementation to 88 interpret "complete byte sequences". They are "stateful" because many 89 encodings can have the same byte value used in different positions of byte 90 sequences for different characters; a specific byte value may be a lead byte 91 or a trail byte. For instance, the lead and trail byte values overlap in 92 codepages like Shift-JIS. If a program does not start reading at a character 93 boundary, it may instead interpret the byte sequences from two or more 94 separate characters as one character. Often, character boundaries can be 95 detected reliably only by reading the non-Unicode text linearly from the 96 beginning. This can be a problem for non-Unicode text processing, where text 97 insertion, deletion, and searching are common. The UTF-8/16/32 encodings do 98 not have this problem because the single, lead, or trail units have disjoint 99 values and character boundary can be easily found. 100 1013. Some stateful encodings only switch between two states: one with one byte 102 per character and one with two bytes per character. This type of encoding is 103 very common in mainframe systems based on Extended Binary Coded Decimal 104 Interchange Code (EBCDIC) and is actually handled in ICU with almost the 105 same code and type of mapping tables as stateless codepages. 106 1074. The classifications of algorithmic vs. data-based converters and of 108 stateless vs. stateful encodings are independent of each other: UTF-8, 109 UTF-16, and UTF-32 encodings are algorithmic but stateless; UTF-7 and SCSU 110 encodings are algorithmic and stateful; Windows-1252 and Shift-JIS encodings 111 are data-based and stateless; ISO-2022-JP encoding is algorithmic, 112 data-based, and stateful. 113 114### Scope of this chapter 115 116The following sections in this chapter discuss the mapping data tables that are 117used in ICU. For related material, please see: 118 1191. [ICU character set collection](https://icu.unicode.org/charts/charset) 120 1212. [Unicode Technical Report 22](http://www.unicode.org/reports/tr22/) 122 1233. "Cross Mapping Tables" in [Unicode Online 124 Data](http://www.unicode.org/onlinedat/online.html) 125 126## ICU Mapping Table Data Files 127 128### Overview 129 130As stated above, most ICU converters rely on character mapping tables. ICU 1.8 131has one single data structure for all character mapping tables, which is used by 132a generic Multi-Byte Character Set (MBCS) converter implementation. The 133implementation is flexible enough to handle stateless encodings with the 134following parameters: 135 1361. Support for variable-length, byte-based encodings with 1 to 4 bytes per 137 character. 138 1392. Support for all Unicode characters (code points 0..0x10ffff). Since ICU 1.8 140 uses the UTF-16 encoding as its Unicode encoding form, surrogate pairs are 141 completely supported. 142 1433. Efficient distinction between unassigned (unmappable) and illegal byte 144 sequences. 145 1464. It is not possible to convert from Unicode to byte sequences with leading 147 zero bytes. 148 1495. Simple stateful encodings are also handled using only Shift-In and Shift-Out 150 (SI/SO) codes and one single-byte and one double-byte state. 151 152> :point_right: **Note**: *In the context of conversion tables, "unassigned" code 153> points or codepage byte sequences are valid but do not have a **mapping**. This 154> is different from "unassigned" code points in a character set like Unicode or 155> Shift-JIS which are codes that do not have assigned **characters**.* 156 157Prior to version 1.8, ICU used more specific, more limited, converter 158implementations for Single Byte Character Set (SBCS), Double Byte Character Set 159(DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC) 160codepages. Mapping table data is provided in text files. ICU comes with several 161dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are 162translated at build time by its makeconv tool (source code in 163icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable 164.cnv file per .ucm file. The resulting .cnv files are included by default in the 165common data file for use at runtime. 166 167The format of the .ucm files is similar to the format of the UPMAP files as 168provided by IBM® in the codepage repository and as used in the uconvdef tool on 169AIX. UPMAP is a text file that specifies the mapping of a codepage character to 170and from Unicode. 171 172The format of the .cnv files is ICU-specific. The .cnv file format may change 173between ICU versions even for the same .ucm files. The .ucm file format may be 174extended to include more features. 175 176The following sections concentrate on the .ucm file format. The .cnv file format 177is described in the source code in the `icu/source/common/ucnvmbcs.c` directory 178and is updated using the MBCS converter implementation. 179 180These conversion tables can have more than one name. ICU allows multiple names 181("aliases") for the same encoding. It matches a requested encoding name against 182a list of names in `icu/source/data/mappings/convrtrs.txt` and when it finds a 183match, ICU opens a converter with the name in the leftmost position in the 184matching line. The name matching is not case-sensitive and ICU ignores spaces, 185dashes, and underscores. At build time, the gencnval tool located in the 186`icu/source/tools/gencnval` directory, generates a binary form of the convrtrs.txt 187file as a data file for runtime for the cnvalias.icu file ("Converter Aliases 188data file"). 189 190### .ucm File Format 191 192.ucm files are line-oriented text files. Empty lines and comments starting with 193'`#`' are ignored. 194 195A .ucm file contains two sections: 196 1971. a header with general specifications of the codepage 198 1992. a mapping table section between the "CHARMAP" and "END CHARMAP" lines. 200 201For example: 202 203``` 204<code_set_name> "IBM-943" 205<char_name_mask> "AXXXX" 206<mb_cur_min> 1 207<mb_cur_max> 2 208<uconv_class> "MBCS" 209<subchar> \xFC\xFC 210<subchar1> \x7F 211<icu:state> 0-7f, 81-9f:1, a0-df, e0-fc:1 212<icu:state> 40-7e, 80-fc 213# 214CHARMAP 215# 216# 217#ISO 10646 IBM-943 218#_________ _________ 219<U0000> \x00 |0 220<U0001> \x01 |0 221<U0002> \x02 |0 222<U0003> \x03 |0 223... 224<UFFE4> \xFA\x55 |1 225<UFFE5> \x81\x8F |0 226<UFFFD> \xFC\xFC |2 227END CHARMAP 228 229``` 230 231The header fields are: 232 2331. code_set_name - The name of the codepage. The makeconv tool generates the 234 .cnv file name from the .ucm filename but uses this header field for the 235 converter name that it writes into the .cnv file for ucnv_getName. The 236 makeconv tool prints a warning message if this header field does not match 237 the file name. The file name is not case-sensitive. 238 2392. char_name_mask - This is ignored by makeconv tool. "AXXXX" specifies that 240 the POSIX-style character "name" consists of one letter (Alpha) followed by 241 4 hexadecimal digits. Since ICU only uses Unicode character "names" (for 242 example, code points) the format is fixed (see below). 243 2443. mb_cur_min - The minimum number of bytes per character. 245 2464. mb_cur_max - The maximum number of bytes per character. 247 2485. uconv_class - This can be either "SBCS", "DBCS", "MBCS", or 249 "EBCDIC_STATEFUL" 250 The most general converter class/type/category is MBCS, which requires that 251 the codepage structure has the following <icu:state> lines. The other types 252 of converters are subsets of MBCS. The makeconv tool uses predefined state 253 tables for these other converters when their structure is not explicitly 254 specified. The following describes how the converter types are interpreted: 255 256 a. MBCS: Generic ICU converter type, requires a state table 257 258 b. SBCS: Single-byte, 8-bit codepages 259 260 c. DBCS: Double-byte EBCDIC codepages 261 262 d. EBCDIC_STATEFUL: Mixed Single-Byte or Double-Byte EBCDIC codepages (stateful, using SI/SO) 263 264The following shows the exact implied state tables for non-MBCS types. A state 265table may need to be overwritten in order to allow supplementary characters 266(U+10000 and up). 267 2681. subchar - The substitution character byte sequence for this codepage. This sequence must be a valid byte sequence according to the codepage structure. 269 2702. subchar1 - This is the single byte substitution character when subchar is defined. Some IBM converter libraries use different substitution characters for "narrow" and "wide" characters (single-byte and double-byte). ICU uses only one substitution character per codepage because it is common industry practice. 271 2723. icu:state - See the "State Table Syntax in .ucm Files" section for a detailed description of how to specify a codepage structure. 273 2744. icu:charsetFamily - This specifies if the codepage is ASCII or EBCDIC based. 275 276The subchar and subchar1 fields have been known to cause some confusion. The 277following conditions outline when each are used: 278 2791. Conversion from Unicode to a codepage occurs and an unassigned code point is 280 found 281 282 a. If a subchar1 byte is defined and a subchar1 mapping is defined for the code point (with a |2 precision indicator), 283 output the subchar1 284 285 b. Otherwise output the regular subchar 286 2872. Conversion from a codepage to Unicode occurs and an unassigned codepoint is found 288 289 a. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage, output U+001A 290 291 b. Otherwise output U+FFFD 292 293In the CHARMAP section of a .ucm file, each line contains a Unicode code point 294(like <U(*1-6 hexadecimal digits for the code point*)> ), a codepage character 295byte sequence (each byte like `\xhh` (2 hexadecimal digits) ), and an optional 296"precision" or "fallback" indicator. 297 298The precision indicator either must be present in all mappings or in none of 299them. The indicator is a pipe symbol `|` followed by a 0, 1, 2, 3, or 4 that has 300the following meaning: 301 302* `|0` - A "normal", roundtrip mapping from a Unicode code point and back. 303* `|1` - A "fallback" mapping only from Unicode to the codepage, but not back. 304* `|2` - A subchar1 mapping. The code point is unmappable, and if a substitution 305 is performed, then the subchar1 should be used rather than the subchar. 306 Otherwise, such mappings are ignored. 307* `|3` - A "reverse fallback" mapping only from the codepage to Unicode, but not 308 back to the codepage. 309* `|4` - A "good one-way" mapping only from Unicode to the codepage, but not 310 back. 311 312Fallback mappings from Unicode typically do not map codes for the same 313character, but for "similar" ones. This mapping is sometimes done if a character 314exists in Unicode but not in the codepage. To replace it, ICU maps a codepage 315code to a similar-looking code for human-readable output. This mapping feature 316is not useful for text data transmission especially in markup languages where a 317Unicode code point can be escaped with its code point value. The ICU application 318programming interface (API) `ucnv_setFallback()` controls this fallback behavior. 319 320"Reverse fallbacks" are technically similar, but the same Unicode character can 321be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime. 322 323A subset of the fallback mappings from Unicode is always used at runtime: Those 324that map private-use Unicode code points. Fallbacks from private-use code points 325are often introduced as replacements for previous roundtrip mappings for the 326same pair of codes. These replacements are used when a Unicode version assigns a 327new character that was previously mapped to that private-use code point. The 328mapping table is then changed to map the same codepage byte sequence to the new 329Unicode code point (as a new roundtrip) and the mapping from the old private-use 330code point to the same codepage code is preserved as a fallback. 331 332A "good one-way" mapping is like a fallback, but ICU always uses "good one-way" 333mappings at runtime, regardless of the fallback API flag. 334 335The idea is that fallbacks normally lose information, such as mapping from a 336compatibility variant of a letter to the ASCII version; however, fallbacks from 337PUA and reverse fallbacks are assumed to be for "the same character", just an 338older code for it. 339 340Something similar happens with from-Unicode Variation Selector sequences. It is 341possible to round-trip (`|0`) either the unadorned character or the sequence with 342a variation selector, and add a "good one-way" mapping (`|4`) from the other 343version. That "good one-way" mapping does not lose much information, and it is 344used even if the "use fallback" API flag is false. Alternatively, both mappings 345could be fallbacks (`|1`) that should be controlled by the "use fallback" 346attribute. 347 348### State table syntax in .ucm files 349 350The conversion to Unicode uses a state machine to achieve the above capabilities 351with reasonable data file sizes. The state machine information itself is loaded 352with the conversion data and defines the structure of the codepage, including 353which byte sequences are valid, unassigned, and illegal. This data cannot (or 354not easily) be computed from the pure mapping data. Instead, the .ucm files for 355MBCS encodings have additional entries that are specific to the ICU makeconv 356tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they 357can be overridden (see the examples below). These state tables are specified in 358the header section of the .ucm file that contains the `<icu:state>` element. Each 359line defines one aspect of the state machine. The state machine uses a table of 360as many rows as there are states (= as many as there are `<icu:state>` lines). 361Each row has 256 entries; one for each possible byte value. 362 363The state table lines in the .ucm header conform to the following Extended 364Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens): 365 366``` 367row=[[firstentry ','] entry (',' entry)*] 368firstentry="initial" | "surrogates" 369 (initial state (default for state 0), output is all surrogate pairs) 370``` 371 372Each state table row description (that follows the `<icu:state>`) begins with an 373optional initial or surrogates keyword and is followed by one or more column 374entries. For the purpose of codepage state tables, the states=rows in the table 375are numbered beginning at 0 for the first line in the .ucm file header. The 376numbers are assigned implicitly by the makeconv tool in order of the `<icu:state>` 377lines. 378 379A row may be empty (nothing following the `<icu:state>`) - that is equivalent to 380"all illegal" or 0-ff.i and is useful for trail byte states for all-illegal byte 381sequences. 382 383``` 384entry=range [':' nextstate] ['.' [action]] 385range = number ['-' number] 386nextstate = number (0..7f) 387action = 'u' | 's' | 'p' | 'i' 388 (unassigned, state change only, surrogate pair, illegal) 389number = (1- or 2-digit hexadecimal number) 390``` 391 392Each column entry contains at least one hexadecimal byte value or value range 393and is separated by a comma. The column entry specifies how to interpret an 394input byte in the row's state. If neither a next state nor an action is 395explicitly specified (only the byte range is given) then the byte value 396terminates the byte sequence, results in a valid mapping to a Unicode BMP 397character, and resets the state number to 0. The first line with `<icu:state>` is 398called state 0. 399 400The next state can be explicitly specified with a separating colon ( : ) 401followed by the number of the state (=number/index of the row, starting at 0). 402This specification is mostly used for intermediate byte values (such as bytes 403that are not the last ones in a sequence). The state machine needs to proceed to 404the next state and read another byte. In this case, no other action is 405specified. 406 407If the byte value(s) terminate(s) a byte sequence, then the byte sequence 408results in the following depending on the action that is announced with a period 409( . ) followed by a letter: 410 411| letter | meaning | 412|--|---------| 413| u | Unassigned. The byte sequence is valid but does not encode a character. | 414| none | (no letter) - Valid. If no action letter is specified, then the byte sequence is valid and encodes a Unicode character up to U+ffff | 415| p | Surrogate Pair. The byte sequence is valid and the result may map to a UTF-16 encoded surrogate pair | 416| i | Illegal. The byte sequence is illegal. This is the default for all byte values in a row that are not otherwise specified with column entries| 417| s | State change only. The byte sequence does not encode any character but may change the state number. This may be used with simple, stateful encodings (for example, SI/SO codes), but currently it is not used by ICU.| 418 419If an action is specified without a next state, then the next state number 420defaults to 0. In other words, a byte value (range) terminates a sequence if 421there is an action specified for it, or when there is neither an action nor a 422next state. In this case, the byte value defaults to "valid, next state is 0" 423(equivalent to :0.). 424 425If a byte value is not specified in any column entry row, then it is illegal in 426the current state. If a byte value is specified in more than one column entry of 427the same row, then ICU uses the last state. These specifications allow you to 428assign common properties for a wide byte value range followed by a few 429exceptions. This is easier than having to specify mutually exclusive ranges, 430especially if many of them have the same properties. 431 432The optional keyword at the beginning of a state line has the following effect: 433 434| keyword | effect | 435|---------|--------| 436| initial | The state machine can start reading byte sequences in this state. State 0 is always an initial state. Only initial states can be next states for final byte values. In an initial state, the Unicode mappings for all final bytes are also stored directly in the state table. 437| surrogates | All Unicode mappings for final bytes in non-initial states are stored in a separate table of 16-bit Unicode (UTF-16) code units. Since most legacy codepages map only to Unicode code points up to U+ffff (the Basic Multilingual Plane, BMP), the default allocation per mapping result is one 16-bit unit. Individual byte values can be specified to map to surrogate pairs (= two 16-bit units) with action letter p. The surrogates keyword specifies the values for the entire state (row). Surrogate pair mapping entries can still hold single units depending on the actual mapping data, but single-unit mapping entries cannot hold a pair of units. Mapping to single-unit entries is the default because the mapping is faster, uses half as much memory in the code units table, and is sufficient for most legacy codepages.| 438 439When converting to Unicode, the state machine starts in state number 0. In each 440iteration, the state machine reads one input (codepage) byte and either proceeds 441to the next state as specified, or treats it as a final byte with the specified 442action and an optional non-0 next (initial) state. This means that a state table 443needs to have at least as many state rows as the maximum number of bytes per 444character, which is the maximum length of any byte sequence. 445 446Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state 4471, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to 448state 0. See the default state table below. 449 450### Extension and delta tables 451 452ICU 2.8 adds an additional "extension" data structure to its conversion tables. 453The new data structure supports a number of new features. When any of the 454following features are used, then all mappings must use a precision indicator. 455 456#### Converting multiple characters as a unit 457 458Before ICU 2.8, only one Unicode code point could be converted to or from one 459complete codepage byte sequence. The new data structure supports the conversion 460between multiple Unicode code points and multiple complete codepage byte 461sequences. (A "complete codepage byte sequence" is a sequence of bytes which is 462valid according to the state table.) 463 464Syntax: Simply write more than one Unicode code point on a mapping line, and/or 465more than one complete codepage byte sequence. Plus signs (+) are optional 466between code points and between bytes. For example, 467ibm-1390_P110-2003.ucm contains 468 469 <U304B><U309A> \xEC\xB5 |0 470 471and test3.ucm contains 472 473 <U101234>+<U50005>+<U60006> \x07+\x00+\x01\x02\x0f+\x09 |0 474 475For more examples see the ICU conversion data and the 476`icu/source/test/testdata/test*.ucm` test data files. 477 478ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31 479bytes on the codepage side. 480 481The longest match possible is converted in order to properly handle tables where 482the source sides of some mappings are prefixes of the source sides of other 483mappings. 484 485As a side effect, if conversion offsets are written and a potential match 486crosses buffer boundaries, then some of the initial offsets for the following 487output may be unknown (-1) because their input was stored in the converter from 488a previous buffer while looking for a longer match. 489 490Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot 491include mappings with SI or SO bytes or where there are SBCS characters in a 492multi-character byte sequence. In other words, for these tables there must be 493exactly one byte in a mapping or else a sequence of one or more DBCS characters. 494 495#### Delta (extension-only) conversion table files 496 497Physically, a binary conversion table (.cnv) file automatically contains both a 498traditional "base table" data structure for the 1:1 mappings and a new 499"extension table" for the m:n mappings if any are encountered in the .ucm file. 500An extension table can also be requested manually by splitting the CHARMAP into 501two. The first CHARMAP section will be used for the base table, and the second 502only for the extension table. M:n mappings in the first CHARMAP will be moved to 503the extension table. 504 505In order to save space for very similar conversion tables, it is possible to 506create delta .cnv files that contain only an extension table and the name of 507another .cnv file with a base table. The base file must be split into two 508CHARMAPs such that the base file's base table does not contain any mappings that 509contradict any of the delta file's mappings. 510 511The delta (extension-only) file uses only a single CHARMAP section. In addition, 512it nees a line in the header that both causes building just a delta file and 513specifies the name of the base file. For example, windows-936-2000.ucm contains 514 515 <icu:base> “ibm-1386_P100-2002” 516 517makeconv ignores all mappings for the delta file that are also in the base 518file's base table. If the two conversion tables are sufficiently similar, then 519the delta file will contain only a relatively small set of mappings, which 520results in a small .cnv file. At runtime, both the delta file and its base file 521are loaded, and the base file's base table is used together with the extension 522file. The base file works as a standalone file, using its own extension table 523for its full set of mappings. The base file must be in the same ICU data package 524as the delta file. 525 526The hard part is to split the base file's mappings into base and extension 527CHARMAPs such that the base table does not overlap with any delta file, while 528all shared mappings should be in the base table. (The base table data structure 529is more compact than the extension table data structure.) 530 531ICU provides the ucmkbase tool in the 532[ucmtools](https://github.com/unicode-org/icu-data/tree/main/charset/source/ucmtools) 533collection to do this. 534 535For example, the following illustrates how to use ucmkbase to make a base .ucm 536file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm 537becomes the base.) 538 539``` 540C:\tmp\icu\ucm>ren ibm-943_P15A-2003.ucm ibm-943_P15A-2003.orig 541C:\tmp\icu\ucm>ucmkbase ibm-943_P15A-2003.orig ibm-943_P130-1999.ucm ibm-942_P12A-1999.ucm > ibm-943_P15A-2003.ucm 542``` 543 544After this, the two delta .ucm files only need to get the following line added 545before the start of their CHARMAPs: 546 547``` 548<icu:base> "ibm-943_P15A-2003" 549``` 550 551The ICU tools and runtime code handle DBCS-only conversion tables specially, 552allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base 553files without using their single-byte mappings, and without ucmkbase moving the 554single-byte mappings of the base file into the base file's extension table. See 555for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm. 556 557#### Other enhancements 558 559ICU 2.8 adds support for the specification of which unassigned Unicode code 560points should be mapped to subchar1 rather than the default subchar. See the 561discussion of subchar1 above for more details. 562 563The extension table data structure also removes one minor limitation on ICU 564conversion tables: Fallback mappings to a single byte 00 are now allowed and 565handled properly. ICU versions before 2.8 could only handle roundtrips to/from 56600. 567 568### Examples for codepage state tables 569 570The following shows the exact implied state tables for non-MBCS types, A state 571table may need to be overwritten in order to allow supplementary characters 572(U+10000 and up). 573 574US-ASCII 575``` 5760-7f 577``` 578 579This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are 580valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff 581are illegal. 582 583Shift-JIS 584``` 5850-7f, 81-9f:1, a0-df, e0-fc:1 58640-7e, 80-fc 587``` 588 589This two-row state table describes the Shift-JIS structure which encodes some 590characters with one byte each and others with two bytes each. Bytes 0 to 0x7f 591and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to 5920xfc are lead bytes. (For example, they are followed by one of the bytes that is 593specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a 594single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31 595is illegal. 596 597EUC-JP 598``` 5990-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 600a1-fe 601a1-e4 602a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 603a1-fe.u 604``` 605 606This fairly complicated state table describes EUC-JP. Valid byte sequences are 607one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and 608end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte 609sequences have a lead byte of 0x8f and continue in state 3. Some final byte 610value ranges are entirely unassigned, therefore they end in state 4 with an 611action letter of u for "unassigned" to save significant memory for the code 612units table. Assigned three-byte sequences end in state 1 like most two-byte 613sequences. 614 615SBCS default state table: 616``` 6170-ff 618``` 619SBCS by default implies the structure for single-byte, 8-bit codepages. 620 621DBCS default state table: 622``` 6230-3f:3, 40:2, 41-fe:1, ff:3 62441-fe 62540 626 627``` 628 629**Important**: 630These are four states — the fourth has an empty line (equivalent to 0-ff.i)! 631DBCS codepages, by default, are defined with the EBCDIC double-byte structure. 632Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40 633for the double-byte space. The structure is defined such that all illegal byte 634sequences are always two in length. Therefore, every byte in the initial state 635is a lead byte. 636 637EBCDIC_STATEFUL default state table: 638``` 6390-ff, e:1.s, f:0.s 640initial, 0-3f:4, e:1.s, f:0.s, 40:3, 41-fe:2, ff:4 6410-40:1.i, 41-fe:1., ff:1.i 6420-ff:1.i, 40:1. 6430-ff:1.i 644``` 645 646This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages, 647which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The 648initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is 649also an initial state and is the basis for a state-shifted version of the DBCS 650structure above. All double-byte sequences return to state 1 and SI switches 651back to state 0. SI and SO are also allowed in their own states with no effect. 652 653> :point_right: **Note**: *If a DBCS or EBCDIC_STATEFUL codepage maps supplementary (non-BMP) Unicode 654> characters, then a modified state table needs to be specified in the .ucm file. 655> The state table needs to use the surrogates designation for a table row or .p 656> for some entries.* 657> 658> *The reuse of a final or intermediate state (shown for EUC-JP) is valid for as 659> long as there is no circle in the state chain. The mappings will be unique 660> because of the different path to the shared state (sharing a state saves some 661> memory; each state table row occupies 1kB in the .cnv file). This table also 662> shows the redefinition of byte value ranges within one state row (State number 663> 3) as shorthand. State 3 defines bytes a1-fe to go to state 1, but the following 664> entries redefine and override certain bytes to go to state 4.* 665 666An initial state never needs a surrogates designation or .p because Unicode 667mapping results in initial states that are stored directly in the state table, 668providing enough room in each cell. The size of a generated .cnv mapping table 669file depends primarily on the number and distribution of the mappings and on the 670number of valid, multi-byte sequences that the state table allows. Each state 671table row takes up one kilobyte. 672 673For single-byte codepages, the state table cells contain all two-Unicode 674mappings. Code point results for multi-byte sequences are stored in an array 675with enough room for all valid byte sequences. For all byte sequences that end 676in a surrogates or .p state, Unicode allocates two code units. 677 678If possible, valid state table entries may be changed to .u to reduce the number 679of valid, assignable sequences and to make the .cnv file smaller. If additional 680states are necessary, then each additional state itself adds 1kB to the file 681size, diminishing the file size savings. See the EUC-JP example above. 682 683For codepages with up to two bytes per character, the makeconv tool 684automatically compacts the bytes, if possible, by introducing one more trail 685byte state. This state replaces valid entries in the original trail state with 686unassigned entries and changes each lead byte entry to work with the new state 687if there are no mappings with that lead byte. 688 689For codepages with up to three or four bytes per character, compaction must be 690done manually. However, if the verbose option is set on the command line, the 691makeconv tool will print useful information about unassigned byte sequences. 692