• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Conversion Data
4nav_order: 2
5parent: Conversion
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# Conversion Data
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Introduction
24
25### Algorithmic vs. Data-based
26
27In a comprehensive conversion library, there are three kinds of codepage
28converter implementations: converters that use algorithms, mapping data, or
29those converters that use both.
30
311.  Most codepages have a simple and straightforward structure but have an
32    arbitrary relationship between input and output character codes. Mapping
33    tables are necessary to define the conversion. If the codepage characters
34    use more than one byte each, then the mapping table must also define the
35    structure of the codepage.
36
372.  Algorithmic converters work by transforming the input stream with built-in
38    algorithms and possibly small, hard coded tables. The conversion can be
39    complex, but the actual mapping of a character code is done numerically if
40    the converter is purely algorithmic.
41
423.  In some cases, a converter needs to be algorithmic for its basic operations
43    but also relies on mapping data.
44
45ICU provides converter implementations for all three groups of codepages. Since
46ICU always converts, to or from Unicode, the purely algorithmic converters are
47the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE,
48UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and
49ISO-8859-1 ("ISO Latin-1"), these encodings also use algorithmic converters for
50performance reasons.
51
52Most other codepages use simple byte sequences but are not encodings of Unicode.
53They are converted with generic code using mapping data tables. ICU also
54supports a few encodings, like ISO-2022 and its variants, that employ an
55algorithmic structure to switch between a set of codepages. The converters for
56these encodings are algorithmic but use mapping tables for the embedded
57codepages.
58
59### Stateful vs. Stateless
60
61Character encodings are either stateful or stateless:
62
631.  Stateless encodings define a byte sequence for each character. Complete
64    character byte sequences can be used in any order, and the same complete
65    character byte sequences always encodes the same characters. It is
66    preferable to always encode one character using the same byte sequence.
67
682.  Stateful encodings define byte sequences that change the state of the text
69    stream. Depending on the current state, the same byte sequence may encode a
70    different character and the same character may be encoded with different
71    byte sequences.
72
73This distinction between stateless and stateful encodings is important, because
74it determines if any available ICU converter implementation is used. The
75following are some more important considerations related to stateless versus
76stateful encodings:
77
781.  A runtime converter object is always stateful, even for "stateless"
79    encodings. They are always stateful because an input buffer may end with a
80    partial byte sequence that is to be continued in the next input buffer in
81    the following conversion call. The information about this is stored in the
82    converter object. Similarly, if the input is Unicode text, then an input
83    buffer may end with the first of a pair of surrogates. The converter object
84    also stores overflow bytes or code units if the result of a character
85    mapping did not fit entirely into the output buffer.
86
872.  Stateless encodings are stateful in our converter implementation to
88    interpret "complete byte sequences". They are "stateful" because many
89    encodings can have the same byte value used in different positions of byte
90    sequences for different characters; a specific byte value may be a lead byte
91    or a trail byte. For instance, the lead and trail byte values overlap in
92    codepages like Shift-JIS. If a program does not start reading at a character
93    boundary, it may instead interpret the byte sequences from two or more
94    separate characters as one character. Often, character boundaries can be
95    detected reliably only by reading the non-Unicode text linearly from the
96    beginning. This can be a problem for non-Unicode text processing, where text
97    insertion, deletion, and searching are common. The UTF-8/16/32 encodings do
98    not have this problem because the single, lead, or trail units have disjoint
99    values and character boundary can be easily found.
100
1013.  Some stateful encodings only switch between two states: one with one byte
102    per character and one with two bytes per character. This type of encoding is
103    very common in mainframe systems based on Extended Binary Coded Decimal
104    Interchange Code (EBCDIC) and is actually handled in ICU with almost the
105    same code and type of mapping tables as stateless codepages.
106
1074.  The classifications of algorithmic vs. data-based converters and of
108    stateless vs. stateful encodings are independent of each other: UTF-8,
109    UTF-16, and UTF-32 encodings are algorithmic but stateless; UTF-7 and SCSU
110    encodings are algorithmic and stateful; Windows-1252 and Shift-JIS encodings
111    are data-based and stateless; ISO-2022-JP encoding is algorithmic,
112    data-based, and stateful.
113
114### Scope of this chapter
115
116The following sections in this chapter discuss the mapping data tables that are
117used in ICU. For related material, please see:
118
1191.  [ICU character set collection](https://icu.unicode.org/charts/charset)
120
1212.  [Unicode Technical Report 22](http://www.unicode.org/reports/tr22/)
122
1233.  "Cross Mapping Tables" in [Unicode Online
124    Data](http://www.unicode.org/onlinedat/online.html)
125
126## ICU Mapping Table Data Files
127
128### Overview
129
130As stated above, most ICU converters rely on character mapping tables. ICU 1.8
131has one single data structure for all character mapping tables, which is used by
132a generic Multi-Byte Character Set (MBCS) converter implementation. The
133implementation is flexible enough to handle stateless encodings with the
134following parameters:
135
1361.  Support for variable-length, byte-based encodings with 1 to 4 bytes per
137    character.
138
1392.  Support for all Unicode characters (code points 0..0x10ffff). Since ICU 1.8
140    uses the UTF-16 encoding as its Unicode encoding form, surrogate pairs are
141    completely supported.
142
1433.  Efficient distinction between unassigned (unmappable) and illegal byte
144    sequences.
145
1464.  It is not possible to convert from Unicode to byte sequences with leading
147    zero bytes.
148
1495.  Simple stateful encodings are also handled using only Shift-In and Shift-Out
150    (SI/SO) codes and one single-byte and one double-byte state.
151
152> :point_right: **Note**: *In the context of conversion tables, "unassigned" code
153> points or codepage byte sequences are valid but do not have a **mapping**. This
154> is different from "unassigned" code points in a character set like Unicode or
155> Shift-JIS which are codes that do not have assigned **characters**.*
156
157Prior to version 1.8, ICU used more specific, more limited, converter
158implementations for Single Byte Character Set (SBCS), Double Byte Character Set
159(DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC)
160codepages. Mapping table data is provided in text files. ICU comes with several
161dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are
162translated at build time by its makeconv tool (source code in
163icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable
164.cnv file per .ucm file. The resulting .cnv files are included by default in the
165common data file for use at runtime.
166
167The format of the .ucm files is similar to the format of the UPMAP files as
168provided by IBM® in the codepage repository and as used in the uconvdef tool on
169AIX. UPMAP is a text file that specifies the mapping of a codepage character to
170and from Unicode.
171
172The format of the .cnv files is ICU-specific. The .cnv file format may change
173between ICU versions even for the same .ucm files. The .ucm file format may be
174extended to include more features.
175
176The following sections concentrate on the .ucm file format. The .cnv file format
177is described in the source code in the `icu/source/common/ucnvmbcs.c` directory
178and is updated using the MBCS converter implementation.
179
180These conversion tables can have more than one name. ICU allows multiple names
181("aliases") for the same encoding. It matches a requested encoding name against
182a list of names in `icu/source/data/mappings/convrtrs.txt` and when it finds a
183match, ICU opens a converter with the name in the leftmost position in the
184matching line. The name matching is not case-sensitive and ICU ignores spaces,
185dashes, and underscores. At build time, the gencnval tool located in the
186`icu/source/tools/gencnval` directory, generates a binary form of the convrtrs.txt
187file as a data file for runtime for the cnvalias.icu file ("Converter Aliases
188data file").
189
190### .ucm File Format
191
192.ucm files are line-oriented text files. Empty lines and comments starting with
193'`#`' are ignored.
194
195A .ucm file contains two sections:
196
1971.  a header with general specifications of the codepage
198
1992.  a mapping table section between the "CHARMAP" and "END CHARMAP" lines.
200
201For example:
202
203```
204<code_set_name>               "IBM-943"
205<char_name_mask>              "AXXXX"
206<mb_cur_min>                  1
207<mb_cur_max>                  2
208<uconv_class>                 "MBCS"
209<subchar>                     \xFC\xFC
210<subchar1>                    \x7F
211<icu:state>                   0-7f, 81-9f:1, a0-df, e0-fc:1
212<icu:state>                   40-7e, 80-fc
213#
214CHARMAP
215#
216#
217#ISO 10646      IBM-943
218#_________      _________
219<U0000> \x00 |0
220<U0001> \x01 |0
221<U0002> \x02 |0
222<U0003> \x03 |0
223...
224<UFFE4> \xFA\x55 |1
225<UFFE5> \x81\x8F |0
226<UFFFD> \xFC\xFC |2
227END CHARMAP
228
229```
230
231The header fields are:
232
2331.  code_set_name - The name of the codepage. The makeconv tool generates the
234    .cnv file name from the .ucm filename but uses this header field for the
235    converter name that it writes into the .cnv file for ucnv_getName. The
236    makeconv tool prints a warning message if this header field does not match
237    the file name. The file name is not case-sensitive.
238
2392.  char_name_mask - This is ignored by makeconv tool. "AXXXX" specifies that
240    the POSIX-style character "name" consists of one letter (Alpha) followed by
241    4 hexadecimal digits. Since ICU only uses Unicode character "names" (for
242    example, code points) the format is fixed (see below).
243
2443.  mb_cur_min - The minimum number of bytes per character.
245
2464.  mb_cur_max - The maximum number of bytes per character.
247
2485.  uconv_class - This can be either "SBCS", "DBCS", "MBCS", or
249    "EBCDIC_STATEFUL"
250    The most general converter class/type/category is MBCS, which requires that
251    the codepage structure has the following <icu:state> lines. The other types
252    of converters are subsets of MBCS. The makeconv tool uses predefined state
253    tables for these other converters when their structure is not explicitly
254    specified. The following describes how the converter types are interpreted:
255
256    a.  MBCS: Generic ICU converter type, requires a state table
257
258    b.  SBCS: Single-byte, 8-bit codepages
259
260    c.  DBCS: Double-byte EBCDIC codepages
261
262    d.  EBCDIC_STATEFUL: Mixed Single-Byte or Double-Byte EBCDIC codepages (stateful, using SI/SO)
263
264The following shows the exact implied state tables for non-MBCS types. A state
265table may need to be overwritten in order to allow supplementary characters
266(U+10000 and up).
267
2681.  subchar - The substitution character byte sequence for this codepage. This sequence must be a valid byte sequence according to the codepage structure.
269
2702.  subchar1 - This is the single byte substitution character when subchar is defined. Some IBM converter libraries use different substitution characters for "narrow" and "wide" characters (single-byte and double-byte). ICU uses only one substitution character per codepage because it is common industry  practice.
271
2723.  icu:state - See the "State Table Syntax in .ucm Files" section for a  detailed description of how to specify a codepage structure.
273
2744.  icu:charsetFamily - This specifies if the codepage is ASCII or EBCDIC based.
275
276The subchar and subchar1 fields have been known to cause some confusion. The
277following conditions outline when each are used:
278
2791.  Conversion from Unicode to a codepage occurs and an unassigned code point is
280    found
281
282    a.  If a subchar1 byte is defined and a subchar1 mapping is defined for the code point (with a |2 precision indicator),
283        output the subchar1
284
285    b.  Otherwise output the regular subchar
286
2872.  Conversion from a codepage to Unicode occurs and an unassigned codepoint is found
288
289    a.  If the input sequence is of length 1 and a subchar1 byte is specified for the codepage, output U+001A
290
291    b.  Otherwise output U+FFFD
292
293In the CHARMAP section of a .ucm file, each line contains a Unicode code point
294(like <U(*1-6 hexadecimal digits for the code point*)> ), a codepage character
295byte sequence (each byte like `\xhh` (2 hexadecimal digits) ), and an optional
296"precision" or "fallback" indicator.
297
298The precision indicator either must be present in all mappings or in none of
299them. The indicator is a pipe symbol `|` followed by a 0, 1, 2, 3, or 4 that has
300the following meaning:
301
302*   `|0` - A "normal", roundtrip mapping from a Unicode code point and back.
303*   `|1` - A "fallback" mapping only from Unicode to the codepage, but not back.
304*   `|2` - A subchar1 mapping. The code point is unmappable, and if a substitution
305    is performed, then the subchar1 should be used rather than the subchar.
306    Otherwise, such mappings are ignored.
307*   `|3` - A "reverse fallback" mapping only from the codepage to Unicode, but not
308    back to the codepage.
309*   `|4` - A "good one-way" mapping only from Unicode to the codepage, but not
310    back.
311
312Fallback mappings from Unicode typically do not map codes for the same
313character, but for "similar" ones. This mapping is sometimes done if a character
314exists in Unicode but not in the codepage. To replace it, ICU maps a codepage
315code to a similar-looking code for human-readable output. This mapping feature
316is not useful for text data transmission especially in markup languages where a
317Unicode code point can be escaped with its code point value. The ICU application
318programming interface (API) `ucnv_setFallback()` controls this fallback behavior.
319
320"Reverse fallbacks" are technically similar, but the same Unicode character can
321be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime.
322
323A subset of the fallback mappings from Unicode is always used at runtime: Those
324that map private-use Unicode code points. Fallbacks from private-use code points
325are often introduced as replacements for previous roundtrip mappings for the
326same pair of codes. These replacements are used when a Unicode version assigns a
327new character that was previously mapped to that private-use code point. The
328mapping table is then changed to map the same codepage byte sequence to the new
329Unicode code point (as a new roundtrip) and the mapping from the old private-use
330code point to the same codepage code is preserved as a fallback.
331
332A "good one-way" mapping is like a fallback, but ICU always uses "good one-way"
333mappings at runtime, regardless of the fallback API flag.
334
335The idea is that fallbacks normally lose information, such as mapping from a
336compatibility variant of a letter to the ASCII version; however, fallbacks from
337PUA and reverse fallbacks are assumed to be for "the same character", just an
338older code for it.
339
340Something similar happens with from-Unicode Variation Selector sequences. It is
341possible to round-trip (`|0`) either the unadorned character or the sequence with
342a variation selector, and add a "good one-way" mapping (`|4`) from the other
343version. That "good one-way" mapping does not lose much information, and it is
344used even if the "use fallback" API flag is false. Alternatively, both mappings
345could be fallbacks (`|1`) that should be controlled by the "use fallback"
346attribute.
347
348### State table syntax in .ucm files
349
350The conversion to Unicode uses a state machine to achieve the above capabilities
351with reasonable data file sizes. The state machine information itself is loaded
352with the conversion data and defines the structure of the codepage, including
353which byte sequences are valid, unassigned, and illegal. This data cannot (or
354not easily) be computed from the pure mapping data. Instead, the .ucm files for
355MBCS encodings have additional entries that are specific to the ICU makeconv
356tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they
357can be overridden (see the examples below). These state tables are specified in
358the header section of the .ucm file that contains the `<icu:state>` element. Each
359line defines one aspect of the state machine. The state machine uses a table of
360as many rows as there are states (= as many as there are `<icu:state>` lines).
361Each row has 256 entries; one for each possible byte value.
362
363The state table lines in the .ucm header conform to the following Extended
364Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens):
365
366```
367row=[[firstentry ','] entry (',' entry)*]
368firstentry="initial" | "surrogates"
369           (initial state (default for state 0), output is all surrogate pairs)
370```
371
372Each state table row description (that follows the `<icu:state>`) begins with an
373optional initial or surrogates keyword and is followed by one or more column
374entries. For the purpose of codepage state tables, the states=rows in the table
375are numbered beginning at 0 for the first line in the .ucm file header. The
376numbers are assigned implicitly by the makeconv tool in order of the `<icu:state>`
377lines.
378
379A row may be empty (nothing following the `<icu:state>`) - that is equivalent to
380"all illegal" or 0-ff.i and is useful for trail byte states for all-illegal byte
381sequences.
382
383```
384entry=range [':' nextstate] ['.' [action]]
385range     = number ['-' number]
386nextstate = number (0..7f)
387action    = 'u' | 's' | 'p' | 'i'
388                (unassigned, state change only, surrogate pair, illegal)
389number    = (1- or 2-digit hexadecimal number)
390```
391
392Each column entry contains at least one hexadecimal byte value or value range
393and is separated by a comma. The column entry specifies how to interpret an
394input byte in the row's state. If neither a next state nor an action is
395explicitly specified (only the byte range is given) then the byte value
396terminates the byte sequence, results in a valid mapping to a Unicode BMP
397character, and resets the state number to 0. The first line with `<icu:state>` is
398called state 0.
399
400The next state can be explicitly specified with a separating colon ( : )
401followed by the number of the state (=number/index of the row, starting at 0).
402This specification is mostly used for intermediate byte values (such as bytes
403that are not the last ones in a sequence). The state machine needs to proceed to
404the next state and read another byte. In this case, no other action is
405specified.
406
407If the byte value(s) terminate(s) a byte sequence, then the byte sequence
408results in the following depending on the action that is announced with a period
409( . ) followed by a letter:
410
411| letter | meaning |
412|--|---------|
413| u | Unassigned. The byte sequence is valid but does not encode a character. |
414| none | (no letter) - Valid. If no action letter is specified, then the byte sequence is valid and encodes a Unicode character up to U+ffff |
415| p | Surrogate Pair. The byte sequence is valid and the result may map to a UTF-16 encoded surrogate pair |
416| i | Illegal. The byte sequence is illegal. This is the default for all byte values in a row that are not otherwise specified with column entries|
417| s | State change only. The byte sequence does not encode any character but may change the state number. This may be used with simple, stateful encodings (for example, SI/SO codes), but currently it is not used by ICU.|
418
419If an action is specified without a next state, then the next state number
420defaults to 0. In other words, a byte value (range) terminates a sequence if
421there is an action specified for it, or when there is neither an action nor a
422next state. In this case, the byte value defaults to "valid, next state is 0"
423(equivalent to :0.).
424
425If a byte value is not specified in any column entry row, then it is illegal in
426the current state. If a byte value is specified in more than one column entry of
427the same row, then ICU uses the last state. These specifications allow you to
428assign common properties for a wide byte value range followed by a few
429exceptions. This is easier than having to specify mutually exclusive ranges,
430especially if many of them have the same properties.
431
432The optional keyword at the beginning of a state line has the following effect:
433
434| keyword | effect |
435|---------|--------|
436| initial | The state machine can start reading byte sequences in this state. State 0 is always an initial state. Only initial states can be next states for final byte values. In an initial state, the Unicode mappings for all final bytes are also stored directly in the state table.
437| surrogates | All Unicode mappings for final bytes in non-initial states are stored in a separate table of 16-bit Unicode (UTF-16) code units. Since most legacy codepages map only to Unicode code points up to U+ffff (the Basic Multilingual Plane, BMP), the default allocation per mapping result is one 16-bit unit. Individual byte values can be specified to map to surrogate pairs (= two 16-bit units) with action letter p. The surrogates keyword specifies the values for the entire state (row). Surrogate pair mapping entries can still hold single units depending on the actual mapping data, but single-unit mapping entries cannot hold a pair of units. Mapping to single-unit entries is the default because the mapping is faster, uses half as much memory in the code units table, and is sufficient for most legacy codepages.|
438
439When converting to Unicode, the state machine starts in state number 0. In each
440iteration, the state machine reads one input (codepage) byte and either proceeds
441to the next state as specified, or treats it as a final byte with the specified
442action and an optional non-0 next (initial) state. This means that a state table
443needs to have at least as many state rows as the maximum number of bytes per
444character, which is the maximum length of any byte sequence.
445
446Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state
4471, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to
448state 0. See the default state table below.
449
450### Extension and delta tables
451
452ICU 2.8 adds an additional "extension" data structure to its conversion tables.
453The new data structure supports a number of new features. When any of the
454following features are used, then all mappings must use a precision indicator.
455
456#### Converting multiple characters as a unit
457
458Before ICU 2.8, only one Unicode code point could be converted to or from one
459complete codepage byte sequence. The new data structure supports the conversion
460between multiple Unicode code points and multiple complete codepage byte
461sequences. (A "complete codepage byte sequence" is a sequence of bytes which is
462valid according to the state table.)
463
464Syntax: Simply write more than one Unicode code point on a mapping line, and/or
465more than one complete codepage byte sequence. Plus signs (+) are optional
466between code points and between bytes. For example,
467ibm-1390_P110-2003.ucm contains
468
469    <U304B><U309A> \xEC\xB5 |0
470
471and test3.ucm contains
472
473    <U101234>+<U50005>+<U60006> \x07+\x00+\x01\x02\x0f+\x09 |0
474
475For more examples see the ICU conversion data and the
476`icu/source/test/testdata/test*.ucm` test data files.
477
478ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31
479bytes on the codepage side.
480
481The longest match possible is converted in order to properly handle tables where
482the source sides of some mappings are prefixes of the source sides of other
483mappings.
484
485As a side effect, if conversion offsets are written and a potential match
486crosses buffer boundaries, then some of the initial offsets for the following
487output may be unknown (-1) because their input was stored in the converter from
488a previous buffer while looking for a longer match.
489
490Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot
491include mappings with SI or SO bytes or where there are SBCS characters in a
492multi-character byte sequence. In other words, for these tables there must be
493exactly one byte in a mapping or else a sequence of one or more DBCS characters.
494
495#### Delta (extension-only) conversion table files
496
497Physically, a binary conversion table (.cnv) file automatically contains both a
498traditional "base table" data structure for the 1:1 mappings and a new
499"extension table" for the m:n mappings if any are encountered in the .ucm file.
500An extension table can also be requested manually by splitting the CHARMAP into
501two. The first CHARMAP section will be used for the base table, and the second
502only for the extension table. M:n mappings in the first CHARMAP will be moved to
503the extension table.
504
505In order to save space for very similar conversion tables, it is possible to
506create delta .cnv files that contain only an extension table and the name of
507another .cnv file with a base table. The base file must be split into two
508CHARMAPs such that the base file's base table does not contain any mappings that
509contradict any of the delta file's mappings.
510
511The delta (extension-only) file uses only a single CHARMAP section. In addition,
512it nees a line in the header that both causes building just a delta file and
513specifies the name of the base file. For example, windows-936-2000.ucm contains
514
515    <icu:base> “ibm-1386_P100-2002”
516
517makeconv ignores all mappings for the delta file that are also in the base
518file's base table. If the two conversion tables are sufficiently similar, then
519the delta file will contain only a relatively small set of mappings, which
520results in a small .cnv file. At runtime, both the delta file and its base file
521are loaded, and the base file's base table is used together with the extension
522file. The base file works as a standalone file, using its own extension table
523for its full set of mappings. The base file must be in the same ICU data package
524as the delta file.
525
526The hard part is to split the base file's mappings into base and extension
527CHARMAPs such that the base table does not overlap with any delta file, while
528all shared mappings should be in the base table. (The base table data structure
529is more compact than the extension table data structure.)
530
531ICU provides the ucmkbase tool in the
532[ucmtools](https://github.com/unicode-org/icu-data/tree/main/charset/source/ucmtools)
533collection to do this.
534
535For example, the following illustrates how to use ucmkbase to make a base .ucm
536file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm
537becomes the base.)
538
539```
540C:\tmp\icu\ucm>ren ibm-943_P15A-2003.ucm ibm-943_P15A-2003.orig
541C:\tmp\icu\ucm>ucmkbase ibm-943_P15A-2003.orig ibm-943_P130-1999.ucm ibm-942_P12A-1999.ucm > ibm-943_P15A-2003.ucm
542```
543
544After this, the two delta .ucm files only need to get the following line added
545before the start of their CHARMAPs:
546
547```
548<icu:base> "ibm-943_P15A-2003"
549```
550
551The ICU tools and runtime code handle DBCS-only conversion tables specially,
552allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base
553files without using their single-byte mappings, and without ucmkbase moving the
554single-byte mappings of the base file into the base file's extension table. See
555for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm.
556
557#### Other enhancements
558
559ICU 2.8 adds support for the specification of which unassigned Unicode code
560points should be mapped to subchar1 rather than the default subchar. See the
561discussion of subchar1 above for more details.
562
563The extension table data structure also removes one minor limitation on ICU
564conversion tables: Fallback mappings to a single byte 00 are now allowed and
565handled properly. ICU versions before 2.8 could only handle roundtrips to/from
56600.
567
568### Examples for codepage state tables
569
570The following shows the exact implied state tables for non-MBCS types, A state
571table may need to be overwritten in order to allow supplementary characters
572(U+10000 and up).
573
574US-ASCII
575```
5760-7f
577```
578
579This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are
580valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff
581are illegal.
582
583Shift-JIS
584```
5850-7f, 81-9f:1, a0-df, e0-fc:1
58640-7e, 80-fc
587```
588
589This two-row state table describes the Shift-JIS structure which encodes some
590characters with one byte each and others with two bytes each. Bytes 0 to 0x7f
591and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to
5920xfc are lead bytes. (For example, they are followed by one of the bytes that is
593specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a
594single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31
595is illegal.
596
597EUC-JP
598```
5990-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
600a1-fe
601a1-e4
602a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
603a1-fe.u
604```
605
606This fairly complicated state table describes EUC-JP. Valid byte sequences are
607one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and
608end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte
609sequences have a lead byte of 0x8f and continue in state 3. Some final byte
610value ranges are entirely unassigned, therefore they end in state 4 with an
611action letter of u for "unassigned" to save significant memory for the code
612units table. Assigned three-byte sequences end in state 1 like most two-byte
613sequences.
614
615SBCS default state table:
616```
6170-ff
618```
619SBCS by default implies the structure for single-byte, 8-bit codepages.
620
621DBCS default state table:
622```
6230-3f:3, 40:2, 41-fe:1, ff:3
62441-fe
62540
626
627```
628
629**Important**:
630These are four states — the fourth has an empty line (equivalent to 0-ff.i)!
631DBCS codepages, by default, are defined with the EBCDIC double-byte structure.
632Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40
633for the double-byte space. The structure is defined such that all illegal byte
634sequences are always two in length. Therefore, every byte in the initial state
635is a lead byte.
636
637EBCDIC_STATEFUL default state table:
638```
6390-ff, e:1.s, f:0.s
640initial, 0-3f:4, e:1.s, f:0.s, 40:3, 41-fe:2, ff:4
6410-40:1.i, 41-fe:1., ff:1.i
6420-ff:1.i, 40:1.
6430-ff:1.i
644```
645
646This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages,
647which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The
648initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is
649also an initial state and is the basis for a state-shifted version of the DBCS
650structure above. All double-byte sequences return to state 1 and SI switches
651back to state 0. SI and SO are also allowed in their own states with no effect.
652
653> :point_right:  **Note**: *If a DBCS or EBCDIC_STATEFUL codepage maps supplementary (non-BMP) Unicode
654> characters, then a modified state table needs to be specified in the .ucm file.
655> The state table needs to use the surrogates designation for a table row or .p
656> for some entries.*
657>
658> *The reuse of a final or intermediate state (shown for EUC-JP) is valid for as
659> long as there is no circle in the state chain. The mappings will be unique
660> because of the different path to the shared state (sharing a state saves some
661> memory; each state table row occupies 1kB in the .cnv file). This table also
662> shows the redefinition of byte value ranges within one state row (State number
663> 3) as shorthand. State 3 defines bytes a1-fe to go to state 1, but the following
664> entries redefine and override certain bytes to go to state 4.*
665
666An initial state never needs a surrogates designation or .p because Unicode
667mapping results in initial states that are stored directly in the state table,
668providing enough room in each cell. The size of a generated .cnv mapping table
669file depends primarily on the number and distribution of the mappings and on the
670number of valid, multi-byte sequences that the state table allows. Each state
671table row takes up one kilobyte.
672
673For single-byte codepages, the state table cells contain all two-Unicode
674mappings. Code point results for multi-byte sequences are stored in an array
675with enough room for all valid byte sequences. For all byte sequences that end
676in a surrogates or .p state, Unicode allocates two code units.
677
678If possible, valid state table entries may be changed to .u to reduce the number
679of valid, assignable sequences and to make the .cnv file smaller. If additional
680states are necessary, then each additional state itself adds 1kB to the file
681size, diminishing the file size savings. See the EUC-JP example above.
682
683For codepages with up to two bytes per character, the makeconv tool
684automatically compacts the bytes, if possible, by introducing one more trail
685byte state. This state replaces valid entries in the original trail state with
686unassigned entries and changes each lead byte entry to work with the new state
687if there are no mappings with that lead byte.
688
689For codepages with up to three or four bytes per character, compaction must be
690done manually. However, if the verbose option is set on the command line, the
691makeconv tool will print useful information about unassigned byte sequences.
692