• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: ICU Data
4nav_order: 13
5has_children: true
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# ICU Data
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25ICU makes use of a wide variety of data tables to provide many of its services.
26Examples include converter mapping tables, collation rules, transliteration
27rules, break iterator rules and dictionaries, and other locale data. Additional
28data can be provided by users, either as customizations of ICU's data or as new
29data altogether.
30
31This section describes how ICU data is stored and located at run time. It also
32describes how ICU data can be customized to suit the needs of a particular
33application.
34
35For simple use of ICU's predefined data, this section on data management can
36safely be skipped. The data is built into a library that is loaded along with
37the rest of ICU. No specific action or setup is required of either the
38application program or the execution environment.
39
40Update: as of ICU 64, the standard data library is over 20 MB in size. We have
41introduced a new tool, the [ICU Data Build Tool](./icu_data/buildtool.md),
42to give you more control over what goes into your ICU locale data file.
43
44> :point_right: **Note**: ICU for C by default comes with pre-built data.
45> The source data files are included as an "icu\*data.zip" file starting in ICU4C 49.
46> Previously, they were not included unless ICU is downloaded from the [source repository](http://site.icu-project.org/repository).
47
48## ICU and CLDR Data
49
50Most of ICU's data is sourced from [CLDR](http://cldr.unicode.org), the [Common
51Locale Data Repository](http://cldr.unicode.org) project. Do not file bugs
52against ICU to request data changes in CLDR, see the CLDR project's page itself.
53Also note that most ICU data files are therefore autogenerated from CLDR, and so
54manually editing them is not usually recommended.
55
56Data which is NOT sourced from CLDR includes:
57
58*   [Conversion Data](conversion/data.md)
59*   Break Iterator Dictionary Data ( Thai, CJK, etc )
60*   Break Iterator Rule Data (as of this writing, it is manually kept in sync
61    with the CLDR datasets)
62
63For information on building ICU data from CLDR, see the
64[cldr-icu-readme](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/cldr-icu-readme.txt).
65
66## ICU Data Directory
67
68The ICU data directory is the default location for all ICU data. Any requests
69for data items that do not include an explicit directory path will be resolved
70to files located in the ICU data directory.
71
72The ICU data directory is determined as follows:
73
741.  If the application has called the function `u_setDataDirectory()`, use the
75    directory specified there, otherwise:
76
772.  If the environment variable `ICU_DATA` is set, use that, otherwise:
78
793.  If the C preprocessor variable `ICU_DATA_DIR` was set at the time ICU was
80    built, use its compiled-in value.
81
824.  Otherwise, the ICU data directory is an empty string. This is the default
83    behavior for ICU using a shared library for its data and provides the
84    highest data loading performance.
85
86> :point_right: **Note**: `u_setDataDirectory()` is not thread-safe. Call it
87> *before* calling ICU APIs from multiple threads. If you use both
88> `u_setDataDirectory()` and `u_init()`, then use `u_setDataDirectory()` first.
89>
90> *Earlier versions of ICU supported two additional schemes: setting a data
91> directory relative to the location of the ICU shared libraries, and on Windows,
92> taking a location from the registry. These have both been removed to make the
93> behavior more predictable and easier to understand.*
94
95The ICU data directory does not need to be set in order to reference the
96standard built-in ICU data. Applications that just use standard ICU capabilities
97(converters, locales, collation, etc.) but do not build and reference their own
98data do not need to specify an ICU data directory.
99
100### Multiple-Item ICU Data Directory Values
101
102The ICU data directory string can contain multiple directories as well as .dat
103path/filenames. They must be separated by the path separator that is used on the
104platform, for example a semicolon (`;`) on Windows. Data files will be searched in
105all directories and .dat package files in the order of the directory string. For
106details, see the example below.
107
108## Default ICU Data
109
110The default ICU data consists of the data needed for the converters, collators,
111locales, etc. that are provided with ICU. Default data must be present in order
112for ICU to function.
113
114The default data is most commonly built into a shared library that is installed
115with the other ICU libraries. Nothing is required of the application for this
116mechanism to work. ICU provides additional options for loading the default data
117if more flexibility is required.
118
119Here are the steps followed by ICU to locate its default data. This procedure
120happens only once per process, at the time an ICU data item is first requested.
121
1221.  If the application has called the function `udata_setCommonData()`, use the
123    data that was provided. The application specifies the address in memory of
124    an image of an ICU common format data file (either in shared-library format
125    or .dat package file format).
126
1272.  Examine the contents of the default ICU data shared library. If it contains
128    data, use that data. If the data library is empty, a stub library, proceed
129    to the next step. (A data shared library must always be present in order for
130    ICU to successfully link and load. A stub data library is used when the
131    actual ICU common data is to be provided from another source).
132
1333.  Dynamically load (memory map, typically) a common format (.dat) file
134    containing the default ICU data. Loading is described in the section
135    [How Data Loading Works](icudata#how-data-loading-works). The path to
136    the data is of the form  "icudt\<version\>\<flag\>", where \<version\> is
137    the two-digit ICU version number, and \<flag\> is a letter indicating the
138    internal format of the file (see the
139    [Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms)
140    section).
141
142Once the default ICU data has been located, loading of individual data items
143proceeds as described in the section
144[How Data Loading Works](icudata#how-data-loading-works).
145
146## Building and Linking against ICU data
147
148When using ICU's configure or runConfigureICU tool to build, several different
149methods of packging are available.
150
151> :point_right: **Note**: in all cases, you **must** link all ICU tools and
152applications against a "data library": either a data library containing the ICU
153data, or against the "stubdata" library located in icu/source/stubdata. For
154example, even if ICU is built in "files" mode, you must still link against the
155"stubdata" library or an undefined symbol error occurs.
156
157*   `--with-data-packaging=library`
158    This mode builds a shared library (DLL or .so). This is the simplest mode to
159    use, and is the default.
160    To use: link your application against the common and data libraries.
161    This is the only directly supported behavior on Windows builds.
162*   `--with-data-packaging=static`
163    This option builds ICU data as a single (large) static library. This mode is
164    more complex to use. If you encounter errors, you may need to build ICU
165    multiple times.
166*   `--with-data-packaging=files`
167    With this option, ICU outputs separate individual files (.res, .cnv, etc)
168    which will be loaded at runtime. Read the rest of this document, especially
169    the sections that discuss the ICU directory path.
170*   `--with-data-packaging=archive`
171    With this option, ICU outputs a single "icudt__.dat" file containing ICU
172    data. Read the rest of this document, especially the sections that discuss
173    the ICU directory path.
174
175## Time Zone Data
176
177Because time zone data requires frequent updates in response to countries
178changing their transition dates for daylight saving time, ICU provides
179additional options for loading time zone data from separate files, thus avoiding
180the need to update a combined ICU data package. Further information is found
181under [Time Zones](datetime/timezone/index.md).
182
183## Application Data
184
185ICU-based applications can ship and use their own data for localized strings,
186custom conversion tables, etc. Each data item file must have a package name as a
187prefix, and this package name must match the basename of a .dat package file, if
188one is used. The package name must be used in ICU APIs, for example in
189`udata_setAppData()` (instead of `udata_setCommonData()` which is only used for
190ICU's own data) and in the pathname argument of `ures_open()`.
191
192The only real difference to ICU's own data is that application data cannot be
193simply loaded by specifying a NULL value for the path arguments of ICU APIs, and
194application data will not be used by APIs that do not have path/package name
195arguments at all.
196
197The most important APIs that allow application data to be used are for Resource
198Bundles, which are most often used for localized strings and other data. There
199are also functions like `ucnv_openPackage()` that allow to specify application
200data, and the `udata.h` API can be used to load any data with minimum
201requirements on the binary format, and without ICU interpreting the contents of
202the data.
203
204The `pkgdata` tool, which is used to package the data into various formats (e.g.
205shared library), has an option (`--without-assembly` or `-w`) to not use
206assembly code when building and packaging the application specific data into a
207shared library. Building the data with assembly code, which is enabled by
208default, is faster and more efficient; however, there are some platform
209specific issues that may arise. The `--without-assembly` option may be
210necessary on certain platforms (e.g. Linux) which have trouble properly loading
211application data when it was built with assembly code and is packaged as a
212shared library.
213
214## Alignment
215
216ICU data is designed to be 16-aligned, with natural alignment of values inside
217the data structure, so that the data is usable as is when memory-mapped.
218("16-aligned" means that the start address is a multiple of 16 bytes.)
219
220Memory-mapping (as well as memory allocation) provides at least 16-alignment on
221modern platforms. Some CPUs require n-alignment of types of size n bytes (and
222crash on unaligned reads), other CPUs usually operate faster on data that is
223aligned properly.
224
225Some of the ICU code explicitly checks for proper alignment.
226
227The `icupkg` tool places data items into the .dat file at start offsets that are
228multiples of 16 bytes.
229
230When using `genccode` to directly write a .o/.obj file, or to write assembler
231code, it specifies at least 16-alignment. When using `genccode` to write C code,
232it prepends the data with a double value which should yield at least 8-alignment
233on most platforms (usually `sizeof(double)=8`).
234
235## Flexibility vs. Installation vs. Performance
236
237There are choices that affect ICU data loading and depend on application
238requirements.
239
240### Data in Shared Libraries/DLLs vs. .dat package files
241
242Building ICU data into shared libraries (`--with-data-packaging=library`) is the
243most convenient packaging method because shared libraries (DLLs) are easily
244found if they are in the same directory as the application libraries, or if they
245are on the system library path. The application installer usually just copies
246the ICU shared libraries in the same place. On the other hand, shared libraries
247are not portable.
248
249Packaging data into .dat files (`--with-data-packaging=archive`) allows them to
250be shared across platforms, but they must either be loaded by the application
251and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be
252in a known location that is included in the ICU data directory string. This
253requires the application installer, or the application itself at runtime, to
254locate the ICU and/or application data by setting the ICU data directory (see
255the [ICU Data Directory](icudata#icu-data-directory) section above) or by
256loading the data and providing it to one of the `udata_setXYZData()` functions.
257
258Unlike shared libraries, .dat package files can be taken apart into separate
259data item files with the decmn ICU tool. This allows post-installation
260modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be
261used to reassemble the .dat package file.
262
263For more information about .dat package files see the section [Sharing ICU Data
264Between Platforms](icudata#sharing-icu-data-between-platforms) below.
265
266### Data Overriding vs. Loading Performance
267
268If the ICU data directory string is empty, then ICU will not attempt to load
269data from the file system. It is then only possible to load data from the
270linked-in shared library or via `udata_setCommonData()` and
271`udata_setAppData()`. This is inflexible but provides the highest performance.
272
273If the ICU data directory string is not empty, then data items are searched in
274all directories and matching .dat files mentioned before checking in
275already-loaded package files. This allows overriding of packaged data items with
276single files after installation but costs some time for filesystem accesses.
277This is usually done only once per data item; see
278[User Data Caching](icudata#user-data-caching) below.
279
280### Single Data Files vs. Packages
281
282Single data files (`--with-data-packaging=files`) are easy to replace and can
283override items inside data packages. However, it is usually desirable to reduce
284the number of files during installation, and package files use less disk space
285than many small files.
286
287## How Data Loading Works
288
289ICU data items are referenced by three names - a path, a name and a type. The
290following are some examples:
291
292path                         |   name   | type
293-----------------------------|----------|-------
294 c:\\some\\path\\dataLibName | test     | dat
295 no path                     | cnvalias | icu
296 no path                     | cp1252   | cnv
297 no path                     | en       | res
298 no path                     | uprops   | icu
299
300
301Items with 'no path' specified are loaded from the default ICU data.
302
303Application data items include a path, and will be loaded from user data files,
304not from the ICU default data. For application data, the path argument need not
305contain an actual directory, but must contain the application data's package
306name after the last directory separator character (or by itself if there is no
307directory). If the path argument contains a directory, then it is logically
308prepended to the ICU data directory string and searched first for data. The path
309argument can contain at most one directory. (Path separators like semicolon (;)
310are not handled here.)
311
312> :point_right: **Note**: The ICU data directory string itself may
313contain multiple directories and path/filenames to .dat package files. See the
314[ICU Data Directory](icudata#icu-data-directory) section.
315
316It is recommended to not include the directory in the path argument but to make
317sure via setting the application data or the ICU data directory string that the
318data can be located. This simplifies program maintenance and improves
319robustness.
320
321See the API descriptions for the functions `udata_open()` and
322`udata_openChoice()` for additional information on opening ICU data from within
323an application.
324
325Data items can exist as individual files, or a number of them can be packaged
326together in a single file for greater efficiency in loading and convenience of
327distribution. The combined files are called Common Files.
328
329Based on the supplied path and name, ICU searches several possible locations
330when opening data. To make things more concrete in the following descriptions,
331the following values of path, name and type are used:
332
333```
334path = "c:\\some\\path\\dataLibName"
335name = "test"
336type = "res"
337```
338
339In this case, "dataLibName" is the "package name" part of the path argument, and
340"c:\\some\\path\\" is the directory part of it.
341
342The search sequence for the data for "test.res" is as follows (the first
343successful loading attempt wins):
344
3451.  Try to load the file "dataLibName_test.res" from c:\\some\\data\\.
346
3472.  Try to load the file "dataLibName_test.res" from each of the directories in
348    the ICU data directory string.
349
3503.  Try to locate the data package for the package name "dataLibName".
351
3521.  Try to locate the data package in the internal cache.
353
3542.  Try to load the package file "dataLibName.dat" from c:\\some\\data\\.
355
3563.  Try to load the package file "dataLibName.dat" from each of the directories
357    in the ICU data directory string.
358
359The first steps, loading the data item from an individual file, are omitted if
360no directory is specified in either the path argument or the ICU data directory
361string.
362
363Package files are loaded at most once and then cached. They are identified only
364by their package name. Whenever a data item is requested from a package and that
365package has been loaded before, then the cached package is used immediately
366instead of searching through the filesystem.
367
368> :point_right: **Note**: ICU versions before 2.2 always searched data packages
369before looking for individual files, which made it impossible to override
370packaged data items. See the ICU 2.2 download page and the readme for more
371information about the changes.
372
373## User Data Caching
374
375Once loaded, data package files are cached, and stay loaded for the duration of
376the process. Any requests for data items from an already loaded data package
377file are routed directly to the cached data. No additional search for loadable
378files is made.
379
380The user data cache is keyed by the base file name portion of the requested
381path, with any directory portion stripped off and ignored. Using the previous
382example, for the path name "c:\\some\\path\\dataLibName", the cache key is
383"dataLibName". After this is cached, a subsequent request for "dataLibName", no
384matter what directory path is specified, will resolve to the cached data.
385
386Data can be explicitly added to the cache of common format data by means of the
387`udata_setAppData()` function. This function takes as input the path (name) and
388a pointer to a memory image of a .dat file. The data is added to the cache,
389causing any subsequent requests for data items from that file name to be routed
390to the cache.
391
392Only data package files are cached. Separate data files that contain just a
393single data item are not cached; for these, multiple requests to ICU to open the
394data will result in multiple requests to the operating system to open the
395underlying file.
396
397However, most ICU services (Resource Bundles, conversion, etc.) themselves cache
398loaded data, so that data is usually loaded only once until the end of the
399process (or until `u_cleanup()` or `ucnv_flushCache()` or similar are called.)
400
401There is no mechanism for removing or updating cached data files.
402
403## Directory Separator Characters
404
405If a directory separator (generally '/' or '\\') is needed in a path parameter,
406use the form that is native to the platform. The ICU header `"putil.h"` defines
407`U_FILE_SEP_CHAR` appropriately for the platform.
408
409> :point_right: **Note**: On Windows, the directory separator must be '\\' for
410any paths passed to ICU APIs. This is different from native Windows APIs, which
411generally allow either '/' or '\\'.
412
413## Sharing ICU Data Between Platforms
414
415ICU's default data is (at the time of this writing) about 8 MB in size. Because
416it is normally built as a shared library, the file format is specific to each
417platform (operating system). The data libraries can not be shared between
418platforms even though the actual data contents are identical.
419
420By distributing the default data in the form of common format .dat files rather
421than as shared libraries, a single data file can be shared among multiple
422platforms. This is beneficial if a single distribution of the application (a CD,
423for example) includes binaries for many platforms, and the size requirements for
424replicating the ICU data for each platform are a problem.
425
426ICU common format data files are not completely interchangeable between
427platforms. The format depends on these properties of the platform:
428
4291.  Byte Ordering (little endian vs. big endian)
430
4312.  Base character set - ASCII or EBCDIC
432
433This means, for example, that ICU data files are interchangeable between Windows
434and Linux on X86 (both are ASCII little endian), or between Macintosh and
435Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC
436and Solaris on X86 (different byte ordering).
437
438The single letter following the version number in the file name of the default
439ICU data file encodes the properties of the file as follows:
440
441```
442icudt19l.dat Little Endian, ASCII
443icudt19b.dat Big Endian, ASCII
444icudt19e.dat Big Endian, EBCDIC
445```
446
447(There are no little endian EBCDIC systems. All non-EBCDIC encodings include an
448invariant subset of ASCII that is sufficient to enable these files to
449interoperate.)
450
451The packaging of the default ICU data as a .dat file rather than as a shared
452library is requested by using an option in the configure script at build time.
453Nothing is required at run time; ICU finds and uses whatever form of the data is
454available.
455
456> :point_right: **Note**: When the ICU data is built in the form of shared
457libraries, the library names have platform-specific prefixes and suffixes. On
458Unix-style platforms, all the libraries have the "lib" prefix and one of the
459usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and
460suffixes, the library names are the same as the above .dat files.
461
462## Customizing ICU's Data Library
463
464ICU includes a standard library of data that is about 16 MB in size. Most of
465this consists of conversion tables and locale information. The data itself is
466normally placed into a single shared library.
467
468Update: as of ICU 64, the standard data library is over 20 MB in size. We have
469introduced a new tool, the [ICU Data Build Tool](icu_data/buildtool.md),
470to replace the makefiles explained below and give you more control over what
471goes into your ICU locale data file.
472
473### Adding Converters to ICU
474
475The first step is to obtain or create a .ucm (source) mapping data file for the
476desired converter. A large archive of converter data is maintained by the ICU
477team at <https://github.com/unicode-org/icu-data/tree/master/charset/data/ucm>
478
479We will use `solaris-eucJP-2.7.ucm`, available from the repository mentioned
480above, as an example.
481
482#### Build the Converter
483
484Converter source files are compiled into binary converter files (.cnv files) by
485using the icu tool makeconv. For the example, you can use this command
486
487```
488makeconv -v solaris-eucJP-2.7.ucm
489```
490
491Some of the .ucm files from the repository will need additional header
492information before they can be built. Use the error messages from the makeconv
493tool, .ucm files for similar converters, and the ICU user guide documentation of
494.ucm files as a guide when making changes. For the `solaris-eucJP-2.7.ucm`
495example, we will borrow the missing header fields from
496`source/data/mappings/ibm-33722_P12A-2000.ucm`, which is the standard ICU eucJP
497converter data.
498
499The ucm file format is described in the
500["Conversion Data" chapter](conversion/data.md) of this user guide.
501
502After adjustment, the header of the `solaris-eucJP-2.7.ucm` file contains these
503items:
504
505```
506<code_set_name>   "solaris-eucJP-2.7"
507<subchar>         \\x3F
508<uconv_class>     "MBCS"
509
510<mb_cur_max>      3
511<mb_cur_min>      1
512
513<icu:state>       0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
514<icu:state>       a1-fe
515<icu:state>       a1-e4
516<icu:state>       a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
517<icu:state>       a1-fe
518```
519
520The binary converter file produced by the `makeconv` tool is
521`solaris-eucJP-2.7.cnv`.
522
523#### Installation
524
525Copy the new .cnv file to the desired location for use. Set the environment
526variable `ICU_DATA` to the directory containing the data, or, alternatively,
527from within an application, tell ICU the location of the new data with the
528function `u_setDataDirectory()` before using the new converter.
529
530If ICU is already obtaining data from files rather than a shared library,
531install the new file in the same location as the existing ICU data file(s), and
532don't change/set the environment variable or data directory.
533
534If you do not want to add a converter to ICU's base data, you can also generate
535a conversion table with `makeconv`, use pkgdata to generate your own package and
536use the `ucnv_openPackage()` to open up a converter with that conversion table
537from the generated package.
538
539#### Building the new converter into ICU
540
541The need to install a separate file and inform ICU of the data directory can be
542avoided by building the new converter into ICU's standard data library. Here is
543the procedure for doing so:
544
5451.  Move the .ucm file(s) for the converter(s) to be added (
546    `solaris-eucJP-2.7.ucm` for our example) into the directory
547    `source/data/mappings/`
548
5492.  Create, or edit, if it already exists, the file
550    `source/data/mappings/ucmlocal.mk`. Add this line:
551
552    ```
553    UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm
554    ```
555
556    Any number of converters can be listed. Extend the list to new lines with a
557    back slash at the end of the line. The `ucmlocal.mk` file is described in
558    more detail in `source/data/mappings/ucmfiles.mk` (Even though they use very
559    different build systems, `ucmlocal.mk` is used for both the Windows and UNIX
560    builds.)
561
5623.  Add the converter name and aliases to `source/data/mappings/convrtrs.txt`.
563    This will allow your converter to be shown in the list of available
564    converters when you call the `ucnv_getAvailableName(`) function. The file
565    syntax is described within the file.
566
5674.  Rebuild the ICU data.
568    For Windows, from MSVC choose the makedata project from the GUI, then build
569    the project.
570    For UNIX, `cd icu/source/data; gmake`
571
572When opening an ICU converter (`ucnv_open()`), the converter name can not be
573qualified with a path that indicates the directory or common data file
574containing the corresponding converter data. The required data must be present
575either in the main ICU data library or as a separate .cnv file located in the
576ICU data directory. This is different from opening resources or other types of
577ICU data, which do allow a path.
578
579### Adding Locale Data to ICU's Data
580
581If you have data for a locale that is not included in ICU's standard build, then
582you can add it to the build in a very similar way as with conversion tables
583above. The ICU project provides a large number of additional locales in its
584[locale
585repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/)
586on the web. Most of this locale data is derived from the CLDR ([Common Locale
587Data Repository](http://www.unicode.org/cldr/)) project.
588
589Dropping the txt file into the correct place in the source tree is sufficient to
590add it to your ICU build. You will need to re-configure in order to pick it up.
591
592## Customizing ICU's Data Library for ICU 63 or earlier
593The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space.
594
595> :point_right: **Note**: ICU for C by default comes with pre-built data.
596The source data files are included as an "icu\*data.zip" file starting in ICU4C
59749. Previously, they were not included unless ICU is downloaded from the
598[source repository](https://github.com/unicode-org/icu). Alternatively, the
599[Data Customizer](http://apps.icu-project.org/datacustom/) may be used to
600customize the pre-built data.
601
602ICU can load data from individual data files as well as from its default
603library, so building a customized library when adding additional data is not
604strictly necessary. Adding to ICU's library can simplify application
605installation by eliminating the need to include separate files with an
606application distribution, and the need to tell ICU where they are installed.
607
608Reducing the size of ICU's data by eliminating unneeded resources can make
609sense on small systems with limited or no disk, but for desktop or server
610systems there is no real advantage to trimming. ICU's data is memory mapped
611into an application's address space, and only those portions of the data
612actually being used are ever paged in, so there are no significant RAM savings.
613As for disk space, with the large size of today's hard drives, saving a few MB
614is not worth the bother.
615
616By default, ICU builds with a large set of converters and with all available
617locales. This means that any extra items added must be provided by the
618application developer. There is no extra ICU-supplied data that could be
619specified.
620
621### Details
622
623The converters and resources that ICU builds are in the following configuration
624files. They are only available when building from ICU's source code repository.
625Normally, the standard ICU distribution do not include these files.
626
627File                              | Description
628----------------------------------|--------------
629source/data/locales/resfiles.mk   | The standard set of locale data resource bundles
630source/data/locales/reslocal.mk   | User-provided file with additional resource bundles
631source/data/coll/colfiles.mk      | The standard set of collation data resource bundles
632source/data/coll/collocal.mk      | User-provided file with additional collation resource bundles
633source/data/brkitr/brkfiles.mk    | The standard set of break iterator data resource bundles
634source/data/brkitr/brklocal.mk    | User-provided file with additional break iterator resource bundles
635source/data/translit/trnsfiles.mk | The standard set of transliterator resource files
636source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files
637source/data/mappings/ucmcore.mk   | Core set of conversion tables for MIME/Unix/Windows
638source/data/mappings/ucmfiles.mk  | Additional, large set of conversion tables for a wide range of uses
639source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables
640source/data/mappings/ucmlocal.mk  | User-provided file with additional conversion tables
641source/data/misc/miscfiles.mk     | Miscellaneous data, like timezone information
642
643These files function identically for both Windows and UNIX builds of ICU. ICU
644will automatically update the list of installed locales returned by
645`uloc_getAvailable()` whenever `resfiles.mk` or `reslocal.mk` are updated and
646the ICU data library is rebuilt. These files are only needed while building ICU.
647If any of these files are removed or renamed, the size of the ICU data library
648will be reduced.
649
650The optional files `reslocal.mk` and `ucmlocal.mk` are not included as part of
651a standard ICU distribution. Thus these customization files do not need to be
652merged or updated when updating versions of ICU.
653
654Both `reslocal.mk` and `ucmlocal.mk` are makefile includes. So the usual rules
655for makefiles apply. Lines may be continued by preceding the end of the line to
656be continued with a back slash. Lines beginning with a # are comments. See
657`ucmfiles.mk` and `resfiles.mk` for additional information.
658
659### Reducing the Size of ICU's Data: Conversion Tables
660
661The size of the ICU data file in the standard build configuration is about 8 MB.
662The majority of this is used for conversion tables. ICU comes with so many
663conversion tables because many ICU users need to support many encodings from
664many platforms. There are conversion tables for EBCDIC and DOS codepages, for
665ISO 2022 variants, and for small variations of popular encodings.
666
667> :point_right: **Important**: ICU provides full internationalization
668functionality without **any** conversion table data. The common library
669contains code to handle several important encodings algorithmically: US-ASCII,
670ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e.,
671US-ASCII, ISO-8859-1, and all Unicode charsets; see
672source/data/mappings/convrtrs.txt for the current list).
673
674Therefore, the easiest way to reduce the size of ICU's data by a lot (without
675limitation of I18N support) is to reduce the number of conversion tables that
676are built into the data file.
677
678The conversion tables are listed for the build process in several makefiles
679`source/data/mappings/ucm\*.mk`, roughly grouped by how commonly they are used.
680If you remove or rename any of these files, then the ICU build will exclude the
681conversion tables that are listed in that file. Beginning with ICU 2.0, all of
682these makefiles including the main one are optional. If you remove all of them,
683then ICU will include only very few conversion tables for "fallback" encodings
684(see note below).
685
686If you remove or rename all `ucm\*.mk` files, then ICU's data is reduced to
687about 3.6 MB. If you remove all these files except for `ucmcore.mk`, then ICU's
688data is reduced to about 4.7 MB, while keeping support for a core set of common
689MIME/Unix/Windows encodings.
690
691> :point_right: **Note**: If you remove the conversion table for an encoding
692that could be a default encoding on one of your platforms, then ICU will not be
693able to instantiate a default converter. In this case, ICU 2.0 and up will
694automatically fall back to a "lowest common denominator" and load a converter
695for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be
696good enough for converting strings that contain only "ASCII" characters (see the
697comment about "invariant characters" in `utypes.h`).
698*When ICU is built with a reduced set of conversion tables, then some tests will
699fail that test the behavior of the converters based on known features of some
700encodings. Also, building the testdata will fail if you remove some conversion
701tables that are necessary for that (to test non-ASCII/Unicode resource bundle
702source files, for example). You can ignore these failures. Build with the
703standard set of conversion tables, if you want to run the tests.*
704
705### Reducing the Size of ICU's Data: Locale Data
706
707If you need to reduce the size of ICU's data even further, then you need to
708remove other files or parts of files from the build as well.
709
710There are a number of different subdirectories of 'data' containing locale data
711split out by section. Each subdirectory has its own **.mk** file listing the
712locales which will be built. Subdirectories include **lang** for language names
713and **curr** for currency names.
714
715You can remove data for entire locales by removing their files from
716`source/data/locales/resfiles.mk` or the appropriate other .mk file. ICU will
717then use the data of the parent locale instead, which is root.txt. If you
718remove all resource bundles for a given language and its country/region/variant
719sublocales, **do not remove root.txt!** Also, do not remove a parent locale if
720child locales exist. For example, do not remove "en" while retaining "en_US".
721
722### Reducing the Size of ICU's Data: Collation Data
723
724Collation data (for sorting, searching and alphabetic indexes) is also large,
725especially the collation data for East Asian languages because they define
726multiple orderings of tens of thousands of Han characters. You can remove the
727collation data for those languages by removing references to those locales from
728`source/data/coll/colfiles.mk` files. When you do that, the collation for those
729languages will fall back to the root collator, that is, you lose
730language-specific behavior.
731
732A much less radical approach is to keep the collation data tables but remove the
733tailoring rule strings from which they were built. Those rule strings are
734rarely used at runtime. For documentation about their use and how to remove
735them see the section "Building on Existing Locales" in the
736[Collation Customization chapter](collation/customization/index.md).
737
738### Adding Locale Data to ICU's Data
739You need to write a resource bundle file for it with a structure like the
740existing locale resource bundles (e.g. `source/data/locales/ja.txt, ru_RU.txt`,
741`kok_IN.txt`) and add it by writing a file `source/data/locales/reslocal.mk`
742just like above. In this file, define the list of additional resource bundles as
743
744```
745GENRB_SOURCE_LOCAL=myLocale.txt other.txt ...
746```
747
748Starting in ICU 2.2, these added locales are automatically listed by
749`uloc_getAvailable()`.
750
751## ICU Data File Formats
752
753ICU uses several kinds of data files with specific source (plain text) and
754binary data formats. The following lists provides links to descriptions of those
755formats.
756
757Each ICU data object begins with a header before the actual, specific data. The
758header consists of a 16-bit header length value, the two "magic" bytes DA 27 and
759a [UDataInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/structUDataInfo.html#_details)
760structure which specifies the data object's endianness, charset family, format,
761data version, etc.
762
763(This is not the case for the trie structures, which are not stand-alone,
764loadable data objects.)
765
766### Public Data Files
767
768#### ICU.dat package files
769*   Source format: (list of files provided as input to the icupkg tool, or
770         on the gencmn tool command line)
771*    Binary format: .dat:
772     [source/tools/toolutil/pkg_gencmn.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/pkg_gencmn.cpp)
773*    Generator tool:
774         [icupkg](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/icupkg)
775         or
776         [gencmn](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencmn)
777
778#### Resource bundles
779*   Source format: .txt:
780    [icuhtml/design/bnf_rb.txt](https://github.com/unicode-org/icu-docs/blob/master/design/bnf_rb.txt)
781*   Binary format: .res:
782    [source/common/uresdata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/uresdata.h)
783*   Generator tool:
784    [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb)
785
786#### Unicode conversion mapping tables
787*   Source format: .ucm: [Conversion Data chapter](conversion/data.md)
788*   Binary format: .cnv:
789    [source/common/ucnvmbcs.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvmbcs.h)
790*   Generator tool:
791    [makeconv](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/makeconv)
792
793#### Conversion (charset) aliases
794*   Source format:
795    [source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt):
796    contains format description. The command "uconv -l --canon" will also
797    generate the alias table from the currently used copy of ICU.
798*   Binary format: cnvalias.icu:
799    [source/common/ucnv_io.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnv_io.cpp)
800*   Generator tool:
801    [gencnval](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencnval)
802
803#### Unicode Character Data (Properties; for Java only: hardcoded in C common library)
804*   Source format:
805    [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt):
806    [Preparsed UCD](http://site.icu-project.org/design/props/ppucd)
807*   Binary format: uprops.icu:
808    [tools/unicode/c/genprops/corepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/corepropsbuilder.cpp)
809*   Generator tool:
810    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
811
812#### Unicode Character Data (Case mappings; for Java only: hardcoded in C common library)
813*   Source format:
814    [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata):
815    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
816*   Binary format: ucase.icu:
817    [tools/unicode/c/genprops/casepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/casepropsbuilder.cpp)
818*   Generator tool:
819    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
820
821#### Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library)
822*   Source format:
823    [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata):
824    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
825*   Binary format: ubidi.icu:
826    [tools/unicode/c/genprops/bidipropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/bidipropsbuilder.cpp)
827*   Generator tool:
828    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
829
830#### Unicode Character Data (Normalization since ICU 4.4) & custom normalization data
831*   Source format:
832    [source/data/unidata/norm2/*.tx](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2):
833    Files derived from the [Unicode Character
834    Database](http://www.unicode.org/onlinedat/online.html), or custom data.
835*   Binary format: .nrm:
836    [source/common/normalizer2impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/normalizer2impl.h)
837*   Generator tool:
838    [gennorm2](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gennorm2)
839
840#### Unicode Character Data (Character names)
841*   Source format:
842    [source/data/unidata/UnicodeData.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/UnicodeData.txt):
843    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
844*   Binary format: unames.icu:
845    [tools/unicode/c/genprops/namespropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/namespropsbuilder.cpp)
846*   Generator tool:
847    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
848
849#### Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8)
850*   Source format: [UCD Property*Aliases.txt](http://www.unicode.org/Public/UNIDATA/):
851                   [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
852*   Binary format: pnames.icu:
853    [source/common/propname.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/propname.h)
854*   Generator tool:
855    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
856
857#### Unicode Character Data (Text layout properties since ICU 64)
858*   Source format:
859    [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt):
860    [Preparsed UCD](http://site.icu-project.org/design/props/ppucd)
861*   Binary format: ulayout.icu:
862    [tools/unicode/c/genprops/layoutpropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/layoutpropsbuilder.cpp)
863*   Generator tool:
864    [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops)
865
866#### Collation data (root collation & tailorings; ICU 53 & later)
867*   Source format: Original data from allkeys_CLDR.txt in
868    [CLDR Root Collation Data Files](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files)
869    processed into
870    [source/data/unidata/FractionalUCA.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/FractionalUCA.txt)
871    by
872    [tool at unicode.org maintained by Mark Davis](https://sites.google.com/site/unicodetools/#TOC-UCA)
873    (call the Main class with option writeFractionalUCA); source tailorings (text rules) in
874    [source/data/coll/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/coll)
875    resource bundles: [Collation Customization chapter](collation/customization/index.md).
876*   Binary format: ucadata.icu & binary tailorings in resource bundles:
877    [source/i18n/collationdatareader.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/collationdatareader.h)
878*   Generator tool:
879    [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca),
880    [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb)
881
882#### Rule-based break iterator data
883*   Source format: .txt: [Boundary Analysis chapter](boundaryanalysis/index.md)
884*   Binary format: .brk:
885    [source/common/rbbidata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/rbbidata.h)
886*   Generator tool:
887    [genbrk](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genbrk)
888
889#### Dictionary-based break iterator data (ICU 50 & later)
890*   Source format: txt: [gendict.cpp
891    comments](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gendict/gendict.cpp)
892*   Binary format: .dict: see
893    [source/common/dictionarydata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/dictionarydata.h
894*   Generator tool:
895    [gendict](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gendict)
896
897#### Rule-based transform (transliterator) data
898*   Source format: .txt (in resource bundles): [Transform Rule Tutorial chapter](transforms/general/rules.md)
899*   Binary format: Uses genrb to make binary format
900*   Generator tool: Does not apply
901
902#### Time zone data (ICU 4.4 & later)
903*   Source format:
904    [source/data/misc/zoneinfo64.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/zoneinfo64.txt):
905    ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
906*   Binary format: zoneinfo64.res (generated by genrb and
907    [tzcode tools](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/tzcode/readme.txt)).
908*   Generator tool: Does not apply
909
910#### StringPrep profile data
911*   Source format:
912    [source/data/sprep/rfc3491.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/sprep/rfc3491.txt):
913*   Binary format: .spp:
914    [source/tools/gensprep/store.c](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gensprep/store.c)
915*   Generator tool:
916    [gensprep](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gensprep)
917
918#### Confusables data
919*   Source format:
920    [source/data/unidata/confusables.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/confusables.txt),
921    [source/data/unidata/confusablesWholeScript.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/confusablesWholeScript.txt)
922*   Binary format: .spp:
923    [confusables.cfu: source/i18n/uspoof_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/uspoof_impl.h)
924*   Generator tool: [gencfu](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencfu)
925
926### Public Data Files (old versions)
927
928#### Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library)
929*   Source format:
930    [source/data/unidata/*.txt]((https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata):
931    [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
932*   Binary format: unorm.icu:
933    [source/common/unormimp.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unormimp.h)
934*   Generator tool: gennorm
935
936#### Unicode Character Data (Property [value] aliases before ICU 4.8)
937*   Source format: source/data/unidata/Property*Aliases.txt: [Unicode Character Database](http://www.unicode.org/onlinedat/online.html)
938*   Binary format: pnames.icu: source/common/propname.h (ICU 4.6)
939*   Generator tool: genpname
940
941#### Collation data (UCA, code points to weights; ICU 52 & earlier)
942*   Source format: Same as in ICU 53
943*   Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52)
944*   Generator tool:
945    [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca),
946    [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb)
947
948#### Collation data (Inverse UCA, weights->code points; ICU 52 & earlier)
949*   Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu
950*   Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52)
951*   Generator tool:
952    [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca)
953
954#### Dictionary-based break iterator data (ICU 49 & earlier)
955*   Source format: .txt: genctd.cpp comments
956*   Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp
957*   Generator tool: genctd
958
959#### Time zone data (Before ICU 4.4)
960*   Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz
961*   Binary format: zoneinfo64.res (generated by genrb and
962    [tzcode tools](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/tzcode/readme.txt)).
963*   Generator tool: Does not apply
964
965### Non-File API Binary Data
966
967#### Converter selector data
968*   Source format: none
969*   Binary format:
970    [source/common/ucnvsel.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvsel.cpp)
971*   Generator tool:
972    [ucnvsel_open()](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvsel.cpp)
973
974### Test-Only Data Files
975
976#### test.icu (for udata API testing)
977*   Source format: none (fixed output from gentest when not using -r or -j options)
978*   Binary format: test.icu: see `createData()` in
979                   [source/tools/gentest/gentest.c](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gentest/gentest.c)
980*   Generator tool:
981    [gentest](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gentest/gentest.c)
982
983### Other Data Structures
984
985#### UCPTrie (C)/CodePointTrie (Java) (maps code points to integers)
986*   Source format: (public builder API)
987*   Binary format:
988    [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie),
989    [icu4c/source/common/ucptrie_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucptrie_impl.h)
990*   Generator tool: (builder class)
991
992#### UTrie2 (C)/Trie2 (Java) (maps code points to integers)
993*   Source format: (internal builder API)
994*   Binary format:
995    [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie),
996    [icu4c/source/common/utrie2_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/utrie2_impl.h)
997*   Generator tool: (builder class)
998
999#### BytesTrie (maps byte sequences to 32-bit integers)
1000*   Source format: (public builder API)
1001*   Binary format:
1002    [BytesTrie design doc](http://site.icu-project.org/design/struct/tries/bytestrie),
1003    [icu4c/source/common/unicode/bytestrie.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/bytestrie.h)
1004*   Generator tool: (builder class)
1005
1006#### UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers)
1007*   Source format: (public builder API)
1008*   Binary format:
1009    [UCharsTrie design doc](http://site.icu-project.org/design/struct/tries/ucharstrie),
1010    [icu4c/source/common/unicode/ucharstrie.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/ucharstrie.h)
1011*   Generator tool: (builder class)
1012
1013## ICU4J Resource Information
1014
1015Starting with release 2.1, ICU4J includes its own resource information which is
1016completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4,
1017time zone information depends on the underlying JRE). The new ICU4J information
1018is equivalent to the information in ICU4C and many resources are, in fact, the
1019same binary files that ICU4C uses.
1020
1021By default the ICU4J distribution includes all of the standard resource
1022information. It is located under the directory `com/ibm/icu/impl/data`.
1023Depending on the service, the data is in different locations and in different
1024formats. Note: This will continue to change from release to release, so clients
1025should not depend on the exact organization of the data in ICU4J.
1026
10271.  The primary **locale data** is under the directory icudt38b, as a set of
1028    ".res" files whose names are the locale identifiers. Locale naming is
1029    documented in the `com.ibm.icu.util.ULocale` class, and the use of these
1030    names in     searching for resources is documented in
1031    `com.ibm.icu.util.UResourceBundle`.
1032
10332.  The **collation data** is under the directory `icudt38b/coll`, as a set of
1034    ".res" files.
1035
10363.  The **rule-based transliterator data** is under the directory
1037    `icudt38b/translit` as a set of ".res" files. (**Note:** the Han
1038    transliterator test data is no longer included in the core icu4j.jar file by
1039    default.)
1040
10414.  The **rule-based number format data** is under the directory `icudt38b/rbnf`
1042    as a set of ".res" files.
1043
10445.  The **break iterator data** is directly under the data directory, as a set
1045    of ".brk" files, named according to the type of break and the locale where
1046    there are locale-specific versions.
1047
10486.  The **holiday data** is under the data directory, as a set of ".class"
1049    files, named "HolidayBundle_" followed by the locale ID.
1050
10517.  The **character property data** as well as assorted **normalization data**
1052    and default **unicode collation algorithm (UCA) data** is found under the
1053    data directory as a set of ".icu" files.
1054
10558.  The **character set converter data** is under the directory `icudt38b/`, as
1056    a set of ".cnv" files. These files are currently included only in
1057    icu-charset.jar.
1058
10599.  The **time zone data** is named `zoneinfo.res` under the directory
1060    `icudt38b`.
1061
1062Some of the data files alias or otherwise reference data from other data files.
1063One reason for this is because some locale names have changed. For example,
1064he_IL used to be iw_IL. In order to support both names but not duplicate the
1065data, one of the resource files refers to the other file's data. In other cases,
1066a file may alias a portion of another file's data in order to save space.
1067Currently ICU4J provides no tool for revealing these dependencies.
1068
1069> :point_right: **Note**: Java's Locale class silently converts the language
1070code "he" to "iw" when you construct the Locale (for versions of Java through
1071Java 5). Thus Java cannot be used to locate resources that use the "he" language
1072code. ICU, on the other hand, does not perform this conversion in ULocale, and
1073instead uses aliasing in the locale data to represent the same set of data under
1074different locale ids.
1075
1076Resource files that use locale ids form a hierarchy, with up to four levels: a
1077root, language, region (country), and variant. Searches for locale data attempt
1078to match as far down the hierarchy as possible, for example, "he_IL" will match
1079he_IL, but "he_US" will match he (since there is no US variant for he, and
1080"xx_YY will match root (the default fallback locale) since there is no xx
1081language code in the locale hierarchy. Again, see `java.util.ResourceBundle` for
1082more information.
1083
1084Currently ICU4J provides no tool for revealing these dependencies between data
1085files, so trimming the data directly in the ICU4J project is a hit-or-miss
1086affair. The key point when you remove data is to make sure to remove all
1087dependencies on that data as well. For example, if you remove he.res, you need
1088to remove he_IL.res, since it is lower in the hierarchy, and you must remove
1089iw.res, since it references he.res, and iw_IL.res, since it depends on it (and
1090also references he_IL.res).
1091
1092Unfortunately, the jar tool in the JDK provides no way to remove items from a
1093jar file. Thus you have to extract the resources, remove the ones you don't
1094want, and then create a new jar file with the remaining resources. See the jar
1095tool information for how to do this. Before 'rejaring' the files, be sure to
1096thoroughly test your application with the remaining resources, making sure each
1097required resource is present.
1098
1099#### Using additional resource files with ICU4J
1100
1101> :point_right: **Note**: Resource file formats can change across releases of ICU4J!
1102>
1103> *The format of ICU4J resources is not part of the API. Clients who develop their
1104> own resources for use with ICU4J should be prepared to regenerate them when they
1105> move to new releases of ICU4J.*
1106
1107We are still developing ICU4J's resource mechanism. Currently it is not possible
1108to mix icu's new binary .res resources with traditional java-style .class or
1109.txt resources. We might allow for this in a future release, but since the
1110resource data and format is not formally supported, you run the risk of
1111incompatibilities with future releases of ICU4J.
1112
1113Resource data in ICU4J is checked in to the repository as a jar file containing
1114the resource binaries, icudata.jar. This means that inspecting the contents of
1115these resources is difficult. They currently are compiled from ICU4C .txt file
1116data. You can view the contents of the ICU4C text resource files to understand
1117the contents of the ICU4J resources.
1118
1119The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build
1120directory when the 'core' target is built. Building the 'resources' target will
1121force the resources to once again be extracted. Extraction will overwrite any
1122corresponding resource files already in that directory.
1123
1124### Building ICU4J Resources from ICU4C
1125
1126#### Requirements
1127
11281.  [ICU4C](http://icu-project.org/download/)
1129
11302.  Compilers and tools required for [building ICU4C](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild).
1131
11323.  J2SE SDK version 5 or above
1133
1134#### Procedure
1135
11361.  Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click
1137    [here](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild).
1138
11392.  Follow the remaining instructions in
1140    [*$icu4c_root*/source/data/icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt).
1141    *$icu4c_root* is the root directory of ICU4C source package.
1142