1--- 2layout: default 3title: ICU Data 4nav_order: 13 5has_children: true 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# ICU Data 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25ICU makes use of a wide variety of data tables to provide many of its services. 26Examples include converter mapping tables, collation rules, transliteration 27rules, break iterator rules and dictionaries, and other locale data. Additional 28data can be provided by users, either as customizations of ICU's data or as new 29data altogether. 30 31This section describes how ICU data is stored and located at run time. It also 32describes how ICU data can be customized to suit the needs of a particular 33application. 34 35For simple use of ICU's predefined data, this section on data management can 36safely be skipped. The data is built into a library that is loaded along with 37the rest of ICU. No specific action or setup is required of either the 38application program or the execution environment. 39 40Update: as of ICU 64, the standard data library is over 20 MB in size. We have 41introduced a new tool, the [ICU Data Build Tool](./icu_data/buildtool.md), 42to give you more control over what goes into your ICU locale data file. 43 44> :point_right: **Note**: ICU for C by default comes with pre-built data. 45> The source data files are included as an "icu\*data.zip" file starting in ICU4C 49. 46> Previously, they were not included unless ICU is downloaded from the [source repository](http://site.icu-project.org/repository). 47 48## ICU and CLDR Data 49 50Most of ICU's data is sourced from [CLDR](http://cldr.unicode.org), the [Common 51Locale Data Repository](http://cldr.unicode.org) project. Do not file bugs 52against ICU to request data changes in CLDR, see the CLDR project's page itself. 53Also note that most ICU data files are therefore autogenerated from CLDR, and so 54manually editing them is not usually recommended. 55 56Data which is NOT sourced from CLDR includes: 57 58* [Conversion Data](conversion/data.md) 59* Break Iterator Dictionary Data ( Thai, CJK, etc ) 60* Break Iterator Rule Data (as of this writing, it is manually kept in sync 61 with the CLDR datasets) 62 63For information on building ICU data from CLDR, see the 64[cldr-icu-readme](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/cldr-icu-readme.txt). 65 66## ICU Data Directory 67 68The ICU data directory is the default location for all ICU data. Any requests 69for data items that do not include an explicit directory path will be resolved 70to files located in the ICU data directory. 71 72The ICU data directory is determined as follows: 73 741. If the application has called the function `u_setDataDirectory()`, use the 75 directory specified there, otherwise: 76 772. If the environment variable `ICU_DATA` is set, use that, otherwise: 78 793. If the C preprocessor variable `ICU_DATA_DIR` was set at the time ICU was 80 built, use its compiled-in value. 81 824. Otherwise, the ICU data directory is an empty string. This is the default 83 behavior for ICU using a shared library for its data and provides the 84 highest data loading performance. 85 86> :point_right: **Note**: `u_setDataDirectory()` is not thread-safe. Call it 87> *before* calling ICU APIs from multiple threads. If you use both 88> `u_setDataDirectory()` and `u_init()`, then use `u_setDataDirectory()` first. 89> 90> *Earlier versions of ICU supported two additional schemes: setting a data 91> directory relative to the location of the ICU shared libraries, and on Windows, 92> taking a location from the registry. These have both been removed to make the 93> behavior more predictable and easier to understand.* 94 95The ICU data directory does not need to be set in order to reference the 96standard built-in ICU data. Applications that just use standard ICU capabilities 97(converters, locales, collation, etc.) but do not build and reference their own 98data do not need to specify an ICU data directory. 99 100### Multiple-Item ICU Data Directory Values 101 102The ICU data directory string can contain multiple directories as well as .dat 103path/filenames. They must be separated by the path separator that is used on the 104platform, for example a semicolon (`;`) on Windows. Data files will be searched in 105all directories and .dat package files in the order of the directory string. For 106details, see the example below. 107 108## Default ICU Data 109 110The default ICU data consists of the data needed for the converters, collators, 111locales, etc. that are provided with ICU. Default data must be present in order 112for ICU to function. 113 114The default data is most commonly built into a shared library that is installed 115with the other ICU libraries. Nothing is required of the application for this 116mechanism to work. ICU provides additional options for loading the default data 117if more flexibility is required. 118 119Here are the steps followed by ICU to locate its default data. This procedure 120happens only once per process, at the time an ICU data item is first requested. 121 1221. If the application has called the function `udata_setCommonData()`, use the 123 data that was provided. The application specifies the address in memory of 124 an image of an ICU common format data file (either in shared-library format 125 or .dat package file format). 126 1272. Examine the contents of the default ICU data shared library. If it contains 128 data, use that data. If the data library is empty, a stub library, proceed 129 to the next step. (A data shared library must always be present in order for 130 ICU to successfully link and load. A stub data library is used when the 131 actual ICU common data is to be provided from another source). 132 1333. Dynamically load (memory map, typically) a common format (.dat) file 134 containing the default ICU data. Loading is described in the section 135 [How Data Loading Works](icudata#how-data-loading-works). The path to 136 the data is of the form "icudt\<version\>\<flag\>", where \<version\> is 137 the two-digit ICU version number, and \<flag\> is a letter indicating the 138 internal format of the file (see the 139 [Sharing ICU Data Between Platforms](icudata#sharing-icu-data-between-platforms) 140 section). 141 142Once the default ICU data has been located, loading of individual data items 143proceeds as described in the section 144[How Data Loading Works](icudata#how-data-loading-works). 145 146## Building and Linking against ICU data 147 148When using ICU's configure or runConfigureICU tool to build, several different 149methods of packging are available. 150 151> :point_right: **Note**: in all cases, you **must** link all ICU tools and 152applications against a "data library": either a data library containing the ICU 153data, or against the "stubdata" library located in icu/source/stubdata. For 154example, even if ICU is built in "files" mode, you must still link against the 155"stubdata" library or an undefined symbol error occurs. 156 157* `--with-data-packaging=library` 158 This mode builds a shared library (DLL or .so). This is the simplest mode to 159 use, and is the default. 160 To use: link your application against the common and data libraries. 161 This is the only directly supported behavior on Windows builds. 162* `--with-data-packaging=static` 163 This option builds ICU data as a single (large) static library. This mode is 164 more complex to use. If you encounter errors, you may need to build ICU 165 multiple times. 166* `--with-data-packaging=files` 167 With this option, ICU outputs separate individual files (.res, .cnv, etc) 168 which will be loaded at runtime. Read the rest of this document, especially 169 the sections that discuss the ICU directory path. 170* `--with-data-packaging=archive` 171 With this option, ICU outputs a single "icudt__.dat" file containing ICU 172 data. Read the rest of this document, especially the sections that discuss 173 the ICU directory path. 174 175## Time Zone Data 176 177Because time zone data requires frequent updates in response to countries 178changing their transition dates for daylight saving time, ICU provides 179additional options for loading time zone data from separate files, thus avoiding 180the need to update a combined ICU data package. Further information is found 181under [Time Zones](datetime/timezone/index.md). 182 183## Application Data 184 185ICU-based applications can ship and use their own data for localized strings, 186custom conversion tables, etc. Each data item file must have a package name as a 187prefix, and this package name must match the basename of a .dat package file, if 188one is used. The package name must be used in ICU APIs, for example in 189`udata_setAppData()` (instead of `udata_setCommonData()` which is only used for 190ICU's own data) and in the pathname argument of `ures_open()`. 191 192The only real difference to ICU's own data is that application data cannot be 193simply loaded by specifying a NULL value for the path arguments of ICU APIs, and 194application data will not be used by APIs that do not have path/package name 195arguments at all. 196 197The most important APIs that allow application data to be used are for Resource 198Bundles, which are most often used for localized strings and other data. There 199are also functions like `ucnv_openPackage()` that allow to specify application 200data, and the `udata.h` API can be used to load any data with minimum 201requirements on the binary format, and without ICU interpreting the contents of 202the data. 203 204The `pkgdata` tool, which is used to package the data into various formats (e.g. 205shared library), has an option (`--without-assembly` or `-w`) to not use 206assembly code when building and packaging the application specific data into a 207shared library. Building the data with assembly code, which is enabled by 208default, is faster and more efficient; however, there are some platform 209specific issues that may arise. The `--without-assembly` option may be 210necessary on certain platforms (e.g. Linux) which have trouble properly loading 211application data when it was built with assembly code and is packaged as a 212shared library. 213 214## Alignment 215 216ICU data is designed to be 16-aligned, with natural alignment of values inside 217the data structure, so that the data is usable as is when memory-mapped. 218("16-aligned" means that the start address is a multiple of 16 bytes.) 219 220Memory-mapping (as well as memory allocation) provides at least 16-alignment on 221modern platforms. Some CPUs require n-alignment of types of size n bytes (and 222crash on unaligned reads), other CPUs usually operate faster on data that is 223aligned properly. 224 225Some of the ICU code explicitly checks for proper alignment. 226 227The `icupkg` tool places data items into the .dat file at start offsets that are 228multiples of 16 bytes. 229 230When using `genccode` to directly write a .o/.obj file, or to write assembler 231code, it specifies at least 16-alignment. When using `genccode` to write C code, 232it prepends the data with a double value which should yield at least 8-alignment 233on most platforms (usually `sizeof(double)=8`). 234 235## Flexibility vs. Installation vs. Performance 236 237There are choices that affect ICU data loading and depend on application 238requirements. 239 240### Data in Shared Libraries/DLLs vs. .dat package files 241 242Building ICU data into shared libraries (`--with-data-packaging=library`) is the 243most convenient packaging method because shared libraries (DLLs) are easily 244found if they are in the same directory as the application libraries, or if they 245are on the system library path. The application installer usually just copies 246the ICU shared libraries in the same place. On the other hand, shared libraries 247are not portable. 248 249Packaging data into .dat files (`--with-data-packaging=archive`) allows them to 250be shared across platforms, but they must either be loaded by the application 251and set with `udata_setCommonData()` or `udata_setAppData()`, or they must be 252in a known location that is included in the ICU data directory string. This 253requires the application installer, or the application itself at runtime, to 254locate the ICU and/or application data by setting the ICU data directory (see 255the [ICU Data Directory](icudata#icu-data-directory) section above) or by 256loading the data and providing it to one of the `udata_setXYZData()` functions. 257 258Unlike shared libraries, .dat package files can be taken apart into separate 259data item files with the decmn ICU tool. This allows post-installation 260modification of a package file. The `gencmn` and `pkgdata` ICU tools can then be 261used to reassemble the .dat package file. 262 263For more information about .dat package files see the section [Sharing ICU Data 264Between Platforms](icudata#sharing-icu-data-between-platforms) below. 265 266### Data Overriding vs. Loading Performance 267 268If the ICU data directory string is empty, then ICU will not attempt to load 269data from the file system. It is then only possible to load data from the 270linked-in shared library or via `udata_setCommonData()` and 271`udata_setAppData()`. This is inflexible but provides the highest performance. 272 273If the ICU data directory string is not empty, then data items are searched in 274all directories and matching .dat files mentioned before checking in 275already-loaded package files. This allows overriding of packaged data items with 276single files after installation but costs some time for filesystem accesses. 277This is usually done only once per data item; see 278[User Data Caching](icudata#user-data-caching) below. 279 280### Single Data Files vs. Packages 281 282Single data files (`--with-data-packaging=files`) are easy to replace and can 283override items inside data packages. However, it is usually desirable to reduce 284the number of files during installation, and package files use less disk space 285than many small files. 286 287## How Data Loading Works 288 289ICU data items are referenced by three names - a path, a name and a type. The 290following are some examples: 291 292path | name | type 293-----------------------------|----------|------- 294 c:\\some\\path\\dataLibName | test | dat 295 no path | cnvalias | icu 296 no path | cp1252 | cnv 297 no path | en | res 298 no path | uprops | icu 299 300 301Items with 'no path' specified are loaded from the default ICU data. 302 303Application data items include a path, and will be loaded from user data files, 304not from the ICU default data. For application data, the path argument need not 305contain an actual directory, but must contain the application data's package 306name after the last directory separator character (or by itself if there is no 307directory). If the path argument contains a directory, then it is logically 308prepended to the ICU data directory string and searched first for data. The path 309argument can contain at most one directory. (Path separators like semicolon (;) 310are not handled here.) 311 312> :point_right: **Note**: The ICU data directory string itself may 313contain multiple directories and path/filenames to .dat package files. See the 314[ICU Data Directory](icudata#icu-data-directory) section. 315 316It is recommended to not include the directory in the path argument but to make 317sure via setting the application data or the ICU data directory string that the 318data can be located. This simplifies program maintenance and improves 319robustness. 320 321See the API descriptions for the functions `udata_open()` and 322`udata_openChoice()` for additional information on opening ICU data from within 323an application. 324 325Data items can exist as individual files, or a number of them can be packaged 326together in a single file for greater efficiency in loading and convenience of 327distribution. The combined files are called Common Files. 328 329Based on the supplied path and name, ICU searches several possible locations 330when opening data. To make things more concrete in the following descriptions, 331the following values of path, name and type are used: 332 333``` 334path = "c:\\some\\path\\dataLibName" 335name = "test" 336type = "res" 337``` 338 339In this case, "dataLibName" is the "package name" part of the path argument, and 340"c:\\some\\path\\" is the directory part of it. 341 342The search sequence for the data for "test.res" is as follows (the first 343successful loading attempt wins): 344 3451. Try to load the file "dataLibName_test.res" from c:\\some\\data\\. 346 3472. Try to load the file "dataLibName_test.res" from each of the directories in 348 the ICU data directory string. 349 3503. Try to locate the data package for the package name "dataLibName". 351 3521. Try to locate the data package in the internal cache. 353 3542. Try to load the package file "dataLibName.dat" from c:\\some\\data\\. 355 3563. Try to load the package file "dataLibName.dat" from each of the directories 357 in the ICU data directory string. 358 359The first steps, loading the data item from an individual file, are omitted if 360no directory is specified in either the path argument or the ICU data directory 361string. 362 363Package files are loaded at most once and then cached. They are identified only 364by their package name. Whenever a data item is requested from a package and that 365package has been loaded before, then the cached package is used immediately 366instead of searching through the filesystem. 367 368> :point_right: **Note**: ICU versions before 2.2 always searched data packages 369before looking for individual files, which made it impossible to override 370packaged data items. See the ICU 2.2 download page and the readme for more 371information about the changes. 372 373## User Data Caching 374 375Once loaded, data package files are cached, and stay loaded for the duration of 376the process. Any requests for data items from an already loaded data package 377file are routed directly to the cached data. No additional search for loadable 378files is made. 379 380The user data cache is keyed by the base file name portion of the requested 381path, with any directory portion stripped off and ignored. Using the previous 382example, for the path name "c:\\some\\path\\dataLibName", the cache key is 383"dataLibName". After this is cached, a subsequent request for "dataLibName", no 384matter what directory path is specified, will resolve to the cached data. 385 386Data can be explicitly added to the cache of common format data by means of the 387`udata_setAppData()` function. This function takes as input the path (name) and 388a pointer to a memory image of a .dat file. The data is added to the cache, 389causing any subsequent requests for data items from that file name to be routed 390to the cache. 391 392Only data package files are cached. Separate data files that contain just a 393single data item are not cached; for these, multiple requests to ICU to open the 394data will result in multiple requests to the operating system to open the 395underlying file. 396 397However, most ICU services (Resource Bundles, conversion, etc.) themselves cache 398loaded data, so that data is usually loaded only once until the end of the 399process (or until `u_cleanup()` or `ucnv_flushCache()` or similar are called.) 400 401There is no mechanism for removing or updating cached data files. 402 403## Directory Separator Characters 404 405If a directory separator (generally '/' or '\\') is needed in a path parameter, 406use the form that is native to the platform. The ICU header `"putil.h"` defines 407`U_FILE_SEP_CHAR` appropriately for the platform. 408 409> :point_right: **Note**: On Windows, the directory separator must be '\\' for 410any paths passed to ICU APIs. This is different from native Windows APIs, which 411generally allow either '/' or '\\'. 412 413## Sharing ICU Data Between Platforms 414 415ICU's default data is (at the time of this writing) about 8 MB in size. Because 416it is normally built as a shared library, the file format is specific to each 417platform (operating system). The data libraries can not be shared between 418platforms even though the actual data contents are identical. 419 420By distributing the default data in the form of common format .dat files rather 421than as shared libraries, a single data file can be shared among multiple 422platforms. This is beneficial if a single distribution of the application (a CD, 423for example) includes binaries for many platforms, and the size requirements for 424replicating the ICU data for each platform are a problem. 425 426ICU common format data files are not completely interchangeable between 427platforms. The format depends on these properties of the platform: 428 4291. Byte Ordering (little endian vs. big endian) 430 4312. Base character set - ASCII or EBCDIC 432 433This means, for example, that ICU data files are interchangeable between Windows 434and Linux on X86 (both are ASCII little endian), or between Macintosh and 435Solaris on SPARC (both are ASCII big endian), but not between Solaris on SPARC 436and Solaris on X86 (different byte ordering). 437 438The single letter following the version number in the file name of the default 439ICU data file encodes the properties of the file as follows: 440 441``` 442icudt19l.dat Little Endian, ASCII 443icudt19b.dat Big Endian, ASCII 444icudt19e.dat Big Endian, EBCDIC 445``` 446 447(There are no little endian EBCDIC systems. All non-EBCDIC encodings include an 448invariant subset of ASCII that is sufficient to enable these files to 449interoperate.) 450 451The packaging of the default ICU data as a .dat file rather than as a shared 452library is requested by using an option in the configure script at build time. 453Nothing is required at run time; ICU finds and uses whatever form of the data is 454available. 455 456> :point_right: **Note**: When the ICU data is built in the form of shared 457libraries, the library names have platform-specific prefixes and suffixes. On 458Unix-style platforms, all the libraries have the "lib" prefix and one of the 459usual (".dll", ".so", ".sl", etc.) suffixes. Other than these prefixes and 460suffixes, the library names are the same as the above .dat files. 461 462## Customizing ICU's Data Library 463 464ICU includes a standard library of data that is about 16 MB in size. Most of 465this consists of conversion tables and locale information. The data itself is 466normally placed into a single shared library. 467 468Update: as of ICU 64, the standard data library is over 20 MB in size. We have 469introduced a new tool, the [ICU Data Build Tool](icu_data/buildtool.md), 470to replace the makefiles explained below and give you more control over what 471goes into your ICU locale data file. 472 473### Adding Converters to ICU 474 475The first step is to obtain or create a .ucm (source) mapping data file for the 476desired converter. A large archive of converter data is maintained by the ICU 477team at <https://github.com/unicode-org/icu-data/tree/master/charset/data/ucm> 478 479We will use `solaris-eucJP-2.7.ucm`, available from the repository mentioned 480above, as an example. 481 482#### Build the Converter 483 484Converter source files are compiled into binary converter files (.cnv files) by 485using the icu tool makeconv. For the example, you can use this command 486 487``` 488makeconv -v solaris-eucJP-2.7.ucm 489``` 490 491Some of the .ucm files from the repository will need additional header 492information before they can be built. Use the error messages from the makeconv 493tool, .ucm files for similar converters, and the ICU user guide documentation of 494.ucm files as a guide when making changes. For the `solaris-eucJP-2.7.ucm` 495example, we will borrow the missing header fields from 496`source/data/mappings/ibm-33722_P12A-2000.ucm`, which is the standard ICU eucJP 497converter data. 498 499The ucm file format is described in the 500["Conversion Data" chapter](conversion/data.md) of this user guide. 501 502After adjustment, the header of the `solaris-eucJP-2.7.ucm` file contains these 503items: 504 505``` 506<code_set_name> "solaris-eucJP-2.7" 507<subchar> \\x3F 508<uconv_class> "MBCS" 509 510<mb_cur_max> 3 511<mb_cur_min> 1 512 513<icu:state> 0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1 514<icu:state> a1-fe 515<icu:state> a1-e4 516<icu:state> a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4 517<icu:state> a1-fe 518``` 519 520The binary converter file produced by the `makeconv` tool is 521`solaris-eucJP-2.7.cnv`. 522 523#### Installation 524 525Copy the new .cnv file to the desired location for use. Set the environment 526variable `ICU_DATA` to the directory containing the data, or, alternatively, 527from within an application, tell ICU the location of the new data with the 528function `u_setDataDirectory()` before using the new converter. 529 530If ICU is already obtaining data from files rather than a shared library, 531install the new file in the same location as the existing ICU data file(s), and 532don't change/set the environment variable or data directory. 533 534If you do not want to add a converter to ICU's base data, you can also generate 535a conversion table with `makeconv`, use pkgdata to generate your own package and 536use the `ucnv_openPackage()` to open up a converter with that conversion table 537from the generated package. 538 539#### Building the new converter into ICU 540 541The need to install a separate file and inform ICU of the data directory can be 542avoided by building the new converter into ICU's standard data library. Here is 543the procedure for doing so: 544 5451. Move the .ucm file(s) for the converter(s) to be added ( 546 `solaris-eucJP-2.7.ucm` for our example) into the directory 547 `source/data/mappings/` 548 5492. Create, or edit, if it already exists, the file 550 `source/data/mappings/ucmlocal.mk`. Add this line: 551 552 ``` 553 UCM_SOURCE_LOCAL = solaris-eucJP-2.7.ucm 554 ``` 555 556 Any number of converters can be listed. Extend the list to new lines with a 557 back slash at the end of the line. The `ucmlocal.mk` file is described in 558 more detail in `source/data/mappings/ucmfiles.mk` (Even though they use very 559 different build systems, `ucmlocal.mk` is used for both the Windows and UNIX 560 builds.) 561 5623. Add the converter name and aliases to `source/data/mappings/convrtrs.txt`. 563 This will allow your converter to be shown in the list of available 564 converters when you call the `ucnv_getAvailableName(`) function. The file 565 syntax is described within the file. 566 5674. Rebuild the ICU data. 568 For Windows, from MSVC choose the makedata project from the GUI, then build 569 the project. 570 For UNIX, `cd icu/source/data; gmake` 571 572When opening an ICU converter (`ucnv_open()`), the converter name can not be 573qualified with a path that indicates the directory or common data file 574containing the corresponding converter data. The required data must be present 575either in the main ICU data library or as a separate .cnv file located in the 576ICU data directory. This is different from opening resources or other types of 577ICU data, which do allow a path. 578 579### Adding Locale Data to ICU's Data 580 581If you have data for a locale that is not included in ICU's standard build, then 582you can add it to the build in a very similar way as with conversion tables 583above. The ICU project provides a large number of additional locales in its 584[locale 585repository](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/) 586on the web. Most of this locale data is derived from the CLDR ([Common Locale 587Data Repository](http://www.unicode.org/cldr/)) project. 588 589Dropping the txt file into the correct place in the source tree is sufficient to 590add it to your ICU build. You will need to re-configure in order to pick it up. 591 592## Customizing ICU's Data Library for ICU 63 or earlier 593The ICU data library can be easily customized, either by adding additional converters or locales, or by removing some of the standard ones for the purpose of saving space. 594 595> :point_right: **Note**: ICU for C by default comes with pre-built data. 596The source data files are included as an "icu\*data.zip" file starting in ICU4C 59749. Previously, they were not included unless ICU is downloaded from the 598[source repository](https://github.com/unicode-org/icu). Alternatively, the 599[Data Customizer](http://apps.icu-project.org/datacustom/) may be used to 600customize the pre-built data. 601 602ICU can load data from individual data files as well as from its default 603library, so building a customized library when adding additional data is not 604strictly necessary. Adding to ICU's library can simplify application 605installation by eliminating the need to include separate files with an 606application distribution, and the need to tell ICU where they are installed. 607 608Reducing the size of ICU's data by eliminating unneeded resources can make 609sense on small systems with limited or no disk, but for desktop or server 610systems there is no real advantage to trimming. ICU's data is memory mapped 611into an application's address space, and only those portions of the data 612actually being used are ever paged in, so there are no significant RAM savings. 613As for disk space, with the large size of today's hard drives, saving a few MB 614is not worth the bother. 615 616By default, ICU builds with a large set of converters and with all available 617locales. This means that any extra items added must be provided by the 618application developer. There is no extra ICU-supplied data that could be 619specified. 620 621### Details 622 623The converters and resources that ICU builds are in the following configuration 624files. They are only available when building from ICU's source code repository. 625Normally, the standard ICU distribution do not include these files. 626 627File | Description 628----------------------------------|-------------- 629source/data/locales/resfiles.mk | The standard set of locale data resource bundles 630source/data/locales/reslocal.mk | User-provided file with additional resource bundles 631source/data/coll/colfiles.mk | The standard set of collation data resource bundles 632source/data/coll/collocal.mk | User-provided file with additional collation resource bundles 633source/data/brkitr/brkfiles.mk | The standard set of break iterator data resource bundles 634source/data/brkitr/brklocal.mk | User-provided file with additional break iterator resource bundles 635source/data/translit/trnsfiles.mk | The standard set of transliterator resource files 636source/data/translit/trnslocal.mk | User-provided file with a set of additional transliterator resource files 637source/data/mappings/ucmcore.mk | Core set of conversion tables for MIME/Unix/Windows 638source/data/mappings/ucmfiles.mk | Additional, large set of conversion tables for a wide range of uses 639source/data/mappings/ucmebcdic.mk | Large set of EBCDIC conversion tables 640source/data/mappings/ucmlocal.mk | User-provided file with additional conversion tables 641source/data/misc/miscfiles.mk | Miscellaneous data, like timezone information 642 643These files function identically for both Windows and UNIX builds of ICU. ICU 644will automatically update the list of installed locales returned by 645`uloc_getAvailable()` whenever `resfiles.mk` or `reslocal.mk` are updated and 646the ICU data library is rebuilt. These files are only needed while building ICU. 647If any of these files are removed or renamed, the size of the ICU data library 648will be reduced. 649 650The optional files `reslocal.mk` and `ucmlocal.mk` are not included as part of 651a standard ICU distribution. Thus these customization files do not need to be 652merged or updated when updating versions of ICU. 653 654Both `reslocal.mk` and `ucmlocal.mk` are makefile includes. So the usual rules 655for makefiles apply. Lines may be continued by preceding the end of the line to 656be continued with a back slash. Lines beginning with a # are comments. See 657`ucmfiles.mk` and `resfiles.mk` for additional information. 658 659### Reducing the Size of ICU's Data: Conversion Tables 660 661The size of the ICU data file in the standard build configuration is about 8 MB. 662The majority of this is used for conversion tables. ICU comes with so many 663conversion tables because many ICU users need to support many encodings from 664many platforms. There are conversion tables for EBCDIC and DOS codepages, for 665ISO 2022 variants, and for small variations of popular encodings. 666 667> :point_right: **Important**: ICU provides full internationalization 668functionality without **any** conversion table data. The common library 669contains code to handle several important encodings algorithmically: US-ASCII, 670ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., 671US-ASCII, ISO-8859-1, and all Unicode charsets; see 672source/data/mappings/convrtrs.txt for the current list). 673 674Therefore, the easiest way to reduce the size of ICU's data by a lot (without 675limitation of I18N support) is to reduce the number of conversion tables that 676are built into the data file. 677 678The conversion tables are listed for the build process in several makefiles 679`source/data/mappings/ucm\*.mk`, roughly grouped by how commonly they are used. 680If you remove or rename any of these files, then the ICU build will exclude the 681conversion tables that are listed in that file. Beginning with ICU 2.0, all of 682these makefiles including the main one are optional. If you remove all of them, 683then ICU will include only very few conversion tables for "fallback" encodings 684(see note below). 685 686If you remove or rename all `ucm\*.mk` files, then ICU's data is reduced to 687about 3.6 MB. If you remove all these files except for `ucmcore.mk`, then ICU's 688data is reduced to about 4.7 MB, while keeping support for a core set of common 689MIME/Unix/Windows encodings. 690 691> :point_right: **Note**: If you remove the conversion table for an encoding 692that could be a default encoding on one of your platforms, then ICU will not be 693able to instantiate a default converter. In this case, ICU 2.0 and up will 694automatically fall back to a "lowest common denominator" and load a converter 695for US-ASCII (or, on EBCDIC platforms, for codepages 37 or 1047). This will be 696good enough for converting strings that contain only "ASCII" characters (see the 697comment about "invariant characters" in `utypes.h`). 698*When ICU is built with a reduced set of conversion tables, then some tests will 699fail that test the behavior of the converters based on known features of some 700encodings. Also, building the testdata will fail if you remove some conversion 701tables that are necessary for that (to test non-ASCII/Unicode resource bundle 702source files, for example). You can ignore these failures. Build with the 703standard set of conversion tables, if you want to run the tests.* 704 705### Reducing the Size of ICU's Data: Locale Data 706 707If you need to reduce the size of ICU's data even further, then you need to 708remove other files or parts of files from the build as well. 709 710There are a number of different subdirectories of 'data' containing locale data 711split out by section. Each subdirectory has its own **.mk** file listing the 712locales which will be built. Subdirectories include **lang** for language names 713and **curr** for currency names. 714 715You can remove data for entire locales by removing their files from 716`source/data/locales/resfiles.mk` or the appropriate other .mk file. ICU will 717then use the data of the parent locale instead, which is root.txt. If you 718remove all resource bundles for a given language and its country/region/variant 719sublocales, **do not remove root.txt!** Also, do not remove a parent locale if 720child locales exist. For example, do not remove "en" while retaining "en_US". 721 722### Reducing the Size of ICU's Data: Collation Data 723 724Collation data (for sorting, searching and alphabetic indexes) is also large, 725especially the collation data for East Asian languages because they define 726multiple orderings of tens of thousands of Han characters. You can remove the 727collation data for those languages by removing references to those locales from 728`source/data/coll/colfiles.mk` files. When you do that, the collation for those 729languages will fall back to the root collator, that is, you lose 730language-specific behavior. 731 732A much less radical approach is to keep the collation data tables but remove the 733tailoring rule strings from which they were built. Those rule strings are 734rarely used at runtime. For documentation about their use and how to remove 735them see the section "Building on Existing Locales" in the 736[Collation Customization chapter](collation/customization/index.md). 737 738### Adding Locale Data to ICU's Data 739You need to write a resource bundle file for it with a structure like the 740existing locale resource bundles (e.g. `source/data/locales/ja.txt, ru_RU.txt`, 741`kok_IN.txt`) and add it by writing a file `source/data/locales/reslocal.mk` 742just like above. In this file, define the list of additional resource bundles as 743 744``` 745GENRB_SOURCE_LOCAL=myLocale.txt other.txt ... 746``` 747 748Starting in ICU 2.2, these added locales are automatically listed by 749`uloc_getAvailable()`. 750 751## ICU Data File Formats 752 753ICU uses several kinds of data files with specific source (plain text) and 754binary data formats. The following lists provides links to descriptions of those 755formats. 756 757Each ICU data object begins with a header before the actual, specific data. The 758header consists of a 16-bit header length value, the two "magic" bytes DA 27 and 759a [UDataInfo](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/structUDataInfo.html#_details) 760structure which specifies the data object's endianness, charset family, format, 761data version, etc. 762 763(This is not the case for the trie structures, which are not stand-alone, 764loadable data objects.) 765 766### Public Data Files 767 768#### ICU.dat package files 769* Source format: (list of files provided as input to the icupkg tool, or 770 on the gencmn tool command line) 771* Binary format: .dat: 772 [source/tools/toolutil/pkg_gencmn.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/pkg_gencmn.cpp) 773* Generator tool: 774 [icupkg](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/icupkg) 775 or 776 [gencmn](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencmn) 777 778#### Resource bundles 779* Source format: .txt: 780 [icuhtml/design/bnf_rb.txt](https://github.com/unicode-org/icu-docs/blob/master/design/bnf_rb.txt) 781* Binary format: .res: 782 [source/common/uresdata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/uresdata.h) 783* Generator tool: 784 [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb) 785 786#### Unicode conversion mapping tables 787* Source format: .ucm: [Conversion Data chapter](conversion/data.md) 788* Binary format: .cnv: 789 [source/common/ucnvmbcs.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvmbcs.h) 790* Generator tool: 791 [makeconv](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/makeconv) 792 793#### Conversion (charset) aliases 794* Source format: 795 [source/data/mappings/convrtrs.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt): 796 contains format description. The command "uconv -l --canon" will also 797 generate the alias table from the currently used copy of ICU. 798* Binary format: cnvalias.icu: 799 [source/common/ucnv_io.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnv_io.cpp) 800* Generator tool: 801 [gencnval](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencnval) 802 803#### Unicode Character Data (Properties; for Java only: hardcoded in C common library) 804* Source format: 805 [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt): 806 [Preparsed UCD](http://site.icu-project.org/design/props/ppucd) 807* Binary format: uprops.icu: 808 [tools/unicode/c/genprops/corepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/corepropsbuilder.cpp) 809* Generator tool: 810 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 811 812#### Unicode Character Data (Case mappings; for Java only: hardcoded in C common library) 813* Source format: 814 [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata): 815 [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 816* Binary format: ucase.icu: 817 [tools/unicode/c/genprops/casepropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/casepropsbuilder.cpp) 818* Generator tool: 819 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 820 821#### Unicode Character Data (BiDi, and Arabic shaping; for Java only: hardcoded in C common library) 822* Source format: 823 [source/data/unidata/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata): 824 [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 825* Binary format: ubidi.icu: 826 [tools/unicode/c/genprops/bidipropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/bidipropsbuilder.cpp) 827* Generator tool: 828 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 829 830#### Unicode Character Data (Normalization since ICU 4.4) & custom normalization data 831* Source format: 832 [source/data/unidata/norm2/*.tx](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2): 833 Files derived from the [Unicode Character 834 Database](http://www.unicode.org/onlinedat/online.html), or custom data. 835* Binary format: .nrm: 836 [source/common/normalizer2impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/normalizer2impl.h) 837* Generator tool: 838 [gennorm2](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gennorm2) 839 840#### Unicode Character Data (Character names) 841* Source format: 842 [source/data/unidata/UnicodeData.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/UnicodeData.txt): 843 [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 844* Binary format: unames.icu: 845 [tools/unicode/c/genprops/namespropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/namespropsbuilder.cpp) 846* Generator tool: 847 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 848 849#### Unicode Character Data (Property [value] aliases since ICU 4.8; for Java only: hardcoded in C common library since ICU 4.8) 850* Source format: [UCD Property*Aliases.txt](http://www.unicode.org/Public/UNIDATA/): 851 [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 852* Binary format: pnames.icu: 853 [source/common/propname.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/propname.h) 854* Generator tool: 855 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 856 857#### Unicode Character Data (Text layout properties since ICU 64) 858* Source format: 859 [source/data/unidata/ppucd.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt): 860 [Preparsed UCD](http://site.icu-project.org/design/props/ppucd) 861* Binary format: ulayout.icu: 862 [tools/unicode/c/genprops/layoutpropsbuilder.cpp](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops/layoutpropsbuilder.cpp) 863* Generator tool: 864 [genprops](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genprops) 865 866#### Collation data (root collation & tailorings; ICU 53 & later) 867* Source format: Original data from allkeys_CLDR.txt in 868 [CLDR Root Collation Data Files](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files) 869 processed into 870 [source/data/unidata/FractionalUCA.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/FractionalUCA.txt) 871 by 872 [tool at unicode.org maintained by Mark Davis](https://sites.google.com/site/unicodetools/#TOC-UCA) 873 (call the Main class with option writeFractionalUCA); source tailorings (text rules) in 874 [source/data/coll/*.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/coll) 875 resource bundles: [Collation Customization chapter](collation/customization/index.md). 876* Binary format: ucadata.icu & binary tailorings in resource bundles: 877 [source/i18n/collationdatareader.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/collationdatareader.h) 878* Generator tool: 879 [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca), 880 [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb) 881 882#### Rule-based break iterator data 883* Source format: .txt: [Boundary Analysis chapter](boundaryanalysis/index.md) 884* Binary format: .brk: 885 [source/common/rbbidata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/rbbidata.h) 886* Generator tool: 887 [genbrk](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genbrk) 888 889#### Dictionary-based break iterator data (ICU 50 & later) 890* Source format: txt: [gendict.cpp 891 comments](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gendict/gendict.cpp) 892* Binary format: .dict: see 893 [source/common/dictionarydata.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/dictionarydata.h 894* Generator tool: 895 [gendict](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gendict) 896 897#### Rule-based transform (transliterator) data 898* Source format: .txt (in resource bundles): [Transform Rule Tutorial chapter](transforms/general/rules.md) 899* Binary format: Uses genrb to make binary format 900* Generator tool: Does not apply 901 902#### Time zone data (ICU 4.4 & later) 903* Source format: 904 [source/data/misc/zoneinfo64.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/misc/zoneinfo64.txt): 905 ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz 906* Binary format: zoneinfo64.res (generated by genrb and 907 [tzcode tools](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/tzcode/readme.txt)). 908* Generator tool: Does not apply 909 910#### StringPrep profile data 911* Source format: 912 [source/data/sprep/rfc3491.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/sprep/rfc3491.txt): 913* Binary format: .spp: 914 [source/tools/gensprep/store.c](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gensprep/store.c) 915* Generator tool: 916 [gensprep](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gensprep) 917 918#### Confusables data 919* Source format: 920 [source/data/unidata/confusables.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/confusables.txt), 921 [source/data/unidata/confusablesWholeScript.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/confusablesWholeScript.txt) 922* Binary format: .spp: 923 [confusables.cfu: source/i18n/uspoof_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/uspoof_impl.h) 924* Generator tool: [gencfu](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gencfu) 925 926### Public Data Files (old versions) 927 928#### Unicode Character Data (Normalization before ICU 4.4; for Java only: was hardcoded in C common library) 929* Source format: 930 [source/data/unidata/*.txt]((https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata): 931 [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 932* Binary format: unorm.icu: 933 [source/common/unormimp.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unormimp.h) 934* Generator tool: gennorm 935 936#### Unicode Character Data (Property [value] aliases before ICU 4.8) 937* Source format: source/data/unidata/Property*Aliases.txt: [Unicode Character Database](http://www.unicode.org/onlinedat/online.html) 938* Binary format: pnames.icu: source/common/propname.h (ICU 4.6) 939* Generator tool: genpname 940 941#### Collation data (UCA, code points to weights; ICU 52 & earlier) 942* Source format: Same as in ICU 53 943* Binary format: ucadata.icu & binary tailorings in resource bundles: source/i18n/ucol_imp.h (ICU 52) 944* Generator tool: 945 [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca), 946 [genrb](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/genrb) 947 948#### Collation data (Inverse UCA, weights->code points; ICU 52 & earlier) 949* Source format: Processed from FractionalUCA.txt like ICU 52 ucadata.icu 950* Binary format: invuca.icu: source/i18n/ucol_imp.h (ICU 52) 951* Generator tool: 952 [genuca](https://github.com/unicode-org/icu/blob/master/tools/unicode/c/genuca) 953 954#### Dictionary-based break iterator data (ICU 49 & earlier) 955* Source format: .txt: genctd.cpp comments 956* Binary format: ctd: see CompactTrieHeader in source/common/triedict.cpp 957* Generator tool: genctd 958 959#### Time zone data (Before ICU 4.4) 960* Source format: .source/data/misc/zoneinfo.txt (ICU 4.2): ftp://elsie.nci.nih.gov/pub/ tzdata<year><rev>.tar.gz 961* Binary format: zoneinfo64.res (generated by genrb and 962 [tzcode tools](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/tzcode/readme.txt)). 963* Generator tool: Does not apply 964 965### Non-File API Binary Data 966 967#### Converter selector data 968* Source format: none 969* Binary format: 970 [source/common/ucnvsel.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvsel.cpp) 971* Generator tool: 972 [ucnvsel_open()](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucnvsel.cpp) 973 974### Test-Only Data Files 975 976#### test.icu (for udata API testing) 977* Source format: none (fixed output from gentest when not using -r or -j options) 978* Binary format: test.icu: see `createData()` in 979 [source/tools/gentest/gentest.c](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gentest/gentest.c) 980* Generator tool: 981 [gentest](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/gentest/gentest.c) 982 983### Other Data Structures 984 985#### UCPTrie (C)/CodePointTrie (Java) (maps code points to integers) 986* Source format: (public builder API) 987* Binary format: 988 [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie), 989 [icu4c/source/common/ucptrie_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/ucptrie_impl.h) 990* Generator tool: (builder class) 991 992#### UTrie2 (C)/Trie2 (Java) (maps code points to integers) 993* Source format: (internal builder API) 994* Binary format: 995 [ICU Code Point Tries design doc](http://site.icu-project.org/design/struct/utrie), 996 [icu4c/source/common/utrie2_impl.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/utrie2_impl.h) 997* Generator tool: (builder class) 998 999#### BytesTrie (maps byte sequences to 32-bit integers) 1000* Source format: (public builder API) 1001* Binary format: 1002 [BytesTrie design doc](http://site.icu-project.org/design/struct/tries/bytestrie), 1003 [icu4c/source/common/unicode/bytestrie.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/bytestrie.h) 1004* Generator tool: (builder class) 1005 1006#### UCharsTrie (C++)/CharsTrie (Java) (maps 16-bit-Unicode strings to 32-bit integers) 1007* Source format: (public builder API) 1008* Binary format: 1009 [UCharsTrie design doc](http://site.icu-project.org/design/struct/tries/ucharstrie), 1010 [icu4c/source/common/unicode/ucharstrie.h](https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/ucharstrie.h) 1011* Generator tool: (builder class) 1012 1013## ICU4J Resource Information 1014 1015Starting with release 2.1, ICU4J includes its own resource information which is 1016completely independent of the JRE resource information. (Note, ICU4J 2.8 to 3.4, 1017time zone information depends on the underlying JRE). The new ICU4J information 1018is equivalent to the information in ICU4C and many resources are, in fact, the 1019same binary files that ICU4C uses. 1020 1021By default the ICU4J distribution includes all of the standard resource 1022information. It is located under the directory `com/ibm/icu/impl/data`. 1023Depending on the service, the data is in different locations and in different 1024formats. Note: This will continue to change from release to release, so clients 1025should not depend on the exact organization of the data in ICU4J. 1026 10271. The primary **locale data** is under the directory icudt38b, as a set of 1028 ".res" files whose names are the locale identifiers. Locale naming is 1029 documented in the `com.ibm.icu.util.ULocale` class, and the use of these 1030 names in searching for resources is documented in 1031 `com.ibm.icu.util.UResourceBundle`. 1032 10332. The **collation data** is under the directory `icudt38b/coll`, as a set of 1034 ".res" files. 1035 10363. The **rule-based transliterator data** is under the directory 1037 `icudt38b/translit` as a set of ".res" files. (**Note:** the Han 1038 transliterator test data is no longer included in the core icu4j.jar file by 1039 default.) 1040 10414. The **rule-based number format data** is under the directory `icudt38b/rbnf` 1042 as a set of ".res" files. 1043 10445. The **break iterator data** is directly under the data directory, as a set 1045 of ".brk" files, named according to the type of break and the locale where 1046 there are locale-specific versions. 1047 10486. The **holiday data** is under the data directory, as a set of ".class" 1049 files, named "HolidayBundle_" followed by the locale ID. 1050 10517. The **character property data** as well as assorted **normalization data** 1052 and default **unicode collation algorithm (UCA) data** is found under the 1053 data directory as a set of ".icu" files. 1054 10558. The **character set converter data** is under the directory `icudt38b/`, as 1056 a set of ".cnv" files. These files are currently included only in 1057 icu-charset.jar. 1058 10599. The **time zone data** is named `zoneinfo.res` under the directory 1060 `icudt38b`. 1061 1062Some of the data files alias or otherwise reference data from other data files. 1063One reason for this is because some locale names have changed. For example, 1064he_IL used to be iw_IL. In order to support both names but not duplicate the 1065data, one of the resource files refers to the other file's data. In other cases, 1066a file may alias a portion of another file's data in order to save space. 1067Currently ICU4J provides no tool for revealing these dependencies. 1068 1069> :point_right: **Note**: Java's Locale class silently converts the language 1070code "he" to "iw" when you construct the Locale (for versions of Java through 1071Java 5). Thus Java cannot be used to locate resources that use the "he" language 1072code. ICU, on the other hand, does not perform this conversion in ULocale, and 1073instead uses aliasing in the locale data to represent the same set of data under 1074different locale ids. 1075 1076Resource files that use locale ids form a hierarchy, with up to four levels: a 1077root, language, region (country), and variant. Searches for locale data attempt 1078to match as far down the hierarchy as possible, for example, "he_IL" will match 1079he_IL, but "he_US" will match he (since there is no US variant for he, and 1080"xx_YY will match root (the default fallback locale) since there is no xx 1081language code in the locale hierarchy. Again, see `java.util.ResourceBundle` for 1082more information. 1083 1084Currently ICU4J provides no tool for revealing these dependencies between data 1085files, so trimming the data directly in the ICU4J project is a hit-or-miss 1086affair. The key point when you remove data is to make sure to remove all 1087dependencies on that data as well. For example, if you remove he.res, you need 1088to remove he_IL.res, since it is lower in the hierarchy, and you must remove 1089iw.res, since it references he.res, and iw_IL.res, since it depends on it (and 1090also references he_IL.res). 1091 1092Unfortunately, the jar tool in the JDK provides no way to remove items from a 1093jar file. Thus you have to extract the resources, remove the ones you don't 1094want, and then create a new jar file with the remaining resources. See the jar 1095tool information for how to do this. Before 'rejaring' the files, be sure to 1096thoroughly test your application with the remaining resources, making sure each 1097required resource is present. 1098 1099#### Using additional resource files with ICU4J 1100 1101> :point_right: **Note**: Resource file formats can change across releases of ICU4J! 1102> 1103> *The format of ICU4J resources is not part of the API. Clients who develop their 1104> own resources for use with ICU4J should be prepared to regenerate them when they 1105> move to new releases of ICU4J.* 1106 1107We are still developing ICU4J's resource mechanism. Currently it is not possible 1108to mix icu's new binary .res resources with traditional java-style .class or 1109.txt resources. We might allow for this in a future release, but since the 1110resource data and format is not formally supported, you run the risk of 1111incompatibilities with future releases of ICU4J. 1112 1113Resource data in ICU4J is checked in to the repository as a jar file containing 1114the resource binaries, icudata.jar. This means that inspecting the contents of 1115these resources is difficult. They currently are compiled from ICU4C .txt file 1116data. You can view the contents of the ICU4C text resource files to understand 1117the contents of the ICU4J resources. 1118 1119The files in icudata.jar get extracted to com/ibm/icu/impl/data in the build 1120directory when the 'core' target is built. Building the 'resources' target will 1121force the resources to once again be extracted. Extraction will overwrite any 1122corresponding resource files already in that directory. 1123 1124### Building ICU4J Resources from ICU4C 1125 1126#### Requirements 1127 11281. [ICU4C](http://icu-project.org/download/) 1129 11302. Compilers and tools required for [building ICU4C](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild). 1131 11323. J2SE SDK version 5 or above 1133 1134#### Procedure 1135 11361. Download and build ICU4C on a Windows or Linux machine. For instructions on downloading and building ICU4C, please click 1137 [here](https://htmlpreview.github.io/?https://github.com/unicode-org/icu/blob/master/icu4c/readme.html#HowToBuild). 1138 11392. Follow the remaining instructions in 1140 [*$icu4c_root*/source/data/icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt). 1141 *$icu4c_root* is the root directory of ICU4C source package. 1142