1--- 2layout: default 3title: ICU Data Build Tool 4nav_order: 1 5parent: ICU Data 6--- 7<!-- 8© 2019 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# ICU Data Build Tool 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25ICU 64 provides a tool for configuring your ICU locale data file with finer 26granularity. This page explains how to use this tool to customize and reduce 27your data file size. 28 29## Overview: What is in the ICU data file? 30 31There are hundreds of **locales** supported in ICU (including script and 32region variants), and ICU supports many different **features**. For each 33locale and for each feature, data is stored in one or more data files. 34 35Those data files are compiled and then bundled into a `.dat` file called 36something like `icudt64l.dat`, which is little-endian data for ICU 64. This 37dat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a` 38on Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`. 39 40At a high level, the size of the ICU data file corresponds to the 41cross-product of locales and features, except that not all features require 42locale-specific data, and not all locales require data for all features. The 43data file contents can be approximately visualized like this: 44 45<img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" /> 46 47The `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped. This file 48size is too large for certain use cases, such as bundling the data file into a 49smartphone app or an embedded device. This is something the ICU Data Build 50Tool aims to solve. 51 52## ICU Data Configuration File 53 54The ICU Data Build Tool enables you to write a configuration file that 55specifies what features and locales to include in a custom data bundle. 56 57The configuration file may be written in either [JSON](http://json.org/) or 58[Hjson](https://hjson.org/). To build ICU4C with custom data, set the 59`ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on 60Unix or when building the data package on Windows. For example: 61 62 ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux 63 64**Important:** You *must* have the data sources in order to use the ICU Data 65Build Tool. Check for the file icu4c/source/data/locales/root.txt. If that file 66is missing, you need to download "icu4c-\*-data.zip", delete the old 67icu4c/source/data directory, and replace it with the data directory from the zip 68file. If there is a \*.dat file in icu4c/source/data/in, that file will be used 69even if you gave ICU custom filter rules. 70 71In order to use Hjson syntax, the `hjson` pip module must be installed on 72your system. You should also consider installing the `jsonschema` module to 73print messages when errors are found in your config file. 74 75 $ pip3 install --user hjson jsonschema 76 77To build ICU4J with custom data, you must first build ICU4C with custom data 78and then generate the JAR file. For more information on building ICU4J, read the 79[ICU4J Readme](../icu4j/). 80 81### Locale Slicing 82 83The simplest way to slice ICU data is by locale. The ICU Data Build Tool 84makes it easy to select your desired locales to suit a number of use cases. 85 86#### Filtering by Language Only 87 88Here is a *filters.json* file that builds ICU data with support for English, 89Chinese, and German, including *all* script and regional variants for those 90languages: 91 92 { 93 "localeFilter": { 94 "filterType": "language", 95 "includelist": [ 96 "en", 97 "de", 98 "zh" 99 ] 100 } 101 } 102 103The *filterType* "language" only supports slicing by entire languages. 104 105##### Terminology: Includelist, Excludelist, Whitelist, Blacklist 106 107Prior to ICU 68, use `"whitelist"` and `"blacklist"` instead of `"includelist"` 108and `"excludelist"`, respectively. ICU 68 allows all four terms. 109 110#### Filtering by Locale 111 112For more control, use *filterType* "locale". Here is a *filters.hjson* file that 113includes the same three languages as above, including regional variants, but 114only the default script (e.g., Simplified Han for Chinese): 115 116 localeFilter: { 117 filterType: locale 118 includelist: [ 119 en 120 de 121 zh 122 ] 123 } 124 125*If using ICU 67 or earlier, see note above regarding allowed keywords.* 126 127#### Adding Script Variants (includeScripts = true) 128 129You may set the *includeScripts* option to true to include all scripts for a 130language while using *filterType* "locale". This results in behavior similar 131to *filterType* "language". In the following JSON example, all scripts for 132Chinese are included: 133 134 { 135 "localeFilter": { 136 "filterType": "locale", 137 "includeScripts": true, 138 "includelist": [ 139 "en", 140 "de", 141 "zh" 142 ] 143 } 144 } 145 146*If using ICU 67 or earlier, see note above regarding allowed keywords.* 147 148If you wish to explicitly list the scripts, you may put the script code in the 149locale tag in the whitelist, and you do not need the *includeScripts* option 150enabled. For example, in Hjson, to include Han Traditional ***but not Han 151Simplified***: 152 153 localeFilter: { 154 filterType: locale 155 includelist: [ 156 en 157 de 158 zh_Hant 159 ] 160 } 161 162*If using ICU 67 or earlier, see note above regarding allowed keywords.* 163 164**Note:** the option *includeScripts* is only supported at the language level; 165i.e., in order to include all scripts for a particular language, you must 166specify the language alone, without a region tag. 167 168#### Removing Regional Variants (includeChildren = false) 169 170If you wish to enumerate exactly which regional variants you wish to support, 171you may use *filterType* "locale" with the *includeChildren* setting turned to 172false. The following *filters.hjson* file includes English (US), English 173(UK), German (Germany), and Chinese (China, Han Simplified), as well as their 174dependencies, *but not* other regional variants like English (Australia), 175German (Switzerland), or Chinese (Taiwan, Han Traditional): 176 177 localeFilter: { 178 filterType: locale 179 includeChildren: false 180 includelist: [ 181 en_US 182 en_GB 183 de_DE 184 zh_CN 185 ] 186 } 187 188*If using ICU 67 or earlier, see note above regarding allowed keywords.* 189 190Including dependencies, the above filter would include the following data files: 191 192- root.txt 193- en.txt 194- en_US.txt 195- en_001.txt 196- en_GB.txt 197- de.txt 198- de_DE.txt 199- zh.txt 200- zh_Hans.txt 201- zh_Hans_CN.txt 202- zh_CN.txt 203 204### File Slicing (coarse-grained features) 205 206ICU provides a lot of features, of which you probably need only a small subset 207for your application. Feature slicing is a powerful way to prune out data for 208any features you are not using. 209 210***CAUTION:*** When slicing by features, you must manually include all 211dependencies. For example, if you are formatting dates, you must include not 212only the date formatting data but also the number formatting data, since dates 213contain numbers. Expect to spend a fair bit of time debugging your feature 214filter to get it to work the way you expect it to. 215 216The data for many ICU features live in individual files. The ICU Data Build 217Tool puts similar *types* of files into categories. The following table 218summarizes the ICU data files and their corresponding features and categories: 219 220| Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/main/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) | 221|---|---|---|---| 222| Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB | 223| Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** | 224| Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** | 225| Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB | 226| Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** | 227| Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** | 228| Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB | 229| Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB | 230| Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB | 231| Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** | 232| Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB | 233| StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB | 234| Time Zones | `"misc"` <br/> `"zone_tree"` <br/> `"zone_supplemental"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt <br/> zone/tzdbNames.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** <br/> 4.8 KiB | 235| Transliteration | `"translit"` | translit/\*.txt | 685 KiB | 236| Unicode Emoji<br/>Properties | `"uemoji"` | in/uemoji.icu | 13 KiB | 237| Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB | 238| Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB | 239| Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** | 240| **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** | 241 242#### Additive and Subtractive Modes 243 244The ICU Data Build Tool allows two strategies for selecting features: 245*additive* mode and *subtractive* mode. 246 247The default is to use subtractive mode. This means that all ICU data is 248included, and your configurations can remove or change data from that baseline. 249Additive mode means that you start with an *empty* ICU data file, and you must 250explicitly add the data required for your application. 251 252There are two concrete differences between additive and subtractive mode: 253 254| | Additive | Subtractive | 255|-------------------------|-------------|-------------| 256| Default Feature Filter | `"exclude"` | `"include"` | 257| Default Resource Filter | `"-/"`, `"+/%%ALIAS"`, `"+/%%Parent"` | `"+/"` | 258 259To enable additive mode, add the following setting to your filter file: 260 261 strategy: "additive" 262 263**Caution:** If using `"-/"` or similar top-level exclusion rules, be aware of 264the fields `"+/%%Parent"` and `"+/%%ALIAS"`, which are required in locale tree 265resource bundles. Excluding these paths may cause unexpected locale fallback 266behavior. 267 268#### Filter Types 269 270You may list *filters* for each category in the *featureFilters* section of 271your config file. What follows are examples of the possible types of filters. 272 273##### Inclusion Filter 274 275To include a category, use the string `"include"` as your filter. 276 277 featureFilters: { 278 locales_tree: include 279 } 280 281If the category is a locale tree (ends with `_tree`), the inclusion filter 282resolves to the `localeFilter`; for more information, see the section 283"Locale-Tree Categories." Otherwise, the inclusion filter causes all files in 284the category to be included. 285 286**NOTE:** When subtractive mode is used (default), all categories implicitly 287start with `"include"` as their filter. 288 289##### Exclusion Filter 290 291To exclude an entire category, use *filterType* "exclude". For example, to 292exclude all confusables data: 293 294 featureFilters: { 295 confusables: { 296 filterType: exclude 297 } 298 } 299 300Since ICU 65, you can also write simply: 301 302 featureFilters: { 303 confusables: exclude 304 } 305 306**NOTE:** When additive mode is used, all categories implicitly start with 307`"exclude"` as their filter. 308 309##### File Name Filter 310 311To exclude certain files out of a category, use the file name filter, which is 312the default type of filter when *filterType* is not specified. For example, 313to include the Burmese break iteration dictionary but not any other 314dictionaries: 315 316 featureFilters: { 317 brkitr_dictionaries: { 318 includelist: [ 319 burmesedict 320 ] 321 } 322 } 323 324Do *not* include directories or file extensions. They will be added 325automatically for you. Note that all files in a particular category have the 326same directory and extension. 327 328You can use either `"includelist"` or `"excludelist"` for the file name filter. 329*If using ICU 67 or earlier, see note above regarding allowed keywords.* 330 331##### Regex Filter 332 333To exclude filenames matching a certain regular expression, use *filterType* 334"regex". For example, to reject the CJK-specific break iteration rules: 335 336 featureFilters: { 337 brkitr_rules: { 338 filterType: regex 339 excludelist: [ 340 ^.*_cj$ 341 ] 342 } 343 } 344 345The Python standard library [*re* 346module](https://docs.python.org/3/library/re.html) is used for evaluating the 347regular expressions. In case the regular expression engine is changed in the 348future, however, you are encouraged to restrict yourself to a simple set of 349regular expression operators. 350 351As above, do not include directories or file extensions, and you can use 352either a whitelist or a blacklist. 353 354##### Union Filter 355 356You can combine the results of multiple filters with *filterType* "union". 357This filter matches files that match *at least one* of the provided filters. 358The syntax is: 359 360 { 361 filterType: union 362 unionOf: [ 363 { /* filter 1 */ }, 364 { /* filter 2 */ }, 365 // ... 366 ] 367 } 368 369This filter type is useful for combining "locale" filters with different 370includeScripts or includeChildren options. 371 372#### Locale-Tree Categories 373 374Several categories have the `_tree` suffix. These categories are for "locale 375trees": they contain locale-specific data. ***The [localeFilter configuration 376option](#slicing-data-by-locale) sets the default file filter for all `_tree` 377categories.*** 378 379If you want to include different locales for different locale file trees, you 380can override their filter in the *featureFilters* section of the config file. 381For example, to include only Italian data for currency symbols *instead of* 382the common locales specified in *localeFilter*, you can do the following: 383 384 featureFilters: 385 curr_tree: { 386 filterType: locale 387 includelist: [ 388 it 389 ] 390 } 391 } 392 393*If using ICU 67 or earlier, see note above regarding allowed keywords.* 394 395You can exclude an entire `_tree` category without affecting other categories. 396For example, to exclude region display names: 397 398 featureFilters: { 399 region_tree: { 400 filterType: exclude 401 } 402 } 403 404Note that you are able to use any of the other filter types for `_tree` 405categories, but you must be very careful that you are including all of the 406correct files. For example, `en_GB` requires `en_001`, and you must always 407include `root`. If you use the "language" or "locale" filter types, this 408logic is done for you. 409 410### Resource Bundle Slicing (fine-grained features) 411 412The third section of the ICU filter config file is *resourceFilters*. With 413this section, you can dive inside resource bundle files to remove even more 414data. 415 416You can apply resource filters to all locale tree categories as well as to 417categories that include resource bundles, such as the `"misc"` category. 418 419For example, consider measurement units. There is one unit file per locale (example: 420[en.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unit/en.txt)), 421and that file contains data for all measurement units in CLDR. However, if 422you are only formatting distances, for example, you may need the data for only 423a small set of units. 424 425Here is how you could include units of length in the "short" style but no 426other units: 427 428 resourceFilters: [ 429 { 430 categories: [ 431 unit_tree 432 ] 433 rules: [ 434 -/units 435 -/unitsNarrow 436 -/unitsShort 437 +/unitsShort/length 438 ] 439 } 440 ] 441 442Conceptually, the rules are applied from top to bottom. First, all data for 443all three styes of units are removed, and then the short length units are 444added back. 445 446**NOTE:** In subtractive mode, resource paths are *included* by default. In 447additive mode, resource paths are *excluded* by default. 448 449#### Wildcard Character 450 451You can use the wildcard character (`*`) to match a piece of the resource 452path. For example, to include length units for all three styles, you can do: 453 454 resourceFilters: [ 455 { 456 categories: [ 457 unit_tree 458 ] 459 rules: [ 460 -/units 461 -/unitsNarrow 462 -/unitsShort 463 +/*/length 464 ] 465 } 466 ] 467 468The wildcard must be the only character in its path segment. Future ICU 469versions may expand the syntax. 470 471#### Resource Filter for Specific File 472 473The resource filter object takes an optional *files* setting which accepts a 474file filter in the same syntax used above for file filtering. For example, if 475you wanted to apply a filter to misc/supplementalData.txt, you could do the 476following (this example removes calendar data): 477 478 resourceFilters: [ 479 { 480 categories: ["misc"] 481 files: { 482 includelist: ["supplementalData"] 483 } 484 rules: [ 485 -/calendarData 486 ] 487 } 488 ] 489 490*If using ICU 67 or earlier, see note above regarding allowed keywords.* 491 492#### Combining Multiple Resource Filter Specs 493 494You can also list multiple resource filter objects in the *resourceFilters* 495array; the filters are added from top to bottom. For example, here is an 496advanced configuration that includes "mile" for en-US and "kilometer" for 497en-CA; this also makes use of the *files* option: 498 499 resourceFilters: [ 500 { 501 categories: ["unit_tree"] 502 rules: [ 503 -/units 504 -/unitsNarrow 505 -/unitsShort 506 ] 507 }, 508 { 509 categories: ["unit_tree"] 510 files: { 511 filterType: locale 512 includelist: ["en_US"] 513 } 514 rules: [ 515 +/*/length/mile 516 ] 517 }, 518 { 519 categories: ["unit_tree"] 520 files: { 521 filterType: locale 522 includelist: ["en_CA"] 523 } 524 rules: [ 525 +/*/length/kilometer 526 ] 527 } 528 ] 529 530The above example would give en-US these resource filter rules: 531 532 -/units 533 -/unitsNarrow 534 -/unitsShort 535 +/*/length/mile 536 537and en-CA these resource filter rules: 538 539 -/units 540 -/unitsNarrow 541 -/unitsShort 542 +/*/length/kilometer 543 544In accordance with *filterType* "locale", the parent locales *en* and *root* 545would get both units; this is required since both en-US and en-CA may inherit 546from the parent locale: 547 548 -/units 549 -/unitsNarrow 550 -/unitsShort 551 +/*/length/mile 552 +/*/length/kilometer 553 554## Debugging Tips 555 556**Run Python directly:** If you do not want to wait for ./runConfigureICU to 557finish, you can directly re-generate the rules using your filter file with the 558following command line run from *iuc4c/source*. 559 560 $ PYTHONPATH=python python3 -m icutools.databuilder \ 561 --mode=gnumake --src_dir=data > data/rules.mk 562 563**Install jsonschema:** Install the `jsonschema` pip package to get warnings 564about problems with your filter file. 565 566**See what data is being used:** ICU is instrumented to allow you to trace 567which resources are used at runtime. This can help you determine what data you 568need to include. For more information, see [tracing.md](tracing.md). 569 570**Inspect data/rules.mk:** The Python script outputs the file *rules.mk* 571inside *iuc4c/source/data*. To see what is going to get built, you can inspect 572that file. First build ICU normally, and copy *rules.mk* to 573*rules_default.mk*. Then build ICU with your filter file. Now you can take the 574diff between *rules_default.mk* and *rules.mk* to see exactly what your filter 575file is removing. 576 577**Inspect the output:** After a `make clean` and `make` with a new *rules.mk*, 578you can look inside the directory *icu4c/source/data/out* to see the files 579that got built. 580 581**Inspect the compiled resource filter rules:** If you are using a resource 582filter, the resource filter rules get compiled for each individual locale 583inside *icu4c/source/data/out/tmp/filters*. You can look at those files to see 584what filter rules are being applied to each individual locale. 585 586**Run genrb in verbose mode:** For debugging a resource filter, you can run 587genrb in verbose mode to see which resources got stripped. To do this, first 588inspect the make output and find a command line like this: 589 590 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt 591 592Copy that command line and re-run it from *icu4c/source/data* with the `-v` 593flag added to the end. The command will print out exactly which resource paths 594are being included and excluded as well as a model of the filter rules applied 595to this file. 596 597**Inspect .res files with derb:** The `derb` tool can convert .res files back 598to .txt files after filtering. For example, to convert the above unit res file 599back to a txt file, you can run this command from *icu4c/source*: 600 601 LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res 602 603That will produce a file *en.txt* in your current directory, which is the 604original *data/unit/en.txt* but after resource filters were applied. 605 606*Tip:* derb expects your res files to be rooted in a directory named 607`icudt64l` (corresponding to your current ICU version and endianness). If your 608files are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR. 609 610**Put complex rules first** and **use the wildcard `*` sparingly:** The order 611of the filter rules matters a great deal in how effective your data size 612reduction can be, and the wildcard `*` can sometimes produce behavior that is 613tricky to reason about. For example, these three lists of filter rules look 614similar on first glance but actually produce different output: 615 616<table> 617<tr> 618<th>Unit Resource Filter Rules</th> 619<th>Unit Resource Size</th> 620<th>Commentary</th> 621<th>Result</th> 622</tr> 623<tr><td><pre> 624-/*/* 625+/*/digital 626-/*/digital/*/dnam 627-/durationUnits 628-/units 629-/unitsNarrow 630</pre></td><td>77 KiB</td><td> 631First, remove all unit types. Then, add back digital units across all unit 632widths. Then, remove display names from digital units. Then, remove duration 633unit patterns and long and narrow forms. 634</td><td> 635Digital units in short form are included; all other units are removed. 636</td></tr> 637<tr><td><pre> 638-/durationUnits 639-/units 640-/unitsNarrow 641-/*/* 642+/*/digital 643-/*/digital/*/dnam 644</pre></td><td>125 KiB</td><td> 645First, remove duration unit patterns and long and narrow forms. Then, remove 646all unit types. Then, add back digital units across all unit widths. Then, 647remove display names from digital units. 648</td><td> 649Digital units are included <em>in all widths</em>; all other units are removed. 650</td></tr> 651<tr><td><pre> 652-/*/* 653+/*/digital 654-/*/*/*/dnam 655-/durationUnits 656-/units 657-/unitsNarrow 658</pre></td><td>191 KiB</td><td> 659First, remove all unit types. Then, add back digital units across all unit 660widths. Then, remove display names from all units. Then, remove duration unit 661patterns and long and narrow forms. 662</td><td> 663Digital units in short form are included, as is the <em>tree structure</em> 664for all other units, even though the other units have no real data. 665</td></tr> 666</table> 667 668By design, empty tree structure is retained in the unit bundle. This is 669because there are numerous instances in ICU data where the presence of an 670empty tree carries meaning. However, it means that you must be careful when 671building resource filter rules in order to achieve the optimal data bundle 672size. 673 674Using the `-v` option in genrb (described above) is helpful when debugging 675these types of issues. 676 677## Other Features of the ICU Data Build Tool 678 679While data filtering is the primary reason the ICU Data Build Tool was 680developed, there are there are additional use cases. 681 682### Running Data Build without Configure/Make 683 684You can build the dat file outside of the ICU build system by directly 685invoking the Python icutools.databuilder. Run the following command to see the 686help text for the CLI tool: 687 688 $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help 689 690### Collation UCAData 691 692For using collation (sorting and searching) in any language, the "root" 693collation data file must be included. It provides the Unicode CLDR default 694sort order for all code points, and forms the basis for language-specific 695tailorings as well as for custom collators built at runtime. 696 697There are two versions of the root collation data file: 698 699- ucadata-unihan.txt (compiled size: 511 KiB) 700- ucadata-implicithan.txt (compiled size: 178 KiB) 701 702The unihan version sorts Han characters in radical-stroke order according to 703Unicode, which is a somewhat useful default sort order, especially for use 704with non-CJK languages. The implicithan version sorts Han characters in the 705order of their Unicode assignment, which is similar to radical-stroke order 706for common characters but arbitrary for others. For more information, see 707[UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights). 708 709By default, the unihan version is used. The unihan version of the data file 710is much larger than that for implicithan, so if you need collation but also 711small data, then you may want to select the implicithan version. To use the 712implicithan version, put the following setting in your *filters.json* file: 713 714 { 715 "collationUCAData": "implicithan" 716 } 717 718### Disable Pool Bundle 719 720By default, ICU uses a "pool bundle" to store strings shared between locales. 721This saves space and is recommended for most users. However, when developing 722a system where locale data files may be added "on the fly" and not included in 723the original ICU distribution, those additional data files may not be able to 724use a pool bundle due to name collisions with the existing pool bundle. 725 726To disable the pool bundle in the current ICU build, put the following setting 727in your *filters.json* file: 728 729 { 730 "usePoolBundle": false 731 } 732 733### File Substitution 734 735Using the configuration file, you can perform whole-file substitutions. For 736example, suppose you want to replace the transliteration rules for 737*Zawgyi_my*. You could create a directory called `my_icu_substitutions` 738containing your new `Zawgyi_my.txt` rule file, and then put this in your 739configuration file: 740 741 fileReplacements: { 742 directory: "/path/to/my_icu_substitutions" 743 replacements: [ 744 { 745 src: "Zawgyi_my.txt" 746 dest: "translit/Zawgyi_my.txt" 747 }, 748 "misc/dayPeriods.txt" 749 ] 750 } 751 752`directory` should either be an absolute path, or a path starting with one of 753the following, and it should not contain a trailing slash: 754 755- "$SRC" for the *icu4c/source/data* directory in the source tree 756- "$FILTERS" for the directory containing filters.json 757- "$CWD" for your current working directory 758 759When the entry in the `replacements` array is an object, the `src` and `dest` 760fields indicate, for each file in the source directory (`src`), what file in 761the ICU hierarchy it should replace (`dest`). When the entry is a string, the 762same relative path is used for both `src` and `dest`. 763 764Whole-file substitution happens before all other filters are applied. 765