• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: ICU Data Build Tool
4nav_order: 1
5parent: ICU Data
6---
7<!--
8© 2019 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# ICU Data Build Tool
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25ICU 64 provides a tool for configuring your ICU locale data file with finer
26granularity.  This page explains how to use this tool to customize and reduce
27your data file size.
28
29## Overview: What is in the ICU data file?
30
31There are hundreds of **locales** supported in ICU (including script and
32region variants), and ICU supports many different **features**.  For each
33locale and for each feature, data is stored in one or more data files.
34
35Those data files are compiled and then bundled into a `.dat` file called
36something like `icudt64l.dat`, which is little-endian data for ICU 64. This
37dat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a`
38on Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`.
39
40At a high level, the size of the ICU data file corresponds to the
41cross-product of locales and features, except that not all features require
42locale-specific data, and not all locales require data for all features. The
43data file contents can be approximately visualized like this:
44
45<img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" />
46
47The `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped.  This file
48size is too large for certain use cases, such as bundling the data file into a
49smartphone app or an embedded device.  This is something the ICU Data Build
50Tool aims to solve.
51
52## ICU Data Configuration File
53
54The ICU Data Build Tool enables you to write a configuration file that
55specifies what features and locales to include in a custom data bundle.
56
57The configuration file may be written in either [JSON](http://json.org/) or
58[Hjson](https://hjson.org/).  To build ICU4C with custom data, set the
59`ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on
60Unix or when building the data package on Windows.  For example:
61
62    ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux
63
64**Important:** You *must* have the data sources in order to use the ICU Data
65Build Tool. Check for the file icu4c/source/data/locales/root.txt. If that file
66is missing, you need to download "icu4c-\*-data.zip", delete the old
67icu4c/source/data directory, and replace it with the data directory from the zip
68file. If there is a \*.dat file in icu4c/source/data/in, that file will be used
69even if you gave ICU custom filter rules.
70
71In order to use Hjson syntax, the `hjson` pip module must be installed on
72your system.  You should also consider installing the `jsonschema` module to
73print messages when errors are found in your config file.
74
75    $ pip3 install --user hjson jsonschema
76
77To build ICU4J with custom data, you must first build ICU4C with custom data
78and then generate the JAR file.  For more information, read
79[icu4j-readme.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/icu4j-readme.txt).
80
81### Locale Slicing
82
83The simplest way to slice ICU data is by locale.  The ICU Data Build Tool
84makes it easy to select your desired locales to suit a number of use cases.
85
86#### Filtering by Language Only
87
88Here is a *filters.json* file that builds ICU data with support for English,
89Chinese, and German, including *all* script and regional variants for those
90languages:
91
92    {
93      "localeFilter": {
94        "filterType": "language",
95        "includelist": [
96          "en",
97          "de",
98          "zh"
99        ]
100      }
101    }
102
103The *filterType* "language" only supports slicing by entire languages.
104
105##### Terminology: Includelist, Excludelist, Whitelist, Blacklist
106
107Prior to ICU 68, use `"whitelist"` and `"blacklist"` instead of `"includelist"`
108and `"excludelist"`, respectively. ICU 68 allows all four terms.
109
110#### Filtering by Locale
111
112For more control, use *filterType* "locale".  Here is a *filters.hjson* file that
113includes the same three languages as above, including regional variants, but
114only the default script (e.g., Simplified Han for Chinese):
115
116    localeFilter: {
117      filterType: locale
118      includelist: [
119        en
120        de
121        zh
122      ]
123    }
124
125*If using ICU 67 or earlier, see note above regarding allowed keywords.*
126
127#### Adding Script Variants (includeScripts = true)
128
129You may set the *includeScripts* option to true to include all scripts for a
130language while using *filterType* "locale".  This results in behavior similar
131to *filterType* "language".  In the following JSON example, all scripts for
132Chinese are included:
133
134    {
135      "localeFilter": {
136        "filterType": "locale",
137        "includeScripts": true,
138        "includelist": [
139          "en",
140          "de",
141          "zh"
142        ]
143      }
144    }
145
146*If using ICU 67 or earlier, see note above regarding allowed keywords.*
147
148If you wish to explicitly list the scripts, you may put the script code in the
149locale tag in the whitelist, and you do not need the *includeScripts* option
150enabled.  For example, in Hjson, to include Han Traditional ***but not Han
151Simplified***:
152
153    localeFilter: {
154      filterType: locale
155      includelist: [
156        en
157        de
158        zh_Hant
159      ]
160    }
161
162*If using ICU 67 or earlier, see note above regarding allowed keywords.*
163
164**Note:** the option *includeScripts* is only supported at the language level;
165i.e., in order to include all scripts for a particular language, you must
166specify the language alone, without a region tag.
167
168#### Removing Regional Variants (includeChildren = false)
169
170If you wish to enumerate exactly which regional variants you wish to support,
171you may use *filterType* "locale" with the *includeChildren* setting turned to
172false.  The following *filters.hjson* file includes English (US), English
173(UK), German (Germany), and Chinese (China, Han Simplified), as well as their
174dependencies, *but not* other regional variants like English (Australia),
175German (Switzerland), or Chinese (Taiwan, Han Traditional):
176
177    localeFilter: {
178      filterType: locale
179      includeChildren: false
180      includelist: [
181        en_US
182        en_GB
183        de_DE
184        zh_CN
185      ]
186    }
187
188*If using ICU 67 or earlier, see note above regarding allowed keywords.*
189
190Including dependencies, the above filter would include the following data files:
191
192- root.txt
193- en.txt
194- en_US.txt
195- en_001.txt
196- en_GB.txt
197- de.txt
198- de_DE.txt
199- zh.txt
200- zh_Hans.txt
201- zh_Hans_CN.txt
202- zh_CN.txt
203
204### File Slicing (coarse-grained features)
205
206ICU provides a lot of features, of which you probably need only a small subset
207for your application.  Feature slicing is a powerful way to prune out data for
208any features you are not using.
209
210***CAUTION:*** When slicing by features, you must manually include all
211dependencies.  For example, if you are formatting dates, you must include not
212only the date formatting data but also the number formatting data, since dates
213contain numbers.  Expect to spend a fair bit of time debugging your feature
214filter to get it to work the way you expect it to.
215
216The data for many ICU features live in individual files.  The ICU Data Build
217Tool puts puts similar *types* of files into categories.  The following table
218summarizes the ICU data files and their corresponding features and categories:
219
220| Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/master/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) |
221|---|---|---|---|
222| Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB |
223| Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** |
224| Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** |
225| Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB |
226| Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** |
227| Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** |
228| Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB |
229| Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB |
230| Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB |
231| Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** |
232| Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB |
233| StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB |
234| Time Zones | `"misc"` <br/> `"zone_tree"` <br/> `"zone_supplemental"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt <br/> zone/tzdbNames.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** <br/> 4.8 KiB |
235| Transliteration | `"translit"` | translit/\*.txt | 685 KiB |
236| Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB |
237| Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB |
238| Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** |
239| **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** |
240
241#### Additive and Subtractive Modes
242
243The ICU Data Build Tool allows two strategies for selecting features:
244*additive* mode and *subtractive* mode.
245
246The default is to use subtractive mode. This means that all ICU data is
247included, and your configurations can remove or change data from that baseline.
248Additive mode means that you start with an *empty* ICU data file, and you must
249explicitly add the data required for your application.
250
251There are two concrete differences between additive and subtractive mode:
252
253|                         | Additive    | Subtractive |
254|-------------------------|-------------|-------------|
255| Default Feature Filter  | `"exclude"` | `"include"` |
256| Default Resource Filter | `"-/"`, `"+/%%ALIAS"`, `"+/%%Parent"` | `"+/"` |
257
258To enable additive mode, add the following setting to your filter file:
259
260    strategy: "additive"
261
262**Caution:** If using `"-/"` or similar top-level exclusion rules, be aware of
263the fields `"+/%%Parent"` and `"+/%%ALIAS"`, which are required in locale tree
264resource bundles. Excluding these paths may cause unexpected locale fallback
265behavior.
266
267#### Filter Types
268
269You may list *filters* for each category in the *featureFilters* section of
270your config file.  What follows are examples of the possible types of filters.
271
272##### Inclusion Filter
273
274To include a category, use the string `"include"` as your filter.
275
276    featureFilters: {
277      locales_tree: include
278    }
279
280If the category is a locale tree (ends with `_tree`), the inclusion filter
281resolves to the `localeFilter`; for more information, see the section
282"Locale-Tree Categories." Otherwise, the inclusion filter causes all files in
283the category to be included.
284
285**NOTE:** When subtractive mode is used (default), all categories implicitly
286start with `"include"` as their filter.
287
288##### Exclusion Filter
289
290To exclude an entire category, use *filterType* "exclude".  For example, to
291exclude all confusables data:
292
293    featureFilters: {
294      confusables: {
295        filterType: exclude
296      }
297    }
298
299Since ICU 65, you can also write simply:
300
301    featureFilters: {
302      confusables: exclude
303    }
304
305**NOTE:** When additive mode is used, all categories implicitly start with
306`"exclude"` as their filter.
307
308##### File Name Filter
309
310To exclude certain files out of a category, use the file name filter, which is
311the default type of filter when *filterType* is not specified.  For example,
312to include the Burmese break iteration dictionary but not any other
313dictionaries:
314
315    featureFilters: {
316      brkitr_dictionaries: {
317        includelist: [
318          burmesedict
319        ]
320      }
321    }
322
323Do *not* include directories or file extensions.  They will be added
324automatically for you.  Note that all files in a particular category have the
325same directory and extension.
326
327You can use either `"includelist"` or `"excludelist"` for the file name filter.
328*If using ICU 67 or earlier, see note above regarding allowed keywords.*
329
330##### Regex Filter
331
332To exclude filenames matching a certain regular expression, use *filterType*
333"regex".  For example, to reject the CJK-specific break iteration rules:
334
335    featureFilters: {
336      brkitr_rules: {
337        filterType: regex
338        excludelist: [
339          ^.*_cj$
340        ]
341      }
342    }
343
344The Python standard library [*re*
345module](https://docs.python.org/3/library/re.html) is used for evaluating the
346regular expressions.  In case the regular expression engine is changed in the
347future, however, you are encouraged to restrict yourself to a simple set of
348regular expression operators.
349
350As above, do not include directories or file extensions, and you can use
351either a whitelist or a blacklist.
352
353##### Union Filter
354
355You can combine the results of multiple filters with *filterType* "union".
356This filter matches files that match *at least one* of the provided filters.
357The syntax is:
358
359    {
360      filterType: union
361      unionOf: [
362        { /* filter 1 */ },
363        { /* filter 2 */ },
364        // ...
365      ]
366    }
367
368This filter type is useful for combining "locale" filters with different
369includeScripts or includeChildren options.
370
371#### Locale-Tree Categories
372
373Several categories have the `_tree` suffix.  These categories are for "locale
374trees": they contain locale-specific data.  ***The [localeFilter configuration
375option](#slicing-data-by-locale) sets the default file filter for all `_tree`
376categories.***
377
378If you want to include different locales for different locale file trees, you
379can override their filter in the *featureFilters* section of the config file.
380For example, to include only Italian data for currency symbols *instead of*
381the common locales specified in *localeFilter*, you can do the following:
382
383    featureFilters:
384      curr_tree: {
385        filterType: locale
386        includelist: [
387          it
388        ]
389      }
390    }
391
392*If using ICU 67 or earlier, see note above regarding allowed keywords.*
393
394You can exclude an entire `_tree` category without affecting other categories.
395For example, to exclude region display names:
396
397    featureFilters: {
398      region_tree: {
399        filterType: exclude
400      }
401    }
402
403Note that you are able to use any of the other filter types for `_tree`
404categories, but you must be very careful that you are including all of the
405correct files.  For example, `en_GB` requires `en_001`, and you must always
406include `root`.  If you use the "language" or "locale" filter types, this
407logic is done for you.
408
409### Resource Bundle Slicing (fine-grained features)
410
411The third section of the ICU filter config file is *resourceFilters*.  With
412this section, you can dive inside resource bundle files to remove even more
413data.
414
415You can apply resource filters to all locale tree categories as well as to
416categories that include resource bundles, such as the `"misc"` category.
417
418For example, consider measurement units.  There is one unit file per locale (example:
419[en.txt](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unit/en.txt)),
420and that file contains data for all measurement units in CLDR.  However, if
421you are only formatting distances, for example, you may need the data for only
422a small set of units.
423
424Here is how you could include units of length in the "short" style but no
425other units:
426
427    resourceFilters: [
428      {
429        categories: [
430          unit_tree
431        ]
432        rules: [
433          -/units
434          -/unitsNarrow
435          -/unitsShort
436          +/unitsShort/length
437        ]
438      }
439    ]
440
441Conceptually, the rules are applied from top to bottom.  First, all data for
442all three styes of units are removed, and then the short length units are
443added back.
444
445**NOTE:** In subtractive mode, resource paths are *included* by default. In
446additive mode, resource paths are *excluded* by default.
447
448#### Wildcard Character
449
450You can use the wildcard character (`*`) to match a piece of the resource
451path.  For example, to include length units for all three styles, you can do:
452
453    resourceFilters: [
454      {
455        categories: [
456          unit_tree
457        ]
458        rules: [
459          -/units
460          -/unitsNarrow
461          -/unitsShort
462          +/*/length
463        ]
464      }
465    ]
466
467The wildcard must be the only character in its path segment. Future ICU
468versions may expand the syntax.
469
470#### Resource Filter for Specific File
471
472The resource filter object takes an optional *files* setting which accepts a
473file filter in the same syntax used above for file filtering.  For example, if
474you wanted to apply a filter to misc/supplementalData.txt, you could do the
475following (this example removes calendar data):
476
477    resourceFilters: [
478      {
479        categories: ["misc"]
480        files: {
481          includelist: ["supplementalData"]
482        }
483        rules: [
484          -/calendarData
485        ]
486      }
487    ]
488
489*If using ICU 67 or earlier, see note above regarding allowed keywords.*
490
491#### Combining Multiple Resource Filter Specs
492
493You can also list multiple resource filter objects in the *resourceFilters*
494array; the filters are added from top to bottom.  For example, here is an
495advanced configuration that includes "mile" for en-US and "kilometer" for
496en-CA; this also makes use of the *files* option:
497
498    resourceFilters: [
499      {
500        categories: ["unit_tree"]
501        rules: [
502          -/units
503          -/unitsNarrow
504          -/unitsShort
505        ]
506      },
507      {
508        categories: ["unit_tree"]
509        files: {
510          filterType: locale
511          includelist: ["en_US"]
512        }
513        rules: [
514          +/*/length/mile
515        ]
516      },
517      {
518        categories: ["unit_tree"]
519        files: {
520          filterType: locale
521          includelist: ["en_CA"]
522        }
523        rules: [
524          +/*/length/kilometer
525        ]
526      }
527    ]
528
529The above example would give en-US these resource filter rules:
530
531    -/units
532    -/unitsNarrow
533    -/unitsShort
534    +/*/length/mile
535
536and en-CA these resource filter rules:
537
538    -/units
539    -/unitsNarrow
540    -/unitsShort
541    +/*/length/kilometer
542
543In accordance with *filterType* "locale", the parent locales *en* and *root*
544would get both units; this is required since both en-US and en-CA may inherit
545from the parent locale:
546
547    -/units
548    -/unitsNarrow
549    -/unitsShort
550    +/*/length/mile
551    +/*/length/kilometer
552
553## Debugging Tips
554
555**Run Python directly:** If you do not want to wait for ./runConfigureICU to
556finish, you can directly re-generate the rules using your filter file with the
557following command line run from *iuc4c/source*.
558
559    $ PYTHONPATH=python python3 -m icutools.databuilder \
560      --mode=gnumake --src_dir=data > data/rules.mk
561
562**Install jsonschema:** Install the `jsonschema` pip package to get warnings
563about problems with your filter file.
564
565**See what data is being used:** ICU is instrumented to allow you to trace
566which resources are used at runtime. This can help you determine what data you
567need to include. For more information, see [tracing.md](tracing.md).
568
569**Inspect data/rules.mk:** The Python script outputs the file *rules.mk*
570inside *iuc4c/source/data*. To see what is going to get built, you can inspect
571that file. First build ICU normally, and copy *rules.mk* to
572*rules_default.mk*. Then build ICU with your filter file. Now you can take the
573diff between *rules_default.mk* and *rules.mk* to see exactly what your filter
574file is removing.
575
576**Inspect the output:** After a `make clean` and `make` with a new *rules.mk*,
577you can look inside the directory *icu4c/source/data/out* to see the files
578that got built.
579
580**Inspect the compiled resource filter rules:** If you are using a resource
581filter, the resource filter rules get compiled for each individual locale
582inside *icu4c/source/data/out/tmp/filters*. You can look at those files to see
583what filter rules are being applied to each individual locale.
584
585**Run genrb in verbose mode:** For debugging a resource filter, you can run
586genrb in verbose mode to see which resources got stripped. To do this, first
587inspect the make output and find a command line like this:
588
589    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt
590
591Copy that command line and re-run it from *icu4c/source/data* with the `-v`
592flag added to the end. The command will print out exactly which resource paths
593are being included and excluded as well as a model of the filter rules applied
594to this file.
595
596**Inspect .res files with derb:** The `derb` tool can convert .res files back
597to .txt files after filtering. For example, to convert the above unit res file
598back to a txt file, you can run this command from *icu4c/source*:
599
600    LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res
601
602That will produce a file *en.txt* in your current directory, which is the
603original *data/unit/en.txt* but after resource filters were applied.
604
605*Tip:* derb expects your res files to be rooted in a directory named
606`icudt64l` (corresponding to your current ICU version and endianness). If your
607files are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR.
608
609**Put complex rules first** and **use the wildcard `*` sparingly:** The order
610of the filter rules matters a great deal in how effective your data size
611reduction can be, and the wildcard `*` can sometimes produce behavior that is
612tricky to reason about. For example, these three lists of filter rules look
613similar on first glance but actually produce different output:
614
615<table>
616<tr>
617<th>Unit Resource Filter Rules</th>
618<th>Unit Resource Size</th>
619<th>Commentary</th>
620<th>Result</th>
621</tr>
622<tr><td><pre>
623-/*/*
624+/*/digital
625-/*/digital/*/dnam
626-/durationUnits
627-/units
628-/unitsNarrow
629</pre></td><td>77 KiB</td><td>
630First, remove all unit types. Then, add back digital units across all unit
631widths. Then, remove display names from digital units. Then, remove duration
632unit patterns and long and narrow forms.
633</td><td>
634Digital units in short form are included; all other units are removed.
635</td></tr>
636<tr><td><pre>
637-/durationUnits
638-/units
639-/unitsNarrow
640-/*/*
641+/*/digital
642-/*/digital/*/dnam
643</pre></td><td>125 KiB</td><td>
644First, remove duration unit patterns and long and narrow forms. Then, remove
645all unit types. Then, add back digital units across all unit widths. Then,
646remove display names from digital units.
647</td><td>
648Digital units are included <em>in all widths</em>; all other units are removed.
649</td></tr>
650<tr><td><pre>
651-/*/*
652+/*/digital
653-/*/*/*/dnam
654-/durationUnits
655-/units
656-/unitsNarrow
657</pre></td><td>191 KiB</td><td>
658First, remove all unit types. Then, add back digital units across all unit
659widths. Then, remove display names from all units. Then, remove duration unit
660patterns and long and narrow forms.
661</td><td>
662Digital units in short form are included, as is the <em>tree structure</em>
663for all other units, even though the other units have no real data.
664</td></tr>
665</table>
666
667By design, empty tree structure is retained in the unit bundle. This is
668because there are numerous instances in ICU data where the presence of an
669empty tree carries meaning. However, it means that you must be careful when
670building resource filter rules in order to achieve the optimal data bundle
671size.
672
673Using the `-v` option in genrb (described above) is helpful when debugging
674these types of issues.
675
676## Other Features of the ICU Data Build Tool
677
678While data filtering is the primary reason the ICU Data Build Tool was
679developed, there are there are additional use cases.
680
681### Running Data Build without Configure/Make
682
683You can build the dat file outside of the ICU build system by directly
684invoking the Python icutools.databuilder.  Run the following command to see the
685help text for the CLI tool:
686
687    $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help
688
689### Collation UCAData
690
691For using collation (sorting and searching) in any language, the "root"
692collation data file must be included. It provides the Unicode CLDR default
693sort order for all code points, and forms the basis for language-specific
694tailorings as well as for custom collators built at runtime.
695
696There are two versions of the root collation data file:
697
698- ucadata-unihan.txt (compiled size: 511 KiB)
699- ucadata-implicithan.txt (compiled size: 178 KiB)
700
701The unihan version sorts Han characters in radical-stroke order according to
702Unicode, which is a somewhat useful default sort order, especially for use
703with non-CJK languages.  The implicithan version sorts Han characters in the
704order of their Unicode assignment, which is similar to radical-stroke order
705for common characters but arbitrary for others.  For more information, see
706[UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights).
707
708By default, the unihan version is used.  The unihan version of the data file
709is much larger than that for implicithan, so if you need collation but also
710small data, then you may want to select the implicithan version.  To use the
711implicithan version, put the following setting in your *filters.json* file:
712
713    {
714      "collationUCAData": "implicithan"
715    }
716
717### Disable Pool Bundle
718
719By default, ICU uses a "pool bundle" to store strings shared between locales.
720This saves space and is recommended for most users. However, when developing
721a system where locale data files may be added "on the fly" and not included in
722the original ICU distribution, those additional data files may not be able to
723use a pool bundle due to name collisions with the existing pool bundle.
724
725To disable the pool bundle in the current ICU build, put the following setting
726in your *filters.json* file:
727
728    {
729      "usePoolBundle": false
730    }
731
732### File Substitution
733
734Using the configuration file, you can perform whole-file substitutions.  For
735example, suppose you want to replace the transliteration rules for
736*Zawgyi_my*.  You could create a directory called `my_icu_substitutions`
737containing your new `Zawgyi_my.txt` rule file, and then put this in your
738configuration file:
739
740    fileReplacements: {
741      directory: "/path/to/my_icu_substitutions"
742      replacements: [
743        {
744          src: "Zawgyi_my.txt"
745          dest: "translit/Zawgyi_my.txt"
746        },
747        "misc/dayPeriods.txt"
748      ]
749    }
750
751`directory` should either be an absolute path, or a path starting with one of
752the following, and it should not contain a trailing slash:
753
754- "$SRC" for the *icu4c/source/data* directory in the source tree
755- "$FILTERS" for the directory containing filters.json
756- "$CWD" for your current working directory
757
758When the entry in the `replacements` array is an object, the `src` and `dest`
759fields indicate, for each file in the source directory (`src`), what file in
760the ICU hierarchy it should replace (`dest`). When the entry is a string, the
761same relative path is used for both `src` and `dest`.
762
763Whole-file substitution happens before all other filters are applied.
764