• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: ICU Data Build Tool
4nav_order: 1
5parent: ICU Data
6---
7<!--
8© 2019 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# ICU Data Build Tool
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25ICU 64 provides a tool for configuring your ICU locale data file with finer
26granularity.  This page explains how to use this tool to customize and reduce
27your data file size.
28
29## Overview: What is in the ICU data file?
30
31There are hundreds of **locales** supported in ICU (including script and
32region variants), and ICU supports many different **features**.  For each
33locale and for each feature, data is stored in one or more data files.
34
35Those data files are compiled and then bundled into a `.dat` file called
36something like `icudt64l.dat`, which is little-endian data for ICU 64. This
37dat file is packaged into the `libicudata.so` on Linux or `libicudata.dll.a`
38on Windows. In ICU4J, it is bundled into a jar file named `icudata.jar`.
39
40At a high level, the size of the ICU data file corresponds to the
41cross-product of locales and features, except that not all features require
42locale-specific data, and not all locales require data for all features. The
43data file contents can be approximately visualized like this:
44
45<img alt="Features vs. Locales" src="../assets/features_locales.svg" style="max-width:600px" />
46
47The `icudt64l.dat` file is 27 MiB uncompressed and 11 MiB gzipped.  This file
48size is too large for certain use cases, such as bundling the data file into a
49smartphone app or an embedded device.  This is something the ICU Data Build
50Tool aims to solve.
51
52## ICU Data Configuration File
53
54The ICU Data Build Tool enables you to write a configuration file that
55specifies what features and locales to include in a custom data bundle.
56
57The configuration file may be written in either [JSON](http://json.org/) or
58[Hjson](https://hjson.org/).  To build ICU4C with custom data, set the
59`ICU_DATA_FILTER_FILE` environment variable when running `runConfigureICU` on
60Unix or when building the data package on Windows.  For example:
61
62    ICU_DATA_FILTER_FILE=filters.json path/to/icu4c/source/runConfigureICU Linux
63
64**Important:** You *must* have the data sources in order to use the ICU Data
65Build Tool. Check for the file icu4c/source/data/locales/root.txt. If that file
66is missing, you need to download "icu4c-\*-data.zip", delete the old
67icu4c/source/data directory, and replace it with the data directory from the zip
68file. If there is a \*.dat file in icu4c/source/data/in, that file will be used
69even if you gave ICU custom filter rules.
70
71In order to use Hjson syntax, the `hjson` pip module must be installed on
72your system.  You should also consider installing the `jsonschema` module to
73print messages when errors are found in your config file.
74
75    $ pip3 install --user hjson jsonschema
76
77To build ICU4J with custom data, you must first build ICU4C with custom data
78and then generate the JAR file.  For more information on building ICU4J, read the
79[ICU4J Readme](../icu4j/).
80
81### Locale Slicing
82
83The simplest way to slice ICU data is by locale.  The ICU Data Build Tool
84makes it easy to select your desired locales to suit a number of use cases.
85
86#### Filtering by Language Only
87
88Here is a *filters.json* file that builds ICU data with support for English,
89Chinese, and German, including *all* script and regional variants for those
90languages:
91
92    {
93      "localeFilter": {
94        "filterType": "language",
95        "includelist": [
96          "en",
97          "de",
98          "zh"
99        ]
100      }
101    }
102
103The *filterType* "language" only supports slicing by entire languages.
104
105##### Terminology: Includelist, Excludelist, Whitelist, Blacklist
106
107Prior to ICU 68, use `"whitelist"` and `"blacklist"` instead of `"includelist"`
108and `"excludelist"`, respectively. ICU 68 allows all four terms.
109
110#### Filtering by Locale
111
112For more control, use *filterType* "locale".  Here is a *filters.hjson* file that
113includes the same three languages as above, including regional variants, but
114only the default script (e.g., Simplified Han for Chinese):
115
116    localeFilter: {
117      filterType: locale
118      includelist: [
119        en
120        de
121        zh
122      ]
123    }
124
125*If using ICU 67 or earlier, see note above regarding allowed keywords.*
126
127#### Adding Script Variants (includeScripts = true)
128
129You may set the *includeScripts* option to true to include all scripts for a
130language while using *filterType* "locale".  This results in behavior similar
131to *filterType* "language".  In the following JSON example, all scripts for
132Chinese are included:
133
134    {
135      "localeFilter": {
136        "filterType": "locale",
137        "includeScripts": true,
138        "includelist": [
139          "en",
140          "de",
141          "zh"
142        ]
143      }
144    }
145
146*If using ICU 67 or earlier, see note above regarding allowed keywords.*
147
148If you wish to explicitly list the scripts, you may put the script code in the
149locale tag in the whitelist, and you do not need the *includeScripts* option
150enabled.  For example, in Hjson, to include Han Traditional ***but not Han
151Simplified***:
152
153    localeFilter: {
154      filterType: locale
155      includelist: [
156        en
157        de
158        zh_Hant
159      ]
160    }
161
162*If using ICU 67 or earlier, see note above regarding allowed keywords.*
163
164**Note:** the option *includeScripts* is only supported at the language level;
165i.e., in order to include all scripts for a particular language, you must
166specify the language alone, without a region tag.
167
168#### Removing Regional Variants (includeChildren = false)
169
170If you wish to enumerate exactly which regional variants you wish to support,
171you may use *filterType* "locale" with the *includeChildren* setting turned to
172false.  The following *filters.hjson* file includes English (US), English
173(UK), German (Germany), and Chinese (China, Han Simplified), as well as their
174dependencies, *but not* other regional variants like English (Australia),
175German (Switzerland), or Chinese (Taiwan, Han Traditional):
176
177    localeFilter: {
178      filterType: locale
179      includeChildren: false
180      includelist: [
181        en_US
182        en_GB
183        de_DE
184        zh_CN
185      ]
186    }
187
188*If using ICU 67 or earlier, see note above regarding allowed keywords.*
189
190Including dependencies, the above filter would include the following data files:
191
192- root.txt
193- en.txt
194- en_US.txt
195- en_001.txt
196- en_GB.txt
197- de.txt
198- de_DE.txt
199- zh.txt
200- zh_Hans.txt
201- zh_Hans_CN.txt
202- zh_CN.txt
203
204### File Slicing (coarse-grained features)
205
206ICU provides a lot of features, of which you probably need only a small subset
207for your application.  Feature slicing is a powerful way to prune out data for
208any features you are not using.
209
210***CAUTION:*** When slicing by features, you must manually include all
211dependencies.  For example, if you are formatting dates, you must include not
212only the date formatting data but also the number formatting data, since dates
213contain numbers.  Expect to spend a fair bit of time debugging your feature
214filter to get it to work the way you expect it to.
215
216The data for many ICU features live in individual files.  The ICU Data Build
217Tool puts similar *types* of files into categories.  The following table
218summarizes the ICU data files and their corresponding features and categories:
219
220| Feature | Category ID(s) | Data Files <br/> ([icu4c/source/data](https://github.com/unicode-org/icu/tree/main/icu4c/source/data)) | Resource Size <br/> (as of ICU 64) |
221|---|---|---|---|
222| Break Iteration | `"brkitr_rules"` <br/> `"brkitr_dictionaries"` <br/> `"brkitr_tree"` | brkitr/rules/\*.txt <br/> brkitr/dictionaries/\*.txt <br/> brkitr/\*.txt | 522 KiB <br/> **2.8 MiB** <br/> 14 KiB |
223| Charset Conversion | `"conversion_mappings"` | mappings/\*.ucm | **4.9 MiB** |
224| Collation <br/> *[more info](#collation-ucadata)* | `"coll_ucadata"` <br/> `"coll_tree"` | in/coll/ucadata-\*.icu <br/> coll/\*.txt | 511 KiB <br/> **2.8 MiB** |
225| Confusables | `"confusables"` | unidata/confusables\*.txt | 45 KiB |
226| Currencies | `"misc"` <br/> `"curr_supplemental"` <br/> `"curr_tree"` | misc/currencyNumericCodes.txt <br/> curr/supplementalData.txt <br/> curr/\*.txt | 3.1 KiB <br/> 27 KiB <br/> **2.5 MiB** |
227| Language Display <br/> Names | `"lang_tree"` | lang/\*.txt | **2.1 MiB** |
228| Language Tags | `"misc"` | misc/keyTypeData.txt <br/> misc/langInfo.txt <br/> misc/likelySubtags.txt <br/> misc/metadata.txt | 6.8 KiB <br/> 37 KiB <br/> 53 KiB <br/> 33 KiB |
229| Normalization | `"normalization"` | in/\*.nrm except in/nfc.nrm | 160 KiB |
230| Plural Rules | `"misc"` | misc/pluralRanges.txt <br/> misc/plurals.txt | 3.3 KiB <br/> 33 KiB |
231| Region Display <br/> Names | `"region_tree"` | region/\*.txt | **1.1 MiB** |
232| Rule-Based <br/> Number Formatting <br/> (Spellout, Ordinals) | `"rbnf_tree"` | rbnf/\*.txt | 538 KiB |
233| StringPrep | `"stringprep"` | sprep/\*.txt | 193 KiB |
234| Time Zones | `"misc"` <br/> `"zone_tree"` <br/> `"zone_supplemental"` | misc/metaZones.txt <br/> misc/timezoneTypes.txt <br/> misc/windowsZones.txt <br/> misc/zoneinfo64.txt <br/> zone/\*.txt <br/> zone/tzdbNames.txt | 41 KiB <br/> 20 KiB <br/> 22 KiB <br/> 151 KiB <br/> **2.7 MiB** <br/> 4.8 KiB |
235| Transliteration | `"translit"` | translit/\*.txt | 685 KiB |
236| Unicode Emoji<br/>Properties | `"uemoji"` | in/uemoji.icu | 13 KiB |
237| Unicode Character <br/> Names | `"unames"` | in/unames.icu | 269 KiB |
238| Unicode Text Layout | `"ulayout"` | in/ulayout.icu | 14 KiB |
239| Units | `"unit_tree"` | unit/\*.txt | **1.7 MiB** |
240| **OTHER** | `"cnvalias"` <br/> `"misc"` <br/> `"locales_tree"` | mappings/convrtrs.txt <br/> misc/dayPeriods.txt <br/> misc/genderList.txt <br/> misc/numberingSystems.txt <br/> misc/supplementalData.txt <br/> locales/\*.txt | 63 KiB <br/> 19 KiB <br/> 0.5 KiB <br/> 5.6 KiB <br/> 228 KiB <br/> **2.4 MiB** |
241
242#### Additive and Subtractive Modes
243
244The ICU Data Build Tool allows two strategies for selecting features:
245*additive* mode and *subtractive* mode.
246
247The default is to use subtractive mode. This means that all ICU data is
248included, and your configurations can remove or change data from that baseline.
249Additive mode means that you start with an *empty* ICU data file, and you must
250explicitly add the data required for your application.
251
252There are two concrete differences between additive and subtractive mode:
253
254|                         | Additive    | Subtractive |
255|-------------------------|-------------|-------------|
256| Default Feature Filter  | `"exclude"` | `"include"` |
257| Default Resource Filter | `"-/"`, `"+/%%ALIAS"`, `"+/%%Parent"` | `"+/"` |
258
259To enable additive mode, add the following setting to your filter file:
260
261    strategy: "additive"
262
263**Caution:** If using `"-/"` or similar top-level exclusion rules, be aware of
264the fields `"+/%%Parent"` and `"+/%%ALIAS"`, which are required in locale tree
265resource bundles. Excluding these paths may cause unexpected locale fallback
266behavior.
267
268#### Filter Types
269
270You may list *filters* for each category in the *featureFilters* section of
271your config file.  What follows are examples of the possible types of filters.
272
273##### Inclusion Filter
274
275To include a category, use the string `"include"` as your filter.
276
277    featureFilters: {
278      locales_tree: include
279    }
280
281If the category is a locale tree (ends with `_tree`), the inclusion filter
282resolves to the `localeFilter`; for more information, see the section
283"Locale-Tree Categories." Otherwise, the inclusion filter causes all files in
284the category to be included.
285
286**NOTE:** When subtractive mode is used (default), all categories implicitly
287start with `"include"` as their filter.
288
289##### Exclusion Filter
290
291To exclude an entire category, use *filterType* "exclude".  For example, to
292exclude all confusables data:
293
294    featureFilters: {
295      confusables: {
296        filterType: exclude
297      }
298    }
299
300Since ICU 65, you can also write simply:
301
302    featureFilters: {
303      confusables: exclude
304    }
305
306**NOTE:** When additive mode is used, all categories implicitly start with
307`"exclude"` as their filter.
308
309##### File Name Filter
310
311To exclude certain files out of a category, use the file name filter, which is
312the default type of filter when *filterType* is not specified.  For example,
313to include the Burmese break iteration dictionary but not any other
314dictionaries:
315
316    featureFilters: {
317      brkitr_dictionaries: {
318        includelist: [
319          burmesedict
320        ]
321      }
322    }
323
324Do *not* include directories or file extensions.  They will be added
325automatically for you.  Note that all files in a particular category have the
326same directory and extension.
327
328You can use either `"includelist"` or `"excludelist"` for the file name filter.
329*If using ICU 67 or earlier, see note above regarding allowed keywords.*
330
331##### Regex Filter
332
333To exclude filenames matching a certain regular expression, use *filterType*
334"regex".  For example, to reject the CJK-specific break iteration rules:
335
336    featureFilters: {
337      brkitr_rules: {
338        filterType: regex
339        excludelist: [
340          ^.*_cj$
341        ]
342      }
343    }
344
345The Python standard library [*re*
346module](https://docs.python.org/3/library/re.html) is used for evaluating the
347regular expressions.  In case the regular expression engine is changed in the
348future, however, you are encouraged to restrict yourself to a simple set of
349regular expression operators.
350
351As above, do not include directories or file extensions, and you can use
352either a whitelist or a blacklist.
353
354##### Union Filter
355
356You can combine the results of multiple filters with *filterType* "union".
357This filter matches files that match *at least one* of the provided filters.
358The syntax is:
359
360    {
361      filterType: union
362      unionOf: [
363        { /* filter 1 */ },
364        { /* filter 2 */ },
365        // ...
366      ]
367    }
368
369This filter type is useful for combining "locale" filters with different
370includeScripts or includeChildren options.
371
372#### Locale-Tree Categories
373
374Several categories have the `_tree` suffix.  These categories are for "locale
375trees": they contain locale-specific data.  ***The [localeFilter configuration
376option](#slicing-data-by-locale) sets the default file filter for all `_tree`
377categories.***
378
379If you want to include different locales for different locale file trees, you
380can override their filter in the *featureFilters* section of the config file.
381For example, to include only Italian data for currency symbols *instead of*
382the common locales specified in *localeFilter*, you can do the following:
383
384    featureFilters:
385      curr_tree: {
386        filterType: locale
387        includelist: [
388          it
389        ]
390      }
391    }
392
393*If using ICU 67 or earlier, see note above regarding allowed keywords.*
394
395You can exclude an entire `_tree` category without affecting other categories.
396For example, to exclude region display names:
397
398    featureFilters: {
399      region_tree: {
400        filterType: exclude
401      }
402    }
403
404Note that you are able to use any of the other filter types for `_tree`
405categories, but you must be very careful that you are including all of the
406correct files.  For example, `en_GB` requires `en_001`, and you must always
407include `root`.  If you use the "language" or "locale" filter types, this
408logic is done for you.
409
410### Resource Bundle Slicing (fine-grained features)
411
412The third section of the ICU filter config file is *resourceFilters*.  With
413this section, you can dive inside resource bundle files to remove even more
414data.
415
416You can apply resource filters to all locale tree categories as well as to
417categories that include resource bundles, such as the `"misc"` category.
418
419For example, consider measurement units.  There is one unit file per locale (example:
420[en.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unit/en.txt)),
421and that file contains data for all measurement units in CLDR.  However, if
422you are only formatting distances, for example, you may need the data for only
423a small set of units.
424
425Here is how you could include units of length in the "short" style but no
426other units:
427
428    resourceFilters: [
429      {
430        categories: [
431          unit_tree
432        ]
433        rules: [
434          -/units
435          -/unitsNarrow
436          -/unitsShort
437          +/unitsShort/length
438        ]
439      }
440    ]
441
442Conceptually, the rules are applied from top to bottom.  First, all data for
443all three styes of units are removed, and then the short length units are
444added back.
445
446**NOTE:** In subtractive mode, resource paths are *included* by default. In
447additive mode, resource paths are *excluded* by default.
448
449#### Wildcard Character
450
451You can use the wildcard character (`*`) to match a piece of the resource
452path.  For example, to include length units for all three styles, you can do:
453
454    resourceFilters: [
455      {
456        categories: [
457          unit_tree
458        ]
459        rules: [
460          -/units
461          -/unitsNarrow
462          -/unitsShort
463          +/*/length
464        ]
465      }
466    ]
467
468The wildcard must be the only character in its path segment. Future ICU
469versions may expand the syntax.
470
471#### Resource Filter for Specific File
472
473The resource filter object takes an optional *files* setting which accepts a
474file filter in the same syntax used above for file filtering.  For example, if
475you wanted to apply a filter to misc/supplementalData.txt, you could do the
476following (this example removes calendar data):
477
478    resourceFilters: [
479      {
480        categories: ["misc"]
481        files: {
482          includelist: ["supplementalData"]
483        }
484        rules: [
485          -/calendarData
486        ]
487      }
488    ]
489
490*If using ICU 67 or earlier, see note above regarding allowed keywords.*
491
492#### Combining Multiple Resource Filter Specs
493
494You can also list multiple resource filter objects in the *resourceFilters*
495array; the filters are added from top to bottom.  For example, here is an
496advanced configuration that includes "mile" for en-US and "kilometer" for
497en-CA; this also makes use of the *files* option:
498
499    resourceFilters: [
500      {
501        categories: ["unit_tree"]
502        rules: [
503          -/units
504          -/unitsNarrow
505          -/unitsShort
506        ]
507      },
508      {
509        categories: ["unit_tree"]
510        files: {
511          filterType: locale
512          includelist: ["en_US"]
513        }
514        rules: [
515          +/*/length/mile
516        ]
517      },
518      {
519        categories: ["unit_tree"]
520        files: {
521          filterType: locale
522          includelist: ["en_CA"]
523        }
524        rules: [
525          +/*/length/kilometer
526        ]
527      }
528    ]
529
530The above example would give en-US these resource filter rules:
531
532    -/units
533    -/unitsNarrow
534    -/unitsShort
535    +/*/length/mile
536
537and en-CA these resource filter rules:
538
539    -/units
540    -/unitsNarrow
541    -/unitsShort
542    +/*/length/kilometer
543
544In accordance with *filterType* "locale", the parent locales *en* and *root*
545would get both units; this is required since both en-US and en-CA may inherit
546from the parent locale:
547
548    -/units
549    -/unitsNarrow
550    -/unitsShort
551    +/*/length/mile
552    +/*/length/kilometer
553
554## Debugging Tips
555
556**Run Python directly:** If you do not want to wait for ./runConfigureICU to
557finish, you can directly re-generate the rules using your filter file with the
558following command line run from *iuc4c/source*.
559
560    $ PYTHONPATH=python python3 -m icutools.databuilder \
561      --mode=gnumake --src_dir=data > data/rules.mk
562
563**Install jsonschema:** Install the `jsonschema` pip package to get warnings
564about problems with your filter file.
565
566**See what data is being used:** ICU is instrumented to allow you to trace
567which resources are used at runtime. This can help you determine what data you
568need to include. For more information, see [tracing.md](tracing.md).
569
570**Inspect data/rules.mk:** The Python script outputs the file *rules.mk*
571inside *iuc4c/source/data*. To see what is going to get built, you can inspect
572that file. First build ICU normally, and copy *rules.mk* to
573*rules_default.mk*. Then build ICU with your filter file. Now you can take the
574diff between *rules_default.mk* and *rules.mk* to see exactly what your filter
575file is removing.
576
577**Inspect the output:** After a `make clean` and `make` with a new *rules.mk*,
578you can look inside the directory *icu4c/source/data/out* to see the files
579that got built.
580
581**Inspect the compiled resource filter rules:** If you are using a resource
582filter, the resource filter rules get compiled for each individual locale
583inside *icu4c/source/data/out/tmp/filters*. You can look at those files to see
584what filter rules are being applied to each individual locale.
585
586**Run genrb in verbose mode:** For debugging a resource filter, you can run
587genrb in verbose mode to see which resources got stripped. To do this, first
588inspect the make output and find a command line like this:
589
590    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/genrb --filterDir ./out/tmp/filters/unit_tree -s ./unit -d ./out/build/icudt64l/unit/ -i ./out/build/icudt64l --usePoolBundle ./out/build/icudt64l/unit/ -k en.txt
591
592Copy that command line and re-run it from *icu4c/source/data* with the `-v`
593flag added to the end. The command will print out exactly which resource paths
594are being included and excluded as well as a model of the filter rules applied
595to this file.
596
597**Inspect .res files with derb:** The `derb` tool can convert .res files back
598to .txt files after filtering. For example, to convert the above unit res file
599back to a txt file, you can run this command from *icu4c/source*:
600
601    LD_LIBRARY_PATH=lib bin/derb data/out/build/icudt64l/unit/en.res
602
603That will produce a file *en.txt* in your current directory, which is the
604original *data/unit/en.txt* but after resource filters were applied.
605
606*Tip:* derb expects your res files to be rooted in a directory named
607`icudt64l` (corresponding to your current ICU version and endianness). If your
608files are not in such a directory, derb fails with U_MISSING_RESOURCE_ERROR.
609
610**Put complex rules first** and **use the wildcard `*` sparingly:** The order
611of the filter rules matters a great deal in how effective your data size
612reduction can be, and the wildcard `*` can sometimes produce behavior that is
613tricky to reason about. For example, these three lists of filter rules look
614similar on first glance but actually produce different output:
615
616<table>
617<tr>
618<th>Unit Resource Filter Rules</th>
619<th>Unit Resource Size</th>
620<th>Commentary</th>
621<th>Result</th>
622</tr>
623<tr><td><pre>
624-/*/*
625+/*/digital
626-/*/digital/*/dnam
627-/durationUnits
628-/units
629-/unitsNarrow
630</pre></td><td>77 KiB</td><td>
631First, remove all unit types. Then, add back digital units across all unit
632widths. Then, remove display names from digital units. Then, remove duration
633unit patterns and long and narrow forms.
634</td><td>
635Digital units in short form are included; all other units are removed.
636</td></tr>
637<tr><td><pre>
638-/durationUnits
639-/units
640-/unitsNarrow
641-/*/*
642+/*/digital
643-/*/digital/*/dnam
644</pre></td><td>125 KiB</td><td>
645First, remove duration unit patterns and long and narrow forms. Then, remove
646all unit types. Then, add back digital units across all unit widths. Then,
647remove display names from digital units.
648</td><td>
649Digital units are included <em>in all widths</em>; all other units are removed.
650</td></tr>
651<tr><td><pre>
652-/*/*
653+/*/digital
654-/*/*/*/dnam
655-/durationUnits
656-/units
657-/unitsNarrow
658</pre></td><td>191 KiB</td><td>
659First, remove all unit types. Then, add back digital units across all unit
660widths. Then, remove display names from all units. Then, remove duration unit
661patterns and long and narrow forms.
662</td><td>
663Digital units in short form are included, as is the <em>tree structure</em>
664for all other units, even though the other units have no real data.
665</td></tr>
666</table>
667
668By design, empty tree structure is retained in the unit bundle. This is
669because there are numerous instances in ICU data where the presence of an
670empty tree carries meaning. However, it means that you must be careful when
671building resource filter rules in order to achieve the optimal data bundle
672size.
673
674Using the `-v` option in genrb (described above) is helpful when debugging
675these types of issues.
676
677## Other Features of the ICU Data Build Tool
678
679While data filtering is the primary reason the ICU Data Build Tool was
680developed, there are there are additional use cases.
681
682### Running Data Build without Configure/Make
683
684You can build the dat file outside of the ICU build system by directly
685invoking the Python icutools.databuilder.  Run the following command to see the
686help text for the CLI tool:
687
688    $ PYTHONPATH=path/to/icu4c/source/python python3 -m icutools.databuilder --help
689
690### Collation UCAData
691
692For using collation (sorting and searching) in any language, the "root"
693collation data file must be included. It provides the Unicode CLDR default
694sort order for all code points, and forms the basis for language-specific
695tailorings as well as for custom collators built at runtime.
696
697There are two versions of the root collation data file:
698
699- ucadata-unihan.txt (compiled size: 511 KiB)
700- ucadata-implicithan.txt (compiled size: 178 KiB)
701
702The unihan version sorts Han characters in radical-stroke order according to
703Unicode, which is a somewhat useful default sort order, especially for use
704with non-CJK languages.  The implicithan version sorts Han characters in the
705order of their Unicode assignment, which is similar to radical-stroke order
706for common characters but arbitrary for others.  For more information, see
707[UTS #10 §10.1.3](https://www.unicode.org/reports/tr10/#Implicit_Weights).
708
709By default, the unihan version is used.  The unihan version of the data file
710is much larger than that for implicithan, so if you need collation but also
711small data, then you may want to select the implicithan version.  To use the
712implicithan version, put the following setting in your *filters.json* file:
713
714    {
715      "collationUCAData": "implicithan"
716    }
717
718### Disable Pool Bundle
719
720By default, ICU uses a "pool bundle" to store strings shared between locales.
721This saves space and is recommended for most users. However, when developing
722a system where locale data files may be added "on the fly" and not included in
723the original ICU distribution, those additional data files may not be able to
724use a pool bundle due to name collisions with the existing pool bundle.
725
726To disable the pool bundle in the current ICU build, put the following setting
727in your *filters.json* file:
728
729    {
730      "usePoolBundle": false
731    }
732
733### File Substitution
734
735Using the configuration file, you can perform whole-file substitutions.  For
736example, suppose you want to replace the transliteration rules for
737*Zawgyi_my*.  You could create a directory called `my_icu_substitutions`
738containing your new `Zawgyi_my.txt` rule file, and then put this in your
739configuration file:
740
741    fileReplacements: {
742      directory: "/path/to/my_icu_substitutions"
743      replacements: [
744        {
745          src: "Zawgyi_my.txt"
746          dest: "translit/Zawgyi_my.txt"
747        },
748        "misc/dayPeriods.txt"
749      ]
750    }
751
752`directory` should either be an absolute path, or a path starting with one of
753the following, and it should not contain a trailing slash:
754
755- "$SRC" for the *icu4c/source/data* directory in the source tree
756- "$FILTERS" for the directory containing filters.json
757- "$CWD" for your current working directory
758
759When the entry in the `replacements` array is an object, the `src` and `dest`
760fields indicate, for each file in the source directory (`src`), what file in
761the ICU hierarchy it should replace (`dest`). When the entry is a string, the
762same relative path is used for both `src` and `dest`.
763
764Whole-file substitution happens before all other filters are applied.
765