• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Unicode Update
4parent: Release & Milestone Tasks
5grand_parent: Contributors
6nav_order: 130
7---
8
9<!--
10© 2021 and later: Unicode, Inc. and others.
11License & terms of use: http://www.unicode.org/copyright.html
12-->
13
14# Unicode Update
15{: .no_toc }
16
17## Contents
18{: .no_toc .text-delta }
19
201. TOC
21{:toc}
22
23---
24
25The International Components for Unicode (ICU) implement the Unicode Standard
26and many of its Standard Annexes and Technical Standards,
27and are updated to each new Unicode version.
28Usually, the ICU team participates in the Unicode beta process by
29updating to a beta snapshot of the new Unicode version and testing it thoroughly.
30In the past, this has sometimes uncovered problems that could be
31fixed before the release of the new Unicode version.
32
33(Note that ICU does not provide any access to Unihan data,
34mostly because of low demand and the large size of the Unihan data.)
35
36## Update process
37
38For the last several updates, there is a
39[change log for Unicode updates](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt).
40
41For each new Unicode version, during the beta period,
42*   Copy the change log for the previous version to the top of this file.
43*   Adjust the versions, tickets, URLs, and paths.
44*   Work through the steps listed in the log, top to bottom, adjusting the log as necessary.
45*   Report problems to the UTC and/or CLDR and/or ICU.
46
47Before the data is final, “turn the crank” several more times,
48using appropriate subsets of the steps.
49
50At the start of the process, most of the Unicode data files are copied into the ICU repository, either
51without modification or, for some files, with comments removed and lines merged
52to reduce their size.
53
54Some of the data files are not part of the Unicode release but are output from
55various Unicode Tools, as noted in the change log.
56(See also https://github.com/unicode-org/unicodetools)
57
58Note: We have looked at using the [UCD XML](https://www.unicode.org/ucd/#UCDinXML) files,
59but decided against it and instead developed a simpler format for a combined Unicode data file.
60See https://icu.unicode.org/design/props/ppucd#TOC-Why-not-UCD-XML-files-
61(There was an outdated, experimental, partial UCD XML parser here:
62<https://github.com/unicode-org/icu-docs/tree/main/design/properties/genudata>)
63
64The ICU Unicode tools parse the text files, process the data somewhat, and write
65binary data for runtime use. Most of these tools live in a
66[source tree](https://github.com/unicode-org/icu/tree/main/tools/unicode) separate
67from the ICU4C/ICU4J sources, and link with ICU4C.
68
69The following steps are necessarily manual:
70
71*   New property values and properties need to be reviewed.
72*   For new property values, enum constants are added to the API.
73*   For new properties, APIs are added and the tools are modified to write the
74    additional data into new fields in the data structures; sometimes new data
75    structures need to be developed for new properties.
76*   Some properties are not exposed via simple, direct data access APIs but
77    in more high-level APIs (like case mapping and normalization functions).
78*   Sometimes changes in which property aliases are canonical vs. actual aliases
79    require manual changes to helper files or tools.
80
81New properties (whether they are supported via dedicated API or not) should be added to the
82[Properties User Guide chapter](https://unicode-org.github.io/icu/userguide/strings/properties).
83
84### Bazel build process
85
86The tools for building ICU data for Unicode properties are in a separate subtree of the ICU repo.
87They depend on parts of the ICU libraries and generate files that go back into the source tree
88in order to make updated properties available to higher-level parts of the library and tools.
89
90In the past, we boot-strapped this by doing a `make install` on ICU with the old data,
91using cmake to build the tools, running some of the tools with their output
92going back into the source tree, rebuilding ICU and the tools, running more tools, etc.
93
94This was very manual and cumbersome.
95
96Instead, starting with ICU 70 (2021),
97we now use the [Bazel build system](https://bazel.build/) to build only small parts of the libraries,
98just enough to build and run the initial tools.
99We still need a layer outside of Bazel in order to copy the tool output into the source tree,
100because Bazel on its own does not allow modifying the source tree.
101We use a shell script to automate alternately building tools and copying files.
102
103This simplifies the process.
104
105It should also make it much easier to customize Unicode properties,
106for example by patching ppucd.txt with real properties for PUA (private use) characters.
107
108Finally, it should make it easier to modify the binary data file format for a property
109because we build the library code that depends on the data only after generating that data.
110
111For the initial setup of this Bazel build system for ICU see
112https://unicode-org.atlassian.net/browse/ICU-21117 “sane build system for Unicode data”
113
114This was completed while working on
115https://unicode-org.atlassian.net/browse/ICU-21635 “Unicode 14”
116
117#### Bazel setup
118
119It should be possible to run the `bazel` command directly,
120but the Bazel team recommends using the `bazelisk` wrapper.
121It downloads and runs the latest version of Bazel, or,
122if the root folder contains a .bazelisk file with an entry like
123```
124USE_BAZEL_VERSION=3.7.1
125```
126then it downloads that specific version. If there are any incompatible changes in Bazel behavior,
127then this insulates us from those.
128
129We do have an $ICU_SRC/.bazeliskrc file with such a line.
130Consider running `bazelisk --version` outside of the $ICU_SRC folder
131to find out the latest `bazel` version, and copying that version number into the config file.
132(Revert if you find incompatibilities, or, better, update our build & config files.)
133
134Right in $ICU_SRC we also have a file called WORKSPACE which tells Bazel that
135our repo root is also the root of its build system.
136We build library “targets” relative to that. For example,
137`//icu4c/source/common:normalizer2` refers to the cc_library named `normalizer2` in
138$ICU_SRC/icu4c/source/common/BUILD .
139
140## Testing
141
142The ICU test suites include some tests for Unicode data. Some just check the
143data from the API against the original .txt files. Some tests simply check for
144certain hardcoded values, which have to be updated when those values change
145deliberately. Other tests perform consistency checks between some properties, or
146between different implementations.
147
148There is a program as a part of CLDR that uses regular expressions to test the
149segmentation rules and properties (LineBreak, WordBreak, etc). That is, there is
150a regular expression corresponding to each of the rules, and a brute force
151evaluation of them. That is used to generate the tables and test data. The
152segmentation rules in ICU are later modified by hand to match the
153specifications. That has to be done by hand, because there are some areas where
154the rules don't correspond 1:1 with the spec. There are a series of ICU
155consistency tests for those rules. ICU also includes regression tests with
156"golden files" that are used to detect unanticipated side effects of revisions
157to the rules.
158