1--- 2layout: default 3title: Unicode Update 4parent: Release & Milestone Tasks 5grand_parent: Contributors 6nav_order: 130 7--- 8 9<!-- 10© 2021 and later: Unicode, Inc. and others. 11License & terms of use: http://www.unicode.org/copyright.html 12--> 13 14# Unicode Update 15{: .no_toc } 16 17## Contents 18{: .no_toc .text-delta } 19 201. TOC 21{:toc} 22 23--- 24 25The International Components for Unicode (ICU) implement the Unicode Standard 26and many of its Standard Annexes and Technical Standards, 27and are updated to each new Unicode version. 28Usually, the ICU team participates in the Unicode beta process by 29updating to a beta snapshot of the new Unicode version and testing it thoroughly. 30In the past, this has sometimes uncovered problems that could be 31fixed before the release of the new Unicode version. 32 33(Note that ICU does not provide any access to Unihan data, 34mostly because of low demand and the large size of the Unihan data.) 35 36## Update process 37 38For the last several updates, there is a 39[change log for Unicode updates](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt). 40 41For each new Unicode version, during the beta period, 42* Copy the change log for the previous version to the top of this file. 43* Adjust the versions, tickets, URLs, and paths. 44* Work through the steps listed in the log, top to bottom, adjusting the log as necessary. 45* Report problems to the UTC and/or CLDR and/or ICU. 46 47Before the data is final, “turn the crank” several more times, 48using appropriate subsets of the steps. 49 50At the start of the process, most of the Unicode data files are copied into the ICU repository, either 51without modification or, for some files, with comments removed and lines merged 52to reduce their size. 53 54Some of the data files are not part of the Unicode release but are output from 55various Unicode Tools, as noted in the change log. 56(See also https://github.com/unicode-org/unicodetools) 57 58Note: We have looked at using the [UCD XML](https://www.unicode.org/ucd/#UCDinXML) files, 59but decided against it and instead developed a simpler format for a combined Unicode data file. 60See https://icu.unicode.org/design/props/ppucd#TOC-Why-not-UCD-XML-files- 61(There was an outdated, experimental, partial UCD XML parser here: 62<https://github.com/unicode-org/icu-docs/tree/main/design/properties/genudata>) 63 64The ICU Unicode tools parse the text files, process the data somewhat, and write 65binary data for runtime use. Most of these tools live in a 66[source tree](https://github.com/unicode-org/icu/tree/main/tools/unicode) separate 67from the ICU4C/ICU4J sources, and link with ICU4C. 68 69The following steps are necessarily manual: 70 71* New property values and properties need to be reviewed. 72* For new property values, enum constants are added to the API. 73* For new properties, APIs are added and the tools are modified to write the 74 additional data into new fields in the data structures; sometimes new data 75 structures need to be developed for new properties. 76* Some properties are not exposed via simple, direct data access APIs but 77 in more high-level APIs (like case mapping and normalization functions). 78* Sometimes changes in which property aliases are canonical vs. actual aliases 79 require manual changes to helper files or tools. 80 81New properties (whether they are supported via dedicated API or not) should be added to the 82[Properties User Guide chapter](https://unicode-org.github.io/icu/userguide/strings/properties). 83 84### Bazel build process 85 86The tools for building ICU data for Unicode properties are in a separate subtree of the ICU repo. 87They depend on parts of the ICU libraries and generate files that go back into the source tree 88in order to make updated properties available to higher-level parts of the library and tools. 89 90In the past, we boot-strapped this by doing a `make install` on ICU with the old data, 91using cmake to build the tools, running some of the tools with their output 92going back into the source tree, rebuilding ICU and the tools, running more tools, etc. 93 94This was very manual and cumbersome. 95 96Instead, starting with ICU 70 (2021), 97we now use the [Bazel build system](https://bazel.build/) to build only small parts of the libraries, 98just enough to build and run the initial tools. 99We still need a layer outside of Bazel in order to copy the tool output into the source tree, 100because Bazel on its own does not allow modifying the source tree. 101We use a shell script to automate alternately building tools and copying files. 102 103This simplifies the process. 104 105It should also make it much easier to customize Unicode properties, 106for example by patching ppucd.txt with real properties for PUA (private use) characters. 107 108Finally, it should make it easier to modify the binary data file format for a property 109because we build the library code that depends on the data only after generating that data. 110 111For the initial setup of this Bazel build system for ICU see 112https://unicode-org.atlassian.net/browse/ICU-21117 “sane build system for Unicode data” 113 114This was completed while working on 115https://unicode-org.atlassian.net/browse/ICU-21635 “Unicode 14” 116 117#### Bazel setup 118 119It should be possible to run the `bazel` command directly, 120but the Bazel team recommends using the `bazelisk` wrapper. 121It downloads and runs the latest version of Bazel, or, 122if the root folder contains a .bazelisk file with an entry like 123``` 124USE_BAZEL_VERSION=3.7.1 125``` 126then it downloads that specific version. If there are any incompatible changes in Bazel behavior, 127then this insulates us from those. 128 129We do have an $ICU_SRC/.bazeliskrc file with such a line. 130Consider running `bazelisk --version` outside of the $ICU_SRC folder 131to find out the latest `bazel` version, and copying that version number into the config file. 132(Revert if you find incompatibilities, or, better, update our build & config files.) 133 134Right in $ICU_SRC we also have a file called WORKSPACE which tells Bazel that 135our repo root is also the root of its build system. 136We build library “targets” relative to that. For example, 137`//icu4c/source/common:normalizer2` refers to the cc_library named `normalizer2` in 138$ICU_SRC/icu4c/source/common/BUILD . 139 140## Testing 141 142The ICU test suites include some tests for Unicode data. Some just check the 143data from the API against the original .txt files. Some tests simply check for 144certain hardcoded values, which have to be updated when those values change 145deliberately. Other tests perform consistency checks between some properties, or 146between different implementations. 147 148There is a program as a part of CLDR that uses regular expressions to test the 149segmentation rules and properties (LineBreak, WordBreak, etc). That is, there is 150a regular expression corresponding to each of the rules, and a brute force 151evaluation of them. That is used to generate the tables and test data. The 152segmentation rules in ICU are later modified by hand to match the 153specifications. That has to be done by hand, because there are some areas where 154the rules don't correspond 1:1 with the spec. There are a series of ICU 155consistency tests for those rules. ICU also includes regression tests with 156"golden files" that are used to detect unanticipated side effects of revisions 157to the rules. 158