docs/processes/unicode-update.md

---
layout: default
title: Unicode Update
parent: Release & Milestone Tasks
grand_parent: Contributors
nav_order: 130
---

<!--
© 2021 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->

# Unicode Update
{: .no_toc }

## Contents
{: .no_toc .text-delta }

1. TOC
{:toc}

---

The International Components for Unicode (ICU) implement the Unicode Standard
and many of its Standard Annexes and Technical Standards,
and are updated to each new Unicode version.
Usually, the ICU team participates in the Unicode beta process by
updating to a beta snapshot of the new Unicode version and testing it thoroughly.
In the past, this has sometimes uncovered problems that could be
fixed before the release of the new Unicode version.

(Note that ICU does not provide any access to Unihan data,
mostly because of low demand and the large size of the Unihan data.)

## Update process

For the last several updates, there is a
[change log for Unicode updates](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt).

For each new Unicode version, during the beta period,
*   Copy the change log for the previous version to the top of this file.
*   Adjust the versions, tickets, URLs, and paths.
*   Work through the steps listed in the log, top to bottom, adjusting the log as necessary.
*   Report problems to the UTC and/or CLDR and/or ICU.

Before the data is final, “turn the crank” several more times,
using appropriate subsets of the steps.

At the start of the process, most of the Unicode data files are copied into the ICU repository, either
without modification or, for some files, with comments removed and lines merged
to reduce their size.

Some of the data files are not part of the Unicode release but are output from
various Unicode Tools, as noted in the change log.
(See also https://github.com/unicode-org/unicodetools)

Note: We have looked at using the [UCD XML](https://www.unicode.org/ucd/#UCDinXML) files,
but decided against it and instead developed a simpler format for a combined Unicode data file.
See https://icu.unicode.org/design/props/ppucd#TOC-Why-not-UCD-XML-files-
(There was an outdated, experimental, partial UCD XML parser here:
<https://github.com/unicode-org/icu-docs/tree/main/design/properties/genudata>)

The ICU Unicode tools parse the text files, process the data somewhat, and write
binary data for runtime use. Most of these tools live in a
[source tree](https://github.com/unicode-org/icu/tree/main/tools/unicode) separate
from the ICU4C/ICU4J sources, and link with ICU4C.

The following steps are necessarily manual:

*   New property values and properties need to be reviewed.
*   For new property values, enum constants are added to the API.
*   For new properties, APIs are added and the tools are modified to write the
    additional data into new fields in the data structures; sometimes new data
    structures need to be developed for new properties.
*   Some properties are not exposed via simple, direct data access APIs but
    in more high-level APIs (like case mapping and normalization functions).
*   Sometimes changes in which property aliases are canonical vs. actual aliases
    require manual changes to helper files or tools.

New properties (whether they are supported via dedicated API or not) should be added to the
[Properties User Guide chapter](https://unicode-org.github.io/icu/userguide/strings/properties).

### Bazel build process

The tools for building ICU data for Unicode properties are in a separate subtree of the ICU repo.
They depend on parts of the ICU libraries and generate files that go back into the source tree
in order to make updated properties available to higher-level parts of the library and tools.

In the past, we boot-strapped this by doing a `make install` on ICU with the old data,
using cmake to build the tools, running some of the tools with their output
going back into the source tree, rebuilding ICU and the tools, running more tools, etc.

This was very manual and cumbersome.

Instead, starting with ICU 70 (2021),
we now use the [Bazel build system](https://bazel.build/) to build only small parts of the libraries,
just enough to build and run the initial tools.
We still need a layer outside of Bazel in order to copy the tool output into the source tree,
because Bazel on its own does not allow modifying the source tree.
We use a shell script to automate alternately building tools and copying files.

This simplifies the process.

It should also make it much easier to customize Unicode properties,
for example by patching ppucd.txt with real properties for PUA (private use) characters.

Finally, it should make it easier to modify the binary data file format for a property
because we build the library code that depends on the data only after generating that data.

For the initial setup of this Bazel build system for ICU see
https://unicode-org.atlassian.net/browse/ICU-21117 “sane build system for Unicode data”

This was completed while working on
https://unicode-org.atlassian.net/browse/ICU-21635 “Unicode 14”

#### Bazel setup

It should be possible to run the `bazel` command directly,
but the Bazel team recommends using the `bazelisk` wrapper.
It downloads and runs the latest version of Bazel, or,
if the root folder contains a .bazelisk file with an entry like
```
USE_BAZEL_VERSION=3.7.1
```
then it downloads that specific version. If there are any incompatible changes in Bazel behavior,
then this insulates us from those.

We do have an $ICU_SRC/.bazeliskrc file with such a line.
Consider running `bazelisk --version` outside of the $ICU_SRC folder
to find out the latest `bazel` version, and copying that version number into the config file.
(Revert if you find incompatibilities, or, better, update our build & config files.)

Right in $ICU_SRC we also have a file called WORKSPACE which tells Bazel that
our repo root is also the root of its build system.
We build library “targets” relative to that. For example,
`//icu4c/source/common:normalizer2` refers to the cc_library named `normalizer2` in
$ICU_SRC/icu4c/source/common/BUILD .

## Testing

The ICU test suites include some tests for Unicode data. Some just check the
data from the API against the original .txt files. Some tests simply check for
certain hardcoded values, which have to be updated when those values change
deliberately. Other tests perform consistency checks between some properties, or
between different implementations.

There is a program as a part of CLDR that uses regular expressions to test the
segmentation rules and properties (LineBreak, WordBreak, etc). That is, there is
a regular expression corresponding to each of the rules, and a brute force
evaluation of them. That is used to generate the tables and test data. The
segmentation rules in ICU are later modified by hand to match the
specifications. That has to be done by hand, because there are some areas where
the rules don't correspond 1:1 with the spec. There are a series of ICU
consistency tests for those rules. ICU also includes regression tests with
"golden files" that are used to detect unanticipated side effects of revisions
to the rules.