• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1* Copyright (C) 2016 and later: Unicode, Inc. and others.
2* License & terms of use: http://www.unicode.org/copyright.html
3* Copyright (C) 2004-2016, International Business Machines
4* Corporation and others.  All Rights Reserved.
5*
6*   file name:  changes.txt
7*   encoding:   US-ASCII
8*   tab size:   8 (not used)
9*   indentation:4
10*
11*   created on: 2004may06
12*   created by: Markus W. Scherer
13
14* change log for Unicode updates
15
16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update
17
18Notes:
19
20This log includes several command lines as used in the update process.
21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign.
22Use a console window that is set to that directory, or cd to there,
23and then paste the command that follows the $ sign.
24
25Most command lines use environment variables to make them more portable across versions
26and machine configurations. When you set up a console window, copy & paste the `export` commands
27from near the top of the current section before pasting tool command lines.
28Adjust the environment variables to the current version and your machine setup.
29(The command lines are currently as used on Linux.)
30
31Syntax of this file:
32
33`***` - section heading
34`*` - sub heading
35`-` - 1st level bullet
36`+` - 2nd level bullet
37`=` - 1st level bullet
38`->` - "the previous things leads to...", OR a 2nd level bullet/item
39
40---------------------------------------------------------------------------- ***
41
42* New ISO 15924 script codes
43
44Normally, add new script codes as part of a Unicode update.
45See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums
46and see the change logs below.
47
48---------------------------------------------------------------------------- ***
49
50Unicode 16.0 update for ICU 76
51
52TODO
53- No more hardcoded spoof checker sets: Update change log.
54- In the Unicode Tools repo: Delete the org.unicode.text.tools.RecommendedSetGenerator.
55- In corepropsbuilder.cpp, remove the isA9CF hack.
56- Update instructions for hardcoded properties
57        IDS_Unary_Operator, ID_Compat_Math_Start & ID_Compat_Math_Continue:
58  + These are still hardcoded, but since ICU 75 they are tested in C++ intltest.
59  + No more need to check via grep.
60  + Still: If the test fails, then update the hardcoded implementation.
61
62---------------------------------------------------------------------------- ***
63
64Unicode 15.1 update for ICU 74
65
66https://www.unicode.org/versions/Unicode15.1.0/
67https://www.unicode.org/versions/beta-15.1.0.html
68https://www.unicode.org/Public/draft/
69https://www.unicode.org/reports/uax-proposed-updates.html
70https://www.unicode.org/reports/tr44/tr44-31.html
71
72https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1
73https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1
74
75https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1
76
77* Command-line environment setup
78
79Markus:
80
81export UNIDATA_ROOT=~/unidata
82export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final
83export CLDR_SRC=~/cldr/uni/src
84export ICU_ROOT=~/icu/uni
85export ICU_SRC=$ICU_ROOT/src
86export ICU_OUT=$ICU_ROOT/dbg
87export ICUDT=icudt74b
88export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
89export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
90export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
91export UNICODE_TOOLS=~/unitools/mine/src
92
93Elango:
94
95export UNIDATA_ROOT=~/oss/unidata
96export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot
97export CLDR_SRC=~/oss/cldr/mine/src
98export ICU_ROOT=~/oss/icu
99export ICU_SRC=$ICU_ROOT
100export ICU_OUT=$ICU_ROOT
101export ICUDT=icudt74b
102export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
103export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
104export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
105export UNICODE_TOOLS=~/oss/unicodetools/mine/src
106
107*** Unicode version numbers
108- makedata.mak
109- uchar.h
110- com.ibm.icu.util.VersionInfo
111- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
112
113*** Configure: Build Unicode data for ICU4J
114- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
115    so that the makefiles see the new version number.
116  cd $ICU_OUT/icu4c
117  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
118
119*** data files & enums & parser code
120
121* download files
122- same as for the early Unicode Tools setup and data refresh:
123  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
124  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
125- mkdir -p $UNICODE_DATA
126- download Unicode files into $UNICODE_DATA
127  + new since Unicode 15.1:
128    for the pre-release (alpha, beta) data files,
129    download all of https://www.unicode.org/Public/draft/
130    (you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders)
131  + if one of us produces the alpha.zip or beta.zip collection of data files for publication,
132    then we can use its contents directly (no FTP from unicode.org necessary)
133  + for final-release data files, the source of truth are the files in
134    https://www.unicode.org/Public/(version) [=UCD],
135    https://www.unicode.org/Public/UCA/(version),
136    https://www.unicode.org/Public/idna/(version),
137    etc.
138  + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc.
139  + subfolders: emoji, idna, security, ucd, uca
140  + whichever way you download the files:
141    ~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip
142    ~ split Unihan into single-property files
143      ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan
144    ~ TODO: for updating ICU, we should not need Unihan.zip contents, correct?
145  + alternate way of fetching files, if available:
146    copy the files from a Unicode Tools workspace that is up to date with
147    https://github.com/unicode-org/unicodetools
148    and which might at this point be *ahead* of "Public"
149    ~ before the Unicode release copy files from "dev" subfolders, for example
150      https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
151- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already)
152    or from the UCD/cldr/ output folder of the Unicode Tools:
153    From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73,
154    CLDR used modified grapheme break rules.
155    This might happen again.
156  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
157    or
158  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
159  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
160  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
161  + TODO: figure out whether we need a CLDR version of LineBreakTest.txt:
162    unicodetools issue #492
163- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
164  + TODO: modify preparseucd.py to copy this file
165
166* Note: Since Unicode 15.1, data files are no longer published with version suffixes
167  even during the alpha or beta.
168  Thus we no longer need steps & tools to remove those suffixes.
169  (remove this note next time)
170
171* process and/or copy files
172- cd $ICU_SRC/tools/unicode
173  py/preparseucd.py $UNICODE_DATA $ICU_SRC
174  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
175  + For debugging, and tweaking how ppucd.txt is written,
176    the tool has an --only_ppucd option:
177    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
178
179* new constants for new property values
180- preparseucd.py error:
181    ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})]
182  = PropertyValueAliases.txt new property values (diff old & new .txt files)
183    cd $UNIDATA_ROOT
184    $ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
185    +age; 15.1                             ; V15_1
186    +blk; CJK_Ext_I                        ; CJK_Unified_Ideographs_Extension_I
187    +IDSU; N                               ; No                               ; F                                ; False
188    +IDSU; Y                               ; Yes                              ; T                                ; True
189    +ID_Compat_Math_Continue; N            ; No                               ; F                                ; False
190    +ID_Compat_Math_Continue; Y            ; Yes                              ; T                                ; True
191    +ID_Compat_Math_Start; N               ; No                               ; F                                ; False
192    +ID_Compat_Math_Start; Y               ; Yes                              ; T                                ; True
193    +lb ; AK                               ; Aksara
194    +lb ; AP                               ; Aksara_Prebase
195    +lb ; AS                               ; Aksara_Start
196    +lb ; VF                               ; Virama_Final
197    +lb ; VI                               ; Virama
198  -> add new blocks to uchar.h before UBLOCK_COUNT
199    use long property names for enum constants,
200    for the trailing comment get the block start code point: diff old & new Blocks.txt
201    cd $UNIDATA_ROOT
202    $ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
203    +2EBF0..2EE4F; CJK Unified Ideographs Extension I
204    (ignore blocks whose end code point changed)
205  -> add new blocks to UCharacter.UnicodeBlock IDs
206    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
207            replace  public static final int \1_ID = \2; \3
208  -> add new blocks to UCharacter.UnicodeBlock objects
209    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
210            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
211  -> add new line break values to uchar.h & UCharacter.LineBreak
212
213* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
214    (not strictly necessary for NOT_ENCODED scripts)
215  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
216
217* build ICU
218  to make sure that there are no syntax errors
219
220  $ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
221
222* update spoof checker UnicodeSet initializers:
223    inclusionPat & recommendedPat in i18n/uspoof.cpp
224    INCLUSION & RECOMMENDED in SpoofChecker.java
225- make sure that the Unicode Tools tree contains the latest security data files
226- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
227- run the tool (no special environment variables needed)
228  cd $UNICODE_TOOLS
229  mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \
230      -Dexec.args="" -am -pl unicodetools  -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
231- copy & paste from the Console output into the .cpp & .java files
232
233* check hardcoded IDS_Unary_Operator
234- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
235- check that it has not changed:
236    (cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator)
237  ->
238    ucd/PropList.txt:2FFE..2FFF    ; IDS_Unary_Operator # So   [2] IDEOGRAPHIC DESCRIPTION CHAR...
239- if it has changed, then update the implementation and the tests
240
241* check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue
242- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
243- check that they have not changed:
244    (cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math)
245  ->
246    ucd/PropList.txt:00B2..00B3    ; ID_Compat_Math_Continue # No   [2] SUPERSCRIPT TWO..SUPERSCRIPT THREE
247    ucd/PropList.txt:00B9          ; ID_Compat_Math_Continue # No       SUPERSCRIPT ONE
248    ucd/PropList.txt:2070          ; ID_Compat_Math_Continue # No       SUPERSCRIPT ZERO
249    ucd/PropList.txt:2074..2079    ; ID_Compat_Math_Continue # No   [6] SUPERSCRIPT FOUR..SUPERSCRIPT NINE
250    ucd/PropList.txt:207A..207C    ; ID_Compat_Math_Continue # Sm   [3] SUPERSCRIPT PLUS SIGN..SUPERSCRIPT EQUALS SIGN
251    ucd/PropList.txt:207D          ; ID_Compat_Math_Continue # Ps       SUPERSCRIPT LEFT PARENTHESIS
252    ucd/PropList.txt:207E          ; ID_Compat_Math_Continue # Pe       SUPERSCRIPT RIGHT PARENTHESIS
253    ucd/PropList.txt:2080..2089    ; ID_Compat_Math_Continue # No  [10] SUBSCRIPT ZERO..SUBSCRIPT NINE
254    ucd/PropList.txt:208A..208C    ; ID_Compat_Math_Continue # Sm   [3] SUBSCRIPT PLUS SIGN..SUBSCRIPT EQUALS SIGN
255    ucd/PropList.txt:208D          ; ID_Compat_Math_Continue # Ps       SUBSCRIPT LEFT PARENTHESIS
256    ucd/PropList.txt:208E          ; ID_Compat_Math_Continue # Pe       SUBSCRIPT RIGHT PARENTHESIS
257    ucd/PropList.txt:2202          ; ID_Compat_Math_Continue # Sm       PARTIAL DIFFERENTIAL
258    ucd/PropList.txt:2207          ; ID_Compat_Math_Continue # Sm       NABLA
259    ucd/PropList.txt:221E          ; ID_Compat_Math_Continue # Sm       INFINITY
260    ucd/PropList.txt:1D6C1         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL BOLD NABLA
261    ucd/PropList.txt:1D6DB         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL BOLD PARTIAL DIFFERENTIAL
262    ucd/PropList.txt:1D6FB         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL ITALIC NABLA
263    ucd/PropList.txt:1D715         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL
264    ucd/PropList.txt:1D735         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL BOLD ITALIC NABLA
265    ucd/PropList.txt:1D74F         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL BOLD ITALIC PARTIAL DIFFERENTIAL
266    ucd/PropList.txt:1D76F         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL SANS-SERIF BOLD NABLA
267    ucd/PropList.txt:1D789         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL SANS-SERIF BOLD PARTIAL DIFFERENTIAL
268    ucd/PropList.txt:1D7A9         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL SANS-SERIF BOLD ITALIC NABLA
269    ucd/PropList.txt:1D7C3         ; ID_Compat_Math_Continue # Sm       MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL
270    ucd/PropList.txt:2202          ; ID_Compat_Math_Start # Sm       PARTIAL DIFFERENTIAL
271    ucd/PropList.txt:2207          ; ID_Compat_Math_Start # Sm       NABLA
272    ucd/PropList.txt:221E          ; ID_Compat_Math_Start # Sm       INFINITY
273    ucd/PropList.txt:1D6C1         ; ID_Compat_Math_Start # Sm       MATHEMATICAL BOLD NABLA
274    ucd/PropList.txt:1D6DB         ; ID_Compat_Math_Start # Sm       MATHEMATICAL BOLD PARTIAL DIFFERENTIAL
275    ucd/PropList.txt:1D6FB         ; ID_Compat_Math_Start # Sm       MATHEMATICAL ITALIC NABLA
276    ucd/PropList.txt:1D715         ; ID_Compat_Math_Start # Sm       MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL
277    ucd/PropList.txt:1D735         ; ID_Compat_Math_Start # Sm       MATHEMATICAL BOLD ITALIC NABLA
278    ucd/PropList.txt:1D74F         ; ID_Compat_Math_Start # Sm       MATHEMATICAL BOLD ITALIC PARTIAL DIFFERENTIAL
279    ucd/PropList.txt:1D76F         ; ID_Compat_Math_Start # Sm       MATHEMATICAL SANS-SERIF BOLD NABLA
280    ucd/PropList.txt:1D789         ; ID_Compat_Math_Start # Sm       MATHEMATICAL SANS-SERIF BOLD PARTIAL DIFFERENTIAL
281    ucd/PropList.txt:1D7A9         ; ID_Compat_Math_Start # Sm       MATHEMATICAL SANS-SERIF BOLD ITALIC NABLA
282    ucd/PropList.txt:1D7C3         ; ID_Compat_Math_Start # Sm       MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL
283- if they have changed, then update the implementation and the tests
284- TODO: There is a ticket for using ppucd.txt in test code.
285  Do that and check these hardcoded properties against that.
286
287* Bazel build process
288
289See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
290for an overview and for setup instructions.
291
292Consider running `bazelisk --version` outside of the $ICU_SRC folder
293to find out the latest `bazel` version, and
294copying that version number into the $ICU_SRC/.bazeliskrc config file.
295(Revert if you find incompatibilities, or, better, update our build & config files.)
296
297* generate data files
298
299- remember to define the environment variables
300  (see the start of the section for this Unicode version)
301- cd $ICU_SRC
302- optional but not necessary:
303    bazelisk clean
304      or even
305    bazelisk clean --expunge
306- build/bootstrap/generate new files:
307    icu4c/source/data/unidata/generate.sh
308
309* Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD).
310  These characters are now just valid, no longer disallowed_STD3_valid.
311  Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid())
312  from uts46.cpp & UTS46.java,
313  and special test code from uts46test.cpp & UTS46Test.java.
314  (remove this section next time)
315
316* run & fix ICU4C tests
317- Note: Some of the collation data and test data will be updated below,
318  so at this time we might get some collation test failures.
319  Ignore these for now.
320- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
321- update CLDR GraphemeBreakTest.txt
322    cd ~/unitools/mine/Generated
323    cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
324    cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
325    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
326- Robin or Andy helps with RBBI & spoof check test failures
327
328* collation: CLDR collation root, UCA DUCET
329
330- UCA DUCET goes into Mark's Unicode tools,
331  and a tool-tailored version goes into CLDR, see
332    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
333
334- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
335    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
336- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
337    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
338    (note removing the underscore before "Rules")
339    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
340- restore TODO diffs in UCARules.txt
341    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
342- update (ICU4C)/source/test/testdata/CollationTest_*.txt
343  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
344  from the CLDR root files (..._CLDR_..._SHORT.txt)
345    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
346    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
347    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
348- if CLDR common/uca/unihan-index.txt changes, then update
349  CLDR common/collation/root.xml <collation type="private-unihan">
350  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
351
352- generate data files, as above (generate.sh), now to pick up new collation data
353- update CollationFCD.java:
354  copy & paste the initializers of lcccIndex[] etc. from
355    ICU4C/source/i18n/collationfcd.cpp to
356    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
357- rebuild ICU4C (make clean, make check, as usual)
358
359* Unihan collators
360    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
361- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
362  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
363- generate ICU zh collation data
364    instructions inspired by
365    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
366    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
367  + setup:
368    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
369        (didn't work without setting JAVA_HOME,
370         nor with the Google default of /usr/local/buildtools/java/jdk
371         [Google security limitations in the XML parser])
372    export TOOLS_ROOT=$ICU_SRC/tools
373    export CLDR_DIR=$CLDR_SRC
374    export CLDR_DATA_DIR=$CLDR_DIR
375        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
376    cd "$TOOLS_ROOT/cldr/lib"
377    ./install-cldr-jars.sh "$CLDR_DIR"
378  + generate the files we need
379    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
380    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
381  + diff
382    cd $ICU_SRC
383    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
384    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
385  + copy into the source tree
386    cd $ICU_SRC
387    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
388    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
389- rebuild ICU4C
390
391* run & fix ICU4C tests, now with new CLDR collation root data
392- run all tests with the collation test data *_SHORT.txt or the full files
393  (the full ones have comments, useful for debugging)
394- note on intltest: if collate/UCAConformanceTest fails, then
395  utility/MultithreadTest/TestCollators will fail as well;
396  fix the conformance test before looking into the multi-thread test
397
398* update Java data files
399- refresh just the UCD/UCA-related/derived files, just to be safe
400- see (ICU4C)/source/data/icu4j-readme.txt
401- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
402- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
403    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
404    you need to reconfigure with unicore data; see the "configure" line above.
405  output:
406    ...
407    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
408    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b
409    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b
410    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b
411    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b"
412    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/
413    mkdir -p /tmp/icu4j/main/shared/data
414    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
415    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/
416    mkdir -p /tmp/icu4j/main/shared/data
417    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
418    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
419- copy the binary data files into the ICU4J tree
420    cd $ICU_OUT/icu4c/data/out/icu4j
421    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
422    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr
423    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
424    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
425    cd com/ibm/icu/impl/data/$ICUDT/
426    ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh
427- The procedure above is very conservative:
428  It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update.
429  It avoids dealing with any other discrepancies
430  between the source and generated data files.
431  *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C:
432      $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
433
434* refresh Java test .txt files
435- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode
436    cd $ICU_SRC/icu4c/source/data/unidata
437    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
438    cd ../../test/testdata
439    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
440    cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
441
442* run & fix ICU4J tests
443
444*** API additions
445- send notice to icu-design about new born-@stable API (enum constants etc.)
446
447*** CLDR numbering systems
448- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
449  for example:
450    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
451    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt
452    ~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt
453    -->
454    (empty this time)
455  or:
456    ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
457    -->
458    (empty this time)
459  Unicode 15.1:
460    (none this time)
461
462*** merge the Unicode update branch back onto the main branch
463- do not merge the icudata.jar and testdata.jar,
464  instead rebuild them from merged & tested ICU4C
465- if there is a merge conflict in icudata.jar, here is one way to deal with it:
466  +   remove icudata.jar from the commit so that rebasing is trivial
467  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
468  + ~/icu/uni/src$ git commit -a --amend
469  +   switch to main, pull updates, switch back to the dev branch
470  + ~/icu/uni/src$ git rebase main
471  +   rebuild icudata.jar
472  + ~/icu/uni/src$ git commit -a --amend
473  + ~/icu/uni/src$ git push -f
474- make sure that changes to Unicode tools are checked in:
475  https://github.com/unicode-org/unicodetools
476
477---------------------------------------------------------------------------- ***
478
479CLDR 43 root collation update for ICU 73
480
481Partial update only for the root collation.
482See
483- https://unicode-org.atlassian.net/browse/CLDR-15946
484  Treat quote marks as equivalent when strength=UCOL_PRIMARY
485- https://github.com/unicode-org/cldr/pull/2691
486  CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks
487- https://github.com/unicode-org/cldr/pull/2833
488  CLDR-15946 make fancy quotes secondary-different from each other
489
490The related changes to tailorings were already integrated in an earlier PR for
491https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS.
492
493This update is for the root collation,
494which is handled by different tools than the locale data updates.
495
496* Command-line environment setup
497
498export UNICODE_DATA=~/unidata/uni15/20220830
499export CLDR_SRC=~/cldr/uni/src
500export ICU_ROOT=~/icu/uni
501export ICU_SRC=$ICU_ROOT/src
502export ICUDT=icudt73b
503export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
504export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
505export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
506
507*** Configure: Build Unicode data for ICU4J
508  cd $ICU_ROOT/dbg/icu4c
509  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
510
511* Bazel build process
512
513See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
514for an overview and for setup instructions.
515
516Consider running `bazelisk --version` outside of the $ICU_SRC folder
517to find out the latest `bazel` version, and
518copying that version number into the $ICU_SRC/.bazeliskrc config file.
519(Revert if you find incompatibilities, or, better, update our build & config files.)
520
521* generate data files
522
523- remember to define the environment variables
524  (see the start of the section for this Unicode version)
525- cd $ICU_SRC
526- optional but not necessary:
527    bazelisk clean
528      or even
529    bazelisk clean --expunge
530- build/bootstrap/generate new files:
531    icu4c/source/data/unidata/generate.sh
532
533* collation: CLDR collation root, UCA DUCET
534
535- UCA DUCET goes into Mark's Unicode tools,
536  and a tool-tailored version goes into CLDR, see
537    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
538
539- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
540    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
541- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
542    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
543    (note removing the underscore before "Rules")
544    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
545- restore TODO diffs in UCARules.txt
546    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
547- update (ICU4C)/source/test/testdata/CollationTest_*.txt
548  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
549  from the CLDR root files (..._CLDR_..._SHORT.txt)
550    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
551    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
552    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
553- if CLDR common/uca/unihan-index.txt changes, then update
554  CLDR common/collation/root.xml <collation type="private-unihan">
555  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
556
557- generate data files, as above (generate.sh), now to pick up new collation data
558- rebuild ICU4C (make clean, make check, as usual)
559
560* run & fix ICU4C tests, now with new CLDR collation root data
561- run all tests with the collation test data *_SHORT.txt or the full files
562  (the full ones have comments, useful for debugging)
563- note on intltest: if collate/UCAConformanceTest fails, then
564  utility/MultithreadTest/TestCollators will fail as well;
565  fix the conformance test before looking into the multi-thread test
566
567* update Java data files
568- refresh just the UCD/UCA-related/derived files, just to be safe
569- see (ICU4C)/source/data/icu4j-readme.txt
570- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
571- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
572    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
573    you need to reconfigure with unicore data; see the "configure" line above.
574  output:
575    ...
576    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
577    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b
578    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b
579    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b
580    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b"
581    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/
582    mkdir -p /tmp/icu4j/main/shared/data
583    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
584    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/
585    mkdir -p /tmp/icu4j/main/shared/data
586    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
587    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
588- copy the big-endian Unicode data files to another location,
589  separate from the other data files,
590  and then refresh ICU4J
591    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
592    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
593    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
594    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
595- new for ICU 73: also copy the binary data files directly into the ICU4J tree
596    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
597
598* When refreshing all of ICU4J data from ICU4C
599- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
600- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
601or
602- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
603
604* refresh Java test .txt files
605- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
606    cd $ICU_SRC/icu4c/source/data/unidata
607    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
608    cd ../../test/testdata
609    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
610    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
611
612* run & fix ICU4J tests
613
614*** merge the Unicode update branch back onto the main branch
615- do not merge the icudata.jar and testdata.jar,
616  instead rebuild them from merged & tested ICU4C
617- if there is a merge conflict in icudata.jar, here is one way to deal with it:
618  +   remove icudata.jar from the commit so that rebasing is trivial
619  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
620  + ~/icu/uni/src$ git commit -a --amend
621  +   switch to main, pull updates, switch back to the dev branch
622  + ~/icu/uni/src$ git rebase main
623  +   rebuild icudata.jar
624  + ~/icu/uni/src$ git commit -a --amend
625  + ~/icu/uni/src$ git push -f
626- make sure that changes to Unicode tools are checked in:
627  https://github.com/unicode-org/unicodetools
628
629---------------------------------------------------------------------------- ***
630
631Unicode 15.0 update for ICU 72
632
633https://www.unicode.org/versions/Unicode15.0.0/
634https://www.unicode.org/versions/beta-15.0.0.html
635https://www.unicode.org/Public/15.0.0/ucd/
636https://www.unicode.org/reports/uax-proposed-updates.html
637https://www.unicode.org/reports/tr44/tr44-29.html
638
639https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15
640https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15
641https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41)
642
643* Command-line environment setup
644
645export UNICODE_DATA=~/unidata/uni15/20220830
646export CLDR_SRC=~/cldr/uni/src
647export ICU_ROOT=~/icu/uni
648export ICU_SRC=$ICU_ROOT/src
649export ICUDT=icudt72b
650export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
651export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
652export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
653
654*** Unicode version numbers
655- makedata.mak
656- uchar.h
657- com.ibm.icu.util.VersionInfo
658- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
659
660- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
661    so that the makefiles see the new version number.
662  cd $ICU_ROOT/dbg/icu4c
663  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
664
665*** data files & enums & parser code
666
667* download files
668- same as for the early Unicode Tools setup and data refresh:
669  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
670  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
671- mkdir -p $UNICODE_DATA
672- download Unicode files into $UNICODE_DATA
673  + subfolders: emoji, idna, security, ucd, uca
674  + old way of fetching files: from the "Public" area on unicode.org
675    ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
676    ~ split Unihan into single-property files
677      ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
678  + new way of fetching files, if available:
679    copy the files from a Unicode Tools workspace that is up to date with
680    https://github.com/unicode-org/unicodetools
681    and which might at this point be *ahead* of "Public"
682    ~ before the Unicode release copy files from "dev" subfolders, for example
683      https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
684  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
685    or from the UCD/cldr/ output folder of the Unicode Tools:
686    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
687  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
688    or
689  cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
690
691* for manual diffs and for Unicode Tools input data updates:
692  remove version suffixes from the file names
693    ~$ unidata/desuffixucd.py $UNICODE_DATA
694  (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
695
696* process and/or copy files
697- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
698  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
699  + For debugging, and tweaking how ppucd.txt is written,
700    the tool has an --only_ppucd option:
701    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
702
703- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
704
705* new constants for new property values
706- preparseucd.py error:
707    ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})]
708  = PropertyValueAliases.txt new property values (diff old & new .txt files)
709    ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
710    +age; 15.0                             ; V15_0
711    +blk; Arabic_Ext_C                     ; Arabic_Extended_C
712    +blk; CJK_Ext_H                        ; CJK_Unified_Ideographs_Extension_H
713    +blk; Cyrillic_Ext_D                   ; Cyrillic_Extended_D
714    +blk; Devanagari_Ext_A                 ; Devanagari_Extended_A
715    +blk; Kaktovik_Numerals                ; Kaktovik_Numerals
716    +blk; Kawi                             ; Kawi
717    +blk; Nag_Mundari                      ; Nag_Mundari
718    +sc ; Kawi                             ; Kawi
719    +sc ; Nagm                             ; Nag_Mundari
720  -> add new blocks to uchar.h before UBLOCK_COUNT
721    use long property names for enum constants,
722    for the trailing comment get the block start code point: diff old & new Blocks.txt
723    ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
724    +10EC0..10EFF; Arabic Extended-C
725    +11B00..11B5F; Devanagari Extended-A
726    +11F00..11F5F; Kawi
727    -13430..1343F; Egyptian Hieroglyph Format Controls
728    +13430..1345F; Egyptian Hieroglyph Format Controls
729    +1D2C0..1D2DF; Kaktovik Numerals
730    +1E030..1E08F; Cyrillic Extended-D
731    +1E4D0..1E4FF; Nag Mundari
732    +31350..323AF; CJK Unified Ideographs Extension H
733    (ignore blocks whose end code point changed)
734  -> add new blocks to UCharacter.UnicodeBlock IDs
735    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
736            replace  public static final int \1_ID = \2; \3
737  -> add new blocks to UCharacter.UnicodeBlock objects
738    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
739            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
740  -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
741    Eclipse find     USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
742            replace  public static final int \1 = \2; \3
743  -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
744      and in com.ibm.icu.dev.test.lang.TestUScript.java
745
746* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
747    (not strictly necessary for NOT_ENCODED scripts)
748  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
749
750* build ICU
751  to make sure that there are no syntax errors
752
753  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
754
755* update spoof checker UnicodeSet initializers:
756    inclusionPat & recommendedPat in i18n/uspoof.cpp
757    INCLUSION & RECOMMENDED in SpoofChecker.java
758- make sure that the Unicode Tools tree contains the latest security data files
759- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
760- run the tool (no special environment variables needed)
761- copy & paste from the Console output into the .cpp & .java files
762
763* Bazel build process
764
765See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
766for an overview and for setup instructions.
767
768Consider running `bazelisk --version` outside of the $ICU_SRC folder
769to find out the latest `bazel` version, and
770copying that version number into the $ICU_SRC/.bazeliskrc config file.
771(Revert if you find incompatibilities, or, better, update our build & config files.)
772
773* generate data files
774
775- remember to define the environment variables
776  (see the start of the section for this Unicode version)
777- cd $ICU_SRC
778- optional but not necessary:
779    bazelisk clean
780- build/bootstrap/generate new files:
781    icu4c/source/data/unidata/generate.sh
782
783* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
784  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
785- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
786    ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt
787- Unicode 6.0..15.0: U+2260, U+226E, U+226F
788- nothing new in this Unicode version, no test file to update
789
790* run & fix ICU4C tests
791- Note: Some of the collation data and test data will be updated below,
792  so at this time we might get some collation test failures.
793  Ignore these for now.
794- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
795  (no rule changes in Unicode 15)
796- update CLDR GraphemeBreakTest.txt
797    cd ~/unitools/mine/Generated
798    cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
799    cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
800    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
801- Andy helps with RBBI & spoof check test failures
802
803* collation: CLDR collation root, UCA DUCET
804
805- UCA DUCET goes into Mark's Unicode tools,
806  and a tool-tailored version goes into CLDR, see
807    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
808
809- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
810    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
811- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
812    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
813    (note removing the underscore before "Rules")
814    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
815- restore TODO diffs in UCARules.txt
816    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
817- update (ICU4C)/source/test/testdata/CollationTest_*.txt
818  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
819  from the CLDR root files (..._CLDR_..._SHORT.txt)
820    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
821    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
822    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
823- if CLDR common/uca/unihan-index.txt changes, then update
824  CLDR common/collation/root.xml <collation type="private-unihan">
825  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
826
827- generate data files, as above (generate.sh), now to pick up new collation data
828- update CollationFCD.java:
829  copy & paste the initializers of lcccIndex[] etc. from
830    ICU4C/source/i18n/collationfcd.cpp to
831    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
832- rebuild ICU4C (make clean, make check, as usual)
833
834* Unihan collators
835    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
836- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
837  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
838- generate ICU zh collation data
839    instructions inspired by
840    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
841    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
842  + setup:
843    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
844        (didn't work without setting JAVA_HOME,
845         nor with the Google default of /usr/local/buildtools/java/jdk
846         [Google security limitations in the XML parser])
847    export TOOLS_ROOT=~/icu/uni/src/tools
848    export CLDR_DIR=~/cldr/uni/src
849    export CLDR_DATA_DIR=~/cldr/uni/src
850        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
851    cd "$TOOLS_ROOT/cldr/lib"
852    ./install-cldr-jars.sh "$CLDR_DIR"
853  + generate the files we need
854    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
855    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
856  + diff
857    cd $ICU_SRC
858    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
859    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
860  + copy into the source tree
861    cd $ICU_SRC
862    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
863    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
864- rebuild ICU4C
865
866* run & fix ICU4C tests, now with new CLDR collation root data
867- run all tests with the collation test data *_SHORT.txt or the full files
868  (the full ones have comments, useful for debugging)
869- note on intltest: if collate/UCAConformanceTest fails, then
870  utility/MultithreadTest/TestCollators will fail as well;
871  fix the conformance test before looking into the multi-thread test
872
873* update Java data files
874- refresh just the UCD/UCA-related/derived files, just to be safe
875- see (ICU4C)/source/data/icu4j-readme.txt
876- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
877- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
878    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
879    you need to reconfigure with unicore data; see the "configure" line above.
880  output:
881    ...
882    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
883    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b
884    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b
885    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b
886    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b"
887    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/
888    mkdir -p /tmp/icu4j/main/shared/data
889    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
890    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/
891    mkdir -p /tmp/icu4j/main/shared/data
892    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
893    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
894- copy the big-endian Unicode data files to another location,
895  separate from the other data files,
896  and then refresh ICU4J
897    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
898    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
899    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
900    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
901    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
902    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
903    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
904    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
905    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
906    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
907
908* When refreshing all of ICU4J data from ICU4C
909- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
910- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
911or
912- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
913
914* refresh Java test .txt files
915- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
916    cd $ICU_SRC/icu4c/source/data/unidata
917    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
918    cd ../../test/testdata
919    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
920    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
921
922* run & fix ICU4J tests
923
924*** API additions
925- send notice to icu-design about new born-@stable API (enum constants etc.)
926
927*** CLDR numbering systems
928- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
929  for example:
930    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
931    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
932    ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt
933    -->
934    +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
935    +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
936  or:
937    ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
938    -->
939    +11F50..11F59  ; Nd #  [10] KAWI DIGIT ZERO..KAWI DIGIT NINE
940    +1E4F0..1E4F9  ; Nd #  [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE
941  Unicode 15:
942    kawi 11F50..11F59 Kawi
943    nagm 1E4F0..1E4F9 Nag Mundari
944    https://github.com/unicode-org/cldr/pull/2041
945
946*** merge the Unicode update branches back onto the trunk
947- do not merge the icudata.jar and testdata.jar,
948  instead rebuild them from merged & tested ICU4C
949- if there is a merge conflict in icudata.jar, here is one way to deal with it:
950  +   remove icudata.jar from the commit so that rebasing is trivial
951  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
952  + ~/icu/uni/src$ git commit -a --amend
953  +   switch to main, pull updates, switch back to the dev branch
954  + ~/icu/uni/src$ git rebase main
955  +   rebuild icudata.jar
956  + ~/icu/uni/src$ git commit -a --amend
957  + ~/icu/uni/src$ git push -f
958- make sure that changes to Unicode tools are checked in:
959  https://github.com/unicode-org/unicodetools
960
961---------------------------------------------------------------------------- ***
962
963Unicode 14.0 update for ICU 70
964
965https://www.unicode.org/versions/Unicode14.0.0/
966https://www.unicode.org/versions/beta-14.0.0.html
967https://www.unicode.org/Public/14.0.0/ucd/
968https://www.unicode.org/reports/uax-proposed-updates.html
969https://www.unicode.org/reports/tr44/tr44-27.html
970
971https://unicode-org.atlassian.net/browse/CLDR-14801
972https://unicode-org.atlassian.net/browse/ICU-21635
973
974* Command-line environment setup
975
976export UNICODE_DATA=~/unidata/uni14/20210903
977export CLDR_SRC=~/cldr/uni/src
978export ICU_ROOT=~/icu/uni
979export ICU_SRC=$ICU_ROOT/src
980export ICUDT=icudt70b
981export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
982export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
983export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
984
985*** Unicode version numbers
986- makedata.mak
987- uchar.h
988- com.ibm.icu.util.VersionInfo
989- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
990
991- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
992    so that the makefiles see the new version number.
993  cd $ICU_ROOT/dbg/icu4c
994  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
995
996*** data files & enums & parser code
997
998* download files
999- same as for the early Unicode Tools setup and data refresh:
1000  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
1001  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
1002- mkdir -p $UNICODE_DATA
1003- download Unicode files into $UNICODE_DATA
1004  + subfolders: emoji, idna, security, ucd, uca
1005  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1006  + split Unihan into single-property files
1007    ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
1008  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1009    or from the UCD/cldr/ output folder of the Unicode Tools:
1010    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
1011  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
1012    or
1013  cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
1014
1015* for manual diffs and for Unicode Tools input data updates:
1016  remove version suffixes from the file names
1017    ~$ unidata/desuffixucd.py $UNICODE_DATA
1018  (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
1019
1020* process and/or copy files
1021- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1022  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1023  + For debugging, and tweaking how ppucd.txt is written,
1024    the tool has an --only_ppucd option:
1025    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1026
1027- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1028
1029* new constants for new property values
1030- preparseucd.py error:
1031    ValueError: missing uchar.h enum constants for some property values:
1032    [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])),
1033    (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])),
1034    (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))]
1035  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1036    ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
1037    +age; 14.0                             ; V14_0
1038    +blk; Arabic_Ext_B                     ; Arabic_Extended_B
1039    +blk; Cypro_Minoan                     ; Cypro_Minoan
1040    +blk; Ethiopic_Ext_B                   ; Ethiopic_Extended_B
1041    +blk; Kana_Ext_B                       ; Kana_Extended_B
1042    +blk; Latin_Ext_F                      ; Latin_Extended_F
1043    +blk; Latin_Ext_G                      ; Latin_Extended_G
1044    +blk; Old_Uyghur                       ; Old_Uyghur
1045    +blk; Tangsa                           ; Tangsa
1046    +blk; Toto                             ; Toto
1047    +blk; UCAS_Ext_A                       ; Unified_Canadian_Aboriginal_Syllabics_Extended_A
1048    +blk; Vithkuqi                         ; Vithkuqi
1049    +blk; Znamenny_Music                   ; Znamenny_Musical_Notation
1050    +jg ; Thin_Yeh                         ; Thin_Yeh
1051    +jg ; Vertical_Tail                    ; Vertical_Tail
1052    +sc ; Cpmn                             ; Cypro_Minoan
1053    +sc ; Ougr                             ; Old_Uyghur
1054    +sc ; Tnsa                             ; Tangsa
1055    +sc ; Toto                             ; Toto
1056    +sc ; Vith                             ; Vithkuqi
1057  -> add new blocks to uchar.h before UBLOCK_COUNT
1058    use long property names for enum constants,
1059    for the trailing comment get the block start code point: diff old & new Blocks.txt
1060    ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
1061    +0870..089F; Arabic Extended-B
1062    +10570..105BF; Vithkuqi
1063    +10780..107BF; Latin Extended-F
1064    +10F70..10FAF; Old Uyghur
1065    -11700..1173F; Ahom
1066    +11700..1174F; Ahom
1067    +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A
1068    +12F90..12FFF; Cypro-Minoan
1069    +16A70..16ACF; Tangsa
1070    -18D00..18D8F; Tangut Supplement
1071    +18D00..18D7F; Tangut Supplement
1072    +1AFF0..1AFFF; Kana Extended-B
1073    +1CF00..1CFCF; Znamenny Musical Notation
1074    +1DF00..1DFFF; Latin Extended-G
1075    +1E290..1E2BF; Toto
1076    +1E7E0..1E7FF; Ethiopic Extended-B
1077    (ignore blocks whose end code point changed)
1078  -> add new blocks to UCharacter.UnicodeBlock IDs
1079    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1080            replace  public static final int \1_ID = \2; \3
1081  -> add new blocks to UCharacter.UnicodeBlock objects
1082    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1083            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1084  -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
1085    Eclipse find     USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
1086            replace  public static final int \1 = \2; \3
1087  -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1088      and in com.ibm.icu.dev.test.lang.TestUScript.java
1089  -> add new joining groups to uchar.h & UCharacter.JoiningGroup
1090
1091* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1092    (not strictly necessary for NOT_ENCODED scripts)
1093  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1094
1095* build ICU
1096  to make sure that there are no syntax errors
1097
1098  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
1099
1100* update spoof checker UnicodeSet initializers:
1101    inclusionPat & recommendedPat in i18n/uspoof.cpp
1102    INCLUSION & RECOMMENDED in SpoofChecker.java
1103- make sure that the Unicode Tools tree contains the latest security data files
1104- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1105- run the tool (no special environment variables needed)
1106- copy & paste from the Console output into the .cpp & .java files
1107
1108* Bazel build process
1109
1110See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
1111for an overview and for setup instructions.
1112
1113Consider running `bazelisk --version` outside of the $ICU_SRC folder
1114to find out the latest `bazel` version, and
1115copying that version number into the $ICU_SRC/.bazeliskrc config file.
1116(Revert if you find incompatibilities, or, better, update our build & config files.)
1117
1118* generate data files
1119
1120- remember to define the environment variables
1121  (see the start of the section for this Unicode version)
1122- cd $ICU_SRC
1123- optional but not necessary:
1124    bazelisk clean
1125- build/bootstrap/generate new files:
1126    icu4c/source/data/unidata/generate.sh
1127
1128* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1129  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1130- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1131- Unicode 6.0..14.0: U+2260, U+226E, U+226F
1132- nothing new in this Unicode version, no test file to update
1133
1134* run & fix ICU4C tests
1135- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
1136- update CLDR GraphemeBreakTest.txt
1137    cd ~/unitools/mine/Generated
1138    cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1139    cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
1140    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
1141- Andy helps with RBBI & spoof check test failures
1142
1143* collation: CLDR collation root, UCA DUCET
1144
1145- UCA DUCET goes into Mark's Unicode tools,
1146  and a tool-tailored version goes into CLDR, see
1147    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
1148
1149- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1150    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1151- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1152    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1153    (note removing the underscore before "Rules")
1154    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1155- restore TODO diffs in UCARules.txt
1156    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1157- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1158  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1159  from the CLDR root files (..._CLDR_..._SHORT.txt)
1160    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1161    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1162    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1163- if CLDR common/uca/unihan-index.txt changes, then update
1164  CLDR common/collation/root.xml <collation type="private-unihan">
1165  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1166
1167- generate data files, as above (generate.sh), now to pick up new collation data
1168- update CollationFCD.java:
1169  copy & paste the initializers of lcccIndex[] etc. from
1170    ICU4C/source/i18n/collationfcd.cpp to
1171    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1172- rebuild ICU4C (make clean, make check, as usual)
1173
1174* Unihan collators
1175    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
1176- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
1177  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
1178- generate ICU zh collation data
1179    instructions inspired by
1180    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
1181    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
1182  + setup:
1183    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
1184        (didn't work without setting JAVA_HOME,
1185         nor with the Google default of /usr/local/buildtools/java/jdk
1186         [Google security limitations in the XML parser])
1187    export TOOLS_ROOT=~/icu/uni/src/tools
1188    export CLDR_DIR=~/cldr/uni/src
1189    export CLDR_DATA_DIR=~/cldr/uni/src
1190        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
1191    cd "$TOOLS_ROOT/cldr/lib"
1192    ./install-cldr-jars.sh "$CLDR_DIR"
1193  + generate the files we need
1194    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
1195    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
1196  + diff
1197    cd $ICU_SRC
1198    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
1199    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
1200  + copy into the source tree
1201    cd $ICU_SRC
1202    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
1203    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
1204- rebuild ICU4C
1205
1206* run & fix ICU4C tests, now with new CLDR collation root data
1207- run all tests with the collation test data *_SHORT.txt or the full files
1208  (the full ones have comments, useful for debugging)
1209- note on intltest: if collate/UCAConformanceTest fails, then
1210  utility/MultithreadTest/TestCollators will fail as well;
1211  fix the conformance test before looking into the multi-thread test
1212
1213* update Java data files
1214- refresh just the UCD/UCA-related/derived files, just to be safe
1215- see (ICU4C)/source/data/icu4j-readme.txt
1216- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1217- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1218    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
1219    you need to reconfigure with unicore data; see the "configure" line above.
1220  output:
1221    ...
1222    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1223    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b
1224    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b
1225    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b
1226    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b"
1227    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/
1228    mkdir -p /tmp/icu4j/main/shared/data
1229    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1230    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/
1231    mkdir -p /tmp/icu4j/main/shared/data
1232    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1233    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1234- copy the big-endian Unicode data files to another location,
1235  separate from the other data files,
1236  and then refresh ICU4J
1237    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1238    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1239    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1240    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1241    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1242    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1243    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1244    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1245    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1246    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1247
1248* When refreshing all of ICU4J data from ICU4C
1249- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1250- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1251or
1252- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1253
1254* refresh Java test .txt files
1255- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1256    cd $ICU_SRC/icu4c/source/data/unidata
1257    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1258    cd ../../test/testdata
1259    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1260    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1261
1262* run & fix ICU4J tests
1263
1264*** API additions
1265- send notice to icu-design about new born-@stable API (enum constants etc.)
1266
1267*** CLDR numbering systems
1268- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1269  for example:
1270    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt
1271    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
1272    ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt
1273    -->
1274    +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
1275  Unicode 14:
1276    tnsa 16AC0..16AC9 Tangsa
1277    https://github.com/unicode-org/cldr/pull/1326
1278
1279*** merge the Unicode update branches back onto the trunk
1280- do not merge the icudata.jar and testdata.jar,
1281  instead rebuild them from merged & tested ICU4C
1282- make sure that changes to Unicode tools are checked in:
1283  https://github.com/unicode-org/unicodetools
1284
1285---------------------------------------------------------------------------- ***
1286
1287Unicode 13.0 update for ICU 66
1288
1289https://www.unicode.org/versions/Unicode13.0.0/
1290https://www.unicode.org/versions/beta-13.0.0.html
1291https://www.unicode.org/Public/13.0.0/ucd/
1292https://www.unicode.org/reports/uax-proposed-updates.html
1293https://www.unicode.org/reports/tr44/tr44-25.html
1294
1295https://unicode-org.atlassian.net/browse/CLDR-13387
1296https://unicode-org.atlassian.net/browse/ICU-20893
1297
1298* Command-line environment setup
1299
1300UNICODE_DATA=~/unidata/uni13/20200212
1301CLDR_SRC=~/cldr/uni/src
1302ICU_ROOT=~/icu/uni
1303ICU_SRC=$ICU_ROOT/src
1304ICUDT=icudt66b
1305ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1306ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1307export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1308
1309*** Unicode version numbers
1310- makedata.mak
1311- uchar.h
1312- com.ibm.icu.util.VersionInfo
1313- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1314
1315- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1316    so that the makefiles see the new version number.
1317  cd $ICU_ROOT/dbg/icu4c
1318  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
1319
1320*** data files & enums & parser code
1321
1322* download files
1323- mkdir -p $UNICODE_DATA
1324- download Unicode files into $UNICODE_DATA
1325  + subfolders: emoji, idna, security, ucd, uca
1326  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1327  + split Unihan into single-property files
1328    ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
1329  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1330    or from the ucd/cldr/ output folder of the Unicode Tools:
1331    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
1332  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
1333
1334* for manual diffs and for Unicode Tools input data updates:
1335  remove version suffixes from the file names
1336    ~$ unidata/desuffixucd.py $UNICODE_DATA
1337  (see https://sites.google.com/site/unicodetools/inputdata)
1338
1339* process and/or copy files
1340- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1341  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1342  + For debugging, and tweaking how ppucd.txt is written,
1343    the tool has an --only_ppucd option:
1344    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1345
1346- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1347
1348* new constants for new property values
1349- preparseucd.py error:
1350    ValueError: missing uchar.h enum constants for some property values:
1351    [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi',
1352        u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])),
1353    (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])),
1354    (u'InPC', set([u'Top_And_Bottom_And_Left']))]
1355  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1356    blk; Chorasmian                       ; Chorasmian
1357    blk; CJK_Ext_G                        ; CJK_Unified_Ideographs_Extension_G
1358    blk; Dives_Akuru                      ; Dives_Akuru
1359    blk; Khitan_Small_Script              ; Khitan_Small_Script
1360    blk; Lisu_Sup                         ; Lisu_Supplement
1361    blk; Symbols_For_Legacy_Computing     ; Symbols_For_Legacy_Computing
1362    blk; Tangut_Sup                       ; Tangut_Supplement
1363    blk; Yezidi                           ; Yezidi
1364  -> add to uchar.h before UBLOCK_COUNT
1365    use long property names for enum constants,
1366    for the trailing comment get the block start code point: diff old & new Blocks.txt
1367  -> add to UCharacter.UnicodeBlock IDs
1368    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1369            replace  public static final int \1_ID = \2; \3
1370  -> add to UCharacter.UnicodeBlock objects
1371    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1372            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1373
1374    sc ; Chrs                             ; Chorasmian
1375    sc ; Diak                             ; Dives_Akuru
1376    sc ; Kits                             ; Khitan_Small_Script
1377    sc ; Yezi                             ; Yezidi
1378  -> uscript.h & com.ibm.icu.lang.UScript
1379  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1380      and in com.ibm.icu.dev.test.lang.TestUScript.java
1381
1382    InPC; Top_And_Bottom_And_Left         ; Top_And_Bottom_And_Left
1383  -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory
1384
1385* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1386    (not strictly necessary for NOT_ENCODED scripts)
1387  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1388
1389* build ICU (make install)
1390  to make sure that there are no syntax errors, and
1391  so that the tools build can pick up the new definitions from the installed header files.
1392
1393  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1394
1395* update spoof checker UnicodeSet initializers:
1396    inclusionPat & recommendedPat in i18n/uspoof.cpp
1397    INCLUSION & RECOMMENDED in SpoofChecker.java
1398- make sure that the Unicode Tools tree contains the latest security data files
1399- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1400- update the hardcoded version number there in the DIRECTORY path
1401- run the tool (no special environment variables needed)
1402- copy & paste from the Console output into the .cpp & .java files
1403
1404* generate normalization data files
1405  cd $ICU_ROOT/dbg/icu4c
1406  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1407  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1408  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1409  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1410  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1411
1412* build ICU (make install)
1413  so that the tools build can pick up the new definitions from the installed header files.
1414
1415  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1416
1417* build Unicode tools using CMake+make
1418
1419$ICU_SRC/tools/unicode/c/icudefs.txt:
1420
1421# Location (--prefix) of where ICU was installed.
1422set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
1423# Location of the ICU4C source tree.
1424set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
1425
1426  $ICU_ROOT/dbg$
1427    mkdir -p tools/unicode/c
1428    cd tools/unicode/c
1429
1430  $ICU_ROOT/dbg/tools/unicode/c$
1431    cmake ../../../../src/tools/unicode/c
1432    make
1433
1434* generate core properties data files
1435  $ICU_ROOT/dbg/tools/unicode/c$
1436    genprops/genprops $ICU_SRC/icu4c
1437- tool failure:
1438    genprops: Script_Extensions indexes overflow bit field
1439    genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR
1440  -> uprops.icu data file format :
1441     add two more bits to store a script code or Script_Extensions index
1442  -> generator code, C++ & Java runtime, uprops.icu format version 7.7
1443- rebuild ICU (make install) & tools
1444
1445* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1446  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1447- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1448- Unicode 6.0..13.0: U+2260, U+226E, U+226F
1449- nothing new in this Unicode version, no test file to update
1450
1451* run & fix ICU4C tests
1452- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
1453- Andy helps with RBBI & spoof check test failures
1454
1455* collation: CLDR collation root, UCA DUCET
1456
1457- UCA DUCET goes into Mark's Unicode tools, see
1458    https://sites.google.com/site/unicodetools/home#TOC-UCA
1459  diff the main mapping file, look for bad changes
1460  (for example, more bytes per weight for common characters)
1461    ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt
1462    ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt
1463
1464- CLDR root data files are checked into $CLDR_SRC/common/uca/
1465    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1466
1467- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1468    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1469- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1470    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1471    (note removing the underscore before "Rules")
1472    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1473- restore TODO diffs in UCARules.txt
1474    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1475- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1476  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1477  from the CLDR root files (..._CLDR_..._SHORT.txt)
1478    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1479    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1480    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1481- if CLDR common/uca/unihan-index.txt changes, then update
1482  CLDR common/collation/root.xml <collation type="private-unihan">
1483  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1484
1485- run genuca
1486  $ICU_ROOT/dbg/tools/unicode/c$
1487    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
1488    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1489- rebuild ICU4C
1490
1491* Unihan collators
1492    https://sites.google.com/site/unicodetools/unihan
1493- run Unicode Tools
1494    org.unicode.draft.GenerateUnihanCollators
1495  with VM arguments
1496    -ea
1497    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1498    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1499    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1500    -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
1501    -DUVERSION=13.0.0
1502- run Unicode Tools
1503    org.unicode.draft.GenerateUnihanCollatorFiles
1504  with the same arguments
1505- check CLDR diffs
1506    cd $CLDR_SRC
1507    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1508    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1509- copy to CLDR
1510    cd $CLDR_SRC
1511    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1512    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1513- run CLDR unit tests, commit to CLDR
1514- generate ICU zh collation data: run CLDR
1515    org.unicode.cldr.icu.NewLdml2IcuConverter
1516  with program arguments
1517    -t collation
1518    -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation
1519    -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental
1520    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
1521    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
1522    zh
1523  and VM arguments
1524    -ea
1525    -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
1526- rebuild ICU4C
1527
1528* run & fix ICU4C tests, now with new CLDR collation root data
1529- run all tests with the collation test data *_SHORT.txt or the full files
1530  (the full ones have comments, useful for debugging)
1531- note on intltest: if collate/UCAConformanceTest fails, then
1532  utility/MultithreadTest/TestCollators will fail as well;
1533  fix the conformance test before looking into the multi-thread test
1534
1535* update Java data files
1536- refresh just the UCD/UCA-related/derived files, just to be safe
1537- see (ICU4C)/source/data/icu4j-readme.txt
1538- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1539- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1540  output:
1541    ...
1542    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1543    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b
1544    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b
1545    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b
1546    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b"
1547    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/
1548    mkdir -p /tmp/icu4j/main/shared/data
1549    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1550    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/
1551    mkdir -p /tmp/icu4j/main/shared/data
1552    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1553    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1554- copy the big-endian Unicode data files to another location,
1555  separate from the other data files,
1556  and then refresh ICU4J
1557    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1558    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1559    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1560    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1561    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1562    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1563    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1564    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1565    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1566    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1567
1568* When refreshing all of ICU4J data from ICU4C
1569- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1570- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1571or
1572- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1573
1574* update CollationFCD.java
1575  + copy & paste the initializers of lcccIndex[] etc. from
1576    ICU4C/source/i18n/collationfcd.cpp to
1577    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1578
1579* refresh Java test .txt files
1580- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1581    cd $ICU_SRC/icu4c/source/data/unidata
1582    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1583    cd ../../test/testdata
1584    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1585    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1586
1587* run & fix ICU4J tests
1588
1589*** API additions
1590- send notice to icu-design about new born-@stable API (enum constants etc.)
1591
1592*** CLDR numbering systems
1593- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1594  for example, look for
1595    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
1596    in new blocks (Blocks.txt)
1597  Unicode 13:
1598    diak 11950..11959 Dives_Akuru
1599
1600*** merge the Unicode update branches back onto the trunk
1601- do not merge the icudata.jar and testdata.jar,
1602  instead rebuild them from merged & tested ICU4C
1603- make sure that changes to Unicode tools are checked in:
1604  http://www.unicode.org/utility/trac/log/trunk/unicodetools
1605
1606---------------------------------------------------------------------------- ***
1607
1608Unicode 12.1 update for ICU 64.2
1609
1610** This is an abbreviated update with one new character for the new
1611** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
1612https://en.wikipedia.org/wiki/Reiwa_period
1613
1614http://www.unicode.org/versions/Unicode12.1.0/
1615
1616ICU-20497 Unicode 12.1
1617
1618cldrbug 11978: Unicode 12.1
1619
1620* Command-line environment setup
1621
1622UNICODE_DATA=~/unidata/uni121/20190403
1623CLDR_SRC=~/svn.cldr/uni
1624ICU_ROOT=~/icu/uni
1625ICU_SRC=$ICU_ROOT/src
1626ICUDT=icudt64b
1627ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1628ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1629export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1630
1631*** Unicode version numbers
1632- makedata.mak
1633- uchar.h
1634- com.ibm.icu.util.VersionInfo
1635- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1636
1637- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1638    so that the makefiles see the new version number.
1639  cd $ICU_ROOT/dbg/icu4c
1640  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
1641
1642*** data files & enums & parser code
1643
1644* download files
1645- mkdir -p $UNICODE_DATA
1646- download Unicode files into $UNICODE_DATA
1647  + subfolders: emoji, idna, security, ucd, uca
1648  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1649
1650* for manual diffs and for Unicode Tools input data updates:
1651  remove version suffixes from the file names
1652    ~$ unidata/desuffixucd.py $UNICODE_DATA
1653  (see https://sites.google.com/site/unicodetools/inputdata)
1654
1655* process and/or copy files
1656- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1657  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1658  + For debugging, and tweaking how ppucd.txt is written,
1659    the tool has an --only_ppucd option:
1660    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1661
1662- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1663
1664* build ICU (make install)
1665  so that the tools build can pick up the new definitions from the installed header files.
1666
1667  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1668
1669* update spoof checker UnicodeSet initializers:
1670    inclusionPat & recommendedPat in uspoof.cpp
1671    INCLUSION & RECOMMENDED in SpoofChecker.java
1672- make sure that the Unicode Tools tree contains the latest security data files
1673- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1674- update the hardcoded version number there in the DIRECTORY path
1675- run the tool (no special environment variables needed)
1676- copy & paste from the Console output into the .cpp & .java files
1677
1678* generate normalization data files
1679  cd $ICU_ROOT/dbg/icu4c
1680  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1681  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1682  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1683  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1684  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1685
1686* build ICU (make install)
1687  so that the tools build can pick up the new definitions from the installed header files.
1688
1689  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1690
1691* build Unicode tools using CMake+make
1692
1693$ICU_SRC/tools/unicode/c/icudefs.txt:
1694
1695# Location (--prefix) of where ICU was installed.
1696set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
1697# Location of the ICU4C source tree.
1698set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
1699
1700  $ICU_ROOT/dbg$
1701    mkdir -p tools/unicode/c
1702    cd tools/unicode/c
1703
1704  $ICU_ROOT/dbg/tools/unicode/c$
1705    cmake ../../../../src/tools/unicode/c
1706    make
1707
1708* generate core properties data files
1709  $ICU_ROOT/dbg/tools/unicode/c$
1710    genprops/genprops $ICU_SRC/icu4c
1711    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
1712    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1713- rebuild ICU (make install) & tools
1714
1715* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1716  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1717- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1718- Unicode 6.0..12.1: U+2260, U+226E, U+226F
1719- nothing new in this Unicode version, no test file to update
1720
1721* run & fix ICU4C tests
1722- Andy handles RBBI & spoof check test failures
1723
1724* collation: CLDR collation root, UCA DUCET
1725
1726- UCA DUCET goes into Mark's Unicode tools, see
1727    https://sites.google.com/site/unicodetools/home#TOC-UCA
1728  diff the main mapping file, look for bad changes
1729  (for example, more bytes per weight for common characters)
1730    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
1731    ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
1732
1733- CLDR root data files are checked into $CLDR_SRC/common/uca/
1734    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1735
1736- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1737    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1738- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1739    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1740    (note removing the underscore before "Rules")
1741    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1742- restore TODO diffs in UCARules.txt
1743    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1744- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1745  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1746  from the CLDR root files (..._CLDR_..._SHORT.txt)
1747    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1748    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1749    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1750- if CLDR common/uca/unihan-index.txt changes, then update
1751  CLDR common/collation/root.xml <collation type="private-unihan">
1752  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1753
1754- run genuca, see command line above
1755- rebuild ICU4C
1756
1757* Unihan collators
1758    https://sites.google.com/site/unicodetools/unihan
1759- run Unicode Tools
1760    org.unicode.draft.GenerateUnihanCollators
1761  with VM arguments
1762    -ea
1763    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1764    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1765    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1766    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1767    -DUVERSION=12.1.0
1768- run Unicode Tools
1769    org.unicode.draft.GenerateUnihanCollatorFiles
1770  with the same arguments
1771- check CLDR diffs
1772    cd $CLDR_SRC
1773    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1774    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1775- copy to CLDR
1776    cd $CLDR_SRC
1777    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1778    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1779- run CLDR unit tests, commit to CLDR
1780- generate ICU zh collation data: run CLDR
1781    org.unicode.cldr.icu.NewLdml2IcuConverter
1782  with program arguments
1783    -t collation
1784    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
1785    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
1786    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
1787    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
1788    zh
1789  and VM arguments
1790    -ea
1791    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1792- rebuild ICU4C
1793
1794* run & fix ICU4C tests, now with new CLDR collation root data
1795- run all tests with the collation test data *_SHORT.txt or the full files
1796  (the full ones have comments, useful for debugging)
1797- note on intltest: if collate/UCAConformanceTest fails, then
1798  utility/MultithreadTest/TestCollators will fail as well;
1799  fix the conformance test before looking into the multi-thread test
1800
1801* update Java data files
1802- refresh just the UCD/UCA-related/derived files, just to be safe
1803- see (ICU4C)/source/data/icu4j-readme.txt
1804- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1805- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1806  output:
1807    ...
1808    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1809    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
1810    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
1811    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
1812    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
1813    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
1814    mkdir -p /tmp/icu4j/main/shared/data
1815    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1816    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
1817    mkdir -p /tmp/icu4j/main/shared/data
1818    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1819    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1820- copy the big-endian Unicode data files to another location,
1821  separate from the other data files,
1822  and then refresh ICU4J
1823    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1824    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1825    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1826    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1827    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1828    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1829    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1830    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1831    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1832    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1833
1834* When refreshing all of ICU4J data from ICU4C
1835- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1836- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1837or
1838- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1839
1840* update CollationFCD.java
1841  + copy & paste the initializers of lcccIndex[] etc. from
1842    ICU4C/source/i18n/collationfcd.cpp to
1843    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1844
1845* refresh Java test .txt files
1846- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1847    cd $ICU_SRC/icu4c/source/data/unidata
1848    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1849    cd ../../test/testdata
1850    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1851    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1852
1853* run & fix ICU4J tests
1854
1855*** API additions
1856- send notice to icu-design about new born-@stable API (enum constants etc.)
1857
1858*** CLDR numbering systems
1859- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1860  for example, look for
1861    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
1862    in new blocks (Blocks.txt)
1863  Unicode 12: using Unicode 12 CLDR ticket #11478
1864    hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
1865    wcho 1E2F0..1E2F9 Wancho
1866  Unicode 11: using Unicode 11 CLDR ticket #10978
1867    rohg 10D30..10D39 Hanifi_Rohingya
1868    gong 11DA0..11DA9 Gunjala_Gondi
1869  Earlier: CLDR tickets specific to adding new numbering systems.
1870  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1871  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1872
1873*** merge the Unicode update branches back onto the trunk
1874- do not merge the icudata.jar and testdata.jar,
1875  instead rebuild them from merged & tested ICU4C
1876- make sure that changes to Unicode tools are checked in:
1877  http://www.unicode.org/utility/trac/log/trunk/unicodetools
1878
1879---------------------------------------------------------------------------- ***
1880
1881Unicode 12.0 update for ICU 64
1882
1883http://www.unicode.org/versions/Unicode12.0.0/
1884http://unicode.org/versions/beta-12.0.0.html
1885https://www.unicode.org/review/pri389/
1886http://www.unicode.org/reports/uax-proposed-updates.html
1887http://www.unicode.org/reports/tr44/tr44-23.html
1888
1889ICU-20203 Unicode 12
1890
1891ICU-20111 move text layout properties data into a data file
1892
1893cldrbug 11478: Unicode 12
1894Accidentally used ^/trunk instead of ^/branches/markus/uni12
1895
1896* Command-line environment setup
1897
1898UNICODE_DATA=~/unidata/uni12/20190309
1899CLDR_SRC=~/svn.cldr/uni
1900ICU_ROOT=~/icu/uni
1901ICU_SRC=$ICU_ROOT/src
1902ICUDT=icudt63b
1903ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1904ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1905export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1906
1907*** Unicode version numbers
1908- makedata.mak
1909- uchar.h
1910- com.ibm.icu.util.VersionInfo
1911- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1912
1913- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1914  so that the makefiles see the new version number.
1915
1916*** data files & enums & parser code
1917
1918* download files
1919- mkdir -p $UNICODE_DATA
1920- download Unicode files into $UNICODE_DATA
1921  + subfolders: emoji, idna, security, ucd, uca
1922  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1923
1924* for manual diffs and for Unicode Tools input data updates:
1925  remove version suffixes from the file names
1926    ~$ unidata/desuffixucd.py $UNICODE_DATA
1927  (see https://sites.google.com/site/unicodetools/inputdata)
1928
1929* process and/or copy files
1930- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1931  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1932  + For debugging, and tweaking how ppucd.txt is written,
1933    the tool has an --only_ppucd option:
1934    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1935
1936- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1937
1938* build ICU (make install)
1939  so that the tools build can pick up the new definitions from the installed header files.
1940
1941  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1942
1943* new constants for new property values
1944- preparseucd.py error:
1945    ValueError: missing uchar.h enum constants for some property values:
1946    [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
1947        u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
1948        u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
1949    (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
1950  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1951    blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
1952    blk; Elymaic                          ; Elymaic
1953    blk; Nandinagari                      ; Nandinagari
1954    blk; Nyiakeng_Puachue_Hmong           ; Nyiakeng_Puachue_Hmong
1955    blk; Ottoman_Siyaq_Numbers            ; Ottoman_Siyaq_Numbers
1956    blk; Small_Kana_Ext                   ; Small_Kana_Extension
1957    blk; Symbols_And_Pictographs_Ext_A    ; Symbols_And_Pictographs_Extended_A
1958    blk; Tamil_Sup                        ; Tamil_Supplement
1959    blk; Wancho                           ; Wancho
1960  -> add to uchar.h
1961    use long property names for enum constants,
1962    for the trailing comment get the block start code point: diff old & new Blocks.txt
1963  -> add to UCharacter.UnicodeBlock IDs
1964    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1965            replace  public static final int \1_ID = \2; \3
1966  -> add to UCharacter.UnicodeBlock objects
1967    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1968            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
1969
1970    sc ; Elym                             ; Elymaic
1971    sc ; Hmnp                             ; Nyiakeng_Puachue_Hmong
1972    sc ; Nand                             ; Nandinagari
1973    sc ; Wcho                             ; Wancho
1974  -> uscript.h & com.ibm.icu.lang.UScript
1975  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1976      and in com.ibm.icu.dev.test.lang.TestUScript.java
1977
1978* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1979    (not strictly necessary for NOT_ENCODED scripts)
1980  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1981
1982* update spoof checker UnicodeSet initializers:
1983    inclusionPat & recommendedPat in uspoof.cpp
1984    INCLUSION & RECOMMENDED in SpoofChecker.java
1985- make sure that the Unicode Tools tree contains the latest security data files
1986- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1987- update the hardcoded version number there in the DIRECTORY path
1988- run the tool (no special environment variables needed)
1989- copy & paste from the Console output into the .cpp & .java files
1990
1991* generate normalization data files
1992  cd $ICU_ROOT/dbg/icu4c
1993  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1994  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1995  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1996  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1997  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1998
1999* build ICU (make install)
2000  so that the tools build can pick up the new definitions from the installed header files.
2001
2002  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
2003
2004* build Unicode tools using CMake+make
2005
2006$ICU_SRC/tools/unicode/c/icudefs.txt:
2007
2008# Location (--prefix) of where ICU was installed.
2009set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
2010# Location of the ICU4C source tree.
2011set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
2012
2013  $ICU_ROOT/dbg$
2014    mkdir -p tools/unicode/c
2015    cd tools/unicode/c
2016
2017  $ICU_ROOT/dbg/tools/unicode/c$
2018    cmake ../../../../src/tools/unicode/c
2019    make
2020
2021* generate core properties data files
2022  $ICU_ROOT/dbg/tools/unicode/c$
2023    genprops/genprops $ICU_SRC/icu4c
2024    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
2025    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2026- rebuild ICU (make install) & tools
2027
2028* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2029  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2030- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2031- Unicode 6.0..12.0: U+2260, U+226E, U+226F
2032- nothing new in this Unicode version, no test file to update
2033
2034* run & fix ICU4C tests
2035- update test of default bidi classes:
2036  Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
2037  see diffs in DerivedBidiClass.txt
2038  + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
2039  + UCharacterTest.java TestIteration() defaultBidi[]
2040- Andy handles RBBI & spoof check test failures
2041
2042* collation: CLDR collation root, UCA DUCET
2043
2044- UCA DUCET goes into Mark's Unicode tools, see
2045    https://sites.google.com/site/unicodetools/home#TOC-UCA
2046  diff the main mapping file, look for bad changes
2047  (for example, more bytes per weight for common characters)
2048    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
2049    ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
2050
2051- CLDR root data files are checked into $CLDR_SRC/common/uca/
2052    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2053
2054- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2055    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2056- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2057    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2058    (note removing the underscore before "Rules")
2059    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2060- restore TODO diffs in UCARules.txt
2061    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2062- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2063  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2064  from the CLDR root files (..._CLDR_..._SHORT.txt)
2065    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2066    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2067    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2068- if CLDR common/uca/unihan-index.txt changes, then update
2069  CLDR common/collation/root.xml <collation type="private-unihan">
2070  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2071
2072- run genuca, see command line above;
2073  deal with
2074    Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
2075    FDD1 119CE;	[71 CD 02, 05, 05]	# Nandinagari first primary (compressible)
2076        (add the character to genuca.cpp sampleCharsToScripts[])
2077  + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
2078    and cache its values.
2079    Works as long as the script metadata is updated before the collation data.
2080- rebuild ICU4C
2081
2082* Unihan collators
2083    https://sites.google.com/site/unicodetools/unihan
2084- run Unicode Tools
2085    org.unicode.draft.GenerateUnihanCollators
2086  with VM arguments
2087    -ea
2088    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2089    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2090    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2091    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2092    -DUVERSION=12.0.0
2093- run Unicode Tools
2094    org.unicode.draft.GenerateUnihanCollatorFiles
2095  with the same arguments
2096- check CLDR diffs
2097    cd $CLDR_SRC
2098    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2099    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2100- copy to CLDR
2101    cd $CLDR_SRC
2102    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2103    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2104- run CLDR unit tests, commit to CLDR
2105- generate ICU zh collation data: run CLDR
2106    org.unicode.cldr.icu.NewLdml2IcuConverter
2107  with program arguments
2108    -t collation
2109    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
2110    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
2111    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
2112    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
2113    zh
2114  and VM arguments
2115    -ea
2116    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2117- rebuild ICU4C
2118
2119* run & fix ICU4C tests, now with new CLDR collation root data
2120- run all tests with the collation test data *_SHORT.txt or the full files
2121  (the full ones have comments, useful for debugging)
2122- note on intltest: if collate/UCAConformanceTest fails, then
2123  utility/MultithreadTest/TestCollators will fail as well;
2124  fix the conformance test before looking into the multi-thread test
2125
2126* update Java data files
2127- refresh just the UCD/UCA-related/derived files, just to be safe
2128- see (ICU4C)/source/data/icu4j-readme.txt
2129- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2130- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2131  output:
2132    ...
2133    Unicode .icu files built to ./out/build/icudt63l
2134    echo timestamp > uni-core-data
2135    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
2136    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
2137    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2138    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
2139    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
2140    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
2141    mkdir -p /tmp/icu4j/main/shared/data
2142    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2143    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
2144    mkdir -p /tmp/icu4j/main/shared/data
2145    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2146    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
2147- copy the big-endian Unicode data files to another location,
2148  separate from the other data files,
2149  and then refresh ICU4J
2150    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2151    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2152    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2153    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2154    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2155    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2156    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2157    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2158    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2159    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2160
2161* When refreshing all of ICU4J data from ICU4C
2162- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2163- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
2164or
2165- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
2166
2167* update CollationFCD.java
2168  + copy & paste the initializers of lcccIndex[] etc. from
2169    ICU4C/source/i18n/collationfcd.cpp to
2170    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2171
2172* refresh Java test .txt files
2173- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2174    cd $ICU_SRC/icu4c/source/data/unidata
2175    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2176    cd ../../test/testdata
2177    cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2178    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2179
2180* run & fix ICU4J tests
2181
2182*** API additions
2183- send notice to icu-design about new born-@stable API (enum constants etc.)
2184
2185*** CLDR numbering systems
2186- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
2187  for example, look for
2188    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
2189    in new blocks (Blocks.txt)
2190  Unicode 12: using Unicode 12 CLDR ticket #11478
2191    hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
2192    wcho 1E2F0..1E2F9 Wancho
2193  Unicode 11: using Unicode 11 CLDR ticket #10978
2194    rohg 10D30..10D39 Hanifi_Rohingya
2195    gong 11DA0..11DA9 Gunjala_Gondi
2196  Earlier: CLDR tickets specific to adding new numbering systems.
2197  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
2198  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
2199
2200*** merge the Unicode update branches back onto the trunk
2201- do not merge the icudata.jar and testdata.jar,
2202  instead rebuild them from merged & tested ICU4C
2203- make sure that changes to Unicode tools are checked in:
2204  http://www.unicode.org/utility/trac/log/trunk/unicodetools
2205
2206---------------------------------------------------------------------------- ***
2207
2208ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
2209
2210* Command-line environment setup
2211
2212UNICODE_DATA=~/unidata/uni11/20180609
2213CLDR_SRC=~/svn.cldr/uni
2214ICU_ROOT=~/icu/mine
2215ICU_SRC=$ICU_ROOT/src
2216ICUDT=icudt62b
2217ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2218ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2219export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2220
2221*** Links
2222
2223https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
2224https://unicode-org.atlassian.net/browse/ICU-12850 vo
2225
2226*** data files & enums & parser code
2227
2228* API additions
2229- for each of the three new enumerated properties
2230  + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
2231  + uchar.h: update UCHAR_INT_LIMIT
2232  + uchar.h: add the enum U<long prop name>
2233    with constants U_<short prop name>_<long value name>
2234  + UProperty.java: add the constant <long prop name>
2235  + UProperty.java: update INT_LIMIT
2236  + UCharacter.java: add the interface <long prop name>
2237    with constants <long value name>
2238
2239* process and/or copy files
2240- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2241  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2242  + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
2243    names and aliases.
2244  + For debugging, and tweaking how ppucd.txt is written,
2245    the tool has an --only_ppucd option:
2246    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2247
2248* preparseucd.py changes
2249- add new property short names (uppercase) to _prop_and_value_re
2250  so that ParseUCharHeader() parses the new enum constants
2251
2252* build ICU (make install)
2253  so that the tools build can pick up the new definitions from the installed header files.
2254
2255  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2256
2257* build Unicode tools using CMake+make
2258
2259$ICU_SRC/tools/unicode/c/icudefs.txt:
2260
2261# Location (--prefix) of where ICU was installed.
2262set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
2263# Location of the ICU4C source tree.
2264set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
2265
2266  $ICU_ROOT/dbg$
2267    mkdir -p tools/unicode/c
2268    cd tools/unicode/c
2269
2270  $ICU_ROOT/dbg/tools/unicode/c$
2271    cmake ../../../../../src/tools/unicode/c
2272    make
2273
2274* generate core properties data files
2275  $ICU_ROOT/dbg/tools/unicode/c$
2276    genprops/genprops $ICU_SRC/icu4c
2277- rebuild ICU (make install) & tools
2278
2279* write data for runtime, hardcoded for now
2280- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
2281- generate new icu4c/source/common/ulayout_props_data.h
2282- for each of the three new enumerated properties
2283  + int property max value
2284  + small, 8-bit UCPTrie
2285    (A small 16-bit trie with bit fields for these three properties
2286    is very nearly the same size as the sum of the three.)
2287
2288* wire into C++
2289- uprops.cpp: #include ulayout_props_data.h
2290- uprops.cpp: add getInPC() etc. functions
2291- uprops.cpp: add lines to intProps[], include max values
2292- uprops.h: add UPropertySource constants
2293- uprops.cpp: add uprops_addPropertyStarts(src)
2294- uniset_props.cpp: add to UnicodeSet_initInclusion()
2295- intltest/ucdtest.cpp: write unit tests
2296
2297* update Java data files
2298- refresh just the pnames.icu file with the new property [value] names, just to be safe
2299- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
2300- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2301- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2302- copy the big-endian Unicode data files to another location,
2303  separate from the other data files,
2304  and then refresh ICU4J
2305    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2306    cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2307    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2308
2309* wire into Java
2310- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
2311- UCharacterProperty.java: for each new property
2312  + create a nested class to hold its CodePointTrie
2313  + initialize it from a string literal
2314  + paste in the initializer printed by genprops
2315  + add a new IntProperty object to the intProps[] array
2316  + use the correct max int value for each property, also printed by genprops
2317- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
2318- UnicodeSet.java: add to getInclusions()
2319- UCharacterTest.java: write unit tests
2320
2321---------------------------------------------------------------------------- ***
2322
2323Unicode 11.0 update for ICU 62
2324
2325http://www.unicode.org/versions/Unicode11.0.0/
2326http://unicode.org/versions/beta-11.0.0.html
2327https://www.unicode.org/review/pri372/
2328http://www.unicode.org/reports/uax-proposed-updates.html
2329http://www.unicode.org/reports/tr44/tr44-21.html
2330
2331* Command-line environment setup
2332
2333UNICODE_DATA=~/unidata/uni11/20180521
2334CLDR_SRC=~/svn.cldr/uni
2335ICU_ROOT=~/svn.icu/uni
2336ICU_SRC=$ICU_ROOT/src
2337ICUDT=icudt61b
2338ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2339ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2340export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2341
2342*** ICU Trac
2343
2344- ticket:13630: Unicode 11
2345- ^/branches/markus/uni11
2346
2347*** CLDR Trac
2348
2349- cldrbug 10978: Unicode 11
2350- ^/branches/markus/uni11
2351
2352*** Unicode version numbers
2353- makedata.mak
2354- uchar.h
2355- com.ibm.icu.util.VersionInfo
2356- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2357
2358- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2359  so that the makefiles see the new version number.
2360
2361*** data files & enums & parser code
2362
2363* download files
2364- mkdir -p $UNICODE_DATA
2365- download Unicode files into $UNICODE_DATA
2366  + subfolders: emoji, idna, security, ucd, uca
2367  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2368
2369* for manual diffs and for Unicode Tools input data updates:
2370  remove version suffixes from the file names
2371    ~$ unidata/desuffixucd.py $UNICODE_DATA
2372  (see https://sites.google.com/site/unicodetools/inputdata)
2373
2374* process and/or copy files
2375- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2376  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2377  + For debugging, and tweaking how ppucd.txt is written,
2378    the tool has an --only_ppucd option:
2379    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2380
2381- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
2382
2383* build ICU (make install)
2384  so that the tools build can pick up the new definitions from the installed header files.
2385
2386  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2387
2388* preparseucd.py changes
2389- fix other errors
2390    NameError: unknown property Extended_Pictographic
2391  -> add Extended_Pictographic binary property
2392  -> add new short names for all Emoji properties
2393
2394* new constants for new property values
2395- preparseucd.py error:
2396    ValueError: missing uchar.h enum constants for some property values:
2397    [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
2398                   u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
2399                   u'Indic_Siyaq_Numbers'])),
2400     (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
2401     (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
2402     (u'GCB', set([u'LinkC', u'Virama'])),
2403     (u'WB', set([u'WSegSpace']))]
2404  = PropertyValueAliases.txt new property values (diff old & new .txt files)
2405    blk; Chess_Symbols                    ; Chess_Symbols
2406    blk; Dogra                            ; Dogra
2407    blk; Georgian_Ext                     ; Georgian_Extended
2408    blk; Gunjala_Gondi                    ; Gunjala_Gondi
2409    blk; Hanifi_Rohingya                  ; Hanifi_Rohingya
2410    blk; Indic_Siyaq_Numbers              ; Indic_Siyaq_Numbers
2411    blk; Makasar                          ; Makasar
2412    blk; Mayan_Numerals                   ; Mayan_Numerals
2413    blk; Medefaidrin                      ; Medefaidrin
2414    blk; Old_Sogdian                      ; Old_Sogdian
2415    blk; Sogdian                          ; Sogdian
2416  -> add to uchar.h
2417    use long property names for enum constants,
2418    for the trailing comment get the block start code point: diff old & new Blocks.txt
2419  -> add to UCharacter.UnicodeBlock IDs
2420    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2421            replace  public static final int \1_ID = \2; \3
2422  -> add to UCharacter.UnicodeBlock objects
2423    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2424            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2425
2426    GCB; LinkC                            ; LinkingConsonant
2427    GCB; Virama                           ; Virama
2428  -> uchar.h & UCharacter.GraphemeClusterBreak
2429  -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
2430
2431    InSC; Consonant_Initial_Postfixed     ; Consonant_Initial_Postfixed
2432  -> ignore: ICU does not yet support this property
2433
2434    jg ; Hanifi_Rohingya_Kinna_Ya         ; Hanifi_Rohingya_Kinna_Ya
2435    jg ; Hanifi_Rohingya_Pa               ; Hanifi_Rohingya_Pa
2436  -> uchar.h & UCharacter.JoiningGroup
2437
2438    sc ; Dogr                             ; Dogra
2439    sc ; Gong                             ; Gunjala_Gondi
2440    sc ; Maka                             ; Makasar
2441    sc ; Medf                             ; Medefaidrin
2442    sc ; Rohg                             ; Hanifi_Rohingya
2443    sc ; Sogd                             ; Sogdian
2444    sc ; Sogo                             ; Old_Sogdian
2445  -> uscript.h & com.ibm.icu.lang.UScript
2446  -> Nushu had been added already
2447  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2448      and in com.ibm.icu.dev.test.lang.TestUScript.java
2449
2450    WB ; WSegSpace                        ; WSegSpace
2451  -> uchar.h & UCharacter.WordBreak
2452
2453* New short names for emoji properties
2454- see UTS #51
2455- short names set in preparseucd.py
2456
2457* New properties
2458- boolean emoji property Extended_Pictographic
2459  -> added in preparseucd.py
2460  -> uchar.h & UProperty.java
2461- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
2462  as shown in PropertyValueAliases.txt
2463  -> ignore for now
2464
2465* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2466    (not strictly necessary for NOT_ENCODED scripts)
2467  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
2468
2469* update spoof checker UnicodeSet initializers:
2470    inclusionPat & recommendedPat in uspoof.cpp
2471    INCLUSION & RECOMMENDED in SpoofChecker.java
2472- make sure that the Unicode Tools tree contains the latest security data files
2473- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
2474- update the hardcoded version number there in the DIRECTORY path
2475- run the tool (no special environment variables needed)
2476- copy & paste from the Console output into the .cpp & .java files
2477
2478* generate normalization data files
2479  cd $ICU_ROOT/dbg/icu4c
2480  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
2481  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
2482  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
2483  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2484  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
2485
2486* build ICU (make install)
2487  so that the tools build can pick up the new definitions from the installed header files.
2488
2489  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2490
2491* build Unicode tools using CMake+make
2492
2493$ICU_SRC/tools/unicode/c/icudefs.txt:
2494
2495# Location (--prefix) of where ICU was installed.
2496set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
2497# Location of the ICU4C source tree.
2498set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
2499
2500  $ICU_ROOT/dbg$
2501    mkdir -p tools/unicode/c
2502    cd tools/unicode/c
2503
2504  $ICU_ROOT/dbg/tools/unicode/c$
2505    cmake ../../../../src/tools/unicode/c
2506    make
2507
2508* generate core properties data files
2509  $ICU_ROOT/dbg/tools/unicode/c$
2510    genprops/genprops $ICU_SRC/icu4c
2511    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
2512    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2513- rebuild ICU (make install) & tools
2514
2515* Fix case props
2516    genprops error: casepropsbuilder: too many exceptions words
2517    genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
2518- With the addition of Georgian Mtavruli capital letters,
2519  there are now too many simple case mappings with big mapping deltas
2520  that yield uncompressible exceptions.
2521- Changing the data structure (now formatVersion 4),
2522  adding one bit for no-simple-case-folding (for Cherokee), and
2523  one optional slot for a big delta (for most faraway mappings),
2524  together with another bit for whether that is negative.
2525  This makes most Cherokee & Georgian etc. case mappings compressible,
2526  reducing the number of exceptions words.
2527- Further changes to gain one more bit for the exceptions index,
2528  for future growth. Details see casepropsbuilder.cpp.
2529
2530* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2531  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2532- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2533- Unicode 6.0..11.0: U+2260, U+226E, U+226F
2534- nothing new in this Unicode version, no test file to update
2535
2536* run & fix ICU4C tests
2537- Andy handles RBBI & spoof check test failures
2538
2539- Errors in char.txt, word.txt, word_POSIX.txt like
2540    createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET"  at line 46, column 16
2541  because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
2542  -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
2543     not empty, just to get ICU building.
2544  -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
2545     and properties together with the rules that used them (GB 10, WB 14).
2546  -> Andy adjusts the rule sets further to sync with
2547     Unicode 11 grapheme, word, and line break spec changes.
2548
2549* collation: CLDR collation root, UCA DUCET
2550
2551- UCA DUCET goes into Mark's Unicode tools, see
2552    https://sites.google.com/site/unicodetools/home#TOC-UCA
2553  diff the main mapping file, look for bad changes
2554  (for example, more bytes per weight for common characters)
2555    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
2556    ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
2557
2558- CLDR root data files are checked into $CLDR_SRC/common/uca/
2559    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2560
2561- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2562    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2563- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2564    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2565    (note removing the underscore before "Rules")
2566    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2567- restore TODO diffs in UCARules.txt
2568    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2569- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2570  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2571  from the CLDR root files (..._CLDR_..._SHORT.txt)
2572    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2573    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2574    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2575- if CLDR common/uca/unihan-index.txt changes, then update
2576  CLDR common/collation/root.xml <collation type="private-unihan">
2577  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2578
2579- run genuca, see command line above;
2580  deal with
2581    Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
2582    FDD1 1180B;	[71 CC 02, 05, 05]	# Dogra first primary (compressible)
2583        (add the character to genuca.cpp sampleCharsToScripts[])
2584  + look up the USCRIPT_ code for the new sample characters
2585    (should be obvious from the comment in the error output)
2586  + *add* mappings to sampleCharsToScripts[], do not replace them
2587    (in case the script sample characters flip-flop)
2588  + insert new scripts in DUCET script order, see the top_byte table
2589    at the beginning of FractionalUCA.txt
2590- rebuild ICU4C
2591
2592* Unihan collators
2593    https://sites.google.com/site/unicodetools/unihan
2594- run Unicode Tools
2595    org.unicode.draft.GenerateUnihanCollators
2596  with VM arguments
2597    -ea
2598    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2599    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2600    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2601    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2602    -DUVERSION=11.0.0
2603- run Unicode Tools
2604    org.unicode.draft.GenerateUnihanCollatorFiles
2605  with the same arguments
2606- check CLDR diffs
2607    cd $CLDR_SRC
2608    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2609    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2610- copy to CLDR
2611    cd $CLDR_SRC
2612    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2613    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2614- run CLDR unit tests, commit to CLDR
2615- generate ICU zh collation data: run CLDR
2616    org.unicode.cldr.icu.NewLdml2IcuConverter
2617  with program arguments
2618    -t collation
2619    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
2620    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
2621    -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
2622    -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
2623    zh
2624  and VM arguments
2625    -ea
2626    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2627- rebuild ICU4C
2628
2629* run & fix ICU4C tests, now with new CLDR collation root data
2630- run all tests with the collation test data *_SHORT.txt or the full files
2631  (the full ones have comments, useful for debugging)
2632- note on intltest: if collate/UCAConformanceTest fails, then
2633  utility/MultithreadTest/TestCollators will fail as well;
2634  fix the conformance test before looking into the multi-thread test
2635
2636* update Java data files
2637- refresh just the UCD/UCA-related/derived files, just to be safe
2638- see (ICU4C)/source/data/icu4j-readme.txt
2639- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2640- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2641  output:
2642    ...
2643    Unicode .icu files built to ./out/build/icudt61l
2644    echo timestamp > uni-core-data
2645    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
2646    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
2647    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2648    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
2649    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
2650    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
2651    mkdir -p /tmp/icu4j/main/shared/data
2652    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2653    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
2654    mkdir -p /tmp/icu4j/main/shared/data
2655    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2656    make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
2657- copy the big-endian Unicode data files to another location,
2658  separate from the other data files,
2659  and then refresh ICU4J
2660    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2661    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2662    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2663    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2664    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2665    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2666    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2667    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2668    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2669    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2670
2671* When refreshing all of ICU4J data from ICU4C
2672- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2673- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
2674or
2675- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
2676
2677* update CollationFCD.java
2678  + copy & paste the initializers of lcccIndex[] etc. from
2679    ICU4C/source/i18n/collationfcd.cpp to
2680    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2681
2682* refresh Java test .txt files
2683- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2684    cd $ICU_SRC/icu4c/source/data/unidata
2685    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2686    cd ../../test/testdata
2687    cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2688    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2689
2690* run & fix ICU4J tests
2691
2692*** API additions
2693- send notice to icu-design about new born-@stable API (enum constants etc.)
2694
2695*** CLDR numbering systems
2696- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
2697  Unicode 11: using Unicode 11 CLDR ticket #10978
2698    rohg 10D30..10D39 Hanifi_Rohingya
2699    gong 11DA0..11DA9 Gunjala_Gondi
2700  Earlier: CLDR tickets specific to adding new numbering systems.
2701  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
2702  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
2703
2704*** merge the Unicode update branches back onto the trunk
2705- do not merge the icudata.jar and testdata.jar,
2706  instead rebuild them from merged & tested ICU4C
2707- make sure that changes to Unicode tools are checked in:
2708  http://www.unicode.org/utility/trac/log/trunk/unicodetools
2709
2710---------------------------------------------------------------------------- ***
2711
2712Unicode 10.0 update for ICU 60
2713
2714http://www.unicode.org/versions/Unicode10.0.0/
2715http://www.unicode.org/versions/beta-10.0.0.html
2716http://blog.unicode.org/2017/03/unicode-100-beta-review.html
2717http://www.unicode.org/review/pri350/
2718http://www.unicode.org/reports/uax-proposed-updates.html
2719http://www.unicode.org/reports/tr44/tr44-19.html
2720
2721* Command-line environment setup
2722
2723UNICODE_DATA=~/unidata/uni10/20170605
2724CLDR_SRC=~/svn.cldr/uni10
2725ICU_ROOT=~/svn.icu/uni10
2726ICU_SRC=$ICU_ROOT/src
2727ICUDT=icudt60b
2728ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2729ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2730export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2731
2732*** ICU Trac
2733
2734- ticket:12985: Unicode 10
2735- ticket:13061: undo hacks from emoji 5.0 update
2736- ticket:13062: add Emoji_Component property
2737- ^/branches/markus/uni10
2738
2739*** CLDR Trac
2740
2741- cldrbug 10055: Unicode 10
2742- cldrbug 9882: Unicode 10 script metadata
2743- cldrbug 10219: numbering systems for Unicode 10
2744
2745*** Unicode version numbers
2746- makedata.mak
2747- uchar.h
2748- com.ibm.icu.util.VersionInfo
2749- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2750
2751- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2752  so that the makefiles see the new version number.
2753
2754*** data files & enums & parser code
2755
2756* download files
2757- mkdir -p $UNICODE_DATA
2758- download Unicode 10.0 files into $UNICODE_DATA
2759  + subfolders: ucd, uca, idna, security
2760  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2761- download emoji 5.0 files into $UNICODE_DATA/emoji
2762
2763* for manual diffs: remove version suffixes from the file names
2764  ~$ unidata/desuffixucd.py $UNICODE_DATA
2765  (see https://sites.google.com/site/unicodetools/inputdata)
2766
2767* process and/or copy files
2768- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2769  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2770  + For debugging, and tweaking how ppucd.txt is written,
2771    the tool has an --only_ppucd option:
2772    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2773
2774- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
2775
2776* build ICU (make install)
2777  so that the tools build can pick up the new definitions from the installed header files.
2778
2779  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2780
2781* preparseucd.py changes
2782- remove or add new Unicode scripts from/to the
2783  only-in-ISO-15924 list according to the error messages:
2784    ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
2785  -> adjust _scripts_only_in_iso15924 as indicated
2786- fix other errors
2787    Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
2788  -> add vo=Vertical_Orientation to _ignored_properties
2789  -> later removed again, parsing the file, even though we do not yet store data for runtime use
2790
2791* new constants for new property values
2792- preparseucd.py error:
2793    ValueError: missing uchar.h enum constants for some property values:
2794    [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
2795                   u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
2796     (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
2797                  u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
2798                  u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
2799     (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
2800  = PropertyValueAliases.txt new property values (diff old & new .txt files)
2801    blk; CJK_Ext_F                        ; CJK_Unified_Ideographs_Extension_F
2802    blk; Kana_Ext_A                       ; Kana_Extended_A
2803    blk; Masaram_Gondi                    ; Masaram_Gondi
2804    blk; Nushu                            ; Nushu
2805    blk; Soyombo                          ; Soyombo
2806    blk; Syriac_Sup                       ; Syriac_Supplement
2807    blk; Zanabazar_Square                 ; Zanabazar_Square
2808  -> add to uchar.h
2809    use long property names for enum constants,
2810    for the trailing comment get the block start code point: diff old & new Blocks.txt
2811  -> add to UCharacter.UnicodeBlock IDs
2812    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2813            replace  public static final int \1_ID = \2; \3
2814  -> add to UCharacter.UnicodeBlock objects
2815    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2816            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2817
2818    jg ; Malayalam_Bha                    ; Malayalam_Bha
2819    jg ; Malayalam_Ja                     ; Malayalam_Ja
2820    jg ; Malayalam_Lla                    ; Malayalam_Lla
2821    jg ; Malayalam_Llla                   ; Malayalam_Llla
2822    jg ; Malayalam_Nga                    ; Malayalam_Nga
2823    jg ; Malayalam_Nna                    ; Malayalam_Nna
2824    jg ; Malayalam_Nnna                   ; Malayalam_Nnna
2825    jg ; Malayalam_Nya                    ; Malayalam_Nya
2826    jg ; Malayalam_Ra                     ; Malayalam_Ra
2827    jg ; Malayalam_Ssa                    ; Malayalam_Ssa
2828    jg ; Malayalam_Tta                    ; Malayalam_Tta
2829  -> uchar.h & UCharacter.JoiningGroup
2830
2831    sc ; Gonm                             ; Masaram_Gondi
2832    sc ; Nshu                             ; Nushu
2833    sc ; Soyo                             ; Soyombo
2834    sc ; Zanb                             ; Zanabazar_Square
2835  -> uscript.h & com.ibm.icu.lang.UScript
2836  -> Nushu had been added already
2837  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2838      and in com.ibm.icu.dev.test.lang.TestUScript.java
2839
2840* New properties as shown in PropertyValueAliases.txt changes
2841- boolean Emoji_Component from emoji 5
2842  -> uchar.h & UProperty.java
2843- boolean
2844    # Regional_Indicator (RI)
2845
2846    RI ; N                                ; No                               ; F                                ; False
2847    RI ; Y                                ; Yes                              ; T                                ; True
2848  -> uchar.h & UProperty.java
2849  -> single immutable range, to be hardcoded
2850- boolean
2851    # Prepended_Concatenation_Mark (PCM)
2852
2853    PCM; N                                ; No                               ; F                                ; False
2854    PCM; Y                                ; Yes                              ; T                                ; True
2855  -> was new in Unicode 9
2856  -> uchar.h & UProperty.java
2857- enumerated
2858    # Vertical_Orientation (vo)
2859
2860    vo ; R                                ; Rotated
2861    vo ; Tr                               ; Transformed_Rotated
2862    vo ; Tu                               ; Transformed_Upright
2863    vo ; U                                ; Upright
2864  -> only pre-parsed for now, but not yet stored for runtime use
2865
2866* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2867    (not strictly necessary for NOT_ENCODED scripts)
2868  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
2869
2870* generate normalization data files
2871  cd $ICU_ROOT/dbg/icu4c
2872  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
2873  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
2874  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
2875  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2876  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
2877
2878* build ICU (make install)
2879  so that the tools build can pick up the new definitions from the installed header files.
2880
2881  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2882
2883* build Unicode tools using CMake+make
2884
2885$ICU_SRC/tools/unicode/c/icudefs.txt:
2886
2887# Location (--prefix) of where ICU was installed.
2888set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
2889# Location of the ICU4C source tree.
2890set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
2891
2892  $ICU_ROOT/dbg/tools/unicode/c$
2893    cmake ../../../../src/tools/unicode/c
2894    make
2895
2896* generate core properties data files
2897  $ICU_ROOT/dbg/tools/unicode/c$
2898    genprops/genprops $ICU_SRC/icu4c
2899    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
2900    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2901- rebuild ICU (make install) & tools
2902
2903* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2904  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2905- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2906- Unicode 6.0..10.0: U+2260, U+226E, U+226F
2907- nothing new in this Unicode version, no test file to update
2908
2909* run & fix ICU4C tests
2910- Andy handles RBBI & spoof check test failures
2911
2912* collation: CLDR collation root, UCA DUCET
2913
2914- UCA DUCET goes into Mark's Unicode tools, see
2915  https://sites.google.com/site/unicodetools/home#TOC-UCA
2916- CLDR root data files are checked into $CLDR_SRC/common/uca/
2917    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2918
2919- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2920    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2921- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2922    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2923    (note removing the underscore before "Rules")
2924    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2925- restore TODO diffs in UCARules.txt
2926    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2927- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2928  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2929  from the CLDR root files (..._CLDR_..._SHORT.txt)
2930    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2931    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2932    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2933- if CLDR common/uca/unihan-index.txt changes, then update
2934  CLDR common/collation/root.xml <collation type="private-unihan">
2935  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2936
2937- run genuca, see command line above;
2938  deal with
2939    Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
2940    FDD1 11D10;     [70 D5 02, 05, 05]      # Masaram_Gondi first primary (compressible)
2941        (add the character to genuca.cpp sampleCharsToScripts[])
2942  + look up the USCRIPT_ code for the new sample characters
2943    (should be obvious from the comment in the error output)
2944  + *add* mappings to sampleCharsToScripts[], do not replace them
2945    (in case the script sample characters flip-flop)
2946  + insert new scripts in DUCET script order, see the top_byte table
2947    at the beginning of FractionalUCA.txt
2948- rebuild ICU4C
2949
2950* Unihan collators
2951    https://sites.google.com/site/unicodetools/unihan
2952- run Unicode Tools
2953    org.unicode.draft.GenerateUnihanCollators
2954  with VM arguments
2955    -ea
2956    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2957    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2958    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2959    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
2960    -DUVERSION=10.0.0
2961- run Unicode Tools
2962    org.unicode.draft.GenerateUnihanCollatorFiles
2963  with the same arguments
2964- check CLDR diffs
2965    cd $CLDR_SRC
2966    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2967    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2968- copy to CLDR
2969    cd $CLDR_SRC
2970    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2971    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2972- run CLDR unit tests, commit to CLDR
2973- generate ICU zh collation data: run CLDR
2974    org.unicode.cldr.icu.NewLdml2IcuConverter
2975  with program arguments
2976    -t collation
2977    -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
2978    -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
2979    -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
2980    -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
2981    zh
2982  and VM arguments
2983    -ea
2984    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
2985- rebuild ICU4C
2986
2987* run & fix ICU4C tests, now with new CLDR collation root data
2988- run all tests with the collation test data *_SHORT.txt or the full files
2989  (the full ones have comments, useful for debugging)
2990- note on intltest: if collate/UCAConformanceTest fails, then
2991  utility/MultithreadTest/TestCollators will fail as well;
2992  fix the conformance test before looking into the multi-thread test
2993
2994* update Java data files
2995- refresh just the UCD/UCA-related/derived files, just to be safe
2996- see (ICU4C)/source/data/icu4j-readme.txt
2997- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2998- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2999  output:
3000    ...
3001    Unicode .icu files built to ./out/build/icudt60l
3002    echo timestamp > uni-core-data
3003    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
3004    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
3005    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3006    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
3007    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
3008    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
3009    mkdir -p /tmp/icu4j/main/shared/data
3010    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3011    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
3012    mkdir -p /tmp/icu4j/main/shared/data
3013    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3014    make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
3015- copy the big-endian Unicode data files to another location,
3016  separate from the other data files,
3017  and then refresh ICU4J
3018    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
3019    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3020    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3021    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3022    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3023    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3024    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3025    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3026    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3027    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3028
3029* When refreshing all of ICU4J data from ICU4C
3030- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3031- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
3032or
3033- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
3034
3035* update CollationFCD.java
3036  + copy & paste the initializers of lcccIndex[] etc. from
3037    ICU4C/source/i18n/collationfcd.cpp to
3038    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
3039
3040* refresh Java test .txt files
3041- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3042    cd $ICU_SRC/icu4c/source/data/unidata
3043    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3044    cd ../../test/testdata
3045    cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3046    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3047
3048* run & fix ICU4J tests
3049
3050*** API additions
3051- send notice to icu-design about new born-@stable API (enum constants etc.)
3052
3053*** CLDR numbering systems
3054- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
3055  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
3056  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
3057
3058*** merge the Unicode update branches back onto the trunk
3059- do not merge the icudata.jar and testdata.jar,
3060  instead rebuild them from merged & tested ICU4C
3061- make sure that changes to Unicode tools are checked in:
3062  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3063
3064---------------------------------------------------------------------------- ***
3065
3066Emoji 5.0 update for ICU 59
3067- ICU 59 mostly remains on Unicode 9.0
3068- except updates bidi and segmentation data to Unicode 10 beta
3069
3070First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
3071
3072* Command-line environment setup
3073
3074ICU_ROOT=~/svn.icu/trunk
3075ICU_SRC_DIR=$ICU_ROOT/src
3076ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
3077ICUDT=icudt59b
3078export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3079SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
3080UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
3081
3082*** ICU Trac
3083
3084- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
3085- changes directly on trunk
3086
3087*** data files & enums & parser code
3088
3089* download files
3090
3091- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
3092- download emoji 5.0 beta files into the same uni90e50 folder
3093- download Unicode 10.0 beta files: ucd
3094  + copy Unicode 10 bidi files to the uni90e50/ucd folder:
3095    BidiBrackets.txt
3096    BidiCharacterTest.txt
3097    BidiMirroring.txt
3098    BidiTest.txt
3099    extracted/DerivedBidiClass.txt
3100  + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
3101    LineBreak.txt
3102    auxiliary/*
3103
3104* preparseucd.py changes
3105- adjust for combined trunks
3106- write new copyright lines
3107- ignore new Emoji_Component property for now
3108
3109* process and/or copy files
3110- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
3111  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3112
3113- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
3114
3115* build ICU (make install)
3116  so that the tools build can pick up the new definitions from the installed header files.
3117
3118  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
3119
3120* build Unicode tools using CMake+make
3121
3122~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
3123
3124# Location (--prefix) of where ICU was installed.
3125set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
3126# Location of the ICU4C source tree.
3127set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
3128
3129  ~/svn.icu/trunk/dbg/tools/unicode/c$
3130    cmake ../../../../src/tools/unicode/c
3131    make
3132
3133* generate core properties data files
3134  ~/svn.icu/trunk/dbg/tools/unicode/c$
3135    genprops/genprops $ICU4C_SRC_DIR
3136- rebuild ICU (make install) & tools
3137
3138* run & fix ICU4C tests
3139- Andy handles RBBI & spoof check test failures
3140
3141* update Java data files
3142- refresh just the UCD/UCA-related/derived files, just to be safe
3143- see (ICU4C)/source/data/icu4j-readme.txt
3144- mkdir /tmp/icu4j
3145- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3146  output:
3147    ...
3148    Unicode .icu files built to ./out/build/icudt59l
3149    echo timestamp > uni-core-data
3150    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
3151    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
3152    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3153    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
3154    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
3155    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
3156    mkdir -p /tmp/icu4j/main/shared/data
3157    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3158    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
3159    mkdir -p /tmp/icu4j/main/shared/data
3160    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3161    make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
3162- copy the big-endian Unicode data files to another location,
3163  separate from the other data files,
3164  and then refresh ICU4J
3165    cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
3166    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3167    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3168    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3169    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3170    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3171    jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3172
3173* When refreshing all of ICU4J data from ICU4C
3174- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3175- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
3176or
3177- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
3178
3179* refresh Java test .txt files
3180- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3181    cd $ICU4C_SRC_DIR/source/data/unidata
3182    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3183    cd ../../test/testdata
3184    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3185    cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3186
3187* run & fix ICU4J tests
3188
3189---------------------------------------------------------------------------- ***
3190
3191Unicode 9.0 update for ICU 58
3192
3193* Command-line environment setup
3194
3195ICU_ROOT=~/svn.icu/trunk
3196ICU_SRC_DIR=$ICU_ROOT/src
3197ICUDT=icudt58b
3198export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3199SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3200UNIDATA=$ICU_SRC_DIR/source/data/unidata
3201
3202http://www.unicode.org/review/pri323/  -- beta review
3203http://www.unicode.org/reports/uax-proposed-updates.html
3204http://www.unicode.org/versions/beta-9.0.0.html
3205http://www.unicode.org/versions/Unicode9.0.0/
3206http://www.unicode.org/reports/tr44/tr44-17.html
3207
3208*** ICU Trac
3209
3210- ticket:12526: integrate Unicode 9
3211- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
3212- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
3213
3214*** CLDR Trac
3215
3216- cldrbug 9414: UCA 9
3217- ^/branches/markus/uni90 at r11518 from trunk at r11517
3218
3219- cldrbug 8745: Unicode 9.0 script metadata
3220
3221*** Unicode version numbers
3222- makedata.mak
3223- uchar.h
3224- com.ibm.icu.util.VersionInfo
3225- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3226
3227- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3228  so that the makefiles see the new version number.
3229
3230*** data files & enums & parser code
3231
3232* file preparation
3233
3234- download UCD & IDNA files
3235- make sure that the Unicode data folder passed into preparseucd.py
3236  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3237- only for manual diffs: remove version suffixes from the file names
3238  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
3239  (see https://sites.google.com/site/unicodetools/inputdata)
3240- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3241- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3242- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3243
3244- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
3245  and copy to $UNIDATA
3246    cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
3247
3248* preparseucd.py changes
3249- remove or add new Unicode scripts from/to the
3250  only-in-ISO-15924 list according to the error messages:
3251    ValueError: remove ['Tang'] from _scripts_only_in_iso15924
3252    ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
3253    ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
3254    ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
3255  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3256      and in com.ibm.icu.dev.test.lang.TestUScript.java
3257- DerivedNumericValues.txt new numeric values
3258    0D58          ; 0.00625 ; ; 1/160 # No       MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
3259    0D59          ; 0.025 ; ; 1/40 # No       MALAYALAM FRACTION ONE FORTIETH
3260    0D5A          ; 0.0375 ; ; 3/80 # No       MALAYALAM FRACTION THREE EIGHTIETHS
3261    0D5B          ; 0.05 ; ; 1/20 # No       MALAYALAM FRACTION ONE TWENTIETH
3262    0D5D          ; 0.15 ; ; 3/20 # No       MALAYALAM FRACTION THREE TWENTIETHS
3263  -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
3264     uchar.c, UCharacterProperty.java
3265     to support a new series of values
3266- adjust preparseucd.py for Tangut algorithmic names
3267  in ppucd.txt:
3268    algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
3269  ->
3270    algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
3271- avoid block-compressing most String/Miscellaneous property values,
3272  triggered by genprops not coping with a multi-code point Case_Folding on
3273    block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
3274  keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
3275
3276* PropertyAliases.txt changes
3277- 1 new property PCM=Prepended_Concatenation_Mark
3278  Ignore: Only useful for layout engines.
3279  Ok to list in ppucd.txt.
3280
3281* PropertyValueAliases.txt new property values
3282    blk; Adlam                            ; Adlam
3283    blk; Bhaiksuki                        ; Bhaiksuki
3284    blk; Cyrillic_Ext_C                   ; Cyrillic_Extended_C
3285    blk; Glagolitic_Sup                   ; Glagolitic_Supplement
3286    blk; Ideographic_Symbols              ; Ideographic_Symbols_And_Punctuation
3287    blk; Marchen                          ; Marchen
3288    blk; Mongolian_Sup                    ; Mongolian_Supplement
3289    blk; Newa                             ; Newa
3290    blk; Osage                            ; Osage
3291    blk; Tangut                           ; Tangut
3292    blk; Tangut_Components                ; Tangut_Components
3293  -> add to uchar.h
3294    use long property names for enum constants
3295  -> add to UCharacter.UnicodeBlock IDs
3296    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3297            replace  public static final int \1_ID = \2; \3
3298  -> add to UCharacter.UnicodeBlock objects
3299    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3300            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3301
3302    GCB; EB                               ; E_Base
3303    GCB; EBG                              ; E_Base_GAZ
3304    GCB; EM                               ; E_Modifier
3305    GCB; GAZ                              ; Glue_After_Zwj
3306    GCB; ZWJ                              ; ZWJ
3307  -> uchar.h & UCharacter.GraphemeClusterBreak
3308
3309    jg ; African_Feh                      ; African_Feh
3310    jg ; African_Noon                     ; African_Noon
3311    jg ; African_Qaf                      ; African_Qaf
3312  -> uchar.h & UCharacter.JoiningGroup
3313
3314    lb ; EB                               ; E_Base
3315    lb ; EM                               ; E_Modifier
3316    lb ; ZWJ                              ; ZWJ
3317  -> uchar.h & UCharacter.LineBreak
3318
3319    sc ; Adlm                             ; Adlam
3320    sc ; Bhks                             ; Bhaiksuki
3321    sc ; Marc                             ; Marchen
3322    sc ; Newa                             ; Newa
3323    sc ; Osge                             ; Osage
3324    sc ; Tang                             ; Tangut
3325  -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
3326
3327    WB ; EB                               ; E_Base
3328    WB ; EBG                              ; E_Base_GAZ
3329    WB ; EM                               ; E_Modifier
3330    WB ; GAZ                              ; Glue_After_Zwj
3331    WB ; ZWJ                              ; ZWJ
3332  -> uchar.h & UCharacter.WordBreak
3333
3334* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
3335    (not strictly necessary for NOT_ENCODED scripts)
3336  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
3337
3338* generate normalization data files
3339  cd $ICU_ROOT/dbg
3340  bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
3341  bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
3342  bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
3343  bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3344  bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
3345
3346* build ICU (make install)
3347  so that the tools build can pick up the new definitions from the installed header files.
3348
3349  $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
3350
3351* build Unicode tools using CMake+make
3352
3353~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
3354
3355  # Location (--prefix) of where ICU was installed.
3356  set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
3357  # Location of the ICU source tree.
3358  set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
3359
3360  ~/svn.icutools/trunk/dbg/unicode/c$
3361    cmake ../../../src/unicode/c
3362    make
3363
3364* generate core properties data files
3365  ~/svn.icutools/trunk/dbg/unicode/c$
3366    genprops/genprops $ICU_SRC_DIR
3367    genuca/genuca --hanOrder implicit $ICU_SRC_DIR
3368    genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
3369- rebuild ICU (make install) & tools
3370
3371* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3372  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3373- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3374- Unicode 6.0..9.0: U+2260, U+226E, U+226F
3375- nothing new in 9.0, no test file to update
3376
3377* run & fix ICU4C tests
3378- Andy handles RBBI & spoof check test failures
3379
3380* collation: CLDR collation root, UCA DUCET
3381
3382- UCA DUCET goes into Mark's Unicode tools, see
3383  https://sites.google.com/site/unicodetools/home#TOC-UCA
3384- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
3385    cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
3386
3387- cd (CLDR UCA branch)/common/uca/
3388- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3389    cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
3390- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3391    cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
3392    (note removing the underscore before "Rules")
3393    cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3394- restore TODO diffs in UCARules.txt
3395    meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3396- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3397  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3398  from the CLDR root files (..._CLDR_..._SHORT.txt)
3399    cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
3400    cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
3401    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
3402- if CLDR common/uca/unihan-index.txt changes, then update
3403  CLDR common/collation/root.xml <collation type="private-unihan">
3404  and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
3405
3406- run genuca, see command line above;
3407  deal with
3408    Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
3409    FDD1 104B5;     [75 B8 02, 05, 05]      # Osage first primary (compressible)
3410        (add the character to genuca.cpp sampleCharsToScripts[])
3411  + look up the USCRIPT_ code for the new sample characters
3412    (should be obvious from the comment in the error output)
3413  + *add* mappings to sampleCharsToScripts[], do not replace them
3414    (in case the script sample characters flip-flop)
3415  + insert new scripts in DUCET script order, see the top_byte table
3416    at the beginning of FractionalUCA.txt
3417- rebuild ICU4C
3418
3419* Unihan collators
3420- run Unicode Tools
3421    org.unicode.draft.GenerateUnihanCollators
3422  with VM arguments
3423    -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
3424    -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
3425    -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
3426    -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
3427    -DUVERSION=9.0.0
3428    -ea
3429- run Unicode Tools
3430    org.unicode.draft.GenerateUnihanCollatorFiles
3431  with the same arguments
3432- check CLDR diffs
3433    cd ~/svn.cldr/trunk
3434    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
3435    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
3436- copy to CLDR
3437    cd ~/svn.cldr/trunk
3438    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
3439    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
3440- commit to CLDR
3441- generate ICU zh collation data: run CLDR
3442    org.unicode.cldr.icu.NewLdml2IcuConverter
3443  with program arguments
3444    -t collation
3445    -s /home/mscherer/svn.cldr/trunk/common/collation
3446    -m /home/mscherer/svn.cldr/trunk/common/supplemental
3447    -d /home/mscherer/svn.icu/trunk/src/source/data/coll
3448    -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
3449    zh
3450  and VM arguments
3451    -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
3452- rebuild ICU4C
3453
3454* run & fix ICU4C tests, now with new CLDR collation root data
3455- run all tests with the collation test data *_SHORT.txt or the full files
3456  (the full ones have comments, useful for debugging)
3457- note on intltest: if collate/UCAConformanceTest fails, then
3458  utility/MultithreadTest/TestCollators will fail as well;
3459  fix the conformance test before looking into the multi-thread test
3460
3461* update Java data files
3462- refresh just the UCD/UCA-related/derived files, just to be safe
3463- see (ICU4C)/source/data/icu4j-readme.txt
3464- mkdir /tmp/icu4j
3465- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3466  output:
3467    ...
3468    Unicode .icu files built to ./out/build/icudt58l
3469    echo timestamp > uni-core-data
3470    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
3471    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
3472    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3473    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
3474    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
3475    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
3476    mkdir -p /tmp/icu4j/main/shared/data
3477    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3478    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
3479    mkdir -p /tmp/icu4j/main/shared/data
3480    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3481    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
3482- copy the big-endian Unicode data files to another location,
3483  separate from the other data files,
3484  and then refresh ICU4J
3485    cd ~/svn.icu/trunk/dbg/data/out/icu4j
3486    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3487    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3488    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3489    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3490    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3491    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3492    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3493    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3494    jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3495
3496* When refreshing all of ICU4J data from ICU4C
3497- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3498- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3499or
3500- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3501
3502* update CollationFCD.java
3503  + copy & paste the initializers of lcccIndex[] etc. from
3504    ICU4C/source/i18n/collationfcd.cpp to
3505    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
3506
3507* refresh Java test .txt files
3508- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3509    cd $ICU_SRC_DIR/source/data/unidata
3510    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3511    cd ../../test/testdata
3512    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3513    cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3514
3515* run & fix ICU4J tests
3516
3517*** LayoutEngine script information
3518
3519* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3520  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3521  in the working directory.
3522
3523  (It also generates ScriptRunData.cpp, which is no longer needed.)
3524
3525  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
3526  (a plain text file)
3527  which maps ICU versions to the numbers of script/language constants
3528  that were added then.
3529  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
3530
3531  The generated files have a current copyright date and "@deprecated" statement.
3532
3533* Review changes, fix Java tool if necessary, and copy to ICU4C
3534  cd ~/svn.icu4j/trunk/src
3535  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3536  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
3537  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
3538
3539*** API additions
3540- send notice to icu-design about new born-@stable API (enum constants etc.)
3541
3542*** merge the Unicode update branches back onto the trunk
3543- do not merge the icudata.jar and testdata.jar,
3544  instead rebuild them from merged & tested ICU4C
3545- make sure that changes to Unicode tools & ICU tools are checked in
3546  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3547  http://bugs.icu-project.org/trac/log/tools/trunk
3548
3549---------------------------------------------------------------------------- ***
3550
3551New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764
3552
3553Adding
3554- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
3555- new combination/alias codes: Hanb, Jamo
3556  - used in CLDR 29 and in spoof checker
3557- new Z* code: Zsye
3558
3559Add new codes to uscript.h & UScript.java, see Unicode update logs.
3560  -> com.ibm.icu.lang.UScript
3561    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3562    replace  public static final int \1 = \2; \3
3563
3564Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
3565add new script codes.
3566"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
3567
3568Note: If we have to run preparseucd.py again before the Unicode 9 update,
3569then we need to manually keep/restore the new script codes.
3570
3571ICU_ROOT=~/svn.icu/trunk
3572ICU_SRC_DIR=$ICU_ROOT/src
3573ICUDT=icudt57b
3574export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3575SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3576UNIDATA=$ICU_SRC_DIR/source/data/unidata
3577
3578Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
3579see https://unicode-org.atlassian.net/browse/ICU-12141
3580
3581make install, then icutools cmake & make, then
3582~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
3583
3584Generate Java data as usual, only update pnames.icu & uprops.icu.
3585
3586*** LayoutEngine script information
3587
3588* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3589  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3590  in the working directory.
3591
3592  (It also generates ScriptRunData.cpp, which is no longer needed.)
3593
3594  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
3595  (a plain text file)
3596  which maps ICU versions to the numbers of script/language constants
3597  that were added then.
3598  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
3599
3600  The generated files have a current copyright date and "@deprecated" statement.
3601
3602* Review changes, fix Java tool if necessary, and copy to ICU4C
3603  cd ~/svn.icu4j/trunk/src
3604  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3605  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
3606  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
3607
3608---------------------------------------------------------------------------- ***
3609
3610Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802
3611
3612Edit preparseucd.py to add & parse new properties.
3613They share the UCD property namespace but are not listed in PropertyAliases.txt.
3614
3615Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
3616Initial data from emoji/2.0/
3617
3618ICU_ROOT=~/svn.icu/trunk
3619ICU_SRC_DIR=$ICU_ROOT/src
3620ICUDT=icudt56b
3621export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3622SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3623UNIDATA=$ICU_SRC_DIR/source/data/unidata
3624
3625Add binary-property constants to uchar.h enum UProperty & UProperty.java.
3626
3627~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3628(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
3629
3630Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
3631
3632make install, then icutools cmake & make, then
3633~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
3634
3635Generate Java data as usual, only update pnames.icu & uprops.icu.
3636
3637---------------------------------------------------------------------------- ***
3638
3639Unicode 8.0 update for ICU 56
3640
3641* Command-line environment setup
3642
3643ICU_ROOT=~/svn.icu/trunk
3644ICU_SRC_DIR=$ICU_ROOT/src
3645ICUDT=icudt56b
3646export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3647SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3648UNIDATA=$ICU_SRC_DIR/source/data/unidata
3649
3650http://www.unicode.org/review/pri297/  -- beta review
3651http://www.unicode.org/reports/uax-proposed-updates.html
3652http://unicode.org/versions/beta-8.0.0.html
3653http://www.unicode.org/versions/Unicode8.0.0/
3654http://www.unicode.org/reports/tr44/tr44-15.html
3655
3656*** ICU Trac
3657
3658- ticket:11574: Unicode 8
3659- C++ branches/markus/uni80 at r37351 from trunk at r37343
3660- Java branches/markus/uni80 at r37352 from trunk at r37338
3661
3662*** CLDR Trac
3663
3664- cldrbug 8311: UCA 8
3665- branches/markus/uni80 at r11518 from trunk at r11517
3666
3667- cldrbug 8109: Unicode 8.0 script metadata
3668- cldrbug 8418: Updated segmentation for Unicode 8.0
3669
3670*** Unicode version numbers
3671- makedata.mak
3672- uchar.h
3673- com.ibm.icu.util.VersionInfo
3674- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3675
3676- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3677  so that the makefiles see the new version number.
3678
3679*** data files & enums & parser code
3680
3681* file preparation
3682
3683- download UCD & IDNA files
3684- make sure that the Unicode data folder passed into preparseucd.py
3685  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3686- only for manual diffs: remove version suffixes from the file names
3687  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
3688  (see https://sites.google.com/site/unicodetools/inputdata)
3689- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3690- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3691- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3692
3693- also: from http://unicode.org/Public/security/8.0.0/ download new
3694  confusables.txt & confusablesWholeScript.txt
3695  and copy to $UNIDATA
3696    ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
3697    ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
3698
3699* initial preparseucd.py changes
3700- remove new Unicode scripts from the
3701  only-in-ISO-15924 list according to the error message:
3702    ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
3703    from _scripts_only_in_iso15924
3704  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3705      and in com.ibm.icu.dev.test.lang.TestUScript.java
3706- property and file name change:
3707    IndicMatraCategory -> IndicPositionalCategory
3708- UnicodeData.txt unusual numeric values (improper fractions)
3709    109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
3710    109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
3711    109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
3712    109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
3713    109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
3714    109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
3715    109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
3716    109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
3717    109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
3718    109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
3719  -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
3720     which are listed in DerivedNumericValues.txt;
3721     keeps storage in data file simple
3722
3723* PropertyValueAliases.txt changes
3724- 10 new Block (blk) values:
3725    blk; Ahom                             ; Ahom
3726    blk; Anatolian_Hieroglyphs            ; Anatolian_Hieroglyphs
3727    blk; Cherokee_Sup                     ; Cherokee_Supplement
3728    blk; CJK_Ext_E                        ; CJK_Unified_Ideographs_Extension_E
3729    blk; Early_Dynastic_Cuneiform         ; Early_Dynastic_Cuneiform
3730    blk; Hatran                           ; Hatran
3731    blk; Multani                          ; Multani
3732    blk; Old_Hungarian                    ; Old_Hungarian
3733    blk; Sup_Symbols_And_Pictographs      ; Supplemental_Symbols_And_Pictographs
3734    blk; Sutton_SignWriting               ; Sutton_SignWriting
3735  -> add to uchar.h
3736    use long property names for enum constants
3737  -> add to UCharacter.UnicodeBlock IDs
3738    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3739            replace  public static final int \1_ID = \2; \3
3740  -> add to UCharacter.UnicodeBlock objects
3741    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3742            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3743- 6 new Script (sc) values:
3744    sc ; Ahom                             ; Ahom
3745    sc ; Hatr                             ; Hatran
3746    sc ; Hluw                             ; Anatolian_Hieroglyphs
3747    sc ; Hung                             ; Old_Hungarian
3748    sc ; Mult                             ; Multani
3749    sc ; Sgnw                             ; SignWriting
3750  -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
3751
3752* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
3753    (not strictly necessary for NOT_ENCODED scripts)
3754  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
3755
3756* generate normalization data files
3757  cd $ICU_ROOT/dbg
3758  bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
3759  bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
3760  bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
3761  bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3762  bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
3763
3764* build ICU (make install)
3765  so that the tools build can pick up the new definitions from the installed header files.
3766
3767  $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
3768
3769* build Unicode tools using CMake+make
3770
3771~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
3772
3773  # Location (--prefix) of where ICU was installed.
3774  set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
3775  # Location of the ICU source tree.
3776  set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
3777
3778  ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
3779  ~/svn.icutools/trunk/dbg/unicode/c$ make
3780
3781* generate core properties data files
3782- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
3783- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
3784- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
3785- rebuild ICU (make install) & tools
3786- run genuca again (see step above) so that it picks up the new nfc.nrm
3787- rebuild ICU (make install) & tools
3788
3789* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3790  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3791- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3792- Unicode 6.0..8.0: U+2260, U+226E, U+226F
3793- nothing new in 8.0, no test file to update
3794
3795* run & fix ICU4C tests
3796- bad Cherokee case folding due to difference in fallbacks:
3797  UCD case folding falls back to no mapping,
3798  ICU runtime case folding falls back to lowercasing;
3799  fixed casepropsbuilder.cpp to generate scf mappings to self
3800  when there is an slc mapping but no scf
3801- Andy handles RBBI & spoof check test failures
3802
3803* collation: CLDR collation root, UCA DUCET
3804
3805- UCA DUCET goes into Mark's Unicode tools, see
3806  https://sites.google.com/site/unicodetools/home#TOC-UCA
3807- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
3808- cd (CLDR UCA branch)/common/uca/
3809- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3810  cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
3811- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3812    cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
3813    (note removing the underscore before "Rules")
3814    cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3815- restore TODO diffs in UCARules.txt
3816    meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3817- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3818  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3819  from the CLDR root files (..._CLDR_..._SHORT.txt)
3820    cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
3821    cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
3822    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
3823- if CLDR common/uca/unihan-index.txt changes, then update
3824  CLDR common/collation/root.xml <collation type="private-unihan">
3825  and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
3826- run genuca, see command line above;
3827  deal with
3828    Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
3829        (add the character to genuca.cpp sampleCharsToScripts[])
3830  + look up the script for the new sample characters
3831    (e.g., in FractionalUCA.txt)
3832  + *add* mappings to sampleCharsToScripts[], do not replace them
3833    (in case the script sample characters flip-flop)
3834  + insert new scripts in DUCET script order, see the top_byte table
3835    at the beginning of FractionalUCA.txt
3836- rebuild ICU4C
3837
3838* run & fix ICU4C tests, now with new CLDR collation root data
3839- run all tests with the collation test data *_SHORT.txt or the full files
3840  (the full ones have comments, useful for debugging)
3841- note on intltest: if collate/UCAConformanceTest fails, then
3842  utility/MultithreadTest/TestCollators will fail as well;
3843  fix the conformance test before looking into the multi-thread test
3844- fixed bug in CollationWeights::getWeightRanges()
3845  exposed by new data and CollationTest::TestRootElements
3846
3847* update Java data files
3848- refresh just the UCD/UCA-related/derived files, just to be safe
3849- see (ICU4C)/source/data/icu4j-readme.txt
3850- mkdir /tmp/icu4j
3851- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3852  output:
3853    ...
3854    Unicode .icu files built to ./out/build/icudt56l
3855    echo timestamp > uni-core-data
3856    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
3857    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
3858    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3859    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
3860    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
3861    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
3862    mkdir -p /tmp/icu4j/main/shared/data
3863    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3864    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
3865    mkdir -p /tmp/icu4j/main/shared/data
3866    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3867    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
3868- copy the big-endian Unicode data files to another location,
3869  separate from the other data files,
3870  and then refresh ICU4J
3871    cd ~/svn.icu/trunk/dbg/data/out/icu4j
3872    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3873    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3874    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3875    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3876    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3877    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3878    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3879    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3880    jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3881
3882* When refreshing all of ICU4J data from ICU4C
3883- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3884- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3885or
3886- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3887
3888* update CollationFCD.java
3889  + copy & paste the initializers of lcccIndex[] etc. from
3890    ICU4C/source/i18n/collationfcd.cpp to
3891    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
3892
3893* refresh Java test .txt files
3894- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3895    cd $ICU_SRC_DIR/source/data/unidata
3896    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3897    cd ../../test/testdata
3898    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3899    cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3900
3901* run & fix ICU4J tests
3902
3903*** LayoutEngine script information
3904
3905* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
3906  because the layout engine was deprecated in ICU 54.
3907  Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
3908  to write lines that we used to add manually.
3909
3910* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3911  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3912  in the working directory.
3913
3914  (It also generates ScriptRunData.cpp, which is no longer needed.)
3915
3916  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
3917  (a plain text file)
3918  which maps ICU versions to the numbers of script/language constants
3919  that were added then.
3920  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
3921
3922  The generated files have a current copyright date and "@deprecated" statement.
3923
3924* Review changes, fix Java tool if necessary, and copy to ICU4C
3925  cd ~/svn.icu4j/trunk/src
3926  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3927  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
3928  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
3929
3930*** API additions
3931- send notice to icu-design about new born-@stable API (enum constants etc.)
3932
3933*** merge the Unicode update branches back onto the trunk
3934- do not merge the icudata.jar and testdata.jar,
3935  instead rebuild them from merged & tested ICU4C
3936- make sure that changes to Unicode tools & ICU tools are checked in
3937  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3938  http://bugs.icu-project.org/trac/log/tools/trunk
3939
3940---------------------------------------------------------------------------- ***
3941
3942Unicode 7.0 update for ICU 54
3943
3944http://www.unicode.org/review/pri271/  -- beta review
3945http://www.unicode.org/reports/uax-proposed-updates.html
3946http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
3947http://www.unicode.org/reports/tr44/tr44-13.html
3948
3949*** ICU Trac
3950
3951- ticket 10821: Unicode 7.0, UCA 7.0
3952- C++ branches/markus/uni70 at r35584 from trunk at r35580
3953- Java branches/markus/uni70 at r35587 from trunk at r35545
3954
3955*** CLDR Trac
3956
3957- ticket 7195: UCA 7.0 CLDR root collation
3958- branches/markus/uni70 at r10062 from trunk at r10061
3959
3960- ticket 6762: script metadata for Unicode 7.0 new scripts
3961
3962*** Unicode version numbers
3963- makedata.mak
3964- uchar.h
3965- com.ibm.icu.util.VersionInfo
3966- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3967
3968- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3969  so that the makefiles see the new version number.
3970
3971*** data files & enums & parser code
3972
3973* file preparation
3974
3975- download UCD & IDNA files
3976- make sure that the Unicode data folder passed into preparseucd.py
3977  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3978- only for manual diffs: remove version suffixes from the file names
3979  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
3980  (see https://sites.google.com/site/unicodetools/inputdata)
3981- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3982- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3983- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3984- Restore TODO diffs in source/data/unidata/UCARules.txt
3985    cd $ICU_SRC_DIR
3986    meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
3987- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
3988
3989- also: from http://unicode.org/Public/security/7.0.0/ download new
3990  confusables.txt & confusablesWholeScript.txt
3991  and copy to $ICU_ROOT/src/source/data/unidata/
3992
3993* initial preparseucd.py changes
3994- remove new Unicode scripts from the
3995  only-in-ISO-15924 list according to the error message:
3996    ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
3997                        'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
3998                        'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
3999    from _scripts_only_in_iso15924
4000  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
4001      and in com.ibm.icu.dev.test.lang.TestUScript.java
4002- NamesList.txt now has a heading with a non-ASCII character
4003  + keep ppucd.txt in platform charset, rather than changing tool/test parsers
4004  + escape non-ASCII characters in heading comments
4005- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
4006  + get the copyright from the first file whose copyright line contains the current year
4007
4008* PropertyValueAliases.txt changes
4009- 32 new Block (blk) values:
4010    blk; Bassa_Vah                        ; Bassa_Vah
4011    blk; Caucasian_Albanian               ; Caucasian_Albanian
4012    blk; Coptic_Epact_Numbers             ; Coptic_Epact_Numbers
4013    blk; Diacriticals_Ext                 ; Combining_Diacritical_Marks_Extended
4014    blk; Duployan                         ; Duployan
4015    blk; Elbasan                          ; Elbasan
4016    blk; Geometric_Shapes_Ext             ; Geometric_Shapes_Extended
4017    blk; Grantha                          ; Grantha
4018    blk; Khojki                           ; Khojki
4019    blk; Khudawadi                        ; Khudawadi
4020    blk; Latin_Ext_E                      ; Latin_Extended_E
4021    blk; Linear_A                         ; Linear_A
4022    blk; Mahajani                         ; Mahajani
4023    blk; Manichaean                       ; Manichaean
4024    blk; Mende_Kikakui                    ; Mende_Kikakui
4025    blk; Modi                             ; Modi
4026    blk; Mro                              ; Mro
4027    blk; Myanmar_Ext_B                    ; Myanmar_Extended_B
4028    blk; Nabataean                        ; Nabataean
4029    blk; Old_North_Arabian                ; Old_North_Arabian
4030    blk; Old_Permic                       ; Old_Permic
4031    blk; Ornamental_Dingbats              ; Ornamental_Dingbats
4032    blk; Pahawh_Hmong                     ; Pahawh_Hmong
4033    blk; Palmyrene                        ; Palmyrene
4034    blk; Pau_Cin_Hau                      ; Pau_Cin_Hau
4035    blk; Psalter_Pahlavi                  ; Psalter_Pahlavi
4036    blk; Shorthand_Format_Controls        ; Shorthand_Format_Controls
4037    blk; Siddham                          ; Siddham
4038    blk; Sinhala_Archaic_Numbers          ; Sinhala_Archaic_Numbers
4039    blk; Sup_Arrows_C                     ; Supplemental_Arrows_C
4040    blk; Tirhuta                          ; Tirhuta
4041    blk; Warang_Citi                      ; Warang_Citi
4042  -> add to uchar.h
4043    use long property names for enum constants
4044  -> add to UCharacter.UnicodeBlock IDs
4045    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
4046            replace  public static final int \1_ID = \2; \3
4047  -> add to UCharacter.UnicodeBlock objects
4048    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
4049            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
4050- 28 new Joining_Group (jg) values:
4051    jg ; Manichaean_Aleph                 ; Manichaean_Aleph
4052    jg ; Manichaean_Ayin                  ; Manichaean_Ayin
4053    jg ; Manichaean_Beth                  ; Manichaean_Beth
4054    jg ; Manichaean_Daleth                ; Manichaean_Daleth
4055    jg ; Manichaean_Dhamedh               ; Manichaean_Dhamedh
4056    jg ; Manichaean_Five                  ; Manichaean_Five
4057    jg ; Manichaean_Gimel                 ; Manichaean_Gimel
4058    jg ; Manichaean_Heth                  ; Manichaean_Heth
4059    jg ; Manichaean_Hundred               ; Manichaean_Hundred
4060    jg ; Manichaean_Kaph                  ; Manichaean_Kaph
4061    jg ; Manichaean_Lamedh                ; Manichaean_Lamedh
4062    jg ; Manichaean_Mem                   ; Manichaean_Mem
4063    jg ; Manichaean_Nun                   ; Manichaean_Nun
4064    jg ; Manichaean_One                   ; Manichaean_One
4065    jg ; Manichaean_Pe                    ; Manichaean_Pe
4066    jg ; Manichaean_Qoph                  ; Manichaean_Qoph
4067    jg ; Manichaean_Resh                  ; Manichaean_Resh
4068    jg ; Manichaean_Sadhe                 ; Manichaean_Sadhe
4069    jg ; Manichaean_Samekh                ; Manichaean_Samekh
4070    jg ; Manichaean_Taw                   ; Manichaean_Taw
4071    jg ; Manichaean_Ten                   ; Manichaean_Ten
4072    jg ; Manichaean_Teth                  ; Manichaean_Teth
4073    jg ; Manichaean_Thamedh               ; Manichaean_Thamedh
4074    jg ; Manichaean_Twenty                ; Manichaean_Twenty
4075    jg ; Manichaean_Waw                   ; Manichaean_Waw
4076    jg ; Manichaean_Yodh                  ; Manichaean_Yodh
4077    jg ; Manichaean_Zayin                 ; Manichaean_Zayin
4078    jg ; Straight_Waw                     ; Straight_Waw
4079  -> uchar.h & UCharacter.JoiningGroup
4080- 23 new Script (sc) values:
4081    sc ; Aghb                             ; Caucasian_Albanian
4082    sc ; Bass                             ; Bassa_Vah
4083    sc ; Dupl                             ; Duployan
4084    sc ; Elba                             ; Elbasan
4085    sc ; Gran                             ; Grantha
4086    sc ; Hmng                             ; Pahawh_Hmong
4087    sc ; Khoj                             ; Khojki
4088    sc ; Lina                             ; Linear_A
4089    sc ; Mahj                             ; Mahajani
4090    sc ; Mani                             ; Manichaean
4091    sc ; Mend                             ; Mende_Kikakui
4092    sc ; Modi                             ; Modi
4093    sc ; Mroo                             ; Mro
4094    sc ; Narb                             ; Old_North_Arabian
4095    sc ; Nbat                             ; Nabataean
4096    sc ; Palm                             ; Palmyrene
4097    sc ; Pauc                             ; Pau_Cin_Hau
4098    sc ; Perm                             ; Old_Permic
4099    sc ; Phlp                             ; Psalter_Pahlavi
4100    sc ; Sidd                             ; Siddham
4101    sc ; Sind                             ; Khudawadi
4102    sc ; Tirh                             ; Tirhuta
4103    sc ; Wara                             ; Warang_Citi
4104  -> uscript.h (many were added before)
4105    comment "Mende Kikakui" for USCRIPT_MENDE
4106    add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
4107  -> com.ibm.icu.lang.UScript
4108    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4109    replace  public static final int \1 = \2; \3
4110- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
4111  (added 2012-11-01)
4112    Ahom        338     Ahom
4113    Hatr        127     Hatran
4114    Mult        323     Multani
4115  (added 2013-10-12)
4116    Modi        324     Modi
4117    Pauc        263     Pau Cin Hau
4118    Sidd        302     Siddham
4119  -> uscript.h (some overlap with additions from Unicode)
4120  -> com.ibm.icu.lang.UScript
4121    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4122    replace  public static final int \1 = \2; \3
4123  -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
4124  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
4125      and in com.ibm.icu.dev.test.lang.TestUScript.java
4126
4127* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
4128    (not strictly necessary for NOT_ENCODED scripts)
4129  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
4130
4131* generate normalization data files
4132- cd $ICU_ROOT/dbg
4133- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
4134- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
4135- UNIDATA=$ICU_SRC_DIR/source/data/unidata
4136- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
4137- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4138- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4139- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4140- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4141
4142* build ICU (make install)
4143  so that the tools build can pick up the new definitions from the installed header files.
4144
4145~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
4146
4147* build Unicode tools using CMake+make
4148
4149~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
4150
4151# Location (--prefix) of where ICU was installed.
4152set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
4153# Location of the ICU source tree.
4154set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
4155
4156~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
4157~/svn.icutools/trunk/dbg/unicode/c$ make
4158
4159* genprops work
4160- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
4161  + add second array of Joining_Group values for at most 10800..10FFF
4162    icutools: unicode/c/genprops/bidipropsbuilder.cpp
4163    icu: source/common/ubidi_props.h/.c/_data.h
4164    icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
4165
4166* generate core properties data files
4167- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
4168- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
4169- rebuild ICU (make install) & tools
4170- run genuca again (see step above) so that it picks up the new nfc.nrm
4171- rebuild ICU (make install) & tools
4172
4173* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4174  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4175- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4176- Unicode 6.0..7.0: U+2260, U+226E, U+226F
4177- nothing new in 7.0, no test file to update
4178
4179* run & fix ICU4C tests
4180
4181* update Java data files
4182- refresh just the UCD-related files, just to be safe
4183- see (ICU4C)/source/data/icu4j-readme.txt
4184- mkdir /tmp/icu4j
4185- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4186  output:
4187    ...
4188    Unicode .icu files built to ./out/build/icudt53l
4189    echo timestamp > uni-core-data
4190    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
4191    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
4192    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4193    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
4194    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
4195    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
4196    mkdir -p /tmp/icu4j/main/shared/data
4197    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4198    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
4199    mkdir -p /tmp/icu4j/main/shared/data
4200    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4201    make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
4202- copy the big-endian Unicode data files to another location,
4203  separate from the other data files
4204    ICUDT=icudt54b
4205    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4206    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4207    cd ~/svn.icu/uni70/dbg/data/out/icu4j
4208    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4209    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4210    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
4211    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4212    cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4213    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4214- refresh ICU4J
4215    ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
4216
4217* update CollationFCD.java
4218  + copy & paste the initializers of lcccIndex[] etc. from
4219    ICU4C/source/i18n/collationfcd.cpp to
4220    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
4221
4222* refresh Java test .txt files
4223- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4224    cd $ICU_SRC_DIR/source/data/unidata
4225    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4226    cd ../../test/testdata
4227    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4228    cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4229
4230* UCA
4231
4232- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
4233- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
4234- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
4235- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
4236- output files are in ~/svn.unitools/Generated/uca/7.0.0/
4237- review data; compare files, use blankweights.sed or similar
4238  ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
4239- cd ~/svn.unitools/Generated/uca/7.0.0/
4240- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4241  cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
4242- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4243    (note removing the underscore before "Rules")
4244    cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
4245- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4246  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4247  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4248    cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
4249    cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
4250    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
4251- run genuca, see command line above
4252- rebuild ICU4C
4253- refresh ICU4J collation data:
4254  (subset of instructions above for properties data refresh, except copies all coll/*)
4255    ICUDT=icudt54b
4256    ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4257    ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4258    ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4259    ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
4260- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4261- note on intltest: if collate/UCAConformanceTest fails, then
4262  utility/MultithreadTest/TestCollators will fail as well;
4263  fix the conformance test before looking into the multi-thread test
4264- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
4265- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
4266  ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
4267
4268* When refreshing all of ICU4J data from ICU4C
4269- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4270- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4271or
4272- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4273
4274* run & fix ICU4J tests
4275
4276*** LayoutEngine script information
4277
4278(For details see the Unicode 5.2 change log below.)
4279
4280* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
4281  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
4282  in the working directory.
4283  (It also generates ScriptRunData.cpp, which is no longer needed.)
4284
4285  The generated files have a current copyright date and "@stable" statement.
4286  ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
4287  for "born stable" Unicode API constants, and to stop parsing ICU version numbers
4288  which may not contain dots any more.
4289
4290- diff current <icu>/source/layout files vs. generated ones
4291    ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
4292  review and manually merge desired changes;
4293  fix gratuitous changes, incorrect @draft/@stable and missing aliases;
4294  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
4295- if you just copy the above files, then
4296  fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
4297  manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
4298
4299*** API additions
4300- send notice to icu-design about new born-@stable API (enum constants etc.)
4301
4302*** merge the Unicode update branches back onto the trunk
4303- do not merge the icudata.jar and testdata.jar,
4304  instead rebuild them from merged & tested ICU4C
4305
4306---------------------------------------------------------------------------- ***
4307
4308Unicode 6.3 update
4309
4310http://www.unicode.org/review/pri249/  -- beta review
4311http://www.unicode.org/reports/uax-proposed-updates.html
4312http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
4313http://www.unicode.org/reports/tr44/tr44-11.html
4314
4315*** ICU Trac
4316
4317- ticket 10128: update ICU to Unicode 6.3 beta
4318- ticket 10168: update ICU to Unicode 6.3 final
4319- C++ branches/markus/uni63 at r33552 from trunk at r33551
4320- Java branches/markus/uni63 at r33550 from trunk at r33553
4321
4322- ticket 10142: implement Unicode 6.3 bidi algorithm additions
4323
4324*** Unicode version numbers
4325- makedata.mak
4326- uchar.h
4327  (configure.in & configure: have been modified to extract the version from uchar.h)
4328- com.ibm.icu.util.VersionInfo
4329- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
4330
4331- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
4332  so that the makefiles see the new version number.
4333
4334*** data files & enums & parser code
4335
4336* file preparation
4337
4338- download UCD, UCA & IDNA files
4339- make sure that the Unicode data folder passed into preparseucd.py
4340  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
4341- modify preparseucd.py:
4342  parse new file BidiBrackets.txt
4343  with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
4344- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
4345- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4346- Check test file diffs for previously commented-out, known-failing data lines;
4347  probably need to keep those commented out.
4348
4349* PropertyAliases.txt changes
4350- 1 new Enumerated Property
4351  bpt                      ; Bidi_Paired_Bracket_Type
4352  -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
4353  -> ubidi_props.h & .c & UBiDiProps.java
4354  -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
4355  -> uprops.cpp
4356  -> change ubidi.icu format version from 2.0 to 2.1
4357- 1 new Miscellaneous Property
4358  bpb                      ; Bidi_Paired_Bracket
4359  -> uchar.h & UProperty.java
4360  -> ppucd.h & .cpp
4361
4362* PropertyValueAliases.txt changes
4363- 3 Bidi_Paired_Bracket_Type (bpt) values:
4364  bpt; c                                ; Close
4365  bpt; n                                ; None
4366  bpt; o                                ; Open
4367  -> uchar.h & UCharacter.BidiPairedBracketType
4368  -> ubidi_props.h & .c & UBiDiProps.java
4369  -> change ubidi.icu format version from 2.0 to 2.1
4370- 4 new Bidi_Class (bc) values:
4371  bc ; FSI                              ; First_Strong_Isolate
4372  bc ; LRI                              ; Left_To_Right_Isolate
4373  bc ; RLI                              ; Right_To_Left_Isolate
4374  bc ; PDI                              ; Pop_Directional_Isolate
4375  -> uchar.h & UCharacterEnums.ECharacterDirection
4376  -> until the bidi code gets updated,
4377     Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
4378- 3 new Word_Break (WB) values:
4379  WB ; HL                               ; Hebrew_Letter
4380  WB ; SQ                               ; Single_Quote
4381  WB ; DQ                               ; Double_Quote
4382  -> uchar.h & UCharacter.WordBreak
4383  -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
4384- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
4385  (added 2012-10-16)
4386  Aghb  239     Caucasian Albanian
4387  Mahj  314     Mahajani
4388  -> uscript.h
4389  -> com.ibm.icu.lang.UScript
4390    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4391    replace  public static final int \1 = \2;\3
4392  -> preparseucd.py _scripts_only_in_iso15924
4393  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
4394      and in com.ibm.icu.dev.test.lang.TestUScript.java
4395  -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
4396     (not strictly necessary for NOT_ENCODED scripts)
4397
4398* generate normalization data files
4399- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
4400- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
4401- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
4402- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4403- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4404- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4405- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4406
4407* build ICU (make install)
4408  so that the tools build can pick up the new definitions from the installed header files.
4409
4410~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
4411
4412* build Unicode tools using CMake+make
4413
4414~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
4415
4416# Location (--prefix) of where ICU was installed.
4417set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
4418# Location of the ICU source tree.
4419set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
4420
4421~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
4422~/svn.icutools/trunk/dbg/unicode/c$ make
4423
4424* generate core properties data files
4425- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
4426- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
4427- rebuild ICU (make install) & tools
4428- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
4429- rebuild ICU (make install) & tools
4430
4431* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4432  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4433- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4434- Unicode 6.0..6.3: U+2260, U+226E, U+226F
4435- nothing new in 6.3, no test file to update
4436
4437* update Java data files
4438- refresh just the UCD-related files, just to be safe
4439- see (ICU4C)/source/data/icu4j-readme.txt
4440- mkdir /tmp/icu4j
4441- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4442  output:
4443    ...
4444    Unicode .icu files built to ./out/build/icudt52l
4445    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
4446    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
4447    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4448    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
4449    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
4450    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
4451    mkdir -p /tmp/icu4j/main/shared/data
4452    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4453    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
4454    mkdir -p /tmp/icu4j/main/shared/data
4455    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4456    make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
4457- copy the big-endian Unicode data files to another location,
4458  separate from the other data files
4459    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4460    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
4461    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
4462    ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
4463    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
4464    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4465    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
4466- refresh ICU4J
4467    ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
4468
4469* refresh Java test .txt files
4470- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4471
4472* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
4473
4474- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
4475- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
4476- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4477- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4478  (note removing the underscore before "Rules")
4479- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4480  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4481  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4482- check test file diffs for previously commented-out, known-failing data lines;
4483  probably need to keep those commented out
4484- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
4485- run genuca, see command line above
4486- rebuild ICU4C
4487- refresh ICU4J collation data:
4488  (subset of instructions above for properties data refresh, except copies all coll/*)
4489    ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4490    ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4491    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4492    ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
4493- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4494- note on intltest: if collate/UCAConformanceTest fails, then
4495  utility/MultithreadTest/TestCollators will fail as well;
4496  fix the conformance test before looking into the multi-thread test
4497
4498* test ICU, fix test code where necessary
4499
4500* When refreshing all of ICU4J data from ICU4C
4501- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4502- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4503or
4504- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4505
4506*** LayoutEngine script information
4507- skipped for Unicode 6.3: no new scripts
4508
4509*** merge the Unicode update branches back onto the trunk
4510- do not merge the icudata.jar and testdata.jar,
4511  instead rebuild them from merged & tested ICU4C
4512
4513---------------------------------------------------------------------------- ***
4514
4515Unicode 6.2 update
4516
4517http://www.unicode.org/review/pri230/
4518http://www.unicode.org/versions/beta-6.2.0.html
4519http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
4520http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
4521http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
4522http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
4523http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
4524http://unicode.org/Public/idna/6.2.0/
4525
4526*** ICU Trac
4527
4528- ticket 9515: Unicode 6.2: final ICU update
4529
4530- ticket 9514: UCA 6.2: fix UCARules.txt
4531
4532- ticket 9437: update ICU to Unicode 6.2
4533- C++ branches/markus/uni62 at r32050 from trunk at r32041
4534- Java branches/markus/uni62 at r32068 from trunk at r32066
4535
4536*** Unicode version numbers
4537- makedata.mak
4538- uchar.h
4539  (configure.in & configure: have been modified to extract the version from uchar.h)
4540- com.ibm.icu.util.VersionInfo
4541- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
4542
4543*** data files & enums & parser code
4544
4545* file preparation
4546
4547- download UCD, UCA & IDNA files
4548- make sure that the Unicode data folder passed into preparseucd.py
4549  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
4550- modify preparseucd.py: NamesList.txt is now in UTF-8
4551- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
4552- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4553- Check test file diffs for previously commented-out, known-failing data lines;
4554  probably need to keep those commented out.
4555
4556* PropertyValueAliases.txt changes
4557- 1 new Line_Break (lb) value:
4558  lb ; RI                               ; Regional_Indicator
4559  -> uchar.h & UCharacter.LineBreak
4560- 1 new Word_Break (WB) value:
4561  WB ; RI                               ; Regional_Indicator
4562  -> uchar.h & UCharacter.WordBreak
4563- 1 new Grapheme_Cluster_Break (GCB) value:
4564  GCB; RI                               ; Regional_Indicator
4565  -> uchar.h & UCharacter.GraphemeClusterBreak
4566
4567* 3 new numeric values
4568  The new value -1, which was really supposed to be NaN but that would have required
4569  new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
4570  but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
4571    cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
4572    cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
4573  The two new values 216000 and 432000 require an addition to the encoding of numeric values.
4574    cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
4575    cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
4576  -> uprops.h, uchar.c & UCharacterProperty.java
4577  -> cucdtst.c & UCharacterTest.java
4578
4579* generate normalization data files
4580- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
4581- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
4582- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
4583- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4584- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4585- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4586- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4587
4588* build ICU (make install)
4589  so that the tools build can pick up the new definitions from the installed header files.
4590* build Unicode tools using CMake+make
4591
4592* generate core properties data files
4593- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
4594- in initial bootstrapping, change the UCA version
4595  in source/data/unidata/FractionalUCA.txt to match the new Unicode version
4596- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
4597- rebuild ICU (make install) & tools
4598  + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
4599    check if the UCA version in FractionalUCA.txt matches the new Unicode version
4600    (see step above)
4601- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
4602- rebuild ICU (make install) & tools
4603
4604* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4605  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4606- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4607- Unicode 6.0..6.2: U+2260, U+226E, U+226F
4608- nothing new in 6.2, no test file to update
4609
4610* update Java data files
4611- refresh just the UCD-related files, just to be safe
4612- see (ICU4C)/source/data/icu4j-readme.txt
4613- mkdir /tmp/icu4j
4614- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4615  output:
4616    ...
4617    Unicode .icu files built to ./out/build/icudt50l
4618    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
4619    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
4620    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4621    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
4622    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
4623    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
4624    mkdir -p /tmp/icu4j/main/shared/data
4625    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4626    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
4627    mkdir -p /tmp/icu4j/main/shared/data
4628    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4629    make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
4630- copy the big-endian Unicode data files to another location,
4631  separate from the other data files
4632    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4633    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
4634    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
4635    ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
4636    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
4637    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4638    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
4639- refresh ICU4J
4640    ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
4641
4642* refresh Java test .txt files
4643- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4644
4645* UCA
4646
4647- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
4648- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
4649- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4650- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4651  (note removing the underscore before "Rules")
4652- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4653  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4654  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4655- check test file diffs for previously commented-out, known-failing data lines;
4656  probably need to keep those commented out
4657- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
4658- run genuca, see command line above
4659- rebuild ICU4C
4660- refresh ICU4J collation data:
4661  (subset of instructions above for properties data refresh, except copies all coll/*)
4662    ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4663    ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4664    ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4665    ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
4666- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4667- note on intltest: if collate/UCAConformanceTest fails, then
4668  utility/MultithreadTest/TestCollators will fail as well;
4669  fix the conformance test before looking into the multi-thread test
4670
4671* test ICU, fix test code where necessary
4672
4673* When refreshing all of ICU4J data from ICU4C
4674- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4675- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4676or
4677- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4678
4679*** LayoutEngine script information
4680- skipped for Unicode 6.2: no new scripts
4681
4682*** merge the Unicode update branches back onto the trunk
4683- do not merge the icudata.jar and testdata.jar,
4684  instead rebuild them from merged & tested ICU4C
4685
4686---------------------------------------------------------------------------- ***
4687
4688Future Unicode update
4689
4690Tools simplified since the Unicode 6.1 update. See
4691- https://icu.unicode.org/design/props/ppucd
4692- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
4693
4694* Unicode version numbers
4695- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
4696
4697* file preparation
4698- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
4699- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
4700- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4701- Check test file diffs for previously commented-out, known-failing data lines;
4702  probably need to keep those commented out.
4703
4704* PropertyValueAliases.txt changes
4705- Script codes that are in ISO 15924 but not in Unicode are now listed in
4706  preparseucd.py, in the _scripts_only_in_iso15924 variable.
4707  If there are new ISO codes, then add them.
4708  If Unicode adds some of them, then remove them from the .py variable.
4709
4710* UnicodeData.txt changes
4711- No more manual changes for CJK ranges for algorithmic names;
4712  those are now written to ppucd.txt and genprops reads them from there.
4713
4714* generate core properties data files (makeprops.sh was deleted)
4715- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
4716
4717* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
4718- it is now generated by preparseucd.py
4719
4720* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
4721- it is now generated by preparseucd.py
4722- make sure that the Unicode data folder passed into preparseucd.py
4723  includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
4724  (can be in some subfolder)
4725
4726* generate normalization data files
4727- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
4728- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
4729- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
4730- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4731- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4732- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4733- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4734
4735* build ICU (make install)
4736* build Unicode tools using CMake+make
4737
4738* new way to call genuca (makeuca.sh was deleted)
4739- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
4740
4741---------------------------------------------------------------------------- ***
4742
4743Unicode 6.1 update
4744
4745*** ICU Trac
4746
4747- ticket 8995 final update to Unicode 6.1
4748- ticket 8994 regenerate source/layout/CanonData.cpp
4749
4750- ticket 8961 support Unicode "Age" value *names*
4751- ticket 8963 support multiple character name aliases & types
4752
4753- ticket 8827 "update ICU to Unicode 6.1"
4754- C++ branches/markus/uni61 at r30864 from trunk at r30843
4755- Java branches/markus/uni61 at r30865 from trunk at r30863
4756
4757*** Unicode version numbers
4758- makedata.mak
4759- uchar.h
4760  (configure.in & configure: have been modified to extract the version from uchar.h)
4761- com.ibm.icu.util.VersionInfo
4762- icutools/unicode/makedefs.sh
4763  + also review & update other definitions in that file,
4764    e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
4765
4766*** data files & enums & parser code
4767
4768* file preparation
4769
4770~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
4771- This prepares both unidata and testdata files in respective output subfolders.
4772- Check test file diffs for previously commented-out, known-failing data lines;
4773  probably need to keep those commented out.
4774
4775* PropertyValueAliases.txt changes
4776- 11 new block names:
4777  Arabic_Extended_A
4778  Arabic_Mathematical_Alphabetic_Symbols
4779  Chakma
4780  Meetei_Mayek_Extensions
4781  Meroitic_Cursive
4782  Meroitic_Hieroglyphs
4783  Miao
4784  Sharada
4785  Sora_Sompeng
4786  Sundanese_Supplement
4787  Takri
4788  -> add to uchar.h
4789  -> add to UCharacter.UnicodeBlock IDs
4790    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
4791            replace  public static final int \1_ID = \2; \3
4792  -> add to UCharacter.UnicodeBlock objects
4793    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
4794            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
4795- 1 new Joining_Group (jg) value:
4796  Rohingya_Yeh
4797  -> uchar.h & UCharacter.JoiningGroup
4798- 2 new Line_Break (lb) values:
4799  CJ=Conditional_Japanese_Starter
4800  HL=Hebrew_Letter
4801  -> uchar.h & UCharacter.LineBreak
4802- 7 new scripts:
4803  sc ; Cakm      ; Chakma
4804  sc ; Merc      ; Meroitic_Cursive
4805  sc ; Mero      ; Meroitic_Hieroglyphs
4806  sc ; Plrd      ; Miao
4807  sc ; Shrd      ; Sharada
4808  sc ; Sora      ; Sora_Sompeng
4809  sc ; Takr      ; Takri
4810  -> remove these from SyntheticPropertyValueAliases.txt
4811  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
4812      and in com.ibm.icu.dev.test.lang.TestUScript.java
4813- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
4814  (added 2011-06-21)
4815  Khoj        322     Khojki
4816  Tirh        326     Tirhuta
4817    and another one added 2011-12-09
4818  Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
4819  -> uscript.h
4820  -> com.ibm.icu.lang.UScript
4821    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4822    replace  public static final int \1 = \2;\3
4823  -> SyntheticPropertyValueAliases.txt
4824  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
4825      and in com.ibm.icu.dev.test.lang.TestUScript.java
4826
4827* UnicodeData.txt changes
4828- the last Unihan code point changes from U+9FCB to U+9FCC
4829  search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
4830  + do change gennames.c
4831  + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
4832
4833* DerivedBidiClass.txt changes
4834- 2 new default-AL blocks:
4835#     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
4836#     Arabic Mathematical Alphabetic Symbols:
4837#                       U+1EE00  - U+1EEFF  (was default-R)
4838- 2 new default-R blocks:
4839#     Meroitic Hieroglyphs:
4840#                        U+10980 - U+1099F
4841#     Meroitic Cursive:  U+109A0 - U+109FF
4842  -> should be picked up by the explicit data in the file
4843
4844* NameAliases.txt changes
4845- from
4846    # Each line has two fields
4847    # First field: Code point
4848    # Second field: Alias
4849- to
4850    # Each line has three fields, as described here:
4851    #
4852    # First field:  Code point
4853    # Second field: Alias
4854    # Third field:  Type
4855- Also, the file previously allowed multiple aliases but only now does it
4856  actually provide multiple, even multiple of the same type. For example,
4857    FEFF;BYTE ORDER MARK;alternate
4858    FEFF;BOM;abbreviation
4859    FEFF;ZWNBSP;abbreviation
4860- This breaks our gennames parser, unames.icu data structure, and API.
4861  Fix gennames to only pick up "correction" aliases.
4862  New ticket #8963 for further changes.
4863
4864* run genpname/preparse.pl (on Linux)
4865  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
4866  + make sure that data.h is writable
4867  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
4868  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
4869
4870* build ICU (make install)
4871  so that the tools build can pick up the new definitions from the installed header files.
4872* build Unicode tools (at least genpname) using CMake+make
4873
4874* run genpname
4875  (builds both pnames.icu and propname_data.h)
4876- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
4877- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
4878
4879* build ICU (make install)
4880* build Unicode tools using CMake+make
4881
4882* update source/data/unidata/norm2/nfkc_cf.txt
4883- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
4884
4885* update source/data/unidata/norm2/uts46.txt
4886- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
4887  to ~/svn.icu/tools/trunk/src/unicode/py
4888- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
4889- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
4890- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
4891
4892* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4893  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4894- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4895- Unicode 6.0..6.1: U+2260, U+226E, U+226F
4896- nothing new in 6.1, no test file to update
4897
4898* generate core properties data files
4899- in initial bootstrapping, change the UCA version
4900  in source/data/unidata/FractionalUCA.txt to match the new Unicode version
4901- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
4902- rebuild ICU & tools
4903  + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
4904    check if the UCA version in FractionalUCA.txt matches the new Unicode version
4905    (see step above)
4906- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
4907  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
4908- rebuild ICU & tools
4909
4910* update Java data files
4911- refresh just the UCD-related files, just to be safe
4912- see (ICU4C)/source/data/icu4j-readme.txt
4913- mkdir /tmp/icu4j
4914- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4915  output:
4916    ...
4917    Unicode .icu files built to ./out/build/icudt49l
4918    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
4919    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
4920    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4921    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
4922    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
4923    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
4924    mkdir -p /tmp/icu4j/main/shared/data
4925    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4926    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
4927    mkdir -p /tmp/icu4j/main/shared/data
4928    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4929    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
4930- copy the big-endian Unicode data files to another location,
4931  separate from the other data files
4932    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
4933    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
4934    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
4935    ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
4936    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
4937    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
4938    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
4939- refresh ICU4J
4940    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
4941
4942* refresh Java test .txt files
4943- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4944
4945* test ICU so far, fix test code where necessary
4946- temporarily ignore collation issues that look like UCA/UCD mismatches,
4947  until UCA data is updated
4948
4949* UCA
4950
4951- get output from Mark's tools; look in
4952    http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
4953- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4954- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4955  (note removing the underscore before "Rules")
4956- update (ICU)/source/test/testdata/CollationTest_*.txt
4957  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4958  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4959- check test file diffs for previously commented-out, known-failing data lines;
4960  probably need to keep those commented out
4961- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
4962- run makeuca.sh:
4963  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
4964- rebuild ICU4C
4965- refresh ICU4J collation data:
4966  (subset of instructions above for properties data refresh, except copies all coll/*)
4967    ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4968    ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
4969    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
4970    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
4971- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4972- note on intltest: if collate/UCAConformanceTest fails, then
4973  utility/MultithreadTest/TestCollators will fail as well;
4974  fix the conformance test before looking into the multi-thread test
4975
4976* When refreshing all of ICU4J data from ICU4C
4977- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4978- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4979or
4980- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4981
4982*** LayoutEngine script information
4983
4984(For details see the Unicode 5.2 change log below.)
4985
4986* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
4987  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
4988  in the working directory.
4989  (It also generates ScriptRunData.cpp, which is no longer needed.)
4990
4991  The generated files have a current copyright date and "@draft" statement.
4992
4993- diff current <icu>/source/layout files vs. generated ones
4994    ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
4995  review and manually merge desired changes;
4996  fix gratuitous changes, incorrect @draft and missing aliases;
4997  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
4998- if you just copy the above files, then
4999  fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
5000  manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5001
5002*** merge the Unicode update branches back onto the trunk
5003- do not merge the icudata.jar and testdata.jar,
5004  instead rebuild them from merged & tested ICU4C
5005
5006---------------------------------------------------------------------------- ***
5007
5008ICU 4.8 (no Unicode update, just new script codes)
5009
5010* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
5011  (added 2010-12-21)
5012    Afak    439     Afaka
5013    Jurc    510     Jurchen
5014    Mroo    199     Mro, Mru
5015    Nshu    499     Nüshu
5016    Shrd    319     Sharada, Śāradā
5017    Sora    398     Sora Sompeng
5018    Takr    321     Takri, Ṭākrī, Ṭāṅkrī
5019    Tang    520     Tangut
5020    Wole    480     Woleai
5021  -> uscript.h
5022  -> com.ibm.icu.lang.UScript
5023    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
5024    replace  public static final int \1 = \2;\3
5025  -> genpname/SyntheticPropertyValueAliases.txt
5026  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
5027      and in com.ibm.icu.dev.test.lang.TestUScript.java
5028
5029* run genpname/preparse.pl (on Linux)
5030  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
5031  + make sure that data.h is writable
5032  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
5033  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
5034
5035* rebuild Unicode tools (at least genpname) using make
5036- You might first need to "make install" ICU so that the tools build can pick
5037  up the new definitions from the installed header files.
5038
5039* run genpname
5040  (builds both pnames.icu and propname_data.h)
5041- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
5042- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
5043- rebuild ICU & tools
5044
5045* run genprops
5046- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
5047- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
5048- rebuild ICU & tools
5049
5050* update Java data files
5051- refresh just the UCD-related files, just to be safe
5052- see (ICU4C)/source/data/icu4j-readme.txt
5053- mkdir /tmp/icu4j
5054- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5055- copy the big-endian Unicode data files to another location,
5056  separate from the other data files
5057    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5058    ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5059    ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5060- refresh ICU4J
5061    ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
5062
5063* should have updated the layout engine script codes but forgot
5064
5065---------------------------------------------------------------------------- ***
5066
5067Unicode 6.0 update
5068
5069*** related ICU Trac tickets
5070
50717264 Unicode 6.0 Update
5072
5073*** Unicode version numbers
5074- makedata.mak
5075- uchar.h
5076  (configure.in & configure: have been modified to extract the version from uchar.h)
5077- com.ibm.icu.util.VersionInfo
5078
5079*** data files & enums & parser code
5080
5081* file preparation
5082
5083~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
5084- This now prepares both unidata and testdata files in respective output subfolders.
5085
5086* PropertyAliases.txt changes
5087- new Script_Extensions property defined in the new ScriptExtensions.txt file
5088  but not listed in PropertyAliases.txt; reported to unicode.org;
5089  -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
5090    scx; Script_Extensions
5091  -> uchar.h with new UProperty section
5092  -> com.ibm.icu.lang.UProperty, parallel with uchar.h
5093
5094* PropertyValueAliases.txt changes
5095- 12 new block names:
5096  Alchemical_Symbols
5097  Bamum_Supplement
5098  Batak
5099  Brahmi
5100  CJK_Unified_Ideographs_Extension_D
5101  Emoticons
5102  Ethiopic_Extended_A
5103  Kana_Supplement
5104  Mandaic
5105  Miscellaneous_Symbols_And_Pictographs
5106  Playing_Cards
5107  Transport_And_Map_Symbols
5108  -> add to uchar.h
5109  -> add to UCharacter.UnicodeBlock
5110    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
5111            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
5112- Joining_Group (jg) values:
5113  Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
5114  -> uchar.h & UCharacter.JoiningGroup
5115- 3 new scripts:
5116  sc ; Batk      ; Batak
5117  sc ; Brah      ; Brahmi
5118  sc ; Mand      ; Mandaic
5119  -> remove these from SyntheticPropertyValueAliases.txt
5120  -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
5121  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
5122      and in com.ibm.icu.dev.test.lang.TestUScript.java
5123- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
5124  (added 2009-11-11..2010-07-18)
5125  Bass        259     Bassa Vah
5126  Dupl        755     Duployan shortand
5127  Elba        226     Elbasan
5128  Gran        343     Grantha
5129  Kpel        436     Kpelle
5130  Loma        437     Loma
5131  Mend        438     Mende
5132  Merc        101     Meroitic Cursive
5133  Narb        106     Old North Arabian
5134  Nbat        159     Nabataean
5135  Palm        126     Palmyrene
5136  Sind        318     Sindhi
5137  Wara        262     Warang Citi
5138  -> uscript.h
5139  -> com.ibm.icu.lang.UScript
5140    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
5141    replace  public static final int \1 = \2;\3
5142  -> SyntheticPropertyValueAliases.txt
5143  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
5144      and in com.ibm.icu.dev.test.lang.TestUScript.java
5145- ISO 15924 name change
5146  Mero        100     Meroitic Hieroglyphs (was Meroitic)
5147  -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
5148- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
5149
5150* UnicodeData.txt changes
5151- new CJK block:
5152  2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
5153  2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
5154  -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
5155
5156* build Unicode tools using CMake+make
5157
5158* run genpname/preparse.pl (on Linux)
5159  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
5160  + make sure that data.h is writable
5161  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
5162  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
5163
5164* rebuild Unicode tools (at least genpname) using make
5165- You might first need to "make install" ICU so that the tools build can pick
5166  up the new definitions from the installed header files.
5167
5168* run genpname
5169- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
5170- rebuild ICU & tools
5171
5172* update source/data/unidata/norm2/nfkc_cf.txt
5173- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
5174
5175* update source/data/unidata/norm2/uts46.txt
5176- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
5177  to ~/svn.icu/tools/trunk/src/unicode/py
5178- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
5179- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
5180- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
5181
5182* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
5183  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
5184- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
5185- Unicode 6.0: U+2260, U+226E, U+226F
5186
5187* generate core properties data files
5188- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5189- rebuild ICU & tools
5190- run makeuca.sh so that genuca picks up the new nfc.nrm:
5191  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5192- rebuild ICU & tools
5193
5194* implement new Script_Extensions property (provisional)
5195- parser & generator: genprops & uprops.icu
5196- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
5197- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
5198
5199* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
5200- (one-time change)
5201- genbidi/gencase/genprops tools changes
5202- re-run makeprops.sh (see above)
5203- UCharacterProperty.java, UCharacterTypeIterator.java,
5204  UBiDiProps.java, UCaseProps.java, and several others with minor changes;
5205  UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
5206
5207* update Java data files
5208- refresh just the UCD-related files, just to be safe
5209- see (ICU4C)/source/data/icu4j-readme.txt
5210- mkdir /tmp/icu4j
5211- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5212  output:
5213    ...
5214    Unicode .icu files built to ./out/build/icudt45l
5215    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
5216    echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
5217    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
5218    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
5219    mkdir -p /tmp/icu4j/main/shared/data
5220    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
5221- copy the big-endian Unicode data files to another location,
5222  separate from the other data files
5223    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5224    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
5225    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
5226    ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
5227    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
5228    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5229    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
5230- refresh ICU4J
5231    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
5232
5233* refresh Java test .txt files
5234- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
5235
5236* un-hardcode normalization skippable (NF*_Inert) test data
5237- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
5238
5239* copy updated break iterator test files
5240- now handled by early ucdcopy.py and
5241  copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
5242  (old instructions:
5243   copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
5244   to ~/svn.icu/trunk/src/source/test/testdata)
5245- they are not used in ICU4J
5246
5247* UCA
5248
5249- get output from Mark's tools; look in
5250    http://www.unicode.org/~book/incoming/mark/uca6.0.0/
5251    http://www.macchiato.com/unicode/utc/additional-uca-files
5252    http://www.unicode.org/Public/UCA/6.0.0/
5253    http://www.unicode.org/~mdavis/uca/
5254- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
5255- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
5256- update Han-implicit ranges for new CJK extensions:
5257  swapCJK() in ucol.cpp & ImplicitCEGenerator.java
5258- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
5259  do not add it into invuca so that tailoring primary-after an ignorable works
5260- genuca: permit space between [variable top] bytes
5261- ucol.cpp: treat noncharacters like unassigned rather than ignorable
5262- run makeuca.sh:
5263  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5264- rebuild ICU4C
5265- refresh ICU4J collation data:
5266  (subset of instructions above for properties data refresh, except copies all coll/*)
5267    ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5268    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5269    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5270    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
5271- update (ICU)/source/test/testdata/CollationTest_*.txt
5272  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
5273  with output from Mark's Unicode tools
5274- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
5275- note on intltest: if collate/UCAConformanceTest fails, then
5276  utility/MultithreadTest/TestCollators will fail as well;
5277  fix the conformance test before looking into the multi-thread test
5278
5279* When refreshing all of ICU4J data from ICU4C
5280- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5281- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
5282or
5283- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
5284
5285*** LayoutEngine script information
5286
5287(For details see the Unicode 5.2 change log below.)
5288
5289* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
5290ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
5291ScriptRunData.cpp, which is no longer needed.)
5292
5293The generated files have a current copyright date and "@draft" statement.
5294
5295* copy the above files into <icu>/source/layout, replacing the old files.
5296* fix mixed line endings
5297* review the diffs and fix incorrect @draft and missing aliases;
5298  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
5299* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5300
5301---------------------------------------------------------------------------- ***
5302
5303Unicode 5.2 update
5304
5305*** related ICU Trac tickets
5306
53077084 Unicode 5.2
5308
53097167 verify collation bytes
53107235 Java test NAME_ALIAS
53117236 Java DerivedCoreProperties.txt test
53127237 Java BidiTest.txt
53137238 UTrie2 in core unidata
53147239 test for tailoring gaps
53157240 Java fix CollationMiscTest
53167243 update layout engine for Unicode 5.2
5317
5318*** Unicode version numbers
5319- makedata.mak
5320- uchar.h
5321- configure.in & configure
5322- update ucdVersion in gennames.c if an algorithmic range changes
5323
5324*** data files & enums & parser code
5325
5326* file preparation
5327
5328python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
5329- includes finding files regardless of version numbers,
5330  copying them, and performing the equivalent processing of the
5331  ucdstrip and ucdmerge tools on the desired set of files
5332
5333* notes on changes
5334- PropertyAliases.txt
5335  moved from numeric to enumerated:
5336    ccc       ; Canonical_Combining_Class
5337  new string properties:
5338    NFKC_CF   ; NFKC_Casefold
5339    Name_Alias; Name_Alias
5340  new binary properties:
5341    Cased     ; Cased
5342    CI        ; Case_Ignorable
5343    CWCF      ; Changes_When_Casefolded
5344    CWCM      ; Changes_When_Casemapped
5345    CWKCF     ; Changes_When_NFKC_Casefolded
5346    CWL       ; Changes_When_Lowercased
5347    CWT       ; Changes_When_Titlecased
5348    CWU       ; Changes_When_Uppercased
5349  new CJK Unihan properties (not supported by ICU)
5350- PropertyValueAliases.txt
5351  new block names
5352  new scripts
5353  one script code change:
5354    sc ; Qaai      ; Inherited
5355    ->
5356    sc ; Zinh      ; Inherited                        ; Qaai
5357  new Line_Break (lb) value:
5358    lb ; CP        ; Close_Parenthesis
5359  new Joining_Group (jg) values: Farsi_Yeh, Nya
5360  other new values:
5361    ccc; 214; ATA  ; Attached_Above
5362- DerivedBidiClass.txt
5363  new default-R range: U+1E800 - U+1EFFF
5364- UnicodeData.txt
5365  all of the ISO comments are gone
5366  new CJK block end:
5367    9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
5368  new CJK block:
5369    2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
5370    2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
5371
5372* genpname
5373- run preparse.pl
5374  + cd \svn\icuproj\icu\trunk\source\tools\genpname
5375  + make sure that data.h is writable
5376  + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
5377  + preparse.pl complains with errors like the following:
5378      Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
5379    This is because ICU 4.0 had scripts from ISO 15924 which are now
5380    added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
5381    and PropertyValueAliases.txt.
5382    -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
5383       Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
5384  + preparse.pl complains with errors about block names missing from uchar.h; add them
5385
5386* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5387- new block & script values
5388  + 26 new blocks
5389    copy new blocks from Blocks.txt
5390    MS VC++ 2008 regular expression:
5391      find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
5392      replace with "    UBLOCK_\3 = 172, /*[\1]*/"
5393  + several new script values already added in ICU 4.0 for ISO 15924 coverage
5394    (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
5395  + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
5396  + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
5397    (added to SyntheticPropertyValueAliases.txt)
5398- new Joining Group (JG) values: Farsi_Yeh, Nya
5399- new Line_Break (lb) value:
5400    lb ; CP        ; Close_Parenthesis
5401
5402* hardcoded Unihan range end/limit
5403- Unihan range end moves from 9FC3 to 9FCB
5404  search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
5405  + do change gennames.c
5406
5407* Compare definitions of new binary properties with what we used to use
5408  in algorithms, to see if the definitions changed.
5409- Verified that definitions for Cased and Case_Ignorable are unchanged.
5410  The gencase tool now parses the newly public Case_Ignorable values
5411  in case the definition changes in the future.
5412
5413* uchar.c & uprops.h & uprops.c & genprops
5414- new numeric values that didn't exist in Unicode data before:
5415    1/7, 1/9, 1/10, 3/10, 1/16, 3/16
5416  the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
5417  therefore redesign the encoding of numeric types and values for formatVersion 6;
5418  design for simple numbers up to at least 144 ("one gross"),
5419  large values up to at least 10^20,
5420  and fractions with numerators -1..17 and denominators 1..16
5421  to cover current and expected future values
5422  (e.g., more Han numeric values, Meroitic twelfths)
5423
5424* reimplement Hangul_Syllable_Type for new Jamo characters
5425- the old code assumed that all Jamo characters are in the 11xx block
5426- Unicode 5.2 fills holes there and adds new Jamo characters in
5427    A960..A97F; Hangul Jamo Extended-A
5428  and in
5429    D7B0..D7FF; Hangul Jamo Extended-B
5430- Hangul_Syllable_Type can be trivially derived from a subset of
5431  Grapheme_Cluster_Break values
5432
5433* build Unicode data source code for hardcoding core data
5434C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
5435
5436ICU data make path is \svn\icuproj\icu\trunk\source\data\
5437ICU root path is \svn\icuproj\icu\trunk
5438Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
5439Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
5440Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
5441Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
5442Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
5443Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
5444Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
5445Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
5446Creating data file for Unicode Property Names
5447Creating data file for Unicode Character Properties
5448Creating data file for Unicode Case Mapping Properties
5449Creating data file for Unicode BiDi/Shaping Properties
5450Creating data file for Unicode Normalization
5451Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
5452Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
5453
5454- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
5455  and rebuild the common library
5456
5457*** UCA
5458
5459- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
5460- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
5461- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
5462[ Begin obsolete instructions:
5463  Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
5464    - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
5465      on Windows:
5466        python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
5467        python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
5468  End obsolete instructions]
5469- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
5470  not just the *_STUB.txt files
5471- note on intltest: if collate/UCAConformanceTest fails, then
5472  utility/MultithreadTest/TestCollators will fail as well;
5473  fix the conformance test before looking into the multi-thread test
5474
5475*** Implement Cased & Case_Ignorable properties
5476- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
5477- Problem: These properties should be disjoint, but aren't
5478- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
5479- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
5480
5481*** Implement Changes_When_Xyz properties
5482- without stored data
5483
5484*** Implement Name_Alias property
5485- add it as another name field in unames.icu
5486- make it available via u_charName() and UCharNameChoice and
5487- consider it in u_charFromName()
5488
5489*** Break iterators
5490
5491* Update break iterator rules to new UAX versions and new property values
5492* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
5493
5494*** new BidiTest file
5495- review format and data
5496- copy BidiTest.txt to source/test/testdata
5497- write test code using this data
5498- fix ICU code where it fails the conformance test
5499
5500*** Java
5501- generally, find and update code corresponding to C/C++
5502- UCharacter.UnicodeBlock constants:
5503  a) add an _ID integer per new block, update COUNT
5504  b) add a class instance per new block
5505     Visual Studio regex:
5506        find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
5507        replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
5508- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
5509
5510- port test changes to Java
5511
5512*** LayoutEngine script information
5513
5514(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
5515
5516* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
5517ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
5518ScriptRunData.cpp, which is no longer needed.)
5519
5520The generated files have a current copyright date and "@draft" statement.
5521
5522-> Eric Mader wrote in email on 20090930:
5523    "I think the tool has been modified to update @draft to @stable for
5524     older scripts and to add @draft for new scripts.
5525     (I worked with an intern on this last year.)
5526     You should check the output after you run it."
5527
5528* copy the above files into <icu>/source/layout, replacing the old files.
5529* fix mixed line endings
5530* review the diffs and fix incorrect @draft and missing aliases
5531* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5532
5533Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
5534and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
5535
5536-> Eric Mader wrote in email on 20090930:
5537    "This is just a matter of making sure that all the per-script tables have
5538     entries for any new scripts that were added.
5539     If any new Indic characters were added, then the class tables in
5540     IndicClassTables.cpp should be updated to reflect this.
5541     John Emmons should know how to do this if it's required."
5542
5543* rebuild the layout and layoutex libraries.
5544
5545*** Documentation
5546- Update User Guide
5547  + Jamo_Short_Name, sfc->scf, binary property value aliases
5548
5549---------------------------------------------------------------------------- ***
5550
5551Unicode 5.1 update
5552
5553*** related ICU Trac tickets
5554
55555696 Update to Unicode 5.1
5556
5557*** Unicode version numbers
5558- makedata.mak
5559- uchar.h
5560- configure.in & configure
5561- update ucdVersion in gennames.c if an algorithmic range changes
5562
5563*** data files & enums & parser code
5564
5565* file preparation
5566- ucdstrip:
5567    DerivedCoreProperties.txt
5568    DerivedNormalizationProps.txt
5569    NormalizationTest.txt
5570    PropList.txt
5571    Scripts.txt
5572    GraphemeBreakProperty.txt
5573    SentenceBreakProperty.txt
5574    WordBreakProperty.txt
5575- ucdstrip and ucdmerge:
5576    EastAsianWidth.txt
5577    LineBreak.txt
5578
5579* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
5580copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
5581copy 5.1.0\ucd\Blocks.txt ..\unidata\
5582copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
5583copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
5584copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
5585copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
5586copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
5587copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
5588copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
5589copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
5590copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
5591copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
5592copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
5593
5594ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
5595ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
5596ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
5597ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
5598ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
5599ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
5600ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
5601ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
5602ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
5603ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
5604
5605* genpname
5606- run preparse.pl
5607  + cd \svn\icuproj\icu\uni51\source\tools\genpname
5608  + make sure that data.h is writable
5609  + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
5610  + preparse.pl complains with errors like the following:
5611      Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
5612    This is because ICU 3.8 had scripts from ISO 15924 which are now
5613    added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
5614    and PropertyValueAliases.txt.
5615    -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
5616       Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
5617  + PropertyValueAliases.txt now explicitly contains values for boolean properties:
5618      N/Y, No/Yes, F/T, False/True
5619    -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
5620       It will use further values from the file if present.
5621
5622* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5623- new block & script values
5624  + 17 new blocks
5625  + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
5626    (removed from SyntheticPropertyValueAliases.txt)
5627  + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
5628    (added to SyntheticPropertyValueAliases.txt)
5629- uprops.icu (uprops.h) only provides 7 bits for script codes.
5630  In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
5631  There is none above 127 yet which is the script code for an
5632  assigned Unicode character, so ICU 4.0 uprops.icu does not store any
5633  script code values greater than 127.
5634  However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
5635  in a parallel bit field, and that overflows now.
5636  Also, future values >=128 would be incompatible anyway.
5637  uprops.h is modified to move around several of the bit fields
5638  in the properties vector words, and now uses 8 bits for the script code.
5639  Two other bit fields also grow to accommodate future growth:
5640  Block (current count: 172) grows from 8 to 9 bits,
5641  and Word_Break grows from 4 to 5 bits.
5642- renamed property Simple_Case_Folding (sfc->scf)
5643  + nothing to be done: handled as normal alias
5644- new property JSN Jamo_Short_Name
5645  + no new API: only contributes to the Name property
5646- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
5647- new Joining Group (JG) value: Burushashki_Yeh_Barree
5648- new Sentence_Break (SB) values:
5649    SB ; CR        ; CR
5650    SB ; EX        ; Extend
5651    SB ; LF        ; LF
5652    SB ; SC        ; SContinue
5653- new Word_Break (WB) values:
5654    WB ; CR        ; CR
5655    WB ; Extend    ; Extend
5656    WB ; LF        ; LF
5657    WB ; MB        ; MidNumLet
5658
5659* Further changes in the 2008-02-29 update:
5660- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
5661  because they should not normally be invisible.
5662- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
5663- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
5664- new Word_Break (WB) value: NL=Newline
5665
5666* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
5667- Unihan range end moves from 9FBB to 9FC3
5668  search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
5669  + do change gennames.c
5670
5671* build Unicode data source code for hardcoding core data
5672C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
5673
5674ICU data make path is \svn\icuproj\icu\uni51\source\data\
5675ICU root path is \svn\icuproj\icu\uni51
5676Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
5677Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
5678Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
5679Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
5680Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
5681Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
5682Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
5683Creating data file for Unicode Character Properties
5684Creating data file for Unicode Case Mapping Properties
5685Creating data file for Unicode BiDi/Shaping Properties
5686Creating data file for Unicode Normalization
5687Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
5688Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
5689
5690- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
5691  and rebuild the common library
5692
5693*** Break iterators
5694
5695* Update break iterator rules to new UAX versions and new property values
5696
5697*** UCA
5698
5699* update FractionalUCA.txt and UCARules.txt with new canonical closure
5700
5701*** Test suites
5702- Test that APIs using Unicode property value aliases (like UnicodeSet)
5703  support all of the boolean values N/Y, No/Yes, F/T, False/True
5704  -> TestBinaryValues() tests in both cintltst and intltest
5705
5706*** LayoutEngine script information
5707* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
5708ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
5709ScriptRunData.cpp, which is no longer needed.)
5710
5711The generated files have a current copyright date and "@draft" statement.
5712
5713* copy the above files into <icu>/source/layout, replacing the old files.
5714
5715Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
5716and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
5717
5718* rebuild the layout and layoutex libraries.
5719
5720*** Documentation
5721- Update User Guide
5722  + Jamo_Short_Name, sfc->scf, binary property value aliases
5723
5724---------------------------------------------------------------------------- ***
5725
5726Unicode 5.0 update
5727
5728*** related Jitterbugs
5729
57305084 RFE: Update to Unicode 5.0
5731
5732*** data files & enums & parser code
5733
5734* file preparation
5735- ucdstrip:
5736    DerivedCoreProperties.txt
5737    DerivedNormalizationProps.txt
5738    NormalizationTest.txt
5739    PropList.txt
5740    Scripts.txt
5741    GraphemeBreakProperty.txt
5742    SentenceBreakProperty.txt
5743    WordBreakProperty.txt
5744- ucdstrip and ucdmerge:
5745    EastAsianWidth.txt
5746    LineBreak.txt
5747
5748* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
5749copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
5750copy 5.0.0\ucd\Blocks.txt ..\unidata\
5751copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
5752copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
5753copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
5754copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
5755copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
5756copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
5757copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
5758copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
5759copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
5760copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
5761copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
5762
5763ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
5764ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
5765ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
5766ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
5767ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
5768ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
5769ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
5770ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
5771ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
5772ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
5773
5774* update FractionalUCA.txt and UCARules.txt with new canonical closure
5775
5776* genpname
5777- run preparse.pl
5778  + make sure that data.h is writable
5779  + perl preparse.pl \cvs\oss\icu > out.txt
5780
5781* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5782- new block & script values
5783  + script values already added in ICU 3.6 because all of ISO 15924 is now covered
5784
5785* build Unicode data source code for hardcoding core data
5786C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
5787
5788ICU data make path is \cvs\oss\icu\source\data\
5789ICU root path is \cvs\oss\icu
5790Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
5791[etc.]
5792Creating data file for Unicode Character Properties
5793Creating data file for Unicode Case Mapping Properties
5794Creating data file for Unicode BiDi/Shaping Properties
5795Creating data file for Unicode Normalization
5796Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
5797Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
5798
5799- copy the .c source files to C:\cvs\oss\icu\source\common
5800  and rebuild the common library
5801
5802*** Unicode version numbers
5803- makedata.mak
5804- uchar.h
5805- configure.in
5806
5807*** LayoutEngine script information
5808* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
5809ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
5810ScriptRunData.cpp, which is no longer needed.)
5811
5812The generated files have a current copyright date and "@draft" statement.
5813
5814* copy the above files into <icu>/source/layout, replacing the old files.
5815
5816Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
5817and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
5818
5819* rebuild the layout and layoutex libraries.
5820
5821---------------------------------------------------------------------------- ***
5822
5823Unicode 4.1 update
5824
5825*** related Jitterbugs
5826
58274332 RFE: Update to Unicode 4.1
58284157 RBBI, TR29 4.1 updates
5829
5830*** data files & enums & parser code
5831
5832* file preparation
5833- ucdstrip:
5834    DerivedCoreProperties.txt
5835    DerivedNormalizationProps.txt
5836    NormalizationTest.txt
5837    GraphemeBreakProperty.txt
5838    SentenceBreakProperty.txt
5839    WordBreakProperty.txt
5840- ucdstrip and ucdmerge:
5841    EastAsianWidth.txt
5842    LineBreak.txt
5843
5844* add new files to the repository
5845    GraphemeBreakProperty.txt
5846    SentenceBreakProperty.txt
5847    WordBreakProperty.txt
5848
5849* update FractionalUCA.txt and UCARules.txt with new canonical closure
5850
5851* genpname
5852- handle new enumerated properties in sub read_uchar
5853- run preparse.pl
5854
5855* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5856- new binary properties
5857  + Pattern_Syntax
5858  + Pattern_White_Space
5859- new enumerated properties
5860  + Grapheme_Cluster_Break
5861  + Sentence_Break
5862  + Word_Break
5863- new block & script & line break values
5864
5865* gencase
5866- case-ignorable changes
5867  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
5868  now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
5869
5870*** Unicode version numbers
5871- makedata.mak
5872- uchar.h
5873- configure.in
5874
5875*** tests
5876- verify that u_charMirror() round-trips
5877- test all new properties and some new values of old properties
5878
5879*** other code
5880
5881* hardcoded Unihan range end/limit
5882- Unihan range end moves from 9FA5 to 9FBB
5883  search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
5884  + do not modify BOCU/BOCSU code because that would change the encoding
5885    and break binary compatibility!
5886  + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
5887    NamePrepProfile.txt
5888  + ignore trietest.c: test data is arbitrary
5889  + ignore tstnorm.cpp: test optimization, not important
5890  + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
5891  + do change line_th.txt and word_th.txt
5892    by replacing hardcoded ranges with the new property values
5893  + do change gennames.c
5894
5895source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
5896source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
5897source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
5898
5899* case mappings
5900- compare new special casing context conditions with previous ones
5901  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
5902
5903* genpname
5904- consider storing only the short name if it is the same as the long name
5905
5906*** other reviews
5907- UAX #29 changes (grapheme/word/sentence breaks)
5908- UAX #14 changes (line breaks)
5909- Pattern_Syntax & Pattern_White_Space
5910
5911---------------------------------------------------------------------------- ***
5912
5913Unicode 4.0.1 update
5914
5915*** related Jitterbugs
5916
59173170 RFE: Update to Unicode 4.0.1
59183171 Add new Unicode 4.0.1 properties
59193520 use Unicode 4.0.1 updates for break iteration
5920
5921*** data files & enums & parser code
5922
5923* file preparation
5924- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
5925- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
5926
5927* file fixes
5928- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
5929  according to PRI #26
5930  http://www.unicode.org/review/resolved-pri.html#pri26
5931- undone again because no corrigendum in sight;
5932  instead modified tests to not check consistency on this for Unicode 4.0.1
5933
5934* ucdterms.txt
5935- update from http://www.unicode.org/copyright.html
5936  formatted for plain text
5937
5938* uchar.h & uprops.h & uprops.c & genprops
5939- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
5940- add U_LB_INSEPARABLE due to a spelling fix
5941  + put short name comment only on line with new constant
5942    for genpname perl script parser
5943- new binary properties
5944  + STerm
5945  + Variation_Selector
5946
5947* genpname
5948- fix genpname perl script so that it doesn't choke on more than 2 names per property value
5949- perl script: correctly calculate the maximum number of fields per row
5950
5951* uscript.h
5952- new script code Hrkt=Katakana_Or_Hiragana
5953
5954* gennorm.c track changes in DerivedNormalizationProps.txt
5955- "FNC" -> "FC_NFKC"
5956- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
5957
5958* genprops/props2.c track changes in DerivedNumericValues.txt
5959- changed from 3 columns to 2, dropping the numeric type
5960  + assume that the type is always numeric for Han characters,
5961    and that only those are added in addition to what UnicodeData.txt lists
5962
5963*** Unicode version numbers
5964- makedata.mak
5965- uchar.h
5966- configure.in
5967
5968*** tests
5969- update test of default bidi classes according to PRI #28
5970  /tsutil/cucdtst/TestUnicodeData
5971  http://www.unicode.org/review/resolved-pri.html#pri28
5972- bidi tests: change exemplar character for ES depending on Unicode version
5973- change hardcoded expected property values where they change
5974
5975*** other code
5976
5977* name matching
5978- read UCD.html
5979
5980* scripts
5981- use new Hrkt=Katakana_Or_Hiragana
5982
5983* ZWJ & ZWNJ
5984- are now part of combining character sequences
5985- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
5986