• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1* Copyright (C) 2016 and later: Unicode, Inc. and others.
2* License & terms of use: http://www.unicode.org/copyright.html
3* Copyright (C) 2004-2016, International Business Machines
4* Corporation and others.  All Rights Reserved.
5*
6*   file name:  changes.txt
7*   encoding:   US-ASCII
8*   tab size:   8 (not used)
9*   indentation:4
10*
11*   created on: 2004may06
12*   created by: Markus W. Scherer
13
14* change log for Unicode updates
15
16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update
17
18Notes:
19
20This log includes several command lines as used in the update process.
21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign.
22Use a console window that is set to that directory, or cd to there,
23and then paste the command that follows the $ sign.
24
25Most command lines use environment variables to make them more portable across versions
26and machine configurations. When you set up a console window, copy & paste the `export` commands
27from near the top of the current section before pasting tool command lines.
28Adjust the environment variables to the current version and your machine setup.
29(The command lines are currently as used on Linux.)
30
31Syntax of this file:
32
33`***` - section heading
34`*` - sub heading
35`-` - 1st level bullet
36`+` - 2nd level bullet
37`=` - 1st level bullet
38`->` - "the previous things leads to...", OR a 2nd level bullet/item
39
40---------------------------------------------------------------------------- ***
41
42* New ISO 15924 script codes
43
44Normally, add new script codes as part of a Unicode update.
45See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums
46and see the change logs below.
47
48---------------------------------------------------------------------------- ***
49
50Unicode 16.0 update for ICU 76
51
52https://www.unicode.org/versions/Unicode16.0.0/
53https://www.unicode.org/versions/beta-16.0.0.html
54https://www.unicode.org/Public/draft/
55https://www.unicode.org/reports/uax-proposed-updates.html
56https://www.unicode.org/reports/tr44/tr44-33.html
57
58https://unicode-org.atlassian.net/browse/ICU-22707 Unicode 16
59https://unicode-org.atlassian.net/browse/CLDR-17226 BRS Unicode 16
60
61https://github.com/unicode-org/unicodetools/pull/774 delete the RecommendedSetGenerator
62
63https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1
64
65* Command-line environment setup
66
67Markus:
68
69export UNIDATA_ROOT=~/unidata
70export UNICODE_DATA=$UNIDATA_ROOT/uni16/final
71export CLDR_SRC=~/cldr/uni/src
72export ICU_ROOT=~/icu/uni
73export ICU_SRC=$ICU_ROOT/src
74export ICU_OUT=$ICU_ROOT/dbg
75export ICUDT=icudt76b
76export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
77export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
78export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
79export UNICODE_TOOLS=~/unitools/mine/src
80
81Elango:
82
83export UNIDATA_ROOT=~/oss/unidata
84export UNICODE_DATA=$UNIDATA_ROOT/uni16/final
85export CLDR_SRC=~/oss/cldr/mine/src
86export ICU_ROOT=~/oss/icu
87export ICU_SRC=$ICU_ROOT
88export ICU_OUT=$ICU_ROOT
89export ICUDT=icudt76b
90export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
91export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
92export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
93export UNICODE_TOOLS=~/oss/unicodetools/mine/src
94
95*** Unicode version numbers
96- icu4c/source/data/makedata.mak
97- icu4c/source/common/unicode/uchar.h
98- com.ibm.icu.util.VersionInfo
99- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
100
101*** Configure: Build Unicode data for ICU4J
102- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
103    so that the makefiles see the new version number.
104- FYI: The option that adds the additional Unicode data files for ICU4J is
105    ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data
106- Markus's version:
107  cd $ICU_OUT/icu4c
108  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ../../src/icu4c/source/runConfigureICU --enable-debug --disable-release Linux/clang --prefix=/usr/local/google/home/mscherer/icu/mine/inst/icu4c > config.out 2>&1 ; tail config.out
109- Elango's version (diff default C++ compiler & in-source build paths):
110  cd $ICU_OUT/icu4c/source
111  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ./runConfigureICU --enable-debug --disable-release Linux/gcc --prefix=/usr/local/google/home/elango/oss/icu/icu4c > config.out 2>&1 ; tail config.out
112
113*** data files & enums & parser code
114
115* download files
116- same as for the early Unicode Tools setup and data refresh:
117  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
118  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
119- mkdir -p $UNICODE_DATA
120- download Unicode files into $UNICODE_DATA
121  + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc.
122  + subfolders: emoji, idna, security, ucd, uca
123  + for pre-release (alpha, beta) data files:
124    ~ if one of us produces the alpha.zip or beta.zip collection of data files for publication,
125      then we can use its contents directly (no FTP from unicode.org necessary)
126    ~ otherwise download from https://www.unicode.org/Public/draft/
127    ~ you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders
128    ~ you can omit or discard UCD/ucd/Unihan.zip
129  + alternate way of fetching files, if available:
130    copy the files from a Unicode Tools workspace that is up to date with
131    https://github.com/unicode-org/unicodetools
132    and which might at this point be *ahead* of "Public"
133    ~ before the Unicode release copy files from "dev" subfolders, for example
134      https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
135  + for final-release data files, the source of truth is the files in
136    https://www.unicode.org/Public/(version) [=UCD],
137    https://www.unicode.org/Public/UCA/(version),
138    https://www.unicode.org/Public/idna/(version),
139    etc.
140- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already)
141  or from the UCD/cldr/ output folder of the Unicode Tools:
142  From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73,
143  CLDR used modified grapheme break rules.
144  This might happen again.
145  + To check in the Unicode Tools workspace:
146    ~/unitools/mine/Generated$ meld UCD/16.0.0/auxiliary/*GraphemeBreakTest.txt UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt
147  + If different, and after copying into CLDR:
148    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
149  or
150    cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
151    cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
152    cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
153  + We may need CLDR versions of WordBreakTest.txt and LineBreakTest.txt
154    unless Unicode 16 and CLDR 46 eliminate their differences:
155    unicodetools issue #492
156
157* process and/or copy files
158- cd $ICU_SRC/tools/unicode
159    py/preparseucd.py $UNICODE_DATA $ICU_SRC
160  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
161  + For debugging, and tweaking how ppucd.txt is written,
162    the tool has an --only_ppucd option:
163      py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
164    e.g.
165      py/preparseucd.py $UNICODE_DATA --only_ppucd /tmp/ppucd.txt
166
167* new constants for new property values
168- preparseucd.py error:
169    ValueError: missing uchar.h enum constants for some property values:
170    [('blk', {'Garay', 'Tulu_Tigalari', 'Todhri', 'Sunuwar', 'Egyptian_Hieroglyphs_Ext_A', 'Kirat_Rai', 'Symbols_For_Legacy_Computing_Sup', 'Myanmar_Ext_C', 'Ol_Onal', 'Gurung_Khema'}),
171    ('sc', {'Gara', 'Onao', 'Todr', 'Krai', 'Tutg', 'Sunu', 'Gukh'}),
172    ('InSC', {'Reordering_Killer'})]
173  = PropertyValueAliases.txt new property values (diff old & new .txt files)
174    (cd $UNIDATA_ROOT && diff -u uni15.1/final/ucd/PropertyValueAliases.txt uni16/alpha/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]')
175    +age; 16.0                             ; V16_0
176    +blk; Egyptian_Hieroglyphs_Ext_A       ; Egyptian_Hieroglyphs_Extended_A
177    +blk; Garay                            ; Garay
178    +blk; Gurung_Khema                     ; Gurung_Khema
179    +blk; Kirat_Rai                        ; Kirat_Rai
180    +blk; Myanmar_Ext_C                    ; Myanmar_Extended_C
181    +blk; Ol_Onal                          ; Ol_Onal
182    +blk; Sunuwar                          ; Sunuwar
183    +blk; Symbols_For_Legacy_Computing_Sup ; Symbols_For_Legacy_Computing_Supplement
184    +blk; Todhri                           ; Todhri
185    +blk; Tulu_Tigalari                    ; Tulu_Tigalari
186    +InSC; Reordering_Killer               ; Reordering_Killer
187    -jg ; Teh_Marbuta_Goal                 ; Hamza_On_Heh_Goal
188    +jg ; Teh_Marbuta_Goal                 ; Teh_Marbuta_Goal                 ; Hamza_On_Heh_Goal
189    +sc ; Gara                             ; Garay
190    +sc ; Gukh                             ; Gurung_Khema
191    +sc ; Krai                             ; Kirat_Rai
192    +sc ; Onao                             ; Ol_Onal
193    +sc ; Sunu                             ; Sunuwar
194    +sc ; Todr                             ; Todhri
195    +sc ; Tutg                             ; Tulu_Tigalari
196  + copy new API constants from the preparseucd.py output into the .h/.java files,
197    add/adjust comments, wrap lines, and set numeric values
198  + (ignore Age: no API constants for that)
199  + Block: uchar.h before UBLOCK_COUNT,
200      UCharacter.UnicodeBlock IDs, UCharacter.UnicodeBlock objects
201  + Script: uscript.h & com.ibm.icu.lang.UScript
202  + for new scripts: fix expectedLong names
203      in cintltst/cucdapi.c/TestUScriptCodeAPI()
204      and in com.ibm.icu.dev.test.lang.TestUScript.java
205  + Indic_Syllabic_Category: uchar.h & UCharacter.IndicSyllabicCategory
206  + after adding new API constants, run preparseucd.py again
207
208* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
209    (not strictly necessary for NOT_ENCODED scripts)
210  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
211
212* build ICU
213  to make sure that there are no syntax errors
214
215  $ICU_OUT/icu4c$ echo;echo; date; make -j20 tests &> out.txt ; tail -n 30 out.txt ; date
216
217* Bazel build process
218
219See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
220for an overview and for setup instructions.
221
222Consider running `bazelisk --version` outside of the $ICU_SRC folder
223to find out the latest `bazel` version, and
224copying that version number into the $ICU_SRC/.bazeliskrc config file.
225(Revert if you find incompatibilities, or, better, update our build & config files.)
226
227* generate data files
228
229- remember to define the environment variables
230  (see the start of the section for this Unicode version)
231- cd $ICU_SRC
232- optional but not necessary:
233    bazelisk clean
234      or even
235    bazelisk clean --expunge
236- build/bootstrap/generate new files:
237    icu4c/source/data/unidata/generate.sh
238
239* run & fix ICU4C tests
240- Note: Some of the collation data and test data will be updated below,
241  so at this time we might get some collation test failures.
242  Ignore these for now.
243- Some properties are hardcoded in the ICU libraries because they apply to
244  few characters or ranges, and are not expected to change often.
245  They are tested at least in C++ intltest (e.g., against ppucd.txt).
246  If these tests fail, then update the implementation and the tests.
247- update CLDR GraphemeBreakTest.txt
248  (see the download section above about this file)
249    cd ~/unitools/mine/Generated
250    cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
251    cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
252    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
253- Robin or Andy helps with RBBI & spoof check test failures
254
255* collation: CLDR collation root, UCA DUCET
256
257- UCA DUCET goes into Mark's Unicode tools,
258  and a tool-tailored version goes into CLDR, see
259    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
260
261- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
262    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
263- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
264    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
265    (note removing the underscore before "Rules")
266    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
267- restore TODO diffs in UCARules.txt
268    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
269- update (ICU4C)/source/test/testdata/CollationTest_*.txt
270  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
271  from the CLDR root files (..._CLDR_..._SHORT.txt)
272    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
273    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
274    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/collate/src/test/resources/com/ibm/icu/dev/data
275- if CLDR common/uca/unihan-index.txt changes, then update
276  CLDR common/collation/root.xml <collation type="private-unihan">
277  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
278
279- update CollationFCD.java:
280  copy & paste the initializers of lcccIndex[] etc.
281  from
282    $ICU_SRC/icu4c/source/i18n/collationfcd.cpp
283  to
284    $ICU_SRC/icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationFCD.java
285- generate data files, as above (generate.sh), now to pick up new collation data
286- rebuild ICU4C (make clean, make check, as usual)
287
288* Unihan collators
289    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
290- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
291  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
292- generate ICU zh collation data
293    instructions inspired by
294    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
295    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
296  + setup:
297    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
298        (didn't work without setting JAVA_HOME,
299         nor with the Google default of /usr/local/buildtools/java/jdk
300         [Google security limitations in the XML parser])
301    export TOOLS_ROOT=$ICU_SRC/tools
302    export CLDR_DIR=$CLDR_SRC
303    export CLDR_DATA_DIR=$CLDR_DIR
304        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
305    cd "$TOOLS_ROOT/cldr/lib"
306    ./install-cldr-jars.sh "$CLDR_DIR"
307  + generate the files we need
308    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
309    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
310  + diff
311    cd $ICU_SRC
312    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
313    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
314  + copy into the source tree
315    cd $ICU_SRC
316    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
317    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
318- rebuild ICU4C
319
320* run & fix ICU4C tests, now with new CLDR collation root data
321- run all tests with the collation test data *_SHORT.txt or the full files
322  (the full ones have comments, useful for debugging)
323- note on intltest: if collate/UCAConformanceTest fails, then
324  utility/MultithreadTest/TestCollators will fail as well;
325  fix the conformance test before looking into the multi-thread test
326
327* update Java data files
328- refresh just the UCD/UCA-related/derived files, just to be safe
329- see (ICU4C)/source/data/icu4j-readme.txt
330- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
331- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
332    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
333    you need to reconfigure with unicore data; see the "configure" line above.
334  output:
335    ...
336    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
337    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt76b
338    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b
339    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt76l.dat ./out/icu4j/icudt76b.dat -s ./out/build/icudt76l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt76b
340    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b"
341    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt76b/
342    mkdir -p /tmp/icu4j/main/shared/data
343    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
344    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt76b/
345    mkdir -p /tmp/icu4j/main/shared/data
346    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
347    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
348- copy the binary data files into the ICU4J tree
349    cd $ICU_OUT/icu4c/data/out/icu4j
350    cp -v com/ibm/icu/impl/data/icudata/coll/* $ICU_SRC/icu4j/main/collate/src/main/resources/com/ibm/icu/impl/data/icudata/coll
351    cp -v com/ibm/icu/impl/data/icudata/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr
352    cp -v com/ibm/icu/impl/data/icudata/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata
353    cp -v com/ibm/icu/impl/data/icudata/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata
354    cd com/ibm/icu/impl/data/icudata/
355    ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata";}' | sh
356- The procedure above is very conservative:
357  It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update.
358  It avoids dealing with any other discrepancies
359  between the source and generated data files.
360  *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C:
361      $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
362
363* refresh Java test .txt files
364- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode
365    cd $ICU_SRC/icu4c/source/data/unidata
366    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
367    cd ../../test/testdata
368    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
369    cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
370
371* run & fix ICU4J tests
372
373*** API additions
374- send notice to icu-design about new born-@stable API (enum constants etc.)
375
376*** CLDR numbering systems
377- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
378  for example:
379    ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.1.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
380    -->
381      +10D40..10D49  ; Nd #  [10] GARAY DIGIT ZERO..GARAY DIGIT NINE
382      +116D0..116E3  ; Nd #  [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE
383      +11BF0..11BF9  ; Nd #  [10] SUNUWAR DIGIT ZERO..SUNUWAR DIGIT NINE
384      +16130..16139  ; Nd #  [10] GURUNG KHEMA DIGIT ZERO..GURUNG KHEMA DIGIT NINE
385      +16D70..16D79  ; Nd #  [10] KIRAT RAI DIGIT ZERO..KIRAT RAI DIGIT NINE
386      +1CCF0..1CCF9  ; Nd #  [10] OUTLINED DIGIT ZERO..OUTLINED DIGIT NINE
387      +1E5F1..1E5FA  ; Nd #  [10] OL ONAL DIGIT ZERO..OL ONAL DIGIT NINE
388  --> https://github.com/unicode-org/cldr/pull/3658
389
390*** merge the Unicode update branch back onto the main branch
391- make sure that changes to Unicode tools are checked in:
392  https://github.com/unicode-org/unicodetools
393
394---------------------------------------------------------------------------- ***
395
396Unicode 15.1 update for ICU 74
397
398https://www.unicode.org/versions/Unicode15.1.0/
399https://www.unicode.org/versions/beta-15.1.0.html
400https://www.unicode.org/Public/draft/
401https://www.unicode.org/reports/uax-proposed-updates.html
402https://www.unicode.org/reports/tr44/tr44-31.html
403
404https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1
405https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1
406
407https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1
408
409* Command-line environment setup
410
411Markus:
412
413export UNIDATA_ROOT=~/unidata
414export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final
415export CLDR_SRC=~/cldr/uni/src
416export ICU_ROOT=~/icu/uni
417export ICU_SRC=$ICU_ROOT/src
418export ICU_OUT=$ICU_ROOT/dbg
419export ICUDT=icudt74b
420export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
421export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
422export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
423export UNICODE_TOOLS=~/unitools/mine/src
424
425Elango:
426
427export UNIDATA_ROOT=~/oss/unidata
428export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot
429export CLDR_SRC=~/oss/cldr/mine/src
430export ICU_ROOT=~/oss/icu
431export ICU_SRC=$ICU_ROOT
432export ICU_OUT=$ICU_ROOT
433export ICUDT=icudt74b
434export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
435export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
436export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib
437export UNICODE_TOOLS=~/oss/unicodetools/mine/src
438
439*** Unicode version numbers
440- makedata.mak
441- uchar.h
442- com.ibm.icu.util.VersionInfo
443- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
444
445*** Configure: Build Unicode data for ICU4J
446- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
447    so that the makefiles see the new version number.
448  cd $ICU_OUT/icu4c
449  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
450
451*** data files & enums & parser code
452
453* download files
454- same as for the early Unicode Tools setup and data refresh:
455  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
456  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
457- mkdir -p $UNICODE_DATA
458- download Unicode files into $UNICODE_DATA
459  + new since Unicode 15.1:
460    for the pre-release (alpha, beta) data files,
461    download all of https://www.unicode.org/Public/draft/
462    (you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders)
463  + if one of us produces the alpha.zip or beta.zip collection of data files for publication,
464    then we can use its contents directly (no FTP from unicode.org necessary)
465  + for final-release data files, the source of truth are the files in
466    https://www.unicode.org/Public/(version) [=UCD],
467    https://www.unicode.org/Public/UCA/(version),
468    https://www.unicode.org/Public/idna/(version),
469    etc.
470  + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc.
471  + subfolders: emoji, idna, security, ucd, uca
472  + whichever way you download the files:
473    ~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip
474    ~ split Unihan into single-property files
475      ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan
476    ~ FYI: for updating ICU, we do not actually need Unihan.zip contents
477  + alternate way of fetching files, if available:
478    copy the files from a Unicode Tools workspace that is up to date with
479    https://github.com/unicode-org/unicodetools
480    and which might at this point be *ahead* of "Public"
481    ~ before the Unicode release copy files from "dev" subfolders, for example
482      https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
483- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already)
484    or from the UCD/cldr/ output folder of the Unicode Tools:
485    From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73,
486    CLDR used modified grapheme break rules.
487    This might happen again.
488  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
489    or
490  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
491  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
492  cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
493  + Done: figure out whether we need a CLDR version of LineBreakTest.txt:
494    unicodetools issue #492
495    We should have had one, and instead rbbitst.cpp has "known issue" exception.
496    Unicode 16 and CLDR 46 might get back to having the same behavior.
497- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
498  + done in ICU 76: modify preparseucd.py to copy this file
499
500* Note: Since Unicode 15.1, data files are no longer published with version suffixes
501  even during the alpha or beta.
502  Thus we no longer need steps & tools to remove those suffixes.
503  (remove this note next time)
504
505* process and/or copy files
506- cd $ICU_SRC/tools/unicode
507  py/preparseucd.py $UNICODE_DATA $ICU_SRC
508  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
509  + For debugging, and tweaking how ppucd.txt is written,
510    the tool has an --only_ppucd option:
511    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
512
513* new constants for new property values
514- preparseucd.py error:
515    ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})]
516  = PropertyValueAliases.txt new property values (diff old & new .txt files)
517    cd $UNIDATA_ROOT
518    $ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
519    +age; 15.1                             ; V15_1
520    +blk; CJK_Ext_I                        ; CJK_Unified_Ideographs_Extension_I
521    +IDSU; N                               ; No                               ; F                                ; False
522    +IDSU; Y                               ; Yes                              ; T                                ; True
523    +ID_Compat_Math_Continue; N            ; No                               ; F                                ; False
524    +ID_Compat_Math_Continue; Y            ; Yes                              ; T                                ; True
525    +ID_Compat_Math_Start; N               ; No                               ; F                                ; False
526    +ID_Compat_Math_Start; Y               ; Yes                              ; T                                ; True
527    +lb ; AK                               ; Aksara
528    +lb ; AP                               ; Aksara_Prebase
529    +lb ; AS                               ; Aksara_Start
530    +lb ; VF                               ; Virama_Final
531    +lb ; VI                               ; Virama
532  -> add new blocks to uchar.h before UBLOCK_COUNT
533    use long property names for enum constants,
534    for the trailing comment get the block start code point: diff old & new Blocks.txt
535    cd $UNIDATA_ROOT
536    $ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
537    +2EBF0..2EE4F; CJK Unified Ideographs Extension I
538    (ignore blocks whose end code point changed)
539  -> add new blocks to UCharacter.UnicodeBlock IDs
540    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
541            replace  public static final int \1_ID = \2; \3
542  -> add new blocks to UCharacter.UnicodeBlock objects
543    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
544            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
545  -> add new line break values to uchar.h & UCharacter.LineBreak
546
547* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
548    (not strictly necessary for NOT_ENCODED scripts)
549  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
550
551* build ICU
552  to make sure that there are no syntax errors
553
554  $ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
555
556* update spoof checker UnicodeSet initializers:
557    inclusionPat & recommendedPat in i18n/uspoof.cpp
558    INCLUSION & RECOMMENDED in SpoofChecker.java
559- make sure that the Unicode Tools tree contains the latest security data files
560- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
561- run the tool (no special environment variables needed)
562  cd $UNICODE_TOOLS
563  mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \
564      -Dexec.args="" -am -pl unicodetools  -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd)
565- copy & paste from the Console output into the .cpp & .java files
566
567* check hardcoded IDS_Unary_Operator
568- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
569- check that it has not changed:
570    (cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator)
571- if it has changed, then update the implementation and the tests
572- Since ICU 75, this property is tested in C++ intltest against ppucd.txt.
573
574* check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue
575- new in Unicode 15.1, hardcoded because trivial, and unlikely to change
576- check that they have not changed:
577    (cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math)
578- if they have changed, then update the implementation and the tests
579- Since ICU 75, these properties are tested in C++ intltest against ppucd.txt.
580
581* Bazel build process
582
583See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
584for an overview and for setup instructions.
585
586Consider running `bazelisk --version` outside of the $ICU_SRC folder
587to find out the latest `bazel` version, and
588copying that version number into the $ICU_SRC/.bazeliskrc config file.
589(Revert if you find incompatibilities, or, better, update our build & config files.)
590
591* generate data files
592
593- remember to define the environment variables
594  (see the start of the section for this Unicode version)
595- cd $ICU_SRC
596- optional but not necessary:
597    bazelisk clean
598      or even
599    bazelisk clean --expunge
600- build/bootstrap/generate new files:
601    icu4c/source/data/unidata/generate.sh
602
603* Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD).
604  These characters are now just valid, no longer disallowed_STD3_valid.
605  Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid())
606  from uts46.cpp & UTS46.java,
607  and special test code from uts46test.cpp & UTS46Test.java.
608  (remove this section next time)
609
610* run & fix ICU4C tests
611- Note: Some of the collation data and test data will be updated below,
612  so at this time we might get some collation test failures.
613  Ignore these for now.
614- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
615- update CLDR GraphemeBreakTest.txt
616    cd ~/unitools/mine/Generated
617    cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
618    cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
619    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
620- Robin or Andy helps with RBBI & spoof check test failures
621
622* collation: CLDR collation root, UCA DUCET
623
624- UCA DUCET goes into Mark's Unicode tools,
625  and a tool-tailored version goes into CLDR, see
626    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
627
628- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
629    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
630- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
631    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
632    (note removing the underscore before "Rules")
633    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
634- restore TODO diffs in UCARules.txt
635    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
636- update (ICU4C)/source/test/testdata/CollationTest_*.txt
637  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
638  from the CLDR root files (..._CLDR_..._SHORT.txt)
639    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
640    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
641    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
642- if CLDR common/uca/unihan-index.txt changes, then update
643  CLDR common/collation/root.xml <collation type="private-unihan">
644  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
645
646- generate data files, as above (generate.sh), now to pick up new collation data
647- update CollationFCD.java:
648  copy & paste the initializers of lcccIndex[] etc. from
649    ICU4C/source/i18n/collationfcd.cpp to
650    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
651- rebuild ICU4C (make clean, make check, as usual)
652
653* Unihan collators
654    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
655- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
656  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
657- generate ICU zh collation data
658    instructions inspired by
659    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
660    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
661  + setup:
662    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
663        (didn't work without setting JAVA_HOME,
664         nor with the Google default of /usr/local/buildtools/java/jdk
665         [Google security limitations in the XML parser])
666    export TOOLS_ROOT=$ICU_SRC/tools
667    export CLDR_DIR=$CLDR_SRC
668    export CLDR_DATA_DIR=$CLDR_DIR
669        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
670    cd "$TOOLS_ROOT/cldr/lib"
671    ./install-cldr-jars.sh "$CLDR_DIR"
672  + generate the files we need
673    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
674    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
675  + diff
676    cd $ICU_SRC
677    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
678    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
679  + copy into the source tree
680    cd $ICU_SRC
681    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
682    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
683- rebuild ICU4C
684
685* run & fix ICU4C tests, now with new CLDR collation root data
686- run all tests with the collation test data *_SHORT.txt or the full files
687  (the full ones have comments, useful for debugging)
688- note on intltest: if collate/UCAConformanceTest fails, then
689  utility/MultithreadTest/TestCollators will fail as well;
690  fix the conformance test before looking into the multi-thread test
691
692* update Java data files
693- refresh just the UCD/UCA-related/derived files, just to be safe
694- see (ICU4C)/source/data/icu4j-readme.txt
695- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
696- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
697    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
698    you need to reconfigure with unicore data; see the "configure" line above.
699  output:
700    ...
701    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
702    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b
703    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b
704    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b
705    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b"
706    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/
707    mkdir -p /tmp/icu4j/main/shared/data
708    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
709    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/
710    mkdir -p /tmp/icu4j/main/shared/data
711    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
712    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
713- copy the binary data files into the ICU4J tree
714    cd $ICU_OUT/icu4c/data/out/icu4j
715    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
716    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr
717    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
718    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT
719    cd com/ibm/icu/impl/data/$ICUDT/
720    ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh
721- The procedure above is very conservative:
722  It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update.
723  It avoids dealing with any other discrepancies
724  between the source and generated data files.
725  *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C:
726      $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
727
728* refresh Java test .txt files
729- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode
730    cd $ICU_SRC/icu4c/source/data/unidata
731    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
732    cd ../../test/testdata
733    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
734    cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode
735
736* run & fix ICU4J tests
737
738*** API additions
739- send notice to icu-design about new born-@stable API (enum constants etc.)
740
741*** CLDR numbering systems
742- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
743  for example:
744    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
745    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt
746    ~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt
747    -->
748    (empty this time)
749  or:
750    ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
751    -->
752    (empty this time)
753  Unicode 15.1:
754    (none this time)
755
756*** merge the Unicode update branch back onto the main branch
757- do not merge the icudata.jar and testdata.jar,
758  instead rebuild them from merged & tested ICU4C
759- if there is a merge conflict in icudata.jar, here is one way to deal with it:
760  +   remove icudata.jar from the commit so that rebasing is trivial
761  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
762  + ~/icu/uni/src$ git commit -a --amend
763  +   switch to main, pull updates, switch back to the dev branch
764  + ~/icu/uni/src$ git rebase main
765  +   rebuild icudata.jar
766  + ~/icu/uni/src$ git commit -a --amend
767  + ~/icu/uni/src$ git push -f
768- make sure that changes to Unicode tools are checked in:
769  https://github.com/unicode-org/unicodetools
770
771---------------------------------------------------------------------------- ***
772
773CLDR 43 root collation update for ICU 73
774
775Partial update only for the root collation.
776See
777- https://unicode-org.atlassian.net/browse/CLDR-15946
778  Treat quote marks as equivalent when strength=UCOL_PRIMARY
779- https://github.com/unicode-org/cldr/pull/2691
780  CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks
781- https://github.com/unicode-org/cldr/pull/2833
782  CLDR-15946 make fancy quotes secondary-different from each other
783
784The related changes to tailorings were already integrated in an earlier PR for
785https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS.
786
787This update is for the root collation,
788which is handled by different tools than the locale data updates.
789
790* Command-line environment setup
791
792export UNICODE_DATA=~/unidata/uni15/20220830
793export CLDR_SRC=~/cldr/uni/src
794export ICU_ROOT=~/icu/uni
795export ICU_SRC=$ICU_ROOT/src
796export ICUDT=icudt73b
797export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
798export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
799export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
800
801*** Configure: Build Unicode data for ICU4J
802  cd $ICU_ROOT/dbg/icu4c
803  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
804
805* Bazel build process
806
807See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
808for an overview and for setup instructions.
809
810Consider running `bazelisk --version` outside of the $ICU_SRC folder
811to find out the latest `bazel` version, and
812copying that version number into the $ICU_SRC/.bazeliskrc config file.
813(Revert if you find incompatibilities, or, better, update our build & config files.)
814
815* generate data files
816
817- remember to define the environment variables
818  (see the start of the section for this Unicode version)
819- cd $ICU_SRC
820- optional but not necessary:
821    bazelisk clean
822      or even
823    bazelisk clean --expunge
824- build/bootstrap/generate new files:
825    icu4c/source/data/unidata/generate.sh
826
827* collation: CLDR collation root, UCA DUCET
828
829- UCA DUCET goes into Mark's Unicode tools,
830  and a tool-tailored version goes into CLDR, see
831    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
832
833- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
834    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
835- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
836    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
837    (note removing the underscore before "Rules")
838    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
839- restore TODO diffs in UCARules.txt
840    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
841- update (ICU4C)/source/test/testdata/CollationTest_*.txt
842  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
843  from the CLDR root files (..._CLDR_..._SHORT.txt)
844    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
845    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
846    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
847- if CLDR common/uca/unihan-index.txt changes, then update
848  CLDR common/collation/root.xml <collation type="private-unihan">
849  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
850
851- generate data files, as above (generate.sh), now to pick up new collation data
852- rebuild ICU4C (make clean, make check, as usual)
853
854* run & fix ICU4C tests, now with new CLDR collation root data
855- run all tests with the collation test data *_SHORT.txt or the full files
856  (the full ones have comments, useful for debugging)
857- note on intltest: if collate/UCAConformanceTest fails, then
858  utility/MultithreadTest/TestCollators will fail as well;
859  fix the conformance test before looking into the multi-thread test
860
861* update Java data files
862- refresh just the UCD/UCA-related/derived files, just to be safe
863- see (ICU4C)/source/data/icu4j-readme.txt
864- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
865- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
866    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
867    you need to reconfigure with unicore data; see the "configure" line above.
868  output:
869    ...
870    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
871    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b
872    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b
873    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b
874    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b"
875    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/
876    mkdir -p /tmp/icu4j/main/shared/data
877    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
878    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/
879    mkdir -p /tmp/icu4j/main/shared/data
880    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
881    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
882- copy the big-endian Unicode data files to another location,
883  separate from the other data files,
884  and then refresh ICU4J
885    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
886    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
887    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
888    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
889- new for ICU 73: also copy the binary data files directly into the ICU4J tree
890    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll
891
892* When refreshing all of ICU4J data from ICU4C
893- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
894- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
895or
896- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
897
898* refresh Java test .txt files
899- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
900    cd $ICU_SRC/icu4c/source/data/unidata
901    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
902    cd ../../test/testdata
903    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
904    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
905
906* run & fix ICU4J tests
907
908*** merge the Unicode update branch back onto the main branch
909- do not merge the icudata.jar and testdata.jar,
910  instead rebuild them from merged & tested ICU4C
911- if there is a merge conflict in icudata.jar, here is one way to deal with it:
912  +   remove icudata.jar from the commit so that rebasing is trivial
913  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
914  + ~/icu/uni/src$ git commit -a --amend
915  +   switch to main, pull updates, switch back to the dev branch
916  + ~/icu/uni/src$ git rebase main
917  +   rebuild icudata.jar
918  + ~/icu/uni/src$ git commit -a --amend
919  + ~/icu/uni/src$ git push -f
920- make sure that changes to Unicode tools are checked in:
921  https://github.com/unicode-org/unicodetools
922
923---------------------------------------------------------------------------- ***
924
925Unicode 15.0 update for ICU 72
926
927https://www.unicode.org/versions/Unicode15.0.0/
928https://www.unicode.org/versions/beta-15.0.0.html
929https://www.unicode.org/Public/15.0.0/ucd/
930https://www.unicode.org/reports/uax-proposed-updates.html
931https://www.unicode.org/reports/tr44/tr44-29.html
932
933https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15
934https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15
935https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41)
936
937* Command-line environment setup
938
939export UNICODE_DATA=~/unidata/uni15/20220830
940export CLDR_SRC=~/cldr/uni/src
941export ICU_ROOT=~/icu/uni
942export ICU_SRC=$ICU_ROOT/src
943export ICUDT=icudt72b
944export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
945export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
946export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
947
948*** Unicode version numbers
949- makedata.mak
950- uchar.h
951- com.ibm.icu.util.VersionInfo
952- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
953
954- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
955    so that the makefiles see the new version number.
956  cd $ICU_ROOT/dbg/icu4c
957  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
958
959*** data files & enums & parser code
960
961* download files
962- same as for the early Unicode Tools setup and data refresh:
963  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
964  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
965- mkdir -p $UNICODE_DATA
966- download Unicode files into $UNICODE_DATA
967  + subfolders: emoji, idna, security, ucd, uca
968  + old way of fetching files: from the "Public" area on unicode.org
969    ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
970    ~ split Unihan into single-property files
971      ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
972  + new way of fetching files, if available:
973    copy the files from a Unicode Tools workspace that is up to date with
974    https://github.com/unicode-org/unicodetools
975    and which might at this point be *ahead* of "Public"
976    ~ before the Unicode release copy files from "dev" subfolders, for example
977      https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
978  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
979    or from the UCD/cldr/ output folder of the Unicode Tools:
980    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
981  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
982    or
983  cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
984
985* for manual diffs and for Unicode Tools input data updates:
986  remove version suffixes from the file names
987    ~$ unidata/desuffixucd.py $UNICODE_DATA
988  (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
989
990* process and/or copy files
991- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
992  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
993  + For debugging, and tweaking how ppucd.txt is written,
994    the tool has an --only_ppucd option:
995    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
996
997- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
998
999* new constants for new property values
1000- preparseucd.py error:
1001    ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})]
1002  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1003    ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
1004    +age; 15.0                             ; V15_0
1005    +blk; Arabic_Ext_C                     ; Arabic_Extended_C
1006    +blk; CJK_Ext_H                        ; CJK_Unified_Ideographs_Extension_H
1007    +blk; Cyrillic_Ext_D                   ; Cyrillic_Extended_D
1008    +blk; Devanagari_Ext_A                 ; Devanagari_Extended_A
1009    +blk; Kaktovik_Numerals                ; Kaktovik_Numerals
1010    +blk; Kawi                             ; Kawi
1011    +blk; Nag_Mundari                      ; Nag_Mundari
1012    +sc ; Kawi                             ; Kawi
1013    +sc ; Nagm                             ; Nag_Mundari
1014  -> add new blocks to uchar.h before UBLOCK_COUNT
1015    use long property names for enum constants,
1016    for the trailing comment get the block start code point: diff old & new Blocks.txt
1017    ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
1018    +10EC0..10EFF; Arabic Extended-C
1019    +11B00..11B5F; Devanagari Extended-A
1020    +11F00..11F5F; Kawi
1021    -13430..1343F; Egyptian Hieroglyph Format Controls
1022    +13430..1345F; Egyptian Hieroglyph Format Controls
1023    +1D2C0..1D2DF; Kaktovik Numerals
1024    +1E030..1E08F; Cyrillic Extended-D
1025    +1E4D0..1E4FF; Nag Mundari
1026    +31350..323AF; CJK Unified Ideographs Extension H
1027    (ignore blocks whose end code point changed)
1028  -> add new blocks to UCharacter.UnicodeBlock IDs
1029    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1030            replace  public static final int \1_ID = \2; \3
1031  -> add new blocks to UCharacter.UnicodeBlock objects
1032    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1033            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1034  -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
1035    Eclipse find     USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
1036            replace  public static final int \1 = \2; \3
1037  -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1038      and in com.ibm.icu.dev.test.lang.TestUScript.java
1039
1040* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1041    (not strictly necessary for NOT_ENCODED scripts)
1042  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1043
1044* build ICU
1045  to make sure that there are no syntax errors
1046
1047  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
1048
1049* update spoof checker UnicodeSet initializers:
1050    inclusionPat & recommendedPat in i18n/uspoof.cpp
1051    INCLUSION & RECOMMENDED in SpoofChecker.java
1052- make sure that the Unicode Tools tree contains the latest security data files
1053- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1054- run the tool (no special environment variables needed)
1055- copy & paste from the Console output into the .cpp & .java files
1056
1057* Bazel build process
1058
1059See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
1060for an overview and for setup instructions.
1061
1062Consider running `bazelisk --version` outside of the $ICU_SRC folder
1063to find out the latest `bazel` version, and
1064copying that version number into the $ICU_SRC/.bazeliskrc config file.
1065(Revert if you find incompatibilities, or, better, update our build & config files.)
1066
1067* generate data files
1068
1069- remember to define the environment variables
1070  (see the start of the section for this Unicode version)
1071- cd $ICU_SRC
1072- optional but not necessary:
1073    bazelisk clean
1074- build/bootstrap/generate new files:
1075    icu4c/source/data/unidata/generate.sh
1076
1077* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1078  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1079- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1080    ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt
1081- Unicode 6.0..15.0: U+2260, U+226E, U+226F
1082- nothing new in this Unicode version, no test file to update
1083
1084* run & fix ICU4C tests
1085- Note: Some of the collation data and test data will be updated below,
1086  so at this time we might get some collation test failures.
1087  Ignore these for now.
1088- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
1089  (no rule changes in Unicode 15)
1090- update CLDR GraphemeBreakTest.txt
1091    cd ~/unitools/mine/Generated
1092    cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1093    cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
1094    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
1095- Andy helps with RBBI & spoof check test failures
1096
1097* collation: CLDR collation root, UCA DUCET
1098
1099- UCA DUCET goes into Mark's Unicode tools,
1100  and a tool-tailored version goes into CLDR, see
1101    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
1102
1103- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1104    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1105- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1106    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1107    (note removing the underscore before "Rules")
1108    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1109- restore TODO diffs in UCARules.txt
1110    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1111- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1112  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1113  from the CLDR root files (..._CLDR_..._SHORT.txt)
1114    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1115    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1116    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1117- if CLDR common/uca/unihan-index.txt changes, then update
1118  CLDR common/collation/root.xml <collation type="private-unihan">
1119  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1120
1121- generate data files, as above (generate.sh), now to pick up new collation data
1122- update CollationFCD.java:
1123  copy & paste the initializers of lcccIndex[] etc. from
1124    ICU4C/source/i18n/collationfcd.cpp to
1125    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1126- rebuild ICU4C (make clean, make check, as usual)
1127
1128* Unihan collators
1129    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
1130- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
1131  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
1132- generate ICU zh collation data
1133    instructions inspired by
1134    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
1135    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
1136  + setup:
1137    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
1138        (didn't work without setting JAVA_HOME,
1139         nor with the Google default of /usr/local/buildtools/java/jdk
1140         [Google security limitations in the XML parser])
1141    export TOOLS_ROOT=~/icu/uni/src/tools
1142    export CLDR_DIR=~/cldr/uni/src
1143    export CLDR_DATA_DIR=~/cldr/uni/src
1144        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
1145    cd "$TOOLS_ROOT/cldr/lib"
1146    ./install-cldr-jars.sh "$CLDR_DIR"
1147  + generate the files we need
1148    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
1149    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
1150  + diff
1151    cd $ICU_SRC
1152    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
1153    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
1154  + copy into the source tree
1155    cd $ICU_SRC
1156    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
1157    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
1158- rebuild ICU4C
1159
1160* run & fix ICU4C tests, now with new CLDR collation root data
1161- run all tests with the collation test data *_SHORT.txt or the full files
1162  (the full ones have comments, useful for debugging)
1163- note on intltest: if collate/UCAConformanceTest fails, then
1164  utility/MultithreadTest/TestCollators will fail as well;
1165  fix the conformance test before looking into the multi-thread test
1166
1167* update Java data files
1168- refresh just the UCD/UCA-related/derived files, just to be safe
1169- see (ICU4C)/source/data/icu4j-readme.txt
1170- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1171- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1172    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
1173    you need to reconfigure with unicore data; see the "configure" line above.
1174  output:
1175    ...
1176    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1177    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b
1178    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b
1179    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b
1180    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b"
1181    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/
1182    mkdir -p /tmp/icu4j/main/shared/data
1183    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1184    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/
1185    mkdir -p /tmp/icu4j/main/shared/data
1186    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1187    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1188- copy the big-endian Unicode data files to another location,
1189  separate from the other data files,
1190  and then refresh ICU4J
1191    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1192    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1193    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1194    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1195    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1196    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1197    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1198    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1199    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1200    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1201
1202* When refreshing all of ICU4J data from ICU4C
1203- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1204- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1205or
1206- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1207
1208* refresh Java test .txt files
1209- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1210    cd $ICU_SRC/icu4c/source/data/unidata
1211    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1212    cd ../../test/testdata
1213    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1214    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1215
1216* run & fix ICU4J tests
1217
1218*** API additions
1219- send notice to icu-design about new born-@stable API (enum constants etc.)
1220
1221*** CLDR numbering systems
1222- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1223  for example:
1224    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
1225    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt
1226    ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt
1227    -->
1228    +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
1229    +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
1230  or:
1231    ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+'
1232    -->
1233    +11F50..11F59  ; Nd #  [10] KAWI DIGIT ZERO..KAWI DIGIT NINE
1234    +1E4F0..1E4F9  ; Nd #  [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE
1235  Unicode 15:
1236    kawi 11F50..11F59 Kawi
1237    nagm 1E4F0..1E4F9 Nag Mundari
1238    https://github.com/unicode-org/cldr/pull/2041
1239
1240*** merge the Unicode update branches back onto the trunk
1241- do not merge the icudata.jar and testdata.jar,
1242  instead rebuild them from merged & tested ICU4C
1243- if there is a merge conflict in icudata.jar, here is one way to deal with it:
1244  +   remove icudata.jar from the commit so that rebasing is trivial
1245  + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar
1246  + ~/icu/uni/src$ git commit -a --amend
1247  +   switch to main, pull updates, switch back to the dev branch
1248  + ~/icu/uni/src$ git rebase main
1249  +   rebuild icudata.jar
1250  + ~/icu/uni/src$ git commit -a --amend
1251  + ~/icu/uni/src$ git push -f
1252- make sure that changes to Unicode tools are checked in:
1253  https://github.com/unicode-org/unicodetools
1254
1255---------------------------------------------------------------------------- ***
1256
1257Unicode 14.0 update for ICU 70
1258
1259https://www.unicode.org/versions/Unicode14.0.0/
1260https://www.unicode.org/versions/beta-14.0.0.html
1261https://www.unicode.org/Public/14.0.0/ucd/
1262https://www.unicode.org/reports/uax-proposed-updates.html
1263https://www.unicode.org/reports/tr44/tr44-27.html
1264
1265https://unicode-org.atlassian.net/browse/CLDR-14801
1266https://unicode-org.atlassian.net/browse/ICU-21635
1267
1268* Command-line environment setup
1269
1270export UNICODE_DATA=~/unidata/uni14/20210903
1271export CLDR_SRC=~/cldr/uni/src
1272export ICU_ROOT=~/icu/uni
1273export ICU_SRC=$ICU_ROOT/src
1274export ICUDT=icudt70b
1275export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1276export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1277export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1278
1279*** Unicode version numbers
1280- makedata.mak
1281- uchar.h
1282- com.ibm.icu.util.VersionInfo
1283- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1284
1285- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1286    so that the makefiles see the new version number.
1287  cd $ICU_ROOT/dbg/icu4c
1288  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
1289
1290*** data files & enums & parser code
1291
1292* download files
1293- same as for the early Unicode Tools setup and data refresh:
1294  https://github.com/unicode-org/unicodetools/blob/main/docs/index.md
1295  https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md
1296- mkdir -p $UNICODE_DATA
1297- download Unicode files into $UNICODE_DATA
1298  + subfolders: emoji, idna, security, ucd, uca
1299  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1300  + split Unihan into single-property files
1301    ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
1302  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1303    or from the UCD/cldr/ output folder of the Unicode Tools:
1304    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
1305  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
1306    or
1307  cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt
1308
1309* for manual diffs and for Unicode Tools input data updates:
1310  remove version suffixes from the file names
1311    ~$ unidata/desuffixucd.py $UNICODE_DATA
1312  (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md)
1313
1314* process and/or copy files
1315- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1316  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1317  + For debugging, and tweaking how ppucd.txt is written,
1318    the tool has an --only_ppucd option:
1319    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1320
1321- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1322
1323* new constants for new property values
1324- preparseucd.py error:
1325    ValueError: missing uchar.h enum constants for some property values:
1326    [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])),
1327    (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])),
1328    (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))]
1329  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1330    ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]'
1331    +age; 14.0                             ; V14_0
1332    +blk; Arabic_Ext_B                     ; Arabic_Extended_B
1333    +blk; Cypro_Minoan                     ; Cypro_Minoan
1334    +blk; Ethiopic_Ext_B                   ; Ethiopic_Extended_B
1335    +blk; Kana_Ext_B                       ; Kana_Extended_B
1336    +blk; Latin_Ext_F                      ; Latin_Extended_F
1337    +blk; Latin_Ext_G                      ; Latin_Extended_G
1338    +blk; Old_Uyghur                       ; Old_Uyghur
1339    +blk; Tangsa                           ; Tangsa
1340    +blk; Toto                             ; Toto
1341    +blk; UCAS_Ext_A                       ; Unified_Canadian_Aboriginal_Syllabics_Extended_A
1342    +blk; Vithkuqi                         ; Vithkuqi
1343    +blk; Znamenny_Music                   ; Znamenny_Musical_Notation
1344    +jg ; Thin_Yeh                         ; Thin_Yeh
1345    +jg ; Vertical_Tail                    ; Vertical_Tail
1346    +sc ; Cpmn                             ; Cypro_Minoan
1347    +sc ; Ougr                             ; Old_Uyghur
1348    +sc ; Tnsa                             ; Tangsa
1349    +sc ; Toto                             ; Toto
1350    +sc ; Vith                             ; Vithkuqi
1351  -> add new blocks to uchar.h before UBLOCK_COUNT
1352    use long property names for enum constants,
1353    for the trailing comment get the block start code point: diff old & new Blocks.txt
1354    ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]'
1355    +0870..089F; Arabic Extended-B
1356    +10570..105BF; Vithkuqi
1357    +10780..107BF; Latin Extended-F
1358    +10F70..10FAF; Old Uyghur
1359    -11700..1173F; Ahom
1360    +11700..1174F; Ahom
1361    +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A
1362    +12F90..12FFF; Cypro-Minoan
1363    +16A70..16ACF; Tangsa
1364    -18D00..18D8F; Tangut Supplement
1365    +18D00..18D7F; Tangut Supplement
1366    +1AFF0..1AFFF; Kana Extended-B
1367    +1CF00..1CFCF; Znamenny Musical Notation
1368    +1DF00..1DFFF; Latin Extended-G
1369    +1E290..1E2BF; Toto
1370    +1E7E0..1E7FF; Ethiopic Extended-B
1371    (ignore blocks whose end code point changed)
1372  -> add new blocks to UCharacter.UnicodeBlock IDs
1373    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1374            replace  public static final int \1_ID = \2; \3
1375  -> add new blocks to UCharacter.UnicodeBlock objects
1376    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1377            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1378  -> add new scripts to uscript.h & com.ibm.icu.lang.UScript
1379    Eclipse find     USCRIPT_([^ ]+) *= ([0-9]+),(/.+)
1380            replace  public static final int \1 = \2; \3
1381  -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1382      and in com.ibm.icu.dev.test.lang.TestUScript.java
1383  -> add new joining groups to uchar.h & UCharacter.JoiningGroup
1384
1385* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1386    (not strictly necessary for NOT_ENCODED scripts)
1387  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1388
1389* build ICU
1390  to make sure that there are no syntax errors
1391
1392  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date
1393
1394* update spoof checker UnicodeSet initializers:
1395    inclusionPat & recommendedPat in i18n/uspoof.cpp
1396    INCLUSION & RECOMMENDED in SpoofChecker.java
1397- make sure that the Unicode Tools tree contains the latest security data files
1398- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1399- run the tool (no special environment variables needed)
1400- copy & paste from the Console output into the .cpp & .java files
1401
1402* Bazel build process
1403
1404See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process
1405for an overview and for setup instructions.
1406
1407Consider running `bazelisk --version` outside of the $ICU_SRC folder
1408to find out the latest `bazel` version, and
1409copying that version number into the $ICU_SRC/.bazeliskrc config file.
1410(Revert if you find incompatibilities, or, better, update our build & config files.)
1411
1412* generate data files
1413
1414- remember to define the environment variables
1415  (see the start of the section for this Unicode version)
1416- cd $ICU_SRC
1417- optional but not necessary:
1418    bazelisk clean
1419- build/bootstrap/generate new files:
1420    icu4c/source/data/unidata/generate.sh
1421
1422* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1423  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1424- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1425- Unicode 6.0..14.0: U+2260, U+226E, U+226F
1426- nothing new in this Unicode version, no test file to update
1427
1428* run & fix ICU4C tests
1429- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
1430- update CLDR GraphemeBreakTest.txt
1431    cd ~/unitools/mine/Generated
1432    cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1433    cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html
1434    cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata
1435- Andy helps with RBBI & spoof check test failures
1436
1437* collation: CLDR collation root, UCA DUCET
1438
1439- UCA DUCET goes into Mark's Unicode tools,
1440  and a tool-tailored version goes into CLDR, see
1441    https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md
1442
1443- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1444    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1445- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1446    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1447    (note removing the underscore before "Rules")
1448    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1449- restore TODO diffs in UCARules.txt
1450    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1451- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1452  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1453  from the CLDR root files (..._CLDR_..._SHORT.txt)
1454    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1455    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1456    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1457- if CLDR common/uca/unihan-index.txt changes, then update
1458  CLDR common/collation/root.xml <collation type="private-unihan">
1459  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1460
1461- generate data files, as above (generate.sh), now to pick up new collation data
1462- update CollationFCD.java:
1463  copy & paste the initializers of lcccIndex[] etc. from
1464    ICU4C/source/i18n/collationfcd.cpp to
1465    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1466- rebuild ICU4C (make clean, make check, as usual)
1467
1468* Unihan collators
1469    https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md
1470- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles,
1471  check CLDR diffs, copy to CLDR, test CLDR, ... as documented there
1472- generate ICU zh collation data
1473    instructions inspired by
1474    https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and
1475    https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt
1476  + setup:
1477    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
1478        (didn't work without setting JAVA_HOME,
1479         nor with the Google default of /usr/local/buildtools/java/jdk
1480         [Google security limitations in the XML parser])
1481    export TOOLS_ROOT=~/icu/uni/src/tools
1482    export CLDR_DIR=~/cldr/uni/src
1483    export CLDR_DATA_DIR=~/cldr/uni/src
1484        (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files)
1485    cd "$TOOLS_ROOT/cldr/lib"
1486    ./install-cldr-jars.sh "$CLDR_DIR"
1487  + generate the files we need
1488    cd "$TOOLS_ROOT/cldr/cldr-to-icu"
1489    ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*'
1490  + diff
1491    cd $ICU_SRC
1492    meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt
1493    meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt
1494  + copy into the source tree
1495    cd $ICU_SRC
1496    cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt
1497    cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt
1498- rebuild ICU4C
1499
1500* run & fix ICU4C tests, now with new CLDR collation root data
1501- run all tests with the collation test data *_SHORT.txt or the full files
1502  (the full ones have comments, useful for debugging)
1503- note on intltest: if collate/UCAConformanceTest fails, then
1504  utility/MultithreadTest/TestCollators will fail as well;
1505  fix the conformance test before looking into the multi-thread test
1506
1507* update Java data files
1508- refresh just the UCD/UCA-related/derived files, just to be safe
1509- see (ICU4C)/source/data/icu4j-readme.txt
1510- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1511- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1512    NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'",
1513    you need to reconfigure with unicore data; see the "configure" line above.
1514  output:
1515    ...
1516    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1517    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b
1518    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b
1519    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b
1520    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b"
1521    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/
1522    mkdir -p /tmp/icu4j/main/shared/data
1523    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1524    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/
1525    mkdir -p /tmp/icu4j/main/shared/data
1526    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1527    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1528- copy the big-endian Unicode data files to another location,
1529  separate from the other data files,
1530  and then refresh ICU4J
1531    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1532    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1533    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1534    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1535    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1536    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1537    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1538    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1539    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1540    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1541
1542* When refreshing all of ICU4J data from ICU4C
1543- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1544- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1545or
1546- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1547
1548* refresh Java test .txt files
1549- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1550    cd $ICU_SRC/icu4c/source/data/unidata
1551    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1552    cd ../../test/testdata
1553    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1554    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1555
1556* run & fix ICU4J tests
1557
1558*** API additions
1559- send notice to icu-design about new born-@stable API (enum constants etc.)
1560
1561*** CLDR numbering systems
1562- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1563  for example:
1564    ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt
1565    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt
1566    ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt
1567    -->
1568    +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS
1569  Unicode 14:
1570    tnsa 16AC0..16AC9 Tangsa
1571    https://github.com/unicode-org/cldr/pull/1326
1572
1573*** merge the Unicode update branches back onto the trunk
1574- do not merge the icudata.jar and testdata.jar,
1575  instead rebuild them from merged & tested ICU4C
1576- make sure that changes to Unicode tools are checked in:
1577  https://github.com/unicode-org/unicodetools
1578
1579---------------------------------------------------------------------------- ***
1580
1581Unicode 13.0 update for ICU 66
1582
1583https://www.unicode.org/versions/Unicode13.0.0/
1584https://www.unicode.org/versions/beta-13.0.0.html
1585https://www.unicode.org/Public/13.0.0/ucd/
1586https://www.unicode.org/reports/uax-proposed-updates.html
1587https://www.unicode.org/reports/tr44/tr44-25.html
1588
1589https://unicode-org.atlassian.net/browse/CLDR-13387
1590https://unicode-org.atlassian.net/browse/ICU-20893
1591
1592* Command-line environment setup
1593
1594UNICODE_DATA=~/unidata/uni13/20200212
1595CLDR_SRC=~/cldr/uni/src
1596ICU_ROOT=~/icu/uni
1597ICU_SRC=$ICU_ROOT/src
1598ICUDT=icudt66b
1599ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1600ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1601export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1602
1603*** Unicode version numbers
1604- makedata.mak
1605- uchar.h
1606- com.ibm.icu.util.VersionInfo
1607- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1608
1609- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1610    so that the makefiles see the new version number.
1611  cd $ICU_ROOT/dbg/icu4c
1612  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
1613
1614*** data files & enums & parser code
1615
1616* download files
1617- mkdir -p $UNICODE_DATA
1618- download Unicode files into $UNICODE_DATA
1619  + subfolders: emoji, idna, security, ucd, uca
1620  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1621  + split Unihan into single-property files
1622    ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
1623  + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
1624    or from the ucd/cldr/ output folder of the Unicode Tools:
1625    Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
1626  cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
1627
1628* for manual diffs and for Unicode Tools input data updates:
1629  remove version suffixes from the file names
1630    ~$ unidata/desuffixucd.py $UNICODE_DATA
1631  (see https://sites.google.com/site/unicodetools/inputdata)
1632
1633* process and/or copy files
1634- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1635  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1636  + For debugging, and tweaking how ppucd.txt is written,
1637    the tool has an --only_ppucd option:
1638    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1639
1640- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1641
1642* new constants for new property values
1643- preparseucd.py error:
1644    ValueError: missing uchar.h enum constants for some property values:
1645    [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi',
1646        u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])),
1647    (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])),
1648    (u'InPC', set([u'Top_And_Bottom_And_Left']))]
1649  = PropertyValueAliases.txt new property values (diff old & new .txt files)
1650    blk; Chorasmian                       ; Chorasmian
1651    blk; CJK_Ext_G                        ; CJK_Unified_Ideographs_Extension_G
1652    blk; Dives_Akuru                      ; Dives_Akuru
1653    blk; Khitan_Small_Script              ; Khitan_Small_Script
1654    blk; Lisu_Sup                         ; Lisu_Supplement
1655    blk; Symbols_For_Legacy_Computing     ; Symbols_For_Legacy_Computing
1656    blk; Tangut_Sup                       ; Tangut_Supplement
1657    blk; Yezidi                           ; Yezidi
1658  -> add to uchar.h before UBLOCK_COUNT
1659    use long property names for enum constants,
1660    for the trailing comment get the block start code point: diff old & new Blocks.txt
1661  -> add to UCharacter.UnicodeBlock IDs
1662    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1663            replace  public static final int \1_ID = \2; \3
1664  -> add to UCharacter.UnicodeBlock objects
1665    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1666            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1667
1668    sc ; Chrs                             ; Chorasmian
1669    sc ; Diak                             ; Dives_Akuru
1670    sc ; Kits                             ; Khitan_Small_Script
1671    sc ; Yezi                             ; Yezidi
1672  -> uscript.h & com.ibm.icu.lang.UScript
1673  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1674      and in com.ibm.icu.dev.test.lang.TestUScript.java
1675
1676    InPC; Top_And_Bottom_And_Left         ; Top_And_Bottom_And_Left
1677  -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory
1678
1679* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1680    (not strictly necessary for NOT_ENCODED scripts)
1681  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1682
1683* build ICU (make install)
1684  to make sure that there are no syntax errors, and
1685  so that the tools build can pick up the new definitions from the installed header files.
1686
1687  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1688
1689* update spoof checker UnicodeSet initializers:
1690    inclusionPat & recommendedPat in i18n/uspoof.cpp
1691    INCLUSION & RECOMMENDED in SpoofChecker.java
1692- make sure that the Unicode Tools tree contains the latest security data files
1693- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1694- update the hardcoded version number there in the DIRECTORY path
1695- run the tool (no special environment variables needed)
1696- copy & paste from the Console output into the .cpp & .java files
1697
1698* generate normalization data files
1699  cd $ICU_ROOT/dbg/icu4c
1700  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1701  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1702  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1703  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1704  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1705
1706* build ICU (make install)
1707  so that the tools build can pick up the new definitions from the installed header files.
1708
1709  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1710
1711* build Unicode tools using CMake+make
1712
1713$ICU_SRC/tools/unicode/c/icudefs.txt:
1714
1715# Location (--prefix) of where ICU was installed.
1716set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
1717# Location of the ICU4C source tree.
1718set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
1719
1720  $ICU_ROOT/dbg$
1721    mkdir -p tools/unicode/c
1722    cd tools/unicode/c
1723
1724  $ICU_ROOT/dbg/tools/unicode/c$
1725    cmake ../../../../src/tools/unicode/c
1726    make
1727
1728* generate core properties data files
1729  $ICU_ROOT/dbg/tools/unicode/c$
1730    genprops/genprops $ICU_SRC/icu4c
1731- tool failure:
1732    genprops: Script_Extensions indexes overflow bit field
1733    genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR
1734  -> uprops.icu data file format :
1735     add two more bits to store a script code or Script_Extensions index
1736  -> generator code, C++ & Java runtime, uprops.icu format version 7.7
1737- rebuild ICU (make install) & tools
1738
1739* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1740  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1741- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1742- Unicode 6.0..13.0: U+2260, U+226E, U+226F
1743- nothing new in this Unicode version, no test file to update
1744
1745* run & fix ICU4C tests
1746- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
1747- Andy helps with RBBI & spoof check test failures
1748
1749* collation: CLDR collation root, UCA DUCET
1750
1751- UCA DUCET goes into Mark's Unicode tools, see
1752    https://sites.google.com/site/unicodetools/home#TOC-UCA
1753  diff the main mapping file, look for bad changes
1754  (for example, more bytes per weight for common characters)
1755    ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt
1756    ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt
1757
1758- CLDR root data files are checked into $CLDR_SRC/common/uca/
1759    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1760
1761- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1762    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1763- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1764    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1765    (note removing the underscore before "Rules")
1766    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1767- restore TODO diffs in UCARules.txt
1768    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1769- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1770  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1771  from the CLDR root files (..._CLDR_..._SHORT.txt)
1772    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1773    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1774    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1775- if CLDR common/uca/unihan-index.txt changes, then update
1776  CLDR common/collation/root.xml <collation type="private-unihan">
1777  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1778
1779- run genuca
1780  $ICU_ROOT/dbg/tools/unicode/c$
1781    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
1782    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1783- rebuild ICU4C
1784
1785* Unihan collators
1786    https://sites.google.com/site/unicodetools/unihan
1787- run Unicode Tools
1788    org.unicode.draft.GenerateUnihanCollators
1789  with VM arguments
1790    -ea
1791    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1792    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1793    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1794    -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
1795    -DUVERSION=13.0.0
1796- run Unicode Tools
1797    org.unicode.draft.GenerateUnihanCollatorFiles
1798  with the same arguments
1799- check CLDR diffs
1800    cd $CLDR_SRC
1801    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1802    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1803- copy to CLDR
1804    cd $CLDR_SRC
1805    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1806    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1807- run CLDR unit tests, commit to CLDR
1808- generate ICU zh collation data: run CLDR
1809    org.unicode.cldr.icu.NewLdml2IcuConverter
1810  with program arguments
1811    -t collation
1812    -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation
1813    -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental
1814    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
1815    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
1816    zh
1817  and VM arguments
1818    -ea
1819    -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
1820- rebuild ICU4C
1821
1822* run & fix ICU4C tests, now with new CLDR collation root data
1823- run all tests with the collation test data *_SHORT.txt or the full files
1824  (the full ones have comments, useful for debugging)
1825- note on intltest: if collate/UCAConformanceTest fails, then
1826  utility/MultithreadTest/TestCollators will fail as well;
1827  fix the conformance test before looking into the multi-thread test
1828
1829* update Java data files
1830- refresh just the UCD/UCA-related/derived files, just to be safe
1831- see (ICU4C)/source/data/icu4j-readme.txt
1832- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1833- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1834  output:
1835    ...
1836    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1837    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b
1838    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b
1839    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b
1840    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b"
1841    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/
1842    mkdir -p /tmp/icu4j/main/shared/data
1843    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1844    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/
1845    mkdir -p /tmp/icu4j/main/shared/data
1846    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1847    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
1848- copy the big-endian Unicode data files to another location,
1849  separate from the other data files,
1850  and then refresh ICU4J
1851    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1852    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1853    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1854    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1855    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1856    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1857    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1858    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1859    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1860    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1861
1862* When refreshing all of ICU4J data from ICU4C
1863- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1864- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1865or
1866- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1867
1868* update CollationFCD.java
1869  + copy & paste the initializers of lcccIndex[] etc. from
1870    ICU4C/source/i18n/collationfcd.cpp to
1871    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1872
1873* refresh Java test .txt files
1874- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1875    cd $ICU_SRC/icu4c/source/data/unidata
1876    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1877    cd ../../test/testdata
1878    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1879    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1880
1881* run & fix ICU4J tests
1882
1883*** API additions
1884- send notice to icu-design about new born-@stable API (enum constants etc.)
1885
1886*** CLDR numbering systems
1887- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1888  for example, look for
1889    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
1890    in new blocks (Blocks.txt)
1891  Unicode 13:
1892    diak 11950..11959 Dives_Akuru
1893
1894*** merge the Unicode update branches back onto the trunk
1895- do not merge the icudata.jar and testdata.jar,
1896  instead rebuild them from merged & tested ICU4C
1897- make sure that changes to Unicode tools are checked in:
1898  http://www.unicode.org/utility/trac/log/trunk/unicodetools
1899
1900---------------------------------------------------------------------------- ***
1901
1902Unicode 12.1 update for ICU 64.2
1903
1904** This is an abbreviated update with one new character for the new
1905** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
1906https://en.wikipedia.org/wiki/Reiwa_period
1907
1908http://www.unicode.org/versions/Unicode12.1.0/
1909
1910ICU-20497 Unicode 12.1
1911
1912cldrbug 11978: Unicode 12.1
1913
1914* Command-line environment setup
1915
1916UNICODE_DATA=~/unidata/uni121/20190403
1917CLDR_SRC=~/svn.cldr/uni
1918ICU_ROOT=~/icu/uni
1919ICU_SRC=$ICU_ROOT/src
1920ICUDT=icudt64b
1921ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1922ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1923export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1924
1925*** Unicode version numbers
1926- makedata.mak
1927- uchar.h
1928- com.ibm.icu.util.VersionInfo
1929- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1930
1931- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1932    so that the makefiles see the new version number.
1933  cd $ICU_ROOT/dbg/icu4c
1934  ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
1935
1936*** data files & enums & parser code
1937
1938* download files
1939- mkdir -p $UNICODE_DATA
1940- download Unicode files into $UNICODE_DATA
1941  + subfolders: emoji, idna, security, ucd, uca
1942  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1943
1944* for manual diffs and for Unicode Tools input data updates:
1945  remove version suffixes from the file names
1946    ~$ unidata/desuffixucd.py $UNICODE_DATA
1947  (see https://sites.google.com/site/unicodetools/inputdata)
1948
1949* process and/or copy files
1950- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1951  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1952  + For debugging, and tweaking how ppucd.txt is written,
1953    the tool has an --only_ppucd option:
1954    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1955
1956- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1957
1958* build ICU (make install)
1959  so that the tools build can pick up the new definitions from the installed header files.
1960
1961  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1962
1963* update spoof checker UnicodeSet initializers:
1964    inclusionPat & recommendedPat in uspoof.cpp
1965    INCLUSION & RECOMMENDED in SpoofChecker.java
1966- make sure that the Unicode Tools tree contains the latest security data files
1967- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1968- update the hardcoded version number there in the DIRECTORY path
1969- run the tool (no special environment variables needed)
1970- copy & paste from the Console output into the .cpp & .java files
1971
1972* generate normalization data files
1973  cd $ICU_ROOT/dbg/icu4c
1974  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1975  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1976  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1977  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1978  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1979
1980* build ICU (make install)
1981  so that the tools build can pick up the new definitions from the installed header files.
1982
1983  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
1984
1985* build Unicode tools using CMake+make
1986
1987$ICU_SRC/tools/unicode/c/icudefs.txt:
1988
1989# Location (--prefix) of where ICU was installed.
1990set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
1991# Location of the ICU4C source tree.
1992set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
1993
1994  $ICU_ROOT/dbg$
1995    mkdir -p tools/unicode/c
1996    cd tools/unicode/c
1997
1998  $ICU_ROOT/dbg/tools/unicode/c$
1999    cmake ../../../../src/tools/unicode/c
2000    make
2001
2002* generate core properties data files
2003  $ICU_ROOT/dbg/tools/unicode/c$
2004    genprops/genprops $ICU_SRC/icu4c
2005    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
2006    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2007- rebuild ICU (make install) & tools
2008
2009* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2010  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2011- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2012- Unicode 6.0..12.1: U+2260, U+226E, U+226F
2013- nothing new in this Unicode version, no test file to update
2014
2015* run & fix ICU4C tests
2016- Andy handles RBBI & spoof check test failures
2017
2018* collation: CLDR collation root, UCA DUCET
2019
2020- UCA DUCET goes into Mark's Unicode tools, see
2021    https://sites.google.com/site/unicodetools/home#TOC-UCA
2022  diff the main mapping file, look for bad changes
2023  (for example, more bytes per weight for common characters)
2024    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
2025    ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
2026
2027- CLDR root data files are checked into $CLDR_SRC/common/uca/
2028    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2029
2030- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2031    cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2032- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2033    cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2034    (note removing the underscore before "Rules")
2035    cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2036- restore TODO diffs in UCARules.txt
2037    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2038- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2039  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2040  from the CLDR root files (..._CLDR_..._SHORT.txt)
2041    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2042    cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2043    cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2044- if CLDR common/uca/unihan-index.txt changes, then update
2045  CLDR common/collation/root.xml <collation type="private-unihan">
2046  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2047
2048- run genuca, see command line above
2049- rebuild ICU4C
2050
2051* Unihan collators
2052    https://sites.google.com/site/unicodetools/unihan
2053- run Unicode Tools
2054    org.unicode.draft.GenerateUnihanCollators
2055  with VM arguments
2056    -ea
2057    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2058    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2059    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2060    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2061    -DUVERSION=12.1.0
2062- run Unicode Tools
2063    org.unicode.draft.GenerateUnihanCollatorFiles
2064  with the same arguments
2065- check CLDR diffs
2066    cd $CLDR_SRC
2067    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2068    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2069- copy to CLDR
2070    cd $CLDR_SRC
2071    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2072    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2073- run CLDR unit tests, commit to CLDR
2074- generate ICU zh collation data: run CLDR
2075    org.unicode.cldr.icu.NewLdml2IcuConverter
2076  with program arguments
2077    -t collation
2078    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
2079    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
2080    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
2081    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
2082    zh
2083  and VM arguments
2084    -ea
2085    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2086- rebuild ICU4C
2087
2088* run & fix ICU4C tests, now with new CLDR collation root data
2089- run all tests with the collation test data *_SHORT.txt or the full files
2090  (the full ones have comments, useful for debugging)
2091- note on intltest: if collate/UCAConformanceTest fails, then
2092  utility/MultithreadTest/TestCollators will fail as well;
2093  fix the conformance test before looking into the multi-thread test
2094
2095* update Java data files
2096- refresh just the UCD/UCA-related/derived files, just to be safe
2097- see (ICU4C)/source/data/icu4j-readme.txt
2098- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2099- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2100  output:
2101    ...
2102    make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
2103    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
2104    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
2105    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
2106    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
2107    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
2108    mkdir -p /tmp/icu4j/main/shared/data
2109    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2110    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
2111    mkdir -p /tmp/icu4j/main/shared/data
2112    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2113    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
2114- copy the big-endian Unicode data files to another location,
2115  separate from the other data files,
2116  and then refresh ICU4J
2117    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2118    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2119    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2120    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2121    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2122    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2123    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2124    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2125    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2126    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2127
2128* When refreshing all of ICU4J data from ICU4C
2129- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2130- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
2131or
2132- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
2133
2134* update CollationFCD.java
2135  + copy & paste the initializers of lcccIndex[] etc. from
2136    ICU4C/source/i18n/collationfcd.cpp to
2137    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2138
2139* refresh Java test .txt files
2140- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2141    cd $ICU_SRC/icu4c/source/data/unidata
2142    cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2143    cd ../../test/testdata
2144    cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2145    cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2146
2147* run & fix ICU4J tests
2148
2149*** API additions
2150- send notice to icu-design about new born-@stable API (enum constants etc.)
2151
2152*** CLDR numbering systems
2153- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
2154  for example, look for
2155    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
2156    in new blocks (Blocks.txt)
2157  Unicode 12: using Unicode 12 CLDR ticket #11478
2158    hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
2159    wcho 1E2F0..1E2F9 Wancho
2160  Unicode 11: using Unicode 11 CLDR ticket #10978
2161    rohg 10D30..10D39 Hanifi_Rohingya
2162    gong 11DA0..11DA9 Gunjala_Gondi
2163  Earlier: CLDR tickets specific to adding new numbering systems.
2164  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
2165  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
2166
2167*** merge the Unicode update branches back onto the trunk
2168- do not merge the icudata.jar and testdata.jar,
2169  instead rebuild them from merged & tested ICU4C
2170- make sure that changes to Unicode tools are checked in:
2171  http://www.unicode.org/utility/trac/log/trunk/unicodetools
2172
2173---------------------------------------------------------------------------- ***
2174
2175Unicode 12.0 update for ICU 64
2176
2177http://www.unicode.org/versions/Unicode12.0.0/
2178http://unicode.org/versions/beta-12.0.0.html
2179https://www.unicode.org/review/pri389/
2180http://www.unicode.org/reports/uax-proposed-updates.html
2181http://www.unicode.org/reports/tr44/tr44-23.html
2182
2183ICU-20203 Unicode 12
2184
2185ICU-20111 move text layout properties data into a data file
2186
2187cldrbug 11478: Unicode 12
2188Accidentally used ^/trunk instead of ^/branches/markus/uni12
2189
2190* Command-line environment setup
2191
2192UNICODE_DATA=~/unidata/uni12/20190309
2193CLDR_SRC=~/svn.cldr/uni
2194ICU_ROOT=~/icu/uni
2195ICU_SRC=$ICU_ROOT/src
2196ICUDT=icudt63b
2197ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2198ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2199export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2200
2201*** Unicode version numbers
2202- makedata.mak
2203- uchar.h
2204- com.ibm.icu.util.VersionInfo
2205- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2206
2207- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2208  so that the makefiles see the new version number.
2209
2210*** data files & enums & parser code
2211
2212* download files
2213- mkdir -p $UNICODE_DATA
2214- download Unicode files into $UNICODE_DATA
2215  + subfolders: emoji, idna, security, ucd, uca
2216  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2217
2218* for manual diffs and for Unicode Tools input data updates:
2219  remove version suffixes from the file names
2220    ~$ unidata/desuffixucd.py $UNICODE_DATA
2221  (see https://sites.google.com/site/unicodetools/inputdata)
2222
2223* process and/or copy files
2224- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2225  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2226  + For debugging, and tweaking how ppucd.txt is written,
2227    the tool has an --only_ppucd option:
2228    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2229
2230- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
2231
2232* build ICU (make install)
2233  so that the tools build can pick up the new definitions from the installed header files.
2234
2235  $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
2236
2237* new constants for new property values
2238- preparseucd.py error:
2239    ValueError: missing uchar.h enum constants for some property values:
2240    [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
2241        u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
2242        u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
2243    (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
2244  = PropertyValueAliases.txt new property values (diff old & new .txt files)
2245    blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
2246    blk; Elymaic                          ; Elymaic
2247    blk; Nandinagari                      ; Nandinagari
2248    blk; Nyiakeng_Puachue_Hmong           ; Nyiakeng_Puachue_Hmong
2249    blk; Ottoman_Siyaq_Numbers            ; Ottoman_Siyaq_Numbers
2250    blk; Small_Kana_Ext                   ; Small_Kana_Extension
2251    blk; Symbols_And_Pictographs_Ext_A    ; Symbols_And_Pictographs_Extended_A
2252    blk; Tamil_Sup                        ; Tamil_Supplement
2253    blk; Wancho                           ; Wancho
2254  -> add to uchar.h
2255    use long property names for enum constants,
2256    for the trailing comment get the block start code point: diff old & new Blocks.txt
2257  -> add to UCharacter.UnicodeBlock IDs
2258    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2259            replace  public static final int \1_ID = \2; \3
2260  -> add to UCharacter.UnicodeBlock objects
2261    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2262            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
2263
2264    sc ; Elym                             ; Elymaic
2265    sc ; Hmnp                             ; Nyiakeng_Puachue_Hmong
2266    sc ; Nand                             ; Nandinagari
2267    sc ; Wcho                             ; Wancho
2268  -> uscript.h & com.ibm.icu.lang.UScript
2269  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2270      and in com.ibm.icu.dev.test.lang.TestUScript.java
2271
2272* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2273    (not strictly necessary for NOT_ENCODED scripts)
2274  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
2275
2276* update spoof checker UnicodeSet initializers:
2277    inclusionPat & recommendedPat in uspoof.cpp
2278    INCLUSION & RECOMMENDED in SpoofChecker.java
2279- make sure that the Unicode Tools tree contains the latest security data files
2280- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
2281- update the hardcoded version number there in the DIRECTORY path
2282- run the tool (no special environment variables needed)
2283- copy & paste from the Console output into the .cpp & .java files
2284
2285* generate normalization data files
2286  cd $ICU_ROOT/dbg/icu4c
2287  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
2288  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
2289  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
2290  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2291  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
2292
2293* build ICU (make install)
2294  so that the tools build can pick up the new definitions from the installed header files.
2295
2296  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
2297
2298* build Unicode tools using CMake+make
2299
2300$ICU_SRC/tools/unicode/c/icudefs.txt:
2301
2302# Location (--prefix) of where ICU was installed.
2303set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
2304# Location of the ICU4C source tree.
2305set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
2306
2307  $ICU_ROOT/dbg$
2308    mkdir -p tools/unicode/c
2309    cd tools/unicode/c
2310
2311  $ICU_ROOT/dbg/tools/unicode/c$
2312    cmake ../../../../src/tools/unicode/c
2313    make
2314
2315* generate core properties data files
2316  $ICU_ROOT/dbg/tools/unicode/c$
2317    genprops/genprops $ICU_SRC/icu4c
2318    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
2319    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2320- rebuild ICU (make install) & tools
2321
2322* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2323  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2324- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2325- Unicode 6.0..12.0: U+2260, U+226E, U+226F
2326- nothing new in this Unicode version, no test file to update
2327
2328* run & fix ICU4C tests
2329- update test of default bidi classes:
2330  Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
2331  see diffs in DerivedBidiClass.txt
2332  + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
2333  + UCharacterTest.java TestIteration() defaultBidi[]
2334- Andy handles RBBI & spoof check test failures
2335
2336* collation: CLDR collation root, UCA DUCET
2337
2338- UCA DUCET goes into Mark's Unicode tools, see
2339    https://sites.google.com/site/unicodetools/home#TOC-UCA
2340  diff the main mapping file, look for bad changes
2341  (for example, more bytes per weight for common characters)
2342    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
2343    ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
2344
2345- CLDR root data files are checked into $CLDR_SRC/common/uca/
2346    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2347
2348- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2349    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2350- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2351    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2352    (note removing the underscore before "Rules")
2353    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2354- restore TODO diffs in UCARules.txt
2355    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2356- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2357  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2358  from the CLDR root files (..._CLDR_..._SHORT.txt)
2359    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2360    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2361    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2362- if CLDR common/uca/unihan-index.txt changes, then update
2363  CLDR common/collation/root.xml <collation type="private-unihan">
2364  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2365
2366- run genuca, see command line above;
2367  deal with
2368    Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
2369    FDD1 119CE;	[71 CD 02, 05, 05]	# Nandinagari first primary (compressible)
2370        (add the character to genuca.cpp sampleCharsToScripts[])
2371  + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
2372    and cache its values.
2373    Works as long as the script metadata is updated before the collation data.
2374- rebuild ICU4C
2375
2376* Unihan collators
2377    https://sites.google.com/site/unicodetools/unihan
2378- run Unicode Tools
2379    org.unicode.draft.GenerateUnihanCollators
2380  with VM arguments
2381    -ea
2382    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2383    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2384    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2385    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2386    -DUVERSION=12.0.0
2387- run Unicode Tools
2388    org.unicode.draft.GenerateUnihanCollatorFiles
2389  with the same arguments
2390- check CLDR diffs
2391    cd $CLDR_SRC
2392    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2393    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2394- copy to CLDR
2395    cd $CLDR_SRC
2396    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2397    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2398- run CLDR unit tests, commit to CLDR
2399- generate ICU zh collation data: run CLDR
2400    org.unicode.cldr.icu.NewLdml2IcuConverter
2401  with program arguments
2402    -t collation
2403    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
2404    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
2405    -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
2406    -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
2407    zh
2408  and VM arguments
2409    -ea
2410    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2411- rebuild ICU4C
2412
2413* run & fix ICU4C tests, now with new CLDR collation root data
2414- run all tests with the collation test data *_SHORT.txt or the full files
2415  (the full ones have comments, useful for debugging)
2416- note on intltest: if collate/UCAConformanceTest fails, then
2417  utility/MultithreadTest/TestCollators will fail as well;
2418  fix the conformance test before looking into the multi-thread test
2419
2420* update Java data files
2421- refresh just the UCD/UCA-related/derived files, just to be safe
2422- see (ICU4C)/source/data/icu4j-readme.txt
2423- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2424- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2425  output:
2426    ...
2427    Unicode .icu files built to ./out/build/icudt63l
2428    echo timestamp > uni-core-data
2429    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
2430    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
2431    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2432    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
2433    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
2434    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
2435    mkdir -p /tmp/icu4j/main/shared/data
2436    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2437    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
2438    mkdir -p /tmp/icu4j/main/shared/data
2439    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2440    make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
2441- copy the big-endian Unicode data files to another location,
2442  separate from the other data files,
2443  and then refresh ICU4J
2444    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2445    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2446    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2447    cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2448    cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2449    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2450    cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2451    cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2452    cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2453    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2454
2455* When refreshing all of ICU4J data from ICU4C
2456- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2457- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
2458or
2459- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
2460
2461* update CollationFCD.java
2462  + copy & paste the initializers of lcccIndex[] etc. from
2463    ICU4C/source/i18n/collationfcd.cpp to
2464    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2465
2466* refresh Java test .txt files
2467- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2468    cd $ICU_SRC/icu4c/source/data/unidata
2469    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2470    cd ../../test/testdata
2471    cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2472    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2473
2474* run & fix ICU4J tests
2475
2476*** API additions
2477- send notice to icu-design about new born-@stable API (enum constants etc.)
2478
2479*** CLDR numbering systems
2480- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
2481  for example, look for
2482    ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
2483    in new blocks (Blocks.txt)
2484  Unicode 12: using Unicode 12 CLDR ticket #11478
2485    hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
2486    wcho 1E2F0..1E2F9 Wancho
2487  Unicode 11: using Unicode 11 CLDR ticket #10978
2488    rohg 10D30..10D39 Hanifi_Rohingya
2489    gong 11DA0..11DA9 Gunjala_Gondi
2490  Earlier: CLDR tickets specific to adding new numbering systems.
2491  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
2492  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
2493
2494*** merge the Unicode update branches back onto the trunk
2495- do not merge the icudata.jar and testdata.jar,
2496  instead rebuild them from merged & tested ICU4C
2497- make sure that changes to Unicode tools are checked in:
2498  http://www.unicode.org/utility/trac/log/trunk/unicodetools
2499
2500---------------------------------------------------------------------------- ***
2501
2502ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
2503
2504* Command-line environment setup
2505
2506UNICODE_DATA=~/unidata/uni11/20180609
2507CLDR_SRC=~/svn.cldr/uni
2508ICU_ROOT=~/icu/mine
2509ICU_SRC=$ICU_ROOT/src
2510ICUDT=icudt62b
2511ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2512ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2513export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2514
2515*** Links
2516
2517https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
2518https://unicode-org.atlassian.net/browse/ICU-12850 vo
2519
2520*** data files & enums & parser code
2521
2522* API additions
2523- for each of the three new enumerated properties
2524  + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
2525  + uchar.h: update UCHAR_INT_LIMIT
2526  + uchar.h: add the enum U<long prop name>
2527    with constants U_<short prop name>_<long value name>
2528  + UProperty.java: add the constant <long prop name>
2529  + UProperty.java: update INT_LIMIT
2530  + UCharacter.java: add the interface <long prop name>
2531    with constants <long value name>
2532
2533* process and/or copy files
2534- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2535  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2536  + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
2537    names and aliases.
2538  + For debugging, and tweaking how ppucd.txt is written,
2539    the tool has an --only_ppucd option:
2540    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2541
2542* preparseucd.py changes
2543- add new property short names (uppercase) to _prop_and_value_re
2544  so that ParseUCharHeader() parses the new enum constants
2545
2546* build ICU (make install)
2547  so that the tools build can pick up the new definitions from the installed header files.
2548
2549  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2550
2551* build Unicode tools using CMake+make
2552
2553$ICU_SRC/tools/unicode/c/icudefs.txt:
2554
2555# Location (--prefix) of where ICU was installed.
2556set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
2557# Location of the ICU4C source tree.
2558set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
2559
2560  $ICU_ROOT/dbg$
2561    mkdir -p tools/unicode/c
2562    cd tools/unicode/c
2563
2564  $ICU_ROOT/dbg/tools/unicode/c$
2565    cmake ../../../../../src/tools/unicode/c
2566    make
2567
2568* generate core properties data files
2569  $ICU_ROOT/dbg/tools/unicode/c$
2570    genprops/genprops $ICU_SRC/icu4c
2571- rebuild ICU (make install) & tools
2572
2573* write data for runtime, hardcoded for now
2574- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
2575- generate new icu4c/source/common/ulayout_props_data.h
2576- for each of the three new enumerated properties
2577  + int property max value
2578  + small, 8-bit UCPTrie
2579    (A small 16-bit trie with bit fields for these three properties
2580    is very nearly the same size as the sum of the three.)
2581
2582* wire into C++
2583- uprops.cpp: #include ulayout_props_data.h
2584- uprops.cpp: add getInPC() etc. functions
2585- uprops.cpp: add lines to intProps[], include max values
2586- uprops.h: add UPropertySource constants
2587- uprops.cpp: add uprops_addPropertyStarts(src)
2588- uniset_props.cpp: add to UnicodeSet_initInclusion()
2589- intltest/ucdtest.cpp: write unit tests
2590
2591* update Java data files
2592- refresh just the pnames.icu file with the new property [value] names, just to be safe
2593- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
2594- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2595- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2596- copy the big-endian Unicode data files to another location,
2597  separate from the other data files,
2598  and then refresh ICU4J
2599    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2600    cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2601    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2602
2603* wire into Java
2604- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
2605- UCharacterProperty.java: for each new property
2606  + create a nested class to hold its CodePointTrie
2607  + initialize it from a string literal
2608  + paste in the initializer printed by genprops
2609  + add a new IntProperty object to the intProps[] array
2610  + use the correct max int value for each property, also printed by genprops
2611- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
2612- UnicodeSet.java: add to getInclusions()
2613- UCharacterTest.java: write unit tests
2614
2615---------------------------------------------------------------------------- ***
2616
2617Unicode 11.0 update for ICU 62
2618
2619http://www.unicode.org/versions/Unicode11.0.0/
2620http://unicode.org/versions/beta-11.0.0.html
2621https://www.unicode.org/review/pri372/
2622http://www.unicode.org/reports/uax-proposed-updates.html
2623http://www.unicode.org/reports/tr44/tr44-21.html
2624
2625* Command-line environment setup
2626
2627UNICODE_DATA=~/unidata/uni11/20180521
2628CLDR_SRC=~/svn.cldr/uni
2629ICU_ROOT=~/svn.icu/uni
2630ICU_SRC=$ICU_ROOT/src
2631ICUDT=icudt61b
2632ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
2633ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
2634export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
2635
2636*** ICU Trac
2637
2638- ticket:13630: Unicode 11
2639- ^/branches/markus/uni11
2640
2641*** CLDR Trac
2642
2643- cldrbug 10978: Unicode 11
2644- ^/branches/markus/uni11
2645
2646*** Unicode version numbers
2647- makedata.mak
2648- uchar.h
2649- com.ibm.icu.util.VersionInfo
2650- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2651
2652- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2653  so that the makefiles see the new version number.
2654
2655*** data files & enums & parser code
2656
2657* download files
2658- mkdir -p $UNICODE_DATA
2659- download Unicode files into $UNICODE_DATA
2660  + subfolders: emoji, idna, security, ucd, uca
2661  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2662
2663* for manual diffs and for Unicode Tools input data updates:
2664  remove version suffixes from the file names
2665    ~$ unidata/desuffixucd.py $UNICODE_DATA
2666  (see https://sites.google.com/site/unicodetools/inputdata)
2667
2668* process and/or copy files
2669- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
2670  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2671  + For debugging, and tweaking how ppucd.txt is written,
2672    the tool has an --only_ppucd option:
2673    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
2674
2675- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
2676
2677* build ICU (make install)
2678  so that the tools build can pick up the new definitions from the installed header files.
2679
2680  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2681
2682* preparseucd.py changes
2683- fix other errors
2684    NameError: unknown property Extended_Pictographic
2685  -> add Extended_Pictographic binary property
2686  -> add new short names for all Emoji properties
2687
2688* new constants for new property values
2689- preparseucd.py error:
2690    ValueError: missing uchar.h enum constants for some property values:
2691    [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
2692                   u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
2693                   u'Indic_Siyaq_Numbers'])),
2694     (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
2695     (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
2696     (u'GCB', set([u'LinkC', u'Virama'])),
2697     (u'WB', set([u'WSegSpace']))]
2698  = PropertyValueAliases.txt new property values (diff old & new .txt files)
2699    blk; Chess_Symbols                    ; Chess_Symbols
2700    blk; Dogra                            ; Dogra
2701    blk; Georgian_Ext                     ; Georgian_Extended
2702    blk; Gunjala_Gondi                    ; Gunjala_Gondi
2703    blk; Hanifi_Rohingya                  ; Hanifi_Rohingya
2704    blk; Indic_Siyaq_Numbers              ; Indic_Siyaq_Numbers
2705    blk; Makasar                          ; Makasar
2706    blk; Mayan_Numerals                   ; Mayan_Numerals
2707    blk; Medefaidrin                      ; Medefaidrin
2708    blk; Old_Sogdian                      ; Old_Sogdian
2709    blk; Sogdian                          ; Sogdian
2710  -> add to uchar.h
2711    use long property names for enum constants,
2712    for the trailing comment get the block start code point: diff old & new Blocks.txt
2713  -> add to UCharacter.UnicodeBlock IDs
2714    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2715            replace  public static final int \1_ID = \2; \3
2716  -> add to UCharacter.UnicodeBlock objects
2717    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2718            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2719
2720    GCB; LinkC                            ; LinkingConsonant
2721    GCB; Virama                           ; Virama
2722  -> uchar.h & UCharacter.GraphemeClusterBreak
2723  -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
2724
2725    InSC; Consonant_Initial_Postfixed     ; Consonant_Initial_Postfixed
2726  -> ignore: ICU does not yet support this property
2727
2728    jg ; Hanifi_Rohingya_Kinna_Ya         ; Hanifi_Rohingya_Kinna_Ya
2729    jg ; Hanifi_Rohingya_Pa               ; Hanifi_Rohingya_Pa
2730  -> uchar.h & UCharacter.JoiningGroup
2731
2732    sc ; Dogr                             ; Dogra
2733    sc ; Gong                             ; Gunjala_Gondi
2734    sc ; Maka                             ; Makasar
2735    sc ; Medf                             ; Medefaidrin
2736    sc ; Rohg                             ; Hanifi_Rohingya
2737    sc ; Sogd                             ; Sogdian
2738    sc ; Sogo                             ; Old_Sogdian
2739  -> uscript.h & com.ibm.icu.lang.UScript
2740  -> Nushu had been added already
2741  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2742      and in com.ibm.icu.dev.test.lang.TestUScript.java
2743
2744    WB ; WSegSpace                        ; WSegSpace
2745  -> uchar.h & UCharacter.WordBreak
2746
2747* New short names for emoji properties
2748- see UTS #51
2749- short names set in preparseucd.py
2750
2751* New properties
2752- boolean emoji property Extended_Pictographic
2753  -> added in preparseucd.py
2754  -> uchar.h & UProperty.java
2755- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
2756  as shown in PropertyValueAliases.txt
2757  -> ignore for now
2758
2759* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2760    (not strictly necessary for NOT_ENCODED scripts)
2761  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
2762
2763* update spoof checker UnicodeSet initializers:
2764    inclusionPat & recommendedPat in uspoof.cpp
2765    INCLUSION & RECOMMENDED in SpoofChecker.java
2766- make sure that the Unicode Tools tree contains the latest security data files
2767- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
2768- update the hardcoded version number there in the DIRECTORY path
2769- run the tool (no special environment variables needed)
2770- copy & paste from the Console output into the .cpp & .java files
2771
2772* generate normalization data files
2773  cd $ICU_ROOT/dbg/icu4c
2774  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
2775  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
2776  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
2777  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2778  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
2779
2780* build ICU (make install)
2781  so that the tools build can pick up the new definitions from the installed header files.
2782
2783  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
2784
2785* build Unicode tools using CMake+make
2786
2787$ICU_SRC/tools/unicode/c/icudefs.txt:
2788
2789# Location (--prefix) of where ICU was installed.
2790set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
2791# Location of the ICU4C source tree.
2792set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
2793
2794  $ICU_ROOT/dbg$
2795    mkdir -p tools/unicode/c
2796    cd tools/unicode/c
2797
2798  $ICU_ROOT/dbg/tools/unicode/c$
2799    cmake ../../../../src/tools/unicode/c
2800    make
2801
2802* generate core properties data files
2803  $ICU_ROOT/dbg/tools/unicode/c$
2804    genprops/genprops $ICU_SRC/icu4c
2805    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
2806    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
2807- rebuild ICU (make install) & tools
2808
2809* Fix case props
2810    genprops error: casepropsbuilder: too many exceptions words
2811    genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
2812- With the addition of Georgian Mtavruli capital letters,
2813  there are now too many simple case mappings with big mapping deltas
2814  that yield uncompressible exceptions.
2815- Changing the data structure (now formatVersion 4),
2816  adding one bit for no-simple-case-folding (for Cherokee), and
2817  one optional slot for a big delta (for most faraway mappings),
2818  together with another bit for whether that is negative.
2819  This makes most Cherokee & Georgian etc. case mappings compressible,
2820  reducing the number of exceptions words.
2821- Further changes to gain one more bit for the exceptions index,
2822  for future growth. Details see casepropsbuilder.cpp.
2823
2824* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2825  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2826- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2827- Unicode 6.0..11.0: U+2260, U+226E, U+226F
2828- nothing new in this Unicode version, no test file to update
2829
2830* run & fix ICU4C tests
2831- Andy handles RBBI & spoof check test failures
2832
2833- Errors in char.txt, word.txt, word_POSIX.txt like
2834    createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET"  at line 46, column 16
2835  because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
2836  -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
2837     not empty, just to get ICU building.
2838  -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
2839     and properties together with the rules that used them (GB 10, WB 14).
2840  -> Andy adjusts the rule sets further to sync with
2841     Unicode 11 grapheme, word, and line break spec changes.
2842
2843* collation: CLDR collation root, UCA DUCET
2844
2845- UCA DUCET goes into Mark's Unicode tools, see
2846    https://sites.google.com/site/unicodetools/home#TOC-UCA
2847  diff the main mapping file, look for bad changes
2848  (for example, more bytes per weight for common characters)
2849    ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
2850    ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
2851
2852- CLDR root data files are checked into $CLDR_SRC/common/uca/
2853    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
2854
2855- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2856    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
2857- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2858    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
2859    (note removing the underscore before "Rules")
2860    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
2861- restore TODO diffs in UCARules.txt
2862    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
2863- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2864  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2865  from the CLDR root files (..._CLDR_..._SHORT.txt)
2866    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2867    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2868    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
2869- if CLDR common/uca/unihan-index.txt changes, then update
2870  CLDR common/collation/root.xml <collation type="private-unihan">
2871  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
2872
2873- run genuca, see command line above;
2874  deal with
2875    Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
2876    FDD1 1180B;	[71 CC 02, 05, 05]	# Dogra first primary (compressible)
2877        (add the character to genuca.cpp sampleCharsToScripts[])
2878  + look up the USCRIPT_ code for the new sample characters
2879    (should be obvious from the comment in the error output)
2880  + *add* mappings to sampleCharsToScripts[], do not replace them
2881    (in case the script sample characters flip-flop)
2882  + insert new scripts in DUCET script order, see the top_byte table
2883    at the beginning of FractionalUCA.txt
2884- rebuild ICU4C
2885
2886* Unihan collators
2887    https://sites.google.com/site/unicodetools/unihan
2888- run Unicode Tools
2889    org.unicode.draft.GenerateUnihanCollators
2890  with VM arguments
2891    -ea
2892    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
2893    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
2894    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
2895    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2896    -DUVERSION=11.0.0
2897- run Unicode Tools
2898    org.unicode.draft.GenerateUnihanCollatorFiles
2899  with the same arguments
2900- check CLDR diffs
2901    cd $CLDR_SRC
2902    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2903    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2904- copy to CLDR
2905    cd $CLDR_SRC
2906    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2907    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2908- run CLDR unit tests, commit to CLDR
2909- generate ICU zh collation data: run CLDR
2910    org.unicode.cldr.icu.NewLdml2IcuConverter
2911  with program arguments
2912    -t collation
2913    -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
2914    -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
2915    -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
2916    -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
2917    zh
2918  and VM arguments
2919    -ea
2920    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
2921- rebuild ICU4C
2922
2923* run & fix ICU4C tests, now with new CLDR collation root data
2924- run all tests with the collation test data *_SHORT.txt or the full files
2925  (the full ones have comments, useful for debugging)
2926- note on intltest: if collate/UCAConformanceTest fails, then
2927  utility/MultithreadTest/TestCollators will fail as well;
2928  fix the conformance test before looking into the multi-thread test
2929
2930* update Java data files
2931- refresh just the UCD/UCA-related/derived files, just to be safe
2932- see (ICU4C)/source/data/icu4j-readme.txt
2933- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2934- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2935  output:
2936    ...
2937    Unicode .icu files built to ./out/build/icudt61l
2938    echo timestamp > uni-core-data
2939    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
2940    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
2941    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2942    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
2943    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
2944    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
2945    mkdir -p /tmp/icu4j/main/shared/data
2946    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2947    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
2948    mkdir -p /tmp/icu4j/main/shared/data
2949    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2950    make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
2951- copy the big-endian Unicode data files to another location,
2952  separate from the other data files,
2953  and then refresh ICU4J
2954    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
2955    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2956    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2957    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2958    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2959    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2960    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2961    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2962    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2963    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2964
2965* When refreshing all of ICU4J data from ICU4C
2966- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2967- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
2968or
2969- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
2970
2971* update CollationFCD.java
2972  + copy & paste the initializers of lcccIndex[] etc. from
2973    ICU4C/source/i18n/collationfcd.cpp to
2974    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2975
2976* refresh Java test .txt files
2977- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2978    cd $ICU_SRC/icu4c/source/data/unidata
2979    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2980    cd ../../test/testdata
2981    cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2982    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
2983
2984* run & fix ICU4J tests
2985
2986*** API additions
2987- send notice to icu-design about new born-@stable API (enum constants etc.)
2988
2989*** CLDR numbering systems
2990- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
2991  Unicode 11: using Unicode 11 CLDR ticket #10978
2992    rohg 10D30..10D39 Hanifi_Rohingya
2993    gong 11DA0..11DA9 Gunjala_Gondi
2994  Earlier: CLDR tickets specific to adding new numbering systems.
2995  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
2996  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
2997
2998*** merge the Unicode update branches back onto the trunk
2999- do not merge the icudata.jar and testdata.jar,
3000  instead rebuild them from merged & tested ICU4C
3001- make sure that changes to Unicode tools are checked in:
3002  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3003
3004---------------------------------------------------------------------------- ***
3005
3006Unicode 10.0 update for ICU 60
3007
3008http://www.unicode.org/versions/Unicode10.0.0/
3009http://www.unicode.org/versions/beta-10.0.0.html
3010http://blog.unicode.org/2017/03/unicode-100-beta-review.html
3011http://www.unicode.org/review/pri350/
3012http://www.unicode.org/reports/uax-proposed-updates.html
3013http://www.unicode.org/reports/tr44/tr44-19.html
3014
3015* Command-line environment setup
3016
3017UNICODE_DATA=~/unidata/uni10/20170605
3018CLDR_SRC=~/svn.cldr/uni10
3019ICU_ROOT=~/svn.icu/uni10
3020ICU_SRC=$ICU_ROOT/src
3021ICUDT=icudt60b
3022ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
3023ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
3024export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
3025
3026*** ICU Trac
3027
3028- ticket:12985: Unicode 10
3029- ticket:13061: undo hacks from emoji 5.0 update
3030- ticket:13062: add Emoji_Component property
3031- ^/branches/markus/uni10
3032
3033*** CLDR Trac
3034
3035- cldrbug 10055: Unicode 10
3036- cldrbug 9882: Unicode 10 script metadata
3037- cldrbug 10219: numbering systems for Unicode 10
3038
3039*** Unicode version numbers
3040- makedata.mak
3041- uchar.h
3042- com.ibm.icu.util.VersionInfo
3043- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3044
3045- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3046  so that the makefiles see the new version number.
3047
3048*** data files & enums & parser code
3049
3050* download files
3051- mkdir -p $UNICODE_DATA
3052- download Unicode 10.0 files into $UNICODE_DATA
3053  + subfolders: ucd, uca, idna, security
3054  + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3055- download emoji 5.0 files into $UNICODE_DATA/emoji
3056
3057* for manual diffs: remove version suffixes from the file names
3058  ~$ unidata/desuffixucd.py $UNICODE_DATA
3059  (see https://sites.google.com/site/unicodetools/inputdata)
3060
3061* process and/or copy files
3062- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
3063  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3064  + For debugging, and tweaking how ppucd.txt is written,
3065    the tool has an --only_ppucd option:
3066    py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
3067
3068- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
3069
3070* build ICU (make install)
3071  so that the tools build can pick up the new definitions from the installed header files.
3072
3073  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
3074
3075* preparseucd.py changes
3076- remove or add new Unicode scripts from/to the
3077  only-in-ISO-15924 list according to the error messages:
3078    ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
3079  -> adjust _scripts_only_in_iso15924 as indicated
3080- fix other errors
3081    Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
3082  -> add vo=Vertical_Orientation to _ignored_properties
3083  -> later removed again, parsing the file, even though we do not yet store data for runtime use
3084
3085* new constants for new property values
3086- preparseucd.py error:
3087    ValueError: missing uchar.h enum constants for some property values:
3088    [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
3089                   u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
3090     (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
3091                  u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
3092                  u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
3093     (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
3094  = PropertyValueAliases.txt new property values (diff old & new .txt files)
3095    blk; CJK_Ext_F                        ; CJK_Unified_Ideographs_Extension_F
3096    blk; Kana_Ext_A                       ; Kana_Extended_A
3097    blk; Masaram_Gondi                    ; Masaram_Gondi
3098    blk; Nushu                            ; Nushu
3099    blk; Soyombo                          ; Soyombo
3100    blk; Syriac_Sup                       ; Syriac_Supplement
3101    blk; Zanabazar_Square                 ; Zanabazar_Square
3102  -> add to uchar.h
3103    use long property names for enum constants,
3104    for the trailing comment get the block start code point: diff old & new Blocks.txt
3105  -> add to UCharacter.UnicodeBlock IDs
3106    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3107            replace  public static final int \1_ID = \2; \3
3108  -> add to UCharacter.UnicodeBlock objects
3109    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3110            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3111
3112    jg ; Malayalam_Bha                    ; Malayalam_Bha
3113    jg ; Malayalam_Ja                     ; Malayalam_Ja
3114    jg ; Malayalam_Lla                    ; Malayalam_Lla
3115    jg ; Malayalam_Llla                   ; Malayalam_Llla
3116    jg ; Malayalam_Nga                    ; Malayalam_Nga
3117    jg ; Malayalam_Nna                    ; Malayalam_Nna
3118    jg ; Malayalam_Nnna                   ; Malayalam_Nnna
3119    jg ; Malayalam_Nya                    ; Malayalam_Nya
3120    jg ; Malayalam_Ra                     ; Malayalam_Ra
3121    jg ; Malayalam_Ssa                    ; Malayalam_Ssa
3122    jg ; Malayalam_Tta                    ; Malayalam_Tta
3123  -> uchar.h & UCharacter.JoiningGroup
3124
3125    sc ; Gonm                             ; Masaram_Gondi
3126    sc ; Nshu                             ; Nushu
3127    sc ; Soyo                             ; Soyombo
3128    sc ; Zanb                             ; Zanabazar_Square
3129  -> uscript.h & com.ibm.icu.lang.UScript
3130  -> Nushu had been added already
3131  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3132      and in com.ibm.icu.dev.test.lang.TestUScript.java
3133
3134* New properties as shown in PropertyValueAliases.txt changes
3135- boolean Emoji_Component from emoji 5
3136  -> uchar.h & UProperty.java
3137- boolean
3138    # Regional_Indicator (RI)
3139
3140    RI ; N                                ; No                               ; F                                ; False
3141    RI ; Y                                ; Yes                              ; T                                ; True
3142  -> uchar.h & UProperty.java
3143  -> single immutable range, to be hardcoded
3144- boolean
3145    # Prepended_Concatenation_Mark (PCM)
3146
3147    PCM; N                                ; No                               ; F                                ; False
3148    PCM; Y                                ; Yes                              ; T                                ; True
3149  -> was new in Unicode 9
3150  -> uchar.h & UProperty.java
3151- enumerated
3152    # Vertical_Orientation (vo)
3153
3154    vo ; R                                ; Rotated
3155    vo ; Tr                               ; Transformed_Rotated
3156    vo ; Tu                               ; Transformed_Upright
3157    vo ; U                                ; Upright
3158  -> only pre-parsed for now, but not yet stored for runtime use
3159
3160* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
3161    (not strictly necessary for NOT_ENCODED scripts)
3162  $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
3163
3164* generate normalization data files
3165  cd $ICU_ROOT/dbg/icu4c
3166  bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
3167  bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
3168  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
3169  bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3170  bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
3171
3172* build ICU (make install)
3173  so that the tools build can pick up the new definitions from the installed header files.
3174
3175  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
3176
3177* build Unicode tools using CMake+make
3178
3179$ICU_SRC/tools/unicode/c/icudefs.txt:
3180
3181# Location (--prefix) of where ICU was installed.
3182set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
3183# Location of the ICU4C source tree.
3184set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
3185
3186  $ICU_ROOT/dbg/tools/unicode/c$
3187    cmake ../../../../src/tools/unicode/c
3188    make
3189
3190* generate core properties data files
3191  $ICU_ROOT/dbg/tools/unicode/c$
3192    genprops/genprops $ICU_SRC/icu4c
3193    genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
3194    genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
3195- rebuild ICU (make install) & tools
3196
3197* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3198  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3199- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3200- Unicode 6.0..10.0: U+2260, U+226E, U+226F
3201- nothing new in this Unicode version, no test file to update
3202
3203* run & fix ICU4C tests
3204- Andy handles RBBI & spoof check test failures
3205
3206* collation: CLDR collation root, UCA DUCET
3207
3208- UCA DUCET goes into Mark's Unicode tools, see
3209  https://sites.google.com/site/unicodetools/home#TOC-UCA
3210- CLDR root data files are checked into $CLDR_SRC/common/uca/
3211    cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
3212
3213- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3214    cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
3215- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3216    cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
3217    (note removing the underscore before "Rules")
3218    cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
3219- restore TODO diffs in UCARules.txt
3220    meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
3221- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3222  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3223  from the CLDR root files (..._CLDR_..._SHORT.txt)
3224    cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
3225    cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
3226    cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
3227- if CLDR common/uca/unihan-index.txt changes, then update
3228  CLDR common/collation/root.xml <collation type="private-unihan">
3229  and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
3230
3231- run genuca, see command line above;
3232  deal with
3233    Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
3234    FDD1 11D10;     [70 D5 02, 05, 05]      # Masaram_Gondi first primary (compressible)
3235        (add the character to genuca.cpp sampleCharsToScripts[])
3236  + look up the USCRIPT_ code for the new sample characters
3237    (should be obvious from the comment in the error output)
3238  + *add* mappings to sampleCharsToScripts[], do not replace them
3239    (in case the script sample characters flip-flop)
3240  + insert new scripts in DUCET script order, see the top_byte table
3241    at the beginning of FractionalUCA.txt
3242- rebuild ICU4C
3243
3244* Unihan collators
3245    https://sites.google.com/site/unicodetools/unihan
3246- run Unicode Tools
3247    org.unicode.draft.GenerateUnihanCollators
3248  with VM arguments
3249    -ea
3250    -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
3251    -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
3252    -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
3253    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
3254    -DUVERSION=10.0.0
3255- run Unicode Tools
3256    org.unicode.draft.GenerateUnihanCollatorFiles
3257  with the same arguments
3258- check CLDR diffs
3259    cd $CLDR_SRC
3260    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
3261    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
3262- copy to CLDR
3263    cd $CLDR_SRC
3264    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
3265    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
3266- run CLDR unit tests, commit to CLDR
3267- generate ICU zh collation data: run CLDR
3268    org.unicode.cldr.icu.NewLdml2IcuConverter
3269  with program arguments
3270    -t collation
3271    -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
3272    -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
3273    -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
3274    -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
3275    zh
3276  and VM arguments
3277    -ea
3278    -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
3279- rebuild ICU4C
3280
3281* run & fix ICU4C tests, now with new CLDR collation root data
3282- run all tests with the collation test data *_SHORT.txt or the full files
3283  (the full ones have comments, useful for debugging)
3284- note on intltest: if collate/UCAConformanceTest fails, then
3285  utility/MultithreadTest/TestCollators will fail as well;
3286  fix the conformance test before looking into the multi-thread test
3287
3288* update Java data files
3289- refresh just the UCD/UCA-related/derived files, just to be safe
3290- see (ICU4C)/source/data/icu4j-readme.txt
3291- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3292- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3293  output:
3294    ...
3295    Unicode .icu files built to ./out/build/icudt60l
3296    echo timestamp > uni-core-data
3297    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
3298    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
3299    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3300    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
3301    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
3302    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
3303    mkdir -p /tmp/icu4j/main/shared/data
3304    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3305    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
3306    mkdir -p /tmp/icu4j/main/shared/data
3307    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3308    make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
3309- copy the big-endian Unicode data files to another location,
3310  separate from the other data files,
3311  and then refresh ICU4J
3312    cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
3313    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3314    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3315    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3316    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3317    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3318    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3319    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3320    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3321    jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3322
3323* When refreshing all of ICU4J data from ICU4C
3324- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3325- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
3326or
3327- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
3328
3329* update CollationFCD.java
3330  + copy & paste the initializers of lcccIndex[] etc. from
3331    ICU4C/source/i18n/collationfcd.cpp to
3332    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
3333
3334* refresh Java test .txt files
3335- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3336    cd $ICU_SRC/icu4c/source/data/unidata
3337    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3338    cd ../../test/testdata
3339    cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3340    cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3341
3342* run & fix ICU4J tests
3343
3344*** API additions
3345- send notice to icu-design about new born-@stable API (enum constants etc.)
3346
3347*** CLDR numbering systems
3348- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
3349  Unicode 10: http://unicode.org/cldr/trac/ticket/10219
3350  Unicode 9: http://unicode.org/cldr/trac/ticket/9692
3351
3352*** merge the Unicode update branches back onto the trunk
3353- do not merge the icudata.jar and testdata.jar,
3354  instead rebuild them from merged & tested ICU4C
3355- make sure that changes to Unicode tools are checked in:
3356  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3357
3358---------------------------------------------------------------------------- ***
3359
3360Emoji 5.0 update for ICU 59
3361- ICU 59 mostly remains on Unicode 9.0
3362- except updates bidi and segmentation data to Unicode 10 beta
3363
3364First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
3365
3366* Command-line environment setup
3367
3368ICU_ROOT=~/svn.icu/trunk
3369ICU_SRC_DIR=$ICU_ROOT/src
3370ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
3371ICUDT=icudt59b
3372export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3373SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
3374UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
3375
3376*** ICU Trac
3377
3378- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
3379- changes directly on trunk
3380
3381*** data files & enums & parser code
3382
3383* download files
3384
3385- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
3386- download emoji 5.0 beta files into the same uni90e50 folder
3387- download Unicode 10.0 beta files: ucd
3388  + copy Unicode 10 bidi files to the uni90e50/ucd folder:
3389    BidiBrackets.txt
3390    BidiCharacterTest.txt
3391    BidiMirroring.txt
3392    BidiTest.txt
3393    extracted/DerivedBidiClass.txt
3394  + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
3395    LineBreak.txt
3396    auxiliary/*
3397
3398* preparseucd.py changes
3399- adjust for combined trunks
3400- write new copyright lines
3401- ignore new Emoji_Component property for now
3402
3403* process and/or copy files
3404- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
3405  + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3406
3407- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
3408
3409* build ICU (make install)
3410  so that the tools build can pick up the new definitions from the installed header files.
3411
3412  $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
3413
3414* build Unicode tools using CMake+make
3415
3416~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
3417
3418# Location (--prefix) of where ICU was installed.
3419set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
3420# Location of the ICU4C source tree.
3421set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
3422
3423  ~/svn.icu/trunk/dbg/tools/unicode/c$
3424    cmake ../../../../src/tools/unicode/c
3425    make
3426
3427* generate core properties data files
3428  ~/svn.icu/trunk/dbg/tools/unicode/c$
3429    genprops/genprops $ICU4C_SRC_DIR
3430- rebuild ICU (make install) & tools
3431
3432* run & fix ICU4C tests
3433- Andy handles RBBI & spoof check test failures
3434
3435* update Java data files
3436- refresh just the UCD/UCA-related/derived files, just to be safe
3437- see (ICU4C)/source/data/icu4j-readme.txt
3438- mkdir /tmp/icu4j
3439- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3440  output:
3441    ...
3442    Unicode .icu files built to ./out/build/icudt59l
3443    echo timestamp > uni-core-data
3444    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
3445    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
3446    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3447    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
3448    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
3449    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
3450    mkdir -p /tmp/icu4j/main/shared/data
3451    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3452    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
3453    mkdir -p /tmp/icu4j/main/shared/data
3454    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3455    make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
3456- copy the big-endian Unicode data files to another location,
3457  separate from the other data files,
3458  and then refresh ICU4J
3459    cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
3460    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3461    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3462    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3463    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3464    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3465    jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3466
3467* When refreshing all of ICU4J data from ICU4C
3468- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3469- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
3470or
3471- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
3472
3473* refresh Java test .txt files
3474- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3475    cd $ICU4C_SRC_DIR/source/data/unidata
3476    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3477    cd ../../test/testdata
3478    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3479    cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
3480
3481* run & fix ICU4J tests
3482
3483---------------------------------------------------------------------------- ***
3484
3485Unicode 9.0 update for ICU 58
3486
3487* Command-line environment setup
3488
3489ICU_ROOT=~/svn.icu/trunk
3490ICU_SRC_DIR=$ICU_ROOT/src
3491ICUDT=icudt58b
3492export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3493SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3494UNIDATA=$ICU_SRC_DIR/source/data/unidata
3495
3496http://www.unicode.org/review/pri323/  -- beta review
3497http://www.unicode.org/reports/uax-proposed-updates.html
3498http://www.unicode.org/versions/beta-9.0.0.html
3499http://www.unicode.org/versions/Unicode9.0.0/
3500http://www.unicode.org/reports/tr44/tr44-17.html
3501
3502*** ICU Trac
3503
3504- ticket:12526: integrate Unicode 9
3505- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
3506- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
3507
3508*** CLDR Trac
3509
3510- cldrbug 9414: UCA 9
3511- ^/branches/markus/uni90 at r11518 from trunk at r11517
3512
3513- cldrbug 8745: Unicode 9.0 script metadata
3514
3515*** Unicode version numbers
3516- makedata.mak
3517- uchar.h
3518- com.ibm.icu.util.VersionInfo
3519- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3520
3521- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3522  so that the makefiles see the new version number.
3523
3524*** data files & enums & parser code
3525
3526* file preparation
3527
3528- download UCD & IDNA files
3529- make sure that the Unicode data folder passed into preparseucd.py
3530  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3531- only for manual diffs: remove version suffixes from the file names
3532  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
3533  (see https://sites.google.com/site/unicodetools/inputdata)
3534- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3535- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3536- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3537
3538- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
3539  and copy to $UNIDATA
3540    cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
3541
3542* preparseucd.py changes
3543- remove or add new Unicode scripts from/to the
3544  only-in-ISO-15924 list according to the error messages:
3545    ValueError: remove ['Tang'] from _scripts_only_in_iso15924
3546    ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
3547    ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
3548    ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
3549  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3550      and in com.ibm.icu.dev.test.lang.TestUScript.java
3551- DerivedNumericValues.txt new numeric values
3552    0D58          ; 0.00625 ; ; 1/160 # No       MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
3553    0D59          ; 0.025 ; ; 1/40 # No       MALAYALAM FRACTION ONE FORTIETH
3554    0D5A          ; 0.0375 ; ; 3/80 # No       MALAYALAM FRACTION THREE EIGHTIETHS
3555    0D5B          ; 0.05 ; ; 1/20 # No       MALAYALAM FRACTION ONE TWENTIETH
3556    0D5D          ; 0.15 ; ; 3/20 # No       MALAYALAM FRACTION THREE TWENTIETHS
3557  -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
3558     uchar.c, UCharacterProperty.java
3559     to support a new series of values
3560- adjust preparseucd.py for Tangut algorithmic names
3561  in ppucd.txt:
3562    algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
3563  ->
3564    algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
3565- avoid block-compressing most String/Miscellaneous property values,
3566  triggered by genprops not coping with a multi-code point Case_Folding on
3567    block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
3568  keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
3569
3570* PropertyAliases.txt changes
3571- 1 new property PCM=Prepended_Concatenation_Mark
3572  Ignore: Only useful for layout engines.
3573  Ok to list in ppucd.txt.
3574
3575* PropertyValueAliases.txt new property values
3576    blk; Adlam                            ; Adlam
3577    blk; Bhaiksuki                        ; Bhaiksuki
3578    blk; Cyrillic_Ext_C                   ; Cyrillic_Extended_C
3579    blk; Glagolitic_Sup                   ; Glagolitic_Supplement
3580    blk; Ideographic_Symbols              ; Ideographic_Symbols_And_Punctuation
3581    blk; Marchen                          ; Marchen
3582    blk; Mongolian_Sup                    ; Mongolian_Supplement
3583    blk; Newa                             ; Newa
3584    blk; Osage                            ; Osage
3585    blk; Tangut                           ; Tangut
3586    blk; Tangut_Components                ; Tangut_Components
3587  -> add to uchar.h
3588    use long property names for enum constants
3589  -> add to UCharacter.UnicodeBlock IDs
3590    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3591            replace  public static final int \1_ID = \2; \3
3592  -> add to UCharacter.UnicodeBlock objects
3593    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3594            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3595
3596    GCB; EB                               ; E_Base
3597    GCB; EBG                              ; E_Base_GAZ
3598    GCB; EM                               ; E_Modifier
3599    GCB; GAZ                              ; Glue_After_Zwj
3600    GCB; ZWJ                              ; ZWJ
3601  -> uchar.h & UCharacter.GraphemeClusterBreak
3602
3603    jg ; African_Feh                      ; African_Feh
3604    jg ; African_Noon                     ; African_Noon
3605    jg ; African_Qaf                      ; African_Qaf
3606  -> uchar.h & UCharacter.JoiningGroup
3607
3608    lb ; EB                               ; E_Base
3609    lb ; EM                               ; E_Modifier
3610    lb ; ZWJ                              ; ZWJ
3611  -> uchar.h & UCharacter.LineBreak
3612
3613    sc ; Adlm                             ; Adlam
3614    sc ; Bhks                             ; Bhaiksuki
3615    sc ; Marc                             ; Marchen
3616    sc ; Newa                             ; Newa
3617    sc ; Osge                             ; Osage
3618    sc ; Tang                             ; Tangut
3619  -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
3620
3621    WB ; EB                               ; E_Base
3622    WB ; EBG                              ; E_Base_GAZ
3623    WB ; EM                               ; E_Modifier
3624    WB ; GAZ                              ; Glue_After_Zwj
3625    WB ; ZWJ                              ; ZWJ
3626  -> uchar.h & UCharacter.WordBreak
3627
3628* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
3629    (not strictly necessary for NOT_ENCODED scripts)
3630  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
3631
3632* generate normalization data files
3633  cd $ICU_ROOT/dbg
3634  bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
3635  bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
3636  bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
3637  bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3638  bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
3639
3640* build ICU (make install)
3641  so that the tools build can pick up the new definitions from the installed header files.
3642
3643  $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
3644
3645* build Unicode tools using CMake+make
3646
3647~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
3648
3649  # Location (--prefix) of where ICU was installed.
3650  set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
3651  # Location of the ICU source tree.
3652  set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
3653
3654  ~/svn.icutools/trunk/dbg/unicode/c$
3655    cmake ../../../src/unicode/c
3656    make
3657
3658* generate core properties data files
3659  ~/svn.icutools/trunk/dbg/unicode/c$
3660    genprops/genprops $ICU_SRC_DIR
3661    genuca/genuca --hanOrder implicit $ICU_SRC_DIR
3662    genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
3663- rebuild ICU (make install) & tools
3664
3665* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3666  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3667- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3668- Unicode 6.0..9.0: U+2260, U+226E, U+226F
3669- nothing new in 9.0, no test file to update
3670
3671* run & fix ICU4C tests
3672- Andy handles RBBI & spoof check test failures
3673
3674* collation: CLDR collation root, UCA DUCET
3675
3676- UCA DUCET goes into Mark's Unicode tools, see
3677  https://sites.google.com/site/unicodetools/home#TOC-UCA
3678- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
3679    cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
3680
3681- cd (CLDR UCA branch)/common/uca/
3682- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3683    cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
3684- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3685    cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
3686    (note removing the underscore before "Rules")
3687    cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3688- restore TODO diffs in UCARules.txt
3689    meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3690- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3691  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3692  from the CLDR root files (..._CLDR_..._SHORT.txt)
3693    cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
3694    cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
3695    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
3696- if CLDR common/uca/unihan-index.txt changes, then update
3697  CLDR common/collation/root.xml <collation type="private-unihan">
3698  and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
3699
3700- run genuca, see command line above;
3701  deal with
3702    Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
3703    FDD1 104B5;     [75 B8 02, 05, 05]      # Osage first primary (compressible)
3704        (add the character to genuca.cpp sampleCharsToScripts[])
3705  + look up the USCRIPT_ code for the new sample characters
3706    (should be obvious from the comment in the error output)
3707  + *add* mappings to sampleCharsToScripts[], do not replace them
3708    (in case the script sample characters flip-flop)
3709  + insert new scripts in DUCET script order, see the top_byte table
3710    at the beginning of FractionalUCA.txt
3711- rebuild ICU4C
3712
3713* Unihan collators
3714- run Unicode Tools
3715    org.unicode.draft.GenerateUnihanCollators
3716  with VM arguments
3717    -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
3718    -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
3719    -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
3720    -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
3721    -DUVERSION=9.0.0
3722    -ea
3723- run Unicode Tools
3724    org.unicode.draft.GenerateUnihanCollatorFiles
3725  with the same arguments
3726- check CLDR diffs
3727    cd ~/svn.cldr/trunk
3728    meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
3729    meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
3730- copy to CLDR
3731    cd ~/svn.cldr/trunk
3732    cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
3733    cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
3734- commit to CLDR
3735- generate ICU zh collation data: run CLDR
3736    org.unicode.cldr.icu.NewLdml2IcuConverter
3737  with program arguments
3738    -t collation
3739    -s /home/mscherer/svn.cldr/trunk/common/collation
3740    -m /home/mscherer/svn.cldr/trunk/common/supplemental
3741    -d /home/mscherer/svn.icu/trunk/src/source/data/coll
3742    -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
3743    zh
3744  and VM arguments
3745    -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
3746- rebuild ICU4C
3747
3748* run & fix ICU4C tests, now with new CLDR collation root data
3749- run all tests with the collation test data *_SHORT.txt or the full files
3750  (the full ones have comments, useful for debugging)
3751- note on intltest: if collate/UCAConformanceTest fails, then
3752  utility/MultithreadTest/TestCollators will fail as well;
3753  fix the conformance test before looking into the multi-thread test
3754
3755* update Java data files
3756- refresh just the UCD/UCA-related/derived files, just to be safe
3757- see (ICU4C)/source/data/icu4j-readme.txt
3758- mkdir /tmp/icu4j
3759- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3760  output:
3761    ...
3762    Unicode .icu files built to ./out/build/icudt58l
3763    echo timestamp > uni-core-data
3764    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
3765    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
3766    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
3767    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
3768    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
3769    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
3770    mkdir -p /tmp/icu4j/main/shared/data
3771    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3772    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
3773    mkdir -p /tmp/icu4j/main/shared/data
3774    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3775    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
3776- copy the big-endian Unicode data files to another location,
3777  separate from the other data files,
3778  and then refresh ICU4J
3779    cd ~/svn.icu/trunk/dbg/data/out/icu4j
3780    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3781    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3782    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3783    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3784    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
3785    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
3786    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3787    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
3788    jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3789
3790* When refreshing all of ICU4J data from ICU4C
3791- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3792- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3793or
3794- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3795
3796* update CollationFCD.java
3797  + copy & paste the initializers of lcccIndex[] etc. from
3798    ICU4C/source/i18n/collationfcd.cpp to
3799    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
3800
3801* refresh Java test .txt files
3802- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3803    cd $ICU_SRC_DIR/source/data/unidata
3804    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3805    cd ../../test/testdata
3806    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3807    cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
3808
3809* run & fix ICU4J tests
3810
3811*** LayoutEngine script information
3812
3813* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3814  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3815  in the working directory.
3816
3817  (It also generates ScriptRunData.cpp, which is no longer needed.)
3818
3819  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
3820  (a plain text file)
3821  which maps ICU versions to the numbers of script/language constants
3822  that were added then.
3823  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
3824
3825  The generated files have a current copyright date and "@deprecated" statement.
3826
3827* Review changes, fix Java tool if necessary, and copy to ICU4C
3828  cd ~/svn.icu4j/trunk/src
3829  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3830  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
3831  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
3832
3833*** API additions
3834- send notice to icu-design about new born-@stable API (enum constants etc.)
3835
3836*** merge the Unicode update branches back onto the trunk
3837- do not merge the icudata.jar and testdata.jar,
3838  instead rebuild them from merged & tested ICU4C
3839- make sure that changes to Unicode tools & ICU tools are checked in
3840  http://www.unicode.org/utility/trac/log/trunk/unicodetools
3841  http://bugs.icu-project.org/trac/log/tools/trunk
3842
3843---------------------------------------------------------------------------- ***
3844
3845New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764
3846
3847Adding
3848- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
3849- new combination/alias codes: Hanb, Jamo
3850  - used in CLDR 29 and in spoof checker
3851- new Z* code: Zsye
3852
3853Add new codes to uscript.h & UScript.java, see Unicode update logs.
3854  -> com.ibm.icu.lang.UScript
3855    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3856    replace  public static final int \1 = \2; \3
3857
3858Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
3859add new script codes.
3860"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
3861
3862Note: If we have to run preparseucd.py again before the Unicode 9 update,
3863then we need to manually keep/restore the new script codes.
3864
3865ICU_ROOT=~/svn.icu/trunk
3866ICU_SRC_DIR=$ICU_ROOT/src
3867ICUDT=icudt57b
3868export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3869SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3870UNIDATA=$ICU_SRC_DIR/source/data/unidata
3871
3872Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
3873see https://unicode-org.atlassian.net/browse/ICU-12141
3874
3875make install, then icutools cmake & make, then
3876~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
3877
3878Generate Java data as usual, only update pnames.icu & uprops.icu.
3879
3880*** LayoutEngine script information
3881
3882* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3883  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3884  in the working directory.
3885
3886  (It also generates ScriptRunData.cpp, which is no longer needed.)
3887
3888  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
3889  (a plain text file)
3890  which maps ICU versions to the numbers of script/language constants
3891  that were added then.
3892  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
3893
3894  The generated files have a current copyright date and "@deprecated" statement.
3895
3896* Review changes, fix Java tool if necessary, and copy to ICU4C
3897  cd ~/svn.icu4j/trunk/src
3898  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3899  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
3900  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
3901
3902---------------------------------------------------------------------------- ***
3903
3904Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802
3905
3906Edit preparseucd.py to add & parse new properties.
3907They share the UCD property namespace but are not listed in PropertyAliases.txt.
3908
3909Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
3910Initial data from emoji/2.0/
3911
3912ICU_ROOT=~/svn.icu/trunk
3913ICU_SRC_DIR=$ICU_ROOT/src
3914ICUDT=icudt56b
3915export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3916SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3917UNIDATA=$ICU_SRC_DIR/source/data/unidata
3918
3919Add binary-property constants to uchar.h enum UProperty & UProperty.java.
3920
3921~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3922(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
3923
3924Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
3925
3926make install, then icutools cmake & make, then
3927~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
3928
3929Generate Java data as usual, only update pnames.icu & uprops.icu.
3930
3931---------------------------------------------------------------------------- ***
3932
3933Unicode 8.0 update for ICU 56
3934
3935* Command-line environment setup
3936
3937ICU_ROOT=~/svn.icu/trunk
3938ICU_SRC_DIR=$ICU_ROOT/src
3939ICUDT=icudt56b
3940export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
3941SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
3942UNIDATA=$ICU_SRC_DIR/source/data/unidata
3943
3944http://www.unicode.org/review/pri297/  -- beta review
3945http://www.unicode.org/reports/uax-proposed-updates.html
3946http://unicode.org/versions/beta-8.0.0.html
3947http://www.unicode.org/versions/Unicode8.0.0/
3948http://www.unicode.org/reports/tr44/tr44-15.html
3949
3950*** ICU Trac
3951
3952- ticket:11574: Unicode 8
3953- C++ branches/markus/uni80 at r37351 from trunk at r37343
3954- Java branches/markus/uni80 at r37352 from trunk at r37338
3955
3956*** CLDR Trac
3957
3958- cldrbug 8311: UCA 8
3959- branches/markus/uni80 at r11518 from trunk at r11517
3960
3961- cldrbug 8109: Unicode 8.0 script metadata
3962- cldrbug 8418: Updated segmentation for Unicode 8.0
3963
3964*** Unicode version numbers
3965- makedata.mak
3966- uchar.h
3967- com.ibm.icu.util.VersionInfo
3968- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3969
3970- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3971  so that the makefiles see the new version number.
3972
3973*** data files & enums & parser code
3974
3975* file preparation
3976
3977- download UCD & IDNA files
3978- make sure that the Unicode data folder passed into preparseucd.py
3979  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3980- only for manual diffs: remove version suffixes from the file names
3981  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
3982  (see https://sites.google.com/site/unicodetools/inputdata)
3983- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
3984- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
3985- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3986
3987- also: from http://unicode.org/Public/security/8.0.0/ download new
3988  confusables.txt & confusablesWholeScript.txt
3989  and copy to $UNIDATA
3990    ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
3991    ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
3992
3993* initial preparseucd.py changes
3994- remove new Unicode scripts from the
3995  only-in-ISO-15924 list according to the error message:
3996    ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
3997    from _scripts_only_in_iso15924
3998  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3999      and in com.ibm.icu.dev.test.lang.TestUScript.java
4000- property and file name change:
4001    IndicMatraCategory -> IndicPositionalCategory
4002- UnicodeData.txt unusual numeric values (improper fractions)
4003    109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
4004    109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
4005    109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
4006    109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
4007    109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
4008    109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
4009    109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
4010    109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
4011    109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
4012    109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
4013  -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
4014     which are listed in DerivedNumericValues.txt;
4015     keeps storage in data file simple
4016
4017* PropertyValueAliases.txt changes
4018- 10 new Block (blk) values:
4019    blk; Ahom                             ; Ahom
4020    blk; Anatolian_Hieroglyphs            ; Anatolian_Hieroglyphs
4021    blk; Cherokee_Sup                     ; Cherokee_Supplement
4022    blk; CJK_Ext_E                        ; CJK_Unified_Ideographs_Extension_E
4023    blk; Early_Dynastic_Cuneiform         ; Early_Dynastic_Cuneiform
4024    blk; Hatran                           ; Hatran
4025    blk; Multani                          ; Multani
4026    blk; Old_Hungarian                    ; Old_Hungarian
4027    blk; Sup_Symbols_And_Pictographs      ; Supplemental_Symbols_And_Pictographs
4028    blk; Sutton_SignWriting               ; Sutton_SignWriting
4029  -> add to uchar.h
4030    use long property names for enum constants
4031  -> add to UCharacter.UnicodeBlock IDs
4032    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
4033            replace  public static final int \1_ID = \2; \3
4034  -> add to UCharacter.UnicodeBlock objects
4035    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
4036            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
4037- 6 new Script (sc) values:
4038    sc ; Ahom                             ; Ahom
4039    sc ; Hatr                             ; Hatran
4040    sc ; Hluw                             ; Anatolian_Hieroglyphs
4041    sc ; Hung                             ; Old_Hungarian
4042    sc ; Mult                             ; Multani
4043    sc ; Sgnw                             ; SignWriting
4044  -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
4045
4046* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
4047    (not strictly necessary for NOT_ENCODED scripts)
4048  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
4049
4050* generate normalization data files
4051  cd $ICU_ROOT/dbg
4052  bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
4053  bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4054  bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4055  bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4056  bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4057
4058* build ICU (make install)
4059  so that the tools build can pick up the new definitions from the installed header files.
4060
4061  $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
4062
4063* build Unicode tools using CMake+make
4064
4065~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
4066
4067  # Location (--prefix) of where ICU was installed.
4068  set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
4069  # Location of the ICU source tree.
4070  set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
4071
4072  ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
4073  ~/svn.icutools/trunk/dbg/unicode/c$ make
4074
4075* generate core properties data files
4076- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
4077- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
4078- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
4079- rebuild ICU (make install) & tools
4080- run genuca again (see step above) so that it picks up the new nfc.nrm
4081- rebuild ICU (make install) & tools
4082
4083* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4084  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4085- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4086- Unicode 6.0..8.0: U+2260, U+226E, U+226F
4087- nothing new in 8.0, no test file to update
4088
4089* run & fix ICU4C tests
4090- bad Cherokee case folding due to difference in fallbacks:
4091  UCD case folding falls back to no mapping,
4092  ICU runtime case folding falls back to lowercasing;
4093  fixed casepropsbuilder.cpp to generate scf mappings to self
4094  when there is an slc mapping but no scf
4095- Andy handles RBBI & spoof check test failures
4096
4097* collation: CLDR collation root, UCA DUCET
4098
4099- UCA DUCET goes into Mark's Unicode tools, see
4100  https://sites.google.com/site/unicodetools/home#TOC-UCA
4101- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
4102- cd (CLDR UCA branch)/common/uca/
4103- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4104  cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
4105- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4106    cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
4107    (note removing the underscore before "Rules")
4108    cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
4109- restore TODO diffs in UCARules.txt
4110    meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
4111- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4112  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4113  from the CLDR root files (..._CLDR_..._SHORT.txt)
4114    cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
4115    cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
4116    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
4117- if CLDR common/uca/unihan-index.txt changes, then update
4118  CLDR common/collation/root.xml <collation type="private-unihan">
4119  and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
4120- run genuca, see command line above;
4121  deal with
4122    Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
4123        (add the character to genuca.cpp sampleCharsToScripts[])
4124  + look up the script for the new sample characters
4125    (e.g., in FractionalUCA.txt)
4126  + *add* mappings to sampleCharsToScripts[], do not replace them
4127    (in case the script sample characters flip-flop)
4128  + insert new scripts in DUCET script order, see the top_byte table
4129    at the beginning of FractionalUCA.txt
4130- rebuild ICU4C
4131
4132* run & fix ICU4C tests, now with new CLDR collation root data
4133- run all tests with the collation test data *_SHORT.txt or the full files
4134  (the full ones have comments, useful for debugging)
4135- note on intltest: if collate/UCAConformanceTest fails, then
4136  utility/MultithreadTest/TestCollators will fail as well;
4137  fix the conformance test before looking into the multi-thread test
4138- fixed bug in CollationWeights::getWeightRanges()
4139  exposed by new data and CollationTest::TestRootElements
4140
4141* update Java data files
4142- refresh just the UCD/UCA-related/derived files, just to be safe
4143- see (ICU4C)/source/data/icu4j-readme.txt
4144- mkdir /tmp/icu4j
4145- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4146  output:
4147    ...
4148    Unicode .icu files built to ./out/build/icudt56l
4149    echo timestamp > uni-core-data
4150    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
4151    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
4152    echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
4153    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
4154    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
4155    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
4156    mkdir -p /tmp/icu4j/main/shared/data
4157    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4158    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
4159    mkdir -p /tmp/icu4j/main/shared/data
4160    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4161    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
4162- copy the big-endian Unicode data files to another location,
4163  separate from the other data files,
4164  and then refresh ICU4J
4165    cd ~/svn.icu/trunk/dbg/data/out/icu4j
4166    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4167    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4168    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4169    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4170    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
4171    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4172    cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4173    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4174    jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
4175
4176* When refreshing all of ICU4J data from ICU4C
4177- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4178- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4179or
4180- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4181
4182* update CollationFCD.java
4183  + copy & paste the initializers of lcccIndex[] etc. from
4184    ICU4C/source/i18n/collationfcd.cpp to
4185    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
4186
4187* refresh Java test .txt files
4188- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4189    cd $ICU_SRC_DIR/source/data/unidata
4190    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4191    cd ../../test/testdata
4192    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4193    cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4194
4195* run & fix ICU4J tests
4196
4197*** LayoutEngine script information
4198
4199* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
4200  because the layout engine was deprecated in ICU 54.
4201  Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
4202  to write lines that we used to add manually.
4203
4204* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
4205  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
4206  in the working directory.
4207
4208  (It also generates ScriptRunData.cpp, which is no longer needed.)
4209
4210  It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
4211  (a plain text file)
4212  which maps ICU versions to the numbers of script/language constants
4213  that were added then.
4214  (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
4215
4216  The generated files have a current copyright date and "@deprecated" statement.
4217
4218* Review changes, fix Java tool if necessary, and copy to ICU4C
4219  cd ~/svn.icu4j/trunk/src
4220  meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
4221  cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
4222  cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
4223
4224*** API additions
4225- send notice to icu-design about new born-@stable API (enum constants etc.)
4226
4227*** merge the Unicode update branches back onto the trunk
4228- do not merge the icudata.jar and testdata.jar,
4229  instead rebuild them from merged & tested ICU4C
4230- make sure that changes to Unicode tools & ICU tools are checked in
4231  http://www.unicode.org/utility/trac/log/trunk/unicodetools
4232  http://bugs.icu-project.org/trac/log/tools/trunk
4233
4234---------------------------------------------------------------------------- ***
4235
4236Unicode 7.0 update for ICU 54
4237
4238http://www.unicode.org/review/pri271/  -- beta review
4239http://www.unicode.org/reports/uax-proposed-updates.html
4240http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
4241http://www.unicode.org/reports/tr44/tr44-13.html
4242
4243*** ICU Trac
4244
4245- ticket 10821: Unicode 7.0, UCA 7.0
4246- C++ branches/markus/uni70 at r35584 from trunk at r35580
4247- Java branches/markus/uni70 at r35587 from trunk at r35545
4248
4249*** CLDR Trac
4250
4251- ticket 7195: UCA 7.0 CLDR root collation
4252- branches/markus/uni70 at r10062 from trunk at r10061
4253
4254- ticket 6762: script metadata for Unicode 7.0 new scripts
4255
4256*** Unicode version numbers
4257- makedata.mak
4258- uchar.h
4259- com.ibm.icu.util.VersionInfo
4260- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
4261
4262- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
4263  so that the makefiles see the new version number.
4264
4265*** data files & enums & parser code
4266
4267* file preparation
4268
4269- download UCD & IDNA files
4270- make sure that the Unicode data folder passed into preparseucd.py
4271  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
4272- only for manual diffs: remove version suffixes from the file names
4273  ~/unidata/uni70/20140403$ ../../desuffixucd.py .
4274  (see https://sites.google.com/site/unicodetools/inputdata)
4275- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
4276- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
4277- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4278- Restore TODO diffs in source/data/unidata/UCARules.txt
4279    cd $ICU_SRC_DIR
4280    meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
4281- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
4282
4283- also: from http://unicode.org/Public/security/7.0.0/ download new
4284  confusables.txt & confusablesWholeScript.txt
4285  and copy to $ICU_ROOT/src/source/data/unidata/
4286
4287* initial preparseucd.py changes
4288- remove new Unicode scripts from the
4289  only-in-ISO-15924 list according to the error message:
4290    ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
4291                        'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
4292                        'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
4293    from _scripts_only_in_iso15924
4294  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
4295      and in com.ibm.icu.dev.test.lang.TestUScript.java
4296- NamesList.txt now has a heading with a non-ASCII character
4297  + keep ppucd.txt in platform charset, rather than changing tool/test parsers
4298  + escape non-ASCII characters in heading comments
4299- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
4300  + get the copyright from the first file whose copyright line contains the current year
4301
4302* PropertyValueAliases.txt changes
4303- 32 new Block (blk) values:
4304    blk; Bassa_Vah                        ; Bassa_Vah
4305    blk; Caucasian_Albanian               ; Caucasian_Albanian
4306    blk; Coptic_Epact_Numbers             ; Coptic_Epact_Numbers
4307    blk; Diacriticals_Ext                 ; Combining_Diacritical_Marks_Extended
4308    blk; Duployan                         ; Duployan
4309    blk; Elbasan                          ; Elbasan
4310    blk; Geometric_Shapes_Ext             ; Geometric_Shapes_Extended
4311    blk; Grantha                          ; Grantha
4312    blk; Khojki                           ; Khojki
4313    blk; Khudawadi                        ; Khudawadi
4314    blk; Latin_Ext_E                      ; Latin_Extended_E
4315    blk; Linear_A                         ; Linear_A
4316    blk; Mahajani                         ; Mahajani
4317    blk; Manichaean                       ; Manichaean
4318    blk; Mende_Kikakui                    ; Mende_Kikakui
4319    blk; Modi                             ; Modi
4320    blk; Mro                              ; Mro
4321    blk; Myanmar_Ext_B                    ; Myanmar_Extended_B
4322    blk; Nabataean                        ; Nabataean
4323    blk; Old_North_Arabian                ; Old_North_Arabian
4324    blk; Old_Permic                       ; Old_Permic
4325    blk; Ornamental_Dingbats              ; Ornamental_Dingbats
4326    blk; Pahawh_Hmong                     ; Pahawh_Hmong
4327    blk; Palmyrene                        ; Palmyrene
4328    blk; Pau_Cin_Hau                      ; Pau_Cin_Hau
4329    blk; Psalter_Pahlavi                  ; Psalter_Pahlavi
4330    blk; Shorthand_Format_Controls        ; Shorthand_Format_Controls
4331    blk; Siddham                          ; Siddham
4332    blk; Sinhala_Archaic_Numbers          ; Sinhala_Archaic_Numbers
4333    blk; Sup_Arrows_C                     ; Supplemental_Arrows_C
4334    blk; Tirhuta                          ; Tirhuta
4335    blk; Warang_Citi                      ; Warang_Citi
4336  -> add to uchar.h
4337    use long property names for enum constants
4338  -> add to UCharacter.UnicodeBlock IDs
4339    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
4340            replace  public static final int \1_ID = \2; \3
4341  -> add to UCharacter.UnicodeBlock objects
4342    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
4343            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
4344- 28 new Joining_Group (jg) values:
4345    jg ; Manichaean_Aleph                 ; Manichaean_Aleph
4346    jg ; Manichaean_Ayin                  ; Manichaean_Ayin
4347    jg ; Manichaean_Beth                  ; Manichaean_Beth
4348    jg ; Manichaean_Daleth                ; Manichaean_Daleth
4349    jg ; Manichaean_Dhamedh               ; Manichaean_Dhamedh
4350    jg ; Manichaean_Five                  ; Manichaean_Five
4351    jg ; Manichaean_Gimel                 ; Manichaean_Gimel
4352    jg ; Manichaean_Heth                  ; Manichaean_Heth
4353    jg ; Manichaean_Hundred               ; Manichaean_Hundred
4354    jg ; Manichaean_Kaph                  ; Manichaean_Kaph
4355    jg ; Manichaean_Lamedh                ; Manichaean_Lamedh
4356    jg ; Manichaean_Mem                   ; Manichaean_Mem
4357    jg ; Manichaean_Nun                   ; Manichaean_Nun
4358    jg ; Manichaean_One                   ; Manichaean_One
4359    jg ; Manichaean_Pe                    ; Manichaean_Pe
4360    jg ; Manichaean_Qoph                  ; Manichaean_Qoph
4361    jg ; Manichaean_Resh                  ; Manichaean_Resh
4362    jg ; Manichaean_Sadhe                 ; Manichaean_Sadhe
4363    jg ; Manichaean_Samekh                ; Manichaean_Samekh
4364    jg ; Manichaean_Taw                   ; Manichaean_Taw
4365    jg ; Manichaean_Ten                   ; Manichaean_Ten
4366    jg ; Manichaean_Teth                  ; Manichaean_Teth
4367    jg ; Manichaean_Thamedh               ; Manichaean_Thamedh
4368    jg ; Manichaean_Twenty                ; Manichaean_Twenty
4369    jg ; Manichaean_Waw                   ; Manichaean_Waw
4370    jg ; Manichaean_Yodh                  ; Manichaean_Yodh
4371    jg ; Manichaean_Zayin                 ; Manichaean_Zayin
4372    jg ; Straight_Waw                     ; Straight_Waw
4373  -> uchar.h & UCharacter.JoiningGroup
4374- 23 new Script (sc) values:
4375    sc ; Aghb                             ; Caucasian_Albanian
4376    sc ; Bass                             ; Bassa_Vah
4377    sc ; Dupl                             ; Duployan
4378    sc ; Elba                             ; Elbasan
4379    sc ; Gran                             ; Grantha
4380    sc ; Hmng                             ; Pahawh_Hmong
4381    sc ; Khoj                             ; Khojki
4382    sc ; Lina                             ; Linear_A
4383    sc ; Mahj                             ; Mahajani
4384    sc ; Mani                             ; Manichaean
4385    sc ; Mend                             ; Mende_Kikakui
4386    sc ; Modi                             ; Modi
4387    sc ; Mroo                             ; Mro
4388    sc ; Narb                             ; Old_North_Arabian
4389    sc ; Nbat                             ; Nabataean
4390    sc ; Palm                             ; Palmyrene
4391    sc ; Pauc                             ; Pau_Cin_Hau
4392    sc ; Perm                             ; Old_Permic
4393    sc ; Phlp                             ; Psalter_Pahlavi
4394    sc ; Sidd                             ; Siddham
4395    sc ; Sind                             ; Khudawadi
4396    sc ; Tirh                             ; Tirhuta
4397    sc ; Wara                             ; Warang_Citi
4398  -> uscript.h (many were added before)
4399    comment "Mende Kikakui" for USCRIPT_MENDE
4400    add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
4401  -> com.ibm.icu.lang.UScript
4402    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4403    replace  public static final int \1 = \2; \3
4404- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
4405  (added 2012-11-01)
4406    Ahom        338     Ahom
4407    Hatr        127     Hatran
4408    Mult        323     Multani
4409  (added 2013-10-12)
4410    Modi        324     Modi
4411    Pauc        263     Pau Cin Hau
4412    Sidd        302     Siddham
4413  -> uscript.h (some overlap with additions from Unicode)
4414  -> com.ibm.icu.lang.UScript
4415    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4416    replace  public static final int \1 = \2; \3
4417  -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
4418  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
4419      and in com.ibm.icu.dev.test.lang.TestUScript.java
4420
4421* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
4422    (not strictly necessary for NOT_ENCODED scripts)
4423  ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
4424
4425* generate normalization data files
4426- cd $ICU_ROOT/dbg
4427- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
4428- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
4429- UNIDATA=$ICU_SRC_DIR/source/data/unidata
4430- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
4431- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4432- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4433- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4434- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4435
4436* build ICU (make install)
4437  so that the tools build can pick up the new definitions from the installed header files.
4438
4439~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
4440
4441* build Unicode tools using CMake+make
4442
4443~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
4444
4445# Location (--prefix) of where ICU was installed.
4446set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
4447# Location of the ICU source tree.
4448set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
4449
4450~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
4451~/svn.icutools/trunk/dbg/unicode/c$ make
4452
4453* genprops work
4454- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
4455  + add second array of Joining_Group values for at most 10800..10FFF
4456    icutools: unicode/c/genprops/bidipropsbuilder.cpp
4457    icu: source/common/ubidi_props.h/.c/_data.h
4458    icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
4459
4460* generate core properties data files
4461- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
4462- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
4463- rebuild ICU (make install) & tools
4464- run genuca again (see step above) so that it picks up the new nfc.nrm
4465- rebuild ICU (make install) & tools
4466
4467* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4468  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4469- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4470- Unicode 6.0..7.0: U+2260, U+226E, U+226F
4471- nothing new in 7.0, no test file to update
4472
4473* run & fix ICU4C tests
4474
4475* update Java data files
4476- refresh just the UCD-related files, just to be safe
4477- see (ICU4C)/source/data/icu4j-readme.txt
4478- mkdir /tmp/icu4j
4479- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4480  output:
4481    ...
4482    Unicode .icu files built to ./out/build/icudt53l
4483    echo timestamp > uni-core-data
4484    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
4485    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
4486    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4487    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
4488    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
4489    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
4490    mkdir -p /tmp/icu4j/main/shared/data
4491    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4492    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
4493    mkdir -p /tmp/icu4j/main/shared/data
4494    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4495    make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
4496- copy the big-endian Unicode data files to another location,
4497  separate from the other data files
4498    ICUDT=icudt54b
4499    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4500    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4501    cd ~/svn.icu/uni70/dbg/data/out/icu4j
4502    cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4503    cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4504    rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
4505    cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
4506    cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4507    cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
4508- refresh ICU4J
4509    ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
4510
4511* update CollationFCD.java
4512  + copy & paste the initializers of lcccIndex[] etc. from
4513    ICU4C/source/i18n/collationfcd.cpp to
4514    ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
4515
4516* refresh Java test .txt files
4517- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4518    cd $ICU_SRC_DIR/source/data/unidata
4519    cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4520    cd ../../test/testdata
4521    cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4522    cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
4523
4524* UCA
4525
4526- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
4527- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
4528- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
4529- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
4530- output files are in ~/svn.unitools/Generated/uca/7.0.0/
4531- review data; compare files, use blankweights.sed or similar
4532  ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
4533- cd ~/svn.unitools/Generated/uca/7.0.0/
4534- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4535  cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
4536- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4537    (note removing the underscore before "Rules")
4538    cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
4539- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4540  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4541  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4542    cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
4543    cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
4544    cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
4545- run genuca, see command line above
4546- rebuild ICU4C
4547- refresh ICU4J collation data:
4548  (subset of instructions above for properties data refresh, except copies all coll/*)
4549    ICUDT=icudt54b
4550    ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4551    ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4552    ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
4553    ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
4554- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4555- note on intltest: if collate/UCAConformanceTest fails, then
4556  utility/MultithreadTest/TestCollators will fail as well;
4557  fix the conformance test before looking into the multi-thread test
4558- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
4559- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
4560  ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
4561
4562* When refreshing all of ICU4J data from ICU4C
4563- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4564- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4565or
4566- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4567
4568* run & fix ICU4J tests
4569
4570*** LayoutEngine script information
4571
4572(For details see the Unicode 5.2 change log below.)
4573
4574* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
4575  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
4576  in the working directory.
4577  (It also generates ScriptRunData.cpp, which is no longer needed.)
4578
4579  The generated files have a current copyright date and "@stable" statement.
4580  ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
4581  for "born stable" Unicode API constants, and to stop parsing ICU version numbers
4582  which may not contain dots any more.
4583
4584- diff current <icu>/source/layout files vs. generated ones
4585    ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
4586  review and manually merge desired changes;
4587  fix gratuitous changes, incorrect @draft/@stable and missing aliases;
4588  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
4589- if you just copy the above files, then
4590  fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
4591  manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
4592
4593*** API additions
4594- send notice to icu-design about new born-@stable API (enum constants etc.)
4595
4596*** merge the Unicode update branches back onto the trunk
4597- do not merge the icudata.jar and testdata.jar,
4598  instead rebuild them from merged & tested ICU4C
4599
4600---------------------------------------------------------------------------- ***
4601
4602Unicode 6.3 update
4603
4604http://www.unicode.org/review/pri249/  -- beta review
4605http://www.unicode.org/reports/uax-proposed-updates.html
4606http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
4607http://www.unicode.org/reports/tr44/tr44-11.html
4608
4609*** ICU Trac
4610
4611- ticket 10128: update ICU to Unicode 6.3 beta
4612- ticket 10168: update ICU to Unicode 6.3 final
4613- C++ branches/markus/uni63 at r33552 from trunk at r33551
4614- Java branches/markus/uni63 at r33550 from trunk at r33553
4615
4616- ticket 10142: implement Unicode 6.3 bidi algorithm additions
4617
4618*** Unicode version numbers
4619- makedata.mak
4620- uchar.h
4621  (configure.in & configure: have been modified to extract the version from uchar.h)
4622- com.ibm.icu.util.VersionInfo
4623- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
4624
4625- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
4626  so that the makefiles see the new version number.
4627
4628*** data files & enums & parser code
4629
4630* file preparation
4631
4632- download UCD, UCA & IDNA files
4633- make sure that the Unicode data folder passed into preparseucd.py
4634  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
4635- modify preparseucd.py:
4636  parse new file BidiBrackets.txt
4637  with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
4638- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
4639- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4640- Check test file diffs for previously commented-out, known-failing data lines;
4641  probably need to keep those commented out.
4642
4643* PropertyAliases.txt changes
4644- 1 new Enumerated Property
4645  bpt                      ; Bidi_Paired_Bracket_Type
4646  -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
4647  -> ubidi_props.h & .c & UBiDiProps.java
4648  -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
4649  -> uprops.cpp
4650  -> change ubidi.icu format version from 2.0 to 2.1
4651- 1 new Miscellaneous Property
4652  bpb                      ; Bidi_Paired_Bracket
4653  -> uchar.h & UProperty.java
4654  -> ppucd.h & .cpp
4655
4656* PropertyValueAliases.txt changes
4657- 3 Bidi_Paired_Bracket_Type (bpt) values:
4658  bpt; c                                ; Close
4659  bpt; n                                ; None
4660  bpt; o                                ; Open
4661  -> uchar.h & UCharacter.BidiPairedBracketType
4662  -> ubidi_props.h & .c & UBiDiProps.java
4663  -> change ubidi.icu format version from 2.0 to 2.1
4664- 4 new Bidi_Class (bc) values:
4665  bc ; FSI                              ; First_Strong_Isolate
4666  bc ; LRI                              ; Left_To_Right_Isolate
4667  bc ; RLI                              ; Right_To_Left_Isolate
4668  bc ; PDI                              ; Pop_Directional_Isolate
4669  -> uchar.h & UCharacterEnums.ECharacterDirection
4670  -> until the bidi code gets updated,
4671     Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
4672- 3 new Word_Break (WB) values:
4673  WB ; HL                               ; Hebrew_Letter
4674  WB ; SQ                               ; Single_Quote
4675  WB ; DQ                               ; Double_Quote
4676  -> uchar.h & UCharacter.WordBreak
4677  -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
4678- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
4679  (added 2012-10-16)
4680  Aghb  239     Caucasian Albanian
4681  Mahj  314     Mahajani
4682  -> uscript.h
4683  -> com.ibm.icu.lang.UScript
4684    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
4685    replace  public static final int \1 = \2;\3
4686  -> preparseucd.py _scripts_only_in_iso15924
4687  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
4688      and in com.ibm.icu.dev.test.lang.TestUScript.java
4689  -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
4690     (not strictly necessary for NOT_ENCODED scripts)
4691
4692* generate normalization data files
4693- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
4694- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
4695- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
4696- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4697- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4698- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4699- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4700
4701* build ICU (make install)
4702  so that the tools build can pick up the new definitions from the installed header files.
4703
4704~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
4705
4706* build Unicode tools using CMake+make
4707
4708~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
4709
4710# Location (--prefix) of where ICU was installed.
4711set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
4712# Location of the ICU source tree.
4713set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
4714
4715~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
4716~/svn.icutools/trunk/dbg/unicode/c$ make
4717
4718* generate core properties data files
4719- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
4720- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
4721- rebuild ICU (make install) & tools
4722- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
4723- rebuild ICU (make install) & tools
4724
4725* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4726  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4727- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4728- Unicode 6.0..6.3: U+2260, U+226E, U+226F
4729- nothing new in 6.3, no test file to update
4730
4731* update Java data files
4732- refresh just the UCD-related files, just to be safe
4733- see (ICU4C)/source/data/icu4j-readme.txt
4734- mkdir /tmp/icu4j
4735- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4736  output:
4737    ...
4738    Unicode .icu files built to ./out/build/icudt52l
4739    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
4740    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
4741    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4742    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
4743    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
4744    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
4745    mkdir -p /tmp/icu4j/main/shared/data
4746    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4747    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
4748    mkdir -p /tmp/icu4j/main/shared/data
4749    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4750    make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
4751- copy the big-endian Unicode data files to another location,
4752  separate from the other data files
4753    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4754    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
4755    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
4756    ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
4757    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
4758    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4759    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
4760- refresh ICU4J
4761    ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
4762
4763* refresh Java test .txt files
4764- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4765
4766* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
4767
4768- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
4769- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
4770- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4771- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4772  (note removing the underscore before "Rules")
4773- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4774  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4775  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4776- check test file diffs for previously commented-out, known-failing data lines;
4777  probably need to keep those commented out
4778- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
4779- run genuca, see command line above
4780- rebuild ICU4C
4781- refresh ICU4J collation data:
4782  (subset of instructions above for properties data refresh, except copies all coll/*)
4783    ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4784    ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4785    ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
4786    ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
4787- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4788- note on intltest: if collate/UCAConformanceTest fails, then
4789  utility/MultithreadTest/TestCollators will fail as well;
4790  fix the conformance test before looking into the multi-thread test
4791
4792* test ICU, fix test code where necessary
4793
4794* When refreshing all of ICU4J data from ICU4C
4795- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4796- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4797or
4798- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4799
4800*** LayoutEngine script information
4801- skipped for Unicode 6.3: no new scripts
4802
4803*** merge the Unicode update branches back onto the trunk
4804- do not merge the icudata.jar and testdata.jar,
4805  instead rebuild them from merged & tested ICU4C
4806
4807---------------------------------------------------------------------------- ***
4808
4809Unicode 6.2 update
4810
4811http://www.unicode.org/review/pri230/
4812http://www.unicode.org/versions/beta-6.2.0.html
4813http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
4814http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
4815http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
4816http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
4817http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
4818http://unicode.org/Public/idna/6.2.0/
4819
4820*** ICU Trac
4821
4822- ticket 9515: Unicode 6.2: final ICU update
4823
4824- ticket 9514: UCA 6.2: fix UCARules.txt
4825
4826- ticket 9437: update ICU to Unicode 6.2
4827- C++ branches/markus/uni62 at r32050 from trunk at r32041
4828- Java branches/markus/uni62 at r32068 from trunk at r32066
4829
4830*** Unicode version numbers
4831- makedata.mak
4832- uchar.h
4833  (configure.in & configure: have been modified to extract the version from uchar.h)
4834- com.ibm.icu.util.VersionInfo
4835- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
4836
4837*** data files & enums & parser code
4838
4839* file preparation
4840
4841- download UCD, UCA & IDNA files
4842- make sure that the Unicode data folder passed into preparseucd.py
4843  includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
4844- modify preparseucd.py: NamesList.txt is now in UTF-8
4845- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
4846- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4847- Check test file diffs for previously commented-out, known-failing data lines;
4848  probably need to keep those commented out.
4849
4850* PropertyValueAliases.txt changes
4851- 1 new Line_Break (lb) value:
4852  lb ; RI                               ; Regional_Indicator
4853  -> uchar.h & UCharacter.LineBreak
4854- 1 new Word_Break (WB) value:
4855  WB ; RI                               ; Regional_Indicator
4856  -> uchar.h & UCharacter.WordBreak
4857- 1 new Grapheme_Cluster_Break (GCB) value:
4858  GCB; RI                               ; Regional_Indicator
4859  -> uchar.h & UCharacter.GraphemeClusterBreak
4860
4861* 3 new numeric values
4862  The new value -1, which was really supposed to be NaN but that would have required
4863  new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
4864  but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
4865    cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
4866    cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
4867  The two new values 216000 and 432000 require an addition to the encoding of numeric values.
4868    cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
4869    cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
4870  -> uprops.h, uchar.c & UCharacterProperty.java
4871  -> cucdtst.c & UCharacterTest.java
4872
4873* generate normalization data files
4874- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
4875- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
4876- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
4877- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
4878- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
4879- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
4880- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
4881
4882* build ICU (make install)
4883  so that the tools build can pick up the new definitions from the installed header files.
4884* build Unicode tools using CMake+make
4885
4886* generate core properties data files
4887- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
4888- in initial bootstrapping, change the UCA version
4889  in source/data/unidata/FractionalUCA.txt to match the new Unicode version
4890- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
4891- rebuild ICU (make install) & tools
4892  + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
4893    check if the UCA version in FractionalUCA.txt matches the new Unicode version
4894    (see step above)
4895- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
4896- rebuild ICU (make install) & tools
4897
4898* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
4899  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
4900- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
4901- Unicode 6.0..6.2: U+2260, U+226E, U+226F
4902- nothing new in 6.2, no test file to update
4903
4904* update Java data files
4905- refresh just the UCD-related files, just to be safe
4906- see (ICU4C)/source/data/icu4j-readme.txt
4907- mkdir /tmp/icu4j
4908- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4909  output:
4910    ...
4911    Unicode .icu files built to ./out/build/icudt50l
4912    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
4913    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
4914    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
4915    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
4916    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
4917    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
4918    mkdir -p /tmp/icu4j/main/shared/data
4919    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
4920    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
4921    mkdir -p /tmp/icu4j/main/shared/data
4922    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
4923    make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
4924- copy the big-endian Unicode data files to another location,
4925  separate from the other data files
4926    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4927    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
4928    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
4929    ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
4930    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
4931    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4932    ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
4933- refresh ICU4J
4934    ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
4935
4936* refresh Java test .txt files
4937- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4938
4939* UCA
4940
4941- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
4942- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
4943- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4944- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4945  (note removing the underscore before "Rules")
4946- update (ICU4C)/source/test/testdata/CollationTest_*.txt
4947  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4948  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
4949- check test file diffs for previously commented-out, known-failing data lines;
4950  probably need to keep those commented out
4951- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
4952- run genuca, see command line above
4953- rebuild ICU4C
4954- refresh ICU4J collation data:
4955  (subset of instructions above for properties data refresh, except copies all coll/*)
4956    ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4957    ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4958    ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
4959    ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
4960- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
4961- note on intltest: if collate/UCAConformanceTest fails, then
4962  utility/MultithreadTest/TestCollators will fail as well;
4963  fix the conformance test before looking into the multi-thread test
4964
4965* test ICU, fix test code where necessary
4966
4967* When refreshing all of ICU4J data from ICU4C
4968- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4969- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4970or
4971- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4972
4973*** LayoutEngine script information
4974- skipped for Unicode 6.2: no new scripts
4975
4976*** merge the Unicode update branches back onto the trunk
4977- do not merge the icudata.jar and testdata.jar,
4978  instead rebuild them from merged & tested ICU4C
4979
4980---------------------------------------------------------------------------- ***
4981
4982Future Unicode update
4983
4984Tools simplified since the Unicode 6.1 update. See
4985- https://icu.unicode.org/design/props/ppucd
4986- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
4987
4988* Unicode version numbers
4989- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
4990
4991* file preparation
4992- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
4993- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
4994- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
4995- Check test file diffs for previously commented-out, known-failing data lines;
4996  probably need to keep those commented out.
4997
4998* PropertyValueAliases.txt changes
4999- Script codes that are in ISO 15924 but not in Unicode are now listed in
5000  preparseucd.py, in the _scripts_only_in_iso15924 variable.
5001  If there are new ISO codes, then add them.
5002  If Unicode adds some of them, then remove them from the .py variable.
5003
5004* UnicodeData.txt changes
5005- No more manual changes for CJK ranges for algorithmic names;
5006  those are now written to ppucd.txt and genprops reads them from there.
5007
5008* generate core properties data files (makeprops.sh was deleted)
5009- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
5010
5011* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
5012- it is now generated by preparseucd.py
5013
5014* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
5015- it is now generated by preparseucd.py
5016- make sure that the Unicode data folder passed into preparseucd.py
5017  includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
5018  (can be in some subfolder)
5019
5020* generate normalization data files
5021- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
5022- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
5023- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
5024- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
5025- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
5026- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
5027- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
5028
5029* build ICU (make install)
5030* build Unicode tools using CMake+make
5031
5032* new way to call genuca (makeuca.sh was deleted)
5033- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
5034
5035---------------------------------------------------------------------------- ***
5036
5037Unicode 6.1 update
5038
5039*** ICU Trac
5040
5041- ticket 8995 final update to Unicode 6.1
5042- ticket 8994 regenerate source/layout/CanonData.cpp
5043
5044- ticket 8961 support Unicode "Age" value *names*
5045- ticket 8963 support multiple character name aliases & types
5046
5047- ticket 8827 "update ICU to Unicode 6.1"
5048- C++ branches/markus/uni61 at r30864 from trunk at r30843
5049- Java branches/markus/uni61 at r30865 from trunk at r30863
5050
5051*** Unicode version numbers
5052- makedata.mak
5053- uchar.h
5054  (configure.in & configure: have been modified to extract the version from uchar.h)
5055- com.ibm.icu.util.VersionInfo
5056- icutools/unicode/makedefs.sh
5057  + also review & update other definitions in that file,
5058    e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
5059
5060*** data files & enums & parser code
5061
5062* file preparation
5063
5064~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
5065- This prepares both unidata and testdata files in respective output subfolders.
5066- Check test file diffs for previously commented-out, known-failing data lines;
5067  probably need to keep those commented out.
5068
5069* PropertyValueAliases.txt changes
5070- 11 new block names:
5071  Arabic_Extended_A
5072  Arabic_Mathematical_Alphabetic_Symbols
5073  Chakma
5074  Meetei_Mayek_Extensions
5075  Meroitic_Cursive
5076  Meroitic_Hieroglyphs
5077  Miao
5078  Sharada
5079  Sora_Sompeng
5080  Sundanese_Supplement
5081  Takri
5082  -> add to uchar.h
5083  -> add to UCharacter.UnicodeBlock IDs
5084    Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
5085            replace  public static final int \1_ID = \2; \3
5086  -> add to UCharacter.UnicodeBlock objects
5087    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
5088            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
5089- 1 new Joining_Group (jg) value:
5090  Rohingya_Yeh
5091  -> uchar.h & UCharacter.JoiningGroup
5092- 2 new Line_Break (lb) values:
5093  CJ=Conditional_Japanese_Starter
5094  HL=Hebrew_Letter
5095  -> uchar.h & UCharacter.LineBreak
5096- 7 new scripts:
5097  sc ; Cakm      ; Chakma
5098  sc ; Merc      ; Meroitic_Cursive
5099  sc ; Mero      ; Meroitic_Hieroglyphs
5100  sc ; Plrd      ; Miao
5101  sc ; Shrd      ; Sharada
5102  sc ; Sora      ; Sora_Sompeng
5103  sc ; Takr      ; Takri
5104  -> remove these from SyntheticPropertyValueAliases.txt
5105  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
5106      and in com.ibm.icu.dev.test.lang.TestUScript.java
5107- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
5108  (added 2011-06-21)
5109  Khoj        322     Khojki
5110  Tirh        326     Tirhuta
5111    and another one added 2011-12-09
5112  Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
5113  -> uscript.h
5114  -> com.ibm.icu.lang.UScript
5115    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
5116    replace  public static final int \1 = \2;\3
5117  -> SyntheticPropertyValueAliases.txt
5118  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
5119      and in com.ibm.icu.dev.test.lang.TestUScript.java
5120
5121* UnicodeData.txt changes
5122- the last Unihan code point changes from U+9FCB to U+9FCC
5123  search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
5124  + do change gennames.c
5125  + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
5126
5127* DerivedBidiClass.txt changes
5128- 2 new default-AL blocks:
5129#     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
5130#     Arabic Mathematical Alphabetic Symbols:
5131#                       U+1EE00  - U+1EEFF  (was default-R)
5132- 2 new default-R blocks:
5133#     Meroitic Hieroglyphs:
5134#                        U+10980 - U+1099F
5135#     Meroitic Cursive:  U+109A0 - U+109FF
5136  -> should be picked up by the explicit data in the file
5137
5138* NameAliases.txt changes
5139- from
5140    # Each line has two fields
5141    # First field: Code point
5142    # Second field: Alias
5143- to
5144    # Each line has three fields, as described here:
5145    #
5146    # First field:  Code point
5147    # Second field: Alias
5148    # Third field:  Type
5149- Also, the file previously allowed multiple aliases but only now does it
5150  actually provide multiple, even multiple of the same type. For example,
5151    FEFF;BYTE ORDER MARK;alternate
5152    FEFF;BOM;abbreviation
5153    FEFF;ZWNBSP;abbreviation
5154- This breaks our gennames parser, unames.icu data structure, and API.
5155  Fix gennames to only pick up "correction" aliases.
5156  New ticket #8963 for further changes.
5157
5158* run genpname/preparse.pl (on Linux)
5159  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
5160  + make sure that data.h is writable
5161  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
5162  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
5163
5164* build ICU (make install)
5165  so that the tools build can pick up the new definitions from the installed header files.
5166* build Unicode tools (at least genpname) using CMake+make
5167
5168* run genpname
5169  (builds both pnames.icu and propname_data.h)
5170- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
5171- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
5172
5173* build ICU (make install)
5174* build Unicode tools using CMake+make
5175
5176* update source/data/unidata/norm2/nfkc_cf.txt
5177- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
5178
5179* update source/data/unidata/norm2/uts46.txt
5180- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
5181  to ~/svn.icu/tools/trunk/src/unicode/py
5182- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
5183- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
5184- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
5185
5186* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
5187  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
5188- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
5189- Unicode 6.0..6.1: U+2260, U+226E, U+226F
5190- nothing new in 6.1, no test file to update
5191
5192* generate core properties data files
5193- in initial bootstrapping, change the UCA version
5194  in source/data/unidata/FractionalUCA.txt to match the new Unicode version
5195- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5196- rebuild ICU & tools
5197  + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
5198    check if the UCA version in FractionalUCA.txt matches the new Unicode version
5199    (see step above)
5200- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
5201  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5202- rebuild ICU & tools
5203
5204* update Java data files
5205- refresh just the UCD-related files, just to be safe
5206- see (ICU4C)/source/data/icu4j-readme.txt
5207- mkdir /tmp/icu4j
5208- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5209  output:
5210    ...
5211    Unicode .icu files built to ./out/build/icudt49l
5212    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
5213    mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
5214    echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
5215    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
5216    mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
5217    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
5218    mkdir -p /tmp/icu4j/main/shared/data
5219    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
5220    jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
5221    mkdir -p /tmp/icu4j/main/shared/data
5222    cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
5223    make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
5224- copy the big-endian Unicode data files to another location,
5225  separate from the other data files
5226    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
5227    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
5228    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
5229    ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
5230    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
5231    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
5232    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
5233- refresh ICU4J
5234    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
5235
5236* refresh Java test .txt files
5237- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
5238
5239* test ICU so far, fix test code where necessary
5240- temporarily ignore collation issues that look like UCA/UCD mismatches,
5241  until UCA data is updated
5242
5243* UCA
5244
5245- get output from Mark's tools; look in
5246    http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
5247- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
5248- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
5249  (note removing the underscore before "Rules")
5250- update (ICU)/source/test/testdata/CollationTest_*.txt
5251  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
5252  with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
5253- check test file diffs for previously commented-out, known-failing data lines;
5254  probably need to keep those commented out
5255- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
5256- run makeuca.sh:
5257  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5258- rebuild ICU4C
5259- refresh ICU4J collation data:
5260  (subset of instructions above for properties data refresh, except copies all coll/*)
5261    ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5262    ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
5263    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
5264    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
5265- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
5266- note on intltest: if collate/UCAConformanceTest fails, then
5267  utility/MultithreadTest/TestCollators will fail as well;
5268  fix the conformance test before looking into the multi-thread test
5269
5270* When refreshing all of ICU4J data from ICU4C
5271- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5272- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
5273or
5274- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
5275
5276*** LayoutEngine script information
5277
5278(For details see the Unicode 5.2 change log below.)
5279
5280* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
5281  This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
5282  in the working directory.
5283  (It also generates ScriptRunData.cpp, which is no longer needed.)
5284
5285  The generated files have a current copyright date and "@draft" statement.
5286
5287- diff current <icu>/source/layout files vs. generated ones
5288    ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
5289  review and manually merge desired changes;
5290  fix gratuitous changes, incorrect @draft and missing aliases;
5291  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
5292- if you just copy the above files, then
5293  fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
5294  manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5295
5296*** merge the Unicode update branches back onto the trunk
5297- do not merge the icudata.jar and testdata.jar,
5298  instead rebuild them from merged & tested ICU4C
5299
5300---------------------------------------------------------------------------- ***
5301
5302ICU 4.8 (no Unicode update, just new script codes)
5303
5304* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
5305  (added 2010-12-21)
5306    Afak    439     Afaka
5307    Jurc    510     Jurchen
5308    Mroo    199     Mro, Mru
5309    Nshu    499     Nüshu
5310    Shrd    319     Sharada, Śāradā
5311    Sora    398     Sora Sompeng
5312    Takr    321     Takri, Ṭākrī, Ṭāṅkrī
5313    Tang    520     Tangut
5314    Wole    480     Woleai
5315  -> uscript.h
5316  -> com.ibm.icu.lang.UScript
5317    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
5318    replace  public static final int \1 = \2;\3
5319  -> genpname/SyntheticPropertyValueAliases.txt
5320  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
5321      and in com.ibm.icu.dev.test.lang.TestUScript.java
5322
5323* run genpname/preparse.pl (on Linux)
5324  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
5325  + make sure that data.h is writable
5326  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
5327  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
5328
5329* rebuild Unicode tools (at least genpname) using make
5330- You might first need to "make install" ICU so that the tools build can pick
5331  up the new definitions from the installed header files.
5332
5333* run genpname
5334  (builds both pnames.icu and propname_data.h)
5335- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
5336- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
5337- rebuild ICU & tools
5338
5339* run genprops
5340- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
5341- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
5342- rebuild ICU & tools
5343
5344* update Java data files
5345- refresh just the UCD-related files, just to be safe
5346- see (ICU4C)/source/data/icu4j-readme.txt
5347- mkdir /tmp/icu4j
5348- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5349- copy the big-endian Unicode data files to another location,
5350  separate from the other data files
5351    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5352    ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5353    ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
5354- refresh ICU4J
5355    ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
5356
5357* should have updated the layout engine script codes but forgot
5358
5359---------------------------------------------------------------------------- ***
5360
5361Unicode 6.0 update
5362
5363*** related ICU Trac tickets
5364
53657264 Unicode 6.0 Update
5366
5367*** Unicode version numbers
5368- makedata.mak
5369- uchar.h
5370  (configure.in & configure: have been modified to extract the version from uchar.h)
5371- com.ibm.icu.util.VersionInfo
5372
5373*** data files & enums & parser code
5374
5375* file preparation
5376
5377~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
5378- This now prepares both unidata and testdata files in respective output subfolders.
5379
5380* PropertyAliases.txt changes
5381- new Script_Extensions property defined in the new ScriptExtensions.txt file
5382  but not listed in PropertyAliases.txt; reported to unicode.org;
5383  -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
5384    scx; Script_Extensions
5385  -> uchar.h with new UProperty section
5386  -> com.ibm.icu.lang.UProperty, parallel with uchar.h
5387
5388* PropertyValueAliases.txt changes
5389- 12 new block names:
5390  Alchemical_Symbols
5391  Bamum_Supplement
5392  Batak
5393  Brahmi
5394  CJK_Unified_Ideographs_Extension_D
5395  Emoticons
5396  Ethiopic_Extended_A
5397  Kana_Supplement
5398  Mandaic
5399  Miscellaneous_Symbols_And_Pictographs
5400  Playing_Cards
5401  Transport_And_Map_Symbols
5402  -> add to uchar.h
5403  -> add to UCharacter.UnicodeBlock
5404    Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
5405            replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
5406- Joining_Group (jg) values:
5407  Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
5408  -> uchar.h & UCharacter.JoiningGroup
5409- 3 new scripts:
5410  sc ; Batk      ; Batak
5411  sc ; Brah      ; Brahmi
5412  sc ; Mand      ; Mandaic
5413  -> remove these from SyntheticPropertyValueAliases.txt
5414  -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
5415  -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
5416      and in com.ibm.icu.dev.test.lang.TestUScript.java
5417- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
5418  (added 2009-11-11..2010-07-18)
5419  Bass        259     Bassa Vah
5420  Dupl        755     Duployan shortand
5421  Elba        226     Elbasan
5422  Gran        343     Grantha
5423  Kpel        436     Kpelle
5424  Loma        437     Loma
5425  Mend        438     Mende
5426  Merc        101     Meroitic Cursive
5427  Narb        106     Old North Arabian
5428  Nbat        159     Nabataean
5429  Palm        126     Palmyrene
5430  Sind        318     Sindhi
5431  Wara        262     Warang Citi
5432  -> uscript.h
5433  -> com.ibm.icu.lang.UScript
5434    find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
5435    replace  public static final int \1 = \2;\3
5436  -> SyntheticPropertyValueAliases.txt
5437  -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
5438      and in com.ibm.icu.dev.test.lang.TestUScript.java
5439- ISO 15924 name change
5440  Mero        100     Meroitic Hieroglyphs (was Meroitic)
5441  -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
5442- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
5443
5444* UnicodeData.txt changes
5445- new CJK block:
5446  2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
5447  2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
5448  -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
5449
5450* build Unicode tools using CMake+make
5451
5452* run genpname/preparse.pl (on Linux)
5453  + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
5454  + make sure that data.h is writable
5455  + perl preparse.pl ~/svn.icu/trunk/src > out.txt
5456  + preparse.pl shows no errors, out.txt Info and Warning lines look ok
5457
5458* rebuild Unicode tools (at least genpname) using make
5459- You might first need to "make install" ICU so that the tools build can pick
5460  up the new definitions from the installed header files.
5461
5462* run genpname
5463- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
5464- rebuild ICU & tools
5465
5466* update source/data/unidata/norm2/nfkc_cf.txt
5467- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
5468
5469* update source/data/unidata/norm2/uts46.txt
5470- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
5471  to ~/svn.icu/tools/trunk/src/unicode/py
5472- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
5473- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
5474- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
5475
5476* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
5477  sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
5478- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
5479- Unicode 6.0: U+2260, U+226E, U+226F
5480
5481* generate core properties data files
5482- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5483- rebuild ICU & tools
5484- run makeuca.sh so that genuca picks up the new nfc.nrm:
5485  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5486- rebuild ICU & tools
5487
5488* implement new Script_Extensions property (provisional)
5489- parser & generator: genprops & uprops.icu
5490- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
5491- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
5492
5493* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
5494- (one-time change)
5495- genbidi/gencase/genprops tools changes
5496- re-run makeprops.sh (see above)
5497- UCharacterProperty.java, UCharacterTypeIterator.java,
5498  UBiDiProps.java, UCaseProps.java, and several others with minor changes;
5499  UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
5500
5501* update Java data files
5502- refresh just the UCD-related files, just to be safe
5503- see (ICU4C)/source/data/icu4j-readme.txt
5504- mkdir /tmp/icu4j
5505- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5506  output:
5507    ...
5508    Unicode .icu files built to ./out/build/icudt45l
5509    mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
5510    echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
5511    LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
5512    jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
5513    mkdir -p /tmp/icu4j/main/shared/data
5514    cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
5515- copy the big-endian Unicode data files to another location,
5516  separate from the other data files
5517    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5518    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
5519    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
5520    ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
5521    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
5522    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5523    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
5524- refresh ICU4J
5525    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
5526
5527* refresh Java test .txt files
5528- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
5529
5530* un-hardcode normalization skippable (NF*_Inert) test data
5531- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
5532
5533* copy updated break iterator test files
5534- now handled by early ucdcopy.py and
5535  copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
5536  (old instructions:
5537   copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
5538   to ~/svn.icu/trunk/src/source/test/testdata)
5539- they are not used in ICU4J
5540
5541* UCA
5542
5543- get output from Mark's tools; look in
5544    http://www.unicode.org/~book/incoming/mark/uca6.0.0/
5545    http://www.macchiato.com/unicode/utc/additional-uca-files
5546    http://www.unicode.org/Public/UCA/6.0.0/
5547    http://www.unicode.org/~mdavis/uca/
5548- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
5549- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
5550- update Han-implicit ranges for new CJK extensions:
5551  swapCJK() in ucol.cpp & ImplicitCEGenerator.java
5552- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
5553  do not add it into invuca so that tailoring primary-after an ignorable works
5554- genuca: permit space between [variable top] bytes
5555- ucol.cpp: treat noncharacters like unassigned rather than ignorable
5556- run makeuca.sh:
5557  ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
5558- rebuild ICU4C
5559- refresh ICU4J collation data:
5560  (subset of instructions above for properties data refresh, except copies all coll/*)
5561    ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5562    mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5563    ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
5564    ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
5565- update (ICU)/source/test/testdata/CollationTest_*.txt
5566  and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
5567  with output from Mark's Unicode tools
5568- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
5569- note on intltest: if collate/UCAConformanceTest fails, then
5570  utility/MultithreadTest/TestCollators will fail as well;
5571  fix the conformance test before looking into the multi-thread test
5572
5573* When refreshing all of ICU4J data from ICU4C
5574- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
5575- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
5576or
5577- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
5578
5579*** LayoutEngine script information
5580
5581(For details see the Unicode 5.2 change log below.)
5582
5583* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
5584ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
5585ScriptRunData.cpp, which is no longer needed.)
5586
5587The generated files have a current copyright date and "@draft" statement.
5588
5589* copy the above files into <icu>/source/layout, replacing the old files.
5590* fix mixed line endings
5591* review the diffs and fix incorrect @draft and missing aliases;
5592  Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
5593* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5594
5595---------------------------------------------------------------------------- ***
5596
5597Unicode 5.2 update
5598
5599*** related ICU Trac tickets
5600
56017084 Unicode 5.2
5602
56037167 verify collation bytes
56047235 Java test NAME_ALIAS
56057236 Java DerivedCoreProperties.txt test
56067237 Java BidiTest.txt
56077238 UTrie2 in core unidata
56087239 test for tailoring gaps
56097240 Java fix CollationMiscTest
56107243 update layout engine for Unicode 5.2
5611
5612*** Unicode version numbers
5613- makedata.mak
5614- uchar.h
5615- configure.in & configure
5616- update ucdVersion in gennames.c if an algorithmic range changes
5617
5618*** data files & enums & parser code
5619
5620* file preparation
5621
5622python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
5623- includes finding files regardless of version numbers,
5624  copying them, and performing the equivalent processing of the
5625  ucdstrip and ucdmerge tools on the desired set of files
5626
5627* notes on changes
5628- PropertyAliases.txt
5629  moved from numeric to enumerated:
5630    ccc       ; Canonical_Combining_Class
5631  new string properties:
5632    NFKC_CF   ; NFKC_Casefold
5633    Name_Alias; Name_Alias
5634  new binary properties:
5635    Cased     ; Cased
5636    CI        ; Case_Ignorable
5637    CWCF      ; Changes_When_Casefolded
5638    CWCM      ; Changes_When_Casemapped
5639    CWKCF     ; Changes_When_NFKC_Casefolded
5640    CWL       ; Changes_When_Lowercased
5641    CWT       ; Changes_When_Titlecased
5642    CWU       ; Changes_When_Uppercased
5643  new CJK Unihan properties (not supported by ICU)
5644- PropertyValueAliases.txt
5645  new block names
5646  new scripts
5647  one script code change:
5648    sc ; Qaai      ; Inherited
5649    ->
5650    sc ; Zinh      ; Inherited                        ; Qaai
5651  new Line_Break (lb) value:
5652    lb ; CP        ; Close_Parenthesis
5653  new Joining_Group (jg) values: Farsi_Yeh, Nya
5654  other new values:
5655    ccc; 214; ATA  ; Attached_Above
5656- DerivedBidiClass.txt
5657  new default-R range: U+1E800 - U+1EFFF
5658- UnicodeData.txt
5659  all of the ISO comments are gone
5660  new CJK block end:
5661    9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
5662  new CJK block:
5663    2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
5664    2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
5665
5666* genpname
5667- run preparse.pl
5668  + cd \svn\icuproj\icu\trunk\source\tools\genpname
5669  + make sure that data.h is writable
5670  + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
5671  + preparse.pl complains with errors like the following:
5672      Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
5673    This is because ICU 4.0 had scripts from ISO 15924 which are now
5674    added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
5675    and PropertyValueAliases.txt.
5676    -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
5677       Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
5678  + preparse.pl complains with errors about block names missing from uchar.h; add them
5679
5680* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5681- new block & script values
5682  + 26 new blocks
5683    copy new blocks from Blocks.txt
5684    MS VC++ 2008 regular expression:
5685      find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
5686      replace with "    UBLOCK_\3 = 172, /*[\1]*/"
5687  + several new script values already added in ICU 4.0 for ISO 15924 coverage
5688    (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
5689  + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
5690  + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
5691    (added to SyntheticPropertyValueAliases.txt)
5692- new Joining Group (JG) values: Farsi_Yeh, Nya
5693- new Line_Break (lb) value:
5694    lb ; CP        ; Close_Parenthesis
5695
5696* hardcoded Unihan range end/limit
5697- Unihan range end moves from 9FC3 to 9FCB
5698  search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
5699  + do change gennames.c
5700
5701* Compare definitions of new binary properties with what we used to use
5702  in algorithms, to see if the definitions changed.
5703- Verified that definitions for Cased and Case_Ignorable are unchanged.
5704  The gencase tool now parses the newly public Case_Ignorable values
5705  in case the definition changes in the future.
5706
5707* uchar.c & uprops.h & uprops.c & genprops
5708- new numeric values that didn't exist in Unicode data before:
5709    1/7, 1/9, 1/10, 3/10, 1/16, 3/16
5710  the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
5711  therefore redesign the encoding of numeric types and values for formatVersion 6;
5712  design for simple numbers up to at least 144 ("one gross"),
5713  large values up to at least 10^20,
5714  and fractions with numerators -1..17 and denominators 1..16
5715  to cover current and expected future values
5716  (e.g., more Han numeric values, Meroitic twelfths)
5717
5718* reimplement Hangul_Syllable_Type for new Jamo characters
5719- the old code assumed that all Jamo characters are in the 11xx block
5720- Unicode 5.2 fills holes there and adds new Jamo characters in
5721    A960..A97F; Hangul Jamo Extended-A
5722  and in
5723    D7B0..D7FF; Hangul Jamo Extended-B
5724- Hangul_Syllable_Type can be trivially derived from a subset of
5725  Grapheme_Cluster_Break values
5726
5727* build Unicode data source code for hardcoding core data
5728C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
5729
5730ICU data make path is \svn\icuproj\icu\trunk\source\data\
5731ICU root path is \svn\icuproj\icu\trunk
5732Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
5733Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
5734Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
5735Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
5736Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
5737Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
5738Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
5739Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
5740Creating data file for Unicode Property Names
5741Creating data file for Unicode Character Properties
5742Creating data file for Unicode Case Mapping Properties
5743Creating data file for Unicode BiDi/Shaping Properties
5744Creating data file for Unicode Normalization
5745Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
5746Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
5747
5748- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
5749  and rebuild the common library
5750
5751*** UCA
5752
5753- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
5754- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
5755- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
5756[ Begin obsolete instructions:
5757  Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
5758    - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
5759      on Windows:
5760        python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
5761        python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
5762  End obsolete instructions]
5763- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
5764  not just the *_STUB.txt files
5765- note on intltest: if collate/UCAConformanceTest fails, then
5766  utility/MultithreadTest/TestCollators will fail as well;
5767  fix the conformance test before looking into the multi-thread test
5768
5769*** Implement Cased & Case_Ignorable properties
5770- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
5771- Problem: These properties should be disjoint, but aren't
5772- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
5773- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
5774
5775*** Implement Changes_When_Xyz properties
5776- without stored data
5777
5778*** Implement Name_Alias property
5779- add it as another name field in unames.icu
5780- make it available via u_charName() and UCharNameChoice and
5781- consider it in u_charFromName()
5782
5783*** Break iterators
5784
5785* Update break iterator rules to new UAX versions and new property values
5786* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
5787
5788*** new BidiTest file
5789- review format and data
5790- copy BidiTest.txt to source/test/testdata
5791- write test code using this data
5792- fix ICU code where it fails the conformance test
5793
5794*** Java
5795- generally, find and update code corresponding to C/C++
5796- UCharacter.UnicodeBlock constants:
5797  a) add an _ID integer per new block, update COUNT
5798  b) add a class instance per new block
5799     Visual Studio regex:
5800        find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
5801        replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
5802- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
5803
5804- port test changes to Java
5805
5806*** LayoutEngine script information
5807
5808(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
5809
5810* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
5811ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
5812ScriptRunData.cpp, which is no longer needed.)
5813
5814The generated files have a current copyright date and "@draft" statement.
5815
5816-> Eric Mader wrote in email on 20090930:
5817    "I think the tool has been modified to update @draft to @stable for
5818     older scripts and to add @draft for new scripts.
5819     (I worked with an intern on this last year.)
5820     You should check the output after you run it."
5821
5822* copy the above files into <icu>/source/layout, replacing the old files.
5823* fix mixed line endings
5824* review the diffs and fix incorrect @draft and missing aliases
5825* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
5826
5827Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
5828and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
5829
5830-> Eric Mader wrote in email on 20090930:
5831    "This is just a matter of making sure that all the per-script tables have
5832     entries for any new scripts that were added.
5833     If any new Indic characters were added, then the class tables in
5834     IndicClassTables.cpp should be updated to reflect this.
5835     John Emmons should know how to do this if it's required."
5836
5837* rebuild the layout and layoutex libraries.
5838
5839*** Documentation
5840- Update User Guide
5841  + Jamo_Short_Name, sfc->scf, binary property value aliases
5842
5843---------------------------------------------------------------------------- ***
5844
5845Unicode 5.1 update
5846
5847*** related ICU Trac tickets
5848
58495696 Update to Unicode 5.1
5850
5851*** Unicode version numbers
5852- makedata.mak
5853- uchar.h
5854- configure.in & configure
5855- update ucdVersion in gennames.c if an algorithmic range changes
5856
5857*** data files & enums & parser code
5858
5859* file preparation
5860- ucdstrip:
5861    DerivedCoreProperties.txt
5862    DerivedNormalizationProps.txt
5863    NormalizationTest.txt
5864    PropList.txt
5865    Scripts.txt
5866    GraphemeBreakProperty.txt
5867    SentenceBreakProperty.txt
5868    WordBreakProperty.txt
5869- ucdstrip and ucdmerge:
5870    EastAsianWidth.txt
5871    LineBreak.txt
5872
5873* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
5874copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
5875copy 5.1.0\ucd\Blocks.txt ..\unidata\
5876copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
5877copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
5878copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
5879copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
5880copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
5881copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
5882copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
5883copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
5884copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
5885copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
5886copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
5887
5888ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
5889ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
5890ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
5891ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
5892ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
5893ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
5894ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
5895ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
5896ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
5897ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
5898
5899* genpname
5900- run preparse.pl
5901  + cd \svn\icuproj\icu\uni51\source\tools\genpname
5902  + make sure that data.h is writable
5903  + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
5904  + preparse.pl complains with errors like the following:
5905      Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
5906    This is because ICU 3.8 had scripts from ISO 15924 which are now
5907    added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
5908    and PropertyValueAliases.txt.
5909    -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
5910       Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
5911  + PropertyValueAliases.txt now explicitly contains values for boolean properties:
5912      N/Y, No/Yes, F/T, False/True
5913    -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
5914       It will use further values from the file if present.
5915
5916* uchar.h & uscript.h & uprops.h & uprops.c & genprops
5917- new block & script values
5918  + 17 new blocks
5919  + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
5920    (removed from SyntheticPropertyValueAliases.txt)
5921  + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
5922    (added to SyntheticPropertyValueAliases.txt)
5923- uprops.icu (uprops.h) only provides 7 bits for script codes.
5924  In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
5925  There is none above 127 yet which is the script code for an
5926  assigned Unicode character, so ICU 4.0 uprops.icu does not store any
5927  script code values greater than 127.
5928  However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
5929  in a parallel bit field, and that overflows now.
5930  Also, future values >=128 would be incompatible anyway.
5931  uprops.h is modified to move around several of the bit fields
5932  in the properties vector words, and now uses 8 bits for the script code.
5933  Two other bit fields also grow to accommodate future growth:
5934  Block (current count: 172) grows from 8 to 9 bits,
5935  and Word_Break grows from 4 to 5 bits.
5936- renamed property Simple_Case_Folding (sfc->scf)
5937  + nothing to be done: handled as normal alias
5938- new property JSN Jamo_Short_Name
5939  + no new API: only contributes to the Name property
5940- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
5941- new Joining Group (JG) value: Burushashki_Yeh_Barree
5942- new Sentence_Break (SB) values:
5943    SB ; CR        ; CR
5944    SB ; EX        ; Extend
5945    SB ; LF        ; LF
5946    SB ; SC        ; SContinue
5947- new Word_Break (WB) values:
5948    WB ; CR        ; CR
5949    WB ; Extend    ; Extend
5950    WB ; LF        ; LF
5951    WB ; MB        ; MidNumLet
5952
5953* Further changes in the 2008-02-29 update:
5954- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
5955  because they should not normally be invisible.
5956- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
5957- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
5958- new Word_Break (WB) value: NL=Newline
5959
5960* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
5961- Unihan range end moves from 9FBB to 9FC3
5962  search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
5963  + do change gennames.c
5964
5965* build Unicode data source code for hardcoding core data
5966C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
5967
5968ICU data make path is \svn\icuproj\icu\uni51\source\data\
5969ICU root path is \svn\icuproj\icu\uni51
5970Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
5971Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
5972Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
5973Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
5974Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
5975Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
5976Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
5977Creating data file for Unicode Character Properties
5978Creating data file for Unicode Case Mapping Properties
5979Creating data file for Unicode BiDi/Shaping Properties
5980Creating data file for Unicode Normalization
5981Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
5982Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
5983
5984- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
5985  and rebuild the common library
5986
5987*** Break iterators
5988
5989* Update break iterator rules to new UAX versions and new property values
5990
5991*** UCA
5992
5993* update FractionalUCA.txt and UCARules.txt with new canonical closure
5994
5995*** Test suites
5996- Test that APIs using Unicode property value aliases (like UnicodeSet)
5997  support all of the boolean values N/Y, No/Yes, F/T, False/True
5998  -> TestBinaryValues() tests in both cintltst and intltest
5999
6000*** LayoutEngine script information
6001* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
6002ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
6003ScriptRunData.cpp, which is no longer needed.)
6004
6005The generated files have a current copyright date and "@draft" statement.
6006
6007* copy the above files into <icu>/source/layout, replacing the old files.
6008
6009Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
6010and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
6011
6012* rebuild the layout and layoutex libraries.
6013
6014*** Documentation
6015- Update User Guide
6016  + Jamo_Short_Name, sfc->scf, binary property value aliases
6017
6018---------------------------------------------------------------------------- ***
6019
6020Unicode 5.0 update
6021
6022*** related Jitterbugs
6023
60245084 RFE: Update to Unicode 5.0
6025
6026*** data files & enums & parser code
6027
6028* file preparation
6029- ucdstrip:
6030    DerivedCoreProperties.txt
6031    DerivedNormalizationProps.txt
6032    NormalizationTest.txt
6033    PropList.txt
6034    Scripts.txt
6035    GraphemeBreakProperty.txt
6036    SentenceBreakProperty.txt
6037    WordBreakProperty.txt
6038- ucdstrip and ucdmerge:
6039    EastAsianWidth.txt
6040    LineBreak.txt
6041
6042* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
6043copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
6044copy 5.0.0\ucd\Blocks.txt ..\unidata\
6045copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
6046copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
6047copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
6048copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
6049copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
6050copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
6051copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
6052copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
6053copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
6054copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
6055copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
6056
6057ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
6058ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
6059ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
6060ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
6061ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
6062ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
6063ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
6064ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
6065ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
6066ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
6067
6068* update FractionalUCA.txt and UCARules.txt with new canonical closure
6069
6070* genpname
6071- run preparse.pl
6072  + make sure that data.h is writable
6073  + perl preparse.pl \cvs\oss\icu > out.txt
6074
6075* uchar.h & uscript.h & uprops.h & uprops.c & genprops
6076- new block & script values
6077  + script values already added in ICU 3.6 because all of ISO 15924 is now covered
6078
6079* build Unicode data source code for hardcoding core data
6080C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
6081
6082ICU data make path is \cvs\oss\icu\source\data\
6083ICU root path is \cvs\oss\icu
6084Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
6085[etc.]
6086Creating data file for Unicode Character Properties
6087Creating data file for Unicode Case Mapping Properties
6088Creating data file for Unicode BiDi/Shaping Properties
6089Creating data file for Unicode Normalization
6090Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
6091Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
6092
6093- copy the .c source files to C:\cvs\oss\icu\source\common
6094  and rebuild the common library
6095
6096*** Unicode version numbers
6097- makedata.mak
6098- uchar.h
6099- configure.in
6100
6101*** LayoutEngine script information
6102* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
6103ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
6104ScriptRunData.cpp, which is no longer needed.)
6105
6106The generated files have a current copyright date and "@draft" statement.
6107
6108* copy the above files into <icu>/source/layout, replacing the old files.
6109
6110Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
6111and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
6112
6113* rebuild the layout and layoutex libraries.
6114
6115---------------------------------------------------------------------------- ***
6116
6117Unicode 4.1 update
6118
6119*** related Jitterbugs
6120
61214332 RFE: Update to Unicode 4.1
61224157 RBBI, TR29 4.1 updates
6123
6124*** data files & enums & parser code
6125
6126* file preparation
6127- ucdstrip:
6128    DerivedCoreProperties.txt
6129    DerivedNormalizationProps.txt
6130    NormalizationTest.txt
6131    GraphemeBreakProperty.txt
6132    SentenceBreakProperty.txt
6133    WordBreakProperty.txt
6134- ucdstrip and ucdmerge:
6135    EastAsianWidth.txt
6136    LineBreak.txt
6137
6138* add new files to the repository
6139    GraphemeBreakProperty.txt
6140    SentenceBreakProperty.txt
6141    WordBreakProperty.txt
6142
6143* update FractionalUCA.txt and UCARules.txt with new canonical closure
6144
6145* genpname
6146- handle new enumerated properties in sub read_uchar
6147- run preparse.pl
6148
6149* uchar.h & uscript.h & uprops.h & uprops.c & genprops
6150- new binary properties
6151  + Pattern_Syntax
6152  + Pattern_White_Space
6153- new enumerated properties
6154  + Grapheme_Cluster_Break
6155  + Sentence_Break
6156  + Word_Break
6157- new block & script & line break values
6158
6159* gencase
6160- case-ignorable changes
6161  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
6162  now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
6163
6164*** Unicode version numbers
6165- makedata.mak
6166- uchar.h
6167- configure.in
6168
6169*** tests
6170- verify that u_charMirror() round-trips
6171- test all new properties and some new values of old properties
6172
6173*** other code
6174
6175* hardcoded Unihan range end/limit
6176- Unihan range end moves from 9FA5 to 9FBB
6177  search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
6178  + do not modify BOCU/BOCSU code because that would change the encoding
6179    and break binary compatibility!
6180  + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
6181    NamePrepProfile.txt
6182  + ignore trietest.c: test data is arbitrary
6183  + ignore tstnorm.cpp: test optimization, not important
6184  + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
6185  + do change line_th.txt and word_th.txt
6186    by replacing hardcoded ranges with the new property values
6187  + do change gennames.c
6188
6189source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
6190source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
6191source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
6192
6193* case mappings
6194- compare new special casing context conditions with previous ones
6195  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
6196
6197* genpname
6198- consider storing only the short name if it is the same as the long name
6199
6200*** other reviews
6201- UAX #29 changes (grapheme/word/sentence breaks)
6202- UAX #14 changes (line breaks)
6203- Pattern_Syntax & Pattern_White_Space
6204
6205---------------------------------------------------------------------------- ***
6206
6207Unicode 4.0.1 update
6208
6209*** related Jitterbugs
6210
62113170 RFE: Update to Unicode 4.0.1
62123171 Add new Unicode 4.0.1 properties
62133520 use Unicode 4.0.1 updates for break iteration
6214
6215*** data files & enums & parser code
6216
6217* file preparation
6218- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
6219- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
6220
6221* file fixes
6222- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
6223  according to PRI #26
6224  http://www.unicode.org/review/resolved-pri.html#pri26
6225- undone again because no corrigendum in sight;
6226  instead modified tests to not check consistency on this for Unicode 4.0.1
6227
6228* ucdterms.txt
6229- update from http://www.unicode.org/copyright.html
6230  formatted for plain text
6231
6232* uchar.h & uprops.h & uprops.c & genprops
6233- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
6234- add U_LB_INSEPARABLE due to a spelling fix
6235  + put short name comment only on line with new constant
6236    for genpname perl script parser
6237- new binary properties
6238  + STerm
6239  + Variation_Selector
6240
6241* genpname
6242- fix genpname perl script so that it doesn't choke on more than 2 names per property value
6243- perl script: correctly calculate the maximum number of fields per row
6244
6245* uscript.h
6246- new script code Hrkt=Katakana_Or_Hiragana
6247
6248* gennorm.c track changes in DerivedNormalizationProps.txt
6249- "FNC" -> "FC_NFKC"
6250- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
6251
6252* genprops/props2.c track changes in DerivedNumericValues.txt
6253- changed from 3 columns to 2, dropping the numeric type
6254  + assume that the type is always numeric for Han characters,
6255    and that only those are added in addition to what UnicodeData.txt lists
6256
6257*** Unicode version numbers
6258- makedata.mak
6259- uchar.h
6260- configure.in
6261
6262*** tests
6263- update test of default bidi classes according to PRI #28
6264  /tsutil/cucdtst/TestUnicodeData
6265  http://www.unicode.org/review/resolved-pri.html#pri28
6266- bidi tests: change exemplar character for ES depending on Unicode version
6267- change hardcoded expected property values where they change
6268
6269*** other code
6270
6271* name matching
6272- read UCD.html
6273
6274* scripts
6275- use new Hrkt=Katakana_Or_Hiragana
6276
6277* ZWJ & ZWNJ
6278- are now part of combining character sequences
6279- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
6280