1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13 14* change log for Unicode updates 15 16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18Notes: 19 20This log includes several command lines as used in the update process. 21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. 22Use a console window that is set to that directory, or cd to there, 23and then paste the command that follows the $ sign. 24 25Most command lines use environment variables to make them more portable across versions 26and machine configurations. When you set up a console window, copy & paste the `export` commands 27from near the top of the current section before pasting tool command lines. 28Adjust the environment variables to the current version and your machine setup. 29(The command lines are currently as used on Linux.) 30 31Syntax of this file: 32 33`***` - section heading 34`*` - sub heading 35`-` - 1st level bullet 36`+` - 2nd level bullet 37`=` - 1st level bullet 38`->` - "the previous things leads to...", OR a 2nd level bullet/item 39 40---------------------------------------------------------------------------- *** 41 42* New ISO 15924 script codes 43 44Normally, add new script codes as part of a Unicode update. 45See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 46and see the change logs below. 47 48---------------------------------------------------------------------------- *** 49 50Unicode 16.0 update for ICU 76 51 52https://www.unicode.org/versions/Unicode16.0.0/ 53https://www.unicode.org/versions/beta-16.0.0.html 54https://www.unicode.org/Public/draft/ 55https://www.unicode.org/reports/uax-proposed-updates.html 56https://www.unicode.org/reports/tr44/tr44-33.html 57 58https://unicode-org.atlassian.net/browse/ICU-22707 Unicode 16 59https://unicode-org.atlassian.net/browse/CLDR-17226 BRS Unicode 16 60 61https://github.com/unicode-org/unicodetools/pull/774 delete the RecommendedSetGenerator 62 63https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1 64 65* Command-line environment setup 66 67Markus: 68 69export UNIDATA_ROOT=~/unidata 70export UNICODE_DATA=$UNIDATA_ROOT/uni16/final 71export CLDR_SRC=~/cldr/uni/src 72export ICU_ROOT=~/icu/uni 73export ICU_SRC=$ICU_ROOT/src 74export ICU_OUT=$ICU_ROOT/dbg 75export ICUDT=icudt76b 76export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 77export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 78export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 79export UNICODE_TOOLS=~/unitools/mine/src 80 81Elango: 82 83export UNIDATA_ROOT=~/oss/unidata 84export UNICODE_DATA=$UNIDATA_ROOT/uni16/final 85export CLDR_SRC=~/oss/cldr/mine/src 86export ICU_ROOT=~/oss/icu 87export ICU_SRC=$ICU_ROOT 88export ICU_OUT=$ICU_ROOT 89export ICUDT=icudt76b 90export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 91export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 92export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 93export UNICODE_TOOLS=~/oss/unicodetools/mine/src 94 95*** Unicode version numbers 96- icu4c/source/data/makedata.mak 97- icu4c/source/common/unicode/uchar.h 98- com.ibm.icu.util.VersionInfo 99- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 100 101*** Configure: Build Unicode data for ICU4J 102- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 103 so that the makefiles see the new version number. 104- FYI: The option that adds the additional Unicode data files for ICU4J is 105 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data 106- Markus's version: 107 cd $ICU_OUT/icu4c 108 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ../../src/icu4c/source/runConfigureICU --enable-debug --disable-release Linux/clang --prefix=/usr/local/google/home/mscherer/icu/mine/inst/icu4c > config.out 2>&1 ; tail config.out 109- Elango's version (diff default C++ compiler & in-source build paths): 110 cd $ICU_OUT/icu4c/source 111 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data CXXFLAGS="-DU_USING_ICU_NAMESPACE=0 -Wimplicit-fallthrough" CPPFLAGS="-DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1 -fsanitize=bounds" LDFLAGS=-fsanitize=bounds ./runConfigureICU --enable-debug --disable-release Linux/gcc --prefix=/usr/local/google/home/elango/oss/icu/icu4c > config.out 2>&1 ; tail config.out 112 113*** data files & enums & parser code 114 115* download files 116- same as for the early Unicode Tools setup and data refresh: 117 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 118 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 119- mkdir -p $UNICODE_DATA 120- download Unicode files into $UNICODE_DATA 121 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc. 122 + subfolders: emoji, idna, security, ucd, uca 123 + for pre-release (alpha, beta) data files: 124 ~ if one of us produces the alpha.zip or beta.zip collection of data files for publication, 125 then we can use its contents directly (no FTP from unicode.org necessary) 126 ~ otherwise download from https://www.unicode.org/Public/draft/ 127 ~ you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders 128 ~ you can omit or discard UCD/ucd/Unihan.zip 129 + alternate way of fetching files, if available: 130 copy the files from a Unicode Tools workspace that is up to date with 131 https://github.com/unicode-org/unicodetools 132 and which might at this point be *ahead* of "Public" 133 ~ before the Unicode release copy files from "dev" subfolders, for example 134 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 135 + for final-release data files, the source of truth is the files in 136 https://www.unicode.org/Public/(version) [=UCD], 137 https://www.unicode.org/Public/UCA/(version), 138 https://www.unicode.org/Public/idna/(version), 139 etc. 140- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already) 141 or from the UCD/cldr/ output folder of the Unicode Tools: 142 From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73, 143 CLDR used modified grapheme break rules. 144 This might happen again. 145 + To check in the Unicode Tools workspace: 146 ~/unitools/mine/Generated$ meld UCD/16.0.0/auxiliary/*GraphemeBreakTest.txt UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt 147 + If different, and after copying into CLDR: 148 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 149 or 150 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 151 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 152 cp ~/unitools/mine/Generated/UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 153 + We may need CLDR versions of WordBreakTest.txt and LineBreakTest.txt 154 unless Unicode 16 and CLDR 46 eliminate their differences: 155 unicodetools issue #492 156 157* process and/or copy files 158- cd $ICU_SRC/tools/unicode 159 py/preparseucd.py $UNICODE_DATA $ICU_SRC 160 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 161 + For debugging, and tweaking how ppucd.txt is written, 162 the tool has an --only_ppucd option: 163 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 164 e.g. 165 py/preparseucd.py $UNICODE_DATA --only_ppucd /tmp/ppucd.txt 166 167* new constants for new property values 168- preparseucd.py error: 169 ValueError: missing uchar.h enum constants for some property values: 170 [('blk', {'Garay', 'Tulu_Tigalari', 'Todhri', 'Sunuwar', 'Egyptian_Hieroglyphs_Ext_A', 'Kirat_Rai', 'Symbols_For_Legacy_Computing_Sup', 'Myanmar_Ext_C', 'Ol_Onal', 'Gurung_Khema'}), 171 ('sc', {'Gara', 'Onao', 'Todr', 'Krai', 'Tutg', 'Sunu', 'Gukh'}), 172 ('InSC', {'Reordering_Killer'})] 173 = PropertyValueAliases.txt new property values (diff old & new .txt files) 174 (cd $UNIDATA_ROOT && diff -u uni15.1/final/ucd/PropertyValueAliases.txt uni16/alpha/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]') 175 +age; 16.0 ; V16_0 176 +blk; Egyptian_Hieroglyphs_Ext_A ; Egyptian_Hieroglyphs_Extended_A 177 +blk; Garay ; Garay 178 +blk; Gurung_Khema ; Gurung_Khema 179 +blk; Kirat_Rai ; Kirat_Rai 180 +blk; Myanmar_Ext_C ; Myanmar_Extended_C 181 +blk; Ol_Onal ; Ol_Onal 182 +blk; Sunuwar ; Sunuwar 183 +blk; Symbols_For_Legacy_Computing_Sup ; Symbols_For_Legacy_Computing_Supplement 184 +blk; Todhri ; Todhri 185 +blk; Tulu_Tigalari ; Tulu_Tigalari 186 +InSC; Reordering_Killer ; Reordering_Killer 187 -jg ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal 188 +jg ; Teh_Marbuta_Goal ; Teh_Marbuta_Goal ; Hamza_On_Heh_Goal 189 +sc ; Gara ; Garay 190 +sc ; Gukh ; Gurung_Khema 191 +sc ; Krai ; Kirat_Rai 192 +sc ; Onao ; Ol_Onal 193 +sc ; Sunu ; Sunuwar 194 +sc ; Todr ; Todhri 195 +sc ; Tutg ; Tulu_Tigalari 196 + copy new API constants from the preparseucd.py output into the .h/.java files, 197 add/adjust comments, wrap lines, and set numeric values 198 + (ignore Age: no API constants for that) 199 + Block: uchar.h before UBLOCK_COUNT, 200 UCharacter.UnicodeBlock IDs, UCharacter.UnicodeBlock objects 201 + Script: uscript.h & com.ibm.icu.lang.UScript 202 + for new scripts: fix expectedLong names 203 in cintltst/cucdapi.c/TestUScriptCodeAPI() 204 and in com.ibm.icu.dev.test.lang.TestUScript.java 205 + Indic_Syllabic_Category: uchar.h & UCharacter.IndicSyllabicCategory 206 + after adding new API constants, run preparseucd.py again 207 208* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 209 (not strictly necessary for NOT_ENCODED scripts) 210 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 211 212* build ICU 213 to make sure that there are no syntax errors 214 215 $ICU_OUT/icu4c$ echo;echo; date; make -j20 tests &> out.txt ; tail -n 30 out.txt ; date 216 217* Bazel build process 218 219See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 220for an overview and for setup instructions. 221 222Consider running `bazelisk --version` outside of the $ICU_SRC folder 223to find out the latest `bazel` version, and 224copying that version number into the $ICU_SRC/.bazeliskrc config file. 225(Revert if you find incompatibilities, or, better, update our build & config files.) 226 227* generate data files 228 229- remember to define the environment variables 230 (see the start of the section for this Unicode version) 231- cd $ICU_SRC 232- optional but not necessary: 233 bazelisk clean 234 or even 235 bazelisk clean --expunge 236- build/bootstrap/generate new files: 237 icu4c/source/data/unidata/generate.sh 238 239* run & fix ICU4C tests 240- Note: Some of the collation data and test data will be updated below, 241 so at this time we might get some collation test failures. 242 Ignore these for now. 243- Some properties are hardcoded in the ICU libraries because they apply to 244 few characters or ranges, and are not expected to change often. 245 They are tested at least in C++ intltest (e.g., against ppucd.txt). 246 If these tests fail, then update the implementation and the tests. 247- update CLDR GraphemeBreakTest.txt 248 (see the download section above about this file) 249 cd ~/unitools/mine/Generated 250 cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 251 cp UCD/16.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 252 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 253- Robin or Andy helps with RBBI & spoof check test failures 254 255* collation: CLDR collation root, UCA DUCET 256 257- UCA DUCET goes into Mark's Unicode tools, 258 and a tool-tailored version goes into CLDR, see 259 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 260 261- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 262 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 263- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 264 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 265 (note removing the underscore before "Rules") 266 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 267- restore TODO diffs in UCARules.txt 268 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 269- update (ICU4C)/source/test/testdata/CollationTest_*.txt 270 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 271 from the CLDR root files (..._CLDR_..._SHORT.txt) 272 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 273 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 274 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/collate/src/test/resources/com/ibm/icu/dev/data 275- if CLDR common/uca/unihan-index.txt changes, then update 276 CLDR common/collation/root.xml <collation type="private-unihan"> 277 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 278 279- update CollationFCD.java: 280 copy & paste the initializers of lcccIndex[] etc. 281 from 282 $ICU_SRC/icu4c/source/i18n/collationfcd.cpp 283 to 284 $ICU_SRC/icu4j/main/collate/src/main/java/com/ibm/icu/impl/coll/CollationFCD.java 285- generate data files, as above (generate.sh), now to pick up new collation data 286- rebuild ICU4C (make clean, make check, as usual) 287 288* Unihan collators 289 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 290- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 291 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 292- generate ICU zh collation data 293 instructions inspired by 294 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 295 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 296 + setup: 297 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 298 (didn't work without setting JAVA_HOME, 299 nor with the Google default of /usr/local/buildtools/java/jdk 300 [Google security limitations in the XML parser]) 301 export TOOLS_ROOT=$ICU_SRC/tools 302 export CLDR_DIR=$CLDR_SRC 303 export CLDR_DATA_DIR=$CLDR_DIR 304 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 305 cd "$TOOLS_ROOT/cldr/lib" 306 ./install-cldr-jars.sh "$CLDR_DIR" 307 + generate the files we need 308 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 309 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 310 + diff 311 cd $ICU_SRC 312 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 313 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 314 + copy into the source tree 315 cd $ICU_SRC 316 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 317 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 318- rebuild ICU4C 319 320* run & fix ICU4C tests, now with new CLDR collation root data 321- run all tests with the collation test data *_SHORT.txt or the full files 322 (the full ones have comments, useful for debugging) 323- note on intltest: if collate/UCAConformanceTest fails, then 324 utility/MultithreadTest/TestCollators will fail as well; 325 fix the conformance test before looking into the multi-thread test 326 327* update Java data files 328- refresh just the UCD/UCA-related/derived files, just to be safe 329- see (ICU4C)/source/data/icu4j-readme.txt 330- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 331- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 332 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 333 you need to reconfigure with unicore data; see the "configure" line above. 334 output: 335 ... 336 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 337 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt76b 338 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b 339 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt76l.dat ./out/icu4j/icudt76b.dat -s ./out/build/icudt76l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt76b 340 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt76b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt76b" 341 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt76b/ 342 mkdir -p /tmp/icu4j/main/shared/data 343 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 344 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt76b/ 345 mkdir -p /tmp/icu4j/main/shared/data 346 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 347 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 348- copy the binary data files into the ICU4J tree 349 cd $ICU_OUT/icu4c/data/out/icu4j 350 cp -v com/ibm/icu/impl/data/icudata/coll/* $ICU_SRC/icu4j/main/collate/src/main/resources/com/ibm/icu/impl/data/icudata/coll 351 cp -v com/ibm/icu/impl/data/icudata/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata/brkitr 352 cp -v com/ibm/icu/impl/data/icudata/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 353 cp -v com/ibm/icu/impl/data/icudata/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata 354 cd com/ibm/icu/impl/data/icudata/ 355 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/icudata";}' | sh 356- The procedure above is very conservative: 357 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 358 It avoids dealing with any other discrepancies 359 between the source and generated data files. 360 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 361 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 362 363* refresh Java test .txt files 364- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 365 cd $ICU_SRC/icu4c/source/data/unidata 366 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 367 cd ../../test/testdata 368 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 369 cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 370 371* run & fix ICU4J tests 372 373*** API additions 374- send notice to icu-design about new born-@stable API (enum constants etc.) 375 376*** CLDR numbering systems 377- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 378 for example: 379 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.1.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 380 --> 381 +10D40..10D49 ; Nd # [10] GARAY DIGIT ZERO..GARAY DIGIT NINE 382 +116D0..116E3 ; Nd # [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE 383 +11BF0..11BF9 ; Nd # [10] SUNUWAR DIGIT ZERO..SUNUWAR DIGIT NINE 384 +16130..16139 ; Nd # [10] GURUNG KHEMA DIGIT ZERO..GURUNG KHEMA DIGIT NINE 385 +16D70..16D79 ; Nd # [10] KIRAT RAI DIGIT ZERO..KIRAT RAI DIGIT NINE 386 +1CCF0..1CCF9 ; Nd # [10] OUTLINED DIGIT ZERO..OUTLINED DIGIT NINE 387 +1E5F1..1E5FA ; Nd # [10] OL ONAL DIGIT ZERO..OL ONAL DIGIT NINE 388 --> https://github.com/unicode-org/cldr/pull/3658 389 390*** merge the Unicode update branch back onto the main branch 391- make sure that changes to Unicode tools are checked in: 392 https://github.com/unicode-org/unicodetools 393 394---------------------------------------------------------------------------- *** 395 396Unicode 15.1 update for ICU 74 397 398https://www.unicode.org/versions/Unicode15.1.0/ 399https://www.unicode.org/versions/beta-15.1.0.html 400https://www.unicode.org/Public/draft/ 401https://www.unicode.org/reports/uax-proposed-updates.html 402https://www.unicode.org/reports/tr44/tr44-31.html 403 404https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1 405https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1 406 407https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1 408 409* Command-line environment setup 410 411Markus: 412 413export UNIDATA_ROOT=~/unidata 414export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final 415export CLDR_SRC=~/cldr/uni/src 416export ICU_ROOT=~/icu/uni 417export ICU_SRC=$ICU_ROOT/src 418export ICU_OUT=$ICU_ROOT/dbg 419export ICUDT=icudt74b 420export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 421export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 422export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 423export UNICODE_TOOLS=~/unitools/mine/src 424 425Elango: 426 427export UNIDATA_ROOT=~/oss/unidata 428export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot 429export CLDR_SRC=~/oss/cldr/mine/src 430export ICU_ROOT=~/oss/icu 431export ICU_SRC=$ICU_ROOT 432export ICU_OUT=$ICU_ROOT 433export ICUDT=icudt74b 434export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 435export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 436export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 437export UNICODE_TOOLS=~/oss/unicodetools/mine/src 438 439*** Unicode version numbers 440- makedata.mak 441- uchar.h 442- com.ibm.icu.util.VersionInfo 443- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 444 445*** Configure: Build Unicode data for ICU4J 446- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 447 so that the makefiles see the new version number. 448 cd $ICU_OUT/icu4c 449 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 450 451*** data files & enums & parser code 452 453* download files 454- same as for the early Unicode Tools setup and data refresh: 455 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 456 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 457- mkdir -p $UNICODE_DATA 458- download Unicode files into $UNICODE_DATA 459 + new since Unicode 15.1: 460 for the pre-release (alpha, beta) data files, 461 download all of https://www.unicode.org/Public/draft/ 462 (you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders) 463 + if one of us produces the alpha.zip or beta.zip collection of data files for publication, 464 then we can use its contents directly (no FTP from unicode.org necessary) 465 + for final-release data files, the source of truth are the files in 466 https://www.unicode.org/Public/(version) [=UCD], 467 https://www.unicode.org/Public/UCA/(version), 468 https://www.unicode.org/Public/idna/(version), 469 etc. 470 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc. 471 + subfolders: emoji, idna, security, ucd, uca 472 + whichever way you download the files: 473 ~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip 474 ~ split Unihan into single-property files 475 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan 476 ~ FYI: for updating ICU, we do not actually need Unihan.zip contents 477 + alternate way of fetching files, if available: 478 copy the files from a Unicode Tools workspace that is up to date with 479 https://github.com/unicode-org/unicodetools 480 and which might at this point be *ahead* of "Public" 481 ~ before the Unicode release copy files from "dev" subfolders, for example 482 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 483- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already) 484 or from the UCD/cldr/ output folder of the Unicode Tools: 485 From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73, 486 CLDR used modified grapheme break rules. 487 This might happen again. 488 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 489 or 490 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 491 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 492 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 493 + Done: figure out whether we need a CLDR version of LineBreakTest.txt: 494 unicodetools issue #492 495 We should have had one, and instead rbbitst.cpp has "known issue" exception. 496 Unicode 16 and CLDR 46 might get back to having the same behavior. 497- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 498 + done in ICU 76: modify preparseucd.py to copy this file 499 500* Note: Since Unicode 15.1, data files are no longer published with version suffixes 501 even during the alpha or beta. 502 Thus we no longer need steps & tools to remove those suffixes. 503 (remove this note next time) 504 505* process and/or copy files 506- cd $ICU_SRC/tools/unicode 507 py/preparseucd.py $UNICODE_DATA $ICU_SRC 508 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 509 + For debugging, and tweaking how ppucd.txt is written, 510 the tool has an --only_ppucd option: 511 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 512 513* new constants for new property values 514- preparseucd.py error: 515 ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})] 516 = PropertyValueAliases.txt new property values (diff old & new .txt files) 517 cd $UNIDATA_ROOT 518 $ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 519 +age; 15.1 ; V15_1 520 +blk; CJK_Ext_I ; CJK_Unified_Ideographs_Extension_I 521 +IDSU; N ; No ; F ; False 522 +IDSU; Y ; Yes ; T ; True 523 +ID_Compat_Math_Continue; N ; No ; F ; False 524 +ID_Compat_Math_Continue; Y ; Yes ; T ; True 525 +ID_Compat_Math_Start; N ; No ; F ; False 526 +ID_Compat_Math_Start; Y ; Yes ; T ; True 527 +lb ; AK ; Aksara 528 +lb ; AP ; Aksara_Prebase 529 +lb ; AS ; Aksara_Start 530 +lb ; VF ; Virama_Final 531 +lb ; VI ; Virama 532 -> add new blocks to uchar.h before UBLOCK_COUNT 533 use long property names for enum constants, 534 for the trailing comment get the block start code point: diff old & new Blocks.txt 535 cd $UNIDATA_ROOT 536 $ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 537 +2EBF0..2EE4F; CJK Unified Ideographs Extension I 538 (ignore blocks whose end code point changed) 539 -> add new blocks to UCharacter.UnicodeBlock IDs 540 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 541 replace public static final int \1_ID = \2; \3 542 -> add new blocks to UCharacter.UnicodeBlock objects 543 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 544 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 545 -> add new line break values to uchar.h & UCharacter.LineBreak 546 547* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 548 (not strictly necessary for NOT_ENCODED scripts) 549 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 550 551* build ICU 552 to make sure that there are no syntax errors 553 554 $ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 555 556* update spoof checker UnicodeSet initializers: 557 inclusionPat & recommendedPat in i18n/uspoof.cpp 558 INCLUSION & RECOMMENDED in SpoofChecker.java 559- make sure that the Unicode Tools tree contains the latest security data files 560- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 561- run the tool (no special environment variables needed) 562 cd $UNICODE_TOOLS 563 mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \ 564 -Dexec.args="" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) 565- copy & paste from the Console output into the .cpp & .java files 566 567* check hardcoded IDS_Unary_Operator 568- new in Unicode 15.1, hardcoded because trivial, and unlikely to change 569- check that it has not changed: 570 (cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator) 571- if it has changed, then update the implementation and the tests 572- Since ICU 75, this property is tested in C++ intltest against ppucd.txt. 573 574* check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue 575- new in Unicode 15.1, hardcoded because trivial, and unlikely to change 576- check that they have not changed: 577 (cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math) 578- if they have changed, then update the implementation and the tests 579- Since ICU 75, these properties are tested in C++ intltest against ppucd.txt. 580 581* Bazel build process 582 583See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 584for an overview and for setup instructions. 585 586Consider running `bazelisk --version` outside of the $ICU_SRC folder 587to find out the latest `bazel` version, and 588copying that version number into the $ICU_SRC/.bazeliskrc config file. 589(Revert if you find incompatibilities, or, better, update our build & config files.) 590 591* generate data files 592 593- remember to define the environment variables 594 (see the start of the section for this Unicode version) 595- cd $ICU_SRC 596- optional but not necessary: 597 bazelisk clean 598 or even 599 bazelisk clean --expunge 600- build/bootstrap/generate new files: 601 icu4c/source/data/unidata/generate.sh 602 603* Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD). 604 These characters are now just valid, no longer disallowed_STD3_valid. 605 Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid()) 606 from uts46.cpp & UTS46.java, 607 and special test code from uts46test.cpp & UTS46Test.java. 608 (remove this section next time) 609 610* run & fix ICU4C tests 611- Note: Some of the collation data and test data will be updated below, 612 so at this time we might get some collation test failures. 613 Ignore these for now. 614- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 615- update CLDR GraphemeBreakTest.txt 616 cd ~/unitools/mine/Generated 617 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 618 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 619 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 620- Robin or Andy helps with RBBI & spoof check test failures 621 622* collation: CLDR collation root, UCA DUCET 623 624- UCA DUCET goes into Mark's Unicode tools, 625 and a tool-tailored version goes into CLDR, see 626 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 627 628- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 629 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 630- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 631 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 632 (note removing the underscore before "Rules") 633 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 634- restore TODO diffs in UCARules.txt 635 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 636- update (ICU4C)/source/test/testdata/CollationTest_*.txt 637 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 638 from the CLDR root files (..._CLDR_..._SHORT.txt) 639 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 640 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 641 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 642- if CLDR common/uca/unihan-index.txt changes, then update 643 CLDR common/collation/root.xml <collation type="private-unihan"> 644 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 645 646- generate data files, as above (generate.sh), now to pick up new collation data 647- update CollationFCD.java: 648 copy & paste the initializers of lcccIndex[] etc. from 649 ICU4C/source/i18n/collationfcd.cpp to 650 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 651- rebuild ICU4C (make clean, make check, as usual) 652 653* Unihan collators 654 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 655- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 656 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 657- generate ICU zh collation data 658 instructions inspired by 659 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 660 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 661 + setup: 662 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 663 (didn't work without setting JAVA_HOME, 664 nor with the Google default of /usr/local/buildtools/java/jdk 665 [Google security limitations in the XML parser]) 666 export TOOLS_ROOT=$ICU_SRC/tools 667 export CLDR_DIR=$CLDR_SRC 668 export CLDR_DATA_DIR=$CLDR_DIR 669 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 670 cd "$TOOLS_ROOT/cldr/lib" 671 ./install-cldr-jars.sh "$CLDR_DIR" 672 + generate the files we need 673 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 674 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 675 + diff 676 cd $ICU_SRC 677 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 678 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 679 + copy into the source tree 680 cd $ICU_SRC 681 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 682 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 683- rebuild ICU4C 684 685* run & fix ICU4C tests, now with new CLDR collation root data 686- run all tests with the collation test data *_SHORT.txt or the full files 687 (the full ones have comments, useful for debugging) 688- note on intltest: if collate/UCAConformanceTest fails, then 689 utility/MultithreadTest/TestCollators will fail as well; 690 fix the conformance test before looking into the multi-thread test 691 692* update Java data files 693- refresh just the UCD/UCA-related/derived files, just to be safe 694- see (ICU4C)/source/data/icu4j-readme.txt 695- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 696- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 697 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 698 you need to reconfigure with unicore data; see the "configure" line above. 699 output: 700 ... 701 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 702 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b 703 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b 704 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b 705 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b" 706 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/ 707 mkdir -p /tmp/icu4j/main/shared/data 708 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 709 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/ 710 mkdir -p /tmp/icu4j/main/shared/data 711 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 712 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 713- copy the binary data files into the ICU4J tree 714 cd $ICU_OUT/icu4c/data/out/icu4j 715 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 716 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr 717 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 718 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 719 cd com/ibm/icu/impl/data/$ICUDT/ 720 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh 721- The procedure above is very conservative: 722 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 723 It avoids dealing with any other discrepancies 724 between the source and generated data files. 725 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 726 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 727 728* refresh Java test .txt files 729- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 730 cd $ICU_SRC/icu4c/source/data/unidata 731 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 732 cd ../../test/testdata 733 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 734 cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 735 736* run & fix ICU4J tests 737 738*** API additions 739- send notice to icu-design about new born-@stable API (enum constants etc.) 740 741*** CLDR numbering systems 742- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 743 for example: 744 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 745 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt 746 ~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt 747 --> 748 (empty this time) 749 or: 750 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 751 --> 752 (empty this time) 753 Unicode 15.1: 754 (none this time) 755 756*** merge the Unicode update branch back onto the main branch 757- do not merge the icudata.jar and testdata.jar, 758 instead rebuild them from merged & tested ICU4C 759- if there is a merge conflict in icudata.jar, here is one way to deal with it: 760 + remove icudata.jar from the commit so that rebasing is trivial 761 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 762 + ~/icu/uni/src$ git commit -a --amend 763 + switch to main, pull updates, switch back to the dev branch 764 + ~/icu/uni/src$ git rebase main 765 + rebuild icudata.jar 766 + ~/icu/uni/src$ git commit -a --amend 767 + ~/icu/uni/src$ git push -f 768- make sure that changes to Unicode tools are checked in: 769 https://github.com/unicode-org/unicodetools 770 771---------------------------------------------------------------------------- *** 772 773CLDR 43 root collation update for ICU 73 774 775Partial update only for the root collation. 776See 777- https://unicode-org.atlassian.net/browse/CLDR-15946 778 Treat quote marks as equivalent when strength=UCOL_PRIMARY 779- https://github.com/unicode-org/cldr/pull/2691 780 CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks 781- https://github.com/unicode-org/cldr/pull/2833 782 CLDR-15946 make fancy quotes secondary-different from each other 783 784The related changes to tailorings were already integrated in an earlier PR for 785https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS. 786 787This update is for the root collation, 788which is handled by different tools than the locale data updates. 789 790* Command-line environment setup 791 792export UNICODE_DATA=~/unidata/uni15/20220830 793export CLDR_SRC=~/cldr/uni/src 794export ICU_ROOT=~/icu/uni 795export ICU_SRC=$ICU_ROOT/src 796export ICUDT=icudt73b 797export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 798export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 799export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 800 801*** Configure: Build Unicode data for ICU4J 802 cd $ICU_ROOT/dbg/icu4c 803 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 804 805* Bazel build process 806 807See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 808for an overview and for setup instructions. 809 810Consider running `bazelisk --version` outside of the $ICU_SRC folder 811to find out the latest `bazel` version, and 812copying that version number into the $ICU_SRC/.bazeliskrc config file. 813(Revert if you find incompatibilities, or, better, update our build & config files.) 814 815* generate data files 816 817- remember to define the environment variables 818 (see the start of the section for this Unicode version) 819- cd $ICU_SRC 820- optional but not necessary: 821 bazelisk clean 822 or even 823 bazelisk clean --expunge 824- build/bootstrap/generate new files: 825 icu4c/source/data/unidata/generate.sh 826 827* collation: CLDR collation root, UCA DUCET 828 829- UCA DUCET goes into Mark's Unicode tools, 830 and a tool-tailored version goes into CLDR, see 831 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 832 833- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 834 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 835- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 836 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 837 (note removing the underscore before "Rules") 838 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 839- restore TODO diffs in UCARules.txt 840 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 841- update (ICU4C)/source/test/testdata/CollationTest_*.txt 842 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 843 from the CLDR root files (..._CLDR_..._SHORT.txt) 844 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 845 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 846 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 847- if CLDR common/uca/unihan-index.txt changes, then update 848 CLDR common/collation/root.xml <collation type="private-unihan"> 849 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 850 851- generate data files, as above (generate.sh), now to pick up new collation data 852- rebuild ICU4C (make clean, make check, as usual) 853 854* run & fix ICU4C tests, now with new CLDR collation root data 855- run all tests with the collation test data *_SHORT.txt or the full files 856 (the full ones have comments, useful for debugging) 857- note on intltest: if collate/UCAConformanceTest fails, then 858 utility/MultithreadTest/TestCollators will fail as well; 859 fix the conformance test before looking into the multi-thread test 860 861* update Java data files 862- refresh just the UCD/UCA-related/derived files, just to be safe 863- see (ICU4C)/source/data/icu4j-readme.txt 864- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 865- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 866 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 867 you need to reconfigure with unicore data; see the "configure" line above. 868 output: 869 ... 870 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 871 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b 872 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b 873 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b 874 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b" 875 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/ 876 mkdir -p /tmp/icu4j/main/shared/data 877 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 878 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/ 879 mkdir -p /tmp/icu4j/main/shared/data 880 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 881 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 882- copy the big-endian Unicode data files to another location, 883 separate from the other data files, 884 and then refresh ICU4J 885 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 886 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 887 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 888 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 889- new for ICU 73: also copy the binary data files directly into the ICU4J tree 890 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 891 892* When refreshing all of ICU4J data from ICU4C 893- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 894- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 895or 896- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 897 898* refresh Java test .txt files 899- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 900 cd $ICU_SRC/icu4c/source/data/unidata 901 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 902 cd ../../test/testdata 903 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 904 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 905 906* run & fix ICU4J tests 907 908*** merge the Unicode update branch back onto the main branch 909- do not merge the icudata.jar and testdata.jar, 910 instead rebuild them from merged & tested ICU4C 911- if there is a merge conflict in icudata.jar, here is one way to deal with it: 912 + remove icudata.jar from the commit so that rebasing is trivial 913 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 914 + ~/icu/uni/src$ git commit -a --amend 915 + switch to main, pull updates, switch back to the dev branch 916 + ~/icu/uni/src$ git rebase main 917 + rebuild icudata.jar 918 + ~/icu/uni/src$ git commit -a --amend 919 + ~/icu/uni/src$ git push -f 920- make sure that changes to Unicode tools are checked in: 921 https://github.com/unicode-org/unicodetools 922 923---------------------------------------------------------------------------- *** 924 925Unicode 15.0 update for ICU 72 926 927https://www.unicode.org/versions/Unicode15.0.0/ 928https://www.unicode.org/versions/beta-15.0.0.html 929https://www.unicode.org/Public/15.0.0/ucd/ 930https://www.unicode.org/reports/uax-proposed-updates.html 931https://www.unicode.org/reports/tr44/tr44-29.html 932 933https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 934https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 935https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) 936 937* Command-line environment setup 938 939export UNICODE_DATA=~/unidata/uni15/20220830 940export CLDR_SRC=~/cldr/uni/src 941export ICU_ROOT=~/icu/uni 942export ICU_SRC=$ICU_ROOT/src 943export ICUDT=icudt72b 944export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 945export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 946export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 947 948*** Unicode version numbers 949- makedata.mak 950- uchar.h 951- com.ibm.icu.util.VersionInfo 952- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 953 954- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 955 so that the makefiles see the new version number. 956 cd $ICU_ROOT/dbg/icu4c 957 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 958 959*** data files & enums & parser code 960 961* download files 962- same as for the early Unicode Tools setup and data refresh: 963 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 964 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 965- mkdir -p $UNICODE_DATA 966- download Unicode files into $UNICODE_DATA 967 + subfolders: emoji, idna, security, ucd, uca 968 + old way of fetching files: from the "Public" area on unicode.org 969 ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 970 ~ split Unihan into single-property files 971 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 972 + new way of fetching files, if available: 973 copy the files from a Unicode Tools workspace that is up to date with 974 https://github.com/unicode-org/unicodetools 975 and which might at this point be *ahead* of "Public" 976 ~ before the Unicode release copy files from "dev" subfolders, for example 977 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 978 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 979 or from the UCD/cldr/ output folder of the Unicode Tools: 980 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 981 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 982 or 983 cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 984 985* for manual diffs and for Unicode Tools input data updates: 986 remove version suffixes from the file names 987 ~$ unidata/desuffixucd.py $UNICODE_DATA 988 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 989 990* process and/or copy files 991- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 992 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 993 + For debugging, and tweaking how ppucd.txt is written, 994 the tool has an --only_ppucd option: 995 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 996 997- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 998 999* new constants for new property values 1000- preparseucd.py error: 1001 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] 1002 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1003 ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 1004 +age; 15.0 ; V15_0 1005 +blk; Arabic_Ext_C ; Arabic_Extended_C 1006 +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H 1007 +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D 1008 +blk; Devanagari_Ext_A ; Devanagari_Extended_A 1009 +blk; Kaktovik_Numerals ; Kaktovik_Numerals 1010 +blk; Kawi ; Kawi 1011 +blk; Nag_Mundari ; Nag_Mundari 1012 +sc ; Kawi ; Kawi 1013 +sc ; Nagm ; Nag_Mundari 1014 -> add new blocks to uchar.h before UBLOCK_COUNT 1015 use long property names for enum constants, 1016 for the trailing comment get the block start code point: diff old & new Blocks.txt 1017 ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 1018 +10EC0..10EFF; Arabic Extended-C 1019 +11B00..11B5F; Devanagari Extended-A 1020 +11F00..11F5F; Kawi 1021 -13430..1343F; Egyptian Hieroglyph Format Controls 1022 +13430..1345F; Egyptian Hieroglyph Format Controls 1023 +1D2C0..1D2DF; Kaktovik Numerals 1024 +1E030..1E08F; Cyrillic Extended-D 1025 +1E4D0..1E4FF; Nag Mundari 1026 +31350..323AF; CJK Unified Ideographs Extension H 1027 (ignore blocks whose end code point changed) 1028 -> add new blocks to UCharacter.UnicodeBlock IDs 1029 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1030 replace public static final int \1_ID = \2; \3 1031 -> add new blocks to UCharacter.UnicodeBlock objects 1032 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1033 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1034 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 1035 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 1036 replace public static final int \1 = \2; \3 1037 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1038 and in com.ibm.icu.dev.test.lang.TestUScript.java 1039 1040* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1041 (not strictly necessary for NOT_ENCODED scripts) 1042 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1043 1044* build ICU 1045 to make sure that there are no syntax errors 1046 1047 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 1048 1049* update spoof checker UnicodeSet initializers: 1050 inclusionPat & recommendedPat in i18n/uspoof.cpp 1051 INCLUSION & RECOMMENDED in SpoofChecker.java 1052- make sure that the Unicode Tools tree contains the latest security data files 1053- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1054- run the tool (no special environment variables needed) 1055- copy & paste from the Console output into the .cpp & .java files 1056 1057* Bazel build process 1058 1059See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1060for an overview and for setup instructions. 1061 1062Consider running `bazelisk --version` outside of the $ICU_SRC folder 1063to find out the latest `bazel` version, and 1064copying that version number into the $ICU_SRC/.bazeliskrc config file. 1065(Revert if you find incompatibilities, or, better, update our build & config files.) 1066 1067* generate data files 1068 1069- remember to define the environment variables 1070 (see the start of the section for this Unicode version) 1071- cd $ICU_SRC 1072- optional but not necessary: 1073 bazelisk clean 1074- build/bootstrap/generate new files: 1075 icu4c/source/data/unidata/generate.sh 1076 1077* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1078 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1079- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1080 ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt 1081- Unicode 6.0..15.0: U+2260, U+226E, U+226F 1082- nothing new in this Unicode version, no test file to update 1083 1084* run & fix ICU4C tests 1085- Note: Some of the collation data and test data will be updated below, 1086 so at this time we might get some collation test failures. 1087 Ignore these for now. 1088- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1089 (no rule changes in Unicode 15) 1090- update CLDR GraphemeBreakTest.txt 1091 cd ~/unitools/mine/Generated 1092 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1093 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 1094 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 1095- Andy helps with RBBI & spoof check test failures 1096 1097* collation: CLDR collation root, UCA DUCET 1098 1099- UCA DUCET goes into Mark's Unicode tools, 1100 and a tool-tailored version goes into CLDR, see 1101 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1102 1103- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1104 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1105- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1106 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1107 (note removing the underscore before "Rules") 1108 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1109- restore TODO diffs in UCARules.txt 1110 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1111- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1112 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1113 from the CLDR root files (..._CLDR_..._SHORT.txt) 1114 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1115 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1116 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1117- if CLDR common/uca/unihan-index.txt changes, then update 1118 CLDR common/collation/root.xml <collation type="private-unihan"> 1119 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1120 1121- generate data files, as above (generate.sh), now to pick up new collation data 1122- update CollationFCD.java: 1123 copy & paste the initializers of lcccIndex[] etc. from 1124 ICU4C/source/i18n/collationfcd.cpp to 1125 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1126- rebuild ICU4C (make clean, make check, as usual) 1127 1128* Unihan collators 1129 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 1130- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 1131 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 1132- generate ICU zh collation data 1133 instructions inspired by 1134 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 1135 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 1136 + setup: 1137 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 1138 (didn't work without setting JAVA_HOME, 1139 nor with the Google default of /usr/local/buildtools/java/jdk 1140 [Google security limitations in the XML parser]) 1141 export TOOLS_ROOT=~/icu/uni/src/tools 1142 export CLDR_DIR=~/cldr/uni/src 1143 export CLDR_DATA_DIR=~/cldr/uni/src 1144 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 1145 cd "$TOOLS_ROOT/cldr/lib" 1146 ./install-cldr-jars.sh "$CLDR_DIR" 1147 + generate the files we need 1148 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1149 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1150 + diff 1151 cd $ICU_SRC 1152 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1153 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1154 + copy into the source tree 1155 cd $ICU_SRC 1156 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1157 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1158- rebuild ICU4C 1159 1160* run & fix ICU4C tests, now with new CLDR collation root data 1161- run all tests with the collation test data *_SHORT.txt or the full files 1162 (the full ones have comments, useful for debugging) 1163- note on intltest: if collate/UCAConformanceTest fails, then 1164 utility/MultithreadTest/TestCollators will fail as well; 1165 fix the conformance test before looking into the multi-thread test 1166 1167* update Java data files 1168- refresh just the UCD/UCA-related/derived files, just to be safe 1169- see (ICU4C)/source/data/icu4j-readme.txt 1170- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1171- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1172 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1173 you need to reconfigure with unicore data; see the "configure" line above. 1174 output: 1175 ... 1176 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1177 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b 1178 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b 1179 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b 1180 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" 1181 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ 1182 mkdir -p /tmp/icu4j/main/shared/data 1183 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1184 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ 1185 mkdir -p /tmp/icu4j/main/shared/data 1186 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1187 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1188- copy the big-endian Unicode data files to another location, 1189 separate from the other data files, 1190 and then refresh ICU4J 1191 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1192 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1193 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1194 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1195 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1196 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1197 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1198 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1199 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1200 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1201 1202* When refreshing all of ICU4J data from ICU4C 1203- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1204- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1205or 1206- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1207 1208* refresh Java test .txt files 1209- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1210 cd $ICU_SRC/icu4c/source/data/unidata 1211 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1212 cd ../../test/testdata 1213 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1214 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1215 1216* run & fix ICU4J tests 1217 1218*** API additions 1219- send notice to icu-design about new born-@stable API (enum constants etc.) 1220 1221*** CLDR numbering systems 1222- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1223 for example: 1224 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 1225 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 1226 ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt 1227 --> 1228 +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1229 +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1230 or: 1231 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 1232 --> 1233 +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE 1234 +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE 1235 Unicode 15: 1236 kawi 11F50..11F59 Kawi 1237 nagm 1E4F0..1E4F9 Nag Mundari 1238 https://github.com/unicode-org/cldr/pull/2041 1239 1240*** merge the Unicode update branches back onto the trunk 1241- do not merge the icudata.jar and testdata.jar, 1242 instead rebuild them from merged & tested ICU4C 1243- if there is a merge conflict in icudata.jar, here is one way to deal with it: 1244 + remove icudata.jar from the commit so that rebasing is trivial 1245 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 1246 + ~/icu/uni/src$ git commit -a --amend 1247 + switch to main, pull updates, switch back to the dev branch 1248 + ~/icu/uni/src$ git rebase main 1249 + rebuild icudata.jar 1250 + ~/icu/uni/src$ git commit -a --amend 1251 + ~/icu/uni/src$ git push -f 1252- make sure that changes to Unicode tools are checked in: 1253 https://github.com/unicode-org/unicodetools 1254 1255---------------------------------------------------------------------------- *** 1256 1257Unicode 14.0 update for ICU 70 1258 1259https://www.unicode.org/versions/Unicode14.0.0/ 1260https://www.unicode.org/versions/beta-14.0.0.html 1261https://www.unicode.org/Public/14.0.0/ucd/ 1262https://www.unicode.org/reports/uax-proposed-updates.html 1263https://www.unicode.org/reports/tr44/tr44-27.html 1264 1265https://unicode-org.atlassian.net/browse/CLDR-14801 1266https://unicode-org.atlassian.net/browse/ICU-21635 1267 1268* Command-line environment setup 1269 1270export UNICODE_DATA=~/unidata/uni14/20210903 1271export CLDR_SRC=~/cldr/uni/src 1272export ICU_ROOT=~/icu/uni 1273export ICU_SRC=$ICU_ROOT/src 1274export ICUDT=icudt70b 1275export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1276export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1277export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1278 1279*** Unicode version numbers 1280- makedata.mak 1281- uchar.h 1282- com.ibm.icu.util.VersionInfo 1283- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1284 1285- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1286 so that the makefiles see the new version number. 1287 cd $ICU_ROOT/dbg/icu4c 1288 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1289 1290*** data files & enums & parser code 1291 1292* download files 1293- same as for the early Unicode Tools setup and data refresh: 1294 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 1295 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 1296- mkdir -p $UNICODE_DATA 1297- download Unicode files into $UNICODE_DATA 1298 + subfolders: emoji, idna, security, ucd, uca 1299 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1300 + split Unihan into single-property files 1301 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1302 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1303 or from the UCD/cldr/ output folder of the Unicode Tools: 1304 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1305 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1306 or 1307 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 1308 1309* for manual diffs and for Unicode Tools input data updates: 1310 remove version suffixes from the file names 1311 ~$ unidata/desuffixucd.py $UNICODE_DATA 1312 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 1313 1314* process and/or copy files 1315- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1316 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1317 + For debugging, and tweaking how ppucd.txt is written, 1318 the tool has an --only_ppucd option: 1319 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1320 1321- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1322 1323* new constants for new property values 1324- preparseucd.py error: 1325 ValueError: missing uchar.h enum constants for some property values: 1326 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 1327 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 1328 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 1329 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1330 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 1331 +age; 14.0 ; V14_0 1332 +blk; Arabic_Ext_B ; Arabic_Extended_B 1333 +blk; Cypro_Minoan ; Cypro_Minoan 1334 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 1335 +blk; Kana_Ext_B ; Kana_Extended_B 1336 +blk; Latin_Ext_F ; Latin_Extended_F 1337 +blk; Latin_Ext_G ; Latin_Extended_G 1338 +blk; Old_Uyghur ; Old_Uyghur 1339 +blk; Tangsa ; Tangsa 1340 +blk; Toto ; Toto 1341 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 1342 +blk; Vithkuqi ; Vithkuqi 1343 +blk; Znamenny_Music ; Znamenny_Musical_Notation 1344 +jg ; Thin_Yeh ; Thin_Yeh 1345 +jg ; Vertical_Tail ; Vertical_Tail 1346 +sc ; Cpmn ; Cypro_Minoan 1347 +sc ; Ougr ; Old_Uyghur 1348 +sc ; Tnsa ; Tangsa 1349 +sc ; Toto ; Toto 1350 +sc ; Vith ; Vithkuqi 1351 -> add new blocks to uchar.h before UBLOCK_COUNT 1352 use long property names for enum constants, 1353 for the trailing comment get the block start code point: diff old & new Blocks.txt 1354 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 1355 +0870..089F; Arabic Extended-B 1356 +10570..105BF; Vithkuqi 1357 +10780..107BF; Latin Extended-F 1358 +10F70..10FAF; Old Uyghur 1359 -11700..1173F; Ahom 1360 +11700..1174F; Ahom 1361 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 1362 +12F90..12FFF; Cypro-Minoan 1363 +16A70..16ACF; Tangsa 1364 -18D00..18D8F; Tangut Supplement 1365 +18D00..18D7F; Tangut Supplement 1366 +1AFF0..1AFFF; Kana Extended-B 1367 +1CF00..1CFCF; Znamenny Musical Notation 1368 +1DF00..1DFFF; Latin Extended-G 1369 +1E290..1E2BF; Toto 1370 +1E7E0..1E7FF; Ethiopic Extended-B 1371 (ignore blocks whose end code point changed) 1372 -> add new blocks to UCharacter.UnicodeBlock IDs 1373 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1374 replace public static final int \1_ID = \2; \3 1375 -> add new blocks to UCharacter.UnicodeBlock objects 1376 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1377 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1378 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 1379 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 1380 replace public static final int \1 = \2; \3 1381 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1382 and in com.ibm.icu.dev.test.lang.TestUScript.java 1383 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 1384 1385* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1386 (not strictly necessary for NOT_ENCODED scripts) 1387 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1388 1389* build ICU 1390 to make sure that there are no syntax errors 1391 1392 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 1393 1394* update spoof checker UnicodeSet initializers: 1395 inclusionPat & recommendedPat in i18n/uspoof.cpp 1396 INCLUSION & RECOMMENDED in SpoofChecker.java 1397- make sure that the Unicode Tools tree contains the latest security data files 1398- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1399- run the tool (no special environment variables needed) 1400- copy & paste from the Console output into the .cpp & .java files 1401 1402* Bazel build process 1403 1404See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1405for an overview and for setup instructions. 1406 1407Consider running `bazelisk --version` outside of the $ICU_SRC folder 1408to find out the latest `bazel` version, and 1409copying that version number into the $ICU_SRC/.bazeliskrc config file. 1410(Revert if you find incompatibilities, or, better, update our build & config files.) 1411 1412* generate data files 1413 1414- remember to define the environment variables 1415 (see the start of the section for this Unicode version) 1416- cd $ICU_SRC 1417- optional but not necessary: 1418 bazelisk clean 1419- build/bootstrap/generate new files: 1420 icu4c/source/data/unidata/generate.sh 1421 1422* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1423 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1424- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1425- Unicode 6.0..14.0: U+2260, U+226E, U+226F 1426- nothing new in this Unicode version, no test file to update 1427 1428* run & fix ICU4C tests 1429- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1430- update CLDR GraphemeBreakTest.txt 1431 cd ~/unitools/mine/Generated 1432 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1433 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 1434 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 1435- Andy helps with RBBI & spoof check test failures 1436 1437* collation: CLDR collation root, UCA DUCET 1438 1439- UCA DUCET goes into Mark's Unicode tools, 1440 and a tool-tailored version goes into CLDR, see 1441 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1442 1443- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1444 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1445- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1446 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1447 (note removing the underscore before "Rules") 1448 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1449- restore TODO diffs in UCARules.txt 1450 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1451- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1452 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1453 from the CLDR root files (..._CLDR_..._SHORT.txt) 1454 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1455 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1456 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1457- if CLDR common/uca/unihan-index.txt changes, then update 1458 CLDR common/collation/root.xml <collation type="private-unihan"> 1459 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1460 1461- generate data files, as above (generate.sh), now to pick up new collation data 1462- update CollationFCD.java: 1463 copy & paste the initializers of lcccIndex[] etc. from 1464 ICU4C/source/i18n/collationfcd.cpp to 1465 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1466- rebuild ICU4C (make clean, make check, as usual) 1467 1468* Unihan collators 1469 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 1470- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 1471 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 1472- generate ICU zh collation data 1473 instructions inspired by 1474 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 1475 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 1476 + setup: 1477 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 1478 (didn't work without setting JAVA_HOME, 1479 nor with the Google default of /usr/local/buildtools/java/jdk 1480 [Google security limitations in the XML parser]) 1481 export TOOLS_ROOT=~/icu/uni/src/tools 1482 export CLDR_DIR=~/cldr/uni/src 1483 export CLDR_DATA_DIR=~/cldr/uni/src 1484 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 1485 cd "$TOOLS_ROOT/cldr/lib" 1486 ./install-cldr-jars.sh "$CLDR_DIR" 1487 + generate the files we need 1488 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1489 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1490 + diff 1491 cd $ICU_SRC 1492 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1493 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1494 + copy into the source tree 1495 cd $ICU_SRC 1496 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1497 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1498- rebuild ICU4C 1499 1500* run & fix ICU4C tests, now with new CLDR collation root data 1501- run all tests with the collation test data *_SHORT.txt or the full files 1502 (the full ones have comments, useful for debugging) 1503- note on intltest: if collate/UCAConformanceTest fails, then 1504 utility/MultithreadTest/TestCollators will fail as well; 1505 fix the conformance test before looking into the multi-thread test 1506 1507* update Java data files 1508- refresh just the UCD/UCA-related/derived files, just to be safe 1509- see (ICU4C)/source/data/icu4j-readme.txt 1510- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1511- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1512 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1513 you need to reconfigure with unicore data; see the "configure" line above. 1514 output: 1515 ... 1516 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1517 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1518 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 1519 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1520 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 1521 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 1522 mkdir -p /tmp/icu4j/main/shared/data 1523 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1524 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 1525 mkdir -p /tmp/icu4j/main/shared/data 1526 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1527 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1528- copy the big-endian Unicode data files to another location, 1529 separate from the other data files, 1530 and then refresh ICU4J 1531 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1532 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1533 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1534 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1535 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1536 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1537 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1538 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1539 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1540 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1541 1542* When refreshing all of ICU4J data from ICU4C 1543- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1544- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1545or 1546- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1547 1548* refresh Java test .txt files 1549- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1550 cd $ICU_SRC/icu4c/source/data/unidata 1551 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1552 cd ../../test/testdata 1553 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1554 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1555 1556* run & fix ICU4J tests 1557 1558*** API additions 1559- send notice to icu-design about new born-@stable API (enum constants etc.) 1560 1561*** CLDR numbering systems 1562- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1563 for example: 1564 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 1565 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 1566 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 1567 --> 1568 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1569 Unicode 14: 1570 tnsa 16AC0..16AC9 Tangsa 1571 https://github.com/unicode-org/cldr/pull/1326 1572 1573*** merge the Unicode update branches back onto the trunk 1574- do not merge the icudata.jar and testdata.jar, 1575 instead rebuild them from merged & tested ICU4C 1576- make sure that changes to Unicode tools are checked in: 1577 https://github.com/unicode-org/unicodetools 1578 1579---------------------------------------------------------------------------- *** 1580 1581Unicode 13.0 update for ICU 66 1582 1583https://www.unicode.org/versions/Unicode13.0.0/ 1584https://www.unicode.org/versions/beta-13.0.0.html 1585https://www.unicode.org/Public/13.0.0/ucd/ 1586https://www.unicode.org/reports/uax-proposed-updates.html 1587https://www.unicode.org/reports/tr44/tr44-25.html 1588 1589https://unicode-org.atlassian.net/browse/CLDR-13387 1590https://unicode-org.atlassian.net/browse/ICU-20893 1591 1592* Command-line environment setup 1593 1594UNICODE_DATA=~/unidata/uni13/20200212 1595CLDR_SRC=~/cldr/uni/src 1596ICU_ROOT=~/icu/uni 1597ICU_SRC=$ICU_ROOT/src 1598ICUDT=icudt66b 1599ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1600ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1601export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1602 1603*** Unicode version numbers 1604- makedata.mak 1605- uchar.h 1606- com.ibm.icu.util.VersionInfo 1607- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1608 1609- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1610 so that the makefiles see the new version number. 1611 cd $ICU_ROOT/dbg/icu4c 1612 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1613 1614*** data files & enums & parser code 1615 1616* download files 1617- mkdir -p $UNICODE_DATA 1618- download Unicode files into $UNICODE_DATA 1619 + subfolders: emoji, idna, security, ucd, uca 1620 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1621 + split Unihan into single-property files 1622 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1623 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1624 or from the ucd/cldr/ output folder of the Unicode Tools: 1625 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1626 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1627 1628* for manual diffs and for Unicode Tools input data updates: 1629 remove version suffixes from the file names 1630 ~$ unidata/desuffixucd.py $UNICODE_DATA 1631 (see https://sites.google.com/site/unicodetools/inputdata) 1632 1633* process and/or copy files 1634- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1635 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1636 + For debugging, and tweaking how ppucd.txt is written, 1637 the tool has an --only_ppucd option: 1638 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1639 1640- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1641 1642* new constants for new property values 1643- preparseucd.py error: 1644 ValueError: missing uchar.h enum constants for some property values: 1645 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 1646 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 1647 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 1648 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 1649 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1650 blk; Chorasmian ; Chorasmian 1651 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 1652 blk; Dives_Akuru ; Dives_Akuru 1653 blk; Khitan_Small_Script ; Khitan_Small_Script 1654 blk; Lisu_Sup ; Lisu_Supplement 1655 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 1656 blk; Tangut_Sup ; Tangut_Supplement 1657 blk; Yezidi ; Yezidi 1658 -> add to uchar.h before UBLOCK_COUNT 1659 use long property names for enum constants, 1660 for the trailing comment get the block start code point: diff old & new Blocks.txt 1661 -> add to UCharacter.UnicodeBlock IDs 1662 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1663 replace public static final int \1_ID = \2; \3 1664 -> add to UCharacter.UnicodeBlock objects 1665 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1666 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1667 1668 sc ; Chrs ; Chorasmian 1669 sc ; Diak ; Dives_Akuru 1670 sc ; Kits ; Khitan_Small_Script 1671 sc ; Yezi ; Yezidi 1672 -> uscript.h & com.ibm.icu.lang.UScript 1673 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1674 and in com.ibm.icu.dev.test.lang.TestUScript.java 1675 1676 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 1677 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 1678 1679* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1680 (not strictly necessary for NOT_ENCODED scripts) 1681 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1682 1683* build ICU (make install) 1684 to make sure that there are no syntax errors, and 1685 so that the tools build can pick up the new definitions from the installed header files. 1686 1687 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1688 1689* update spoof checker UnicodeSet initializers: 1690 inclusionPat & recommendedPat in i18n/uspoof.cpp 1691 INCLUSION & RECOMMENDED in SpoofChecker.java 1692- make sure that the Unicode Tools tree contains the latest security data files 1693- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1694- update the hardcoded version number there in the DIRECTORY path 1695- run the tool (no special environment variables needed) 1696- copy & paste from the Console output into the .cpp & .java files 1697 1698* generate normalization data files 1699 cd $ICU_ROOT/dbg/icu4c 1700 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1701 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1702 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1703 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1704 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1705 1706* build ICU (make install) 1707 so that the tools build can pick up the new definitions from the installed header files. 1708 1709 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1710 1711* build Unicode tools using CMake+make 1712 1713$ICU_SRC/tools/unicode/c/icudefs.txt: 1714 1715# Location (--prefix) of where ICU was installed. 1716set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1717# Location of the ICU4C source tree. 1718set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1719 1720 $ICU_ROOT/dbg$ 1721 mkdir -p tools/unicode/c 1722 cd tools/unicode/c 1723 1724 $ICU_ROOT/dbg/tools/unicode/c$ 1725 cmake ../../../../src/tools/unicode/c 1726 make 1727 1728* generate core properties data files 1729 $ICU_ROOT/dbg/tools/unicode/c$ 1730 genprops/genprops $ICU_SRC/icu4c 1731- tool failure: 1732 genprops: Script_Extensions indexes overflow bit field 1733 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 1734 -> uprops.icu data file format : 1735 add two more bits to store a script code or Script_Extensions index 1736 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 1737- rebuild ICU (make install) & tools 1738 1739* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1740 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1741- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1742- Unicode 6.0..13.0: U+2260, U+226E, U+226F 1743- nothing new in this Unicode version, no test file to update 1744 1745* run & fix ICU4C tests 1746- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1747- Andy helps with RBBI & spoof check test failures 1748 1749* collation: CLDR collation root, UCA DUCET 1750 1751- UCA DUCET goes into Mark's Unicode tools, see 1752 https://sites.google.com/site/unicodetools/home#TOC-UCA 1753 diff the main mapping file, look for bad changes 1754 (for example, more bytes per weight for common characters) 1755 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 1756 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 1757 1758- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1759 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1760 1761- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1762 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1763- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1764 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1765 (note removing the underscore before "Rules") 1766 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1767- restore TODO diffs in UCARules.txt 1768 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1769- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1770 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1771 from the CLDR root files (..._CLDR_..._SHORT.txt) 1772 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1773 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1774 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1775- if CLDR common/uca/unihan-index.txt changes, then update 1776 CLDR common/collation/root.xml <collation type="private-unihan"> 1777 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1778 1779- run genuca 1780 $ICU_ROOT/dbg/tools/unicode/c$ 1781 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1782 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1783- rebuild ICU4C 1784 1785* Unihan collators 1786 https://sites.google.com/site/unicodetools/unihan 1787- run Unicode Tools 1788 org.unicode.draft.GenerateUnihanCollators 1789 with VM arguments 1790 -ea 1791 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1792 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1793 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1794 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1795 -DUVERSION=13.0.0 1796- run Unicode Tools 1797 org.unicode.draft.GenerateUnihanCollatorFiles 1798 with the same arguments 1799- check CLDR diffs 1800 cd $CLDR_SRC 1801 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1802 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1803- copy to CLDR 1804 cd $CLDR_SRC 1805 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1806 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1807- run CLDR unit tests, commit to CLDR 1808- generate ICU zh collation data: run CLDR 1809 org.unicode.cldr.icu.NewLdml2IcuConverter 1810 with program arguments 1811 -t collation 1812 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 1813 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 1814 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1815 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1816 zh 1817 and VM arguments 1818 -ea 1819 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1820- rebuild ICU4C 1821 1822* run & fix ICU4C tests, now with new CLDR collation root data 1823- run all tests with the collation test data *_SHORT.txt or the full files 1824 (the full ones have comments, useful for debugging) 1825- note on intltest: if collate/UCAConformanceTest fails, then 1826 utility/MultithreadTest/TestCollators will fail as well; 1827 fix the conformance test before looking into the multi-thread test 1828 1829* update Java data files 1830- refresh just the UCD/UCA-related/derived files, just to be safe 1831- see (ICU4C)/source/data/icu4j-readme.txt 1832- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1833- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1834 output: 1835 ... 1836 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1837 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1838 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 1839 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1840 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 1841 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 1842 mkdir -p /tmp/icu4j/main/shared/data 1843 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1844 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 1845 mkdir -p /tmp/icu4j/main/shared/data 1846 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1847 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1848- copy the big-endian Unicode data files to another location, 1849 separate from the other data files, 1850 and then refresh ICU4J 1851 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1852 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1853 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1854 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1855 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1856 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1857 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1858 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1859 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1860 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1861 1862* When refreshing all of ICU4J data from ICU4C 1863- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1864- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1865or 1866- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1867 1868* update CollationFCD.java 1869 + copy & paste the initializers of lcccIndex[] etc. from 1870 ICU4C/source/i18n/collationfcd.cpp to 1871 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1872 1873* refresh Java test .txt files 1874- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1875 cd $ICU_SRC/icu4c/source/data/unidata 1876 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1877 cd ../../test/testdata 1878 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1879 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1880 1881* run & fix ICU4J tests 1882 1883*** API additions 1884- send notice to icu-design about new born-@stable API (enum constants etc.) 1885 1886*** CLDR numbering systems 1887- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1888 for example, look for 1889 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1890 in new blocks (Blocks.txt) 1891 Unicode 13: 1892 diak 11950..11959 Dives_Akuru 1893 1894*** merge the Unicode update branches back onto the trunk 1895- do not merge the icudata.jar and testdata.jar, 1896 instead rebuild them from merged & tested ICU4C 1897- make sure that changes to Unicode tools are checked in: 1898 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1899 1900---------------------------------------------------------------------------- *** 1901 1902Unicode 12.1 update for ICU 64.2 1903 1904** This is an abbreviated update with one new character for the new 1905** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 1906https://en.wikipedia.org/wiki/Reiwa_period 1907 1908http://www.unicode.org/versions/Unicode12.1.0/ 1909 1910ICU-20497 Unicode 12.1 1911 1912cldrbug 11978: Unicode 12.1 1913 1914* Command-line environment setup 1915 1916UNICODE_DATA=~/unidata/uni121/20190403 1917CLDR_SRC=~/svn.cldr/uni 1918ICU_ROOT=~/icu/uni 1919ICU_SRC=$ICU_ROOT/src 1920ICUDT=icudt64b 1921ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1922ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1923export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1924 1925*** Unicode version numbers 1926- makedata.mak 1927- uchar.h 1928- com.ibm.icu.util.VersionInfo 1929- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1930 1931- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1932 so that the makefiles see the new version number. 1933 cd $ICU_ROOT/dbg/icu4c 1934 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1935 1936*** data files & enums & parser code 1937 1938* download files 1939- mkdir -p $UNICODE_DATA 1940- download Unicode files into $UNICODE_DATA 1941 + subfolders: emoji, idna, security, ucd, uca 1942 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1943 1944* for manual diffs and for Unicode Tools input data updates: 1945 remove version suffixes from the file names 1946 ~$ unidata/desuffixucd.py $UNICODE_DATA 1947 (see https://sites.google.com/site/unicodetools/inputdata) 1948 1949* process and/or copy files 1950- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1951 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1952 + For debugging, and tweaking how ppucd.txt is written, 1953 the tool has an --only_ppucd option: 1954 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1955 1956- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1957 1958* build ICU (make install) 1959 so that the tools build can pick up the new definitions from the installed header files. 1960 1961 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1962 1963* update spoof checker UnicodeSet initializers: 1964 inclusionPat & recommendedPat in uspoof.cpp 1965 INCLUSION & RECOMMENDED in SpoofChecker.java 1966- make sure that the Unicode Tools tree contains the latest security data files 1967- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1968- update the hardcoded version number there in the DIRECTORY path 1969- run the tool (no special environment variables needed) 1970- copy & paste from the Console output into the .cpp & .java files 1971 1972* generate normalization data files 1973 cd $ICU_ROOT/dbg/icu4c 1974 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1975 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1976 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1977 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1978 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1979 1980* build ICU (make install) 1981 so that the tools build can pick up the new definitions from the installed header files. 1982 1983 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1984 1985* build Unicode tools using CMake+make 1986 1987$ICU_SRC/tools/unicode/c/icudefs.txt: 1988 1989# Location (--prefix) of where ICU was installed. 1990set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1991# Location of the ICU4C source tree. 1992set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1993 1994 $ICU_ROOT/dbg$ 1995 mkdir -p tools/unicode/c 1996 cd tools/unicode/c 1997 1998 $ICU_ROOT/dbg/tools/unicode/c$ 1999 cmake ../../../../src/tools/unicode/c 2000 make 2001 2002* generate core properties data files 2003 $ICU_ROOT/dbg/tools/unicode/c$ 2004 genprops/genprops $ICU_SRC/icu4c 2005 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2006 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2007- rebuild ICU (make install) & tools 2008 2009* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2010 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2011- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2012- Unicode 6.0..12.1: U+2260, U+226E, U+226F 2013- nothing new in this Unicode version, no test file to update 2014 2015* run & fix ICU4C tests 2016- Andy handles RBBI & spoof check test failures 2017 2018* collation: CLDR collation root, UCA DUCET 2019 2020- UCA DUCET goes into Mark's Unicode tools, see 2021 https://sites.google.com/site/unicodetools/home#TOC-UCA 2022 diff the main mapping file, look for bad changes 2023 (for example, more bytes per weight for common characters) 2024 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 2025 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 2026 2027- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2028 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2029 2030- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2031 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2032- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2033 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2034 (note removing the underscore before "Rules") 2035 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2036- restore TODO diffs in UCARules.txt 2037 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2038- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2039 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2040 from the CLDR root files (..._CLDR_..._SHORT.txt) 2041 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2042 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2043 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2044- if CLDR common/uca/unihan-index.txt changes, then update 2045 CLDR common/collation/root.xml <collation type="private-unihan"> 2046 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2047 2048- run genuca, see command line above 2049- rebuild ICU4C 2050 2051* Unihan collators 2052 https://sites.google.com/site/unicodetools/unihan 2053- run Unicode Tools 2054 org.unicode.draft.GenerateUnihanCollators 2055 with VM arguments 2056 -ea 2057 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2058 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2059 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2060 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2061 -DUVERSION=12.1.0 2062- run Unicode Tools 2063 org.unicode.draft.GenerateUnihanCollatorFiles 2064 with the same arguments 2065- check CLDR diffs 2066 cd $CLDR_SRC 2067 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2068 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2069- copy to CLDR 2070 cd $CLDR_SRC 2071 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2072 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2073- run CLDR unit tests, commit to CLDR 2074- generate ICU zh collation data: run CLDR 2075 org.unicode.cldr.icu.NewLdml2IcuConverter 2076 with program arguments 2077 -t collation 2078 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2079 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2080 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2081 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2082 zh 2083 and VM arguments 2084 -ea 2085 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2086- rebuild ICU4C 2087 2088* run & fix ICU4C tests, now with new CLDR collation root data 2089- run all tests with the collation test data *_SHORT.txt or the full files 2090 (the full ones have comments, useful for debugging) 2091- note on intltest: if collate/UCAConformanceTest fails, then 2092 utility/MultithreadTest/TestCollators will fail as well; 2093 fix the conformance test before looking into the multi-thread test 2094 2095* update Java data files 2096- refresh just the UCD/UCA-related/derived files, just to be safe 2097- see (ICU4C)/source/data/icu4j-readme.txt 2098- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2099- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2100 output: 2101 ... 2102 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2103 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 2104 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 2105 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 2106 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 2107 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 2108 mkdir -p /tmp/icu4j/main/shared/data 2109 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2110 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 2111 mkdir -p /tmp/icu4j/main/shared/data 2112 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2113 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2114- copy the big-endian Unicode data files to another location, 2115 separate from the other data files, 2116 and then refresh ICU4J 2117 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2118 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2119 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2120 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2121 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2122 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2123 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2124 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2125 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2126 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2127 2128* When refreshing all of ICU4J data from ICU4C 2129- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2130- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2131or 2132- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2133 2134* update CollationFCD.java 2135 + copy & paste the initializers of lcccIndex[] etc. from 2136 ICU4C/source/i18n/collationfcd.cpp to 2137 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2138 2139* refresh Java test .txt files 2140- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2141 cd $ICU_SRC/icu4c/source/data/unidata 2142 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2143 cd ../../test/testdata 2144 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2145 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2146 2147* run & fix ICU4J tests 2148 2149*** API additions 2150- send notice to icu-design about new born-@stable API (enum constants etc.) 2151 2152*** CLDR numbering systems 2153- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2154 for example, look for 2155 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2156 in new blocks (Blocks.txt) 2157 Unicode 12: using Unicode 12 CLDR ticket #11478 2158 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 2159 wcho 1E2F0..1E2F9 Wancho 2160 Unicode 11: using Unicode 11 CLDR ticket #10978 2161 rohg 10D30..10D39 Hanifi_Rohingya 2162 gong 11DA0..11DA9 Gunjala_Gondi 2163 Earlier: CLDR tickets specific to adding new numbering systems. 2164 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2165 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2166 2167*** merge the Unicode update branches back onto the trunk 2168- do not merge the icudata.jar and testdata.jar, 2169 instead rebuild them from merged & tested ICU4C 2170- make sure that changes to Unicode tools are checked in: 2171 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2172 2173---------------------------------------------------------------------------- *** 2174 2175Unicode 12.0 update for ICU 64 2176 2177http://www.unicode.org/versions/Unicode12.0.0/ 2178http://unicode.org/versions/beta-12.0.0.html 2179https://www.unicode.org/review/pri389/ 2180http://www.unicode.org/reports/uax-proposed-updates.html 2181http://www.unicode.org/reports/tr44/tr44-23.html 2182 2183ICU-20203 Unicode 12 2184 2185ICU-20111 move text layout properties data into a data file 2186 2187cldrbug 11478: Unicode 12 2188Accidentally used ^/trunk instead of ^/branches/markus/uni12 2189 2190* Command-line environment setup 2191 2192UNICODE_DATA=~/unidata/uni12/20190309 2193CLDR_SRC=~/svn.cldr/uni 2194ICU_ROOT=~/icu/uni 2195ICU_SRC=$ICU_ROOT/src 2196ICUDT=icudt63b 2197ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2198ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2199export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2200 2201*** Unicode version numbers 2202- makedata.mak 2203- uchar.h 2204- com.ibm.icu.util.VersionInfo 2205- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2206 2207- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2208 so that the makefiles see the new version number. 2209 2210*** data files & enums & parser code 2211 2212* download files 2213- mkdir -p $UNICODE_DATA 2214- download Unicode files into $UNICODE_DATA 2215 + subfolders: emoji, idna, security, ucd, uca 2216 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2217 2218* for manual diffs and for Unicode Tools input data updates: 2219 remove version suffixes from the file names 2220 ~$ unidata/desuffixucd.py $UNICODE_DATA 2221 (see https://sites.google.com/site/unicodetools/inputdata) 2222 2223* process and/or copy files 2224- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2225 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2226 + For debugging, and tweaking how ppucd.txt is written, 2227 the tool has an --only_ppucd option: 2228 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2229 2230- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2231 2232* build ICU (make install) 2233 so that the tools build can pick up the new definitions from the installed header files. 2234 2235 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2236 2237* new constants for new property values 2238- preparseucd.py error: 2239 ValueError: missing uchar.h enum constants for some property values: 2240 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 2241 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 2242 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 2243 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 2244 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2245 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 2246 blk; Elymaic ; Elymaic 2247 blk; Nandinagari ; Nandinagari 2248 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 2249 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 2250 blk; Small_Kana_Ext ; Small_Kana_Extension 2251 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 2252 blk; Tamil_Sup ; Tamil_Supplement 2253 blk; Wancho ; Wancho 2254 -> add to uchar.h 2255 use long property names for enum constants, 2256 for the trailing comment get the block start code point: diff old & new Blocks.txt 2257 -> add to UCharacter.UnicodeBlock IDs 2258 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2259 replace public static final int \1_ID = \2; \3 2260 -> add to UCharacter.UnicodeBlock objects 2261 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2262 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 2263 2264 sc ; Elym ; Elymaic 2265 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 2266 sc ; Nand ; Nandinagari 2267 sc ; Wcho ; Wancho 2268 -> uscript.h & com.ibm.icu.lang.UScript 2269 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2270 and in com.ibm.icu.dev.test.lang.TestUScript.java 2271 2272* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2273 (not strictly necessary for NOT_ENCODED scripts) 2274 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2275 2276* update spoof checker UnicodeSet initializers: 2277 inclusionPat & recommendedPat in uspoof.cpp 2278 INCLUSION & RECOMMENDED in SpoofChecker.java 2279- make sure that the Unicode Tools tree contains the latest security data files 2280- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2281- update the hardcoded version number there in the DIRECTORY path 2282- run the tool (no special environment variables needed) 2283- copy & paste from the Console output into the .cpp & .java files 2284 2285* generate normalization data files 2286 cd $ICU_ROOT/dbg/icu4c 2287 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2288 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2289 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2290 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2291 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2292 2293* build ICU (make install) 2294 so that the tools build can pick up the new definitions from the installed header files. 2295 2296 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2297 2298* build Unicode tools using CMake+make 2299 2300$ICU_SRC/tools/unicode/c/icudefs.txt: 2301 2302# Location (--prefix) of where ICU was installed. 2303set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2304# Location of the ICU4C source tree. 2305set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 2306 2307 $ICU_ROOT/dbg$ 2308 mkdir -p tools/unicode/c 2309 cd tools/unicode/c 2310 2311 $ICU_ROOT/dbg/tools/unicode/c$ 2312 cmake ../../../../src/tools/unicode/c 2313 make 2314 2315* generate core properties data files 2316 $ICU_ROOT/dbg/tools/unicode/c$ 2317 genprops/genprops $ICU_SRC/icu4c 2318 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2319 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2320- rebuild ICU (make install) & tools 2321 2322* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2323 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2324- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2325- Unicode 6.0..12.0: U+2260, U+226E, U+226F 2326- nothing new in this Unicode version, no test file to update 2327 2328* run & fix ICU4C tests 2329- update test of default bidi classes: 2330 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 2331 see diffs in DerivedBidiClass.txt 2332 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 2333 + UCharacterTest.java TestIteration() defaultBidi[] 2334- Andy handles RBBI & spoof check test failures 2335 2336* collation: CLDR collation root, UCA DUCET 2337 2338- UCA DUCET goes into Mark's Unicode tools, see 2339 https://sites.google.com/site/unicodetools/home#TOC-UCA 2340 diff the main mapping file, look for bad changes 2341 (for example, more bytes per weight for common characters) 2342 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 2343 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 2344 2345- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2346 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2347 2348- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2349 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2350- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2351 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2352 (note removing the underscore before "Rules") 2353 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2354- restore TODO diffs in UCARules.txt 2355 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2356- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2357 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2358 from the CLDR root files (..._CLDR_..._SHORT.txt) 2359 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2360 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2361 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2362- if CLDR common/uca/unihan-index.txt changes, then update 2363 CLDR common/collation/root.xml <collation type="private-unihan"> 2364 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2365 2366- run genuca, see command line above; 2367 deal with 2368 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2369 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 2370 (add the character to genuca.cpp sampleCharsToScripts[]) 2371 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 2372 and cache its values. 2373 Works as long as the script metadata is updated before the collation data. 2374- rebuild ICU4C 2375 2376* Unihan collators 2377 https://sites.google.com/site/unicodetools/unihan 2378- run Unicode Tools 2379 org.unicode.draft.GenerateUnihanCollators 2380 with VM arguments 2381 -ea 2382 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2383 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2384 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2385 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2386 -DUVERSION=12.0.0 2387- run Unicode Tools 2388 org.unicode.draft.GenerateUnihanCollatorFiles 2389 with the same arguments 2390- check CLDR diffs 2391 cd $CLDR_SRC 2392 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2393 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2394- copy to CLDR 2395 cd $CLDR_SRC 2396 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2397 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2398- run CLDR unit tests, commit to CLDR 2399- generate ICU zh collation data: run CLDR 2400 org.unicode.cldr.icu.NewLdml2IcuConverter 2401 with program arguments 2402 -t collation 2403 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2404 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2405 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2406 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2407 zh 2408 and VM arguments 2409 -ea 2410 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2411- rebuild ICU4C 2412 2413* run & fix ICU4C tests, now with new CLDR collation root data 2414- run all tests with the collation test data *_SHORT.txt or the full files 2415 (the full ones have comments, useful for debugging) 2416- note on intltest: if collate/UCAConformanceTest fails, then 2417 utility/MultithreadTest/TestCollators will fail as well; 2418 fix the conformance test before looking into the multi-thread test 2419 2420* update Java data files 2421- refresh just the UCD/UCA-related/derived files, just to be safe 2422- see (ICU4C)/source/data/icu4j-readme.txt 2423- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2424- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2425 output: 2426 ... 2427 Unicode .icu files built to ./out/build/icudt63l 2428 echo timestamp > uni-core-data 2429 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2430 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 2431 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2432 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2433 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 2434 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 2435 mkdir -p /tmp/icu4j/main/shared/data 2436 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2437 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 2438 mkdir -p /tmp/icu4j/main/shared/data 2439 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2440 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2441- copy the big-endian Unicode data files to another location, 2442 separate from the other data files, 2443 and then refresh ICU4J 2444 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2445 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2446 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2447 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2448 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2449 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2450 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2451 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2452 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2453 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2454 2455* When refreshing all of ICU4J data from ICU4C 2456- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2457- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2458or 2459- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2460 2461* update CollationFCD.java 2462 + copy & paste the initializers of lcccIndex[] etc. from 2463 ICU4C/source/i18n/collationfcd.cpp to 2464 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2465 2466* refresh Java test .txt files 2467- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2468 cd $ICU_SRC/icu4c/source/data/unidata 2469 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2470 cd ../../test/testdata 2471 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2472 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2473 2474* run & fix ICU4J tests 2475 2476*** API additions 2477- send notice to icu-design about new born-@stable API (enum constants etc.) 2478 2479*** CLDR numbering systems 2480- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2481 for example, look for 2482 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2483 in new blocks (Blocks.txt) 2484 Unicode 12: using Unicode 12 CLDR ticket #11478 2485 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 2486 wcho 1E2F0..1E2F9 Wancho 2487 Unicode 11: using Unicode 11 CLDR ticket #10978 2488 rohg 10D30..10D39 Hanifi_Rohingya 2489 gong 11DA0..11DA9 Gunjala_Gondi 2490 Earlier: CLDR tickets specific to adding new numbering systems. 2491 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2492 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2493 2494*** merge the Unicode update branches back onto the trunk 2495- do not merge the icudata.jar and testdata.jar, 2496 instead rebuild them from merged & tested ICU4C 2497- make sure that changes to Unicode tools are checked in: 2498 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2499 2500---------------------------------------------------------------------------- *** 2501 2502ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 2503 2504* Command-line environment setup 2505 2506UNICODE_DATA=~/unidata/uni11/20180609 2507CLDR_SRC=~/svn.cldr/uni 2508ICU_ROOT=~/icu/mine 2509ICU_SRC=$ICU_ROOT/src 2510ICUDT=icudt62b 2511ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2512ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2513export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2514 2515*** Links 2516 2517https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 2518https://unicode-org.atlassian.net/browse/ICU-12850 vo 2519 2520*** data files & enums & parser code 2521 2522* API additions 2523- for each of the three new enumerated properties 2524 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 2525 + uchar.h: update UCHAR_INT_LIMIT 2526 + uchar.h: add the enum U<long prop name> 2527 with constants U_<short prop name>_<long value name> 2528 + UProperty.java: add the constant <long prop name> 2529 + UProperty.java: update INT_LIMIT 2530 + UCharacter.java: add the interface <long prop name> 2531 with constants <long value name> 2532 2533* process and/or copy files 2534- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2535 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2536 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 2537 names and aliases. 2538 + For debugging, and tweaking how ppucd.txt is written, 2539 the tool has an --only_ppucd option: 2540 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2541 2542* preparseucd.py changes 2543- add new property short names (uppercase) to _prop_and_value_re 2544 so that ParseUCharHeader() parses the new enum constants 2545 2546* build ICU (make install) 2547 so that the tools build can pick up the new definitions from the installed header files. 2548 2549 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2550 2551* build Unicode tools using CMake+make 2552 2553$ICU_SRC/tools/unicode/c/icudefs.txt: 2554 2555# Location (--prefix) of where ICU was installed. 2556set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2557# Location of the ICU4C source tree. 2558set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 2559 2560 $ICU_ROOT/dbg$ 2561 mkdir -p tools/unicode/c 2562 cd tools/unicode/c 2563 2564 $ICU_ROOT/dbg/tools/unicode/c$ 2565 cmake ../../../../../src/tools/unicode/c 2566 make 2567 2568* generate core properties data files 2569 $ICU_ROOT/dbg/tools/unicode/c$ 2570 genprops/genprops $ICU_SRC/icu4c 2571- rebuild ICU (make install) & tools 2572 2573* write data for runtime, hardcoded for now 2574- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 2575- generate new icu4c/source/common/ulayout_props_data.h 2576- for each of the three new enumerated properties 2577 + int property max value 2578 + small, 8-bit UCPTrie 2579 (A small 16-bit trie with bit fields for these three properties 2580 is very nearly the same size as the sum of the three.) 2581 2582* wire into C++ 2583- uprops.cpp: #include ulayout_props_data.h 2584- uprops.cpp: add getInPC() etc. functions 2585- uprops.cpp: add lines to intProps[], include max values 2586- uprops.h: add UPropertySource constants 2587- uprops.cpp: add uprops_addPropertyStarts(src) 2588- uniset_props.cpp: add to UnicodeSet_initInclusion() 2589- intltest/ucdtest.cpp: write unit tests 2590 2591* update Java data files 2592- refresh just the pnames.icu file with the new property [value] names, just to be safe 2593- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 2594- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2595- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2596- copy the big-endian Unicode data files to another location, 2597 separate from the other data files, 2598 and then refresh ICU4J 2599 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2600 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2601 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2602 2603* wire into Java 2604- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 2605- UCharacterProperty.java: for each new property 2606 + create a nested class to hold its CodePointTrie 2607 + initialize it from a string literal 2608 + paste in the initializer printed by genprops 2609 + add a new IntProperty object to the intProps[] array 2610 + use the correct max int value for each property, also printed by genprops 2611- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 2612- UnicodeSet.java: add to getInclusions() 2613- UCharacterTest.java: write unit tests 2614 2615---------------------------------------------------------------------------- *** 2616 2617Unicode 11.0 update for ICU 62 2618 2619http://www.unicode.org/versions/Unicode11.0.0/ 2620http://unicode.org/versions/beta-11.0.0.html 2621https://www.unicode.org/review/pri372/ 2622http://www.unicode.org/reports/uax-proposed-updates.html 2623http://www.unicode.org/reports/tr44/tr44-21.html 2624 2625* Command-line environment setup 2626 2627UNICODE_DATA=~/unidata/uni11/20180521 2628CLDR_SRC=~/svn.cldr/uni 2629ICU_ROOT=~/svn.icu/uni 2630ICU_SRC=$ICU_ROOT/src 2631ICUDT=icudt61b 2632ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2633ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2634export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2635 2636*** ICU Trac 2637 2638- ticket:13630: Unicode 11 2639- ^/branches/markus/uni11 2640 2641*** CLDR Trac 2642 2643- cldrbug 10978: Unicode 11 2644- ^/branches/markus/uni11 2645 2646*** Unicode version numbers 2647- makedata.mak 2648- uchar.h 2649- com.ibm.icu.util.VersionInfo 2650- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2651 2652- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2653 so that the makefiles see the new version number. 2654 2655*** data files & enums & parser code 2656 2657* download files 2658- mkdir -p $UNICODE_DATA 2659- download Unicode files into $UNICODE_DATA 2660 + subfolders: emoji, idna, security, ucd, uca 2661 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2662 2663* for manual diffs and for Unicode Tools input data updates: 2664 remove version suffixes from the file names 2665 ~$ unidata/desuffixucd.py $UNICODE_DATA 2666 (see https://sites.google.com/site/unicodetools/inputdata) 2667 2668* process and/or copy files 2669- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2670 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2671 + For debugging, and tweaking how ppucd.txt is written, 2672 the tool has an --only_ppucd option: 2673 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2674 2675- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2676 2677* build ICU (make install) 2678 so that the tools build can pick up the new definitions from the installed header files. 2679 2680 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2681 2682* preparseucd.py changes 2683- fix other errors 2684 NameError: unknown property Extended_Pictographic 2685 -> add Extended_Pictographic binary property 2686 -> add new short names for all Emoji properties 2687 2688* new constants for new property values 2689- preparseucd.py error: 2690 ValueError: missing uchar.h enum constants for some property values: 2691 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 2692 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 2693 u'Indic_Siyaq_Numbers'])), 2694 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 2695 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 2696 (u'GCB', set([u'LinkC', u'Virama'])), 2697 (u'WB', set([u'WSegSpace']))] 2698 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2699 blk; Chess_Symbols ; Chess_Symbols 2700 blk; Dogra ; Dogra 2701 blk; Georgian_Ext ; Georgian_Extended 2702 blk; Gunjala_Gondi ; Gunjala_Gondi 2703 blk; Hanifi_Rohingya ; Hanifi_Rohingya 2704 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 2705 blk; Makasar ; Makasar 2706 blk; Mayan_Numerals ; Mayan_Numerals 2707 blk; Medefaidrin ; Medefaidrin 2708 blk; Old_Sogdian ; Old_Sogdian 2709 blk; Sogdian ; Sogdian 2710 -> add to uchar.h 2711 use long property names for enum constants, 2712 for the trailing comment get the block start code point: diff old & new Blocks.txt 2713 -> add to UCharacter.UnicodeBlock IDs 2714 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2715 replace public static final int \1_ID = \2; \3 2716 -> add to UCharacter.UnicodeBlock objects 2717 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2718 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2719 2720 GCB; LinkC ; LinkingConsonant 2721 GCB; Virama ; Virama 2722 -> uchar.h & UCharacter.GraphemeClusterBreak 2723 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 2724 2725 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 2726 -> ignore: ICU does not yet support this property 2727 2728 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 2729 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 2730 -> uchar.h & UCharacter.JoiningGroup 2731 2732 sc ; Dogr ; Dogra 2733 sc ; Gong ; Gunjala_Gondi 2734 sc ; Maka ; Makasar 2735 sc ; Medf ; Medefaidrin 2736 sc ; Rohg ; Hanifi_Rohingya 2737 sc ; Sogd ; Sogdian 2738 sc ; Sogo ; Old_Sogdian 2739 -> uscript.h & com.ibm.icu.lang.UScript 2740 -> Nushu had been added already 2741 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2742 and in com.ibm.icu.dev.test.lang.TestUScript.java 2743 2744 WB ; WSegSpace ; WSegSpace 2745 -> uchar.h & UCharacter.WordBreak 2746 2747* New short names for emoji properties 2748- see UTS #51 2749- short names set in preparseucd.py 2750 2751* New properties 2752- boolean emoji property Extended_Pictographic 2753 -> added in preparseucd.py 2754 -> uchar.h & UProperty.java 2755- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 2756 as shown in PropertyValueAliases.txt 2757 -> ignore for now 2758 2759* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2760 (not strictly necessary for NOT_ENCODED scripts) 2761 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2762 2763* update spoof checker UnicodeSet initializers: 2764 inclusionPat & recommendedPat in uspoof.cpp 2765 INCLUSION & RECOMMENDED in SpoofChecker.java 2766- make sure that the Unicode Tools tree contains the latest security data files 2767- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2768- update the hardcoded version number there in the DIRECTORY path 2769- run the tool (no special environment variables needed) 2770- copy & paste from the Console output into the .cpp & .java files 2771 2772* generate normalization data files 2773 cd $ICU_ROOT/dbg/icu4c 2774 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2775 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2776 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2777 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2778 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2779 2780* build ICU (make install) 2781 so that the tools build can pick up the new definitions from the installed header files. 2782 2783 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2784 2785* build Unicode tools using CMake+make 2786 2787$ICU_SRC/tools/unicode/c/icudefs.txt: 2788 2789# Location (--prefix) of where ICU was installed. 2790set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2791# Location of the ICU4C source tree. 2792set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 2793 2794 $ICU_ROOT/dbg$ 2795 mkdir -p tools/unicode/c 2796 cd tools/unicode/c 2797 2798 $ICU_ROOT/dbg/tools/unicode/c$ 2799 cmake ../../../../src/tools/unicode/c 2800 make 2801 2802* generate core properties data files 2803 $ICU_ROOT/dbg/tools/unicode/c$ 2804 genprops/genprops $ICU_SRC/icu4c 2805 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2806 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2807- rebuild ICU (make install) & tools 2808 2809* Fix case props 2810 genprops error: casepropsbuilder: too many exceptions words 2811 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 2812- With the addition of Georgian Mtavruli capital letters, 2813 there are now too many simple case mappings with big mapping deltas 2814 that yield uncompressible exceptions. 2815- Changing the data structure (now formatVersion 4), 2816 adding one bit for no-simple-case-folding (for Cherokee), and 2817 one optional slot for a big delta (for most faraway mappings), 2818 together with another bit for whether that is negative. 2819 This makes most Cherokee & Georgian etc. case mappings compressible, 2820 reducing the number of exceptions words. 2821- Further changes to gain one more bit for the exceptions index, 2822 for future growth. Details see casepropsbuilder.cpp. 2823 2824* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2825 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2826- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2827- Unicode 6.0..11.0: U+2260, U+226E, U+226F 2828- nothing new in this Unicode version, no test file to update 2829 2830* run & fix ICU4C tests 2831- Andy handles RBBI & spoof check test failures 2832 2833- Errors in char.txt, word.txt, word_POSIX.txt like 2834 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 2835 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 2836 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 2837 not empty, just to get ICU building. 2838 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 2839 and properties together with the rules that used them (GB 10, WB 14). 2840 -> Andy adjusts the rule sets further to sync with 2841 Unicode 11 grapheme, word, and line break spec changes. 2842 2843* collation: CLDR collation root, UCA DUCET 2844 2845- UCA DUCET goes into Mark's Unicode tools, see 2846 https://sites.google.com/site/unicodetools/home#TOC-UCA 2847 diff the main mapping file, look for bad changes 2848 (for example, more bytes per weight for common characters) 2849 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 2850 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 2851 2852- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2853 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2854 2855- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2856 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2857- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2858 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2859 (note removing the underscore before "Rules") 2860 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2861- restore TODO diffs in UCARules.txt 2862 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2863- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2864 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2865 from the CLDR root files (..._CLDR_..._SHORT.txt) 2866 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2867 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2868 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2869- if CLDR common/uca/unihan-index.txt changes, then update 2870 CLDR common/collation/root.xml <collation type="private-unihan"> 2871 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2872 2873- run genuca, see command line above; 2874 deal with 2875 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2876 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 2877 (add the character to genuca.cpp sampleCharsToScripts[]) 2878 + look up the USCRIPT_ code for the new sample characters 2879 (should be obvious from the comment in the error output) 2880 + *add* mappings to sampleCharsToScripts[], do not replace them 2881 (in case the script sample characters flip-flop) 2882 + insert new scripts in DUCET script order, see the top_byte table 2883 at the beginning of FractionalUCA.txt 2884- rebuild ICU4C 2885 2886* Unihan collators 2887 https://sites.google.com/site/unicodetools/unihan 2888- run Unicode Tools 2889 org.unicode.draft.GenerateUnihanCollators 2890 with VM arguments 2891 -ea 2892 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2893 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2894 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2895 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2896 -DUVERSION=11.0.0 2897- run Unicode Tools 2898 org.unicode.draft.GenerateUnihanCollatorFiles 2899 with the same arguments 2900- check CLDR diffs 2901 cd $CLDR_SRC 2902 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2903 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2904- copy to CLDR 2905 cd $CLDR_SRC 2906 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2907 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2908- run CLDR unit tests, commit to CLDR 2909- generate ICU zh collation data: run CLDR 2910 org.unicode.cldr.icu.NewLdml2IcuConverter 2911 with program arguments 2912 -t collation 2913 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2914 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2915 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 2916 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 2917 zh 2918 and VM arguments 2919 -ea 2920 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2921- rebuild ICU4C 2922 2923* run & fix ICU4C tests, now with new CLDR collation root data 2924- run all tests with the collation test data *_SHORT.txt or the full files 2925 (the full ones have comments, useful for debugging) 2926- note on intltest: if collate/UCAConformanceTest fails, then 2927 utility/MultithreadTest/TestCollators will fail as well; 2928 fix the conformance test before looking into the multi-thread test 2929 2930* update Java data files 2931- refresh just the UCD/UCA-related/derived files, just to be safe 2932- see (ICU4C)/source/data/icu4j-readme.txt 2933- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2934- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2935 output: 2936 ... 2937 Unicode .icu files built to ./out/build/icudt61l 2938 echo timestamp > uni-core-data 2939 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2940 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 2941 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2942 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2943 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 2944 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 2945 mkdir -p /tmp/icu4j/main/shared/data 2946 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2947 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 2948 mkdir -p /tmp/icu4j/main/shared/data 2949 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2950 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 2951- copy the big-endian Unicode data files to another location, 2952 separate from the other data files, 2953 and then refresh ICU4J 2954 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2955 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2956 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2957 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2958 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2959 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2960 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2961 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2962 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2963 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2964 2965* When refreshing all of ICU4J data from ICU4C 2966- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2967- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2968or 2969- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2970 2971* update CollationFCD.java 2972 + copy & paste the initializers of lcccIndex[] etc. from 2973 ICU4C/source/i18n/collationfcd.cpp to 2974 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2975 2976* refresh Java test .txt files 2977- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2978 cd $ICU_SRC/icu4c/source/data/unidata 2979 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2980 cd ../../test/testdata 2981 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2982 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2983 2984* run & fix ICU4J tests 2985 2986*** API additions 2987- send notice to icu-design about new born-@stable API (enum constants etc.) 2988 2989*** CLDR numbering systems 2990- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2991 Unicode 11: using Unicode 11 CLDR ticket #10978 2992 rohg 10D30..10D39 Hanifi_Rohingya 2993 gong 11DA0..11DA9 Gunjala_Gondi 2994 Earlier: CLDR tickets specific to adding new numbering systems. 2995 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2996 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2997 2998*** merge the Unicode update branches back onto the trunk 2999- do not merge the icudata.jar and testdata.jar, 3000 instead rebuild them from merged & tested ICU4C 3001- make sure that changes to Unicode tools are checked in: 3002 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3003 3004---------------------------------------------------------------------------- *** 3005 3006Unicode 10.0 update for ICU 60 3007 3008http://www.unicode.org/versions/Unicode10.0.0/ 3009http://www.unicode.org/versions/beta-10.0.0.html 3010http://blog.unicode.org/2017/03/unicode-100-beta-review.html 3011http://www.unicode.org/review/pri350/ 3012http://www.unicode.org/reports/uax-proposed-updates.html 3013http://www.unicode.org/reports/tr44/tr44-19.html 3014 3015* Command-line environment setup 3016 3017UNICODE_DATA=~/unidata/uni10/20170605 3018CLDR_SRC=~/svn.cldr/uni10 3019ICU_ROOT=~/svn.icu/uni10 3020ICU_SRC=$ICU_ROOT/src 3021ICUDT=icudt60b 3022ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 3023ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 3024export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 3025 3026*** ICU Trac 3027 3028- ticket:12985: Unicode 10 3029- ticket:13061: undo hacks from emoji 5.0 update 3030- ticket:13062: add Emoji_Component property 3031- ^/branches/markus/uni10 3032 3033*** CLDR Trac 3034 3035- cldrbug 10055: Unicode 10 3036- cldrbug 9882: Unicode 10 script metadata 3037- cldrbug 10219: numbering systems for Unicode 10 3038 3039*** Unicode version numbers 3040- makedata.mak 3041- uchar.h 3042- com.ibm.icu.util.VersionInfo 3043- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3044 3045- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3046 so that the makefiles see the new version number. 3047 3048*** data files & enums & parser code 3049 3050* download files 3051- mkdir -p $UNICODE_DATA 3052- download Unicode 10.0 files into $UNICODE_DATA 3053 + subfolders: ucd, uca, idna, security 3054 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3055- download emoji 5.0 files into $UNICODE_DATA/emoji 3056 3057* for manual diffs: remove version suffixes from the file names 3058 ~$ unidata/desuffixucd.py $UNICODE_DATA 3059 (see https://sites.google.com/site/unicodetools/inputdata) 3060 3061* process and/or copy files 3062- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 3063 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3064 + For debugging, and tweaking how ppucd.txt is written, 3065 the tool has an --only_ppucd option: 3066 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 3067 3068- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 3069 3070* build ICU (make install) 3071 so that the tools build can pick up the new definitions from the installed header files. 3072 3073 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3074 3075* preparseucd.py changes 3076- remove or add new Unicode scripts from/to the 3077 only-in-ISO-15924 list according to the error messages: 3078 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 3079 -> adjust _scripts_only_in_iso15924 as indicated 3080- fix other errors 3081 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 3082 -> add vo=Vertical_Orientation to _ignored_properties 3083 -> later removed again, parsing the file, even though we do not yet store data for runtime use 3084 3085* new constants for new property values 3086- preparseucd.py error: 3087 ValueError: missing uchar.h enum constants for some property values: 3088 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 3089 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 3090 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 3091 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 3092 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 3093 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 3094 = PropertyValueAliases.txt new property values (diff old & new .txt files) 3095 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 3096 blk; Kana_Ext_A ; Kana_Extended_A 3097 blk; Masaram_Gondi ; Masaram_Gondi 3098 blk; Nushu ; Nushu 3099 blk; Soyombo ; Soyombo 3100 blk; Syriac_Sup ; Syriac_Supplement 3101 blk; Zanabazar_Square ; Zanabazar_Square 3102 -> add to uchar.h 3103 use long property names for enum constants, 3104 for the trailing comment get the block start code point: diff old & new Blocks.txt 3105 -> add to UCharacter.UnicodeBlock IDs 3106 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3107 replace public static final int \1_ID = \2; \3 3108 -> add to UCharacter.UnicodeBlock objects 3109 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3110 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3111 3112 jg ; Malayalam_Bha ; Malayalam_Bha 3113 jg ; Malayalam_Ja ; Malayalam_Ja 3114 jg ; Malayalam_Lla ; Malayalam_Lla 3115 jg ; Malayalam_Llla ; Malayalam_Llla 3116 jg ; Malayalam_Nga ; Malayalam_Nga 3117 jg ; Malayalam_Nna ; Malayalam_Nna 3118 jg ; Malayalam_Nnna ; Malayalam_Nnna 3119 jg ; Malayalam_Nya ; Malayalam_Nya 3120 jg ; Malayalam_Ra ; Malayalam_Ra 3121 jg ; Malayalam_Ssa ; Malayalam_Ssa 3122 jg ; Malayalam_Tta ; Malayalam_Tta 3123 -> uchar.h & UCharacter.JoiningGroup 3124 3125 sc ; Gonm ; Masaram_Gondi 3126 sc ; Nshu ; Nushu 3127 sc ; Soyo ; Soyombo 3128 sc ; Zanb ; Zanabazar_Square 3129 -> uscript.h & com.ibm.icu.lang.UScript 3130 -> Nushu had been added already 3131 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3132 and in com.ibm.icu.dev.test.lang.TestUScript.java 3133 3134* New properties as shown in PropertyValueAliases.txt changes 3135- boolean Emoji_Component from emoji 5 3136 -> uchar.h & UProperty.java 3137- boolean 3138 # Regional_Indicator (RI) 3139 3140 RI ; N ; No ; F ; False 3141 RI ; Y ; Yes ; T ; True 3142 -> uchar.h & UProperty.java 3143 -> single immutable range, to be hardcoded 3144- boolean 3145 # Prepended_Concatenation_Mark (PCM) 3146 3147 PCM; N ; No ; F ; False 3148 PCM; Y ; Yes ; T ; True 3149 -> was new in Unicode 9 3150 -> uchar.h & UProperty.java 3151- enumerated 3152 # Vertical_Orientation (vo) 3153 3154 vo ; R ; Rotated 3155 vo ; Tr ; Transformed_Rotated 3156 vo ; Tu ; Transformed_Upright 3157 vo ; U ; Upright 3158 -> only pre-parsed for now, but not yet stored for runtime use 3159 3160* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3161 (not strictly necessary for NOT_ENCODED scripts) 3162 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 3163 3164* generate normalization data files 3165 cd $ICU_ROOT/dbg/icu4c 3166 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 3167 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 3168 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 3169 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3170 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 3171 3172* build ICU (make install) 3173 so that the tools build can pick up the new definitions from the installed header files. 3174 3175 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3176 3177* build Unicode tools using CMake+make 3178 3179$ICU_SRC/tools/unicode/c/icudefs.txt: 3180 3181# Location (--prefix) of where ICU was installed. 3182set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3183# Location of the ICU4C source tree. 3184set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 3185 3186 $ICU_ROOT/dbg/tools/unicode/c$ 3187 cmake ../../../../src/tools/unicode/c 3188 make 3189 3190* generate core properties data files 3191 $ICU_ROOT/dbg/tools/unicode/c$ 3192 genprops/genprops $ICU_SRC/icu4c 3193 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 3194 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 3195- rebuild ICU (make install) & tools 3196 3197* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3198 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3199- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3200- Unicode 6.0..10.0: U+2260, U+226E, U+226F 3201- nothing new in this Unicode version, no test file to update 3202 3203* run & fix ICU4C tests 3204- Andy handles RBBI & spoof check test failures 3205 3206* collation: CLDR collation root, UCA DUCET 3207 3208- UCA DUCET goes into Mark's Unicode tools, see 3209 https://sites.google.com/site/unicodetools/home#TOC-UCA 3210- CLDR root data files are checked into $CLDR_SRC/common/uca/ 3211 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 3212 3213- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3214 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 3215- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3216 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 3217 (note removing the underscore before "Rules") 3218 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 3219- restore TODO diffs in UCARules.txt 3220 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 3221- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3222 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3223 from the CLDR root files (..._CLDR_..._SHORT.txt) 3224 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3225 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3226 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 3227- if CLDR common/uca/unihan-index.txt changes, then update 3228 CLDR common/collation/root.xml <collation type="private-unihan"> 3229 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 3230 3231- run genuca, see command line above; 3232 deal with 3233 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 3234 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 3235 (add the character to genuca.cpp sampleCharsToScripts[]) 3236 + look up the USCRIPT_ code for the new sample characters 3237 (should be obvious from the comment in the error output) 3238 + *add* mappings to sampleCharsToScripts[], do not replace them 3239 (in case the script sample characters flip-flop) 3240 + insert new scripts in DUCET script order, see the top_byte table 3241 at the beginning of FractionalUCA.txt 3242- rebuild ICU4C 3243 3244* Unihan collators 3245 https://sites.google.com/site/unicodetools/unihan 3246- run Unicode Tools 3247 org.unicode.draft.GenerateUnihanCollators 3248 with VM arguments 3249 -ea 3250 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 3251 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 3252 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 3253 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 3254 -DUVERSION=10.0.0 3255- run Unicode Tools 3256 org.unicode.draft.GenerateUnihanCollatorFiles 3257 with the same arguments 3258- check CLDR diffs 3259 cd $CLDR_SRC 3260 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 3261 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 3262- copy to CLDR 3263 cd $CLDR_SRC 3264 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3265 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3266- run CLDR unit tests, commit to CLDR 3267- generate ICU zh collation data: run CLDR 3268 org.unicode.cldr.icu.NewLdml2IcuConverter 3269 with program arguments 3270 -t collation 3271 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 3272 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 3273 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 3274 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 3275 zh 3276 and VM arguments 3277 -ea 3278 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 3279- rebuild ICU4C 3280 3281* run & fix ICU4C tests, now with new CLDR collation root data 3282- run all tests with the collation test data *_SHORT.txt or the full files 3283 (the full ones have comments, useful for debugging) 3284- note on intltest: if collate/UCAConformanceTest fails, then 3285 utility/MultithreadTest/TestCollators will fail as well; 3286 fix the conformance test before looking into the multi-thread test 3287 3288* update Java data files 3289- refresh just the UCD/UCA-related/derived files, just to be safe 3290- see (ICU4C)/source/data/icu4j-readme.txt 3291- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3292- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3293 output: 3294 ... 3295 Unicode .icu files built to ./out/build/icudt60l 3296 echo timestamp > uni-core-data 3297 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3298 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 3299 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3300 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3301 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 3302 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 3303 mkdir -p /tmp/icu4j/main/shared/data 3304 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3305 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 3306 mkdir -p /tmp/icu4j/main/shared/data 3307 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3308 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 3309- copy the big-endian Unicode data files to another location, 3310 separate from the other data files, 3311 and then refresh ICU4J 3312 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 3313 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3314 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3315 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3316 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3317 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3318 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3319 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3320 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3321 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3322 3323* When refreshing all of ICU4J data from ICU4C 3324- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3325- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 3326or 3327- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 3328 3329* update CollationFCD.java 3330 + copy & paste the initializers of lcccIndex[] etc. from 3331 ICU4C/source/i18n/collationfcd.cpp to 3332 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3333 3334* refresh Java test .txt files 3335- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3336 cd $ICU_SRC/icu4c/source/data/unidata 3337 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3338 cd ../../test/testdata 3339 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3340 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3341 3342* run & fix ICU4J tests 3343 3344*** API additions 3345- send notice to icu-design about new born-@stable API (enum constants etc.) 3346 3347*** CLDR numbering systems 3348- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 3349 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 3350 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 3351 3352*** merge the Unicode update branches back onto the trunk 3353- do not merge the icudata.jar and testdata.jar, 3354 instead rebuild them from merged & tested ICU4C 3355- make sure that changes to Unicode tools are checked in: 3356 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3357 3358---------------------------------------------------------------------------- *** 3359 3360Emoji 5.0 update for ICU 59 3361- ICU 59 mostly remains on Unicode 9.0 3362- except updates bidi and segmentation data to Unicode 10 beta 3363 3364First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 3365 3366* Command-line environment setup 3367 3368ICU_ROOT=~/svn.icu/trunk 3369ICU_SRC_DIR=$ICU_ROOT/src 3370ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 3371ICUDT=icudt59b 3372export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3373SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 3374UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 3375 3376*** ICU Trac 3377 3378- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 3379- changes directly on trunk 3380 3381*** data files & enums & parser code 3382 3383* download files 3384 3385- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 3386- download emoji 5.0 beta files into the same uni90e50 folder 3387- download Unicode 10.0 beta files: ucd 3388 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 3389 BidiBrackets.txt 3390 BidiCharacterTest.txt 3391 BidiMirroring.txt 3392 BidiTest.txt 3393 extracted/DerivedBidiClass.txt 3394 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 3395 LineBreak.txt 3396 auxiliary/* 3397 3398* preparseucd.py changes 3399- adjust for combined trunks 3400- write new copyright lines 3401- ignore new Emoji_Component property for now 3402 3403* process and/or copy files 3404- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 3405 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3406 3407- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 3408 3409* build ICU (make install) 3410 so that the tools build can pick up the new definitions from the installed header files. 3411 3412 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3413 3414* build Unicode tools using CMake+make 3415 3416~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 3417 3418# Location (--prefix) of where ICU was installed. 3419set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3420# Location of the ICU4C source tree. 3421set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 3422 3423 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3424 cmake ../../../../src/tools/unicode/c 3425 make 3426 3427* generate core properties data files 3428 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3429 genprops/genprops $ICU4C_SRC_DIR 3430- rebuild ICU (make install) & tools 3431 3432* run & fix ICU4C tests 3433- Andy handles RBBI & spoof check test failures 3434 3435* update Java data files 3436- refresh just the UCD/UCA-related/derived files, just to be safe 3437- see (ICU4C)/source/data/icu4j-readme.txt 3438- mkdir /tmp/icu4j 3439- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3440 output: 3441 ... 3442 Unicode .icu files built to ./out/build/icudt59l 3443 echo timestamp > uni-core-data 3444 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3445 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 3446 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3447 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3448 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 3449 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 3450 mkdir -p /tmp/icu4j/main/shared/data 3451 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3452 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 3453 mkdir -p /tmp/icu4j/main/shared/data 3454 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3455 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 3456- copy the big-endian Unicode data files to another location, 3457 separate from the other data files, 3458 and then refresh ICU4J 3459 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 3460 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3461 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3462 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3463 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3464 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3465 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3466 3467* When refreshing all of ICU4J data from ICU4C 3468- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3469- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 3470or 3471- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 3472 3473* refresh Java test .txt files 3474- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3475 cd $ICU4C_SRC_DIR/source/data/unidata 3476 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3477 cd ../../test/testdata 3478 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3479 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3480 3481* run & fix ICU4J tests 3482 3483---------------------------------------------------------------------------- *** 3484 3485Unicode 9.0 update for ICU 58 3486 3487* Command-line environment setup 3488 3489ICU_ROOT=~/svn.icu/trunk 3490ICU_SRC_DIR=$ICU_ROOT/src 3491ICUDT=icudt58b 3492export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3493SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3494UNIDATA=$ICU_SRC_DIR/source/data/unidata 3495 3496http://www.unicode.org/review/pri323/ -- beta review 3497http://www.unicode.org/reports/uax-proposed-updates.html 3498http://www.unicode.org/versions/beta-9.0.0.html 3499http://www.unicode.org/versions/Unicode9.0.0/ 3500http://www.unicode.org/reports/tr44/tr44-17.html 3501 3502*** ICU Trac 3503 3504- ticket:12526: integrate Unicode 9 3505- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 3506- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 3507 3508*** CLDR Trac 3509 3510- cldrbug 9414: UCA 9 3511- ^/branches/markus/uni90 at r11518 from trunk at r11517 3512 3513- cldrbug 8745: Unicode 9.0 script metadata 3514 3515*** Unicode version numbers 3516- makedata.mak 3517- uchar.h 3518- com.ibm.icu.util.VersionInfo 3519- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3520 3521- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3522 so that the makefiles see the new version number. 3523 3524*** data files & enums & parser code 3525 3526* file preparation 3527 3528- download UCD & IDNA files 3529- make sure that the Unicode data folder passed into preparseucd.py 3530 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3531- only for manual diffs: remove version suffixes from the file names 3532 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3533 (see https://sites.google.com/site/unicodetools/inputdata) 3534- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3535- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3536- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3537 3538- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 3539 and copy to $UNIDATA 3540 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 3541 3542* preparseucd.py changes 3543- remove or add new Unicode scripts from/to the 3544 only-in-ISO-15924 list according to the error messages: 3545 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 3546 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 3547 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 3548 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 3549 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3550 and in com.ibm.icu.dev.test.lang.TestUScript.java 3551- DerivedNumericValues.txt new numeric values 3552 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 3553 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 3554 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 3555 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 3556 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 3557 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 3558 uchar.c, UCharacterProperty.java 3559 to support a new series of values 3560- adjust preparseucd.py for Tangut algorithmic names 3561 in ppucd.txt: 3562 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 3563 -> 3564 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 3565- avoid block-compressing most String/Miscellaneous property values, 3566 triggered by genprops not coping with a multi-code point Case_Folding on 3567 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 3568 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 3569 3570* PropertyAliases.txt changes 3571- 1 new property PCM=Prepended_Concatenation_Mark 3572 Ignore: Only useful for layout engines. 3573 Ok to list in ppucd.txt. 3574 3575* PropertyValueAliases.txt new property values 3576 blk; Adlam ; Adlam 3577 blk; Bhaiksuki ; Bhaiksuki 3578 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 3579 blk; Glagolitic_Sup ; Glagolitic_Supplement 3580 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 3581 blk; Marchen ; Marchen 3582 blk; Mongolian_Sup ; Mongolian_Supplement 3583 blk; Newa ; Newa 3584 blk; Osage ; Osage 3585 blk; Tangut ; Tangut 3586 blk; Tangut_Components ; Tangut_Components 3587 -> add to uchar.h 3588 use long property names for enum constants 3589 -> add to UCharacter.UnicodeBlock IDs 3590 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3591 replace public static final int \1_ID = \2; \3 3592 -> add to UCharacter.UnicodeBlock objects 3593 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3594 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3595 3596 GCB; EB ; E_Base 3597 GCB; EBG ; E_Base_GAZ 3598 GCB; EM ; E_Modifier 3599 GCB; GAZ ; Glue_After_Zwj 3600 GCB; ZWJ ; ZWJ 3601 -> uchar.h & UCharacter.GraphemeClusterBreak 3602 3603 jg ; African_Feh ; African_Feh 3604 jg ; African_Noon ; African_Noon 3605 jg ; African_Qaf ; African_Qaf 3606 -> uchar.h & UCharacter.JoiningGroup 3607 3608 lb ; EB ; E_Base 3609 lb ; EM ; E_Modifier 3610 lb ; ZWJ ; ZWJ 3611 -> uchar.h & UCharacter.LineBreak 3612 3613 sc ; Adlm ; Adlam 3614 sc ; Bhks ; Bhaiksuki 3615 sc ; Marc ; Marchen 3616 sc ; Newa ; Newa 3617 sc ; Osge ; Osage 3618 sc ; Tang ; Tangut 3619 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3620 3621 WB ; EB ; E_Base 3622 WB ; EBG ; E_Base_GAZ 3623 WB ; EM ; E_Modifier 3624 WB ; GAZ ; Glue_After_Zwj 3625 WB ; ZWJ ; ZWJ 3626 -> uchar.h & UCharacter.WordBreak 3627 3628* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3629 (not strictly necessary for NOT_ENCODED scripts) 3630 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3631 3632* generate normalization data files 3633 cd $ICU_ROOT/dbg 3634 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3635 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3636 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3637 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3638 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3639 3640* build ICU (make install) 3641 so that the tools build can pick up the new definitions from the installed header files. 3642 3643 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 3644 3645* build Unicode tools using CMake+make 3646 3647~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3648 3649 # Location (--prefix) of where ICU was installed. 3650 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3651 # Location of the ICU source tree. 3652 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3653 3654 ~/svn.icutools/trunk/dbg/unicode/c$ 3655 cmake ../../../src/unicode/c 3656 make 3657 3658* generate core properties data files 3659 ~/svn.icutools/trunk/dbg/unicode/c$ 3660 genprops/genprops $ICU_SRC_DIR 3661 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3662 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3663- rebuild ICU (make install) & tools 3664 3665* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3666 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3667- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3668- Unicode 6.0..9.0: U+2260, U+226E, U+226F 3669- nothing new in 9.0, no test file to update 3670 3671* run & fix ICU4C tests 3672- Andy handles RBBI & spoof check test failures 3673 3674* collation: CLDR collation root, UCA DUCET 3675 3676- UCA DUCET goes into Mark's Unicode tools, see 3677 https://sites.google.com/site/unicodetools/home#TOC-UCA 3678- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 3679 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3680 3681- cd (CLDR UCA branch)/common/uca/ 3682- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3683 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3684- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3685 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 3686 (note removing the underscore before "Rules") 3687 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3688- restore TODO diffs in UCARules.txt 3689 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3690- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3691 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3692 from the CLDR root files (..._CLDR_..._SHORT.txt) 3693 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3694 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3695 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3696- if CLDR common/uca/unihan-index.txt changes, then update 3697 CLDR common/collation/root.xml <collation type="private-unihan"> 3698 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 3699 3700- run genuca, see command line above; 3701 deal with 3702 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 3703 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 3704 (add the character to genuca.cpp sampleCharsToScripts[]) 3705 + look up the USCRIPT_ code for the new sample characters 3706 (should be obvious from the comment in the error output) 3707 + *add* mappings to sampleCharsToScripts[], do not replace them 3708 (in case the script sample characters flip-flop) 3709 + insert new scripts in DUCET script order, see the top_byte table 3710 at the beginning of FractionalUCA.txt 3711- rebuild ICU4C 3712 3713* Unihan collators 3714- run Unicode Tools 3715 org.unicode.draft.GenerateUnihanCollators 3716 with VM arguments 3717 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 3718 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 3719 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 3720 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 3721 -DUVERSION=9.0.0 3722 -ea 3723- run Unicode Tools 3724 org.unicode.draft.GenerateUnihanCollatorFiles 3725 with the same arguments 3726- check CLDR diffs 3727 cd ~/svn.cldr/trunk 3728 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 3729 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 3730- copy to CLDR 3731 cd ~/svn.cldr/trunk 3732 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3733 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3734- commit to CLDR 3735- generate ICU zh collation data: run CLDR 3736 org.unicode.cldr.icu.NewLdml2IcuConverter 3737 with program arguments 3738 -t collation 3739 -s /home/mscherer/svn.cldr/trunk/common/collation 3740 -m /home/mscherer/svn.cldr/trunk/common/supplemental 3741 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 3742 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 3743 zh 3744 and VM arguments 3745 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 3746- rebuild ICU4C 3747 3748* run & fix ICU4C tests, now with new CLDR collation root data 3749- run all tests with the collation test data *_SHORT.txt or the full files 3750 (the full ones have comments, useful for debugging) 3751- note on intltest: if collate/UCAConformanceTest fails, then 3752 utility/MultithreadTest/TestCollators will fail as well; 3753 fix the conformance test before looking into the multi-thread test 3754 3755* update Java data files 3756- refresh just the UCD/UCA-related/derived files, just to be safe 3757- see (ICU4C)/source/data/icu4j-readme.txt 3758- mkdir /tmp/icu4j 3759- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3760 output: 3761 ... 3762 Unicode .icu files built to ./out/build/icudt58l 3763 echo timestamp > uni-core-data 3764 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3765 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 3766 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3767 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3768 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 3769 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 3770 mkdir -p /tmp/icu4j/main/shared/data 3771 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3772 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 3773 mkdir -p /tmp/icu4j/main/shared/data 3774 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3775 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3776- copy the big-endian Unicode data files to another location, 3777 separate from the other data files, 3778 and then refresh ICU4J 3779 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3780 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3781 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3782 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3783 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3784 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3785 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3786 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3787 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3788 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3789 3790* When refreshing all of ICU4J data from ICU4C 3791- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3792- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3793or 3794- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3795 3796* update CollationFCD.java 3797 + copy & paste the initializers of lcccIndex[] etc. from 3798 ICU4C/source/i18n/collationfcd.cpp to 3799 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3800 3801* refresh Java test .txt files 3802- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3803 cd $ICU_SRC_DIR/source/data/unidata 3804 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3805 cd ../../test/testdata 3806 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3807 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3808 3809* run & fix ICU4J tests 3810 3811*** LayoutEngine script information 3812 3813* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3814 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3815 in the working directory. 3816 3817 (It also generates ScriptRunData.cpp, which is no longer needed.) 3818 3819 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3820 (a plain text file) 3821 which maps ICU versions to the numbers of script/language constants 3822 that were added then. 3823 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3824 3825 The generated files have a current copyright date and "@deprecated" statement. 3826 3827* Review changes, fix Java tool if necessary, and copy to ICU4C 3828 cd ~/svn.icu4j/trunk/src 3829 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3830 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3831 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3832 3833*** API additions 3834- send notice to icu-design about new born-@stable API (enum constants etc.) 3835 3836*** merge the Unicode update branches back onto the trunk 3837- do not merge the icudata.jar and testdata.jar, 3838 instead rebuild them from merged & tested ICU4C 3839- make sure that changes to Unicode tools & ICU tools are checked in 3840 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3841 http://bugs.icu-project.org/trac/log/tools/trunk 3842 3843---------------------------------------------------------------------------- *** 3844 3845New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 3846 3847Adding 3848- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 3849- new combination/alias codes: Hanb, Jamo 3850 - used in CLDR 29 and in spoof checker 3851- new Z* code: Zsye 3852 3853Add new codes to uscript.h & UScript.java, see Unicode update logs. 3854 -> com.ibm.icu.lang.UScript 3855 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3856 replace public static final int \1 = \2; \3 3857 3858Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 3859add new script codes. 3860"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 3861 3862Note: If we have to run preparseucd.py again before the Unicode 9 update, 3863then we need to manually keep/restore the new script codes. 3864 3865ICU_ROOT=~/svn.icu/trunk 3866ICU_SRC_DIR=$ICU_ROOT/src 3867ICUDT=icudt57b 3868export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3869SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3870UNIDATA=$ICU_SRC_DIR/source/data/unidata 3871 3872Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 3873see https://unicode-org.atlassian.net/browse/ICU-12141 3874 3875make install, then icutools cmake & make, then 3876~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3877 3878Generate Java data as usual, only update pnames.icu & uprops.icu. 3879 3880*** LayoutEngine script information 3881 3882* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3883 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3884 in the working directory. 3885 3886 (It also generates ScriptRunData.cpp, which is no longer needed.) 3887 3888 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3889 (a plain text file) 3890 which maps ICU versions to the numbers of script/language constants 3891 that were added then. 3892 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3893 3894 The generated files have a current copyright date and "@deprecated" statement. 3895 3896* Review changes, fix Java tool if necessary, and copy to ICU4C 3897 cd ~/svn.icu4j/trunk/src 3898 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3899 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3900 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3901 3902---------------------------------------------------------------------------- *** 3903 3904Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 3905 3906Edit preparseucd.py to add & parse new properties. 3907They share the UCD property namespace but are not listed in PropertyAliases.txt. 3908 3909Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 3910Initial data from emoji/2.0/ 3911 3912ICU_ROOT=~/svn.icu/trunk 3913ICU_SRC_DIR=$ICU_ROOT/src 3914ICUDT=icudt56b 3915export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3916SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3917UNIDATA=$ICU_SRC_DIR/source/data/unidata 3918 3919Add binary-property constants to uchar.h enum UProperty & UProperty.java. 3920 3921~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3922(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 3923 3924Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 3925 3926make install, then icutools cmake & make, then 3927~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3928 3929Generate Java data as usual, only update pnames.icu & uprops.icu. 3930 3931---------------------------------------------------------------------------- *** 3932 3933Unicode 8.0 update for ICU 56 3934 3935* Command-line environment setup 3936 3937ICU_ROOT=~/svn.icu/trunk 3938ICU_SRC_DIR=$ICU_ROOT/src 3939ICUDT=icudt56b 3940export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3941SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3942UNIDATA=$ICU_SRC_DIR/source/data/unidata 3943 3944http://www.unicode.org/review/pri297/ -- beta review 3945http://www.unicode.org/reports/uax-proposed-updates.html 3946http://unicode.org/versions/beta-8.0.0.html 3947http://www.unicode.org/versions/Unicode8.0.0/ 3948http://www.unicode.org/reports/tr44/tr44-15.html 3949 3950*** ICU Trac 3951 3952- ticket:11574: Unicode 8 3953- C++ branches/markus/uni80 at r37351 from trunk at r37343 3954- Java branches/markus/uni80 at r37352 from trunk at r37338 3955 3956*** CLDR Trac 3957 3958- cldrbug 8311: UCA 8 3959- branches/markus/uni80 at r11518 from trunk at r11517 3960 3961- cldrbug 8109: Unicode 8.0 script metadata 3962- cldrbug 8418: Updated segmentation for Unicode 8.0 3963 3964*** Unicode version numbers 3965- makedata.mak 3966- uchar.h 3967- com.ibm.icu.util.VersionInfo 3968- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3969 3970- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3971 so that the makefiles see the new version number. 3972 3973*** data files & enums & parser code 3974 3975* file preparation 3976 3977- download UCD & IDNA files 3978- make sure that the Unicode data folder passed into preparseucd.py 3979 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3980- only for manual diffs: remove version suffixes from the file names 3981 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3982 (see https://sites.google.com/site/unicodetools/inputdata) 3983- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3984- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3985- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3986 3987- also: from http://unicode.org/Public/security/8.0.0/ download new 3988 confusables.txt & confusablesWholeScript.txt 3989 and copy to $UNIDATA 3990 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 3991 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 3992 3993* initial preparseucd.py changes 3994- remove new Unicode scripts from the 3995 only-in-ISO-15924 list according to the error message: 3996 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 3997 from _scripts_only_in_iso15924 3998 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3999 and in com.ibm.icu.dev.test.lang.TestUScript.java 4000- property and file name change: 4001 IndicMatraCategory -> IndicPositionalCategory 4002- UnicodeData.txt unusual numeric values (improper fractions) 4003 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 4004 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 4005 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 4006 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 4007 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 4008 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 4009 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 4010 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 4011 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 4012 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 4013 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 4014 which are listed in DerivedNumericValues.txt; 4015 keeps storage in data file simple 4016 4017* PropertyValueAliases.txt changes 4018- 10 new Block (blk) values: 4019 blk; Ahom ; Ahom 4020 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 4021 blk; Cherokee_Sup ; Cherokee_Supplement 4022 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 4023 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 4024 blk; Hatran ; Hatran 4025 blk; Multani ; Multani 4026 blk; Old_Hungarian ; Old_Hungarian 4027 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 4028 blk; Sutton_SignWriting ; Sutton_SignWriting 4029 -> add to uchar.h 4030 use long property names for enum constants 4031 -> add to UCharacter.UnicodeBlock IDs 4032 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4033 replace public static final int \1_ID = \2; \3 4034 -> add to UCharacter.UnicodeBlock objects 4035 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4036 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4037- 6 new Script (sc) values: 4038 sc ; Ahom ; Ahom 4039 sc ; Hatr ; Hatran 4040 sc ; Hluw ; Anatolian_Hieroglyphs 4041 sc ; Hung ; Old_Hungarian 4042 sc ; Mult ; Multani 4043 sc ; Sgnw ; SignWriting 4044 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 4045 4046* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4047 (not strictly necessary for NOT_ENCODED scripts) 4048 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 4049 4050* generate normalization data files 4051 cd $ICU_ROOT/dbg 4052 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 4053 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4054 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4055 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4056 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4057 4058* build ICU (make install) 4059 so that the tools build can pick up the new definitions from the installed header files. 4060 4061 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4062 4063* build Unicode tools using CMake+make 4064 4065~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4066 4067 # Location (--prefix) of where ICU was installed. 4068 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 4069 # Location of the ICU source tree. 4070 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 4071 4072 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4073 ~/svn.icutools/trunk/dbg/unicode/c$ make 4074 4075* generate core properties data files 4076- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 4077- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 4078- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 4079- rebuild ICU (make install) & tools 4080- run genuca again (see step above) so that it picks up the new nfc.nrm 4081- rebuild ICU (make install) & tools 4082 4083* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4084 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4085- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4086- Unicode 6.0..8.0: U+2260, U+226E, U+226F 4087- nothing new in 8.0, no test file to update 4088 4089* run & fix ICU4C tests 4090- bad Cherokee case folding due to difference in fallbacks: 4091 UCD case folding falls back to no mapping, 4092 ICU runtime case folding falls back to lowercasing; 4093 fixed casepropsbuilder.cpp to generate scf mappings to self 4094 when there is an slc mapping but no scf 4095- Andy handles RBBI & spoof check test failures 4096 4097* collation: CLDR collation root, UCA DUCET 4098 4099- UCA DUCET goes into Mark's Unicode tools, see 4100 https://sites.google.com/site/unicodetools/home#TOC-UCA 4101- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 4102- cd (CLDR UCA branch)/common/uca/ 4103- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4104 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4105- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4106 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 4107 (note removing the underscore before "Rules") 4108 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4109- restore TODO diffs in UCARules.txt 4110 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4111- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4112 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4113 from the CLDR root files (..._CLDR_..._SHORT.txt) 4114 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4115 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4116 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4117- if CLDR common/uca/unihan-index.txt changes, then update 4118 CLDR common/collation/root.xml <collation type="private-unihan"> 4119 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 4120- run genuca, see command line above; 4121 deal with 4122 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 4123 (add the character to genuca.cpp sampleCharsToScripts[]) 4124 + look up the script for the new sample characters 4125 (e.g., in FractionalUCA.txt) 4126 + *add* mappings to sampleCharsToScripts[], do not replace them 4127 (in case the script sample characters flip-flop) 4128 + insert new scripts in DUCET script order, see the top_byte table 4129 at the beginning of FractionalUCA.txt 4130- rebuild ICU4C 4131 4132* run & fix ICU4C tests, now with new CLDR collation root data 4133- run all tests with the collation test data *_SHORT.txt or the full files 4134 (the full ones have comments, useful for debugging) 4135- note on intltest: if collate/UCAConformanceTest fails, then 4136 utility/MultithreadTest/TestCollators will fail as well; 4137 fix the conformance test before looking into the multi-thread test 4138- fixed bug in CollationWeights::getWeightRanges() 4139 exposed by new data and CollationTest::TestRootElements 4140 4141* update Java data files 4142- refresh just the UCD/UCA-related/derived files, just to be safe 4143- see (ICU4C)/source/data/icu4j-readme.txt 4144- mkdir /tmp/icu4j 4145- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4146 output: 4147 ... 4148 Unicode .icu files built to ./out/build/icudt56l 4149 echo timestamp > uni-core-data 4150 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 4151 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 4152 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 4153 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 4154 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 4155 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 4156 mkdir -p /tmp/icu4j/main/shared/data 4157 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4158 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 4159 mkdir -p /tmp/icu4j/main/shared/data 4160 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4161 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 4162- copy the big-endian Unicode data files to another location, 4163 separate from the other data files, 4164 and then refresh ICU4J 4165 cd ~/svn.icu/trunk/dbg/data/out/icu4j 4166 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4167 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4168 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4169 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4170 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4171 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4172 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4173 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4174 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4175 4176* When refreshing all of ICU4J data from ICU4C 4177- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4178- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4179or 4180- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4181 4182* update CollationFCD.java 4183 + copy & paste the initializers of lcccIndex[] etc. from 4184 ICU4C/source/i18n/collationfcd.cpp to 4185 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4186 4187* refresh Java test .txt files 4188- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4189 cd $ICU_SRC_DIR/source/data/unidata 4190 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4191 cd ../../test/testdata 4192 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4193 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4194 4195* run & fix ICU4J tests 4196 4197*** LayoutEngine script information 4198 4199* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 4200 because the layout engine was deprecated in ICU 54. 4201 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 4202 to write lines that we used to add manually. 4203 4204* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4205 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4206 in the working directory. 4207 4208 (It also generates ScriptRunData.cpp, which is no longer needed.) 4209 4210 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 4211 (a plain text file) 4212 which maps ICU versions to the numbers of script/language constants 4213 that were added then. 4214 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 4215 4216 The generated files have a current copyright date and "@deprecated" statement. 4217 4218* Review changes, fix Java tool if necessary, and copy to ICU4C 4219 cd ~/svn.icu4j/trunk/src 4220 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4221 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 4222 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 4223 4224*** API additions 4225- send notice to icu-design about new born-@stable API (enum constants etc.) 4226 4227*** merge the Unicode update branches back onto the trunk 4228- do not merge the icudata.jar and testdata.jar, 4229 instead rebuild them from merged & tested ICU4C 4230- make sure that changes to Unicode tools & ICU tools are checked in 4231 http://www.unicode.org/utility/trac/log/trunk/unicodetools 4232 http://bugs.icu-project.org/trac/log/tools/trunk 4233 4234---------------------------------------------------------------------------- *** 4235 4236Unicode 7.0 update for ICU 54 4237 4238http://www.unicode.org/review/pri271/ -- beta review 4239http://www.unicode.org/reports/uax-proposed-updates.html 4240http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 4241http://www.unicode.org/reports/tr44/tr44-13.html 4242 4243*** ICU Trac 4244 4245- ticket 10821: Unicode 7.0, UCA 7.0 4246- C++ branches/markus/uni70 at r35584 from trunk at r35580 4247- Java branches/markus/uni70 at r35587 from trunk at r35545 4248 4249*** CLDR Trac 4250 4251- ticket 7195: UCA 7.0 CLDR root collation 4252- branches/markus/uni70 at r10062 from trunk at r10061 4253 4254- ticket 6762: script metadata for Unicode 7.0 new scripts 4255 4256*** Unicode version numbers 4257- makedata.mak 4258- uchar.h 4259- com.ibm.icu.util.VersionInfo 4260- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4261 4262- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4263 so that the makefiles see the new version number. 4264 4265*** data files & enums & parser code 4266 4267* file preparation 4268 4269- download UCD & IDNA files 4270- make sure that the Unicode data folder passed into preparseucd.py 4271 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4272- only for manual diffs: remove version suffixes from the file names 4273 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 4274 (see https://sites.google.com/site/unicodetools/inputdata) 4275- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 4276- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 4277- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4278- Restore TODO diffs in source/data/unidata/UCARules.txt 4279 cd $ICU_SRC_DIR 4280 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 4281- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 4282 4283- also: from http://unicode.org/Public/security/7.0.0/ download new 4284 confusables.txt & confusablesWholeScript.txt 4285 and copy to $ICU_ROOT/src/source/data/unidata/ 4286 4287* initial preparseucd.py changes 4288- remove new Unicode scripts from the 4289 only-in-ISO-15924 list according to the error message: 4290 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 4291 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 4292 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 4293 from _scripts_only_in_iso15924 4294 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4295 and in com.ibm.icu.dev.test.lang.TestUScript.java 4296- NamesList.txt now has a heading with a non-ASCII character 4297 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 4298 + escape non-ASCII characters in heading comments 4299- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 4300 + get the copyright from the first file whose copyright line contains the current year 4301 4302* PropertyValueAliases.txt changes 4303- 32 new Block (blk) values: 4304 blk; Bassa_Vah ; Bassa_Vah 4305 blk; Caucasian_Albanian ; Caucasian_Albanian 4306 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 4307 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 4308 blk; Duployan ; Duployan 4309 blk; Elbasan ; Elbasan 4310 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 4311 blk; Grantha ; Grantha 4312 blk; Khojki ; Khojki 4313 blk; Khudawadi ; Khudawadi 4314 blk; Latin_Ext_E ; Latin_Extended_E 4315 blk; Linear_A ; Linear_A 4316 blk; Mahajani ; Mahajani 4317 blk; Manichaean ; Manichaean 4318 blk; Mende_Kikakui ; Mende_Kikakui 4319 blk; Modi ; Modi 4320 blk; Mro ; Mro 4321 blk; Myanmar_Ext_B ; Myanmar_Extended_B 4322 blk; Nabataean ; Nabataean 4323 blk; Old_North_Arabian ; Old_North_Arabian 4324 blk; Old_Permic ; Old_Permic 4325 blk; Ornamental_Dingbats ; Ornamental_Dingbats 4326 blk; Pahawh_Hmong ; Pahawh_Hmong 4327 blk; Palmyrene ; Palmyrene 4328 blk; Pau_Cin_Hau ; Pau_Cin_Hau 4329 blk; Psalter_Pahlavi ; Psalter_Pahlavi 4330 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 4331 blk; Siddham ; Siddham 4332 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 4333 blk; Sup_Arrows_C ; Supplemental_Arrows_C 4334 blk; Tirhuta ; Tirhuta 4335 blk; Warang_Citi ; Warang_Citi 4336 -> add to uchar.h 4337 use long property names for enum constants 4338 -> add to UCharacter.UnicodeBlock IDs 4339 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4340 replace public static final int \1_ID = \2; \3 4341 -> add to UCharacter.UnicodeBlock objects 4342 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4343 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4344- 28 new Joining_Group (jg) values: 4345 jg ; Manichaean_Aleph ; Manichaean_Aleph 4346 jg ; Manichaean_Ayin ; Manichaean_Ayin 4347 jg ; Manichaean_Beth ; Manichaean_Beth 4348 jg ; Manichaean_Daleth ; Manichaean_Daleth 4349 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 4350 jg ; Manichaean_Five ; Manichaean_Five 4351 jg ; Manichaean_Gimel ; Manichaean_Gimel 4352 jg ; Manichaean_Heth ; Manichaean_Heth 4353 jg ; Manichaean_Hundred ; Manichaean_Hundred 4354 jg ; Manichaean_Kaph ; Manichaean_Kaph 4355 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 4356 jg ; Manichaean_Mem ; Manichaean_Mem 4357 jg ; Manichaean_Nun ; Manichaean_Nun 4358 jg ; Manichaean_One ; Manichaean_One 4359 jg ; Manichaean_Pe ; Manichaean_Pe 4360 jg ; Manichaean_Qoph ; Manichaean_Qoph 4361 jg ; Manichaean_Resh ; Manichaean_Resh 4362 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 4363 jg ; Manichaean_Samekh ; Manichaean_Samekh 4364 jg ; Manichaean_Taw ; Manichaean_Taw 4365 jg ; Manichaean_Ten ; Manichaean_Ten 4366 jg ; Manichaean_Teth ; Manichaean_Teth 4367 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 4368 jg ; Manichaean_Twenty ; Manichaean_Twenty 4369 jg ; Manichaean_Waw ; Manichaean_Waw 4370 jg ; Manichaean_Yodh ; Manichaean_Yodh 4371 jg ; Manichaean_Zayin ; Manichaean_Zayin 4372 jg ; Straight_Waw ; Straight_Waw 4373 -> uchar.h & UCharacter.JoiningGroup 4374- 23 new Script (sc) values: 4375 sc ; Aghb ; Caucasian_Albanian 4376 sc ; Bass ; Bassa_Vah 4377 sc ; Dupl ; Duployan 4378 sc ; Elba ; Elbasan 4379 sc ; Gran ; Grantha 4380 sc ; Hmng ; Pahawh_Hmong 4381 sc ; Khoj ; Khojki 4382 sc ; Lina ; Linear_A 4383 sc ; Mahj ; Mahajani 4384 sc ; Mani ; Manichaean 4385 sc ; Mend ; Mende_Kikakui 4386 sc ; Modi ; Modi 4387 sc ; Mroo ; Mro 4388 sc ; Narb ; Old_North_Arabian 4389 sc ; Nbat ; Nabataean 4390 sc ; Palm ; Palmyrene 4391 sc ; Pauc ; Pau_Cin_Hau 4392 sc ; Perm ; Old_Permic 4393 sc ; Phlp ; Psalter_Pahlavi 4394 sc ; Sidd ; Siddham 4395 sc ; Sind ; Khudawadi 4396 sc ; Tirh ; Tirhuta 4397 sc ; Wara ; Warang_Citi 4398 -> uscript.h (many were added before) 4399 comment "Mende Kikakui" for USCRIPT_MENDE 4400 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 4401 -> com.ibm.icu.lang.UScript 4402 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4403 replace public static final int \1 = \2; \3 4404- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4405 (added 2012-11-01) 4406 Ahom 338 Ahom 4407 Hatr 127 Hatran 4408 Mult 323 Multani 4409 (added 2013-10-12) 4410 Modi 324 Modi 4411 Pauc 263 Pau Cin Hau 4412 Sidd 302 Siddham 4413 -> uscript.h (some overlap with additions from Unicode) 4414 -> com.ibm.icu.lang.UScript 4415 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4416 replace public static final int \1 = \2; \3 4417 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 4418 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4419 and in com.ibm.icu.dev.test.lang.TestUScript.java 4420 4421* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4422 (not strictly necessary for NOT_ENCODED scripts) 4423 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 4424 4425* generate normalization data files 4426- cd $ICU_ROOT/dbg 4427- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4428- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4429- UNIDATA=$ICU_SRC_DIR/source/data/unidata 4430- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 4431- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4432- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4433- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4434- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4435 4436* build ICU (make install) 4437 so that the tools build can pick up the new definitions from the installed header files. 4438 4439~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4440 4441* build Unicode tools using CMake+make 4442 4443~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4444 4445# Location (--prefix) of where ICU was installed. 4446set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 4447# Location of the ICU source tree. 4448set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 4449 4450~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4451~/svn.icutools/trunk/dbg/unicode/c$ make 4452 4453* genprops work 4454- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 4455 + add second array of Joining_Group values for at most 10800..10FFF 4456 icutools: unicode/c/genprops/bidipropsbuilder.cpp 4457 icu: source/common/ubidi_props.h/.c/_data.h 4458 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 4459 4460* generate core properties data files 4461- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 4462- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 4463- rebuild ICU (make install) & tools 4464- run genuca again (see step above) so that it picks up the new nfc.nrm 4465- rebuild ICU (make install) & tools 4466 4467* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4468 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4469- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4470- Unicode 6.0..7.0: U+2260, U+226E, U+226F 4471- nothing new in 7.0, no test file to update 4472 4473* run & fix ICU4C tests 4474 4475* update Java data files 4476- refresh just the UCD-related files, just to be safe 4477- see (ICU4C)/source/data/icu4j-readme.txt 4478- mkdir /tmp/icu4j 4479- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4480 output: 4481 ... 4482 Unicode .icu files built to ./out/build/icudt53l 4483 echo timestamp > uni-core-data 4484 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4485 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 4486 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4487 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4488 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 4489 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 4490 mkdir -p /tmp/icu4j/main/shared/data 4491 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4492 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 4493 mkdir -p /tmp/icu4j/main/shared/data 4494 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4495 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 4496- copy the big-endian Unicode data files to another location, 4497 separate from the other data files 4498 ICUDT=icudt54b 4499 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4500 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4501 cd ~/svn.icu/uni70/dbg/data/out/icu4j 4502 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4503 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4504 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4505 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4506 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4507 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4508- refresh ICU4J 4509 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4510 4511* update CollationFCD.java 4512 + copy & paste the initializers of lcccIndex[] etc. from 4513 ICU4C/source/i18n/collationfcd.cpp to 4514 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4515 4516* refresh Java test .txt files 4517- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4518 cd $ICU_SRC_DIR/source/data/unidata 4519 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4520 cd ../../test/testdata 4521 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4522 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4523 4524* UCA 4525 4526- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 4527- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 4528- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 4529- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 4530- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 4531- review data; compare files, use blankweights.sed or similar 4532 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 4533- cd ~/svn.unitools/Generated/uca/7.0.0/ 4534- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4535 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4536- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4537 (note removing the underscore before "Rules") 4538 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4539- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4540 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4541 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4542 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4543 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4544 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4545- run genuca, see command line above 4546- rebuild ICU4C 4547- refresh ICU4J collation data: 4548 (subset of instructions above for properties data refresh, except copies all coll/*) 4549 ICUDT=icudt54b 4550 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4551 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4552 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4553 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4554- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4555- note on intltest: if collate/UCAConformanceTest fails, then 4556 utility/MultithreadTest/TestCollators will fail as well; 4557 fix the conformance test before looking into the multi-thread test 4558- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 4559- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 4560 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 4561 4562* When refreshing all of ICU4J data from ICU4C 4563- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4564- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4565or 4566- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4567 4568* run & fix ICU4J tests 4569 4570*** LayoutEngine script information 4571 4572(For details see the Unicode 5.2 change log below.) 4573 4574* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4575 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4576 in the working directory. 4577 (It also generates ScriptRunData.cpp, which is no longer needed.) 4578 4579 The generated files have a current copyright date and "@stable" statement. 4580 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 4581 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 4582 which may not contain dots any more. 4583 4584- diff current <icu>/source/layout files vs. generated ones 4585 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4586 review and manually merge desired changes; 4587 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 4588 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4589- if you just copy the above files, then 4590 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4591 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4592 4593*** API additions 4594- send notice to icu-design about new born-@stable API (enum constants etc.) 4595 4596*** merge the Unicode update branches back onto the trunk 4597- do not merge the icudata.jar and testdata.jar, 4598 instead rebuild them from merged & tested ICU4C 4599 4600---------------------------------------------------------------------------- *** 4601 4602Unicode 6.3 update 4603 4604http://www.unicode.org/review/pri249/ -- beta review 4605http://www.unicode.org/reports/uax-proposed-updates.html 4606http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 4607http://www.unicode.org/reports/tr44/tr44-11.html 4608 4609*** ICU Trac 4610 4611- ticket 10128: update ICU to Unicode 6.3 beta 4612- ticket 10168: update ICU to Unicode 6.3 final 4613- C++ branches/markus/uni63 at r33552 from trunk at r33551 4614- Java branches/markus/uni63 at r33550 from trunk at r33553 4615 4616- ticket 10142: implement Unicode 6.3 bidi algorithm additions 4617 4618*** Unicode version numbers 4619- makedata.mak 4620- uchar.h 4621 (configure.in & configure: have been modified to extract the version from uchar.h) 4622- com.ibm.icu.util.VersionInfo 4623- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4624 4625- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4626 so that the makefiles see the new version number. 4627 4628*** data files & enums & parser code 4629 4630* file preparation 4631 4632- download UCD, UCA & IDNA files 4633- make sure that the Unicode data folder passed into preparseucd.py 4634 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4635- modify preparseucd.py: 4636 parse new file BidiBrackets.txt 4637 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 4638- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 4639- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4640- Check test file diffs for previously commented-out, known-failing data lines; 4641 probably need to keep those commented out. 4642 4643* PropertyAliases.txt changes 4644- 1 new Enumerated Property 4645 bpt ; Bidi_Paired_Bracket_Type 4646 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 4647 -> ubidi_props.h & .c & UBiDiProps.java 4648 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 4649 -> uprops.cpp 4650 -> change ubidi.icu format version from 2.0 to 2.1 4651- 1 new Miscellaneous Property 4652 bpb ; Bidi_Paired_Bracket 4653 -> uchar.h & UProperty.java 4654 -> ppucd.h & .cpp 4655 4656* PropertyValueAliases.txt changes 4657- 3 Bidi_Paired_Bracket_Type (bpt) values: 4658 bpt; c ; Close 4659 bpt; n ; None 4660 bpt; o ; Open 4661 -> uchar.h & UCharacter.BidiPairedBracketType 4662 -> ubidi_props.h & .c & UBiDiProps.java 4663 -> change ubidi.icu format version from 2.0 to 2.1 4664- 4 new Bidi_Class (bc) values: 4665 bc ; FSI ; First_Strong_Isolate 4666 bc ; LRI ; Left_To_Right_Isolate 4667 bc ; RLI ; Right_To_Left_Isolate 4668 bc ; PDI ; Pop_Directional_Isolate 4669 -> uchar.h & UCharacterEnums.ECharacterDirection 4670 -> until the bidi code gets updated, 4671 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 4672- 3 new Word_Break (WB) values: 4673 WB ; HL ; Hebrew_Letter 4674 WB ; SQ ; Single_Quote 4675 WB ; DQ ; Double_Quote 4676 -> uchar.h & UCharacter.WordBreak 4677 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 4678- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4679 (added 2012-10-16) 4680 Aghb 239 Caucasian Albanian 4681 Mahj 314 Mahajani 4682 -> uscript.h 4683 -> com.ibm.icu.lang.UScript 4684 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4685 replace public static final int \1 = \2;\3 4686 -> preparseucd.py _scripts_only_in_iso15924 4687 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4688 and in com.ibm.icu.dev.test.lang.TestUScript.java 4689 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4690 (not strictly necessary for NOT_ENCODED scripts) 4691 4692* generate normalization data files 4693- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 4694- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 4695- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 4696- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4697- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4698- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4699- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4700 4701* build ICU (make install) 4702 so that the tools build can pick up the new definitions from the installed header files. 4703 4704~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4705 4706* build Unicode tools using CMake+make 4707 4708~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4709 4710# Location (--prefix) of where ICU was installed. 4711set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 4712# Location of the ICU source tree. 4713set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 4714 4715~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4716~/svn.icutools/trunk/dbg/unicode/c$ make 4717 4718* generate core properties data files 4719- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 4720- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 4721- rebuild ICU (make install) & tools 4722- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4723- rebuild ICU (make install) & tools 4724 4725* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4726 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4727- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4728- Unicode 6.0..6.3: U+2260, U+226E, U+226F 4729- nothing new in 6.3, no test file to update 4730 4731* update Java data files 4732- refresh just the UCD-related files, just to be safe 4733- see (ICU4C)/source/data/icu4j-readme.txt 4734- mkdir /tmp/icu4j 4735- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4736 output: 4737 ... 4738 Unicode .icu files built to ./out/build/icudt52l 4739 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4740 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 4741 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4742 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4743 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 4744 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 4745 mkdir -p /tmp/icu4j/main/shared/data 4746 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4747 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 4748 mkdir -p /tmp/icu4j/main/shared/data 4749 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4750 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 4751- copy the big-endian Unicode data files to another location, 4752 separate from the other data files 4753 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4754 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4755 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4756 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 4757 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4758 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4759 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4760- refresh ICU4J 4761 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4762 4763* refresh Java test .txt files 4764- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4765 4766* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 4767 4768- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4769- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4770- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4771- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4772 (note removing the underscore before "Rules") 4773- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4774 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4775 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4776- check test file diffs for previously commented-out, known-failing data lines; 4777 probably need to keep those commented out 4778- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4779- run genuca, see command line above 4780- rebuild ICU4C 4781- refresh ICU4J collation data: 4782 (subset of instructions above for properties data refresh, except copies all coll/*) 4783 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4784 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4785 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4786 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4787- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4788- note on intltest: if collate/UCAConformanceTest fails, then 4789 utility/MultithreadTest/TestCollators will fail as well; 4790 fix the conformance test before looking into the multi-thread test 4791 4792* test ICU, fix test code where necessary 4793 4794* When refreshing all of ICU4J data from ICU4C 4795- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4796- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4797or 4798- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4799 4800*** LayoutEngine script information 4801- skipped for Unicode 6.3: no new scripts 4802 4803*** merge the Unicode update branches back onto the trunk 4804- do not merge the icudata.jar and testdata.jar, 4805 instead rebuild them from merged & tested ICU4C 4806 4807---------------------------------------------------------------------------- *** 4808 4809Unicode 6.2 update 4810 4811http://www.unicode.org/review/pri230/ 4812http://www.unicode.org/versions/beta-6.2.0.html 4813http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 4814http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 4815http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 4816http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 4817http://www.unicode.org/reports/tr46/tr46-8.html IDNA 4818http://unicode.org/Public/idna/6.2.0/ 4819 4820*** ICU Trac 4821 4822- ticket 9515: Unicode 6.2: final ICU update 4823 4824- ticket 9514: UCA 6.2: fix UCARules.txt 4825 4826- ticket 9437: update ICU to Unicode 6.2 4827- C++ branches/markus/uni62 at r32050 from trunk at r32041 4828- Java branches/markus/uni62 at r32068 from trunk at r32066 4829 4830*** Unicode version numbers 4831- makedata.mak 4832- uchar.h 4833 (configure.in & configure: have been modified to extract the version from uchar.h) 4834- com.ibm.icu.util.VersionInfo 4835- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4836 4837*** data files & enums & parser code 4838 4839* file preparation 4840 4841- download UCD, UCA & IDNA files 4842- make sure that the Unicode data folder passed into preparseucd.py 4843 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4844- modify preparseucd.py: NamesList.txt is now in UTF-8 4845- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 4846- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4847- Check test file diffs for previously commented-out, known-failing data lines; 4848 probably need to keep those commented out. 4849 4850* PropertyValueAliases.txt changes 4851- 1 new Line_Break (lb) value: 4852 lb ; RI ; Regional_Indicator 4853 -> uchar.h & UCharacter.LineBreak 4854- 1 new Word_Break (WB) value: 4855 WB ; RI ; Regional_Indicator 4856 -> uchar.h & UCharacter.WordBreak 4857- 1 new Grapheme_Cluster_Break (GCB) value: 4858 GCB; RI ; Regional_Indicator 4859 -> uchar.h & UCharacter.GraphemeClusterBreak 4860 4861* 3 new numeric values 4862 The new value -1, which was really supposed to be NaN but that would have required 4863 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 4864 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 4865 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 4866 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 4867 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 4868 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 4869 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 4870 -> uprops.h, uchar.c & UCharacterProperty.java 4871 -> cucdtst.c & UCharacterTest.java 4872 4873* generate normalization data files 4874- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 4875- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 4876- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 4877- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4878- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4879- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4880- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4881 4882* build ICU (make install) 4883 so that the tools build can pick up the new definitions from the installed header files. 4884* build Unicode tools using CMake+make 4885 4886* generate core properties data files 4887- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 4888- in initial bootstrapping, change the UCA version 4889 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4890- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 4891- rebuild ICU (make install) & tools 4892 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4893 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4894 (see step above) 4895- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4896- rebuild ICU (make install) & tools 4897 4898* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4899 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4900- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4901- Unicode 6.0..6.2: U+2260, U+226E, U+226F 4902- nothing new in 6.2, no test file to update 4903 4904* update Java data files 4905- refresh just the UCD-related files, just to be safe 4906- see (ICU4C)/source/data/icu4j-readme.txt 4907- mkdir /tmp/icu4j 4908- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4909 output: 4910 ... 4911 Unicode .icu files built to ./out/build/icudt50l 4912 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4913 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 4914 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4915 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4916 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 4917 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 4918 mkdir -p /tmp/icu4j/main/shared/data 4919 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4920 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 4921 mkdir -p /tmp/icu4j/main/shared/data 4922 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4923 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 4924- copy the big-endian Unicode data files to another location, 4925 separate from the other data files 4926 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4927 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4928 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4929 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 4930 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4931 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4932 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4933- refresh ICU4J 4934 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4935 4936* refresh Java test .txt files 4937- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4938 4939* UCA 4940 4941- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4942- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4943- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4944- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4945 (note removing the underscore before "Rules") 4946- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4947 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4948 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4949- check test file diffs for previously commented-out, known-failing data lines; 4950 probably need to keep those commented out 4951- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4952- run genuca, see command line above 4953- rebuild ICU4C 4954- refresh ICU4J collation data: 4955 (subset of instructions above for properties data refresh, except copies all coll/*) 4956 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4957 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4958 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4959 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4960- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4961- note on intltest: if collate/UCAConformanceTest fails, then 4962 utility/MultithreadTest/TestCollators will fail as well; 4963 fix the conformance test before looking into the multi-thread test 4964 4965* test ICU, fix test code where necessary 4966 4967* When refreshing all of ICU4J data from ICU4C 4968- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4969- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4970or 4971- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4972 4973*** LayoutEngine script information 4974- skipped for Unicode 6.2: no new scripts 4975 4976*** merge the Unicode update branches back onto the trunk 4977- do not merge the icudata.jar and testdata.jar, 4978 instead rebuild them from merged & tested ICU4C 4979 4980---------------------------------------------------------------------------- *** 4981 4982Future Unicode update 4983 4984Tools simplified since the Unicode 6.1 update. See 4985- https://icu.unicode.org/design/props/ppucd 4986- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 4987 4988* Unicode version numbers 4989- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 4990 4991* file preparation 4992- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 4993- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 4994- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4995- Check test file diffs for previously commented-out, known-failing data lines; 4996 probably need to keep those commented out. 4997 4998* PropertyValueAliases.txt changes 4999- Script codes that are in ISO 15924 but not in Unicode are now listed in 5000 preparseucd.py, in the _scripts_only_in_iso15924 variable. 5001 If there are new ISO codes, then add them. 5002 If Unicode adds some of them, then remove them from the .py variable. 5003 5004* UnicodeData.txt changes 5005- No more manual changes for CJK ranges for algorithmic names; 5006 those are now written to ppucd.txt and genprops reads them from there. 5007 5008* generate core properties data files (makeprops.sh was deleted) 5009- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 5010 5011* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 5012- it is now generated by preparseucd.py 5013 5014* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 5015- it is now generated by preparseucd.py 5016- make sure that the Unicode data folder passed into preparseucd.py 5017 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 5018 (can be in some subfolder) 5019 5020* generate normalization data files 5021- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 5022- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 5023- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 5024- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 5025- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 5026- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 5027- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 5028 5029* build ICU (make install) 5030* build Unicode tools using CMake+make 5031 5032* new way to call genuca (makeuca.sh was deleted) 5033- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 5034 5035---------------------------------------------------------------------------- *** 5036 5037Unicode 6.1 update 5038 5039*** ICU Trac 5040 5041- ticket 8995 final update to Unicode 6.1 5042- ticket 8994 regenerate source/layout/CanonData.cpp 5043 5044- ticket 8961 support Unicode "Age" value *names* 5045- ticket 8963 support multiple character name aliases & types 5046 5047- ticket 8827 "update ICU to Unicode 6.1" 5048- C++ branches/markus/uni61 at r30864 from trunk at r30843 5049- Java branches/markus/uni61 at r30865 from trunk at r30863 5050 5051*** Unicode version numbers 5052- makedata.mak 5053- uchar.h 5054 (configure.in & configure: have been modified to extract the version from uchar.h) 5055- com.ibm.icu.util.VersionInfo 5056- icutools/unicode/makedefs.sh 5057 + also review & update other definitions in that file, 5058 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 5059 5060*** data files & enums & parser code 5061 5062* file preparation 5063 5064~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 5065- This prepares both unidata and testdata files in respective output subfolders. 5066- Check test file diffs for previously commented-out, known-failing data lines; 5067 probably need to keep those commented out. 5068 5069* PropertyValueAliases.txt changes 5070- 11 new block names: 5071 Arabic_Extended_A 5072 Arabic_Mathematical_Alphabetic_Symbols 5073 Chakma 5074 Meetei_Mayek_Extensions 5075 Meroitic_Cursive 5076 Meroitic_Hieroglyphs 5077 Miao 5078 Sharada 5079 Sora_Sompeng 5080 Sundanese_Supplement 5081 Takri 5082 -> add to uchar.h 5083 -> add to UCharacter.UnicodeBlock IDs 5084 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 5085 replace public static final int \1_ID = \2; \3 5086 -> add to UCharacter.UnicodeBlock objects 5087 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 5088 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5089- 1 new Joining_Group (jg) value: 5090 Rohingya_Yeh 5091 -> uchar.h & UCharacter.JoiningGroup 5092- 2 new Line_Break (lb) values: 5093 CJ=Conditional_Japanese_Starter 5094 HL=Hebrew_Letter 5095 -> uchar.h & UCharacter.LineBreak 5096- 7 new scripts: 5097 sc ; Cakm ; Chakma 5098 sc ; Merc ; Meroitic_Cursive 5099 sc ; Mero ; Meroitic_Hieroglyphs 5100 sc ; Plrd ; Miao 5101 sc ; Shrd ; Sharada 5102 sc ; Sora ; Sora_Sompeng 5103 sc ; Takr ; Takri 5104 -> remove these from SyntheticPropertyValueAliases.txt 5105 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 5106 and in com.ibm.icu.dev.test.lang.TestUScript.java 5107- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5108 (added 2011-06-21) 5109 Khoj 322 Khojki 5110 Tirh 326 Tirhuta 5111 and another one added 2011-12-09 5112 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 5113 -> uscript.h 5114 -> com.ibm.icu.lang.UScript 5115 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5116 replace public static final int \1 = \2;\3 5117 -> SyntheticPropertyValueAliases.txt 5118 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5119 and in com.ibm.icu.dev.test.lang.TestUScript.java 5120 5121* UnicodeData.txt changes 5122- the last Unihan code point changes from U+9FCB to U+9FCC 5123 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 5124 + do change gennames.c 5125 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 5126 5127* DerivedBidiClass.txt changes 5128- 2 new default-AL blocks: 5129# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 5130# Arabic Mathematical Alphabetic Symbols: 5131# U+1EE00 - U+1EEFF (was default-R) 5132- 2 new default-R blocks: 5133# Meroitic Hieroglyphs: 5134# U+10980 - U+1099F 5135# Meroitic Cursive: U+109A0 - U+109FF 5136 -> should be picked up by the explicit data in the file 5137 5138* NameAliases.txt changes 5139- from 5140 # Each line has two fields 5141 # First field: Code point 5142 # Second field: Alias 5143- to 5144 # Each line has three fields, as described here: 5145 # 5146 # First field: Code point 5147 # Second field: Alias 5148 # Third field: Type 5149- Also, the file previously allowed multiple aliases but only now does it 5150 actually provide multiple, even multiple of the same type. For example, 5151 FEFF;BYTE ORDER MARK;alternate 5152 FEFF;BOM;abbreviation 5153 FEFF;ZWNBSP;abbreviation 5154- This breaks our gennames parser, unames.icu data structure, and API. 5155 Fix gennames to only pick up "correction" aliases. 5156 New ticket #8963 for further changes. 5157 5158* run genpname/preparse.pl (on Linux) 5159 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5160 + make sure that data.h is writable 5161 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5162 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5163 5164* build ICU (make install) 5165 so that the tools build can pick up the new definitions from the installed header files. 5166* build Unicode tools (at least genpname) using CMake+make 5167 5168* run genpname 5169 (builds both pnames.icu and propname_data.h) 5170- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5171- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 5172 5173* build ICU (make install) 5174* build Unicode tools using CMake+make 5175 5176* update source/data/unidata/norm2/nfkc_cf.txt 5177- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 5178 5179* update source/data/unidata/norm2/uts46.txt 5180- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 5181 to ~/svn.icu/tools/trunk/src/unicode/py 5182- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 5183- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 5184- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 5185 5186* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5187 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5188- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5189- Unicode 6.0..6.1: U+2260, U+226E, U+226F 5190- nothing new in 6.1, no test file to update 5191 5192* generate core properties data files 5193- in initial bootstrapping, change the UCA version 5194 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 5195- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5196- rebuild ICU & tools 5197 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 5198 check if the UCA version in FractionalUCA.txt matches the new Unicode version 5199 (see step above) 5200- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 5201 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5202- rebuild ICU & tools 5203 5204* update Java data files 5205- refresh just the UCD-related files, just to be safe 5206- see (ICU4C)/source/data/icu4j-readme.txt 5207- mkdir /tmp/icu4j 5208- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5209 output: 5210 ... 5211 Unicode .icu files built to ./out/build/icudt49l 5212 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 5213 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 5214 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5215 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 5216 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 5217 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 5218 mkdir -p /tmp/icu4j/main/shared/data 5219 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5220 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 5221 mkdir -p /tmp/icu4j/main/shared/data 5222 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 5223 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 5224- copy the big-endian Unicode data files to another location, 5225 separate from the other data files 5226 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5227 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 5228 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 5229 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 5230 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 5231 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5232 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 5233- refresh ICU4J 5234 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 5235 5236* refresh Java test .txt files 5237- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5238 5239* test ICU so far, fix test code where necessary 5240- temporarily ignore collation issues that look like UCA/UCD mismatches, 5241 until UCA data is updated 5242 5243* UCA 5244 5245- get output from Mark's tools; look in 5246 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 5247- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5248- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5249 (note removing the underscore before "Rules") 5250- update (ICU)/source/test/testdata/CollationTest_*.txt 5251 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5252 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 5253- check test file diffs for previously commented-out, known-failing data lines; 5254 probably need to keep those commented out 5255- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 5256- run makeuca.sh: 5257 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5258- rebuild ICU4C 5259- refresh ICU4J collation data: 5260 (subset of instructions above for properties data refresh, except copies all coll/*) 5261 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5262 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5263 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 5264 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 5265- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 5266- note on intltest: if collate/UCAConformanceTest fails, then 5267 utility/MultithreadTest/TestCollators will fail as well; 5268 fix the conformance test before looking into the multi-thread test 5269 5270* When refreshing all of ICU4J data from ICU4C 5271- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5272- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5273or 5274- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5275 5276*** LayoutEngine script information 5277 5278(For details see the Unicode 5.2 change log below.) 5279 5280* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 5281 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 5282 in the working directory. 5283 (It also generates ScriptRunData.cpp, which is no longer needed.) 5284 5285 The generated files have a current copyright date and "@draft" statement. 5286 5287- diff current <icu>/source/layout files vs. generated ones 5288 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 5289 review and manually merge desired changes; 5290 fix gratuitous changes, incorrect @draft and missing aliases; 5291 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 5292- if you just copy the above files, then 5293 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 5294 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5295 5296*** merge the Unicode update branches back onto the trunk 5297- do not merge the icudata.jar and testdata.jar, 5298 instead rebuild them from merged & tested ICU4C 5299 5300---------------------------------------------------------------------------- *** 5301 5302ICU 4.8 (no Unicode update, just new script codes) 5303 5304* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5305 (added 2010-12-21) 5306 Afak 439 Afaka 5307 Jurc 510 Jurchen 5308 Mroo 199 Mro, Mru 5309 Nshu 499 Nüshu 5310 Shrd 319 Sharada, Śāradā 5311 Sora 398 Sora Sompeng 5312 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 5313 Tang 520 Tangut 5314 Wole 480 Woleai 5315 -> uscript.h 5316 -> com.ibm.icu.lang.UScript 5317 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5318 replace public static final int \1 = \2;\3 5319 -> genpname/SyntheticPropertyValueAliases.txt 5320 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5321 and in com.ibm.icu.dev.test.lang.TestUScript.java 5322 5323* run genpname/preparse.pl (on Linux) 5324 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5325 + make sure that data.h is writable 5326 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5327 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5328 5329* rebuild Unicode tools (at least genpname) using make 5330- You might first need to "make install" ICU so that the tools build can pick 5331 up the new definitions from the installed header files. 5332 5333* run genpname 5334 (builds both pnames.icu and propname_data.h) 5335- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5336- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 5337- rebuild ICU & tools 5338 5339* run genprops 5340- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5341- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5342- rebuild ICU & tools 5343 5344* update Java data files 5345- refresh just the UCD-related files, just to be safe 5346- see (ICU4C)/source/data/icu4j-readme.txt 5347- mkdir /tmp/icu4j 5348- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5349- copy the big-endian Unicode data files to another location, 5350 separate from the other data files 5351 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5352 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5353 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5354- refresh ICU4J 5355 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 5356 5357* should have updated the layout engine script codes but forgot 5358 5359---------------------------------------------------------------------------- *** 5360 5361Unicode 6.0 update 5362 5363*** related ICU Trac tickets 5364 53657264 Unicode 6.0 Update 5366 5367*** Unicode version numbers 5368- makedata.mak 5369- uchar.h 5370 (configure.in & configure: have been modified to extract the version from uchar.h) 5371- com.ibm.icu.util.VersionInfo 5372 5373*** data files & enums & parser code 5374 5375* file preparation 5376 5377~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 5378- This now prepares both unidata and testdata files in respective output subfolders. 5379 5380* PropertyAliases.txt changes 5381- new Script_Extensions property defined in the new ScriptExtensions.txt file 5382 but not listed in PropertyAliases.txt; reported to unicode.org; 5383 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 5384 scx; Script_Extensions 5385 -> uchar.h with new UProperty section 5386 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 5387 5388* PropertyValueAliases.txt changes 5389- 12 new block names: 5390 Alchemical_Symbols 5391 Bamum_Supplement 5392 Batak 5393 Brahmi 5394 CJK_Unified_Ideographs_Extension_D 5395 Emoticons 5396 Ethiopic_Extended_A 5397 Kana_Supplement 5398 Mandaic 5399 Miscellaneous_Symbols_And_Pictographs 5400 Playing_Cards 5401 Transport_And_Map_Symbols 5402 -> add to uchar.h 5403 -> add to UCharacter.UnicodeBlock 5404 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 5405 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5406- Joining_Group (jg) values: 5407 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 5408 -> uchar.h & UCharacter.JoiningGroup 5409- 3 new scripts: 5410 sc ; Batk ; Batak 5411 sc ; Brah ; Brahmi 5412 sc ; Mand ; Mandaic 5413 -> remove these from SyntheticPropertyValueAliases.txt 5414 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 5415 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 5416 and in com.ibm.icu.dev.test.lang.TestUScript.java 5417- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5418 (added 2009-11-11..2010-07-18) 5419 Bass 259 Bassa Vah 5420 Dupl 755 Duployan shortand 5421 Elba 226 Elbasan 5422 Gran 343 Grantha 5423 Kpel 436 Kpelle 5424 Loma 437 Loma 5425 Mend 438 Mende 5426 Merc 101 Meroitic Cursive 5427 Narb 106 Old North Arabian 5428 Nbat 159 Nabataean 5429 Palm 126 Palmyrene 5430 Sind 318 Sindhi 5431 Wara 262 Warang Citi 5432 -> uscript.h 5433 -> com.ibm.icu.lang.UScript 5434 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5435 replace public static final int \1 = \2;\3 5436 -> SyntheticPropertyValueAliases.txt 5437 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5438 and in com.ibm.icu.dev.test.lang.TestUScript.java 5439- ISO 15924 name change 5440 Mero 100 Meroitic Hieroglyphs (was Meroitic) 5441 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 5442- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 5443 5444* UnicodeData.txt changes 5445- new CJK block: 5446 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 5447 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 5448 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 5449 5450* build Unicode tools using CMake+make 5451 5452* run genpname/preparse.pl (on Linux) 5453 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5454 + make sure that data.h is writable 5455 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5456 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5457 5458* rebuild Unicode tools (at least genpname) using make 5459- You might first need to "make install" ICU so that the tools build can pick 5460 up the new definitions from the installed header files. 5461 5462* run genpname 5463- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5464- rebuild ICU & tools 5465 5466* update source/data/unidata/norm2/nfkc_cf.txt 5467- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 5468 5469* update source/data/unidata/norm2/uts46.txt 5470- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 5471 to ~/svn.icu/tools/trunk/src/unicode/py 5472- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 5473- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 5474- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 5475 5476* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5477 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5478- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5479- Unicode 6.0: U+2260, U+226E, U+226F 5480 5481* generate core properties data files 5482- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5483- rebuild ICU & tools 5484- run makeuca.sh so that genuca picks up the new nfc.nrm: 5485 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5486- rebuild ICU & tools 5487 5488* implement new Script_Extensions property (provisional) 5489- parser & generator: genprops & uprops.icu 5490- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 5491- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 5492 5493* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 5494- (one-time change) 5495- genbidi/gencase/genprops tools changes 5496- re-run makeprops.sh (see above) 5497- UCharacterProperty.java, UCharacterTypeIterator.java, 5498 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 5499 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 5500 5501* update Java data files 5502- refresh just the UCD-related files, just to be safe 5503- see (ICU4C)/source/data/icu4j-readme.txt 5504- mkdir /tmp/icu4j 5505- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5506 output: 5507 ... 5508 Unicode .icu files built to ./out/build/icudt45l 5509 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5510 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5511 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5512 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 5513 mkdir -p /tmp/icu4j/main/shared/data 5514 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5515- copy the big-endian Unicode data files to another location, 5516 separate from the other data files 5517 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5518 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5519 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5520 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 5521 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5522 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5523 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5524- refresh ICU4J 5525 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5526 5527* refresh Java test .txt files 5528- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5529 5530* un-hardcode normalization skippable (NF*_Inert) test data 5531- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 5532 5533* copy updated break iterator test files 5534- now handled by early ucdcopy.py and 5535 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 5536 (old instructions: 5537 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 5538 to ~/svn.icu/trunk/src/source/test/testdata) 5539- they are not used in ICU4J 5540 5541* UCA 5542 5543- get output from Mark's tools; look in 5544 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 5545 http://www.macchiato.com/unicode/utc/additional-uca-files 5546 http://www.unicode.org/Public/UCA/6.0.0/ 5547 http://www.unicode.org/~mdavis/uca/ 5548- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5549- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5550- update Han-implicit ranges for new CJK extensions: 5551 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 5552- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 5553 do not add it into invuca so that tailoring primary-after an ignorable works 5554- genuca: permit space between [variable top] bytes 5555- ucol.cpp: treat noncharacters like unassigned rather than ignorable 5556- run makeuca.sh: 5557 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5558- rebuild ICU4C 5559- refresh ICU4J collation data: 5560 (subset of instructions above for properties data refresh, except copies all coll/*) 5561 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5562 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5563 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5564 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5565- update (ICU)/source/test/testdata/CollationTest_*.txt 5566 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5567 with output from Mark's Unicode tools 5568- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5569- note on intltest: if collate/UCAConformanceTest fails, then 5570 utility/MultithreadTest/TestCollators will fail as well; 5571 fix the conformance test before looking into the multi-thread test 5572 5573* When refreshing all of ICU4J data from ICU4C 5574- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5575- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5576or 5577- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5578 5579*** LayoutEngine script information 5580 5581(For details see the Unicode 5.2 change log below.) 5582 5583* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5584ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5585ScriptRunData.cpp, which is no longer needed.) 5586 5587The generated files have a current copyright date and "@draft" statement. 5588 5589* copy the above files into <icu>/source/layout, replacing the old files. 5590* fix mixed line endings 5591* review the diffs and fix incorrect @draft and missing aliases; 5592 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 5593* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5594 5595---------------------------------------------------------------------------- *** 5596 5597Unicode 5.2 update 5598 5599*** related ICU Trac tickets 5600 56017084 Unicode 5.2 5602 56037167 verify collation bytes 56047235 Java test NAME_ALIAS 56057236 Java DerivedCoreProperties.txt test 56067237 Java BidiTest.txt 56077238 UTrie2 in core unidata 56087239 test for tailoring gaps 56097240 Java fix CollationMiscTest 56107243 update layout engine for Unicode 5.2 5611 5612*** Unicode version numbers 5613- makedata.mak 5614- uchar.h 5615- configure.in & configure 5616- update ucdVersion in gennames.c if an algorithmic range changes 5617 5618*** data files & enums & parser code 5619 5620* file preparation 5621 5622python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 5623- includes finding files regardless of version numbers, 5624 copying them, and performing the equivalent processing of the 5625 ucdstrip and ucdmerge tools on the desired set of files 5626 5627* notes on changes 5628- PropertyAliases.txt 5629 moved from numeric to enumerated: 5630 ccc ; Canonical_Combining_Class 5631 new string properties: 5632 NFKC_CF ; NFKC_Casefold 5633 Name_Alias; Name_Alias 5634 new binary properties: 5635 Cased ; Cased 5636 CI ; Case_Ignorable 5637 CWCF ; Changes_When_Casefolded 5638 CWCM ; Changes_When_Casemapped 5639 CWKCF ; Changes_When_NFKC_Casefolded 5640 CWL ; Changes_When_Lowercased 5641 CWT ; Changes_When_Titlecased 5642 CWU ; Changes_When_Uppercased 5643 new CJK Unihan properties (not supported by ICU) 5644- PropertyValueAliases.txt 5645 new block names 5646 new scripts 5647 one script code change: 5648 sc ; Qaai ; Inherited 5649 -> 5650 sc ; Zinh ; Inherited ; Qaai 5651 new Line_Break (lb) value: 5652 lb ; CP ; Close_Parenthesis 5653 new Joining_Group (jg) values: Farsi_Yeh, Nya 5654 other new values: 5655 ccc; 214; ATA ; Attached_Above 5656- DerivedBidiClass.txt 5657 new default-R range: U+1E800 - U+1EFFF 5658- UnicodeData.txt 5659 all of the ISO comments are gone 5660 new CJK block end: 5661 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 5662 new CJK block: 5663 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 5664 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 5665 5666* genpname 5667- run preparse.pl 5668 + cd \svn\icuproj\icu\trunk\source\tools\genpname 5669 + make sure that data.h is writable 5670 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 5671 + preparse.pl complains with errors like the following: 5672 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 5673 This is because ICU 4.0 had scripts from ISO 15924 which are now 5674 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 5675 and PropertyValueAliases.txt. 5676 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5677 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 5678 + preparse.pl complains with errors about block names missing from uchar.h; add them 5679 5680* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5681- new block & script values 5682 + 26 new blocks 5683 copy new blocks from Blocks.txt 5684 MS VC++ 2008 regular expression: 5685 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 5686 replace with " UBLOCK_\3 = 172, /*[\1]*/" 5687 + several new script values already added in ICU 4.0 for ISO 15924 coverage 5688 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 5689 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 5690 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 5691 (added to SyntheticPropertyValueAliases.txt) 5692- new Joining Group (JG) values: Farsi_Yeh, Nya 5693- new Line_Break (lb) value: 5694 lb ; CP ; Close_Parenthesis 5695 5696* hardcoded Unihan range end/limit 5697- Unihan range end moves from 9FC3 to 9FCB 5698 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 5699 + do change gennames.c 5700 5701* Compare definitions of new binary properties with what we used to use 5702 in algorithms, to see if the definitions changed. 5703- Verified that definitions for Cased and Case_Ignorable are unchanged. 5704 The gencase tool now parses the newly public Case_Ignorable values 5705 in case the definition changes in the future. 5706 5707* uchar.c & uprops.h & uprops.c & genprops 5708- new numeric values that didn't exist in Unicode data before: 5709 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 5710 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 5711 therefore redesign the encoding of numeric types and values for formatVersion 6; 5712 design for simple numbers up to at least 144 ("one gross"), 5713 large values up to at least 10^20, 5714 and fractions with numerators -1..17 and denominators 1..16 5715 to cover current and expected future values 5716 (e.g., more Han numeric values, Meroitic twelfths) 5717 5718* reimplement Hangul_Syllable_Type for new Jamo characters 5719- the old code assumed that all Jamo characters are in the 11xx block 5720- Unicode 5.2 fills holes there and adds new Jamo characters in 5721 A960..A97F; Hangul Jamo Extended-A 5722 and in 5723 D7B0..D7FF; Hangul Jamo Extended-B 5724- Hangul_Syllable_Type can be trivially derived from a subset of 5725 Grapheme_Cluster_Break values 5726 5727* build Unicode data source code for hardcoding core data 5728C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 5729 5730ICU data make path is \svn\icuproj\icu\trunk\source\data\ 5731ICU root path is \svn\icuproj\icu\trunk 5732Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5733Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5734Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5735Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5736Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5737Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5738Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5739Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 5740Creating data file for Unicode Property Names 5741Creating data file for Unicode Character Properties 5742Creating data file for Unicode Case Mapping Properties 5743Creating data file for Unicode BiDi/Shaping Properties 5744Creating data file for Unicode Normalization 5745Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 5746Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 5747 5748- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 5749 and rebuild the common library 5750 5751*** UCA 5752 5753- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 5754- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 5755- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 5756[ Begin obsolete instructions: 5757 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 5758 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 5759 on Windows: 5760 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 5761 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 5762 End obsolete instructions] 5763- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5764 not just the *_STUB.txt files 5765- note on intltest: if collate/UCAConformanceTest fails, then 5766 utility/MultithreadTest/TestCollators will fail as well; 5767 fix the conformance test before looking into the multi-thread test 5768 5769*** Implement Cased & Case_Ignorable properties 5770- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 5771- Problem: These properties should be disjoint, but aren't 5772- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 5773- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 5774 5775*** Implement Changes_When_Xyz properties 5776- without stored data 5777 5778*** Implement Name_Alias property 5779- add it as another name field in unames.icu 5780- make it available via u_charName() and UCharNameChoice and 5781- consider it in u_charFromName() 5782 5783*** Break iterators 5784 5785* Update break iterator rules to new UAX versions and new property values 5786* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 5787 5788*** new BidiTest file 5789- review format and data 5790- copy BidiTest.txt to source/test/testdata 5791- write test code using this data 5792- fix ICU code where it fails the conformance test 5793 5794*** Java 5795- generally, find and update code corresponding to C/C++ 5796- UCharacter.UnicodeBlock constants: 5797 a) add an _ID integer per new block, update COUNT 5798 b) add a class instance per new block 5799 Visual Studio regex: 5800 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 5801 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5802- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 5803 5804- port test changes to Java 5805 5806*** LayoutEngine script information 5807 5808(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 5809 5810* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5811ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5812ScriptRunData.cpp, which is no longer needed.) 5813 5814The generated files have a current copyright date and "@draft" statement. 5815 5816-> Eric Mader wrote in email on 20090930: 5817 "I think the tool has been modified to update @draft to @stable for 5818 older scripts and to add @draft for new scripts. 5819 (I worked with an intern on this last year.) 5820 You should check the output after you run it." 5821 5822* copy the above files into <icu>/source/layout, replacing the old files. 5823* fix mixed line endings 5824* review the diffs and fix incorrect @draft and missing aliases 5825* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5826 5827Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5828and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5829 5830-> Eric Mader wrote in email on 20090930: 5831 "This is just a matter of making sure that all the per-script tables have 5832 entries for any new scripts that were added. 5833 If any new Indic characters were added, then the class tables in 5834 IndicClassTables.cpp should be updated to reflect this. 5835 John Emmons should know how to do this if it's required." 5836 5837* rebuild the layout and layoutex libraries. 5838 5839*** Documentation 5840- Update User Guide 5841 + Jamo_Short_Name, sfc->scf, binary property value aliases 5842 5843---------------------------------------------------------------------------- *** 5844 5845Unicode 5.1 update 5846 5847*** related ICU Trac tickets 5848 58495696 Update to Unicode 5.1 5850 5851*** Unicode version numbers 5852- makedata.mak 5853- uchar.h 5854- configure.in & configure 5855- update ucdVersion in gennames.c if an algorithmic range changes 5856 5857*** data files & enums & parser code 5858 5859* file preparation 5860- ucdstrip: 5861 DerivedCoreProperties.txt 5862 DerivedNormalizationProps.txt 5863 NormalizationTest.txt 5864 PropList.txt 5865 Scripts.txt 5866 GraphemeBreakProperty.txt 5867 SentenceBreakProperty.txt 5868 WordBreakProperty.txt 5869- ucdstrip and ucdmerge: 5870 EastAsianWidth.txt 5871 LineBreak.txt 5872 5873* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5874copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 5875copy 5.1.0\ucd\Blocks.txt ..\unidata\ 5876copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 5877copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 5878copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5879copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5880copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5881copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5882copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 5883copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 5884copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 5885copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 5886copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 5887 5888ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5889ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5890ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5891ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 5892ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5893ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5894ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5895ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5896ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5897ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5898 5899* genpname 5900- run preparse.pl 5901 + cd \svn\icuproj\icu\uni51\source\tools\genpname 5902 + make sure that data.h is writable 5903 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 5904 + preparse.pl complains with errors like the following: 5905 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 5906 This is because ICU 3.8 had scripts from ISO 15924 which are now 5907 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 5908 and PropertyValueAliases.txt. 5909 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5910 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 5911 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 5912 N/Y, No/Yes, F/T, False/True 5913 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 5914 It will use further values from the file if present. 5915 5916* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5917- new block & script values 5918 + 17 new blocks 5919 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 5920 (removed from SyntheticPropertyValueAliases.txt) 5921 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 5922 (added to SyntheticPropertyValueAliases.txt) 5923- uprops.icu (uprops.h) only provides 7 bits for script codes. 5924 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 5925 There is none above 127 yet which is the script code for an 5926 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 5927 script code values greater than 127. 5928 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 5929 in a parallel bit field, and that overflows now. 5930 Also, future values >=128 would be incompatible anyway. 5931 uprops.h is modified to move around several of the bit fields 5932 in the properties vector words, and now uses 8 bits for the script code. 5933 Two other bit fields also grow to accommodate future growth: 5934 Block (current count: 172) grows from 8 to 9 bits, 5935 and Word_Break grows from 4 to 5 bits. 5936- renamed property Simple_Case_Folding (sfc->scf) 5937 + nothing to be done: handled as normal alias 5938- new property JSN Jamo_Short_Name 5939 + no new API: only contributes to the Name property 5940- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 5941- new Joining Group (JG) value: Burushashki_Yeh_Barree 5942- new Sentence_Break (SB) values: 5943 SB ; CR ; CR 5944 SB ; EX ; Extend 5945 SB ; LF ; LF 5946 SB ; SC ; SContinue 5947- new Word_Break (WB) values: 5948 WB ; CR ; CR 5949 WB ; Extend ; Extend 5950 WB ; LF ; LF 5951 WB ; MB ; MidNumLet 5952 5953* Further changes in the 2008-02-29 update: 5954- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 5955 because they should not normally be invisible. 5956- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 5957- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 5958- new Word_Break (WB) value: NL=Newline 5959 5960* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 5961- Unihan range end moves from 9FBB to 9FC3 5962 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 5963 + do change gennames.c 5964 5965* build Unicode data source code for hardcoding core data 5966C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 5967 5968ICU data make path is \svn\icuproj\icu\uni51\source\data\ 5969ICU root path is \svn\icuproj\icu\uni51 5970Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5971Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5972Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5973Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5974Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5975Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5976Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5977Creating data file for Unicode Character Properties 5978Creating data file for Unicode Case Mapping Properties 5979Creating data file for Unicode BiDi/Shaping Properties 5980Creating data file for Unicode Normalization 5981Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 5982Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 5983 5984- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 5985 and rebuild the common library 5986 5987*** Break iterators 5988 5989* Update break iterator rules to new UAX versions and new property values 5990 5991*** UCA 5992 5993* update FractionalUCA.txt and UCARules.txt with new canonical closure 5994 5995*** Test suites 5996- Test that APIs using Unicode property value aliases (like UnicodeSet) 5997 support all of the boolean values N/Y, No/Yes, F/T, False/True 5998 -> TestBinaryValues() tests in both cintltst and intltest 5999 6000*** LayoutEngine script information 6001* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 6002ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 6003ScriptRunData.cpp, which is no longer needed.) 6004 6005The generated files have a current copyright date and "@draft" statement. 6006 6007* copy the above files into <icu>/source/layout, replacing the old files. 6008 6009Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 6010and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 6011 6012* rebuild the layout and layoutex libraries. 6013 6014*** Documentation 6015- Update User Guide 6016 + Jamo_Short_Name, sfc->scf, binary property value aliases 6017 6018---------------------------------------------------------------------------- *** 6019 6020Unicode 5.0 update 6021 6022*** related Jitterbugs 6023 60245084 RFE: Update to Unicode 5.0 6025 6026*** data files & enums & parser code 6027 6028* file preparation 6029- ucdstrip: 6030 DerivedCoreProperties.txt 6031 DerivedNormalizationProps.txt 6032 NormalizationTest.txt 6033 PropList.txt 6034 Scripts.txt 6035 GraphemeBreakProperty.txt 6036 SentenceBreakProperty.txt 6037 WordBreakProperty.txt 6038- ucdstrip and ucdmerge: 6039 EastAsianWidth.txt 6040 LineBreak.txt 6041 6042* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 6043copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 6044copy 5.0.0\ucd\Blocks.txt ..\unidata\ 6045copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 6046copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 6047copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 6048copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 6049copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 6050copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 6051copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 6052copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 6053copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 6054copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 6055copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 6056 6057ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 6058ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 6059ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 6060ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 6061ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 6062ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 6063ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 6064ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 6065ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 6066ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 6067 6068* update FractionalUCA.txt and UCARules.txt with new canonical closure 6069 6070* genpname 6071- run preparse.pl 6072 + make sure that data.h is writable 6073 + perl preparse.pl \cvs\oss\icu > out.txt 6074 6075* uchar.h & uscript.h & uprops.h & uprops.c & genprops 6076- new block & script values 6077 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 6078 6079* build Unicode data source code for hardcoding core data 6080C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 6081 6082ICU data make path is \cvs\oss\icu\source\data\ 6083ICU root path is \cvs\oss\icu 6084Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 6085[etc.] 6086Creating data file for Unicode Character Properties 6087Creating data file for Unicode Case Mapping Properties 6088Creating data file for Unicode BiDi/Shaping Properties 6089Creating data file for Unicode Normalization 6090Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 6091Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 6092 6093- copy the .c source files to C:\cvs\oss\icu\source\common 6094 and rebuild the common library 6095 6096*** Unicode version numbers 6097- makedata.mak 6098- uchar.h 6099- configure.in 6100 6101*** LayoutEngine script information 6102* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 6103ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 6104ScriptRunData.cpp, which is no longer needed.) 6105 6106The generated files have a current copyright date and "@draft" statement. 6107 6108* copy the above files into <icu>/source/layout, replacing the old files. 6109 6110Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 6111and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 6112 6113* rebuild the layout and layoutex libraries. 6114 6115---------------------------------------------------------------------------- *** 6116 6117Unicode 4.1 update 6118 6119*** related Jitterbugs 6120 61214332 RFE: Update to Unicode 4.1 61224157 RBBI, TR29 4.1 updates 6123 6124*** data files & enums & parser code 6125 6126* file preparation 6127- ucdstrip: 6128 DerivedCoreProperties.txt 6129 DerivedNormalizationProps.txt 6130 NormalizationTest.txt 6131 GraphemeBreakProperty.txt 6132 SentenceBreakProperty.txt 6133 WordBreakProperty.txt 6134- ucdstrip and ucdmerge: 6135 EastAsianWidth.txt 6136 LineBreak.txt 6137 6138* add new files to the repository 6139 GraphemeBreakProperty.txt 6140 SentenceBreakProperty.txt 6141 WordBreakProperty.txt 6142 6143* update FractionalUCA.txt and UCARules.txt with new canonical closure 6144 6145* genpname 6146- handle new enumerated properties in sub read_uchar 6147- run preparse.pl 6148 6149* uchar.h & uscript.h & uprops.h & uprops.c & genprops 6150- new binary properties 6151 + Pattern_Syntax 6152 + Pattern_White_Space 6153- new enumerated properties 6154 + Grapheme_Cluster_Break 6155 + Sentence_Break 6156 + Word_Break 6157- new block & script & line break values 6158 6159* gencase 6160- case-ignorable changes 6161 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 6162 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 6163 6164*** Unicode version numbers 6165- makedata.mak 6166- uchar.h 6167- configure.in 6168 6169*** tests 6170- verify that u_charMirror() round-trips 6171- test all new properties and some new values of old properties 6172 6173*** other code 6174 6175* hardcoded Unihan range end/limit 6176- Unihan range end moves from 9FA5 to 9FBB 6177 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 6178 + do not modify BOCU/BOCSU code because that would change the encoding 6179 and break binary compatibility! 6180 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 6181 NamePrepProfile.txt 6182 + ignore trietest.c: test data is arbitrary 6183 + ignore tstnorm.cpp: test optimization, not important 6184 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 6185 + do change line_th.txt and word_th.txt 6186 by replacing hardcoded ranges with the new property values 6187 + do change gennames.c 6188 6189source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 6190source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 6191source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 6192 6193* case mappings 6194- compare new special casing context conditions with previous ones 6195 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 6196 6197* genpname 6198- consider storing only the short name if it is the same as the long name 6199 6200*** other reviews 6201- UAX #29 changes (grapheme/word/sentence breaks) 6202- UAX #14 changes (line breaks) 6203- Pattern_Syntax & Pattern_White_Space 6204 6205---------------------------------------------------------------------------- *** 6206 6207Unicode 4.0.1 update 6208 6209*** related Jitterbugs 6210 62113170 RFE: Update to Unicode 4.0.1 62123171 Add new Unicode 4.0.1 properties 62133520 use Unicode 4.0.1 updates for break iteration 6214 6215*** data files & enums & parser code 6216 6217* file preparation 6218- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 6219- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 6220 6221* file fixes 6222- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 6223 according to PRI #26 6224 http://www.unicode.org/review/resolved-pri.html#pri26 6225- undone again because no corrigendum in sight; 6226 instead modified tests to not check consistency on this for Unicode 4.0.1 6227 6228* ucdterms.txt 6229- update from http://www.unicode.org/copyright.html 6230 formatted for plain text 6231 6232* uchar.h & uprops.h & uprops.c & genprops 6233- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 6234- add U_LB_INSEPARABLE due to a spelling fix 6235 + put short name comment only on line with new constant 6236 for genpname perl script parser 6237- new binary properties 6238 + STerm 6239 + Variation_Selector 6240 6241* genpname 6242- fix genpname perl script so that it doesn't choke on more than 2 names per property value 6243- perl script: correctly calculate the maximum number of fields per row 6244 6245* uscript.h 6246- new script code Hrkt=Katakana_Or_Hiragana 6247 6248* gennorm.c track changes in DerivedNormalizationProps.txt 6249- "FNC" -> "FC_NFKC" 6250- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 6251 6252* genprops/props2.c track changes in DerivedNumericValues.txt 6253- changed from 3 columns to 2, dropping the numeric type 6254 + assume that the type is always numeric for Han characters, 6255 and that only those are added in addition to what UnicodeData.txt lists 6256 6257*** Unicode version numbers 6258- makedata.mak 6259- uchar.h 6260- configure.in 6261 6262*** tests 6263- update test of default bidi classes according to PRI #28 6264 /tsutil/cucdtst/TestUnicodeData 6265 http://www.unicode.org/review/resolved-pri.html#pri28 6266- bidi tests: change exemplar character for ES depending on Unicode version 6267- change hardcoded expected property values where they change 6268 6269*** other code 6270 6271* name matching 6272- read UCD.html 6273 6274* scripts 6275- use new Hrkt=Katakana_Or_Hiragana 6276 6277* ZWJ & ZWNJ 6278- are now part of combining character sequences 6279- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 6280