1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13 14* change log for Unicode updates 15 16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18Notes: 19 20This log includes several command lines as used in the update process. 21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. 22Use a console window that is set to that directory, or cd to there, 23and then paste the command that follows the $ sign. 24 25Most command lines use environment variables to make them more portable across versions 26and machine configurations. When you set up a console window, copy & paste the `export` commands 27from near the top of the current section before pasting tool command lines. 28Adjust the environment variables to the current version and your machine setup. 29(The command lines are currently as used on Linux.) 30 31---------------------------------------------------------------------------- *** 32 33* New ISO 15924 script codes 34 35Normally, add new script codes as part of a Unicode update. 36See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 37and see the change logs below. 38 39---------------------------------------------------------------------------- *** 40 41CLDR 43 root collation update for ICU 73 42 43Partial update only for the root collation. 44See 45- https://unicode-org.atlassian.net/browse/CLDR-15946 46 Treat quote marks as equivalent when strength=UCOL_PRIMARY 47- https://github.com/unicode-org/cldr/pull/2691 48 CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks 49- https://github.com/unicode-org/cldr/pull/2833 50 CLDR-15946 make fancy quotes secondary-different from each other 51 52The related changes to tailorings were already integrated in an earlier PR for 53https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS. 54 55This update is for the root collation, 56which is handled by different tools than the locale data updates. 57 58* Command-line environment setup 59 60export UNICODE_DATA=~/unidata/uni15/20220830 61export CLDR_SRC=~/cldr/uni/src 62export ICU_ROOT=~/icu/uni 63export ICU_SRC=$ICU_ROOT/src 64export ICUDT=icudt73b 65export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 66export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 67export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 68 69*** Configure: Build Unicode data for ICU4J 70 cd $ICU_ROOT/dbg/icu4c 71 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 72 73* Bazel build process 74 75See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 76for an overview and for setup instructions. 77 78Consider running `bazelisk --version` outside of the $ICU_SRC folder 79to find out the latest `bazel` version, and 80copying that version number into the $ICU_SRC/.bazeliskrc config file. 81(Revert if you find incompatibilities, or, better, update our build & config files.) 82 83* generate data files 84 85- remember to define the environment variables 86 (see the start of the section for this Unicode version) 87- cd $ICU_SRC 88- optional but not necessary: 89 bazelisk clean 90 or even 91 bazelisk clean --expunge 92- build/bootstrap/generate new files: 93 icu4c/source/data/unidata/generate.sh 94 95* collation: CLDR collation root, UCA DUCET 96 97- UCA DUCET goes into Mark's Unicode tools, 98 and a tool-tailored version goes into CLDR, see 99 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 100 101- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 102 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 103- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 104 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 105 (note removing the underscore before "Rules") 106 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 107- restore TODO diffs in UCARules.txt 108 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 109- update (ICU4C)/source/test/testdata/CollationTest_*.txt 110 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 111 from the CLDR root files (..._CLDR_..._SHORT.txt) 112 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 113 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 114 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 115- if CLDR common/uca/unihan-index.txt changes, then update 116 CLDR common/collation/root.xml <collation type="private-unihan"> 117 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 118 119- generate data files, as above (generate.sh), now to pick up new collation data 120- rebuild ICU4C (make clean, make check, as usual) 121 122* run & fix ICU4C tests, now with new CLDR collation root data 123- run all tests with the collation test data *_SHORT.txt or the full files 124 (the full ones have comments, useful for debugging) 125- note on intltest: if collate/UCAConformanceTest fails, then 126 utility/MultithreadTest/TestCollators will fail as well; 127 fix the conformance test before looking into the multi-thread test 128 129* update Java data files 130- refresh just the UCD/UCA-related/derived files, just to be safe 131- see (ICU4C)/source/data/icu4j-readme.txt 132- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 133- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 134 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 135 you need to reconfigure with unicore data; see the "configure" line above. 136 output: 137 ... 138 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 139 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b 140 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b 141 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b 142 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b" 143 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/ 144 mkdir -p /tmp/icu4j/main/shared/data 145 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 146 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/ 147 mkdir -p /tmp/icu4j/main/shared/data 148 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 149 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 150- copy the big-endian Unicode data files to another location, 151 separate from the other data files, 152 and then refresh ICU4J 153 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 154 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 155 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 156 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 157- new for ICU 73: also copy the binary data files directly into the ICU4J tree 158 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 159 160* When refreshing all of ICU4J data from ICU4C 161- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 162- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 163or 164- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 165 166* refresh Java test .txt files 167- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 168 cd $ICU_SRC/icu4c/source/data/unidata 169 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 170 cd ../../test/testdata 171 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 172 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 173 174* run & fix ICU4J tests 175 176*** merge the Unicode update branch back onto the main branch 177- do not merge the icudata.jar and testdata.jar, 178 instead rebuild them from merged & tested ICU4C 179- if there is a merge conflict in icudata.jar, here is one way to deal with it: 180 + remove icudata.jar from the commit so that rebasing is trivial 181 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 182 + ~/icu/uni/src$ git commit -a --amend 183 + switch to main, pull updates, switch back to the dev branch 184 + ~/icu/uni/src$ git rebase main 185 + rebuild icudata.jar 186 + ~/icu/uni/src$ git commit -a --amend 187 + ~/icu/uni/src$ git push -f 188- make sure that changes to Unicode tools are checked in: 189 https://github.com/unicode-org/unicodetools 190 191---------------------------------------------------------------------------- *** 192 193Unicode 15.0 update for ICU 72 194 195https://www.unicode.org/versions/Unicode15.0.0/ 196https://www.unicode.org/versions/beta-15.0.0.html 197https://www.unicode.org/Public/15.0.0/ucd/ 198https://www.unicode.org/reports/uax-proposed-updates.html 199https://www.unicode.org/reports/tr44/tr44-29.html 200 201https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 202https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 203https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) 204 205* Command-line environment setup 206 207export UNICODE_DATA=~/unidata/uni15/20220830 208export CLDR_SRC=~/cldr/uni/src 209export ICU_ROOT=~/icu/uni 210export ICU_SRC=$ICU_ROOT/src 211export ICUDT=icudt72b 212export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 213export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 214export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 215 216*** Unicode version numbers 217- makedata.mak 218- uchar.h 219- com.ibm.icu.util.VersionInfo 220- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 221 222- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 223 so that the makefiles see the new version number. 224 cd $ICU_ROOT/dbg/icu4c 225 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 226 227*** data files & enums & parser code 228 229* download files 230- same as for the early Unicode Tools setup and data refresh: 231 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 232 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 233- mkdir -p $UNICODE_DATA 234- download Unicode files into $UNICODE_DATA 235 + subfolders: emoji, idna, security, ucd, uca 236 + old way of fetching files: from the "Public" area on unicode.org 237 ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 238 ~ split Unihan into single-property files 239 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 240 + new way of fetching files, if available: 241 copy the files from a Unicode Tools workspace that is up to date with 242 https://github.com/unicode-org/unicodetools 243 and which might at this point be *ahead* of "Public" 244 ~ before the Unicode release copy files from "dev" subfolders, for example 245 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 246 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 247 or from the UCD/cldr/ output folder of the Unicode Tools: 248 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 249 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 250 or 251 cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 252 253* for manual diffs and for Unicode Tools input data updates: 254 remove version suffixes from the file names 255 ~$ unidata/desuffixucd.py $UNICODE_DATA 256 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 257 258* process and/or copy files 259- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 260 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 261 + For debugging, and tweaking how ppucd.txt is written, 262 the tool has an --only_ppucd option: 263 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 264 265- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 266 267* new constants for new property values 268- preparseucd.py error: 269 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] 270 = PropertyValueAliases.txt new property values (diff old & new .txt files) 271 ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 272 +age; 15.0 ; V15_0 273 +blk; Arabic_Ext_C ; Arabic_Extended_C 274 +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H 275 +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D 276 +blk; Devanagari_Ext_A ; Devanagari_Extended_A 277 +blk; Kaktovik_Numerals ; Kaktovik_Numerals 278 +blk; Kawi ; Kawi 279 +blk; Nag_Mundari ; Nag_Mundari 280 +sc ; Kawi ; Kawi 281 +sc ; Nagm ; Nag_Mundari 282 -> add new blocks to uchar.h before UBLOCK_COUNT 283 use long property names for enum constants, 284 for the trailing comment get the block start code point: diff old & new Blocks.txt 285 ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 286 +10EC0..10EFF; Arabic Extended-C 287 +11B00..11B5F; Devanagari Extended-A 288 +11F00..11F5F; Kawi 289 -13430..1343F; Egyptian Hieroglyph Format Controls 290 +13430..1345F; Egyptian Hieroglyph Format Controls 291 +1D2C0..1D2DF; Kaktovik Numerals 292 +1E030..1E08F; Cyrillic Extended-D 293 +1E4D0..1E4FF; Nag Mundari 294 +31350..323AF; CJK Unified Ideographs Extension H 295 (ignore blocks whose end code point changed) 296 -> add new blocks to UCharacter.UnicodeBlock IDs 297 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 298 replace public static final int \1_ID = \2; \3 299 -> add new blocks to UCharacter.UnicodeBlock objects 300 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 301 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 302 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 303 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 304 replace public static final int \1 = \2; \3 305 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 306 and in com.ibm.icu.dev.test.lang.TestUScript.java 307 308* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 309 (not strictly necessary for NOT_ENCODED scripts) 310 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 311 312* build ICU 313 to make sure that there are no syntax errors 314 315 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 316 317* update spoof checker UnicodeSet initializers: 318 inclusionPat & recommendedPat in i18n/uspoof.cpp 319 INCLUSION & RECOMMENDED in SpoofChecker.java 320- make sure that the Unicode Tools tree contains the latest security data files 321- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 322- run the tool (no special environment variables needed) 323- copy & paste from the Console output into the .cpp & .java files 324 325* Bazel build process 326 327See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 328for an overview and for setup instructions. 329 330Consider running `bazelisk --version` outside of the $ICU_SRC folder 331to find out the latest `bazel` version, and 332copying that version number into the $ICU_SRC/.bazeliskrc config file. 333(Revert if you find incompatibilities, or, better, update our build & config files.) 334 335* generate data files 336 337- remember to define the environment variables 338 (see the start of the section for this Unicode version) 339- cd $ICU_SRC 340- optional but not necessary: 341 bazelisk clean 342- build/bootstrap/generate new files: 343 icu4c/source/data/unidata/generate.sh 344 345* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 346 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 347- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 348 ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt 349- Unicode 6.0..15.0: U+2260, U+226E, U+226F 350- nothing new in this Unicode version, no test file to update 351 352* run & fix ICU4C tests 353- Note: Some of the collation data and test data will be updated below, 354 so at this time we might get some collation test failures. 355 Ignore these for now. 356- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 357 (no rule changes in Unicode 15) 358- update CLDR GraphemeBreakTest.txt 359 cd ~/unitools/mine/Generated 360 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 361 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 362 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 363- Andy helps with RBBI & spoof check test failures 364 365* collation: CLDR collation root, UCA DUCET 366 367- UCA DUCET goes into Mark's Unicode tools, 368 and a tool-tailored version goes into CLDR, see 369 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 370 371- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 372 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 373- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 374 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 375 (note removing the underscore before "Rules") 376 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 377- restore TODO diffs in UCARules.txt 378 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 379- update (ICU4C)/source/test/testdata/CollationTest_*.txt 380 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 381 from the CLDR root files (..._CLDR_..._SHORT.txt) 382 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 383 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 384 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 385- if CLDR common/uca/unihan-index.txt changes, then update 386 CLDR common/collation/root.xml <collation type="private-unihan"> 387 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 388 389- generate data files, as above (generate.sh), now to pick up new collation data 390- update CollationFCD.java: 391 copy & paste the initializers of lcccIndex[] etc. from 392 ICU4C/source/i18n/collationfcd.cpp to 393 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 394- rebuild ICU4C (make clean, make check, as usual) 395 396* Unihan collators 397 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 398- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 399 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 400- generate ICU zh collation data 401 instructions inspired by 402 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 403 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 404 + setup: 405 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 406 (didn't work without setting JAVA_HOME, 407 nor with the Google default of /usr/local/buildtools/java/jdk 408 [Google security limitations in the XML parser]) 409 export TOOLS_ROOT=~/icu/uni/src/tools 410 export CLDR_DIR=~/cldr/uni/src 411 export CLDR_DATA_DIR=~/cldr/uni/src 412 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 413 cd "$TOOLS_ROOT/cldr/lib" 414 ./install-cldr-jars.sh "$CLDR_DIR" 415 + generate the files we need 416 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 417 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 418 + diff 419 cd $ICU_SRC 420 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 421 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 422 + copy into the source tree 423 cd $ICU_SRC 424 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 425 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 426- rebuild ICU4C 427 428* run & fix ICU4C tests, now with new CLDR collation root data 429- run all tests with the collation test data *_SHORT.txt or the full files 430 (the full ones have comments, useful for debugging) 431- note on intltest: if collate/UCAConformanceTest fails, then 432 utility/MultithreadTest/TestCollators will fail as well; 433 fix the conformance test before looking into the multi-thread test 434 435* update Java data files 436- refresh just the UCD/UCA-related/derived files, just to be safe 437- see (ICU4C)/source/data/icu4j-readme.txt 438- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 439- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 440 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 441 you need to reconfigure with unicore data; see the "configure" line above. 442 output: 443 ... 444 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 445 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b 446 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b 447 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b 448 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" 449 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ 450 mkdir -p /tmp/icu4j/main/shared/data 451 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 452 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ 453 mkdir -p /tmp/icu4j/main/shared/data 454 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 455 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 456- copy the big-endian Unicode data files to another location, 457 separate from the other data files, 458 and then refresh ICU4J 459 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 460 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 461 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 462 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 463 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 464 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 465 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 466 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 467 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 468 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 469 470* When refreshing all of ICU4J data from ICU4C 471- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 472- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 473or 474- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 475 476* refresh Java test .txt files 477- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 478 cd $ICU_SRC/icu4c/source/data/unidata 479 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 480 cd ../../test/testdata 481 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 482 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 483 484* run & fix ICU4J tests 485 486*** API additions 487- send notice to icu-design about new born-@stable API (enum constants etc.) 488 489*** CLDR numbering systems 490- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 491 for example: 492 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 493 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 494 ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt 495 --> 496 +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 497 +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 498 or: 499 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 500 --> 501 +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE 502 +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE 503 Unicode 15: 504 kawi 11F50..11F59 Kawi 505 nagm 1E4F0..1E4F9 Nag Mundari 506 https://github.com/unicode-org/cldr/pull/2041 507 508*** merge the Unicode update branches back onto the trunk 509- do not merge the icudata.jar and testdata.jar, 510 instead rebuild them from merged & tested ICU4C 511- if there is a merge conflict in icudata.jar, here is one way to deal with it: 512 + remove icudata.jar from the commit so that rebasing is trivial 513 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 514 + ~/icu/uni/src$ git commit -a --amend 515 + switch to main, pull updates, switch back to the dev branch 516 + ~/icu/uni/src$ git rebase main 517 + rebuild icudata.jar 518 + ~/icu/uni/src$ git commit -a --amend 519 + ~/icu/uni/src$ git push -f 520- make sure that changes to Unicode tools are checked in: 521 https://github.com/unicode-org/unicodetools 522 523---------------------------------------------------------------------------- *** 524 525Unicode 14.0 update for ICU 70 526 527https://www.unicode.org/versions/Unicode14.0.0/ 528https://www.unicode.org/versions/beta-14.0.0.html 529https://www.unicode.org/Public/14.0.0/ucd/ 530https://www.unicode.org/reports/uax-proposed-updates.html 531https://www.unicode.org/reports/tr44/tr44-27.html 532 533https://unicode-org.atlassian.net/browse/CLDR-14801 534https://unicode-org.atlassian.net/browse/ICU-21635 535 536* Command-line environment setup 537 538export UNICODE_DATA=~/unidata/uni14/20210903 539export CLDR_SRC=~/cldr/uni/src 540export ICU_ROOT=~/icu/uni 541export ICU_SRC=$ICU_ROOT/src 542export ICUDT=icudt70b 543export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 544export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 545export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 546 547*** Unicode version numbers 548- makedata.mak 549- uchar.h 550- com.ibm.icu.util.VersionInfo 551- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 552 553- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 554 so that the makefiles see the new version number. 555 cd $ICU_ROOT/dbg/icu4c 556 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 557 558*** data files & enums & parser code 559 560* download files 561- same as for the early Unicode Tools setup and data refresh: 562 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 563 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 564- mkdir -p $UNICODE_DATA 565- download Unicode files into $UNICODE_DATA 566 + subfolders: emoji, idna, security, ucd, uca 567 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 568 + split Unihan into single-property files 569 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 570 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 571 or from the UCD/cldr/ output folder of the Unicode Tools: 572 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 573 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 574 or 575 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 576 577* for manual diffs and for Unicode Tools input data updates: 578 remove version suffixes from the file names 579 ~$ unidata/desuffixucd.py $UNICODE_DATA 580 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 581 582* process and/or copy files 583- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 584 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 585 + For debugging, and tweaking how ppucd.txt is written, 586 the tool has an --only_ppucd option: 587 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 588 589- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 590 591* new constants for new property values 592- preparseucd.py error: 593 ValueError: missing uchar.h enum constants for some property values: 594 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 595 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 596 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 597 = PropertyValueAliases.txt new property values (diff old & new .txt files) 598 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 599 +age; 14.0 ; V14_0 600 +blk; Arabic_Ext_B ; Arabic_Extended_B 601 +blk; Cypro_Minoan ; Cypro_Minoan 602 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 603 +blk; Kana_Ext_B ; Kana_Extended_B 604 +blk; Latin_Ext_F ; Latin_Extended_F 605 +blk; Latin_Ext_G ; Latin_Extended_G 606 +blk; Old_Uyghur ; Old_Uyghur 607 +blk; Tangsa ; Tangsa 608 +blk; Toto ; Toto 609 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 610 +blk; Vithkuqi ; Vithkuqi 611 +blk; Znamenny_Music ; Znamenny_Musical_Notation 612 +jg ; Thin_Yeh ; Thin_Yeh 613 +jg ; Vertical_Tail ; Vertical_Tail 614 +sc ; Cpmn ; Cypro_Minoan 615 +sc ; Ougr ; Old_Uyghur 616 +sc ; Tnsa ; Tangsa 617 +sc ; Toto ; Toto 618 +sc ; Vith ; Vithkuqi 619 -> add new blocks to uchar.h before UBLOCK_COUNT 620 use long property names for enum constants, 621 for the trailing comment get the block start code point: diff old & new Blocks.txt 622 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 623 +0870..089F; Arabic Extended-B 624 +10570..105BF; Vithkuqi 625 +10780..107BF; Latin Extended-F 626 +10F70..10FAF; Old Uyghur 627 -11700..1173F; Ahom 628 +11700..1174F; Ahom 629 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 630 +12F90..12FFF; Cypro-Minoan 631 +16A70..16ACF; Tangsa 632 -18D00..18D8F; Tangut Supplement 633 +18D00..18D7F; Tangut Supplement 634 +1AFF0..1AFFF; Kana Extended-B 635 +1CF00..1CFCF; Znamenny Musical Notation 636 +1DF00..1DFFF; Latin Extended-G 637 +1E290..1E2BF; Toto 638 +1E7E0..1E7FF; Ethiopic Extended-B 639 (ignore blocks whose end code point changed) 640 -> add new blocks to UCharacter.UnicodeBlock IDs 641 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 642 replace public static final int \1_ID = \2; \3 643 -> add new blocks to UCharacter.UnicodeBlock objects 644 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 645 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 646 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 647 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 648 replace public static final int \1 = \2; \3 649 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 650 and in com.ibm.icu.dev.test.lang.TestUScript.java 651 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 652 653* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 654 (not strictly necessary for NOT_ENCODED scripts) 655 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 656 657* build ICU 658 to make sure that there are no syntax errors 659 660 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 661 662* update spoof checker UnicodeSet initializers: 663 inclusionPat & recommendedPat in i18n/uspoof.cpp 664 INCLUSION & RECOMMENDED in SpoofChecker.java 665- make sure that the Unicode Tools tree contains the latest security data files 666- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 667- run the tool (no special environment variables needed) 668- copy & paste from the Console output into the .cpp & .java files 669 670* Bazel build process 671 672See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 673for an overview and for setup instructions. 674 675Consider running `bazelisk --version` outside of the $ICU_SRC folder 676to find out the latest `bazel` version, and 677copying that version number into the $ICU_SRC/.bazeliskrc config file. 678(Revert if you find incompatibilities, or, better, update our build & config files.) 679 680* generate data files 681 682- remember to define the environment variables 683 (see the start of the section for this Unicode version) 684- cd $ICU_SRC 685- optional but not necessary: 686 bazelisk clean 687- build/bootstrap/generate new files: 688 icu4c/source/data/unidata/generate.sh 689 690* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 691 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 692- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 693- Unicode 6.0..14.0: U+2260, U+226E, U+226F 694- nothing new in this Unicode version, no test file to update 695 696* run & fix ICU4C tests 697- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 698- update CLDR GraphemeBreakTest.txt 699 cd ~/unitools/mine/Generated 700 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 701 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 702 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 703- Andy helps with RBBI & spoof check test failures 704 705* collation: CLDR collation root, UCA DUCET 706 707- UCA DUCET goes into Mark's Unicode tools, 708 and a tool-tailored version goes into CLDR, see 709 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 710 711- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 712 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 713- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 714 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 715 (note removing the underscore before "Rules") 716 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 717- restore TODO diffs in UCARules.txt 718 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 719- update (ICU4C)/source/test/testdata/CollationTest_*.txt 720 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 721 from the CLDR root files (..._CLDR_..._SHORT.txt) 722 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 723 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 724 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 725- if CLDR common/uca/unihan-index.txt changes, then update 726 CLDR common/collation/root.xml <collation type="private-unihan"> 727 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 728 729- generate data files, as above (generate.sh), now to pick up new collation data 730- update CollationFCD.java: 731 copy & paste the initializers of lcccIndex[] etc. from 732 ICU4C/source/i18n/collationfcd.cpp to 733 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 734- rebuild ICU4C (make clean, make check, as usual) 735 736* Unihan collators 737 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 738- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 739 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 740- generate ICU zh collation data 741 instructions inspired by 742 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 743 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 744 + setup: 745 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 746 (didn't work without setting JAVA_HOME, 747 nor with the Google default of /usr/local/buildtools/java/jdk 748 [Google security limitations in the XML parser]) 749 export TOOLS_ROOT=~/icu/uni/src/tools 750 export CLDR_DIR=~/cldr/uni/src 751 export CLDR_DATA_DIR=~/cldr/uni/src 752 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 753 cd "$TOOLS_ROOT/cldr/lib" 754 ./install-cldr-jars.sh "$CLDR_DIR" 755 + generate the files we need 756 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 757 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 758 + diff 759 cd $ICU_SRC 760 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 761 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 762 + copy into the source tree 763 cd $ICU_SRC 764 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 765 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 766- rebuild ICU4C 767 768* run & fix ICU4C tests, now with new CLDR collation root data 769- run all tests with the collation test data *_SHORT.txt or the full files 770 (the full ones have comments, useful for debugging) 771- note on intltest: if collate/UCAConformanceTest fails, then 772 utility/MultithreadTest/TestCollators will fail as well; 773 fix the conformance test before looking into the multi-thread test 774 775* update Java data files 776- refresh just the UCD/UCA-related/derived files, just to be safe 777- see (ICU4C)/source/data/icu4j-readme.txt 778- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 779- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 780 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 781 you need to reconfigure with unicore data; see the "configure" line above. 782 output: 783 ... 784 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 785 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 786 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 787 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 788 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 789 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 790 mkdir -p /tmp/icu4j/main/shared/data 791 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 792 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 793 mkdir -p /tmp/icu4j/main/shared/data 794 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 795 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 796- copy the big-endian Unicode data files to another location, 797 separate from the other data files, 798 and then refresh ICU4J 799 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 800 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 801 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 802 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 803 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 804 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 805 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 806 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 807 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 808 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 809 810* When refreshing all of ICU4J data from ICU4C 811- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 812- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 813or 814- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 815 816* refresh Java test .txt files 817- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 818 cd $ICU_SRC/icu4c/source/data/unidata 819 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 820 cd ../../test/testdata 821 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 822 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 823 824* run & fix ICU4J tests 825 826*** API additions 827- send notice to icu-design about new born-@stable API (enum constants etc.) 828 829*** CLDR numbering systems 830- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 831 for example: 832 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 833 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 834 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 835 --> 836 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 837 Unicode 14: 838 tnsa 16AC0..16AC9 Tangsa 839 https://github.com/unicode-org/cldr/pull/1326 840 841*** merge the Unicode update branches back onto the trunk 842- do not merge the icudata.jar and testdata.jar, 843 instead rebuild them from merged & tested ICU4C 844- make sure that changes to Unicode tools are checked in: 845 https://github.com/unicode-org/unicodetools 846 847---------------------------------------------------------------------------- *** 848 849Unicode 13.0 update for ICU 66 850 851https://www.unicode.org/versions/Unicode13.0.0/ 852https://www.unicode.org/versions/beta-13.0.0.html 853https://www.unicode.org/Public/13.0.0/ucd/ 854https://www.unicode.org/reports/uax-proposed-updates.html 855https://www.unicode.org/reports/tr44/tr44-25.html 856 857https://unicode-org.atlassian.net/browse/CLDR-13387 858https://unicode-org.atlassian.net/browse/ICU-20893 859 860* Command-line environment setup 861 862UNICODE_DATA=~/unidata/uni13/20200212 863CLDR_SRC=~/cldr/uni/src 864ICU_ROOT=~/icu/uni 865ICU_SRC=$ICU_ROOT/src 866ICUDT=icudt66b 867ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 868ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 869export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 870 871*** Unicode version numbers 872- makedata.mak 873- uchar.h 874- com.ibm.icu.util.VersionInfo 875- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 876 877- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 878 so that the makefiles see the new version number. 879 cd $ICU_ROOT/dbg/icu4c 880 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 881 882*** data files & enums & parser code 883 884* download files 885- mkdir -p $UNICODE_DATA 886- download Unicode files into $UNICODE_DATA 887 + subfolders: emoji, idna, security, ucd, uca 888 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 889 + split Unihan into single-property files 890 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 891 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 892 or from the ucd/cldr/ output folder of the Unicode Tools: 893 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 894 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 895 896* for manual diffs and for Unicode Tools input data updates: 897 remove version suffixes from the file names 898 ~$ unidata/desuffixucd.py $UNICODE_DATA 899 (see https://sites.google.com/site/unicodetools/inputdata) 900 901* process and/or copy files 902- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 903 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 904 + For debugging, and tweaking how ppucd.txt is written, 905 the tool has an --only_ppucd option: 906 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 907 908- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 909 910* new constants for new property values 911- preparseucd.py error: 912 ValueError: missing uchar.h enum constants for some property values: 913 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 914 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 915 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 916 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 917 = PropertyValueAliases.txt new property values (diff old & new .txt files) 918 blk; Chorasmian ; Chorasmian 919 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 920 blk; Dives_Akuru ; Dives_Akuru 921 blk; Khitan_Small_Script ; Khitan_Small_Script 922 blk; Lisu_Sup ; Lisu_Supplement 923 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 924 blk; Tangut_Sup ; Tangut_Supplement 925 blk; Yezidi ; Yezidi 926 -> add to uchar.h before UBLOCK_COUNT 927 use long property names for enum constants, 928 for the trailing comment get the block start code point: diff old & new Blocks.txt 929 -> add to UCharacter.UnicodeBlock IDs 930 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 931 replace public static final int \1_ID = \2; \3 932 -> add to UCharacter.UnicodeBlock objects 933 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 934 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 935 936 sc ; Chrs ; Chorasmian 937 sc ; Diak ; Dives_Akuru 938 sc ; Kits ; Khitan_Small_Script 939 sc ; Yezi ; Yezidi 940 -> uscript.h & com.ibm.icu.lang.UScript 941 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 942 and in com.ibm.icu.dev.test.lang.TestUScript.java 943 944 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 945 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 946 947* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 948 (not strictly necessary for NOT_ENCODED scripts) 949 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 950 951* build ICU (make install) 952 to make sure that there are no syntax errors, and 953 so that the tools build can pick up the new definitions from the installed header files. 954 955 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 956 957* update spoof checker UnicodeSet initializers: 958 inclusionPat & recommendedPat in i18n/uspoof.cpp 959 INCLUSION & RECOMMENDED in SpoofChecker.java 960- make sure that the Unicode Tools tree contains the latest security data files 961- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 962- update the hardcoded version number there in the DIRECTORY path 963- run the tool (no special environment variables needed) 964- copy & paste from the Console output into the .cpp & .java files 965 966* generate normalization data files 967 cd $ICU_ROOT/dbg/icu4c 968 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 969 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 970 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 971 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 972 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 973 974* build ICU (make install) 975 so that the tools build can pick up the new definitions from the installed header files. 976 977 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 978 979* build Unicode tools using CMake+make 980 981$ICU_SRC/tools/unicode/c/icudefs.txt: 982 983# Location (--prefix) of where ICU was installed. 984set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 985# Location of the ICU4C source tree. 986set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 987 988 $ICU_ROOT/dbg$ 989 mkdir -p tools/unicode/c 990 cd tools/unicode/c 991 992 $ICU_ROOT/dbg/tools/unicode/c$ 993 cmake ../../../../src/tools/unicode/c 994 make 995 996* generate core properties data files 997 $ICU_ROOT/dbg/tools/unicode/c$ 998 genprops/genprops $ICU_SRC/icu4c 999- tool failure: 1000 genprops: Script_Extensions indexes overflow bit field 1001 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 1002 -> uprops.icu data file format : 1003 add two more bits to store a script code or Script_Extensions index 1004 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 1005- rebuild ICU (make install) & tools 1006 1007* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1008 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1009- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1010- Unicode 6.0..13.0: U+2260, U+226E, U+226F 1011- nothing new in this Unicode version, no test file to update 1012 1013* run & fix ICU4C tests 1014- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1015- Andy helps with RBBI & spoof check test failures 1016 1017* collation: CLDR collation root, UCA DUCET 1018 1019- UCA DUCET goes into Mark's Unicode tools, see 1020 https://sites.google.com/site/unicodetools/home#TOC-UCA 1021 diff the main mapping file, look for bad changes 1022 (for example, more bytes per weight for common characters) 1023 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 1024 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 1025 1026- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1027 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1028 1029- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1030 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1031- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1032 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1033 (note removing the underscore before "Rules") 1034 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1035- restore TODO diffs in UCARules.txt 1036 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1037- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1038 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1039 from the CLDR root files (..._CLDR_..._SHORT.txt) 1040 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1041 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1042 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1043- if CLDR common/uca/unihan-index.txt changes, then update 1044 CLDR common/collation/root.xml <collation type="private-unihan"> 1045 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1046 1047- run genuca 1048 $ICU_ROOT/dbg/tools/unicode/c$ 1049 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1050 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1051- rebuild ICU4C 1052 1053* Unihan collators 1054 https://sites.google.com/site/unicodetools/unihan 1055- run Unicode Tools 1056 org.unicode.draft.GenerateUnihanCollators 1057 with VM arguments 1058 -ea 1059 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1060 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1061 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1062 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1063 -DUVERSION=13.0.0 1064- run Unicode Tools 1065 org.unicode.draft.GenerateUnihanCollatorFiles 1066 with the same arguments 1067- check CLDR diffs 1068 cd $CLDR_SRC 1069 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1070 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1071- copy to CLDR 1072 cd $CLDR_SRC 1073 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1074 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1075- run CLDR unit tests, commit to CLDR 1076- generate ICU zh collation data: run CLDR 1077 org.unicode.cldr.icu.NewLdml2IcuConverter 1078 with program arguments 1079 -t collation 1080 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 1081 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 1082 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1083 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1084 zh 1085 and VM arguments 1086 -ea 1087 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1088- rebuild ICU4C 1089 1090* run & fix ICU4C tests, now with new CLDR collation root data 1091- run all tests with the collation test data *_SHORT.txt or the full files 1092 (the full ones have comments, useful for debugging) 1093- note on intltest: if collate/UCAConformanceTest fails, then 1094 utility/MultithreadTest/TestCollators will fail as well; 1095 fix the conformance test before looking into the multi-thread test 1096 1097* update Java data files 1098- refresh just the UCD/UCA-related/derived files, just to be safe 1099- see (ICU4C)/source/data/icu4j-readme.txt 1100- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1101- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1102 output: 1103 ... 1104 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1105 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1106 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 1107 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1108 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 1109 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 1110 mkdir -p /tmp/icu4j/main/shared/data 1111 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1112 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 1113 mkdir -p /tmp/icu4j/main/shared/data 1114 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1115 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1116- copy the big-endian Unicode data files to another location, 1117 separate from the other data files, 1118 and then refresh ICU4J 1119 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1120 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1121 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1122 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1123 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1124 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1125 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1126 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1127 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1128 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1129 1130* When refreshing all of ICU4J data from ICU4C 1131- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1132- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1133or 1134- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1135 1136* update CollationFCD.java 1137 + copy & paste the initializers of lcccIndex[] etc. from 1138 ICU4C/source/i18n/collationfcd.cpp to 1139 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1140 1141* refresh Java test .txt files 1142- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1143 cd $ICU_SRC/icu4c/source/data/unidata 1144 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1145 cd ../../test/testdata 1146 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1147 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1148 1149* run & fix ICU4J tests 1150 1151*** API additions 1152- send notice to icu-design about new born-@stable API (enum constants etc.) 1153 1154*** CLDR numbering systems 1155- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1156 for example, look for 1157 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1158 in new blocks (Blocks.txt) 1159 Unicode 13: 1160 diak 11950..11959 Dives_Akuru 1161 1162*** merge the Unicode update branches back onto the trunk 1163- do not merge the icudata.jar and testdata.jar, 1164 instead rebuild them from merged & tested ICU4C 1165- make sure that changes to Unicode tools are checked in: 1166 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1167 1168---------------------------------------------------------------------------- *** 1169 1170Unicode 12.1 update for ICU 64.2 1171 1172** This is an abbreviated update with one new character for the new 1173** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 1174https://en.wikipedia.org/wiki/Reiwa_period 1175 1176http://www.unicode.org/versions/Unicode12.1.0/ 1177 1178ICU-20497 Unicode 12.1 1179 1180cldrbug 11978: Unicode 12.1 1181 1182* Command-line environment setup 1183 1184UNICODE_DATA=~/unidata/uni121/20190403 1185CLDR_SRC=~/svn.cldr/uni 1186ICU_ROOT=~/icu/uni 1187ICU_SRC=$ICU_ROOT/src 1188ICUDT=icudt64b 1189ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1190ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1191export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1192 1193*** Unicode version numbers 1194- makedata.mak 1195- uchar.h 1196- com.ibm.icu.util.VersionInfo 1197- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1198 1199- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1200 so that the makefiles see the new version number. 1201 cd $ICU_ROOT/dbg/icu4c 1202 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1203 1204*** data files & enums & parser code 1205 1206* download files 1207- mkdir -p $UNICODE_DATA 1208- download Unicode files into $UNICODE_DATA 1209 + subfolders: emoji, idna, security, ucd, uca 1210 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1211 1212* for manual diffs and for Unicode Tools input data updates: 1213 remove version suffixes from the file names 1214 ~$ unidata/desuffixucd.py $UNICODE_DATA 1215 (see https://sites.google.com/site/unicodetools/inputdata) 1216 1217* process and/or copy files 1218- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1219 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1220 + For debugging, and tweaking how ppucd.txt is written, 1221 the tool has an --only_ppucd option: 1222 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1223 1224- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1225 1226* build ICU (make install) 1227 so that the tools build can pick up the new definitions from the installed header files. 1228 1229 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1230 1231* update spoof checker UnicodeSet initializers: 1232 inclusionPat & recommendedPat in uspoof.cpp 1233 INCLUSION & RECOMMENDED in SpoofChecker.java 1234- make sure that the Unicode Tools tree contains the latest security data files 1235- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1236- update the hardcoded version number there in the DIRECTORY path 1237- run the tool (no special environment variables needed) 1238- copy & paste from the Console output into the .cpp & .java files 1239 1240* generate normalization data files 1241 cd $ICU_ROOT/dbg/icu4c 1242 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1243 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1244 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1245 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1246 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1247 1248* build ICU (make install) 1249 so that the tools build can pick up the new definitions from the installed header files. 1250 1251 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1252 1253* build Unicode tools using CMake+make 1254 1255$ICU_SRC/tools/unicode/c/icudefs.txt: 1256 1257# Location (--prefix) of where ICU was installed. 1258set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1259# Location of the ICU4C source tree. 1260set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1261 1262 $ICU_ROOT/dbg$ 1263 mkdir -p tools/unicode/c 1264 cd tools/unicode/c 1265 1266 $ICU_ROOT/dbg/tools/unicode/c$ 1267 cmake ../../../../src/tools/unicode/c 1268 make 1269 1270* generate core properties data files 1271 $ICU_ROOT/dbg/tools/unicode/c$ 1272 genprops/genprops $ICU_SRC/icu4c 1273 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1274 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1275- rebuild ICU (make install) & tools 1276 1277* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1278 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1279- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1280- Unicode 6.0..12.1: U+2260, U+226E, U+226F 1281- nothing new in this Unicode version, no test file to update 1282 1283* run & fix ICU4C tests 1284- Andy handles RBBI & spoof check test failures 1285 1286* collation: CLDR collation root, UCA DUCET 1287 1288- UCA DUCET goes into Mark's Unicode tools, see 1289 https://sites.google.com/site/unicodetools/home#TOC-UCA 1290 diff the main mapping file, look for bad changes 1291 (for example, more bytes per weight for common characters) 1292 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 1293 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 1294 1295- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1296 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1297 1298- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1299 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1300- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1301 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1302 (note removing the underscore before "Rules") 1303 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1304- restore TODO diffs in UCARules.txt 1305 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1306- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1307 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1308 from the CLDR root files (..._CLDR_..._SHORT.txt) 1309 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1310 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1311 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1312- if CLDR common/uca/unihan-index.txt changes, then update 1313 CLDR common/collation/root.xml <collation type="private-unihan"> 1314 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1315 1316- run genuca, see command line above 1317- rebuild ICU4C 1318 1319* Unihan collators 1320 https://sites.google.com/site/unicodetools/unihan 1321- run Unicode Tools 1322 org.unicode.draft.GenerateUnihanCollators 1323 with VM arguments 1324 -ea 1325 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1326 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1327 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1328 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1329 -DUVERSION=12.1.0 1330- run Unicode Tools 1331 org.unicode.draft.GenerateUnihanCollatorFiles 1332 with the same arguments 1333- check CLDR diffs 1334 cd $CLDR_SRC 1335 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1336 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1337- copy to CLDR 1338 cd $CLDR_SRC 1339 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1340 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1341- run CLDR unit tests, commit to CLDR 1342- generate ICU zh collation data: run CLDR 1343 org.unicode.cldr.icu.NewLdml2IcuConverter 1344 with program arguments 1345 -t collation 1346 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1347 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1348 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1349 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1350 zh 1351 and VM arguments 1352 -ea 1353 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1354- rebuild ICU4C 1355 1356* run & fix ICU4C tests, now with new CLDR collation root data 1357- run all tests with the collation test data *_SHORT.txt or the full files 1358 (the full ones have comments, useful for debugging) 1359- note on intltest: if collate/UCAConformanceTest fails, then 1360 utility/MultithreadTest/TestCollators will fail as well; 1361 fix the conformance test before looking into the multi-thread test 1362 1363* update Java data files 1364- refresh just the UCD/UCA-related/derived files, just to be safe 1365- see (ICU4C)/source/data/icu4j-readme.txt 1366- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1367- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1368 output: 1369 ... 1370 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1371 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1372 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 1373 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1374 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 1375 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 1376 mkdir -p /tmp/icu4j/main/shared/data 1377 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1378 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 1379 mkdir -p /tmp/icu4j/main/shared/data 1380 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1381 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1382- copy the big-endian Unicode data files to another location, 1383 separate from the other data files, 1384 and then refresh ICU4J 1385 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1386 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1387 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1388 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1389 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1390 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1391 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1392 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1393 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1394 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1395 1396* When refreshing all of ICU4J data from ICU4C 1397- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1398- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1399or 1400- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1401 1402* update CollationFCD.java 1403 + copy & paste the initializers of lcccIndex[] etc. from 1404 ICU4C/source/i18n/collationfcd.cpp to 1405 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1406 1407* refresh Java test .txt files 1408- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1409 cd $ICU_SRC/icu4c/source/data/unidata 1410 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1411 cd ../../test/testdata 1412 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1413 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1414 1415* run & fix ICU4J tests 1416 1417*** API additions 1418- send notice to icu-design about new born-@stable API (enum constants etc.) 1419 1420*** CLDR numbering systems 1421- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1422 for example, look for 1423 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1424 in new blocks (Blocks.txt) 1425 Unicode 12: using Unicode 12 CLDR ticket #11478 1426 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1427 wcho 1E2F0..1E2F9 Wancho 1428 Unicode 11: using Unicode 11 CLDR ticket #10978 1429 rohg 10D30..10D39 Hanifi_Rohingya 1430 gong 11DA0..11DA9 Gunjala_Gondi 1431 Earlier: CLDR tickets specific to adding new numbering systems. 1432 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1433 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1434 1435*** merge the Unicode update branches back onto the trunk 1436- do not merge the icudata.jar and testdata.jar, 1437 instead rebuild them from merged & tested ICU4C 1438- make sure that changes to Unicode tools are checked in: 1439 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1440 1441---------------------------------------------------------------------------- *** 1442 1443Unicode 12.0 update for ICU 64 1444 1445http://www.unicode.org/versions/Unicode12.0.0/ 1446http://unicode.org/versions/beta-12.0.0.html 1447https://www.unicode.org/review/pri389/ 1448http://www.unicode.org/reports/uax-proposed-updates.html 1449http://www.unicode.org/reports/tr44/tr44-23.html 1450 1451ICU-20203 Unicode 12 1452 1453ICU-20111 move text layout properties data into a data file 1454 1455cldrbug 11478: Unicode 12 1456Accidentally used ^/trunk instead of ^/branches/markus/uni12 1457 1458* Command-line environment setup 1459 1460UNICODE_DATA=~/unidata/uni12/20190309 1461CLDR_SRC=~/svn.cldr/uni 1462ICU_ROOT=~/icu/uni 1463ICU_SRC=$ICU_ROOT/src 1464ICUDT=icudt63b 1465ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1466ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1467export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1468 1469*** Unicode version numbers 1470- makedata.mak 1471- uchar.h 1472- com.ibm.icu.util.VersionInfo 1473- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1474 1475- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1476 so that the makefiles see the new version number. 1477 1478*** data files & enums & parser code 1479 1480* download files 1481- mkdir -p $UNICODE_DATA 1482- download Unicode files into $UNICODE_DATA 1483 + subfolders: emoji, idna, security, ucd, uca 1484 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1485 1486* for manual diffs and for Unicode Tools input data updates: 1487 remove version suffixes from the file names 1488 ~$ unidata/desuffixucd.py $UNICODE_DATA 1489 (see https://sites.google.com/site/unicodetools/inputdata) 1490 1491* process and/or copy files 1492- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1493 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1494 + For debugging, and tweaking how ppucd.txt is written, 1495 the tool has an --only_ppucd option: 1496 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1497 1498- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1499 1500* build ICU (make install) 1501 so that the tools build can pick up the new definitions from the installed header files. 1502 1503 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1504 1505* new constants for new property values 1506- preparseucd.py error: 1507 ValueError: missing uchar.h enum constants for some property values: 1508 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 1509 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 1510 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 1511 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 1512 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1513 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 1514 blk; Elymaic ; Elymaic 1515 blk; Nandinagari ; Nandinagari 1516 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 1517 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 1518 blk; Small_Kana_Ext ; Small_Kana_Extension 1519 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 1520 blk; Tamil_Sup ; Tamil_Supplement 1521 blk; Wancho ; Wancho 1522 -> add to uchar.h 1523 use long property names for enum constants, 1524 for the trailing comment get the block start code point: diff old & new Blocks.txt 1525 -> add to UCharacter.UnicodeBlock IDs 1526 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1527 replace public static final int \1_ID = \2; \3 1528 -> add to UCharacter.UnicodeBlock objects 1529 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1530 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 1531 1532 sc ; Elym ; Elymaic 1533 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 1534 sc ; Nand ; Nandinagari 1535 sc ; Wcho ; Wancho 1536 -> uscript.h & com.ibm.icu.lang.UScript 1537 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1538 and in com.ibm.icu.dev.test.lang.TestUScript.java 1539 1540* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1541 (not strictly necessary for NOT_ENCODED scripts) 1542 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1543 1544* update spoof checker UnicodeSet initializers: 1545 inclusionPat & recommendedPat in uspoof.cpp 1546 INCLUSION & RECOMMENDED in SpoofChecker.java 1547- make sure that the Unicode Tools tree contains the latest security data files 1548- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1549- update the hardcoded version number there in the DIRECTORY path 1550- run the tool (no special environment variables needed) 1551- copy & paste from the Console output into the .cpp & .java files 1552 1553* generate normalization data files 1554 cd $ICU_ROOT/dbg/icu4c 1555 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1556 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1557 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1558 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1559 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1560 1561* build ICU (make install) 1562 so that the tools build can pick up the new definitions from the installed header files. 1563 1564 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1565 1566* build Unicode tools using CMake+make 1567 1568$ICU_SRC/tools/unicode/c/icudefs.txt: 1569 1570# Location (--prefix) of where ICU was installed. 1571set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1572# Location of the ICU4C source tree. 1573set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1574 1575 $ICU_ROOT/dbg$ 1576 mkdir -p tools/unicode/c 1577 cd tools/unicode/c 1578 1579 $ICU_ROOT/dbg/tools/unicode/c$ 1580 cmake ../../../../src/tools/unicode/c 1581 make 1582 1583* generate core properties data files 1584 $ICU_ROOT/dbg/tools/unicode/c$ 1585 genprops/genprops $ICU_SRC/icu4c 1586 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1587 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1588- rebuild ICU (make install) & tools 1589 1590* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1591 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1592- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1593- Unicode 6.0..12.0: U+2260, U+226E, U+226F 1594- nothing new in this Unicode version, no test file to update 1595 1596* run & fix ICU4C tests 1597- update test of default bidi classes: 1598 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 1599 see diffs in DerivedBidiClass.txt 1600 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 1601 + UCharacterTest.java TestIteration() defaultBidi[] 1602- Andy handles RBBI & spoof check test failures 1603 1604* collation: CLDR collation root, UCA DUCET 1605 1606- UCA DUCET goes into Mark's Unicode tools, see 1607 https://sites.google.com/site/unicodetools/home#TOC-UCA 1608 diff the main mapping file, look for bad changes 1609 (for example, more bytes per weight for common characters) 1610 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 1611 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 1612 1613- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1614 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1615 1616- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1617 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1618- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1619 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1620 (note removing the underscore before "Rules") 1621 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1622- restore TODO diffs in UCARules.txt 1623 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1624- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1625 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1626 from the CLDR root files (..._CLDR_..._SHORT.txt) 1627 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1628 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1629 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1630- if CLDR common/uca/unihan-index.txt changes, then update 1631 CLDR common/collation/root.xml <collation type="private-unihan"> 1632 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1633 1634- run genuca, see command line above; 1635 deal with 1636 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1637 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 1638 (add the character to genuca.cpp sampleCharsToScripts[]) 1639 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 1640 and cache its values. 1641 Works as long as the script metadata is updated before the collation data. 1642- rebuild ICU4C 1643 1644* Unihan collators 1645 https://sites.google.com/site/unicodetools/unihan 1646- run Unicode Tools 1647 org.unicode.draft.GenerateUnihanCollators 1648 with VM arguments 1649 -ea 1650 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1651 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1652 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1653 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1654 -DUVERSION=12.0.0 1655- run Unicode Tools 1656 org.unicode.draft.GenerateUnihanCollatorFiles 1657 with the same arguments 1658- check CLDR diffs 1659 cd $CLDR_SRC 1660 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1661 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1662- copy to CLDR 1663 cd $CLDR_SRC 1664 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1665 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1666- run CLDR unit tests, commit to CLDR 1667- generate ICU zh collation data: run CLDR 1668 org.unicode.cldr.icu.NewLdml2IcuConverter 1669 with program arguments 1670 -t collation 1671 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1672 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1673 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1674 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1675 zh 1676 and VM arguments 1677 -ea 1678 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1679- rebuild ICU4C 1680 1681* run & fix ICU4C tests, now with new CLDR collation root data 1682- run all tests with the collation test data *_SHORT.txt or the full files 1683 (the full ones have comments, useful for debugging) 1684- note on intltest: if collate/UCAConformanceTest fails, then 1685 utility/MultithreadTest/TestCollators will fail as well; 1686 fix the conformance test before looking into the multi-thread test 1687 1688* update Java data files 1689- refresh just the UCD/UCA-related/derived files, just to be safe 1690- see (ICU4C)/source/data/icu4j-readme.txt 1691- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1692- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1693 output: 1694 ... 1695 Unicode .icu files built to ./out/build/icudt63l 1696 echo timestamp > uni-core-data 1697 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1698 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 1699 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1700 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1701 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 1702 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 1703 mkdir -p /tmp/icu4j/main/shared/data 1704 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1705 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 1706 mkdir -p /tmp/icu4j/main/shared/data 1707 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1708 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1709- copy the big-endian Unicode data files to another location, 1710 separate from the other data files, 1711 and then refresh ICU4J 1712 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1713 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1714 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1715 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1716 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1717 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1718 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1719 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1720 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1721 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1722 1723* When refreshing all of ICU4J data from ICU4C 1724- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1725- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1726or 1727- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1728 1729* update CollationFCD.java 1730 + copy & paste the initializers of lcccIndex[] etc. from 1731 ICU4C/source/i18n/collationfcd.cpp to 1732 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1733 1734* refresh Java test .txt files 1735- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1736 cd $ICU_SRC/icu4c/source/data/unidata 1737 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1738 cd ../../test/testdata 1739 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1740 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1741 1742* run & fix ICU4J tests 1743 1744*** API additions 1745- send notice to icu-design about new born-@stable API (enum constants etc.) 1746 1747*** CLDR numbering systems 1748- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1749 for example, look for 1750 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1751 in new blocks (Blocks.txt) 1752 Unicode 12: using Unicode 12 CLDR ticket #11478 1753 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1754 wcho 1E2F0..1E2F9 Wancho 1755 Unicode 11: using Unicode 11 CLDR ticket #10978 1756 rohg 10D30..10D39 Hanifi_Rohingya 1757 gong 11DA0..11DA9 Gunjala_Gondi 1758 Earlier: CLDR tickets specific to adding new numbering systems. 1759 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1760 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1761 1762*** merge the Unicode update branches back onto the trunk 1763- do not merge the icudata.jar and testdata.jar, 1764 instead rebuild them from merged & tested ICU4C 1765- make sure that changes to Unicode tools are checked in: 1766 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1767 1768---------------------------------------------------------------------------- *** 1769 1770ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 1771 1772* Command-line environment setup 1773 1774UNICODE_DATA=~/unidata/uni11/20180609 1775CLDR_SRC=~/svn.cldr/uni 1776ICU_ROOT=~/icu/mine 1777ICU_SRC=$ICU_ROOT/src 1778ICUDT=icudt62b 1779ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1780ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1781export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1782 1783*** Links 1784 1785https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 1786https://unicode-org.atlassian.net/browse/ICU-12850 vo 1787 1788*** data files & enums & parser code 1789 1790* API additions 1791- for each of the three new enumerated properties 1792 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 1793 + uchar.h: update UCHAR_INT_LIMIT 1794 + uchar.h: add the enum U<long prop name> 1795 with constants U_<short prop name>_<long value name> 1796 + UProperty.java: add the constant <long prop name> 1797 + UProperty.java: update INT_LIMIT 1798 + UCharacter.java: add the interface <long prop name> 1799 with constants <long value name> 1800 1801* process and/or copy files 1802- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1803 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1804 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 1805 names and aliases. 1806 + For debugging, and tweaking how ppucd.txt is written, 1807 the tool has an --only_ppucd option: 1808 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1809 1810* preparseucd.py changes 1811- add new property short names (uppercase) to _prop_and_value_re 1812 so that ParseUCharHeader() parses the new enum constants 1813 1814* build ICU (make install) 1815 so that the tools build can pick up the new definitions from the installed header files. 1816 1817 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1818 1819* build Unicode tools using CMake+make 1820 1821$ICU_SRC/tools/unicode/c/icudefs.txt: 1822 1823# Location (--prefix) of where ICU was installed. 1824set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1825# Location of the ICU4C source tree. 1826set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 1827 1828 $ICU_ROOT/dbg$ 1829 mkdir -p tools/unicode/c 1830 cd tools/unicode/c 1831 1832 $ICU_ROOT/dbg/tools/unicode/c$ 1833 cmake ../../../../../src/tools/unicode/c 1834 make 1835 1836* generate core properties data files 1837 $ICU_ROOT/dbg/tools/unicode/c$ 1838 genprops/genprops $ICU_SRC/icu4c 1839- rebuild ICU (make install) & tools 1840 1841* write data for runtime, hardcoded for now 1842- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 1843- generate new icu4c/source/common/ulayout_props_data.h 1844- for each of the three new enumerated properties 1845 + int property max value 1846 + small, 8-bit UCPTrie 1847 (A small 16-bit trie with bit fields for these three properties 1848 is very nearly the same size as the sum of the three.) 1849 1850* wire into C++ 1851- uprops.cpp: #include ulayout_props_data.h 1852- uprops.cpp: add getInPC() etc. functions 1853- uprops.cpp: add lines to intProps[], include max values 1854- uprops.h: add UPropertySource constants 1855- uprops.cpp: add uprops_addPropertyStarts(src) 1856- uniset_props.cpp: add to UnicodeSet_initInclusion() 1857- intltest/ucdtest.cpp: write unit tests 1858 1859* update Java data files 1860- refresh just the pnames.icu file with the new property [value] names, just to be safe 1861- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 1862- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1863- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1864- copy the big-endian Unicode data files to another location, 1865 separate from the other data files, 1866 and then refresh ICU4J 1867 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1868 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1869 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1870 1871* wire into Java 1872- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 1873- UCharacterProperty.java: for each new property 1874 + create a nested class to hold its CodePointTrie 1875 + initialize it from a string literal 1876 + paste in the initializer printed by genprops 1877 + add a new IntProperty object to the intProps[] array 1878 + use the correct max int value for each property, also printed by genprops 1879- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 1880- UnicodeSet.java: add to getInclusions() 1881- UCharacterTest.java: write unit tests 1882 1883---------------------------------------------------------------------------- *** 1884 1885Unicode 11.0 update for ICU 62 1886 1887http://www.unicode.org/versions/Unicode11.0.0/ 1888http://unicode.org/versions/beta-11.0.0.html 1889https://www.unicode.org/review/pri372/ 1890http://www.unicode.org/reports/uax-proposed-updates.html 1891http://www.unicode.org/reports/tr44/tr44-21.html 1892 1893* Command-line environment setup 1894 1895UNICODE_DATA=~/unidata/uni11/20180521 1896CLDR_SRC=~/svn.cldr/uni 1897ICU_ROOT=~/svn.icu/uni 1898ICU_SRC=$ICU_ROOT/src 1899ICUDT=icudt61b 1900ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1901ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1902export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1903 1904*** ICU Trac 1905 1906- ticket:13630: Unicode 11 1907- ^/branches/markus/uni11 1908 1909*** CLDR Trac 1910 1911- cldrbug 10978: Unicode 11 1912- ^/branches/markus/uni11 1913 1914*** Unicode version numbers 1915- makedata.mak 1916- uchar.h 1917- com.ibm.icu.util.VersionInfo 1918- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1919 1920- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1921 so that the makefiles see the new version number. 1922 1923*** data files & enums & parser code 1924 1925* download files 1926- mkdir -p $UNICODE_DATA 1927- download Unicode files into $UNICODE_DATA 1928 + subfolders: emoji, idna, security, ucd, uca 1929 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1930 1931* for manual diffs and for Unicode Tools input data updates: 1932 remove version suffixes from the file names 1933 ~$ unidata/desuffixucd.py $UNICODE_DATA 1934 (see https://sites.google.com/site/unicodetools/inputdata) 1935 1936* process and/or copy files 1937- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1938 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1939 + For debugging, and tweaking how ppucd.txt is written, 1940 the tool has an --only_ppucd option: 1941 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1942 1943- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1944 1945* build ICU (make install) 1946 so that the tools build can pick up the new definitions from the installed header files. 1947 1948 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1949 1950* preparseucd.py changes 1951- fix other errors 1952 NameError: unknown property Extended_Pictographic 1953 -> add Extended_Pictographic binary property 1954 -> add new short names for all Emoji properties 1955 1956* new constants for new property values 1957- preparseucd.py error: 1958 ValueError: missing uchar.h enum constants for some property values: 1959 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 1960 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 1961 u'Indic_Siyaq_Numbers'])), 1962 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 1963 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 1964 (u'GCB', set([u'LinkC', u'Virama'])), 1965 (u'WB', set([u'WSegSpace']))] 1966 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1967 blk; Chess_Symbols ; Chess_Symbols 1968 blk; Dogra ; Dogra 1969 blk; Georgian_Ext ; Georgian_Extended 1970 blk; Gunjala_Gondi ; Gunjala_Gondi 1971 blk; Hanifi_Rohingya ; Hanifi_Rohingya 1972 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 1973 blk; Makasar ; Makasar 1974 blk; Mayan_Numerals ; Mayan_Numerals 1975 blk; Medefaidrin ; Medefaidrin 1976 blk; Old_Sogdian ; Old_Sogdian 1977 blk; Sogdian ; Sogdian 1978 -> add to uchar.h 1979 use long property names for enum constants, 1980 for the trailing comment get the block start code point: diff old & new Blocks.txt 1981 -> add to UCharacter.UnicodeBlock IDs 1982 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1983 replace public static final int \1_ID = \2; \3 1984 -> add to UCharacter.UnicodeBlock objects 1985 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1986 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1987 1988 GCB; LinkC ; LinkingConsonant 1989 GCB; Virama ; Virama 1990 -> uchar.h & UCharacter.GraphemeClusterBreak 1991 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 1992 1993 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 1994 -> ignore: ICU does not yet support this property 1995 1996 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 1997 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 1998 -> uchar.h & UCharacter.JoiningGroup 1999 2000 sc ; Dogr ; Dogra 2001 sc ; Gong ; Gunjala_Gondi 2002 sc ; Maka ; Makasar 2003 sc ; Medf ; Medefaidrin 2004 sc ; Rohg ; Hanifi_Rohingya 2005 sc ; Sogd ; Sogdian 2006 sc ; Sogo ; Old_Sogdian 2007 -> uscript.h & com.ibm.icu.lang.UScript 2008 -> Nushu had been added already 2009 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2010 and in com.ibm.icu.dev.test.lang.TestUScript.java 2011 2012 WB ; WSegSpace ; WSegSpace 2013 -> uchar.h & UCharacter.WordBreak 2014 2015* New short names for emoji properties 2016- see UTS #51 2017- short names set in preparseucd.py 2018 2019* New properties 2020- boolean emoji property Extended_Pictographic 2021 -> added in preparseucd.py 2022 -> uchar.h & UProperty.java 2023- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 2024 as shown in PropertyValueAliases.txt 2025 -> ignore for now 2026 2027* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2028 (not strictly necessary for NOT_ENCODED scripts) 2029 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2030 2031* update spoof checker UnicodeSet initializers: 2032 inclusionPat & recommendedPat in uspoof.cpp 2033 INCLUSION & RECOMMENDED in SpoofChecker.java 2034- make sure that the Unicode Tools tree contains the latest security data files 2035- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2036- update the hardcoded version number there in the DIRECTORY path 2037- run the tool (no special environment variables needed) 2038- copy & paste from the Console output into the .cpp & .java files 2039 2040* generate normalization data files 2041 cd $ICU_ROOT/dbg/icu4c 2042 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2043 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2044 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2045 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2046 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2047 2048* build ICU (make install) 2049 so that the tools build can pick up the new definitions from the installed header files. 2050 2051 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2052 2053* build Unicode tools using CMake+make 2054 2055$ICU_SRC/tools/unicode/c/icudefs.txt: 2056 2057# Location (--prefix) of where ICU was installed. 2058set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2059# Location of the ICU4C source tree. 2060set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 2061 2062 $ICU_ROOT/dbg$ 2063 mkdir -p tools/unicode/c 2064 cd tools/unicode/c 2065 2066 $ICU_ROOT/dbg/tools/unicode/c$ 2067 cmake ../../../../src/tools/unicode/c 2068 make 2069 2070* generate core properties data files 2071 $ICU_ROOT/dbg/tools/unicode/c$ 2072 genprops/genprops $ICU_SRC/icu4c 2073 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2074 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2075- rebuild ICU (make install) & tools 2076 2077* Fix case props 2078 genprops error: casepropsbuilder: too many exceptions words 2079 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 2080- With the addition of Georgian Mtavruli capital letters, 2081 there are now too many simple case mappings with big mapping deltas 2082 that yield uncompressible exceptions. 2083- Changing the data structure (now formatVersion 4), 2084 adding one bit for no-simple-case-folding (for Cherokee), and 2085 one optional slot for a big delta (for most faraway mappings), 2086 together with another bit for whether that is negative. 2087 This makes most Cherokee & Georgian etc. case mappings compressible, 2088 reducing the number of exceptions words. 2089- Further changes to gain one more bit for the exceptions index, 2090 for future growth. Details see casepropsbuilder.cpp. 2091 2092* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2093 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2094- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2095- Unicode 6.0..11.0: U+2260, U+226E, U+226F 2096- nothing new in this Unicode version, no test file to update 2097 2098* run & fix ICU4C tests 2099- Andy handles RBBI & spoof check test failures 2100 2101- Errors in char.txt, word.txt, word_POSIX.txt like 2102 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 2103 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 2104 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 2105 not empty, just to get ICU building. 2106 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 2107 and properties together with the rules that used them (GB 10, WB 14). 2108 -> Andy adjusts the rule sets further to sync with 2109 Unicode 11 grapheme, word, and line break spec changes. 2110 2111* collation: CLDR collation root, UCA DUCET 2112 2113- UCA DUCET goes into Mark's Unicode tools, see 2114 https://sites.google.com/site/unicodetools/home#TOC-UCA 2115 diff the main mapping file, look for bad changes 2116 (for example, more bytes per weight for common characters) 2117 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 2118 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 2119 2120- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2121 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2122 2123- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2124 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2125- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2126 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2127 (note removing the underscore before "Rules") 2128 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2129- restore TODO diffs in UCARules.txt 2130 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2131- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2132 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2133 from the CLDR root files (..._CLDR_..._SHORT.txt) 2134 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2135 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2136 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2137- if CLDR common/uca/unihan-index.txt changes, then update 2138 CLDR common/collation/root.xml <collation type="private-unihan"> 2139 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2140 2141- run genuca, see command line above; 2142 deal with 2143 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2144 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 2145 (add the character to genuca.cpp sampleCharsToScripts[]) 2146 + look up the USCRIPT_ code for the new sample characters 2147 (should be obvious from the comment in the error output) 2148 + *add* mappings to sampleCharsToScripts[], do not replace them 2149 (in case the script sample characters flip-flop) 2150 + insert new scripts in DUCET script order, see the top_byte table 2151 at the beginning of FractionalUCA.txt 2152- rebuild ICU4C 2153 2154* Unihan collators 2155 https://sites.google.com/site/unicodetools/unihan 2156- run Unicode Tools 2157 org.unicode.draft.GenerateUnihanCollators 2158 with VM arguments 2159 -ea 2160 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2161 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2162 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2163 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2164 -DUVERSION=11.0.0 2165- run Unicode Tools 2166 org.unicode.draft.GenerateUnihanCollatorFiles 2167 with the same arguments 2168- check CLDR diffs 2169 cd $CLDR_SRC 2170 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2171 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2172- copy to CLDR 2173 cd $CLDR_SRC 2174 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2175 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2176- run CLDR unit tests, commit to CLDR 2177- generate ICU zh collation data: run CLDR 2178 org.unicode.cldr.icu.NewLdml2IcuConverter 2179 with program arguments 2180 -t collation 2181 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2182 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2183 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 2184 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 2185 zh 2186 and VM arguments 2187 -ea 2188 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2189- rebuild ICU4C 2190 2191* run & fix ICU4C tests, now with new CLDR collation root data 2192- run all tests with the collation test data *_SHORT.txt or the full files 2193 (the full ones have comments, useful for debugging) 2194- note on intltest: if collate/UCAConformanceTest fails, then 2195 utility/MultithreadTest/TestCollators will fail as well; 2196 fix the conformance test before looking into the multi-thread test 2197 2198* update Java data files 2199- refresh just the UCD/UCA-related/derived files, just to be safe 2200- see (ICU4C)/source/data/icu4j-readme.txt 2201- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2202- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2203 output: 2204 ... 2205 Unicode .icu files built to ./out/build/icudt61l 2206 echo timestamp > uni-core-data 2207 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2208 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 2209 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2210 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2211 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 2212 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 2213 mkdir -p /tmp/icu4j/main/shared/data 2214 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2215 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 2216 mkdir -p /tmp/icu4j/main/shared/data 2217 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2218 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 2219- copy the big-endian Unicode data files to another location, 2220 separate from the other data files, 2221 and then refresh ICU4J 2222 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2223 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2224 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2225 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2226 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2227 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2228 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2229 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2230 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2231 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2232 2233* When refreshing all of ICU4J data from ICU4C 2234- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2235- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2236or 2237- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2238 2239* update CollationFCD.java 2240 + copy & paste the initializers of lcccIndex[] etc. from 2241 ICU4C/source/i18n/collationfcd.cpp to 2242 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2243 2244* refresh Java test .txt files 2245- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2246 cd $ICU_SRC/icu4c/source/data/unidata 2247 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2248 cd ../../test/testdata 2249 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2250 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2251 2252* run & fix ICU4J tests 2253 2254*** API additions 2255- send notice to icu-design about new born-@stable API (enum constants etc.) 2256 2257*** CLDR numbering systems 2258- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2259 Unicode 11: using Unicode 11 CLDR ticket #10978 2260 rohg 10D30..10D39 Hanifi_Rohingya 2261 gong 11DA0..11DA9 Gunjala_Gondi 2262 Earlier: CLDR tickets specific to adding new numbering systems. 2263 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2264 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2265 2266*** merge the Unicode update branches back onto the trunk 2267- do not merge the icudata.jar and testdata.jar, 2268 instead rebuild them from merged & tested ICU4C 2269- make sure that changes to Unicode tools are checked in: 2270 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2271 2272---------------------------------------------------------------------------- *** 2273 2274Unicode 10.0 update for ICU 60 2275 2276http://www.unicode.org/versions/Unicode10.0.0/ 2277http://www.unicode.org/versions/beta-10.0.0.html 2278http://blog.unicode.org/2017/03/unicode-100-beta-review.html 2279http://www.unicode.org/review/pri350/ 2280http://www.unicode.org/reports/uax-proposed-updates.html 2281http://www.unicode.org/reports/tr44/tr44-19.html 2282 2283* Command-line environment setup 2284 2285UNICODE_DATA=~/unidata/uni10/20170605 2286CLDR_SRC=~/svn.cldr/uni10 2287ICU_ROOT=~/svn.icu/uni10 2288ICU_SRC=$ICU_ROOT/src 2289ICUDT=icudt60b 2290ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2291ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2292export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2293 2294*** ICU Trac 2295 2296- ticket:12985: Unicode 10 2297- ticket:13061: undo hacks from emoji 5.0 update 2298- ticket:13062: add Emoji_Component property 2299- ^/branches/markus/uni10 2300 2301*** CLDR Trac 2302 2303- cldrbug 10055: Unicode 10 2304- cldrbug 9882: Unicode 10 script metadata 2305- cldrbug 10219: numbering systems for Unicode 10 2306 2307*** Unicode version numbers 2308- makedata.mak 2309- uchar.h 2310- com.ibm.icu.util.VersionInfo 2311- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2312 2313- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2314 so that the makefiles see the new version number. 2315 2316*** data files & enums & parser code 2317 2318* download files 2319- mkdir -p $UNICODE_DATA 2320- download Unicode 10.0 files into $UNICODE_DATA 2321 + subfolders: ucd, uca, idna, security 2322 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2323- download emoji 5.0 files into $UNICODE_DATA/emoji 2324 2325* for manual diffs: remove version suffixes from the file names 2326 ~$ unidata/desuffixucd.py $UNICODE_DATA 2327 (see https://sites.google.com/site/unicodetools/inputdata) 2328 2329* process and/or copy files 2330- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2331 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2332 + For debugging, and tweaking how ppucd.txt is written, 2333 the tool has an --only_ppucd option: 2334 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2335 2336- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2337 2338* build ICU (make install) 2339 so that the tools build can pick up the new definitions from the installed header files. 2340 2341 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2342 2343* preparseucd.py changes 2344- remove or add new Unicode scripts from/to the 2345 only-in-ISO-15924 list according to the error messages: 2346 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 2347 -> adjust _scripts_only_in_iso15924 as indicated 2348- fix other errors 2349 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 2350 -> add vo=Vertical_Orientation to _ignored_properties 2351 -> later removed again, parsing the file, even though we do not yet store data for runtime use 2352 2353* new constants for new property values 2354- preparseucd.py error: 2355 ValueError: missing uchar.h enum constants for some property values: 2356 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 2357 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 2358 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 2359 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 2360 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 2361 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 2362 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2363 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 2364 blk; Kana_Ext_A ; Kana_Extended_A 2365 blk; Masaram_Gondi ; Masaram_Gondi 2366 blk; Nushu ; Nushu 2367 blk; Soyombo ; Soyombo 2368 blk; Syriac_Sup ; Syriac_Supplement 2369 blk; Zanabazar_Square ; Zanabazar_Square 2370 -> add to uchar.h 2371 use long property names for enum constants, 2372 for the trailing comment get the block start code point: diff old & new Blocks.txt 2373 -> add to UCharacter.UnicodeBlock IDs 2374 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2375 replace public static final int \1_ID = \2; \3 2376 -> add to UCharacter.UnicodeBlock objects 2377 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2378 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2379 2380 jg ; Malayalam_Bha ; Malayalam_Bha 2381 jg ; Malayalam_Ja ; Malayalam_Ja 2382 jg ; Malayalam_Lla ; Malayalam_Lla 2383 jg ; Malayalam_Llla ; Malayalam_Llla 2384 jg ; Malayalam_Nga ; Malayalam_Nga 2385 jg ; Malayalam_Nna ; Malayalam_Nna 2386 jg ; Malayalam_Nnna ; Malayalam_Nnna 2387 jg ; Malayalam_Nya ; Malayalam_Nya 2388 jg ; Malayalam_Ra ; Malayalam_Ra 2389 jg ; Malayalam_Ssa ; Malayalam_Ssa 2390 jg ; Malayalam_Tta ; Malayalam_Tta 2391 -> uchar.h & UCharacter.JoiningGroup 2392 2393 sc ; Gonm ; Masaram_Gondi 2394 sc ; Nshu ; Nushu 2395 sc ; Soyo ; Soyombo 2396 sc ; Zanb ; Zanabazar_Square 2397 -> uscript.h & com.ibm.icu.lang.UScript 2398 -> Nushu had been added already 2399 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2400 and in com.ibm.icu.dev.test.lang.TestUScript.java 2401 2402* New properties as shown in PropertyValueAliases.txt changes 2403- boolean Emoji_Component from emoji 5 2404 -> uchar.h & UProperty.java 2405- boolean 2406 # Regional_Indicator (RI) 2407 2408 RI ; N ; No ; F ; False 2409 RI ; Y ; Yes ; T ; True 2410 -> uchar.h & UProperty.java 2411 -> single immutable range, to be hardcoded 2412- boolean 2413 # Prepended_Concatenation_Mark (PCM) 2414 2415 PCM; N ; No ; F ; False 2416 PCM; Y ; Yes ; T ; True 2417 -> was new in Unicode 9 2418 -> uchar.h & UProperty.java 2419- enumerated 2420 # Vertical_Orientation (vo) 2421 2422 vo ; R ; Rotated 2423 vo ; Tr ; Transformed_Rotated 2424 vo ; Tu ; Transformed_Upright 2425 vo ; U ; Upright 2426 -> only pre-parsed for now, but not yet stored for runtime use 2427 2428* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2429 (not strictly necessary for NOT_ENCODED scripts) 2430 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2431 2432* generate normalization data files 2433 cd $ICU_ROOT/dbg/icu4c 2434 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2435 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2436 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2437 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2438 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2439 2440* build ICU (make install) 2441 so that the tools build can pick up the new definitions from the installed header files. 2442 2443 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2444 2445* build Unicode tools using CMake+make 2446 2447$ICU_SRC/tools/unicode/c/icudefs.txt: 2448 2449# Location (--prefix) of where ICU was installed. 2450set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2451# Location of the ICU4C source tree. 2452set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 2453 2454 $ICU_ROOT/dbg/tools/unicode/c$ 2455 cmake ../../../../src/tools/unicode/c 2456 make 2457 2458* generate core properties data files 2459 $ICU_ROOT/dbg/tools/unicode/c$ 2460 genprops/genprops $ICU_SRC/icu4c 2461 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2462 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2463- rebuild ICU (make install) & tools 2464 2465* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2466 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2467- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2468- Unicode 6.0..10.0: U+2260, U+226E, U+226F 2469- nothing new in this Unicode version, no test file to update 2470 2471* run & fix ICU4C tests 2472- Andy handles RBBI & spoof check test failures 2473 2474* collation: CLDR collation root, UCA DUCET 2475 2476- UCA DUCET goes into Mark's Unicode tools, see 2477 https://sites.google.com/site/unicodetools/home#TOC-UCA 2478- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2479 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2480 2481- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2482 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2483- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2484 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2485 (note removing the underscore before "Rules") 2486 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2487- restore TODO diffs in UCARules.txt 2488 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2489- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2490 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2491 from the CLDR root files (..._CLDR_..._SHORT.txt) 2492 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2493 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2494 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2495- if CLDR common/uca/unihan-index.txt changes, then update 2496 CLDR common/collation/root.xml <collation type="private-unihan"> 2497 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2498 2499- run genuca, see command line above; 2500 deal with 2501 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 2502 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 2503 (add the character to genuca.cpp sampleCharsToScripts[]) 2504 + look up the USCRIPT_ code for the new sample characters 2505 (should be obvious from the comment in the error output) 2506 + *add* mappings to sampleCharsToScripts[], do not replace them 2507 (in case the script sample characters flip-flop) 2508 + insert new scripts in DUCET script order, see the top_byte table 2509 at the beginning of FractionalUCA.txt 2510- rebuild ICU4C 2511 2512* Unihan collators 2513 https://sites.google.com/site/unicodetools/unihan 2514- run Unicode Tools 2515 org.unicode.draft.GenerateUnihanCollators 2516 with VM arguments 2517 -ea 2518 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2519 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2520 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2521 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2522 -DUVERSION=10.0.0 2523- run Unicode Tools 2524 org.unicode.draft.GenerateUnihanCollatorFiles 2525 with the same arguments 2526- check CLDR diffs 2527 cd $CLDR_SRC 2528 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2529 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2530- copy to CLDR 2531 cd $CLDR_SRC 2532 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2533 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2534- run CLDR unit tests, commit to CLDR 2535- generate ICU zh collation data: run CLDR 2536 org.unicode.cldr.icu.NewLdml2IcuConverter 2537 with program arguments 2538 -t collation 2539 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 2540 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 2541 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 2542 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 2543 zh 2544 and VM arguments 2545 -ea 2546 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2547- rebuild ICU4C 2548 2549* run & fix ICU4C tests, now with new CLDR collation root data 2550- run all tests with the collation test data *_SHORT.txt or the full files 2551 (the full ones have comments, useful for debugging) 2552- note on intltest: if collate/UCAConformanceTest fails, then 2553 utility/MultithreadTest/TestCollators will fail as well; 2554 fix the conformance test before looking into the multi-thread test 2555 2556* update Java data files 2557- refresh just the UCD/UCA-related/derived files, just to be safe 2558- see (ICU4C)/source/data/icu4j-readme.txt 2559- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2560- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2561 output: 2562 ... 2563 Unicode .icu files built to ./out/build/icudt60l 2564 echo timestamp > uni-core-data 2565 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2566 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 2567 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2568 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2569 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 2570 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 2571 mkdir -p /tmp/icu4j/main/shared/data 2572 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2573 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 2574 mkdir -p /tmp/icu4j/main/shared/data 2575 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2576 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 2577- copy the big-endian Unicode data files to another location, 2578 separate from the other data files, 2579 and then refresh ICU4J 2580 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2581 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2582 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2583 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2584 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2585 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2586 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2587 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2588 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2589 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2590 2591* When refreshing all of ICU4J data from ICU4C 2592- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2593- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2594or 2595- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2596 2597* update CollationFCD.java 2598 + copy & paste the initializers of lcccIndex[] etc. from 2599 ICU4C/source/i18n/collationfcd.cpp to 2600 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2601 2602* refresh Java test .txt files 2603- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2604 cd $ICU_SRC/icu4c/source/data/unidata 2605 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2606 cd ../../test/testdata 2607 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2608 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2609 2610* run & fix ICU4J tests 2611 2612*** API additions 2613- send notice to icu-design about new born-@stable API (enum constants etc.) 2614 2615*** CLDR numbering systems 2616- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 2617 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2618 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2619 2620*** merge the Unicode update branches back onto the trunk 2621- do not merge the icudata.jar and testdata.jar, 2622 instead rebuild them from merged & tested ICU4C 2623- make sure that changes to Unicode tools are checked in: 2624 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2625 2626---------------------------------------------------------------------------- *** 2627 2628Emoji 5.0 update for ICU 59 2629- ICU 59 mostly remains on Unicode 9.0 2630- except updates bidi and segmentation data to Unicode 10 beta 2631 2632First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 2633 2634* Command-line environment setup 2635 2636ICU_ROOT=~/svn.icu/trunk 2637ICU_SRC_DIR=$ICU_ROOT/src 2638ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 2639ICUDT=icudt59b 2640export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2641SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 2642UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 2643 2644*** ICU Trac 2645 2646- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 2647- changes directly on trunk 2648 2649*** data files & enums & parser code 2650 2651* download files 2652 2653- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 2654- download emoji 5.0 beta files into the same uni90e50 folder 2655- download Unicode 10.0 beta files: ucd 2656 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 2657 BidiBrackets.txt 2658 BidiCharacterTest.txt 2659 BidiMirroring.txt 2660 BidiTest.txt 2661 extracted/DerivedBidiClass.txt 2662 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 2663 LineBreak.txt 2664 auxiliary/* 2665 2666* preparseucd.py changes 2667- adjust for combined trunks 2668- write new copyright lines 2669- ignore new Emoji_Component property for now 2670 2671* process and/or copy files 2672- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 2673 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2674 2675- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 2676 2677* build ICU (make install) 2678 so that the tools build can pick up the new definitions from the installed header files. 2679 2680 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2681 2682* build Unicode tools using CMake+make 2683 2684~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 2685 2686# Location (--prefix) of where ICU was installed. 2687set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2688# Location of the ICU4C source tree. 2689set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 2690 2691 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2692 cmake ../../../../src/tools/unicode/c 2693 make 2694 2695* generate core properties data files 2696 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2697 genprops/genprops $ICU4C_SRC_DIR 2698- rebuild ICU (make install) & tools 2699 2700* run & fix ICU4C tests 2701- Andy handles RBBI & spoof check test failures 2702 2703* update Java data files 2704- refresh just the UCD/UCA-related/derived files, just to be safe 2705- see (ICU4C)/source/data/icu4j-readme.txt 2706- mkdir /tmp/icu4j 2707- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2708 output: 2709 ... 2710 Unicode .icu files built to ./out/build/icudt59l 2711 echo timestamp > uni-core-data 2712 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2713 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 2714 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2715 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2716 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 2717 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 2718 mkdir -p /tmp/icu4j/main/shared/data 2719 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2720 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 2721 mkdir -p /tmp/icu4j/main/shared/data 2722 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2723 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 2724- copy the big-endian Unicode data files to another location, 2725 separate from the other data files, 2726 and then refresh ICU4J 2727 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 2728 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2729 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2730 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2731 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2732 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2733 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2734 2735* When refreshing all of ICU4J data from ICU4C 2736- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2737- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 2738or 2739- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 2740 2741* refresh Java test .txt files 2742- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2743 cd $ICU4C_SRC_DIR/source/data/unidata 2744 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2745 cd ../../test/testdata 2746 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2747 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2748 2749* run & fix ICU4J tests 2750 2751---------------------------------------------------------------------------- *** 2752 2753Unicode 9.0 update for ICU 58 2754 2755* Command-line environment setup 2756 2757ICU_ROOT=~/svn.icu/trunk 2758ICU_SRC_DIR=$ICU_ROOT/src 2759ICUDT=icudt58b 2760export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2761SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2762UNIDATA=$ICU_SRC_DIR/source/data/unidata 2763 2764http://www.unicode.org/review/pri323/ -- beta review 2765http://www.unicode.org/reports/uax-proposed-updates.html 2766http://www.unicode.org/versions/beta-9.0.0.html 2767http://www.unicode.org/versions/Unicode9.0.0/ 2768http://www.unicode.org/reports/tr44/tr44-17.html 2769 2770*** ICU Trac 2771 2772- ticket:12526: integrate Unicode 9 2773- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 2774- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 2775 2776*** CLDR Trac 2777 2778- cldrbug 9414: UCA 9 2779- ^/branches/markus/uni90 at r11518 from trunk at r11517 2780 2781- cldrbug 8745: Unicode 9.0 script metadata 2782 2783*** Unicode version numbers 2784- makedata.mak 2785- uchar.h 2786- com.ibm.icu.util.VersionInfo 2787- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2788 2789- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2790 so that the makefiles see the new version number. 2791 2792*** data files & enums & parser code 2793 2794* file preparation 2795 2796- download UCD & IDNA files 2797- make sure that the Unicode data folder passed into preparseucd.py 2798 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2799- only for manual diffs: remove version suffixes from the file names 2800 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2801 (see https://sites.google.com/site/unicodetools/inputdata) 2802- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2803- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2804- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2805 2806- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 2807 and copy to $UNIDATA 2808 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 2809 2810* preparseucd.py changes 2811- remove or add new Unicode scripts from/to the 2812 only-in-ISO-15924 list according to the error messages: 2813 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 2814 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 2815 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 2816 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 2817 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2818 and in com.ibm.icu.dev.test.lang.TestUScript.java 2819- DerivedNumericValues.txt new numeric values 2820 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 2821 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 2822 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 2823 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 2824 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 2825 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 2826 uchar.c, UCharacterProperty.java 2827 to support a new series of values 2828- adjust preparseucd.py for Tangut algorithmic names 2829 in ppucd.txt: 2830 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 2831 -> 2832 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 2833- avoid block-compressing most String/Miscellaneous property values, 2834 triggered by genprops not coping with a multi-code point Case_Folding on 2835 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 2836 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 2837 2838* PropertyAliases.txt changes 2839- 1 new property PCM=Prepended_Concatenation_Mark 2840 Ignore: Only useful for layout engines. 2841 Ok to list in ppucd.txt. 2842 2843* PropertyValueAliases.txt new property values 2844 blk; Adlam ; Adlam 2845 blk; Bhaiksuki ; Bhaiksuki 2846 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 2847 blk; Glagolitic_Sup ; Glagolitic_Supplement 2848 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 2849 blk; Marchen ; Marchen 2850 blk; Mongolian_Sup ; Mongolian_Supplement 2851 blk; Newa ; Newa 2852 blk; Osage ; Osage 2853 blk; Tangut ; Tangut 2854 blk; Tangut_Components ; Tangut_Components 2855 -> add to uchar.h 2856 use long property names for enum constants 2857 -> add to UCharacter.UnicodeBlock IDs 2858 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2859 replace public static final int \1_ID = \2; \3 2860 -> add to UCharacter.UnicodeBlock objects 2861 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2862 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2863 2864 GCB; EB ; E_Base 2865 GCB; EBG ; E_Base_GAZ 2866 GCB; EM ; E_Modifier 2867 GCB; GAZ ; Glue_After_Zwj 2868 GCB; ZWJ ; ZWJ 2869 -> uchar.h & UCharacter.GraphemeClusterBreak 2870 2871 jg ; African_Feh ; African_Feh 2872 jg ; African_Noon ; African_Noon 2873 jg ; African_Qaf ; African_Qaf 2874 -> uchar.h & UCharacter.JoiningGroup 2875 2876 lb ; EB ; E_Base 2877 lb ; EM ; E_Modifier 2878 lb ; ZWJ ; ZWJ 2879 -> uchar.h & UCharacter.LineBreak 2880 2881 sc ; Adlm ; Adlam 2882 sc ; Bhks ; Bhaiksuki 2883 sc ; Marc ; Marchen 2884 sc ; Newa ; Newa 2885 sc ; Osge ; Osage 2886 sc ; Tang ; Tangut 2887 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2888 2889 WB ; EB ; E_Base 2890 WB ; EBG ; E_Base_GAZ 2891 WB ; EM ; E_Modifier 2892 WB ; GAZ ; Glue_After_Zwj 2893 WB ; ZWJ ; ZWJ 2894 -> uchar.h & UCharacter.WordBreak 2895 2896* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2897 (not strictly necessary for NOT_ENCODED scripts) 2898 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2899 2900* generate normalization data files 2901 cd $ICU_ROOT/dbg 2902 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2903 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2904 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2905 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2906 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2907 2908* build ICU (make install) 2909 so that the tools build can pick up the new definitions from the installed header files. 2910 2911 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 2912 2913* build Unicode tools using CMake+make 2914 2915~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2916 2917 # Location (--prefix) of where ICU was installed. 2918 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2919 # Location of the ICU source tree. 2920 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2921 2922 ~/svn.icutools/trunk/dbg/unicode/c$ 2923 cmake ../../../src/unicode/c 2924 make 2925 2926* generate core properties data files 2927 ~/svn.icutools/trunk/dbg/unicode/c$ 2928 genprops/genprops $ICU_SRC_DIR 2929 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2930 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2931- rebuild ICU (make install) & tools 2932 2933* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2934 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2935- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2936- Unicode 6.0..9.0: U+2260, U+226E, U+226F 2937- nothing new in 9.0, no test file to update 2938 2939* run & fix ICU4C tests 2940- Andy handles RBBI & spoof check test failures 2941 2942* collation: CLDR collation root, UCA DUCET 2943 2944- UCA DUCET goes into Mark's Unicode tools, see 2945 https://sites.google.com/site/unicodetools/home#TOC-UCA 2946- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2947 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 2948 2949- cd (CLDR UCA branch)/common/uca/ 2950- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2951 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2952- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2953 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2954 (note removing the underscore before "Rules") 2955 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2956- restore TODO diffs in UCARules.txt 2957 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2958- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2959 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2960 from the CLDR root files (..._CLDR_..._SHORT.txt) 2961 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2962 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2963 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2964- if CLDR common/uca/unihan-index.txt changes, then update 2965 CLDR common/collation/root.xml <collation type="private-unihan"> 2966 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2967 2968- run genuca, see command line above; 2969 deal with 2970 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 2971 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 2972 (add the character to genuca.cpp sampleCharsToScripts[]) 2973 + look up the USCRIPT_ code for the new sample characters 2974 (should be obvious from the comment in the error output) 2975 + *add* mappings to sampleCharsToScripts[], do not replace them 2976 (in case the script sample characters flip-flop) 2977 + insert new scripts in DUCET script order, see the top_byte table 2978 at the beginning of FractionalUCA.txt 2979- rebuild ICU4C 2980 2981* Unihan collators 2982- run Unicode Tools 2983 org.unicode.draft.GenerateUnihanCollators 2984 with VM arguments 2985 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 2986 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 2987 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 2988 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2989 -DUVERSION=9.0.0 2990 -ea 2991- run Unicode Tools 2992 org.unicode.draft.GenerateUnihanCollatorFiles 2993 with the same arguments 2994- check CLDR diffs 2995 cd ~/svn.cldr/trunk 2996 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2997 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2998- copy to CLDR 2999 cd ~/svn.cldr/trunk 3000 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3001 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3002- commit to CLDR 3003- generate ICU zh collation data: run CLDR 3004 org.unicode.cldr.icu.NewLdml2IcuConverter 3005 with program arguments 3006 -t collation 3007 -s /home/mscherer/svn.cldr/trunk/common/collation 3008 -m /home/mscherer/svn.cldr/trunk/common/supplemental 3009 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 3010 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 3011 zh 3012 and VM arguments 3013 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 3014- rebuild ICU4C 3015 3016* run & fix ICU4C tests, now with new CLDR collation root data 3017- run all tests with the collation test data *_SHORT.txt or the full files 3018 (the full ones have comments, useful for debugging) 3019- note on intltest: if collate/UCAConformanceTest fails, then 3020 utility/MultithreadTest/TestCollators will fail as well; 3021 fix the conformance test before looking into the multi-thread test 3022 3023* update Java data files 3024- refresh just the UCD/UCA-related/derived files, just to be safe 3025- see (ICU4C)/source/data/icu4j-readme.txt 3026- mkdir /tmp/icu4j 3027- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3028 output: 3029 ... 3030 Unicode .icu files built to ./out/build/icudt58l 3031 echo timestamp > uni-core-data 3032 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3033 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 3034 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3035 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3036 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 3037 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 3038 mkdir -p /tmp/icu4j/main/shared/data 3039 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3040 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 3041 mkdir -p /tmp/icu4j/main/shared/data 3042 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3043 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3044- copy the big-endian Unicode data files to another location, 3045 separate from the other data files, 3046 and then refresh ICU4J 3047 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3048 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3049 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3050 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3051 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3052 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3053 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3054 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3055 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3056 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3057 3058* When refreshing all of ICU4J data from ICU4C 3059- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3060- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3061or 3062- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3063 3064* update CollationFCD.java 3065 + copy & paste the initializers of lcccIndex[] etc. from 3066 ICU4C/source/i18n/collationfcd.cpp to 3067 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3068 3069* refresh Java test .txt files 3070- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3071 cd $ICU_SRC_DIR/source/data/unidata 3072 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3073 cd ../../test/testdata 3074 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3075 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3076 3077* run & fix ICU4J tests 3078 3079*** LayoutEngine script information 3080 3081* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3082 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3083 in the working directory. 3084 3085 (It also generates ScriptRunData.cpp, which is no longer needed.) 3086 3087 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3088 (a plain text file) 3089 which maps ICU versions to the numbers of script/language constants 3090 that were added then. 3091 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3092 3093 The generated files have a current copyright date and "@deprecated" statement. 3094 3095* Review changes, fix Java tool if necessary, and copy to ICU4C 3096 cd ~/svn.icu4j/trunk/src 3097 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3098 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3099 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3100 3101*** API additions 3102- send notice to icu-design about new born-@stable API (enum constants etc.) 3103 3104*** merge the Unicode update branches back onto the trunk 3105- do not merge the icudata.jar and testdata.jar, 3106 instead rebuild them from merged & tested ICU4C 3107- make sure that changes to Unicode tools & ICU tools are checked in 3108 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3109 http://bugs.icu-project.org/trac/log/tools/trunk 3110 3111---------------------------------------------------------------------------- *** 3112 3113New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 3114 3115Adding 3116- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 3117- new combination/alias codes: Hanb, Jamo 3118 - used in CLDR 29 and in spoof checker 3119- new Z* code: Zsye 3120 3121Add new codes to uscript.h & UScript.java, see Unicode update logs. 3122 -> com.ibm.icu.lang.UScript 3123 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3124 replace public static final int \1 = \2; \3 3125 3126Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 3127add new script codes. 3128"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 3129 3130Note: If we have to run preparseucd.py again before the Unicode 9 update, 3131then we need to manually keep/restore the new script codes. 3132 3133ICU_ROOT=~/svn.icu/trunk 3134ICU_SRC_DIR=$ICU_ROOT/src 3135ICUDT=icudt57b 3136export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3137SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3138UNIDATA=$ICU_SRC_DIR/source/data/unidata 3139 3140Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 3141see https://unicode-org.atlassian.net/browse/ICU-12141 3142 3143make install, then icutools cmake & make, then 3144~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3145 3146Generate Java data as usual, only update pnames.icu & uprops.icu. 3147 3148*** LayoutEngine script information 3149 3150* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3151 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3152 in the working directory. 3153 3154 (It also generates ScriptRunData.cpp, which is no longer needed.) 3155 3156 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3157 (a plain text file) 3158 which maps ICU versions to the numbers of script/language constants 3159 that were added then. 3160 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3161 3162 The generated files have a current copyright date and "@deprecated" statement. 3163 3164* Review changes, fix Java tool if necessary, and copy to ICU4C 3165 cd ~/svn.icu4j/trunk/src 3166 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3167 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3168 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3169 3170---------------------------------------------------------------------------- *** 3171 3172Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 3173 3174Edit preparseucd.py to add & parse new properties. 3175They share the UCD property namespace but are not listed in PropertyAliases.txt. 3176 3177Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 3178Initial data from emoji/2.0/ 3179 3180ICU_ROOT=~/svn.icu/trunk 3181ICU_SRC_DIR=$ICU_ROOT/src 3182ICUDT=icudt56b 3183export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3184SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3185UNIDATA=$ICU_SRC_DIR/source/data/unidata 3186 3187Add binary-property constants to uchar.h enum UProperty & UProperty.java. 3188 3189~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3190(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 3191 3192Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 3193 3194make install, then icutools cmake & make, then 3195~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3196 3197Generate Java data as usual, only update pnames.icu & uprops.icu. 3198 3199---------------------------------------------------------------------------- *** 3200 3201Unicode 8.0 update for ICU 56 3202 3203* Command-line environment setup 3204 3205ICU_ROOT=~/svn.icu/trunk 3206ICU_SRC_DIR=$ICU_ROOT/src 3207ICUDT=icudt56b 3208export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3209SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3210UNIDATA=$ICU_SRC_DIR/source/data/unidata 3211 3212http://www.unicode.org/review/pri297/ -- beta review 3213http://www.unicode.org/reports/uax-proposed-updates.html 3214http://unicode.org/versions/beta-8.0.0.html 3215http://www.unicode.org/versions/Unicode8.0.0/ 3216http://www.unicode.org/reports/tr44/tr44-15.html 3217 3218*** ICU Trac 3219 3220- ticket:11574: Unicode 8 3221- C++ branches/markus/uni80 at r37351 from trunk at r37343 3222- Java branches/markus/uni80 at r37352 from trunk at r37338 3223 3224*** CLDR Trac 3225 3226- cldrbug 8311: UCA 8 3227- branches/markus/uni80 at r11518 from trunk at r11517 3228 3229- cldrbug 8109: Unicode 8.0 script metadata 3230- cldrbug 8418: Updated segmentation for Unicode 8.0 3231 3232*** Unicode version numbers 3233- makedata.mak 3234- uchar.h 3235- com.ibm.icu.util.VersionInfo 3236- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3237 3238- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3239 so that the makefiles see the new version number. 3240 3241*** data files & enums & parser code 3242 3243* file preparation 3244 3245- download UCD & IDNA files 3246- make sure that the Unicode data folder passed into preparseucd.py 3247 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3248- only for manual diffs: remove version suffixes from the file names 3249 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3250 (see https://sites.google.com/site/unicodetools/inputdata) 3251- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3252- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3253- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3254 3255- also: from http://unicode.org/Public/security/8.0.0/ download new 3256 confusables.txt & confusablesWholeScript.txt 3257 and copy to $UNIDATA 3258 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 3259 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 3260 3261* initial preparseucd.py changes 3262- remove new Unicode scripts from the 3263 only-in-ISO-15924 list according to the error message: 3264 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 3265 from _scripts_only_in_iso15924 3266 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3267 and in com.ibm.icu.dev.test.lang.TestUScript.java 3268- property and file name change: 3269 IndicMatraCategory -> IndicPositionalCategory 3270- UnicodeData.txt unusual numeric values (improper fractions) 3271 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 3272 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 3273 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 3274 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 3275 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 3276 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 3277 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 3278 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 3279 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 3280 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 3281 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 3282 which are listed in DerivedNumericValues.txt; 3283 keeps storage in data file simple 3284 3285* PropertyValueAliases.txt changes 3286- 10 new Block (blk) values: 3287 blk; Ahom ; Ahom 3288 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 3289 blk; Cherokee_Sup ; Cherokee_Supplement 3290 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 3291 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 3292 blk; Hatran ; Hatran 3293 blk; Multani ; Multani 3294 blk; Old_Hungarian ; Old_Hungarian 3295 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 3296 blk; Sutton_SignWriting ; Sutton_SignWriting 3297 -> add to uchar.h 3298 use long property names for enum constants 3299 -> add to UCharacter.UnicodeBlock IDs 3300 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3301 replace public static final int \1_ID = \2; \3 3302 -> add to UCharacter.UnicodeBlock objects 3303 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3304 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3305- 6 new Script (sc) values: 3306 sc ; Ahom ; Ahom 3307 sc ; Hatr ; Hatran 3308 sc ; Hluw ; Anatolian_Hieroglyphs 3309 sc ; Hung ; Old_Hungarian 3310 sc ; Mult ; Multani 3311 sc ; Sgnw ; SignWriting 3312 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3313 3314* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3315 (not strictly necessary for NOT_ENCODED scripts) 3316 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3317 3318* generate normalization data files 3319 cd $ICU_ROOT/dbg 3320 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3321 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3322 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3323 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3324 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3325 3326* build ICU (make install) 3327 so that the tools build can pick up the new definitions from the installed header files. 3328 3329 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3330 3331* build Unicode tools using CMake+make 3332 3333~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3334 3335 # Location (--prefix) of where ICU was installed. 3336 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3337 # Location of the ICU source tree. 3338 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3339 3340 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3341 ~/svn.icutools/trunk/dbg/unicode/c$ make 3342 3343* generate core properties data files 3344- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3345- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3346- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3347- rebuild ICU (make install) & tools 3348- run genuca again (see step above) so that it picks up the new nfc.nrm 3349- rebuild ICU (make install) & tools 3350 3351* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3352 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3353- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3354- Unicode 6.0..8.0: U+2260, U+226E, U+226F 3355- nothing new in 8.0, no test file to update 3356 3357* run & fix ICU4C tests 3358- bad Cherokee case folding due to difference in fallbacks: 3359 UCD case folding falls back to no mapping, 3360 ICU runtime case folding falls back to lowercasing; 3361 fixed casepropsbuilder.cpp to generate scf mappings to self 3362 when there is an slc mapping but no scf 3363- Andy handles RBBI & spoof check test failures 3364 3365* collation: CLDR collation root, UCA DUCET 3366 3367- UCA DUCET goes into Mark's Unicode tools, see 3368 https://sites.google.com/site/unicodetools/home#TOC-UCA 3369- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 3370- cd (CLDR UCA branch)/common/uca/ 3371- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3372 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3373- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3374 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 3375 (note removing the underscore before "Rules") 3376 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3377- restore TODO diffs in UCARules.txt 3378 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3379- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3380 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3381 from the CLDR root files (..._CLDR_..._SHORT.txt) 3382 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3383 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3384 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3385- if CLDR common/uca/unihan-index.txt changes, then update 3386 CLDR common/collation/root.xml <collation type="private-unihan"> 3387 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 3388- run genuca, see command line above; 3389 deal with 3390 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 3391 (add the character to genuca.cpp sampleCharsToScripts[]) 3392 + look up the script for the new sample characters 3393 (e.g., in FractionalUCA.txt) 3394 + *add* mappings to sampleCharsToScripts[], do not replace them 3395 (in case the script sample characters flip-flop) 3396 + insert new scripts in DUCET script order, see the top_byte table 3397 at the beginning of FractionalUCA.txt 3398- rebuild ICU4C 3399 3400* run & fix ICU4C tests, now with new CLDR collation root data 3401- run all tests with the collation test data *_SHORT.txt or the full files 3402 (the full ones have comments, useful for debugging) 3403- note on intltest: if collate/UCAConformanceTest fails, then 3404 utility/MultithreadTest/TestCollators will fail as well; 3405 fix the conformance test before looking into the multi-thread test 3406- fixed bug in CollationWeights::getWeightRanges() 3407 exposed by new data and CollationTest::TestRootElements 3408 3409* update Java data files 3410- refresh just the UCD/UCA-related/derived files, just to be safe 3411- see (ICU4C)/source/data/icu4j-readme.txt 3412- mkdir /tmp/icu4j 3413- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3414 output: 3415 ... 3416 Unicode .icu files built to ./out/build/icudt56l 3417 echo timestamp > uni-core-data 3418 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3419 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 3420 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3421 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3422 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 3423 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 3424 mkdir -p /tmp/icu4j/main/shared/data 3425 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3426 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 3427 mkdir -p /tmp/icu4j/main/shared/data 3428 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3429 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3430- copy the big-endian Unicode data files to another location, 3431 separate from the other data files, 3432 and then refresh ICU4J 3433 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3434 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3435 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3436 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3437 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3438 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3439 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3440 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3441 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3442 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3443 3444* When refreshing all of ICU4J data from ICU4C 3445- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3446- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3447or 3448- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3449 3450* update CollationFCD.java 3451 + copy & paste the initializers of lcccIndex[] etc. from 3452 ICU4C/source/i18n/collationfcd.cpp to 3453 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3454 3455* refresh Java test .txt files 3456- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3457 cd $ICU_SRC_DIR/source/data/unidata 3458 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3459 cd ../../test/testdata 3460 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3461 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3462 3463* run & fix ICU4J tests 3464 3465*** LayoutEngine script information 3466 3467* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 3468 because the layout engine was deprecated in ICU 54. 3469 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 3470 to write lines that we used to add manually. 3471 3472* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3473 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3474 in the working directory. 3475 3476 (It also generates ScriptRunData.cpp, which is no longer needed.) 3477 3478 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3479 (a plain text file) 3480 which maps ICU versions to the numbers of script/language constants 3481 that were added then. 3482 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3483 3484 The generated files have a current copyright date and "@deprecated" statement. 3485 3486* Review changes, fix Java tool if necessary, and copy to ICU4C 3487 cd ~/svn.icu4j/trunk/src 3488 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3489 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3490 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3491 3492*** API additions 3493- send notice to icu-design about new born-@stable API (enum constants etc.) 3494 3495*** merge the Unicode update branches back onto the trunk 3496- do not merge the icudata.jar and testdata.jar, 3497 instead rebuild them from merged & tested ICU4C 3498- make sure that changes to Unicode tools & ICU tools are checked in 3499 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3500 http://bugs.icu-project.org/trac/log/tools/trunk 3501 3502---------------------------------------------------------------------------- *** 3503 3504Unicode 7.0 update for ICU 54 3505 3506http://www.unicode.org/review/pri271/ -- beta review 3507http://www.unicode.org/reports/uax-proposed-updates.html 3508http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 3509http://www.unicode.org/reports/tr44/tr44-13.html 3510 3511*** ICU Trac 3512 3513- ticket 10821: Unicode 7.0, UCA 7.0 3514- C++ branches/markus/uni70 at r35584 from trunk at r35580 3515- Java branches/markus/uni70 at r35587 from trunk at r35545 3516 3517*** CLDR Trac 3518 3519- ticket 7195: UCA 7.0 CLDR root collation 3520- branches/markus/uni70 at r10062 from trunk at r10061 3521 3522- ticket 6762: script metadata for Unicode 7.0 new scripts 3523 3524*** Unicode version numbers 3525- makedata.mak 3526- uchar.h 3527- com.ibm.icu.util.VersionInfo 3528- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3529 3530- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3531 so that the makefiles see the new version number. 3532 3533*** data files & enums & parser code 3534 3535* file preparation 3536 3537- download UCD & IDNA files 3538- make sure that the Unicode data folder passed into preparseucd.py 3539 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3540- only for manual diffs: remove version suffixes from the file names 3541 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3542 (see https://sites.google.com/site/unicodetools/inputdata) 3543- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3544- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3545- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3546- Restore TODO diffs in source/data/unidata/UCARules.txt 3547 cd $ICU_SRC_DIR 3548 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 3549- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 3550 3551- also: from http://unicode.org/Public/security/7.0.0/ download new 3552 confusables.txt & confusablesWholeScript.txt 3553 and copy to $ICU_ROOT/src/source/data/unidata/ 3554 3555* initial preparseucd.py changes 3556- remove new Unicode scripts from the 3557 only-in-ISO-15924 list according to the error message: 3558 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 3559 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 3560 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 3561 from _scripts_only_in_iso15924 3562 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3563 and in com.ibm.icu.dev.test.lang.TestUScript.java 3564- NamesList.txt now has a heading with a non-ASCII character 3565 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 3566 + escape non-ASCII characters in heading comments 3567- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 3568 + get the copyright from the first file whose copyright line contains the current year 3569 3570* PropertyValueAliases.txt changes 3571- 32 new Block (blk) values: 3572 blk; Bassa_Vah ; Bassa_Vah 3573 blk; Caucasian_Albanian ; Caucasian_Albanian 3574 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 3575 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 3576 blk; Duployan ; Duployan 3577 blk; Elbasan ; Elbasan 3578 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 3579 blk; Grantha ; Grantha 3580 blk; Khojki ; Khojki 3581 blk; Khudawadi ; Khudawadi 3582 blk; Latin_Ext_E ; Latin_Extended_E 3583 blk; Linear_A ; Linear_A 3584 blk; Mahajani ; Mahajani 3585 blk; Manichaean ; Manichaean 3586 blk; Mende_Kikakui ; Mende_Kikakui 3587 blk; Modi ; Modi 3588 blk; Mro ; Mro 3589 blk; Myanmar_Ext_B ; Myanmar_Extended_B 3590 blk; Nabataean ; Nabataean 3591 blk; Old_North_Arabian ; Old_North_Arabian 3592 blk; Old_Permic ; Old_Permic 3593 blk; Ornamental_Dingbats ; Ornamental_Dingbats 3594 blk; Pahawh_Hmong ; Pahawh_Hmong 3595 blk; Palmyrene ; Palmyrene 3596 blk; Pau_Cin_Hau ; Pau_Cin_Hau 3597 blk; Psalter_Pahlavi ; Psalter_Pahlavi 3598 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 3599 blk; Siddham ; Siddham 3600 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 3601 blk; Sup_Arrows_C ; Supplemental_Arrows_C 3602 blk; Tirhuta ; Tirhuta 3603 blk; Warang_Citi ; Warang_Citi 3604 -> add to uchar.h 3605 use long property names for enum constants 3606 -> add to UCharacter.UnicodeBlock IDs 3607 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3608 replace public static final int \1_ID = \2; \3 3609 -> add to UCharacter.UnicodeBlock objects 3610 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3611 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3612- 28 new Joining_Group (jg) values: 3613 jg ; Manichaean_Aleph ; Manichaean_Aleph 3614 jg ; Manichaean_Ayin ; Manichaean_Ayin 3615 jg ; Manichaean_Beth ; Manichaean_Beth 3616 jg ; Manichaean_Daleth ; Manichaean_Daleth 3617 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 3618 jg ; Manichaean_Five ; Manichaean_Five 3619 jg ; Manichaean_Gimel ; Manichaean_Gimel 3620 jg ; Manichaean_Heth ; Manichaean_Heth 3621 jg ; Manichaean_Hundred ; Manichaean_Hundred 3622 jg ; Manichaean_Kaph ; Manichaean_Kaph 3623 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 3624 jg ; Manichaean_Mem ; Manichaean_Mem 3625 jg ; Manichaean_Nun ; Manichaean_Nun 3626 jg ; Manichaean_One ; Manichaean_One 3627 jg ; Manichaean_Pe ; Manichaean_Pe 3628 jg ; Manichaean_Qoph ; Manichaean_Qoph 3629 jg ; Manichaean_Resh ; Manichaean_Resh 3630 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 3631 jg ; Manichaean_Samekh ; Manichaean_Samekh 3632 jg ; Manichaean_Taw ; Manichaean_Taw 3633 jg ; Manichaean_Ten ; Manichaean_Ten 3634 jg ; Manichaean_Teth ; Manichaean_Teth 3635 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 3636 jg ; Manichaean_Twenty ; Manichaean_Twenty 3637 jg ; Manichaean_Waw ; Manichaean_Waw 3638 jg ; Manichaean_Yodh ; Manichaean_Yodh 3639 jg ; Manichaean_Zayin ; Manichaean_Zayin 3640 jg ; Straight_Waw ; Straight_Waw 3641 -> uchar.h & UCharacter.JoiningGroup 3642- 23 new Script (sc) values: 3643 sc ; Aghb ; Caucasian_Albanian 3644 sc ; Bass ; Bassa_Vah 3645 sc ; Dupl ; Duployan 3646 sc ; Elba ; Elbasan 3647 sc ; Gran ; Grantha 3648 sc ; Hmng ; Pahawh_Hmong 3649 sc ; Khoj ; Khojki 3650 sc ; Lina ; Linear_A 3651 sc ; Mahj ; Mahajani 3652 sc ; Mani ; Manichaean 3653 sc ; Mend ; Mende_Kikakui 3654 sc ; Modi ; Modi 3655 sc ; Mroo ; Mro 3656 sc ; Narb ; Old_North_Arabian 3657 sc ; Nbat ; Nabataean 3658 sc ; Palm ; Palmyrene 3659 sc ; Pauc ; Pau_Cin_Hau 3660 sc ; Perm ; Old_Permic 3661 sc ; Phlp ; Psalter_Pahlavi 3662 sc ; Sidd ; Siddham 3663 sc ; Sind ; Khudawadi 3664 sc ; Tirh ; Tirhuta 3665 sc ; Wara ; Warang_Citi 3666 -> uscript.h (many were added before) 3667 comment "Mende Kikakui" for USCRIPT_MENDE 3668 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 3669 -> com.ibm.icu.lang.UScript 3670 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3671 replace public static final int \1 = \2; \3 3672- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3673 (added 2012-11-01) 3674 Ahom 338 Ahom 3675 Hatr 127 Hatran 3676 Mult 323 Multani 3677 (added 2013-10-12) 3678 Modi 324 Modi 3679 Pauc 263 Pau Cin Hau 3680 Sidd 302 Siddham 3681 -> uscript.h (some overlap with additions from Unicode) 3682 -> com.ibm.icu.lang.UScript 3683 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3684 replace public static final int \1 = \2; \3 3685 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 3686 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3687 and in com.ibm.icu.dev.test.lang.TestUScript.java 3688 3689* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3690 (not strictly necessary for NOT_ENCODED scripts) 3691 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3692 3693* generate normalization data files 3694- cd $ICU_ROOT/dbg 3695- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3696- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3697- UNIDATA=$ICU_SRC_DIR/source/data/unidata 3698- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3699- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3700- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3701- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3702- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3703 3704* build ICU (make install) 3705 so that the tools build can pick up the new definitions from the installed header files. 3706 3707~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3708 3709* build Unicode tools using CMake+make 3710 3711~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3712 3713# Location (--prefix) of where ICU was installed. 3714set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 3715# Location of the ICU source tree. 3716set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 3717 3718~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3719~/svn.icutools/trunk/dbg/unicode/c$ make 3720 3721* genprops work 3722- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 3723 + add second array of Joining_Group values for at most 10800..10FFF 3724 icutools: unicode/c/genprops/bidipropsbuilder.cpp 3725 icu: source/common/ubidi_props.h/.c/_data.h 3726 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 3727 3728* generate core properties data files 3729- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3730- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 3731- rebuild ICU (make install) & tools 3732- run genuca again (see step above) so that it picks up the new nfc.nrm 3733- rebuild ICU (make install) & tools 3734 3735* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3736 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3737- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3738- Unicode 6.0..7.0: U+2260, U+226E, U+226F 3739- nothing new in 7.0, no test file to update 3740 3741* run & fix ICU4C tests 3742 3743* update Java data files 3744- refresh just the UCD-related files, just to be safe 3745- see (ICU4C)/source/data/icu4j-readme.txt 3746- mkdir /tmp/icu4j 3747- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3748 output: 3749 ... 3750 Unicode .icu files built to ./out/build/icudt53l 3751 echo timestamp > uni-core-data 3752 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3753 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 3754 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3755 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3756 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 3757 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 3758 mkdir -p /tmp/icu4j/main/shared/data 3759 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3760 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 3761 mkdir -p /tmp/icu4j/main/shared/data 3762 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3763 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 3764- copy the big-endian Unicode data files to another location, 3765 separate from the other data files 3766 ICUDT=icudt54b 3767 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3768 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3769 cd ~/svn.icu/uni70/dbg/data/out/icu4j 3770 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3771 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3772 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3773 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3774 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3775 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3776- refresh ICU4J 3777 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3778 3779* update CollationFCD.java 3780 + copy & paste the initializers of lcccIndex[] etc. from 3781 ICU4C/source/i18n/collationfcd.cpp to 3782 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3783 3784* refresh Java test .txt files 3785- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3786 cd $ICU_SRC_DIR/source/data/unidata 3787 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3788 cd ../../test/testdata 3789 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3790 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3791 3792* UCA 3793 3794- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 3795- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 3796- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 3797- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 3798- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 3799- review data; compare files, use blankweights.sed or similar 3800 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 3801- cd ~/svn.unitools/Generated/uca/7.0.0/ 3802- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3803 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3804- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3805 (note removing the underscore before "Rules") 3806 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3807- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3808 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3809 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3810 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3811 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3812 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3813- run genuca, see command line above 3814- rebuild ICU4C 3815- refresh ICU4J collation data: 3816 (subset of instructions above for properties data refresh, except copies all coll/*) 3817 ICUDT=icudt54b 3818 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3819 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3820 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3821 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3822- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3823- note on intltest: if collate/UCAConformanceTest fails, then 3824 utility/MultithreadTest/TestCollators will fail as well; 3825 fix the conformance test before looking into the multi-thread test 3826- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 3827- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 3828 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3829 3830* When refreshing all of ICU4J data from ICU4C 3831- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3832- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3833or 3834- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3835 3836* run & fix ICU4J tests 3837 3838*** LayoutEngine script information 3839 3840(For details see the Unicode 5.2 change log below.) 3841 3842* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3843 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3844 in the working directory. 3845 (It also generates ScriptRunData.cpp, which is no longer needed.) 3846 3847 The generated files have a current copyright date and "@stable" statement. 3848 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 3849 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 3850 which may not contain dots any more. 3851 3852- diff current <icu>/source/layout files vs. generated ones 3853 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3854 review and manually merge desired changes; 3855 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 3856 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 3857- if you just copy the above files, then 3858 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 3859 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 3860 3861*** API additions 3862- send notice to icu-design about new born-@stable API (enum constants etc.) 3863 3864*** merge the Unicode update branches back onto the trunk 3865- do not merge the icudata.jar and testdata.jar, 3866 instead rebuild them from merged & tested ICU4C 3867 3868---------------------------------------------------------------------------- *** 3869 3870Unicode 6.3 update 3871 3872http://www.unicode.org/review/pri249/ -- beta review 3873http://www.unicode.org/reports/uax-proposed-updates.html 3874http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 3875http://www.unicode.org/reports/tr44/tr44-11.html 3876 3877*** ICU Trac 3878 3879- ticket 10128: update ICU to Unicode 6.3 beta 3880- ticket 10168: update ICU to Unicode 6.3 final 3881- C++ branches/markus/uni63 at r33552 from trunk at r33551 3882- Java branches/markus/uni63 at r33550 from trunk at r33553 3883 3884- ticket 10142: implement Unicode 6.3 bidi algorithm additions 3885 3886*** Unicode version numbers 3887- makedata.mak 3888- uchar.h 3889 (configure.in & configure: have been modified to extract the version from uchar.h) 3890- com.ibm.icu.util.VersionInfo 3891- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3892 3893- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3894 so that the makefiles see the new version number. 3895 3896*** data files & enums & parser code 3897 3898* file preparation 3899 3900- download UCD, UCA & IDNA files 3901- make sure that the Unicode data folder passed into preparseucd.py 3902 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3903- modify preparseucd.py: 3904 parse new file BidiBrackets.txt 3905 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 3906- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 3907- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3908- Check test file diffs for previously commented-out, known-failing data lines; 3909 probably need to keep those commented out. 3910 3911* PropertyAliases.txt changes 3912- 1 new Enumerated Property 3913 bpt ; Bidi_Paired_Bracket_Type 3914 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 3915 -> ubidi_props.h & .c & UBiDiProps.java 3916 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 3917 -> uprops.cpp 3918 -> change ubidi.icu format version from 2.0 to 2.1 3919- 1 new Miscellaneous Property 3920 bpb ; Bidi_Paired_Bracket 3921 -> uchar.h & UProperty.java 3922 -> ppucd.h & .cpp 3923 3924* PropertyValueAliases.txt changes 3925- 3 Bidi_Paired_Bracket_Type (bpt) values: 3926 bpt; c ; Close 3927 bpt; n ; None 3928 bpt; o ; Open 3929 -> uchar.h & UCharacter.BidiPairedBracketType 3930 -> ubidi_props.h & .c & UBiDiProps.java 3931 -> change ubidi.icu format version from 2.0 to 2.1 3932- 4 new Bidi_Class (bc) values: 3933 bc ; FSI ; First_Strong_Isolate 3934 bc ; LRI ; Left_To_Right_Isolate 3935 bc ; RLI ; Right_To_Left_Isolate 3936 bc ; PDI ; Pop_Directional_Isolate 3937 -> uchar.h & UCharacterEnums.ECharacterDirection 3938 -> until the bidi code gets updated, 3939 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 3940- 3 new Word_Break (WB) values: 3941 WB ; HL ; Hebrew_Letter 3942 WB ; SQ ; Single_Quote 3943 WB ; DQ ; Double_Quote 3944 -> uchar.h & UCharacter.WordBreak 3945 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 3946- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3947 (added 2012-10-16) 3948 Aghb 239 Caucasian Albanian 3949 Mahj 314 Mahajani 3950 -> uscript.h 3951 -> com.ibm.icu.lang.UScript 3952 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3953 replace public static final int \1 = \2;\3 3954 -> preparseucd.py _scripts_only_in_iso15924 3955 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3956 and in com.ibm.icu.dev.test.lang.TestUScript.java 3957 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3958 (not strictly necessary for NOT_ENCODED scripts) 3959 3960* generate normalization data files 3961- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 3962- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 3963- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 3964- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3965- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3966- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3967- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3968 3969* build ICU (make install) 3970 so that the tools build can pick up the new definitions from the installed header files. 3971 3972~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3973 3974* build Unicode tools using CMake+make 3975 3976~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3977 3978# Location (--prefix) of where ICU was installed. 3979set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 3980# Location of the ICU source tree. 3981set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 3982 3983~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3984~/svn.icutools/trunk/dbg/unicode/c$ make 3985 3986* generate core properties data files 3987- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 3988- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 3989- rebuild ICU (make install) & tools 3990- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3991- rebuild ICU (make install) & tools 3992 3993* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3994 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3995- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3996- Unicode 6.0..6.3: U+2260, U+226E, U+226F 3997- nothing new in 6.3, no test file to update 3998 3999* update Java data files 4000- refresh just the UCD-related files, just to be safe 4001- see (ICU4C)/source/data/icu4j-readme.txt 4002- mkdir /tmp/icu4j 4003- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4004 output: 4005 ... 4006 Unicode .icu files built to ./out/build/icudt52l 4007 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4008 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 4009 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4010 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4011 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 4012 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 4013 mkdir -p /tmp/icu4j/main/shared/data 4014 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4015 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 4016 mkdir -p /tmp/icu4j/main/shared/data 4017 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4018 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 4019- copy the big-endian Unicode data files to another location, 4020 separate from the other data files 4021 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4022 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4023 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4024 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 4025 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4026 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4027 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4028- refresh ICU4J 4029 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4030 4031* refresh Java test .txt files 4032- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4033 4034* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 4035 4036- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4037- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4038- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4039- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4040 (note removing the underscore before "Rules") 4041- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4042 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4043 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4044- check test file diffs for previously commented-out, known-failing data lines; 4045 probably need to keep those commented out 4046- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4047- run genuca, see command line above 4048- rebuild ICU4C 4049- refresh ICU4J collation data: 4050 (subset of instructions above for properties data refresh, except copies all coll/*) 4051 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4052 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4053 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4054 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4055- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4056- note on intltest: if collate/UCAConformanceTest fails, then 4057 utility/MultithreadTest/TestCollators will fail as well; 4058 fix the conformance test before looking into the multi-thread test 4059 4060* test ICU, fix test code where necessary 4061 4062* When refreshing all of ICU4J data from ICU4C 4063- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4064- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4065or 4066- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4067 4068*** LayoutEngine script information 4069- skipped for Unicode 6.3: no new scripts 4070 4071*** merge the Unicode update branches back onto the trunk 4072- do not merge the icudata.jar and testdata.jar, 4073 instead rebuild them from merged & tested ICU4C 4074 4075---------------------------------------------------------------------------- *** 4076 4077Unicode 6.2 update 4078 4079http://www.unicode.org/review/pri230/ 4080http://www.unicode.org/versions/beta-6.2.0.html 4081http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 4082http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 4083http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 4084http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 4085http://www.unicode.org/reports/tr46/tr46-8.html IDNA 4086http://unicode.org/Public/idna/6.2.0/ 4087 4088*** ICU Trac 4089 4090- ticket 9515: Unicode 6.2: final ICU update 4091 4092- ticket 9514: UCA 6.2: fix UCARules.txt 4093 4094- ticket 9437: update ICU to Unicode 6.2 4095- C++ branches/markus/uni62 at r32050 from trunk at r32041 4096- Java branches/markus/uni62 at r32068 from trunk at r32066 4097 4098*** Unicode version numbers 4099- makedata.mak 4100- uchar.h 4101 (configure.in & configure: have been modified to extract the version from uchar.h) 4102- com.ibm.icu.util.VersionInfo 4103- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4104 4105*** data files & enums & parser code 4106 4107* file preparation 4108 4109- download UCD, UCA & IDNA files 4110- make sure that the Unicode data folder passed into preparseucd.py 4111 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4112- modify preparseucd.py: NamesList.txt is now in UTF-8 4113- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 4114- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4115- Check test file diffs for previously commented-out, known-failing data lines; 4116 probably need to keep those commented out. 4117 4118* PropertyValueAliases.txt changes 4119- 1 new Line_Break (lb) value: 4120 lb ; RI ; Regional_Indicator 4121 -> uchar.h & UCharacter.LineBreak 4122- 1 new Word_Break (WB) value: 4123 WB ; RI ; Regional_Indicator 4124 -> uchar.h & UCharacter.WordBreak 4125- 1 new Grapheme_Cluster_Break (GCB) value: 4126 GCB; RI ; Regional_Indicator 4127 -> uchar.h & UCharacter.GraphemeClusterBreak 4128 4129* 3 new numeric values 4130 The new value -1, which was really supposed to be NaN but that would have required 4131 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 4132 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 4133 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 4134 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 4135 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 4136 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 4137 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 4138 -> uprops.h, uchar.c & UCharacterProperty.java 4139 -> cucdtst.c & UCharacterTest.java 4140 4141* generate normalization data files 4142- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 4143- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 4144- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 4145- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4146- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4147- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4148- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4149 4150* build ICU (make install) 4151 so that the tools build can pick up the new definitions from the installed header files. 4152* build Unicode tools using CMake+make 4153 4154* generate core properties data files 4155- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 4156- in initial bootstrapping, change the UCA version 4157 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4158- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 4159- rebuild ICU (make install) & tools 4160 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4161 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4162 (see step above) 4163- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4164- rebuild ICU (make install) & tools 4165 4166* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4167 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4168- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4169- Unicode 6.0..6.2: U+2260, U+226E, U+226F 4170- nothing new in 6.2, no test file to update 4171 4172* update Java data files 4173- refresh just the UCD-related files, just to be safe 4174- see (ICU4C)/source/data/icu4j-readme.txt 4175- mkdir /tmp/icu4j 4176- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4177 output: 4178 ... 4179 Unicode .icu files built to ./out/build/icudt50l 4180 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4181 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 4182 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4183 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4184 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 4185 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 4186 mkdir -p /tmp/icu4j/main/shared/data 4187 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4188 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 4189 mkdir -p /tmp/icu4j/main/shared/data 4190 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4191 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 4192- copy the big-endian Unicode data files to another location, 4193 separate from the other data files 4194 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4195 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4196 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4197 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 4198 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4199 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4200 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4201- refresh ICU4J 4202 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4203 4204* refresh Java test .txt files 4205- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4206 4207* UCA 4208 4209- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4210- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4211- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4212- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4213 (note removing the underscore before "Rules") 4214- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4215 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4216 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4217- check test file diffs for previously commented-out, known-failing data lines; 4218 probably need to keep those commented out 4219- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4220- run genuca, see command line above 4221- rebuild ICU4C 4222- refresh ICU4J collation data: 4223 (subset of instructions above for properties data refresh, except copies all coll/*) 4224 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4225 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4226 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4227 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4228- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4229- note on intltest: if collate/UCAConformanceTest fails, then 4230 utility/MultithreadTest/TestCollators will fail as well; 4231 fix the conformance test before looking into the multi-thread test 4232 4233* test ICU, fix test code where necessary 4234 4235* When refreshing all of ICU4J data from ICU4C 4236- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4237- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4238or 4239- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4240 4241*** LayoutEngine script information 4242- skipped for Unicode 6.2: no new scripts 4243 4244*** merge the Unicode update branches back onto the trunk 4245- do not merge the icudata.jar and testdata.jar, 4246 instead rebuild them from merged & tested ICU4C 4247 4248---------------------------------------------------------------------------- *** 4249 4250Future Unicode update 4251 4252Tools simplified since the Unicode 6.1 update. See 4253- https://icu.unicode.org/design/props/ppucd 4254- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 4255 4256* Unicode version numbers 4257- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 4258 4259* file preparation 4260- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 4261- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 4262- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4263- Check test file diffs for previously commented-out, known-failing data lines; 4264 probably need to keep those commented out. 4265 4266* PropertyValueAliases.txt changes 4267- Script codes that are in ISO 15924 but not in Unicode are now listed in 4268 preparseucd.py, in the _scripts_only_in_iso15924 variable. 4269 If there are new ISO codes, then add them. 4270 If Unicode adds some of them, then remove them from the .py variable. 4271 4272* UnicodeData.txt changes 4273- No more manual changes for CJK ranges for algorithmic names; 4274 those are now written to ppucd.txt and genprops reads them from there. 4275 4276* generate core properties data files (makeprops.sh was deleted) 4277- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 4278 4279* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 4280- it is now generated by preparseucd.py 4281 4282* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 4283- it is now generated by preparseucd.py 4284- make sure that the Unicode data folder passed into preparseucd.py 4285 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4286 (can be in some subfolder) 4287 4288* generate normalization data files 4289- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 4290- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 4291- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 4292- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4293- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4294- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4295- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4296 4297* build ICU (make install) 4298* build Unicode tools using CMake+make 4299 4300* new way to call genuca (makeuca.sh was deleted) 4301- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 4302 4303---------------------------------------------------------------------------- *** 4304 4305Unicode 6.1 update 4306 4307*** ICU Trac 4308 4309- ticket 8995 final update to Unicode 6.1 4310- ticket 8994 regenerate source/layout/CanonData.cpp 4311 4312- ticket 8961 support Unicode "Age" value *names* 4313- ticket 8963 support multiple character name aliases & types 4314 4315- ticket 8827 "update ICU to Unicode 6.1" 4316- C++ branches/markus/uni61 at r30864 from trunk at r30843 4317- Java branches/markus/uni61 at r30865 from trunk at r30863 4318 4319*** Unicode version numbers 4320- makedata.mak 4321- uchar.h 4322 (configure.in & configure: have been modified to extract the version from uchar.h) 4323- com.ibm.icu.util.VersionInfo 4324- icutools/unicode/makedefs.sh 4325 + also review & update other definitions in that file, 4326 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 4327 4328*** data files & enums & parser code 4329 4330* file preparation 4331 4332~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 4333- This prepares both unidata and testdata files in respective output subfolders. 4334- Check test file diffs for previously commented-out, known-failing data lines; 4335 probably need to keep those commented out. 4336 4337* PropertyValueAliases.txt changes 4338- 11 new block names: 4339 Arabic_Extended_A 4340 Arabic_Mathematical_Alphabetic_Symbols 4341 Chakma 4342 Meetei_Mayek_Extensions 4343 Meroitic_Cursive 4344 Meroitic_Hieroglyphs 4345 Miao 4346 Sharada 4347 Sora_Sompeng 4348 Sundanese_Supplement 4349 Takri 4350 -> add to uchar.h 4351 -> add to UCharacter.UnicodeBlock IDs 4352 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4353 replace public static final int \1_ID = \2; \3 4354 -> add to UCharacter.UnicodeBlock objects 4355 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4356 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4357- 1 new Joining_Group (jg) value: 4358 Rohingya_Yeh 4359 -> uchar.h & UCharacter.JoiningGroup 4360- 2 new Line_Break (lb) values: 4361 CJ=Conditional_Japanese_Starter 4362 HL=Hebrew_Letter 4363 -> uchar.h & UCharacter.LineBreak 4364- 7 new scripts: 4365 sc ; Cakm ; Chakma 4366 sc ; Merc ; Meroitic_Cursive 4367 sc ; Mero ; Meroitic_Hieroglyphs 4368 sc ; Plrd ; Miao 4369 sc ; Shrd ; Sharada 4370 sc ; Sora ; Sora_Sompeng 4371 sc ; Takr ; Takri 4372 -> remove these from SyntheticPropertyValueAliases.txt 4373 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4374 and in com.ibm.icu.dev.test.lang.TestUScript.java 4375- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4376 (added 2011-06-21) 4377 Khoj 322 Khojki 4378 Tirh 326 Tirhuta 4379 and another one added 2011-12-09 4380 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 4381 -> uscript.h 4382 -> com.ibm.icu.lang.UScript 4383 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4384 replace public static final int \1 = \2;\3 4385 -> SyntheticPropertyValueAliases.txt 4386 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4387 and in com.ibm.icu.dev.test.lang.TestUScript.java 4388 4389* UnicodeData.txt changes 4390- the last Unihan code point changes from U+9FCB to U+9FCC 4391 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 4392 + do change gennames.c 4393 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4394 4395* DerivedBidiClass.txt changes 4396- 2 new default-AL blocks: 4397# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 4398# Arabic Mathematical Alphabetic Symbols: 4399# U+1EE00 - U+1EEFF (was default-R) 4400- 2 new default-R blocks: 4401# Meroitic Hieroglyphs: 4402# U+10980 - U+1099F 4403# Meroitic Cursive: U+109A0 - U+109FF 4404 -> should be picked up by the explicit data in the file 4405 4406* NameAliases.txt changes 4407- from 4408 # Each line has two fields 4409 # First field: Code point 4410 # Second field: Alias 4411- to 4412 # Each line has three fields, as described here: 4413 # 4414 # First field: Code point 4415 # Second field: Alias 4416 # Third field: Type 4417- Also, the file previously allowed multiple aliases but only now does it 4418 actually provide multiple, even multiple of the same type. For example, 4419 FEFF;BYTE ORDER MARK;alternate 4420 FEFF;BOM;abbreviation 4421 FEFF;ZWNBSP;abbreviation 4422- This breaks our gennames parser, unames.icu data structure, and API. 4423 Fix gennames to only pick up "correction" aliases. 4424 New ticket #8963 for further changes. 4425 4426* run genpname/preparse.pl (on Linux) 4427 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4428 + make sure that data.h is writable 4429 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4430 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4431 4432* build ICU (make install) 4433 so that the tools build can pick up the new definitions from the installed header files. 4434* build Unicode tools (at least genpname) using CMake+make 4435 4436* run genpname 4437 (builds both pnames.icu and propname_data.h) 4438- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4439- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4440 4441* build ICU (make install) 4442* build Unicode tools using CMake+make 4443 4444* update source/data/unidata/norm2/nfkc_cf.txt 4445- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4446 4447* update source/data/unidata/norm2/uts46.txt 4448- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4449 to ~/svn.icu/tools/trunk/src/unicode/py 4450- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 4451- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4452- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4453 4454* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4455 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4456- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4457- Unicode 6.0..6.1: U+2260, U+226E, U+226F 4458- nothing new in 6.1, no test file to update 4459 4460* generate core properties data files 4461- in initial bootstrapping, change the UCA version 4462 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4463- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4464- rebuild ICU & tools 4465 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4466 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4467 (see step above) 4468- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 4469 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4470- rebuild ICU & tools 4471 4472* update Java data files 4473- refresh just the UCD-related files, just to be safe 4474- see (ICU4C)/source/data/icu4j-readme.txt 4475- mkdir /tmp/icu4j 4476- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4477 output: 4478 ... 4479 Unicode .icu files built to ./out/build/icudt49l 4480 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4481 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 4482 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4483 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4484 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 4485 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 4486 mkdir -p /tmp/icu4j/main/shared/data 4487 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4488 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 4489 mkdir -p /tmp/icu4j/main/shared/data 4490 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4491 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 4492- copy the big-endian Unicode data files to another location, 4493 separate from the other data files 4494 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4495 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4496 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4497 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 4498 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4499 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4500 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4501- refresh ICU4J 4502 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4503 4504* refresh Java test .txt files 4505- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4506 4507* test ICU so far, fix test code where necessary 4508- temporarily ignore collation issues that look like UCA/UCD mismatches, 4509 until UCA data is updated 4510 4511* UCA 4512 4513- get output from Mark's tools; look in 4514 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 4515- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4516- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4517 (note removing the underscore before "Rules") 4518- update (ICU)/source/test/testdata/CollationTest_*.txt 4519 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4520 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4521- check test file diffs for previously commented-out, known-failing data lines; 4522 probably need to keep those commented out 4523- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4524- run makeuca.sh: 4525 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4526- rebuild ICU4C 4527- refresh ICU4J collation data: 4528 (subset of instructions above for properties data refresh, except copies all coll/*) 4529 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4530 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4531 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4532 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4533- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4534- note on intltest: if collate/UCAConformanceTest fails, then 4535 utility/MultithreadTest/TestCollators will fail as well; 4536 fix the conformance test before looking into the multi-thread test 4537 4538* When refreshing all of ICU4J data from ICU4C 4539- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4540- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4541or 4542- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4543 4544*** LayoutEngine script information 4545 4546(For details see the Unicode 5.2 change log below.) 4547 4548* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4549 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4550 in the working directory. 4551 (It also generates ScriptRunData.cpp, which is no longer needed.) 4552 4553 The generated files have a current copyright date and "@draft" statement. 4554 4555- diff current <icu>/source/layout files vs. generated ones 4556 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4557 review and manually merge desired changes; 4558 fix gratuitous changes, incorrect @draft and missing aliases; 4559 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4560- if you just copy the above files, then 4561 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4562 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4563 4564*** merge the Unicode update branches back onto the trunk 4565- do not merge the icudata.jar and testdata.jar, 4566 instead rebuild them from merged & tested ICU4C 4567 4568---------------------------------------------------------------------------- *** 4569 4570ICU 4.8 (no Unicode update, just new script codes) 4571 4572* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4573 (added 2010-12-21) 4574 Afak 439 Afaka 4575 Jurc 510 Jurchen 4576 Mroo 199 Mro, Mru 4577 Nshu 499 Nüshu 4578 Shrd 319 Sharada, Śāradā 4579 Sora 398 Sora Sompeng 4580 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 4581 Tang 520 Tangut 4582 Wole 480 Woleai 4583 -> uscript.h 4584 -> com.ibm.icu.lang.UScript 4585 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4586 replace public static final int \1 = \2;\3 4587 -> genpname/SyntheticPropertyValueAliases.txt 4588 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4589 and in com.ibm.icu.dev.test.lang.TestUScript.java 4590 4591* run genpname/preparse.pl (on Linux) 4592 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4593 + make sure that data.h is writable 4594 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4595 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4596 4597* rebuild Unicode tools (at least genpname) using make 4598- You might first need to "make install" ICU so that the tools build can pick 4599 up the new definitions from the installed header files. 4600 4601* run genpname 4602 (builds both pnames.icu and propname_data.h) 4603- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4604- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4605- rebuild ICU & tools 4606 4607* run genprops 4608- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4609- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4610- rebuild ICU & tools 4611 4612* update Java data files 4613- refresh just the UCD-related files, just to be safe 4614- see (ICU4C)/source/data/icu4j-readme.txt 4615- mkdir /tmp/icu4j 4616- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4617- copy the big-endian Unicode data files to another location, 4618 separate from the other data files 4619 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4620 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4621 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4622- refresh ICU4J 4623 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 4624 4625* should have updated the layout engine script codes but forgot 4626 4627---------------------------------------------------------------------------- *** 4628 4629Unicode 6.0 update 4630 4631*** related ICU Trac tickets 4632 46337264 Unicode 6.0 Update 4634 4635*** Unicode version numbers 4636- makedata.mak 4637- uchar.h 4638 (configure.in & configure: have been modified to extract the version from uchar.h) 4639- com.ibm.icu.util.VersionInfo 4640 4641*** data files & enums & parser code 4642 4643* file preparation 4644 4645~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 4646- This now prepares both unidata and testdata files in respective output subfolders. 4647 4648* PropertyAliases.txt changes 4649- new Script_Extensions property defined in the new ScriptExtensions.txt file 4650 but not listed in PropertyAliases.txt; reported to unicode.org; 4651 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 4652 scx; Script_Extensions 4653 -> uchar.h with new UProperty section 4654 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 4655 4656* PropertyValueAliases.txt changes 4657- 12 new block names: 4658 Alchemical_Symbols 4659 Bamum_Supplement 4660 Batak 4661 Brahmi 4662 CJK_Unified_Ideographs_Extension_D 4663 Emoticons 4664 Ethiopic_Extended_A 4665 Kana_Supplement 4666 Mandaic 4667 Miscellaneous_Symbols_And_Pictographs 4668 Playing_Cards 4669 Transport_And_Map_Symbols 4670 -> add to uchar.h 4671 -> add to UCharacter.UnicodeBlock 4672 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4673 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4674- Joining_Group (jg) values: 4675 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 4676 -> uchar.h & UCharacter.JoiningGroup 4677- 3 new scripts: 4678 sc ; Batk ; Batak 4679 sc ; Brah ; Brahmi 4680 sc ; Mand ; Mandaic 4681 -> remove these from SyntheticPropertyValueAliases.txt 4682 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 4683 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4684 and in com.ibm.icu.dev.test.lang.TestUScript.java 4685- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4686 (added 2009-11-11..2010-07-18) 4687 Bass 259 Bassa Vah 4688 Dupl 755 Duployan shortand 4689 Elba 226 Elbasan 4690 Gran 343 Grantha 4691 Kpel 436 Kpelle 4692 Loma 437 Loma 4693 Mend 438 Mende 4694 Merc 101 Meroitic Cursive 4695 Narb 106 Old North Arabian 4696 Nbat 159 Nabataean 4697 Palm 126 Palmyrene 4698 Sind 318 Sindhi 4699 Wara 262 Warang Citi 4700 -> uscript.h 4701 -> com.ibm.icu.lang.UScript 4702 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4703 replace public static final int \1 = \2;\3 4704 -> SyntheticPropertyValueAliases.txt 4705 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4706 and in com.ibm.icu.dev.test.lang.TestUScript.java 4707- ISO 15924 name change 4708 Mero 100 Meroitic Hieroglyphs (was Meroitic) 4709 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 4710- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 4711 4712* UnicodeData.txt changes 4713- new CJK block: 4714 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 4715 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 4716 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 4717 4718* build Unicode tools using CMake+make 4719 4720* run genpname/preparse.pl (on Linux) 4721 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4722 + make sure that data.h is writable 4723 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4724 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4725 4726* rebuild Unicode tools (at least genpname) using make 4727- You might first need to "make install" ICU so that the tools build can pick 4728 up the new definitions from the installed header files. 4729 4730* run genpname 4731- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4732- rebuild ICU & tools 4733 4734* update source/data/unidata/norm2/nfkc_cf.txt 4735- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4736 4737* update source/data/unidata/norm2/uts46.txt 4738- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 4739 to ~/svn.icu/tools/trunk/src/unicode/py 4740- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 4741- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4742- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4743 4744* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4745 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4746- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4747- Unicode 6.0: U+2260, U+226E, U+226F 4748 4749* generate core properties data files 4750- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4751- rebuild ICU & tools 4752- run makeuca.sh so that genuca picks up the new nfc.nrm: 4753 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4754- rebuild ICU & tools 4755 4756* implement new Script_Extensions property (provisional) 4757- parser & generator: genprops & uprops.icu 4758- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 4759- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 4760 4761* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 4762- (one-time change) 4763- genbidi/gencase/genprops tools changes 4764- re-run makeprops.sh (see above) 4765- UCharacterProperty.java, UCharacterTypeIterator.java, 4766 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 4767 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 4768 4769* update Java data files 4770- refresh just the UCD-related files, just to be safe 4771- see (ICU4C)/source/data/icu4j-readme.txt 4772- mkdir /tmp/icu4j 4773- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4774 output: 4775 ... 4776 Unicode .icu files built to ./out/build/icudt45l 4777 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4778 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4779 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4780 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 4781 mkdir -p /tmp/icu4j/main/shared/data 4782 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4783- copy the big-endian Unicode data files to another location, 4784 separate from the other data files 4785 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4786 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4787 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4788 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 4789 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4790 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4791 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4792- refresh ICU4J 4793 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4794 4795* refresh Java test .txt files 4796- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4797 4798* un-hardcode normalization skippable (NF*_Inert) test data 4799- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 4800 4801* copy updated break iterator test files 4802- now handled by early ucdcopy.py and 4803 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 4804 (old instructions: 4805 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 4806 to ~/svn.icu/trunk/src/source/test/testdata) 4807- they are not used in ICU4J 4808 4809* UCA 4810 4811- get output from Mark's tools; look in 4812 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 4813 http://www.macchiato.com/unicode/utc/additional-uca-files 4814 http://www.unicode.org/Public/UCA/6.0.0/ 4815 http://www.unicode.org/~mdavis/uca/ 4816- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4817- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4818- update Han-implicit ranges for new CJK extensions: 4819 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4820- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 4821 do not add it into invuca so that tailoring primary-after an ignorable works 4822- genuca: permit space between [variable top] bytes 4823- ucol.cpp: treat noncharacters like unassigned rather than ignorable 4824- run makeuca.sh: 4825 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4826- rebuild ICU4C 4827- refresh ICU4J collation data: 4828 (subset of instructions above for properties data refresh, except copies all coll/*) 4829 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4830 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4831 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4832 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4833- update (ICU)/source/test/testdata/CollationTest_*.txt 4834 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4835 with output from Mark's Unicode tools 4836- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4837- note on intltest: if collate/UCAConformanceTest fails, then 4838 utility/MultithreadTest/TestCollators will fail as well; 4839 fix the conformance test before looking into the multi-thread test 4840 4841* When refreshing all of ICU4J data from ICU4C 4842- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4843- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4844or 4845- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4846 4847*** LayoutEngine script information 4848 4849(For details see the Unicode 5.2 change log below.) 4850 4851* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4852ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4853ScriptRunData.cpp, which is no longer needed.) 4854 4855The generated files have a current copyright date and "@draft" statement. 4856 4857* copy the above files into <icu>/source/layout, replacing the old files. 4858* fix mixed line endings 4859* review the diffs and fix incorrect @draft and missing aliases; 4860 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4861* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4862 4863---------------------------------------------------------------------------- *** 4864 4865Unicode 5.2 update 4866 4867*** related ICU Trac tickets 4868 48697084 Unicode 5.2 4870 48717167 verify collation bytes 48727235 Java test NAME_ALIAS 48737236 Java DerivedCoreProperties.txt test 48747237 Java BidiTest.txt 48757238 UTrie2 in core unidata 48767239 test for tailoring gaps 48777240 Java fix CollationMiscTest 48787243 update layout engine for Unicode 5.2 4879 4880*** Unicode version numbers 4881- makedata.mak 4882- uchar.h 4883- configure.in & configure 4884- update ucdVersion in gennames.c if an algorithmic range changes 4885 4886*** data files & enums & parser code 4887 4888* file preparation 4889 4890python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 4891- includes finding files regardless of version numbers, 4892 copying them, and performing the equivalent processing of the 4893 ucdstrip and ucdmerge tools on the desired set of files 4894 4895* notes on changes 4896- PropertyAliases.txt 4897 moved from numeric to enumerated: 4898 ccc ; Canonical_Combining_Class 4899 new string properties: 4900 NFKC_CF ; NFKC_Casefold 4901 Name_Alias; Name_Alias 4902 new binary properties: 4903 Cased ; Cased 4904 CI ; Case_Ignorable 4905 CWCF ; Changes_When_Casefolded 4906 CWCM ; Changes_When_Casemapped 4907 CWKCF ; Changes_When_NFKC_Casefolded 4908 CWL ; Changes_When_Lowercased 4909 CWT ; Changes_When_Titlecased 4910 CWU ; Changes_When_Uppercased 4911 new CJK Unihan properties (not supported by ICU) 4912- PropertyValueAliases.txt 4913 new block names 4914 new scripts 4915 one script code change: 4916 sc ; Qaai ; Inherited 4917 -> 4918 sc ; Zinh ; Inherited ; Qaai 4919 new Line_Break (lb) value: 4920 lb ; CP ; Close_Parenthesis 4921 new Joining_Group (jg) values: Farsi_Yeh, Nya 4922 other new values: 4923 ccc; 214; ATA ; Attached_Above 4924- DerivedBidiClass.txt 4925 new default-R range: U+1E800 - U+1EFFF 4926- UnicodeData.txt 4927 all of the ISO comments are gone 4928 new CJK block end: 4929 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 4930 new CJK block: 4931 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 4932 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 4933 4934* genpname 4935- run preparse.pl 4936 + cd \svn\icuproj\icu\trunk\source\tools\genpname 4937 + make sure that data.h is writable 4938 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 4939 + preparse.pl complains with errors like the following: 4940 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 4941 This is because ICU 4.0 had scripts from ISO 15924 which are now 4942 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 4943 and PropertyValueAliases.txt. 4944 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4945 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 4946 + preparse.pl complains with errors about block names missing from uchar.h; add them 4947 4948* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4949- new block & script values 4950 + 26 new blocks 4951 copy new blocks from Blocks.txt 4952 MS VC++ 2008 regular expression: 4953 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 4954 replace with " UBLOCK_\3 = 172, /*[\1]*/" 4955 + several new script values already added in ICU 4.0 for ISO 15924 coverage 4956 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 4957 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 4958 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 4959 (added to SyntheticPropertyValueAliases.txt) 4960- new Joining Group (JG) values: Farsi_Yeh, Nya 4961- new Line_Break (lb) value: 4962 lb ; CP ; Close_Parenthesis 4963 4964* hardcoded Unihan range end/limit 4965- Unihan range end moves from 9FC3 to 9FCB 4966 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 4967 + do change gennames.c 4968 4969* Compare definitions of new binary properties with what we used to use 4970 in algorithms, to see if the definitions changed. 4971- Verified that definitions for Cased and Case_Ignorable are unchanged. 4972 The gencase tool now parses the newly public Case_Ignorable values 4973 in case the definition changes in the future. 4974 4975* uchar.c & uprops.h & uprops.c & genprops 4976- new numeric values that didn't exist in Unicode data before: 4977 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 4978 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 4979 therefore redesign the encoding of numeric types and values for formatVersion 6; 4980 design for simple numbers up to at least 144 ("one gross"), 4981 large values up to at least 10^20, 4982 and fractions with numerators -1..17 and denominators 1..16 4983 to cover current and expected future values 4984 (e.g., more Han numeric values, Meroitic twelfths) 4985 4986* reimplement Hangul_Syllable_Type for new Jamo characters 4987- the old code assumed that all Jamo characters are in the 11xx block 4988- Unicode 5.2 fills holes there and adds new Jamo characters in 4989 A960..A97F; Hangul Jamo Extended-A 4990 and in 4991 D7B0..D7FF; Hangul Jamo Extended-B 4992- Hangul_Syllable_Type can be trivially derived from a subset of 4993 Grapheme_Cluster_Break values 4994 4995* build Unicode data source code for hardcoding core data 4996C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 4997 4998ICU data make path is \svn\icuproj\icu\trunk\source\data\ 4999ICU root path is \svn\icuproj\icu\trunk 5000Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5001Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5002Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5003Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5004Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5005Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5006Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5007Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 5008Creating data file for Unicode Property Names 5009Creating data file for Unicode Character Properties 5010Creating data file for Unicode Case Mapping Properties 5011Creating data file for Unicode BiDi/Shaping Properties 5012Creating data file for Unicode Normalization 5013Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 5014Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 5015 5016- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 5017 and rebuild the common library 5018 5019*** UCA 5020 5021- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 5022- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 5023- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 5024[ Begin obsolete instructions: 5025 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 5026 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 5027 on Windows: 5028 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 5029 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 5030 End obsolete instructions] 5031- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5032 not just the *_STUB.txt files 5033- note on intltest: if collate/UCAConformanceTest fails, then 5034 utility/MultithreadTest/TestCollators will fail as well; 5035 fix the conformance test before looking into the multi-thread test 5036 5037*** Implement Cased & Case_Ignorable properties 5038- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 5039- Problem: These properties should be disjoint, but aren't 5040- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 5041- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 5042 5043*** Implement Changes_When_Xyz properties 5044- without stored data 5045 5046*** Implement Name_Alias property 5047- add it as another name field in unames.icu 5048- make it available via u_charName() and UCharNameChoice and 5049- consider it in u_charFromName() 5050 5051*** Break iterators 5052 5053* Update break iterator rules to new UAX versions and new property values 5054* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 5055 5056*** new BidiTest file 5057- review format and data 5058- copy BidiTest.txt to source/test/testdata 5059- write test code using this data 5060- fix ICU code where it fails the conformance test 5061 5062*** Java 5063- generally, find and update code corresponding to C/C++ 5064- UCharacter.UnicodeBlock constants: 5065 a) add an _ID integer per new block, update COUNT 5066 b) add a class instance per new block 5067 Visual Studio regex: 5068 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 5069 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5070- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 5071 5072- port test changes to Java 5073 5074*** LayoutEngine script information 5075 5076(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 5077 5078* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5079ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5080ScriptRunData.cpp, which is no longer needed.) 5081 5082The generated files have a current copyright date and "@draft" statement. 5083 5084-> Eric Mader wrote in email on 20090930: 5085 "I think the tool has been modified to update @draft to @stable for 5086 older scripts and to add @draft for new scripts. 5087 (I worked with an intern on this last year.) 5088 You should check the output after you run it." 5089 5090* copy the above files into <icu>/source/layout, replacing the old files. 5091* fix mixed line endings 5092* review the diffs and fix incorrect @draft and missing aliases 5093* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5094 5095Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5096and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5097 5098-> Eric Mader wrote in email on 20090930: 5099 "This is just a matter of making sure that all the per-script tables have 5100 entries for any new scripts that were added. 5101 If any new Indic characters were added, then the class tables in 5102 IndicClassTables.cpp should be updated to reflect this. 5103 John Emmons should know how to do this if it's required." 5104 5105* rebuild the layout and layoutex libraries. 5106 5107*** Documentation 5108- Update User Guide 5109 + Jamo_Short_Name, sfc->scf, binary property value aliases 5110 5111---------------------------------------------------------------------------- *** 5112 5113Unicode 5.1 update 5114 5115*** related ICU Trac tickets 5116 51175696 Update to Unicode 5.1 5118 5119*** Unicode version numbers 5120- makedata.mak 5121- uchar.h 5122- configure.in & configure 5123- update ucdVersion in gennames.c if an algorithmic range changes 5124 5125*** data files & enums & parser code 5126 5127* file preparation 5128- ucdstrip: 5129 DerivedCoreProperties.txt 5130 DerivedNormalizationProps.txt 5131 NormalizationTest.txt 5132 PropList.txt 5133 Scripts.txt 5134 GraphemeBreakProperty.txt 5135 SentenceBreakProperty.txt 5136 WordBreakProperty.txt 5137- ucdstrip and ucdmerge: 5138 EastAsianWidth.txt 5139 LineBreak.txt 5140 5141* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5142copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 5143copy 5.1.0\ucd\Blocks.txt ..\unidata\ 5144copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 5145copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 5146copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5147copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5148copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5149copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5150copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 5151copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 5152copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 5153copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 5154copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 5155 5156ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5157ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5158ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5159ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 5160ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5161ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5162ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5163ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5164ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5165ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5166 5167* genpname 5168- run preparse.pl 5169 + cd \svn\icuproj\icu\uni51\source\tools\genpname 5170 + make sure that data.h is writable 5171 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 5172 + preparse.pl complains with errors like the following: 5173 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 5174 This is because ICU 3.8 had scripts from ISO 15924 which are now 5175 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 5176 and PropertyValueAliases.txt. 5177 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5178 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 5179 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 5180 N/Y, No/Yes, F/T, False/True 5181 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 5182 It will use further values from the file if present. 5183 5184* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5185- new block & script values 5186 + 17 new blocks 5187 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 5188 (removed from SyntheticPropertyValueAliases.txt) 5189 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 5190 (added to SyntheticPropertyValueAliases.txt) 5191- uprops.icu (uprops.h) only provides 7 bits for script codes. 5192 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 5193 There is none above 127 yet which is the script code for an 5194 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 5195 script code values greater than 127. 5196 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 5197 in a parallel bit field, and that overflows now. 5198 Also, future values >=128 would be incompatible anyway. 5199 uprops.h is modified to move around several of the bit fields 5200 in the properties vector words, and now uses 8 bits for the script code. 5201 Two other bit fields also grow to accommodate future growth: 5202 Block (current count: 172) grows from 8 to 9 bits, 5203 and Word_Break grows from 4 to 5 bits. 5204- renamed property Simple_Case_Folding (sfc->scf) 5205 + nothing to be done: handled as normal alias 5206- new property JSN Jamo_Short_Name 5207 + no new API: only contributes to the Name property 5208- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 5209- new Joining Group (JG) value: Burushashki_Yeh_Barree 5210- new Sentence_Break (SB) values: 5211 SB ; CR ; CR 5212 SB ; EX ; Extend 5213 SB ; LF ; LF 5214 SB ; SC ; SContinue 5215- new Word_Break (WB) values: 5216 WB ; CR ; CR 5217 WB ; Extend ; Extend 5218 WB ; LF ; LF 5219 WB ; MB ; MidNumLet 5220 5221* Further changes in the 2008-02-29 update: 5222- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 5223 because they should not normally be invisible. 5224- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 5225- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 5226- new Word_Break (WB) value: NL=Newline 5227 5228* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 5229- Unihan range end moves from 9FBB to 9FC3 5230 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 5231 + do change gennames.c 5232 5233* build Unicode data source code for hardcoding core data 5234C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 5235 5236ICU data make path is \svn\icuproj\icu\uni51\source\data\ 5237ICU root path is \svn\icuproj\icu\uni51 5238Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5239Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5240Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5241Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5242Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5243Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5244Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5245Creating data file for Unicode Character Properties 5246Creating data file for Unicode Case Mapping Properties 5247Creating data file for Unicode BiDi/Shaping Properties 5248Creating data file for Unicode Normalization 5249Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 5250Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 5251 5252- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 5253 and rebuild the common library 5254 5255*** Break iterators 5256 5257* Update break iterator rules to new UAX versions and new property values 5258 5259*** UCA 5260 5261* update FractionalUCA.txt and UCARules.txt with new canonical closure 5262 5263*** Test suites 5264- Test that APIs using Unicode property value aliases (like UnicodeSet) 5265 support all of the boolean values N/Y, No/Yes, F/T, False/True 5266 -> TestBinaryValues() tests in both cintltst and intltest 5267 5268*** LayoutEngine script information 5269* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5270ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5271ScriptRunData.cpp, which is no longer needed.) 5272 5273The generated files have a current copyright date and "@draft" statement. 5274 5275* copy the above files into <icu>/source/layout, replacing the old files. 5276 5277Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5278and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5279 5280* rebuild the layout and layoutex libraries. 5281 5282*** Documentation 5283- Update User Guide 5284 + Jamo_Short_Name, sfc->scf, binary property value aliases 5285 5286---------------------------------------------------------------------------- *** 5287 5288Unicode 5.0 update 5289 5290*** related Jitterbugs 5291 52925084 RFE: Update to Unicode 5.0 5293 5294*** data files & enums & parser code 5295 5296* file preparation 5297- ucdstrip: 5298 DerivedCoreProperties.txt 5299 DerivedNormalizationProps.txt 5300 NormalizationTest.txt 5301 PropList.txt 5302 Scripts.txt 5303 GraphemeBreakProperty.txt 5304 SentenceBreakProperty.txt 5305 WordBreakProperty.txt 5306- ucdstrip and ucdmerge: 5307 EastAsianWidth.txt 5308 LineBreak.txt 5309 5310* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5311copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 5312copy 5.0.0\ucd\Blocks.txt ..\unidata\ 5313copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 5314copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 5315copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5316copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5317copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5318copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5319copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 5320copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 5321copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 5322copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 5323copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 5324 5325ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5326ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5327ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5328ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 5329ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5330ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5331ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5332ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5333ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5334ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5335 5336* update FractionalUCA.txt and UCARules.txt with new canonical closure 5337 5338* genpname 5339- run preparse.pl 5340 + make sure that data.h is writable 5341 + perl preparse.pl \cvs\oss\icu > out.txt 5342 5343* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5344- new block & script values 5345 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 5346 5347* build Unicode data source code for hardcoding core data 5348C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 5349 5350ICU data make path is \cvs\oss\icu\source\data\ 5351ICU root path is \cvs\oss\icu 5352Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5353[etc.] 5354Creating data file for Unicode Character Properties 5355Creating data file for Unicode Case Mapping Properties 5356Creating data file for Unicode BiDi/Shaping Properties 5357Creating data file for Unicode Normalization 5358Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 5359Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 5360 5361- copy the .c source files to C:\cvs\oss\icu\source\common 5362 and rebuild the common library 5363 5364*** Unicode version numbers 5365- makedata.mak 5366- uchar.h 5367- configure.in 5368 5369*** LayoutEngine script information 5370* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5371ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5372ScriptRunData.cpp, which is no longer needed.) 5373 5374The generated files have a current copyright date and "@draft" statement. 5375 5376* copy the above files into <icu>/source/layout, replacing the old files. 5377 5378Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5379and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5380 5381* rebuild the layout and layoutex libraries. 5382 5383---------------------------------------------------------------------------- *** 5384 5385Unicode 4.1 update 5386 5387*** related Jitterbugs 5388 53894332 RFE: Update to Unicode 4.1 53904157 RBBI, TR29 4.1 updates 5391 5392*** data files & enums & parser code 5393 5394* file preparation 5395- ucdstrip: 5396 DerivedCoreProperties.txt 5397 DerivedNormalizationProps.txt 5398 NormalizationTest.txt 5399 GraphemeBreakProperty.txt 5400 SentenceBreakProperty.txt 5401 WordBreakProperty.txt 5402- ucdstrip and ucdmerge: 5403 EastAsianWidth.txt 5404 LineBreak.txt 5405 5406* add new files to the repository 5407 GraphemeBreakProperty.txt 5408 SentenceBreakProperty.txt 5409 WordBreakProperty.txt 5410 5411* update FractionalUCA.txt and UCARules.txt with new canonical closure 5412 5413* genpname 5414- handle new enumerated properties in sub read_uchar 5415- run preparse.pl 5416 5417* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5418- new binary properties 5419 + Pattern_Syntax 5420 + Pattern_White_Space 5421- new enumerated properties 5422 + Grapheme_Cluster_Break 5423 + Sentence_Break 5424 + Word_Break 5425- new block & script & line break values 5426 5427* gencase 5428- case-ignorable changes 5429 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5430 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 5431 5432*** Unicode version numbers 5433- makedata.mak 5434- uchar.h 5435- configure.in 5436 5437*** tests 5438- verify that u_charMirror() round-trips 5439- test all new properties and some new values of old properties 5440 5441*** other code 5442 5443* hardcoded Unihan range end/limit 5444- Unihan range end moves from 9FA5 to 9FBB 5445 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 5446 + do not modify BOCU/BOCSU code because that would change the encoding 5447 and break binary compatibility! 5448 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 5449 NamePrepProfile.txt 5450 + ignore trietest.c: test data is arbitrary 5451 + ignore tstnorm.cpp: test optimization, not important 5452 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 5453 + do change line_th.txt and word_th.txt 5454 by replacing hardcoded ranges with the new property values 5455 + do change gennames.c 5456 5457source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5458source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5459source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 5460 5461* case mappings 5462- compare new special casing context conditions with previous ones 5463 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5464 5465* genpname 5466- consider storing only the short name if it is the same as the long name 5467 5468*** other reviews 5469- UAX #29 changes (grapheme/word/sentence breaks) 5470- UAX #14 changes (line breaks) 5471- Pattern_Syntax & Pattern_White_Space 5472 5473---------------------------------------------------------------------------- *** 5474 5475Unicode 4.0.1 update 5476 5477*** related Jitterbugs 5478 54793170 RFE: Update to Unicode 4.0.1 54803171 Add new Unicode 4.0.1 properties 54813520 use Unicode 4.0.1 updates for break iteration 5482 5483*** data files & enums & parser code 5484 5485* file preparation 5486- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 5487- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 5488 5489* file fixes 5490- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 5491 according to PRI #26 5492 http://www.unicode.org/review/resolved-pri.html#pri26 5493- undone again because no corrigendum in sight; 5494 instead modified tests to not check consistency on this for Unicode 4.0.1 5495 5496* ucdterms.txt 5497- update from http://www.unicode.org/copyright.html 5498 formatted for plain text 5499 5500* uchar.h & uprops.h & uprops.c & genprops 5501- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 5502- add U_LB_INSEPARABLE due to a spelling fix 5503 + put short name comment only on line with new constant 5504 for genpname perl script parser 5505- new binary properties 5506 + STerm 5507 + Variation_Selector 5508 5509* genpname 5510- fix genpname perl script so that it doesn't choke on more than 2 names per property value 5511- perl script: correctly calculate the maximum number of fields per row 5512 5513* uscript.h 5514- new script code Hrkt=Katakana_Or_Hiragana 5515 5516* gennorm.c track changes in DerivedNormalizationProps.txt 5517- "FNC" -> "FC_NFKC" 5518- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 5519 5520* genprops/props2.c track changes in DerivedNumericValues.txt 5521- changed from 3 columns to 2, dropping the numeric type 5522 + assume that the type is always numeric for Han characters, 5523 and that only those are added in addition to what UnicodeData.txt lists 5524 5525*** Unicode version numbers 5526- makedata.mak 5527- uchar.h 5528- configure.in 5529 5530*** tests 5531- update test of default bidi classes according to PRI #28 5532 /tsutil/cucdtst/TestUnicodeData 5533 http://www.unicode.org/review/resolved-pri.html#pri28 5534- bidi tests: change exemplar character for ES depending on Unicode version 5535- change hardcoded expected property values where they change 5536 5537*** other code 5538 5539* name matching 5540- read UCD.html 5541 5542* scripts 5543- use new Hrkt=Katakana_Or_Hiragana 5544 5545* ZWJ & ZWNJ 5546- are now part of combining character sequences 5547- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 5548