1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13 14* change log for Unicode updates 15 16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18Notes: 19 20This log includes several command lines as used in the update process. 21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. 22Use a console window that is set to that directory, or cd to there, 23and then paste the command that follows the $ sign. 24 25Most command lines use environment variables to make them more portable across versions 26and machine configurations. When you set up a console window, copy & paste the `export` commands 27from near the top of the current section before pasting tool command lines. 28Adjust the environment variables to the current version and your machine setup. 29(The command lines are currently as used on Linux.) 30 31Syntax of this file: 32 33`***` - section heading 34`*` - sub heading 35`-` - 1st level bullet 36`+` - 2nd level bullet 37`=` - 1st level bullet 38`->` - "the previous things leads to...", OR a 2nd level bullet/item 39 40---------------------------------------------------------------------------- *** 41 42* New ISO 15924 script codes 43 44Normally, add new script codes as part of a Unicode update. 45See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 46and see the change logs below. 47 48---------------------------------------------------------------------------- *** 49 50Unicode 16.0 update for ICU 76 51 52TODO 53- No more hardcoded spoof checker sets: Update change log. 54- In the Unicode Tools repo: Delete the org.unicode.text.tools.RecommendedSetGenerator. 55- In corepropsbuilder.cpp, remove the isA9CF hack. 56- Update instructions for hardcoded properties 57 IDS_Unary_Operator, ID_Compat_Math_Start & ID_Compat_Math_Continue: 58 + These are still hardcoded, but since ICU 75 they are tested in C++ intltest. 59 + No more need to check via grep. 60 + Still: If the test fails, then update the hardcoded implementation. 61 62---------------------------------------------------------------------------- *** 63 64Unicode 15.1 update for ICU 74 65 66https://www.unicode.org/versions/Unicode15.1.0/ 67https://www.unicode.org/versions/beta-15.1.0.html 68https://www.unicode.org/Public/draft/ 69https://www.unicode.org/reports/uax-proposed-updates.html 70https://www.unicode.org/reports/tr44/tr44-31.html 71 72https://unicode-org.atlassian.net/browse/ICU-22404 Unicode 15.1 73https://unicode-org.atlassian.net/browse/CLDR-16669 BRS Unicode 15.1 74 75https://github.com/unicode-org/unicodetools/issues/492 adjust cldr/*BreakTest generation for Unicode 15.1 76 77* Command-line environment setup 78 79Markus: 80 81export UNIDATA_ROOT=~/unidata 82export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/final 83export CLDR_SRC=~/cldr/uni/src 84export ICU_ROOT=~/icu/uni 85export ICU_SRC=$ICU_ROOT/src 86export ICU_OUT=$ICU_ROOT/dbg 87export ICUDT=icudt74b 88export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 89export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 90export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 91export UNICODE_TOOLS=~/unitools/mine/src 92 93Elango: 94 95export UNIDATA_ROOT=~/oss/unidata 96export UNICODE_DATA=$UNIDATA_ROOT/uni15.1/snapshot 97export CLDR_SRC=~/oss/cldr/mine/src 98export ICU_ROOT=~/oss/icu 99export ICU_SRC=$ICU_ROOT 100export ICU_OUT=$ICU_ROOT 101export ICUDT=icudt74b 102export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 103export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 104export LD_LIBRARY_PATH=$ICU_OUT/icu4c/lib 105export UNICODE_TOOLS=~/oss/unicodetools/mine/src 106 107*** Unicode version numbers 108- makedata.mak 109- uchar.h 110- com.ibm.icu.util.VersionInfo 111- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 112 113*** Configure: Build Unicode data for ICU4J 114- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 115 so that the makefiles see the new version number. 116 cd $ICU_OUT/icu4c 117 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 118 119*** data files & enums & parser code 120 121* download files 122- same as for the early Unicode Tools setup and data refresh: 123 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 124 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 125- mkdir -p $UNICODE_DATA 126- download Unicode files into $UNICODE_DATA 127 + new since Unicode 15.1: 128 for the pre-release (alpha, beta) data files, 129 download all of https://www.unicode.org/Public/draft/ 130 (you can omit or discard the UCD/charts/ and UCD/ucdxml/ files/folders) 131 + if one of us produces the alpha.zip or beta.zip collection of data files for publication, 132 then we can use its contents directly (no FTP from unicode.org necessary) 133 + for final-release data files, the source of truth are the files in 134 https://www.unicode.org/Public/(version) [=UCD], 135 https://www.unicode.org/Public/UCA/(version), 136 https://www.unicode.org/Public/idna/(version), 137 etc. 138 + use an FTP client; anonymous FTP from www.unicode.org at /Public/draft etc. 139 + subfolders: emoji, idna, security, ucd, uca 140 + whichever way you download the files: 141 ~ inside ucd: extract Unihan.zip to "here" (.../UCD/ucd/Unihan/*.txt), delete Unihan.zip 142 ~ split Unihan into single-property files 143 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/UCD/ucd/Unihan 144 ~ TODO: for updating ICU, we should not need Unihan.zip contents, correct? 145 + alternate way of fetching files, if available: 146 copy the files from a Unicode Tools workspace that is up to date with 147 https://github.com/unicode-org/unicodetools 148 and which might at this point be *ahead* of "Public" 149 ~ before the Unicode release copy files from "dev" subfolders, for example 150 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 151- get the CLDR version of GraphemeBreakTest.txt from CLDR (if it has been updated there already) 152 or from the UCD/cldr/ output folder of the Unicode Tools: 153 From Unicode 12/CLDR 35/ICU 64 to Unicode 15.0/CLDR 43/ICU 73, 154 CLDR used modified grapheme break rules. 155 This might happen again. 156 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 157 or 158 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 159 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 160 cp ~/unitools/mine/Generated/UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 161 + TODO: figure out whether we need a CLDR version of LineBreakTest.txt: 162 unicodetools issue #492 163- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 164 + TODO: modify preparseucd.py to copy this file 165 166* Note: Since Unicode 15.1, data files are no longer published with version suffixes 167 even during the alpha or beta. 168 Thus we no longer need steps & tools to remove those suffixes. 169 (remove this note next time) 170 171* process and/or copy files 172- cd $ICU_SRC/tools/unicode 173 py/preparseucd.py $UNICODE_DATA $ICU_SRC 174 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 175 + For debugging, and tweaking how ppucd.txt is written, 176 the tool has an --only_ppucd option: 177 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 178 179* new constants for new property values 180- preparseucd.py error: 181 ValueError: missing uchar.h enum constants for some property values: [('blk', {'CJK_Ext_I'}), ('lb', {'VF', 'VI', 'AS', 'AK', 'AP'})] 182 = PropertyValueAliases.txt new property values (diff old & new .txt files) 183 cd $UNIDATA_ROOT 184 $ diff -u uni15.0/ucd/PropertyValueAliases.txt uni15.1/snapshot/UCD/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 185 +age; 15.1 ; V15_1 186 +blk; CJK_Ext_I ; CJK_Unified_Ideographs_Extension_I 187 +IDSU; N ; No ; F ; False 188 +IDSU; Y ; Yes ; T ; True 189 +ID_Compat_Math_Continue; N ; No ; F ; False 190 +ID_Compat_Math_Continue; Y ; Yes ; T ; True 191 +ID_Compat_Math_Start; N ; No ; F ; False 192 +ID_Compat_Math_Start; Y ; Yes ; T ; True 193 +lb ; AK ; Aksara 194 +lb ; AP ; Aksara_Prebase 195 +lb ; AS ; Aksara_Start 196 +lb ; VF ; Virama_Final 197 +lb ; VI ; Virama 198 -> add new blocks to uchar.h before UBLOCK_COUNT 199 use long property names for enum constants, 200 for the trailing comment get the block start code point: diff old & new Blocks.txt 201 cd $UNIDATA_ROOT 202 $ diff -u uni15.0/ucd/Blocks.txt uni15.1/snapshot/UCD/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 203 +2EBF0..2EE4F; CJK Unified Ideographs Extension I 204 (ignore blocks whose end code point changed) 205 -> add new blocks to UCharacter.UnicodeBlock IDs 206 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 207 replace public static final int \1_ID = \2; \3 208 -> add new blocks to UCharacter.UnicodeBlock objects 209 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 210 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 211 -> add new line break values to uchar.h & UCharacter.LineBreak 212 213* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 214 (not strictly necessary for NOT_ENCODED scripts) 215 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 216 217* build ICU 218 to make sure that there are no syntax errors 219 220 $ICU_OUT/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 221 222* update spoof checker UnicodeSet initializers: 223 inclusionPat & recommendedPat in i18n/uspoof.cpp 224 INCLUSION & RECOMMENDED in SpoofChecker.java 225- make sure that the Unicode Tools tree contains the latest security data files 226- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 227- run the tool (no special environment variables needed) 228 cd $UNICODE_TOOLS 229 mvn -s ~/.m2/settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.tools.RecommendedSetGenerator" \ 230 -Dexec.args="" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) 231- copy & paste from the Console output into the .cpp & .java files 232 233* check hardcoded IDS_Unary_Operator 234- new in Unicode 15.1, hardcoded because trivial, and unlikely to change 235- check that it has not changed: 236 (cd $UNICODE_DATA && grep -r --include=PropList.txt IDS_Unary_Operator) 237 -> 238 ucd/PropList.txt:2FFE..2FFF ; IDS_Unary_Operator # So [2] IDEOGRAPHIC DESCRIPTION CHAR... 239- if it has changed, then update the implementation and the tests 240 241* check hardcoded ID_Compat_Math_Start & ID_Compat_Math_Continue 242- new in Unicode 15.1, hardcoded because trivial, and unlikely to change 243- check that they have not changed: 244 (cd $UNICODE_DATA && grep -r --include=PropList.txt ID_Compat_Math) 245 -> 246 ucd/PropList.txt:00B2..00B3 ; ID_Compat_Math_Continue # No [2] SUPERSCRIPT TWO..SUPERSCRIPT THREE 247 ucd/PropList.txt:00B9 ; ID_Compat_Math_Continue # No SUPERSCRIPT ONE 248 ucd/PropList.txt:2070 ; ID_Compat_Math_Continue # No SUPERSCRIPT ZERO 249 ucd/PropList.txt:2074..2079 ; ID_Compat_Math_Continue # No [6] SUPERSCRIPT FOUR..SUPERSCRIPT NINE 250 ucd/PropList.txt:207A..207C ; ID_Compat_Math_Continue # Sm [3] SUPERSCRIPT PLUS SIGN..SUPERSCRIPT EQUALS SIGN 251 ucd/PropList.txt:207D ; ID_Compat_Math_Continue # Ps SUPERSCRIPT LEFT PARENTHESIS 252 ucd/PropList.txt:207E ; ID_Compat_Math_Continue # Pe SUPERSCRIPT RIGHT PARENTHESIS 253 ucd/PropList.txt:2080..2089 ; ID_Compat_Math_Continue # No [10] SUBSCRIPT ZERO..SUBSCRIPT NINE 254 ucd/PropList.txt:208A..208C ; ID_Compat_Math_Continue # Sm [3] SUBSCRIPT PLUS SIGN..SUBSCRIPT EQUALS SIGN 255 ucd/PropList.txt:208D ; ID_Compat_Math_Continue # Ps SUBSCRIPT LEFT PARENTHESIS 256 ucd/PropList.txt:208E ; ID_Compat_Math_Continue # Pe SUBSCRIPT RIGHT PARENTHESIS 257 ucd/PropList.txt:2202 ; ID_Compat_Math_Continue # Sm PARTIAL DIFFERENTIAL 258 ucd/PropList.txt:2207 ; ID_Compat_Math_Continue # Sm NABLA 259 ucd/PropList.txt:221E ; ID_Compat_Math_Continue # Sm INFINITY 260 ucd/PropList.txt:1D6C1 ; ID_Compat_Math_Continue # Sm MATHEMATICAL BOLD NABLA 261 ucd/PropList.txt:1D6DB ; ID_Compat_Math_Continue # Sm MATHEMATICAL BOLD PARTIAL DIFFERENTIAL 262 ucd/PropList.txt:1D6FB ; ID_Compat_Math_Continue # Sm MATHEMATICAL ITALIC NABLA 263 ucd/PropList.txt:1D715 ; ID_Compat_Math_Continue # Sm MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL 264 ucd/PropList.txt:1D735 ; ID_Compat_Math_Continue # Sm MATHEMATICAL BOLD ITALIC NABLA 265 ucd/PropList.txt:1D74F ; ID_Compat_Math_Continue # Sm MATHEMATICAL BOLD ITALIC PARTIAL DIFFERENTIAL 266 ucd/PropList.txt:1D76F ; ID_Compat_Math_Continue # Sm MATHEMATICAL SANS-SERIF BOLD NABLA 267 ucd/PropList.txt:1D789 ; ID_Compat_Math_Continue # Sm MATHEMATICAL SANS-SERIF BOLD PARTIAL DIFFERENTIAL 268 ucd/PropList.txt:1D7A9 ; ID_Compat_Math_Continue # Sm MATHEMATICAL SANS-SERIF BOLD ITALIC NABLA 269 ucd/PropList.txt:1D7C3 ; ID_Compat_Math_Continue # Sm MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL 270 ucd/PropList.txt:2202 ; ID_Compat_Math_Start # Sm PARTIAL DIFFERENTIAL 271 ucd/PropList.txt:2207 ; ID_Compat_Math_Start # Sm NABLA 272 ucd/PropList.txt:221E ; ID_Compat_Math_Start # Sm INFINITY 273 ucd/PropList.txt:1D6C1 ; ID_Compat_Math_Start # Sm MATHEMATICAL BOLD NABLA 274 ucd/PropList.txt:1D6DB ; ID_Compat_Math_Start # Sm MATHEMATICAL BOLD PARTIAL DIFFERENTIAL 275 ucd/PropList.txt:1D6FB ; ID_Compat_Math_Start # Sm MATHEMATICAL ITALIC NABLA 276 ucd/PropList.txt:1D715 ; ID_Compat_Math_Start # Sm MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL 277 ucd/PropList.txt:1D735 ; ID_Compat_Math_Start # Sm MATHEMATICAL BOLD ITALIC NABLA 278 ucd/PropList.txt:1D74F ; ID_Compat_Math_Start # Sm MATHEMATICAL BOLD ITALIC PARTIAL DIFFERENTIAL 279 ucd/PropList.txt:1D76F ; ID_Compat_Math_Start # Sm MATHEMATICAL SANS-SERIF BOLD NABLA 280 ucd/PropList.txt:1D789 ; ID_Compat_Math_Start # Sm MATHEMATICAL SANS-SERIF BOLD PARTIAL DIFFERENTIAL 281 ucd/PropList.txt:1D7A9 ; ID_Compat_Math_Start # Sm MATHEMATICAL SANS-SERIF BOLD ITALIC NABLA 282 ucd/PropList.txt:1D7C3 ; ID_Compat_Math_Start # Sm MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL 283- if they have changed, then update the implementation and the tests 284- TODO: There is a ticket for using ppucd.txt in test code. 285 Do that and check these hardcoded properties against that. 286 287* Bazel build process 288 289See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 290for an overview and for setup instructions. 291 292Consider running `bazelisk --version` outside of the $ICU_SRC folder 293to find out the latest `bazel` version, and 294copying that version number into the $ICU_SRC/.bazeliskrc config file. 295(Revert if you find incompatibilities, or, better, update our build & config files.) 296 297* generate data files 298 299- remember to define the environment variables 300 (see the start of the section for this Unicode version) 301- cd $ICU_SRC 302- optional but not necessary: 303 bazelisk clean 304 or even 305 bazelisk clean --expunge 306- build/bootstrap/generate new files: 307 icu4c/source/data/unidata/generate.sh 308 309* Since Unicode 15.1, the UTS #46 data derivation no longer looks at the decompositions (NFD). 310 These characters are now just valid, no longer disallowed_STD3_valid. 311 Remove special handling of U+2260, U+226E, U+226F (isNonASCIIDisallowedSTD3Valid()) 312 from uts46.cpp & UTS46.java, 313 and special test code from uts46test.cpp & UTS46Test.java. 314 (remove this section next time) 315 316* run & fix ICU4C tests 317- Note: Some of the collation data and test data will be updated below, 318 so at this time we might get some collation test failures. 319 Ignore these for now. 320- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 321- update CLDR GraphemeBreakTest.txt 322 cd ~/unitools/mine/Generated 323 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 324 cp UCD/15.1.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 325 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 326- Robin or Andy helps with RBBI & spoof check test failures 327 328* collation: CLDR collation root, UCA DUCET 329 330- UCA DUCET goes into Mark's Unicode tools, 331 and a tool-tailored version goes into CLDR, see 332 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 333 334- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 335 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 336- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 337 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 338 (note removing the underscore before "Rules") 339 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 340- restore TODO diffs in UCARules.txt 341 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 342- update (ICU4C)/source/test/testdata/CollationTest_*.txt 343 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 344 from the CLDR root files (..._CLDR_..._SHORT.txt) 345 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 346 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 347 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 348- if CLDR common/uca/unihan-index.txt changes, then update 349 CLDR common/collation/root.xml <collation type="private-unihan"> 350 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 351 352- generate data files, as above (generate.sh), now to pick up new collation data 353- update CollationFCD.java: 354 copy & paste the initializers of lcccIndex[] etc. from 355 ICU4C/source/i18n/collationfcd.cpp to 356 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 357- rebuild ICU4C (make clean, make check, as usual) 358 359* Unihan collators 360 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 361- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 362 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 363- generate ICU zh collation data 364 instructions inspired by 365 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 366 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 367 + setup: 368 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 369 (didn't work without setting JAVA_HOME, 370 nor with the Google default of /usr/local/buildtools/java/jdk 371 [Google security limitations in the XML parser]) 372 export TOOLS_ROOT=$ICU_SRC/tools 373 export CLDR_DIR=$CLDR_SRC 374 export CLDR_DATA_DIR=$CLDR_DIR 375 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 376 cd "$TOOLS_ROOT/cldr/lib" 377 ./install-cldr-jars.sh "$CLDR_DIR" 378 + generate the files we need 379 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 380 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 381 + diff 382 cd $ICU_SRC 383 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 384 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 385 + copy into the source tree 386 cd $ICU_SRC 387 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 388 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 389- rebuild ICU4C 390 391* run & fix ICU4C tests, now with new CLDR collation root data 392- run all tests with the collation test data *_SHORT.txt or the full files 393 (the full ones have comments, useful for debugging) 394- note on intltest: if collate/UCAConformanceTest fails, then 395 utility/MultithreadTest/TestCollators will fail as well; 396 fix the conformance test before looking into the multi-thread test 397 398* update Java data files 399- refresh just the UCD/UCA-related/derived files, just to be safe 400- see (ICU4C)/source/data/icu4j-readme.txt 401- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 402- $ICU_OUT/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 403 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 404 you need to reconfigure with unicore data; see the "configure" line above. 405 output: 406 ... 407 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 408 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt74b 409 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b 410 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt74l.dat ./out/icu4j/icudt74b.dat -s ./out/build/icudt74l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt74b 411 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt74b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt74b" 412 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt74b/ 413 mkdir -p /tmp/icu4j/main/shared/data 414 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 415 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt74b/ 416 mkdir -p /tmp/icu4j/main/shared/data 417 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 418 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 419- copy the binary data files into the ICU4J tree 420 cd $ICU_OUT/icu4c/data/out/icu4j 421 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 422 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT/brkitr 423 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 424 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT 425 cd com/ibm/icu/impl/data/$ICUDT/ 426 ls *.icu | egrep -v "cnvalias.icu" | awk '{print "cp " $0 " $ICU_SRC/icu4j/main/core/src/main/resources/com/ibm/icu/impl/data/$ICUDT";}' | sh 427- The procedure above is very conservative: 428 It refreshes only the parts of the ICU4J data that we think are affected by a Unicode data update. 429 It avoids dealing with any other discrepancies 430 between the source and generated data files. 431 *If* instead we wanted to refresh *all* of the ICU4J data from ICU4C: 432 $ICU_OUT/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 433 434* refresh Java test .txt files 435- copy new .txt files into ICU4J's main/core/src/test/resources/com/ibm/icu/dev/data/unicode 436 cd $ICU_SRC/icu4c/source/data/unidata 437 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 438 cd ../../test/testdata 439 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 440 cp -v $UNICODE_DATA/UCD/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/unicode 441 442* run & fix ICU4J tests 443 444*** API additions 445- send notice to icu-design about new born-@stable API (enum constants etc.) 446 447*** CLDR numbering systems 448- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 449 for example: 450 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 451 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.1.txt 452 ~/icu/uni/src$ diff -u /tmp/icu/nv4-15.txt /tmp/icu/nv4-15.1.txt 453 --> 454 (empty this time) 455 or: 456 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/15.0.0/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 457 --> 458 (empty this time) 459 Unicode 15.1: 460 (none this time) 461 462*** merge the Unicode update branch back onto the main branch 463- do not merge the icudata.jar and testdata.jar, 464 instead rebuild them from merged & tested ICU4C 465- if there is a merge conflict in icudata.jar, here is one way to deal with it: 466 + remove icudata.jar from the commit so that rebasing is trivial 467 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 468 + ~/icu/uni/src$ git commit -a --amend 469 + switch to main, pull updates, switch back to the dev branch 470 + ~/icu/uni/src$ git rebase main 471 + rebuild icudata.jar 472 + ~/icu/uni/src$ git commit -a --amend 473 + ~/icu/uni/src$ git push -f 474- make sure that changes to Unicode tools are checked in: 475 https://github.com/unicode-org/unicodetools 476 477---------------------------------------------------------------------------- *** 478 479CLDR 43 root collation update for ICU 73 480 481Partial update only for the root collation. 482See 483- https://unicode-org.atlassian.net/browse/CLDR-15946 484 Treat quote marks as equivalent when strength=UCOL_PRIMARY 485- https://github.com/unicode-org/cldr/pull/2691 486 CLDR-15946 make fancy quotes primary-equal to ASCII fallbacks 487- https://github.com/unicode-org/cldr/pull/2833 488 CLDR-15946 make fancy quotes secondary-different from each other 489 490The related changes to tailorings were already integrated in an earlier PR for 491https://unicode-org.atlassian.net/browse/ICU-22220 ICU 73rc BRS. 492 493This update is for the root collation, 494which is handled by different tools than the locale data updates. 495 496* Command-line environment setup 497 498export UNICODE_DATA=~/unidata/uni15/20220830 499export CLDR_SRC=~/cldr/uni/src 500export ICU_ROOT=~/icu/uni 501export ICU_SRC=$ICU_ROOT/src 502export ICUDT=icudt73b 503export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 504export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 505export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 506 507*** Configure: Build Unicode data for ICU4J 508 cd $ICU_ROOT/dbg/icu4c 509 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 510 511* Bazel build process 512 513See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 514for an overview and for setup instructions. 515 516Consider running `bazelisk --version` outside of the $ICU_SRC folder 517to find out the latest `bazel` version, and 518copying that version number into the $ICU_SRC/.bazeliskrc config file. 519(Revert if you find incompatibilities, or, better, update our build & config files.) 520 521* generate data files 522 523- remember to define the environment variables 524 (see the start of the section for this Unicode version) 525- cd $ICU_SRC 526- optional but not necessary: 527 bazelisk clean 528 or even 529 bazelisk clean --expunge 530- build/bootstrap/generate new files: 531 icu4c/source/data/unidata/generate.sh 532 533* collation: CLDR collation root, UCA DUCET 534 535- UCA DUCET goes into Mark's Unicode tools, 536 and a tool-tailored version goes into CLDR, see 537 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 538 539- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 540 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 541- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 542 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 543 (note removing the underscore before "Rules") 544 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 545- restore TODO diffs in UCARules.txt 546 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 547- update (ICU4C)/source/test/testdata/CollationTest_*.txt 548 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 549 from the CLDR root files (..._CLDR_..._SHORT.txt) 550 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 551 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 552 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 553- if CLDR common/uca/unihan-index.txt changes, then update 554 CLDR common/collation/root.xml <collation type="private-unihan"> 555 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 556 557- generate data files, as above (generate.sh), now to pick up new collation data 558- rebuild ICU4C (make clean, make check, as usual) 559 560* run & fix ICU4C tests, now with new CLDR collation root data 561- run all tests with the collation test data *_SHORT.txt or the full files 562 (the full ones have comments, useful for debugging) 563- note on intltest: if collate/UCAConformanceTest fails, then 564 utility/MultithreadTest/TestCollators will fail as well; 565 fix the conformance test before looking into the multi-thread test 566 567* update Java data files 568- refresh just the UCD/UCA-related/derived files, just to be safe 569- see (ICU4C)/source/data/icu4j-readme.txt 570- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 571- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 572 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 573 you need to reconfigure with unicore data; see the "configure" line above. 574 output: 575 ... 576 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 577 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt73b 578 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b 579 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt73l.dat ./out/icu4j/icudt73b.dat -s ./out/build/icudt73l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt73b 580 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt73b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt73b" 581 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt73b/ 582 mkdir -p /tmp/icu4j/main/shared/data 583 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 584 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt73b/ 585 mkdir -p /tmp/icu4j/main/shared/data 586 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 587 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 588- copy the big-endian Unicode data files to another location, 589 separate from the other data files, 590 and then refresh ICU4J 591 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 592 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 593 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 594 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 595- new for ICU 73: also copy the binary data files directly into the ICU4J tree 596 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* $ICU_SRC/icu4j/maven-build/maven-icu4j-datafiles/src/main/resources/com/ibm/icu/impl/data/$ICUDT/coll 597 598* When refreshing all of ICU4J data from ICU4C 599- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 600- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 601or 602- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 603 604* refresh Java test .txt files 605- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 606 cd $ICU_SRC/icu4c/source/data/unidata 607 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 608 cd ../../test/testdata 609 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 610 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 611 612* run & fix ICU4J tests 613 614*** merge the Unicode update branch back onto the main branch 615- do not merge the icudata.jar and testdata.jar, 616 instead rebuild them from merged & tested ICU4C 617- if there is a merge conflict in icudata.jar, here is one way to deal with it: 618 + remove icudata.jar from the commit so that rebasing is trivial 619 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 620 + ~/icu/uni/src$ git commit -a --amend 621 + switch to main, pull updates, switch back to the dev branch 622 + ~/icu/uni/src$ git rebase main 623 + rebuild icudata.jar 624 + ~/icu/uni/src$ git commit -a --amend 625 + ~/icu/uni/src$ git push -f 626- make sure that changes to Unicode tools are checked in: 627 https://github.com/unicode-org/unicodetools 628 629---------------------------------------------------------------------------- *** 630 631Unicode 15.0 update for ICU 72 632 633https://www.unicode.org/versions/Unicode15.0.0/ 634https://www.unicode.org/versions/beta-15.0.0.html 635https://www.unicode.org/Public/15.0.0/ucd/ 636https://www.unicode.org/reports/uax-proposed-updates.html 637https://www.unicode.org/reports/tr44/tr44-29.html 638 639https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 640https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 641https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) 642 643* Command-line environment setup 644 645export UNICODE_DATA=~/unidata/uni15/20220830 646export CLDR_SRC=~/cldr/uni/src 647export ICU_ROOT=~/icu/uni 648export ICU_SRC=$ICU_ROOT/src 649export ICUDT=icudt72b 650export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 651export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 652export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 653 654*** Unicode version numbers 655- makedata.mak 656- uchar.h 657- com.ibm.icu.util.VersionInfo 658- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 659 660- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 661 so that the makefiles see the new version number. 662 cd $ICU_ROOT/dbg/icu4c 663 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 664 665*** data files & enums & parser code 666 667* download files 668- same as for the early Unicode Tools setup and data refresh: 669 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 670 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 671- mkdir -p $UNICODE_DATA 672- download Unicode files into $UNICODE_DATA 673 + subfolders: emoji, idna, security, ucd, uca 674 + old way of fetching files: from the "Public" area on unicode.org 675 ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 676 ~ split Unihan into single-property files 677 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 678 + new way of fetching files, if available: 679 copy the files from a Unicode Tools workspace that is up to date with 680 https://github.com/unicode-org/unicodetools 681 and which might at this point be *ahead* of "Public" 682 ~ before the Unicode release copy files from "dev" subfolders, for example 683 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 684 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 685 or from the UCD/cldr/ output folder of the Unicode Tools: 686 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 687 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 688 or 689 cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 690 691* for manual diffs and for Unicode Tools input data updates: 692 remove version suffixes from the file names 693 ~$ unidata/desuffixucd.py $UNICODE_DATA 694 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 695 696* process and/or copy files 697- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 698 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 699 + For debugging, and tweaking how ppucd.txt is written, 700 the tool has an --only_ppucd option: 701 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 702 703- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 704 705* new constants for new property values 706- preparseucd.py error: 707 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] 708 = PropertyValueAliases.txt new property values (diff old & new .txt files) 709 ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 710 +age; 15.0 ; V15_0 711 +blk; Arabic_Ext_C ; Arabic_Extended_C 712 +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H 713 +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D 714 +blk; Devanagari_Ext_A ; Devanagari_Extended_A 715 +blk; Kaktovik_Numerals ; Kaktovik_Numerals 716 +blk; Kawi ; Kawi 717 +blk; Nag_Mundari ; Nag_Mundari 718 +sc ; Kawi ; Kawi 719 +sc ; Nagm ; Nag_Mundari 720 -> add new blocks to uchar.h before UBLOCK_COUNT 721 use long property names for enum constants, 722 for the trailing comment get the block start code point: diff old & new Blocks.txt 723 ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 724 +10EC0..10EFF; Arabic Extended-C 725 +11B00..11B5F; Devanagari Extended-A 726 +11F00..11F5F; Kawi 727 -13430..1343F; Egyptian Hieroglyph Format Controls 728 +13430..1345F; Egyptian Hieroglyph Format Controls 729 +1D2C0..1D2DF; Kaktovik Numerals 730 +1E030..1E08F; Cyrillic Extended-D 731 +1E4D0..1E4FF; Nag Mundari 732 +31350..323AF; CJK Unified Ideographs Extension H 733 (ignore blocks whose end code point changed) 734 -> add new blocks to UCharacter.UnicodeBlock IDs 735 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 736 replace public static final int \1_ID = \2; \3 737 -> add new blocks to UCharacter.UnicodeBlock objects 738 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 739 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 740 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 741 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 742 replace public static final int \1 = \2; \3 743 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 744 and in com.ibm.icu.dev.test.lang.TestUScript.java 745 746* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 747 (not strictly necessary for NOT_ENCODED scripts) 748 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 749 750* build ICU 751 to make sure that there are no syntax errors 752 753 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 754 755* update spoof checker UnicodeSet initializers: 756 inclusionPat & recommendedPat in i18n/uspoof.cpp 757 INCLUSION & RECOMMENDED in SpoofChecker.java 758- make sure that the Unicode Tools tree contains the latest security data files 759- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 760- run the tool (no special environment variables needed) 761- copy & paste from the Console output into the .cpp & .java files 762 763* Bazel build process 764 765See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 766for an overview and for setup instructions. 767 768Consider running `bazelisk --version` outside of the $ICU_SRC folder 769to find out the latest `bazel` version, and 770copying that version number into the $ICU_SRC/.bazeliskrc config file. 771(Revert if you find incompatibilities, or, better, update our build & config files.) 772 773* generate data files 774 775- remember to define the environment variables 776 (see the start of the section for this Unicode version) 777- cd $ICU_SRC 778- optional but not necessary: 779 bazelisk clean 780- build/bootstrap/generate new files: 781 icu4c/source/data/unidata/generate.sh 782 783* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 784 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 785- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 786 ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt 787- Unicode 6.0..15.0: U+2260, U+226E, U+226F 788- nothing new in this Unicode version, no test file to update 789 790* run & fix ICU4C tests 791- Note: Some of the collation data and test data will be updated below, 792 so at this time we might get some collation test failures. 793 Ignore these for now. 794- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 795 (no rule changes in Unicode 15) 796- update CLDR GraphemeBreakTest.txt 797 cd ~/unitools/mine/Generated 798 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 799 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 800 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 801- Andy helps with RBBI & spoof check test failures 802 803* collation: CLDR collation root, UCA DUCET 804 805- UCA DUCET goes into Mark's Unicode tools, 806 and a tool-tailored version goes into CLDR, see 807 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 808 809- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 810 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 811- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 812 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 813 (note removing the underscore before "Rules") 814 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 815- restore TODO diffs in UCARules.txt 816 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 817- update (ICU4C)/source/test/testdata/CollationTest_*.txt 818 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 819 from the CLDR root files (..._CLDR_..._SHORT.txt) 820 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 821 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 822 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 823- if CLDR common/uca/unihan-index.txt changes, then update 824 CLDR common/collation/root.xml <collation type="private-unihan"> 825 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 826 827- generate data files, as above (generate.sh), now to pick up new collation data 828- update CollationFCD.java: 829 copy & paste the initializers of lcccIndex[] etc. from 830 ICU4C/source/i18n/collationfcd.cpp to 831 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 832- rebuild ICU4C (make clean, make check, as usual) 833 834* Unihan collators 835 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 836- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 837 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 838- generate ICU zh collation data 839 instructions inspired by 840 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 841 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 842 + setup: 843 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 844 (didn't work without setting JAVA_HOME, 845 nor with the Google default of /usr/local/buildtools/java/jdk 846 [Google security limitations in the XML parser]) 847 export TOOLS_ROOT=~/icu/uni/src/tools 848 export CLDR_DIR=~/cldr/uni/src 849 export CLDR_DATA_DIR=~/cldr/uni/src 850 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 851 cd "$TOOLS_ROOT/cldr/lib" 852 ./install-cldr-jars.sh "$CLDR_DIR" 853 + generate the files we need 854 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 855 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 856 + diff 857 cd $ICU_SRC 858 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 859 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 860 + copy into the source tree 861 cd $ICU_SRC 862 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 863 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 864- rebuild ICU4C 865 866* run & fix ICU4C tests, now with new CLDR collation root data 867- run all tests with the collation test data *_SHORT.txt or the full files 868 (the full ones have comments, useful for debugging) 869- note on intltest: if collate/UCAConformanceTest fails, then 870 utility/MultithreadTest/TestCollators will fail as well; 871 fix the conformance test before looking into the multi-thread test 872 873* update Java data files 874- refresh just the UCD/UCA-related/derived files, just to be safe 875- see (ICU4C)/source/data/icu4j-readme.txt 876- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 877- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 878 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 879 you need to reconfigure with unicore data; see the "configure" line above. 880 output: 881 ... 882 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 883 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b 884 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b 885 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b 886 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" 887 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ 888 mkdir -p /tmp/icu4j/main/shared/data 889 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 890 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ 891 mkdir -p /tmp/icu4j/main/shared/data 892 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 893 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 894- copy the big-endian Unicode data files to another location, 895 separate from the other data files, 896 and then refresh ICU4J 897 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 898 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 899 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 900 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 901 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 902 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 903 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 904 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 905 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 906 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 907 908* When refreshing all of ICU4J data from ICU4C 909- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 910- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 911or 912- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 913 914* refresh Java test .txt files 915- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 916 cd $ICU_SRC/icu4c/source/data/unidata 917 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 918 cd ../../test/testdata 919 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 920 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 921 922* run & fix ICU4J tests 923 924*** API additions 925- send notice to icu-design about new born-@stable API (enum constants etc.) 926 927*** CLDR numbering systems 928- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 929 for example: 930 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 931 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 932 ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt 933 --> 934 +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 935 +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 936 or: 937 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 938 --> 939 +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE 940 +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE 941 Unicode 15: 942 kawi 11F50..11F59 Kawi 943 nagm 1E4F0..1E4F9 Nag Mundari 944 https://github.com/unicode-org/cldr/pull/2041 945 946*** merge the Unicode update branches back onto the trunk 947- do not merge the icudata.jar and testdata.jar, 948 instead rebuild them from merged & tested ICU4C 949- if there is a merge conflict in icudata.jar, here is one way to deal with it: 950 + remove icudata.jar from the commit so that rebasing is trivial 951 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 952 + ~/icu/uni/src$ git commit -a --amend 953 + switch to main, pull updates, switch back to the dev branch 954 + ~/icu/uni/src$ git rebase main 955 + rebuild icudata.jar 956 + ~/icu/uni/src$ git commit -a --amend 957 + ~/icu/uni/src$ git push -f 958- make sure that changes to Unicode tools are checked in: 959 https://github.com/unicode-org/unicodetools 960 961---------------------------------------------------------------------------- *** 962 963Unicode 14.0 update for ICU 70 964 965https://www.unicode.org/versions/Unicode14.0.0/ 966https://www.unicode.org/versions/beta-14.0.0.html 967https://www.unicode.org/Public/14.0.0/ucd/ 968https://www.unicode.org/reports/uax-proposed-updates.html 969https://www.unicode.org/reports/tr44/tr44-27.html 970 971https://unicode-org.atlassian.net/browse/CLDR-14801 972https://unicode-org.atlassian.net/browse/ICU-21635 973 974* Command-line environment setup 975 976export UNICODE_DATA=~/unidata/uni14/20210903 977export CLDR_SRC=~/cldr/uni/src 978export ICU_ROOT=~/icu/uni 979export ICU_SRC=$ICU_ROOT/src 980export ICUDT=icudt70b 981export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 982export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 983export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 984 985*** Unicode version numbers 986- makedata.mak 987- uchar.h 988- com.ibm.icu.util.VersionInfo 989- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 990 991- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 992 so that the makefiles see the new version number. 993 cd $ICU_ROOT/dbg/icu4c 994 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 995 996*** data files & enums & parser code 997 998* download files 999- same as for the early Unicode Tools setup and data refresh: 1000 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 1001 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 1002- mkdir -p $UNICODE_DATA 1003- download Unicode files into $UNICODE_DATA 1004 + subfolders: emoji, idna, security, ucd, uca 1005 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1006 + split Unihan into single-property files 1007 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1008 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1009 or from the UCD/cldr/ output folder of the Unicode Tools: 1010 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1011 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1012 or 1013 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 1014 1015* for manual diffs and for Unicode Tools input data updates: 1016 remove version suffixes from the file names 1017 ~$ unidata/desuffixucd.py $UNICODE_DATA 1018 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 1019 1020* process and/or copy files 1021- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1022 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1023 + For debugging, and tweaking how ppucd.txt is written, 1024 the tool has an --only_ppucd option: 1025 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1026 1027- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1028 1029* new constants for new property values 1030- preparseucd.py error: 1031 ValueError: missing uchar.h enum constants for some property values: 1032 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 1033 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 1034 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 1035 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1036 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 1037 +age; 14.0 ; V14_0 1038 +blk; Arabic_Ext_B ; Arabic_Extended_B 1039 +blk; Cypro_Minoan ; Cypro_Minoan 1040 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 1041 +blk; Kana_Ext_B ; Kana_Extended_B 1042 +blk; Latin_Ext_F ; Latin_Extended_F 1043 +blk; Latin_Ext_G ; Latin_Extended_G 1044 +blk; Old_Uyghur ; Old_Uyghur 1045 +blk; Tangsa ; Tangsa 1046 +blk; Toto ; Toto 1047 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 1048 +blk; Vithkuqi ; Vithkuqi 1049 +blk; Znamenny_Music ; Znamenny_Musical_Notation 1050 +jg ; Thin_Yeh ; Thin_Yeh 1051 +jg ; Vertical_Tail ; Vertical_Tail 1052 +sc ; Cpmn ; Cypro_Minoan 1053 +sc ; Ougr ; Old_Uyghur 1054 +sc ; Tnsa ; Tangsa 1055 +sc ; Toto ; Toto 1056 +sc ; Vith ; Vithkuqi 1057 -> add new blocks to uchar.h before UBLOCK_COUNT 1058 use long property names for enum constants, 1059 for the trailing comment get the block start code point: diff old & new Blocks.txt 1060 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 1061 +0870..089F; Arabic Extended-B 1062 +10570..105BF; Vithkuqi 1063 +10780..107BF; Latin Extended-F 1064 +10F70..10FAF; Old Uyghur 1065 -11700..1173F; Ahom 1066 +11700..1174F; Ahom 1067 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 1068 +12F90..12FFF; Cypro-Minoan 1069 +16A70..16ACF; Tangsa 1070 -18D00..18D8F; Tangut Supplement 1071 +18D00..18D7F; Tangut Supplement 1072 +1AFF0..1AFFF; Kana Extended-B 1073 +1CF00..1CFCF; Znamenny Musical Notation 1074 +1DF00..1DFFF; Latin Extended-G 1075 +1E290..1E2BF; Toto 1076 +1E7E0..1E7FF; Ethiopic Extended-B 1077 (ignore blocks whose end code point changed) 1078 -> add new blocks to UCharacter.UnicodeBlock IDs 1079 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1080 replace public static final int \1_ID = \2; \3 1081 -> add new blocks to UCharacter.UnicodeBlock objects 1082 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1083 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1084 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 1085 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 1086 replace public static final int \1 = \2; \3 1087 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1088 and in com.ibm.icu.dev.test.lang.TestUScript.java 1089 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 1090 1091* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1092 (not strictly necessary for NOT_ENCODED scripts) 1093 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1094 1095* build ICU 1096 to make sure that there are no syntax errors 1097 1098 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 1099 1100* update spoof checker UnicodeSet initializers: 1101 inclusionPat & recommendedPat in i18n/uspoof.cpp 1102 INCLUSION & RECOMMENDED in SpoofChecker.java 1103- make sure that the Unicode Tools tree contains the latest security data files 1104- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1105- run the tool (no special environment variables needed) 1106- copy & paste from the Console output into the .cpp & .java files 1107 1108* Bazel build process 1109 1110See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 1111for an overview and for setup instructions. 1112 1113Consider running `bazelisk --version` outside of the $ICU_SRC folder 1114to find out the latest `bazel` version, and 1115copying that version number into the $ICU_SRC/.bazeliskrc config file. 1116(Revert if you find incompatibilities, or, better, update our build & config files.) 1117 1118* generate data files 1119 1120- remember to define the environment variables 1121 (see the start of the section for this Unicode version) 1122- cd $ICU_SRC 1123- optional but not necessary: 1124 bazelisk clean 1125- build/bootstrap/generate new files: 1126 icu4c/source/data/unidata/generate.sh 1127 1128* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1129 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1130- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1131- Unicode 6.0..14.0: U+2260, U+226E, U+226F 1132- nothing new in this Unicode version, no test file to update 1133 1134* run & fix ICU4C tests 1135- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1136- update CLDR GraphemeBreakTest.txt 1137 cd ~/unitools/mine/Generated 1138 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1139 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 1140 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 1141- Andy helps with RBBI & spoof check test failures 1142 1143* collation: CLDR collation root, UCA DUCET 1144 1145- UCA DUCET goes into Mark's Unicode tools, 1146 and a tool-tailored version goes into CLDR, see 1147 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 1148 1149- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1150 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1151- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1152 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1153 (note removing the underscore before "Rules") 1154 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1155- restore TODO diffs in UCARules.txt 1156 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1157- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1158 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1159 from the CLDR root files (..._CLDR_..._SHORT.txt) 1160 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1161 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1162 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1163- if CLDR common/uca/unihan-index.txt changes, then update 1164 CLDR common/collation/root.xml <collation type="private-unihan"> 1165 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1166 1167- generate data files, as above (generate.sh), now to pick up new collation data 1168- update CollationFCD.java: 1169 copy & paste the initializers of lcccIndex[] etc. from 1170 ICU4C/source/i18n/collationfcd.cpp to 1171 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1172- rebuild ICU4C (make clean, make check, as usual) 1173 1174* Unihan collators 1175 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 1176- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 1177 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 1178- generate ICU zh collation data 1179 instructions inspired by 1180 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 1181 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 1182 + setup: 1183 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 1184 (didn't work without setting JAVA_HOME, 1185 nor with the Google default of /usr/local/buildtools/java/jdk 1186 [Google security limitations in the XML parser]) 1187 export TOOLS_ROOT=~/icu/uni/src/tools 1188 export CLDR_DIR=~/cldr/uni/src 1189 export CLDR_DATA_DIR=~/cldr/uni/src 1190 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 1191 cd "$TOOLS_ROOT/cldr/lib" 1192 ./install-cldr-jars.sh "$CLDR_DIR" 1193 + generate the files we need 1194 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 1195 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 1196 + diff 1197 cd $ICU_SRC 1198 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 1199 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 1200 + copy into the source tree 1201 cd $ICU_SRC 1202 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 1203 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 1204- rebuild ICU4C 1205 1206* run & fix ICU4C tests, now with new CLDR collation root data 1207- run all tests with the collation test data *_SHORT.txt or the full files 1208 (the full ones have comments, useful for debugging) 1209- note on intltest: if collate/UCAConformanceTest fails, then 1210 utility/MultithreadTest/TestCollators will fail as well; 1211 fix the conformance test before looking into the multi-thread test 1212 1213* update Java data files 1214- refresh just the UCD/UCA-related/derived files, just to be safe 1215- see (ICU4C)/source/data/icu4j-readme.txt 1216- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1217- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1218 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 1219 you need to reconfigure with unicore data; see the "configure" line above. 1220 output: 1221 ... 1222 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1223 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1224 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 1225 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 1226 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 1227 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 1228 mkdir -p /tmp/icu4j/main/shared/data 1229 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1230 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 1231 mkdir -p /tmp/icu4j/main/shared/data 1232 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1233 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1234- copy the big-endian Unicode data files to another location, 1235 separate from the other data files, 1236 and then refresh ICU4J 1237 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1238 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1239 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1240 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1241 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1242 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1243 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1244 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1245 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1246 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1247 1248* When refreshing all of ICU4J data from ICU4C 1249- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1250- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1251or 1252- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1253 1254* refresh Java test .txt files 1255- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1256 cd $ICU_SRC/icu4c/source/data/unidata 1257 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1258 cd ../../test/testdata 1259 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1260 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1261 1262* run & fix ICU4J tests 1263 1264*** API additions 1265- send notice to icu-design about new born-@stable API (enum constants etc.) 1266 1267*** CLDR numbering systems 1268- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1269 for example: 1270 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 1271 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 1272 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 1273 --> 1274 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 1275 Unicode 14: 1276 tnsa 16AC0..16AC9 Tangsa 1277 https://github.com/unicode-org/cldr/pull/1326 1278 1279*** merge the Unicode update branches back onto the trunk 1280- do not merge the icudata.jar and testdata.jar, 1281 instead rebuild them from merged & tested ICU4C 1282- make sure that changes to Unicode tools are checked in: 1283 https://github.com/unicode-org/unicodetools 1284 1285---------------------------------------------------------------------------- *** 1286 1287Unicode 13.0 update for ICU 66 1288 1289https://www.unicode.org/versions/Unicode13.0.0/ 1290https://www.unicode.org/versions/beta-13.0.0.html 1291https://www.unicode.org/Public/13.0.0/ucd/ 1292https://www.unicode.org/reports/uax-proposed-updates.html 1293https://www.unicode.org/reports/tr44/tr44-25.html 1294 1295https://unicode-org.atlassian.net/browse/CLDR-13387 1296https://unicode-org.atlassian.net/browse/ICU-20893 1297 1298* Command-line environment setup 1299 1300UNICODE_DATA=~/unidata/uni13/20200212 1301CLDR_SRC=~/cldr/uni/src 1302ICU_ROOT=~/icu/uni 1303ICU_SRC=$ICU_ROOT/src 1304ICUDT=icudt66b 1305ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1306ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1307export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1308 1309*** Unicode version numbers 1310- makedata.mak 1311- uchar.h 1312- com.ibm.icu.util.VersionInfo 1313- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1314 1315- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1316 so that the makefiles see the new version number. 1317 cd $ICU_ROOT/dbg/icu4c 1318 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1319 1320*** data files & enums & parser code 1321 1322* download files 1323- mkdir -p $UNICODE_DATA 1324- download Unicode files into $UNICODE_DATA 1325 + subfolders: emoji, idna, security, ucd, uca 1326 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1327 + split Unihan into single-property files 1328 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 1329 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 1330 or from the ucd/cldr/ output folder of the Unicode Tools: 1331 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 1332 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 1333 1334* for manual diffs and for Unicode Tools input data updates: 1335 remove version suffixes from the file names 1336 ~$ unidata/desuffixucd.py $UNICODE_DATA 1337 (see https://sites.google.com/site/unicodetools/inputdata) 1338 1339* process and/or copy files 1340- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1341 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1342 + For debugging, and tweaking how ppucd.txt is written, 1343 the tool has an --only_ppucd option: 1344 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1345 1346- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1347 1348* new constants for new property values 1349- preparseucd.py error: 1350 ValueError: missing uchar.h enum constants for some property values: 1351 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 1352 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 1353 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 1354 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 1355 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1356 blk; Chorasmian ; Chorasmian 1357 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 1358 blk; Dives_Akuru ; Dives_Akuru 1359 blk; Khitan_Small_Script ; Khitan_Small_Script 1360 blk; Lisu_Sup ; Lisu_Supplement 1361 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 1362 blk; Tangut_Sup ; Tangut_Supplement 1363 blk; Yezidi ; Yezidi 1364 -> add to uchar.h before UBLOCK_COUNT 1365 use long property names for enum constants, 1366 for the trailing comment get the block start code point: diff old & new Blocks.txt 1367 -> add to UCharacter.UnicodeBlock IDs 1368 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1369 replace public static final int \1_ID = \2; \3 1370 -> add to UCharacter.UnicodeBlock objects 1371 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1372 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1373 1374 sc ; Chrs ; Chorasmian 1375 sc ; Diak ; Dives_Akuru 1376 sc ; Kits ; Khitan_Small_Script 1377 sc ; Yezi ; Yezidi 1378 -> uscript.h & com.ibm.icu.lang.UScript 1379 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1380 and in com.ibm.icu.dev.test.lang.TestUScript.java 1381 1382 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 1383 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 1384 1385* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1386 (not strictly necessary for NOT_ENCODED scripts) 1387 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1388 1389* build ICU (make install) 1390 to make sure that there are no syntax errors, and 1391 so that the tools build can pick up the new definitions from the installed header files. 1392 1393 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1394 1395* update spoof checker UnicodeSet initializers: 1396 inclusionPat & recommendedPat in i18n/uspoof.cpp 1397 INCLUSION & RECOMMENDED in SpoofChecker.java 1398- make sure that the Unicode Tools tree contains the latest security data files 1399- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1400- update the hardcoded version number there in the DIRECTORY path 1401- run the tool (no special environment variables needed) 1402- copy & paste from the Console output into the .cpp & .java files 1403 1404* generate normalization data files 1405 cd $ICU_ROOT/dbg/icu4c 1406 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1407 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1408 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1409 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1410 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1411 1412* build ICU (make install) 1413 so that the tools build can pick up the new definitions from the installed header files. 1414 1415 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1416 1417* build Unicode tools using CMake+make 1418 1419$ICU_SRC/tools/unicode/c/icudefs.txt: 1420 1421# Location (--prefix) of where ICU was installed. 1422set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1423# Location of the ICU4C source tree. 1424set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1425 1426 $ICU_ROOT/dbg$ 1427 mkdir -p tools/unicode/c 1428 cd tools/unicode/c 1429 1430 $ICU_ROOT/dbg/tools/unicode/c$ 1431 cmake ../../../../src/tools/unicode/c 1432 make 1433 1434* generate core properties data files 1435 $ICU_ROOT/dbg/tools/unicode/c$ 1436 genprops/genprops $ICU_SRC/icu4c 1437- tool failure: 1438 genprops: Script_Extensions indexes overflow bit field 1439 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 1440 -> uprops.icu data file format : 1441 add two more bits to store a script code or Script_Extensions index 1442 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 1443- rebuild ICU (make install) & tools 1444 1445* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1446 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1447- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1448- Unicode 6.0..13.0: U+2260, U+226E, U+226F 1449- nothing new in this Unicode version, no test file to update 1450 1451* run & fix ICU4C tests 1452- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 1453- Andy helps with RBBI & spoof check test failures 1454 1455* collation: CLDR collation root, UCA DUCET 1456 1457- UCA DUCET goes into Mark's Unicode tools, see 1458 https://sites.google.com/site/unicodetools/home#TOC-UCA 1459 diff the main mapping file, look for bad changes 1460 (for example, more bytes per weight for common characters) 1461 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 1462 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 1463 1464- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1465 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1466 1467- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1468 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1469- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1470 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1471 (note removing the underscore before "Rules") 1472 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1473- restore TODO diffs in UCARules.txt 1474 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1475- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1476 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1477 from the CLDR root files (..._CLDR_..._SHORT.txt) 1478 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1479 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1480 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1481- if CLDR common/uca/unihan-index.txt changes, then update 1482 CLDR common/collation/root.xml <collation type="private-unihan"> 1483 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1484 1485- run genuca 1486 $ICU_ROOT/dbg/tools/unicode/c$ 1487 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1488 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1489- rebuild ICU4C 1490 1491* Unihan collators 1492 https://sites.google.com/site/unicodetools/unihan 1493- run Unicode Tools 1494 org.unicode.draft.GenerateUnihanCollators 1495 with VM arguments 1496 -ea 1497 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1498 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1499 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1500 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1501 -DUVERSION=13.0.0 1502- run Unicode Tools 1503 org.unicode.draft.GenerateUnihanCollatorFiles 1504 with the same arguments 1505- check CLDR diffs 1506 cd $CLDR_SRC 1507 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1508 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1509- copy to CLDR 1510 cd $CLDR_SRC 1511 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1512 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1513- run CLDR unit tests, commit to CLDR 1514- generate ICU zh collation data: run CLDR 1515 org.unicode.cldr.icu.NewLdml2IcuConverter 1516 with program arguments 1517 -t collation 1518 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 1519 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 1520 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1521 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1522 zh 1523 and VM arguments 1524 -ea 1525 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 1526- rebuild ICU4C 1527 1528* run & fix ICU4C tests, now with new CLDR collation root data 1529- run all tests with the collation test data *_SHORT.txt or the full files 1530 (the full ones have comments, useful for debugging) 1531- note on intltest: if collate/UCAConformanceTest fails, then 1532 utility/MultithreadTest/TestCollators will fail as well; 1533 fix the conformance test before looking into the multi-thread test 1534 1535* update Java data files 1536- refresh just the UCD/UCA-related/derived files, just to be safe 1537- see (ICU4C)/source/data/icu4j-readme.txt 1538- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1539- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1540 output: 1541 ... 1542 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1543 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1544 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 1545 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 1546 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 1547 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 1548 mkdir -p /tmp/icu4j/main/shared/data 1549 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1550 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 1551 mkdir -p /tmp/icu4j/main/shared/data 1552 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1553 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1554- copy the big-endian Unicode data files to another location, 1555 separate from the other data files, 1556 and then refresh ICU4J 1557 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1558 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1559 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1560 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1561 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1562 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1563 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1564 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1565 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1566 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1567 1568* When refreshing all of ICU4J data from ICU4C 1569- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1570- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1571or 1572- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1573 1574* update CollationFCD.java 1575 + copy & paste the initializers of lcccIndex[] etc. from 1576 ICU4C/source/i18n/collationfcd.cpp to 1577 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1578 1579* refresh Java test .txt files 1580- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1581 cd $ICU_SRC/icu4c/source/data/unidata 1582 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1583 cd ../../test/testdata 1584 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1585 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1586 1587* run & fix ICU4J tests 1588 1589*** API additions 1590- send notice to icu-design about new born-@stable API (enum constants etc.) 1591 1592*** CLDR numbering systems 1593- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1594 for example, look for 1595 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1596 in new blocks (Blocks.txt) 1597 Unicode 13: 1598 diak 11950..11959 Dives_Akuru 1599 1600*** merge the Unicode update branches back onto the trunk 1601- do not merge the icudata.jar and testdata.jar, 1602 instead rebuild them from merged & tested ICU4C 1603- make sure that changes to Unicode tools are checked in: 1604 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1605 1606---------------------------------------------------------------------------- *** 1607 1608Unicode 12.1 update for ICU 64.2 1609 1610** This is an abbreviated update with one new character for the new 1611** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 1612https://en.wikipedia.org/wiki/Reiwa_period 1613 1614http://www.unicode.org/versions/Unicode12.1.0/ 1615 1616ICU-20497 Unicode 12.1 1617 1618cldrbug 11978: Unicode 12.1 1619 1620* Command-line environment setup 1621 1622UNICODE_DATA=~/unidata/uni121/20190403 1623CLDR_SRC=~/svn.cldr/uni 1624ICU_ROOT=~/icu/uni 1625ICU_SRC=$ICU_ROOT/src 1626ICUDT=icudt64b 1627ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1628ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1629export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1630 1631*** Unicode version numbers 1632- makedata.mak 1633- uchar.h 1634- com.ibm.icu.util.VersionInfo 1635- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1636 1637- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1638 so that the makefiles see the new version number. 1639 cd $ICU_ROOT/dbg/icu4c 1640 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1641 1642*** data files & enums & parser code 1643 1644* download files 1645- mkdir -p $UNICODE_DATA 1646- download Unicode files into $UNICODE_DATA 1647 + subfolders: emoji, idna, security, ucd, uca 1648 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1649 1650* for manual diffs and for Unicode Tools input data updates: 1651 remove version suffixes from the file names 1652 ~$ unidata/desuffixucd.py $UNICODE_DATA 1653 (see https://sites.google.com/site/unicodetools/inputdata) 1654 1655* process and/or copy files 1656- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1657 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1658 + For debugging, and tweaking how ppucd.txt is written, 1659 the tool has an --only_ppucd option: 1660 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1661 1662- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1663 1664* build ICU (make install) 1665 so that the tools build can pick up the new definitions from the installed header files. 1666 1667 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1668 1669* update spoof checker UnicodeSet initializers: 1670 inclusionPat & recommendedPat in uspoof.cpp 1671 INCLUSION & RECOMMENDED in SpoofChecker.java 1672- make sure that the Unicode Tools tree contains the latest security data files 1673- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1674- update the hardcoded version number there in the DIRECTORY path 1675- run the tool (no special environment variables needed) 1676- copy & paste from the Console output into the .cpp & .java files 1677 1678* generate normalization data files 1679 cd $ICU_ROOT/dbg/icu4c 1680 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1681 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1682 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1683 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1684 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1685 1686* build ICU (make install) 1687 so that the tools build can pick up the new definitions from the installed header files. 1688 1689 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1690 1691* build Unicode tools using CMake+make 1692 1693$ICU_SRC/tools/unicode/c/icudefs.txt: 1694 1695# Location (--prefix) of where ICU was installed. 1696set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1697# Location of the ICU4C source tree. 1698set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1699 1700 $ICU_ROOT/dbg$ 1701 mkdir -p tools/unicode/c 1702 cd tools/unicode/c 1703 1704 $ICU_ROOT/dbg/tools/unicode/c$ 1705 cmake ../../../../src/tools/unicode/c 1706 make 1707 1708* generate core properties data files 1709 $ICU_ROOT/dbg/tools/unicode/c$ 1710 genprops/genprops $ICU_SRC/icu4c 1711 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1712 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1713- rebuild ICU (make install) & tools 1714 1715* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1716 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1717- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1718- Unicode 6.0..12.1: U+2260, U+226E, U+226F 1719- nothing new in this Unicode version, no test file to update 1720 1721* run & fix ICU4C tests 1722- Andy handles RBBI & spoof check test failures 1723 1724* collation: CLDR collation root, UCA DUCET 1725 1726- UCA DUCET goes into Mark's Unicode tools, see 1727 https://sites.google.com/site/unicodetools/home#TOC-UCA 1728 diff the main mapping file, look for bad changes 1729 (for example, more bytes per weight for common characters) 1730 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 1731 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 1732 1733- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1734 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1735 1736- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1737 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1738- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1739 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1740 (note removing the underscore before "Rules") 1741 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1742- restore TODO diffs in UCARules.txt 1743 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1744- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1745 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1746 from the CLDR root files (..._CLDR_..._SHORT.txt) 1747 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1748 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1749 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1750- if CLDR common/uca/unihan-index.txt changes, then update 1751 CLDR common/collation/root.xml <collation type="private-unihan"> 1752 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1753 1754- run genuca, see command line above 1755- rebuild ICU4C 1756 1757* Unihan collators 1758 https://sites.google.com/site/unicodetools/unihan 1759- run Unicode Tools 1760 org.unicode.draft.GenerateUnihanCollators 1761 with VM arguments 1762 -ea 1763 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1764 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1765 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1766 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1767 -DUVERSION=12.1.0 1768- run Unicode Tools 1769 org.unicode.draft.GenerateUnihanCollatorFiles 1770 with the same arguments 1771- check CLDR diffs 1772 cd $CLDR_SRC 1773 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1774 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1775- copy to CLDR 1776 cd $CLDR_SRC 1777 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1778 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1779- run CLDR unit tests, commit to CLDR 1780- generate ICU zh collation data: run CLDR 1781 org.unicode.cldr.icu.NewLdml2IcuConverter 1782 with program arguments 1783 -t collation 1784 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1785 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1786 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1787 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1788 zh 1789 and VM arguments 1790 -ea 1791 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1792- rebuild ICU4C 1793 1794* run & fix ICU4C tests, now with new CLDR collation root data 1795- run all tests with the collation test data *_SHORT.txt or the full files 1796 (the full ones have comments, useful for debugging) 1797- note on intltest: if collate/UCAConformanceTest fails, then 1798 utility/MultithreadTest/TestCollators will fail as well; 1799 fix the conformance test before looking into the multi-thread test 1800 1801* update Java data files 1802- refresh just the UCD/UCA-related/derived files, just to be safe 1803- see (ICU4C)/source/data/icu4j-readme.txt 1804- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1805- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1806 output: 1807 ... 1808 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1809 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1810 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 1811 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1812 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 1813 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 1814 mkdir -p /tmp/icu4j/main/shared/data 1815 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1816 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 1817 mkdir -p /tmp/icu4j/main/shared/data 1818 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1819 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1820- copy the big-endian Unicode data files to another location, 1821 separate from the other data files, 1822 and then refresh ICU4J 1823 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1824 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1825 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1826 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1827 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1828 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1829 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1830 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1831 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1832 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1833 1834* When refreshing all of ICU4J data from ICU4C 1835- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1836- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1837or 1838- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1839 1840* update CollationFCD.java 1841 + copy & paste the initializers of lcccIndex[] etc. from 1842 ICU4C/source/i18n/collationfcd.cpp to 1843 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1844 1845* refresh Java test .txt files 1846- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1847 cd $ICU_SRC/icu4c/source/data/unidata 1848 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1849 cd ../../test/testdata 1850 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1851 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1852 1853* run & fix ICU4J tests 1854 1855*** API additions 1856- send notice to icu-design about new born-@stable API (enum constants etc.) 1857 1858*** CLDR numbering systems 1859- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1860 for example, look for 1861 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1862 in new blocks (Blocks.txt) 1863 Unicode 12: using Unicode 12 CLDR ticket #11478 1864 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1865 wcho 1E2F0..1E2F9 Wancho 1866 Unicode 11: using Unicode 11 CLDR ticket #10978 1867 rohg 10D30..10D39 Hanifi_Rohingya 1868 gong 11DA0..11DA9 Gunjala_Gondi 1869 Earlier: CLDR tickets specific to adding new numbering systems. 1870 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1871 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1872 1873*** merge the Unicode update branches back onto the trunk 1874- do not merge the icudata.jar and testdata.jar, 1875 instead rebuild them from merged & tested ICU4C 1876- make sure that changes to Unicode tools are checked in: 1877 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1878 1879---------------------------------------------------------------------------- *** 1880 1881Unicode 12.0 update for ICU 64 1882 1883http://www.unicode.org/versions/Unicode12.0.0/ 1884http://unicode.org/versions/beta-12.0.0.html 1885https://www.unicode.org/review/pri389/ 1886http://www.unicode.org/reports/uax-proposed-updates.html 1887http://www.unicode.org/reports/tr44/tr44-23.html 1888 1889ICU-20203 Unicode 12 1890 1891ICU-20111 move text layout properties data into a data file 1892 1893cldrbug 11478: Unicode 12 1894Accidentally used ^/trunk instead of ^/branches/markus/uni12 1895 1896* Command-line environment setup 1897 1898UNICODE_DATA=~/unidata/uni12/20190309 1899CLDR_SRC=~/svn.cldr/uni 1900ICU_ROOT=~/icu/uni 1901ICU_SRC=$ICU_ROOT/src 1902ICUDT=icudt63b 1903ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1904ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1905export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1906 1907*** Unicode version numbers 1908- makedata.mak 1909- uchar.h 1910- com.ibm.icu.util.VersionInfo 1911- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1912 1913- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1914 so that the makefiles see the new version number. 1915 1916*** data files & enums & parser code 1917 1918* download files 1919- mkdir -p $UNICODE_DATA 1920- download Unicode files into $UNICODE_DATA 1921 + subfolders: emoji, idna, security, ucd, uca 1922 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1923 1924* for manual diffs and for Unicode Tools input data updates: 1925 remove version suffixes from the file names 1926 ~$ unidata/desuffixucd.py $UNICODE_DATA 1927 (see https://sites.google.com/site/unicodetools/inputdata) 1928 1929* process and/or copy files 1930- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1931 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1932 + For debugging, and tweaking how ppucd.txt is written, 1933 the tool has an --only_ppucd option: 1934 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1935 1936- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1937 1938* build ICU (make install) 1939 so that the tools build can pick up the new definitions from the installed header files. 1940 1941 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1942 1943* new constants for new property values 1944- preparseucd.py error: 1945 ValueError: missing uchar.h enum constants for some property values: 1946 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 1947 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 1948 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 1949 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 1950 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1951 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 1952 blk; Elymaic ; Elymaic 1953 blk; Nandinagari ; Nandinagari 1954 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 1955 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 1956 blk; Small_Kana_Ext ; Small_Kana_Extension 1957 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 1958 blk; Tamil_Sup ; Tamil_Supplement 1959 blk; Wancho ; Wancho 1960 -> add to uchar.h 1961 use long property names for enum constants, 1962 for the trailing comment get the block start code point: diff old & new Blocks.txt 1963 -> add to UCharacter.UnicodeBlock IDs 1964 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1965 replace public static final int \1_ID = \2; \3 1966 -> add to UCharacter.UnicodeBlock objects 1967 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1968 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 1969 1970 sc ; Elym ; Elymaic 1971 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 1972 sc ; Nand ; Nandinagari 1973 sc ; Wcho ; Wancho 1974 -> uscript.h & com.ibm.icu.lang.UScript 1975 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1976 and in com.ibm.icu.dev.test.lang.TestUScript.java 1977 1978* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1979 (not strictly necessary for NOT_ENCODED scripts) 1980 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1981 1982* update spoof checker UnicodeSet initializers: 1983 inclusionPat & recommendedPat in uspoof.cpp 1984 INCLUSION & RECOMMENDED in SpoofChecker.java 1985- make sure that the Unicode Tools tree contains the latest security data files 1986- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1987- update the hardcoded version number there in the DIRECTORY path 1988- run the tool (no special environment variables needed) 1989- copy & paste from the Console output into the .cpp & .java files 1990 1991* generate normalization data files 1992 cd $ICU_ROOT/dbg/icu4c 1993 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1994 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1995 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1996 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1997 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1998 1999* build ICU (make install) 2000 so that the tools build can pick up the new definitions from the installed header files. 2001 2002 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 2003 2004* build Unicode tools using CMake+make 2005 2006$ICU_SRC/tools/unicode/c/icudefs.txt: 2007 2008# Location (--prefix) of where ICU was installed. 2009set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2010# Location of the ICU4C source tree. 2011set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 2012 2013 $ICU_ROOT/dbg$ 2014 mkdir -p tools/unicode/c 2015 cd tools/unicode/c 2016 2017 $ICU_ROOT/dbg/tools/unicode/c$ 2018 cmake ../../../../src/tools/unicode/c 2019 make 2020 2021* generate core properties data files 2022 $ICU_ROOT/dbg/tools/unicode/c$ 2023 genprops/genprops $ICU_SRC/icu4c 2024 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 2025 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2026- rebuild ICU (make install) & tools 2027 2028* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2029 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2030- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2031- Unicode 6.0..12.0: U+2260, U+226E, U+226F 2032- nothing new in this Unicode version, no test file to update 2033 2034* run & fix ICU4C tests 2035- update test of default bidi classes: 2036 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 2037 see diffs in DerivedBidiClass.txt 2038 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 2039 + UCharacterTest.java TestIteration() defaultBidi[] 2040- Andy handles RBBI & spoof check test failures 2041 2042* collation: CLDR collation root, UCA DUCET 2043 2044- UCA DUCET goes into Mark's Unicode tools, see 2045 https://sites.google.com/site/unicodetools/home#TOC-UCA 2046 diff the main mapping file, look for bad changes 2047 (for example, more bytes per weight for common characters) 2048 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 2049 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 2050 2051- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2052 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2053 2054- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2055 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2056- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2057 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2058 (note removing the underscore before "Rules") 2059 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2060- restore TODO diffs in UCARules.txt 2061 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2062- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2063 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2064 from the CLDR root files (..._CLDR_..._SHORT.txt) 2065 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2066 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2067 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2068- if CLDR common/uca/unihan-index.txt changes, then update 2069 CLDR common/collation/root.xml <collation type="private-unihan"> 2070 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2071 2072- run genuca, see command line above; 2073 deal with 2074 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2075 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 2076 (add the character to genuca.cpp sampleCharsToScripts[]) 2077 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 2078 and cache its values. 2079 Works as long as the script metadata is updated before the collation data. 2080- rebuild ICU4C 2081 2082* Unihan collators 2083 https://sites.google.com/site/unicodetools/unihan 2084- run Unicode Tools 2085 org.unicode.draft.GenerateUnihanCollators 2086 with VM arguments 2087 -ea 2088 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2089 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2090 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2091 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2092 -DUVERSION=12.0.0 2093- run Unicode Tools 2094 org.unicode.draft.GenerateUnihanCollatorFiles 2095 with the same arguments 2096- check CLDR diffs 2097 cd $CLDR_SRC 2098 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2099 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2100- copy to CLDR 2101 cd $CLDR_SRC 2102 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2103 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2104- run CLDR unit tests, commit to CLDR 2105- generate ICU zh collation data: run CLDR 2106 org.unicode.cldr.icu.NewLdml2IcuConverter 2107 with program arguments 2108 -t collation 2109 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2110 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2111 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 2112 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 2113 zh 2114 and VM arguments 2115 -ea 2116 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2117- rebuild ICU4C 2118 2119* run & fix ICU4C tests, now with new CLDR collation root data 2120- run all tests with the collation test data *_SHORT.txt or the full files 2121 (the full ones have comments, useful for debugging) 2122- note on intltest: if collate/UCAConformanceTest fails, then 2123 utility/MultithreadTest/TestCollators will fail as well; 2124 fix the conformance test before looking into the multi-thread test 2125 2126* update Java data files 2127- refresh just the UCD/UCA-related/derived files, just to be safe 2128- see (ICU4C)/source/data/icu4j-readme.txt 2129- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2130- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2131 output: 2132 ... 2133 Unicode .icu files built to ./out/build/icudt63l 2134 echo timestamp > uni-core-data 2135 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2136 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 2137 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2138 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 2139 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 2140 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 2141 mkdir -p /tmp/icu4j/main/shared/data 2142 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2143 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 2144 mkdir -p /tmp/icu4j/main/shared/data 2145 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2146 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 2147- copy the big-endian Unicode data files to another location, 2148 separate from the other data files, 2149 and then refresh ICU4J 2150 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2151 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2152 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2153 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2154 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2155 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2156 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2157 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2158 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2159 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2160 2161* When refreshing all of ICU4J data from ICU4C 2162- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2163- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2164or 2165- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2166 2167* update CollationFCD.java 2168 + copy & paste the initializers of lcccIndex[] etc. from 2169 ICU4C/source/i18n/collationfcd.cpp to 2170 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2171 2172* refresh Java test .txt files 2173- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2174 cd $ICU_SRC/icu4c/source/data/unidata 2175 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2176 cd ../../test/testdata 2177 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2178 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2179 2180* run & fix ICU4J tests 2181 2182*** API additions 2183- send notice to icu-design about new born-@stable API (enum constants etc.) 2184 2185*** CLDR numbering systems 2186- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2187 for example, look for 2188 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 2189 in new blocks (Blocks.txt) 2190 Unicode 12: using Unicode 12 CLDR ticket #11478 2191 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 2192 wcho 1E2F0..1E2F9 Wancho 2193 Unicode 11: using Unicode 11 CLDR ticket #10978 2194 rohg 10D30..10D39 Hanifi_Rohingya 2195 gong 11DA0..11DA9 Gunjala_Gondi 2196 Earlier: CLDR tickets specific to adding new numbering systems. 2197 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2198 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2199 2200*** merge the Unicode update branches back onto the trunk 2201- do not merge the icudata.jar and testdata.jar, 2202 instead rebuild them from merged & tested ICU4C 2203- make sure that changes to Unicode tools are checked in: 2204 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2205 2206---------------------------------------------------------------------------- *** 2207 2208ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 2209 2210* Command-line environment setup 2211 2212UNICODE_DATA=~/unidata/uni11/20180609 2213CLDR_SRC=~/svn.cldr/uni 2214ICU_ROOT=~/icu/mine 2215ICU_SRC=$ICU_ROOT/src 2216ICUDT=icudt62b 2217ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2218ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2219export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2220 2221*** Links 2222 2223https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 2224https://unicode-org.atlassian.net/browse/ICU-12850 vo 2225 2226*** data files & enums & parser code 2227 2228* API additions 2229- for each of the three new enumerated properties 2230 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 2231 + uchar.h: update UCHAR_INT_LIMIT 2232 + uchar.h: add the enum U<long prop name> 2233 with constants U_<short prop name>_<long value name> 2234 + UProperty.java: add the constant <long prop name> 2235 + UProperty.java: update INT_LIMIT 2236 + UCharacter.java: add the interface <long prop name> 2237 with constants <long value name> 2238 2239* process and/or copy files 2240- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2241 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2242 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 2243 names and aliases. 2244 + For debugging, and tweaking how ppucd.txt is written, 2245 the tool has an --only_ppucd option: 2246 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2247 2248* preparseucd.py changes 2249- add new property short names (uppercase) to _prop_and_value_re 2250 so that ParseUCharHeader() parses the new enum constants 2251 2252* build ICU (make install) 2253 so that the tools build can pick up the new definitions from the installed header files. 2254 2255 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2256 2257* build Unicode tools using CMake+make 2258 2259$ICU_SRC/tools/unicode/c/icudefs.txt: 2260 2261# Location (--prefix) of where ICU was installed. 2262set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 2263# Location of the ICU4C source tree. 2264set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 2265 2266 $ICU_ROOT/dbg$ 2267 mkdir -p tools/unicode/c 2268 cd tools/unicode/c 2269 2270 $ICU_ROOT/dbg/tools/unicode/c$ 2271 cmake ../../../../../src/tools/unicode/c 2272 make 2273 2274* generate core properties data files 2275 $ICU_ROOT/dbg/tools/unicode/c$ 2276 genprops/genprops $ICU_SRC/icu4c 2277- rebuild ICU (make install) & tools 2278 2279* write data for runtime, hardcoded for now 2280- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 2281- generate new icu4c/source/common/ulayout_props_data.h 2282- for each of the three new enumerated properties 2283 + int property max value 2284 + small, 8-bit UCPTrie 2285 (A small 16-bit trie with bit fields for these three properties 2286 is very nearly the same size as the sum of the three.) 2287 2288* wire into C++ 2289- uprops.cpp: #include ulayout_props_data.h 2290- uprops.cpp: add getInPC() etc. functions 2291- uprops.cpp: add lines to intProps[], include max values 2292- uprops.h: add UPropertySource constants 2293- uprops.cpp: add uprops_addPropertyStarts(src) 2294- uniset_props.cpp: add to UnicodeSet_initInclusion() 2295- intltest/ucdtest.cpp: write unit tests 2296 2297* update Java data files 2298- refresh just the pnames.icu file with the new property [value] names, just to be safe 2299- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 2300- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2301- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2302- copy the big-endian Unicode data files to another location, 2303 separate from the other data files, 2304 and then refresh ICU4J 2305 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2306 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2307 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2308 2309* wire into Java 2310- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 2311- UCharacterProperty.java: for each new property 2312 + create a nested class to hold its CodePointTrie 2313 + initialize it from a string literal 2314 + paste in the initializer printed by genprops 2315 + add a new IntProperty object to the intProps[] array 2316 + use the correct max int value for each property, also printed by genprops 2317- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 2318- UnicodeSet.java: add to getInclusions() 2319- UCharacterTest.java: write unit tests 2320 2321---------------------------------------------------------------------------- *** 2322 2323Unicode 11.0 update for ICU 62 2324 2325http://www.unicode.org/versions/Unicode11.0.0/ 2326http://unicode.org/versions/beta-11.0.0.html 2327https://www.unicode.org/review/pri372/ 2328http://www.unicode.org/reports/uax-proposed-updates.html 2329http://www.unicode.org/reports/tr44/tr44-21.html 2330 2331* Command-line environment setup 2332 2333UNICODE_DATA=~/unidata/uni11/20180521 2334CLDR_SRC=~/svn.cldr/uni 2335ICU_ROOT=~/svn.icu/uni 2336ICU_SRC=$ICU_ROOT/src 2337ICUDT=icudt61b 2338ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2339ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2340export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2341 2342*** ICU Trac 2343 2344- ticket:13630: Unicode 11 2345- ^/branches/markus/uni11 2346 2347*** CLDR Trac 2348 2349- cldrbug 10978: Unicode 11 2350- ^/branches/markus/uni11 2351 2352*** Unicode version numbers 2353- makedata.mak 2354- uchar.h 2355- com.ibm.icu.util.VersionInfo 2356- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2357 2358- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2359 so that the makefiles see the new version number. 2360 2361*** data files & enums & parser code 2362 2363* download files 2364- mkdir -p $UNICODE_DATA 2365- download Unicode files into $UNICODE_DATA 2366 + subfolders: emoji, idna, security, ucd, uca 2367 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2368 2369* for manual diffs and for Unicode Tools input data updates: 2370 remove version suffixes from the file names 2371 ~$ unidata/desuffixucd.py $UNICODE_DATA 2372 (see https://sites.google.com/site/unicodetools/inputdata) 2373 2374* process and/or copy files 2375- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2376 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2377 + For debugging, and tweaking how ppucd.txt is written, 2378 the tool has an --only_ppucd option: 2379 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2380 2381- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2382 2383* build ICU (make install) 2384 so that the tools build can pick up the new definitions from the installed header files. 2385 2386 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2387 2388* preparseucd.py changes 2389- fix other errors 2390 NameError: unknown property Extended_Pictographic 2391 -> add Extended_Pictographic binary property 2392 -> add new short names for all Emoji properties 2393 2394* new constants for new property values 2395- preparseucd.py error: 2396 ValueError: missing uchar.h enum constants for some property values: 2397 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 2398 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 2399 u'Indic_Siyaq_Numbers'])), 2400 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 2401 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 2402 (u'GCB', set([u'LinkC', u'Virama'])), 2403 (u'WB', set([u'WSegSpace']))] 2404 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2405 blk; Chess_Symbols ; Chess_Symbols 2406 blk; Dogra ; Dogra 2407 blk; Georgian_Ext ; Georgian_Extended 2408 blk; Gunjala_Gondi ; Gunjala_Gondi 2409 blk; Hanifi_Rohingya ; Hanifi_Rohingya 2410 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 2411 blk; Makasar ; Makasar 2412 blk; Mayan_Numerals ; Mayan_Numerals 2413 blk; Medefaidrin ; Medefaidrin 2414 blk; Old_Sogdian ; Old_Sogdian 2415 blk; Sogdian ; Sogdian 2416 -> add to uchar.h 2417 use long property names for enum constants, 2418 for the trailing comment get the block start code point: diff old & new Blocks.txt 2419 -> add to UCharacter.UnicodeBlock IDs 2420 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2421 replace public static final int \1_ID = \2; \3 2422 -> add to UCharacter.UnicodeBlock objects 2423 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2424 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2425 2426 GCB; LinkC ; LinkingConsonant 2427 GCB; Virama ; Virama 2428 -> uchar.h & UCharacter.GraphemeClusterBreak 2429 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 2430 2431 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 2432 -> ignore: ICU does not yet support this property 2433 2434 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 2435 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 2436 -> uchar.h & UCharacter.JoiningGroup 2437 2438 sc ; Dogr ; Dogra 2439 sc ; Gong ; Gunjala_Gondi 2440 sc ; Maka ; Makasar 2441 sc ; Medf ; Medefaidrin 2442 sc ; Rohg ; Hanifi_Rohingya 2443 sc ; Sogd ; Sogdian 2444 sc ; Sogo ; Old_Sogdian 2445 -> uscript.h & com.ibm.icu.lang.UScript 2446 -> Nushu had been added already 2447 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2448 and in com.ibm.icu.dev.test.lang.TestUScript.java 2449 2450 WB ; WSegSpace ; WSegSpace 2451 -> uchar.h & UCharacter.WordBreak 2452 2453* New short names for emoji properties 2454- see UTS #51 2455- short names set in preparseucd.py 2456 2457* New properties 2458- boolean emoji property Extended_Pictographic 2459 -> added in preparseucd.py 2460 -> uchar.h & UProperty.java 2461- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 2462 as shown in PropertyValueAliases.txt 2463 -> ignore for now 2464 2465* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2466 (not strictly necessary for NOT_ENCODED scripts) 2467 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2468 2469* update spoof checker UnicodeSet initializers: 2470 inclusionPat & recommendedPat in uspoof.cpp 2471 INCLUSION & RECOMMENDED in SpoofChecker.java 2472- make sure that the Unicode Tools tree contains the latest security data files 2473- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 2474- update the hardcoded version number there in the DIRECTORY path 2475- run the tool (no special environment variables needed) 2476- copy & paste from the Console output into the .cpp & .java files 2477 2478* generate normalization data files 2479 cd $ICU_ROOT/dbg/icu4c 2480 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2481 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2482 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2483 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2484 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2485 2486* build ICU (make install) 2487 so that the tools build can pick up the new definitions from the installed header files. 2488 2489 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2490 2491* build Unicode tools using CMake+make 2492 2493$ICU_SRC/tools/unicode/c/icudefs.txt: 2494 2495# Location (--prefix) of where ICU was installed. 2496set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2497# Location of the ICU4C source tree. 2498set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 2499 2500 $ICU_ROOT/dbg$ 2501 mkdir -p tools/unicode/c 2502 cd tools/unicode/c 2503 2504 $ICU_ROOT/dbg/tools/unicode/c$ 2505 cmake ../../../../src/tools/unicode/c 2506 make 2507 2508* generate core properties data files 2509 $ICU_ROOT/dbg/tools/unicode/c$ 2510 genprops/genprops $ICU_SRC/icu4c 2511 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2512 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2513- rebuild ICU (make install) & tools 2514 2515* Fix case props 2516 genprops error: casepropsbuilder: too many exceptions words 2517 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 2518- With the addition of Georgian Mtavruli capital letters, 2519 there are now too many simple case mappings with big mapping deltas 2520 that yield uncompressible exceptions. 2521- Changing the data structure (now formatVersion 4), 2522 adding one bit for no-simple-case-folding (for Cherokee), and 2523 one optional slot for a big delta (for most faraway mappings), 2524 together with another bit for whether that is negative. 2525 This makes most Cherokee & Georgian etc. case mappings compressible, 2526 reducing the number of exceptions words. 2527- Further changes to gain one more bit for the exceptions index, 2528 for future growth. Details see casepropsbuilder.cpp. 2529 2530* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2531 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2532- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2533- Unicode 6.0..11.0: U+2260, U+226E, U+226F 2534- nothing new in this Unicode version, no test file to update 2535 2536* run & fix ICU4C tests 2537- Andy handles RBBI & spoof check test failures 2538 2539- Errors in char.txt, word.txt, word_POSIX.txt like 2540 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 2541 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 2542 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 2543 not empty, just to get ICU building. 2544 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 2545 and properties together with the rules that used them (GB 10, WB 14). 2546 -> Andy adjusts the rule sets further to sync with 2547 Unicode 11 grapheme, word, and line break spec changes. 2548 2549* collation: CLDR collation root, UCA DUCET 2550 2551- UCA DUCET goes into Mark's Unicode tools, see 2552 https://sites.google.com/site/unicodetools/home#TOC-UCA 2553 diff the main mapping file, look for bad changes 2554 (for example, more bytes per weight for common characters) 2555 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 2556 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 2557 2558- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2559 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2560 2561- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2562 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2563- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2564 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2565 (note removing the underscore before "Rules") 2566 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2567- restore TODO diffs in UCARules.txt 2568 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2569- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2570 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2571 from the CLDR root files (..._CLDR_..._SHORT.txt) 2572 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2573 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2574 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2575- if CLDR common/uca/unihan-index.txt changes, then update 2576 CLDR common/collation/root.xml <collation type="private-unihan"> 2577 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2578 2579- run genuca, see command line above; 2580 deal with 2581 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 2582 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 2583 (add the character to genuca.cpp sampleCharsToScripts[]) 2584 + look up the USCRIPT_ code for the new sample characters 2585 (should be obvious from the comment in the error output) 2586 + *add* mappings to sampleCharsToScripts[], do not replace them 2587 (in case the script sample characters flip-flop) 2588 + insert new scripts in DUCET script order, see the top_byte table 2589 at the beginning of FractionalUCA.txt 2590- rebuild ICU4C 2591 2592* Unihan collators 2593 https://sites.google.com/site/unicodetools/unihan 2594- run Unicode Tools 2595 org.unicode.draft.GenerateUnihanCollators 2596 with VM arguments 2597 -ea 2598 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2599 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2600 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2601 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2602 -DUVERSION=11.0.0 2603- run Unicode Tools 2604 org.unicode.draft.GenerateUnihanCollatorFiles 2605 with the same arguments 2606- check CLDR diffs 2607 cd $CLDR_SRC 2608 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2609 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2610- copy to CLDR 2611 cd $CLDR_SRC 2612 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2613 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2614- run CLDR unit tests, commit to CLDR 2615- generate ICU zh collation data: run CLDR 2616 org.unicode.cldr.icu.NewLdml2IcuConverter 2617 with program arguments 2618 -t collation 2619 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2620 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2621 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 2622 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 2623 zh 2624 and VM arguments 2625 -ea 2626 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2627- rebuild ICU4C 2628 2629* run & fix ICU4C tests, now with new CLDR collation root data 2630- run all tests with the collation test data *_SHORT.txt or the full files 2631 (the full ones have comments, useful for debugging) 2632- note on intltest: if collate/UCAConformanceTest fails, then 2633 utility/MultithreadTest/TestCollators will fail as well; 2634 fix the conformance test before looking into the multi-thread test 2635 2636* update Java data files 2637- refresh just the UCD/UCA-related/derived files, just to be safe 2638- see (ICU4C)/source/data/icu4j-readme.txt 2639- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2640- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2641 output: 2642 ... 2643 Unicode .icu files built to ./out/build/icudt61l 2644 echo timestamp > uni-core-data 2645 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2646 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 2647 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2648 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2649 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 2650 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 2651 mkdir -p /tmp/icu4j/main/shared/data 2652 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2653 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 2654 mkdir -p /tmp/icu4j/main/shared/data 2655 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2656 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 2657- copy the big-endian Unicode data files to another location, 2658 separate from the other data files, 2659 and then refresh ICU4J 2660 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2661 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2662 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2663 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2664 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2665 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2666 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2667 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2668 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2669 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2670 2671* When refreshing all of ICU4J data from ICU4C 2672- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2673- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2674or 2675- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2676 2677* update CollationFCD.java 2678 + copy & paste the initializers of lcccIndex[] etc. from 2679 ICU4C/source/i18n/collationfcd.cpp to 2680 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2681 2682* refresh Java test .txt files 2683- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2684 cd $ICU_SRC/icu4c/source/data/unidata 2685 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2686 cd ../../test/testdata 2687 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2688 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2689 2690* run & fix ICU4J tests 2691 2692*** API additions 2693- send notice to icu-design about new born-@stable API (enum constants etc.) 2694 2695*** CLDR numbering systems 2696- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2697 Unicode 11: using Unicode 11 CLDR ticket #10978 2698 rohg 10D30..10D39 Hanifi_Rohingya 2699 gong 11DA0..11DA9 Gunjala_Gondi 2700 Earlier: CLDR tickets specific to adding new numbering systems. 2701 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2702 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2703 2704*** merge the Unicode update branches back onto the trunk 2705- do not merge the icudata.jar and testdata.jar, 2706 instead rebuild them from merged & tested ICU4C 2707- make sure that changes to Unicode tools are checked in: 2708 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2709 2710---------------------------------------------------------------------------- *** 2711 2712Unicode 10.0 update for ICU 60 2713 2714http://www.unicode.org/versions/Unicode10.0.0/ 2715http://www.unicode.org/versions/beta-10.0.0.html 2716http://blog.unicode.org/2017/03/unicode-100-beta-review.html 2717http://www.unicode.org/review/pri350/ 2718http://www.unicode.org/reports/uax-proposed-updates.html 2719http://www.unicode.org/reports/tr44/tr44-19.html 2720 2721* Command-line environment setup 2722 2723UNICODE_DATA=~/unidata/uni10/20170605 2724CLDR_SRC=~/svn.cldr/uni10 2725ICU_ROOT=~/svn.icu/uni10 2726ICU_SRC=$ICU_ROOT/src 2727ICUDT=icudt60b 2728ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2729ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2730export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2731 2732*** ICU Trac 2733 2734- ticket:12985: Unicode 10 2735- ticket:13061: undo hacks from emoji 5.0 update 2736- ticket:13062: add Emoji_Component property 2737- ^/branches/markus/uni10 2738 2739*** CLDR Trac 2740 2741- cldrbug 10055: Unicode 10 2742- cldrbug 9882: Unicode 10 script metadata 2743- cldrbug 10219: numbering systems for Unicode 10 2744 2745*** Unicode version numbers 2746- makedata.mak 2747- uchar.h 2748- com.ibm.icu.util.VersionInfo 2749- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2750 2751- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2752 so that the makefiles see the new version number. 2753 2754*** data files & enums & parser code 2755 2756* download files 2757- mkdir -p $UNICODE_DATA 2758- download Unicode 10.0 files into $UNICODE_DATA 2759 + subfolders: ucd, uca, idna, security 2760 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2761- download emoji 5.0 files into $UNICODE_DATA/emoji 2762 2763* for manual diffs: remove version suffixes from the file names 2764 ~$ unidata/desuffixucd.py $UNICODE_DATA 2765 (see https://sites.google.com/site/unicodetools/inputdata) 2766 2767* process and/or copy files 2768- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2769 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2770 + For debugging, and tweaking how ppucd.txt is written, 2771 the tool has an --only_ppucd option: 2772 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2773 2774- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2775 2776* build ICU (make install) 2777 so that the tools build can pick up the new definitions from the installed header files. 2778 2779 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2780 2781* preparseucd.py changes 2782- remove or add new Unicode scripts from/to the 2783 only-in-ISO-15924 list according to the error messages: 2784 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 2785 -> adjust _scripts_only_in_iso15924 as indicated 2786- fix other errors 2787 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 2788 -> add vo=Vertical_Orientation to _ignored_properties 2789 -> later removed again, parsing the file, even though we do not yet store data for runtime use 2790 2791* new constants for new property values 2792- preparseucd.py error: 2793 ValueError: missing uchar.h enum constants for some property values: 2794 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 2795 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 2796 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 2797 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 2798 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 2799 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 2800 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2801 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 2802 blk; Kana_Ext_A ; Kana_Extended_A 2803 blk; Masaram_Gondi ; Masaram_Gondi 2804 blk; Nushu ; Nushu 2805 blk; Soyombo ; Soyombo 2806 blk; Syriac_Sup ; Syriac_Supplement 2807 blk; Zanabazar_Square ; Zanabazar_Square 2808 -> add to uchar.h 2809 use long property names for enum constants, 2810 for the trailing comment get the block start code point: diff old & new Blocks.txt 2811 -> add to UCharacter.UnicodeBlock IDs 2812 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2813 replace public static final int \1_ID = \2; \3 2814 -> add to UCharacter.UnicodeBlock objects 2815 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2816 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2817 2818 jg ; Malayalam_Bha ; Malayalam_Bha 2819 jg ; Malayalam_Ja ; Malayalam_Ja 2820 jg ; Malayalam_Lla ; Malayalam_Lla 2821 jg ; Malayalam_Llla ; Malayalam_Llla 2822 jg ; Malayalam_Nga ; Malayalam_Nga 2823 jg ; Malayalam_Nna ; Malayalam_Nna 2824 jg ; Malayalam_Nnna ; Malayalam_Nnna 2825 jg ; Malayalam_Nya ; Malayalam_Nya 2826 jg ; Malayalam_Ra ; Malayalam_Ra 2827 jg ; Malayalam_Ssa ; Malayalam_Ssa 2828 jg ; Malayalam_Tta ; Malayalam_Tta 2829 -> uchar.h & UCharacter.JoiningGroup 2830 2831 sc ; Gonm ; Masaram_Gondi 2832 sc ; Nshu ; Nushu 2833 sc ; Soyo ; Soyombo 2834 sc ; Zanb ; Zanabazar_Square 2835 -> uscript.h & com.ibm.icu.lang.UScript 2836 -> Nushu had been added already 2837 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2838 and in com.ibm.icu.dev.test.lang.TestUScript.java 2839 2840* New properties as shown in PropertyValueAliases.txt changes 2841- boolean Emoji_Component from emoji 5 2842 -> uchar.h & UProperty.java 2843- boolean 2844 # Regional_Indicator (RI) 2845 2846 RI ; N ; No ; F ; False 2847 RI ; Y ; Yes ; T ; True 2848 -> uchar.h & UProperty.java 2849 -> single immutable range, to be hardcoded 2850- boolean 2851 # Prepended_Concatenation_Mark (PCM) 2852 2853 PCM; N ; No ; F ; False 2854 PCM; Y ; Yes ; T ; True 2855 -> was new in Unicode 9 2856 -> uchar.h & UProperty.java 2857- enumerated 2858 # Vertical_Orientation (vo) 2859 2860 vo ; R ; Rotated 2861 vo ; Tr ; Transformed_Rotated 2862 vo ; Tu ; Transformed_Upright 2863 vo ; U ; Upright 2864 -> only pre-parsed for now, but not yet stored for runtime use 2865 2866* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2867 (not strictly necessary for NOT_ENCODED scripts) 2868 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2869 2870* generate normalization data files 2871 cd $ICU_ROOT/dbg/icu4c 2872 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2873 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2874 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2875 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2876 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2877 2878* build ICU (make install) 2879 so that the tools build can pick up the new definitions from the installed header files. 2880 2881 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2882 2883* build Unicode tools using CMake+make 2884 2885$ICU_SRC/tools/unicode/c/icudefs.txt: 2886 2887# Location (--prefix) of where ICU was installed. 2888set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2889# Location of the ICU4C source tree. 2890set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 2891 2892 $ICU_ROOT/dbg/tools/unicode/c$ 2893 cmake ../../../../src/tools/unicode/c 2894 make 2895 2896* generate core properties data files 2897 $ICU_ROOT/dbg/tools/unicode/c$ 2898 genprops/genprops $ICU_SRC/icu4c 2899 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2900 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2901- rebuild ICU (make install) & tools 2902 2903* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2904 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2905- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2906- Unicode 6.0..10.0: U+2260, U+226E, U+226F 2907- nothing new in this Unicode version, no test file to update 2908 2909* run & fix ICU4C tests 2910- Andy handles RBBI & spoof check test failures 2911 2912* collation: CLDR collation root, UCA DUCET 2913 2914- UCA DUCET goes into Mark's Unicode tools, see 2915 https://sites.google.com/site/unicodetools/home#TOC-UCA 2916- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2917 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2918 2919- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2920 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2921- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2922 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2923 (note removing the underscore before "Rules") 2924 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2925- restore TODO diffs in UCARules.txt 2926 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2927- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2928 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2929 from the CLDR root files (..._CLDR_..._SHORT.txt) 2930 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2931 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2932 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2933- if CLDR common/uca/unihan-index.txt changes, then update 2934 CLDR common/collation/root.xml <collation type="private-unihan"> 2935 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2936 2937- run genuca, see command line above; 2938 deal with 2939 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 2940 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 2941 (add the character to genuca.cpp sampleCharsToScripts[]) 2942 + look up the USCRIPT_ code for the new sample characters 2943 (should be obvious from the comment in the error output) 2944 + *add* mappings to sampleCharsToScripts[], do not replace them 2945 (in case the script sample characters flip-flop) 2946 + insert new scripts in DUCET script order, see the top_byte table 2947 at the beginning of FractionalUCA.txt 2948- rebuild ICU4C 2949 2950* Unihan collators 2951 https://sites.google.com/site/unicodetools/unihan 2952- run Unicode Tools 2953 org.unicode.draft.GenerateUnihanCollators 2954 with VM arguments 2955 -ea 2956 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2957 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2958 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2959 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2960 -DUVERSION=10.0.0 2961- run Unicode Tools 2962 org.unicode.draft.GenerateUnihanCollatorFiles 2963 with the same arguments 2964- check CLDR diffs 2965 cd $CLDR_SRC 2966 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2967 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2968- copy to CLDR 2969 cd $CLDR_SRC 2970 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2971 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2972- run CLDR unit tests, commit to CLDR 2973- generate ICU zh collation data: run CLDR 2974 org.unicode.cldr.icu.NewLdml2IcuConverter 2975 with program arguments 2976 -t collation 2977 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 2978 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 2979 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 2980 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 2981 zh 2982 and VM arguments 2983 -ea 2984 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2985- rebuild ICU4C 2986 2987* run & fix ICU4C tests, now with new CLDR collation root data 2988- run all tests with the collation test data *_SHORT.txt or the full files 2989 (the full ones have comments, useful for debugging) 2990- note on intltest: if collate/UCAConformanceTest fails, then 2991 utility/MultithreadTest/TestCollators will fail as well; 2992 fix the conformance test before looking into the multi-thread test 2993 2994* update Java data files 2995- refresh just the UCD/UCA-related/derived files, just to be safe 2996- see (ICU4C)/source/data/icu4j-readme.txt 2997- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2998- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2999 output: 3000 ... 3001 Unicode .icu files built to ./out/build/icudt60l 3002 echo timestamp > uni-core-data 3003 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3004 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 3005 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3006 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 3007 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 3008 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 3009 mkdir -p /tmp/icu4j/main/shared/data 3010 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3011 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 3012 mkdir -p /tmp/icu4j/main/shared/data 3013 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3014 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 3015- copy the big-endian Unicode data files to another location, 3016 separate from the other data files, 3017 and then refresh ICU4J 3018 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 3019 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3020 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3021 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3022 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3023 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3024 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3025 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3026 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3027 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3028 3029* When refreshing all of ICU4J data from ICU4C 3030- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3031- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 3032or 3033- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 3034 3035* update CollationFCD.java 3036 + copy & paste the initializers of lcccIndex[] etc. from 3037 ICU4C/source/i18n/collationfcd.cpp to 3038 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3039 3040* refresh Java test .txt files 3041- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3042 cd $ICU_SRC/icu4c/source/data/unidata 3043 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3044 cd ../../test/testdata 3045 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3046 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3047 3048* run & fix ICU4J tests 3049 3050*** API additions 3051- send notice to icu-design about new born-@stable API (enum constants etc.) 3052 3053*** CLDR numbering systems 3054- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 3055 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 3056 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 3057 3058*** merge the Unicode update branches back onto the trunk 3059- do not merge the icudata.jar and testdata.jar, 3060 instead rebuild them from merged & tested ICU4C 3061- make sure that changes to Unicode tools are checked in: 3062 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3063 3064---------------------------------------------------------------------------- *** 3065 3066Emoji 5.0 update for ICU 59 3067- ICU 59 mostly remains on Unicode 9.0 3068- except updates bidi and segmentation data to Unicode 10 beta 3069 3070First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 3071 3072* Command-line environment setup 3073 3074ICU_ROOT=~/svn.icu/trunk 3075ICU_SRC_DIR=$ICU_ROOT/src 3076ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 3077ICUDT=icudt59b 3078export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3079SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 3080UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 3081 3082*** ICU Trac 3083 3084- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 3085- changes directly on trunk 3086 3087*** data files & enums & parser code 3088 3089* download files 3090 3091- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 3092- download emoji 5.0 beta files into the same uni90e50 folder 3093- download Unicode 10.0 beta files: ucd 3094 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 3095 BidiBrackets.txt 3096 BidiCharacterTest.txt 3097 BidiMirroring.txt 3098 BidiTest.txt 3099 extracted/DerivedBidiClass.txt 3100 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 3101 LineBreak.txt 3102 auxiliary/* 3103 3104* preparseucd.py changes 3105- adjust for combined trunks 3106- write new copyright lines 3107- ignore new Emoji_Component property for now 3108 3109* process and/or copy files 3110- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 3111 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3112 3113- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 3114 3115* build ICU (make install) 3116 so that the tools build can pick up the new definitions from the installed header files. 3117 3118 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 3119 3120* build Unicode tools using CMake+make 3121 3122~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 3123 3124# Location (--prefix) of where ICU was installed. 3125set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 3126# Location of the ICU4C source tree. 3127set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 3128 3129 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3130 cmake ../../../../src/tools/unicode/c 3131 make 3132 3133* generate core properties data files 3134 ~/svn.icu/trunk/dbg/tools/unicode/c$ 3135 genprops/genprops $ICU4C_SRC_DIR 3136- rebuild ICU (make install) & tools 3137 3138* run & fix ICU4C tests 3139- Andy handles RBBI & spoof check test failures 3140 3141* update Java data files 3142- refresh just the UCD/UCA-related/derived files, just to be safe 3143- see (ICU4C)/source/data/icu4j-readme.txt 3144- mkdir /tmp/icu4j 3145- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3146 output: 3147 ... 3148 Unicode .icu files built to ./out/build/icudt59l 3149 echo timestamp > uni-core-data 3150 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3151 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 3152 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3153 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 3154 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 3155 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 3156 mkdir -p /tmp/icu4j/main/shared/data 3157 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3158 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 3159 mkdir -p /tmp/icu4j/main/shared/data 3160 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3161 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 3162- copy the big-endian Unicode data files to another location, 3163 separate from the other data files, 3164 and then refresh ICU4J 3165 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 3166 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3167 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3168 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3169 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3170 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3171 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3172 3173* When refreshing all of ICU4J data from ICU4C 3174- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3175- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 3176or 3177- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 3178 3179* refresh Java test .txt files 3180- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3181 cd $ICU4C_SRC_DIR/source/data/unidata 3182 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3183 cd ../../test/testdata 3184 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3185 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 3186 3187* run & fix ICU4J tests 3188 3189---------------------------------------------------------------------------- *** 3190 3191Unicode 9.0 update for ICU 58 3192 3193* Command-line environment setup 3194 3195ICU_ROOT=~/svn.icu/trunk 3196ICU_SRC_DIR=$ICU_ROOT/src 3197ICUDT=icudt58b 3198export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3199SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3200UNIDATA=$ICU_SRC_DIR/source/data/unidata 3201 3202http://www.unicode.org/review/pri323/ -- beta review 3203http://www.unicode.org/reports/uax-proposed-updates.html 3204http://www.unicode.org/versions/beta-9.0.0.html 3205http://www.unicode.org/versions/Unicode9.0.0/ 3206http://www.unicode.org/reports/tr44/tr44-17.html 3207 3208*** ICU Trac 3209 3210- ticket:12526: integrate Unicode 9 3211- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 3212- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 3213 3214*** CLDR Trac 3215 3216- cldrbug 9414: UCA 9 3217- ^/branches/markus/uni90 at r11518 from trunk at r11517 3218 3219- cldrbug 8745: Unicode 9.0 script metadata 3220 3221*** Unicode version numbers 3222- makedata.mak 3223- uchar.h 3224- com.ibm.icu.util.VersionInfo 3225- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3226 3227- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3228 so that the makefiles see the new version number. 3229 3230*** data files & enums & parser code 3231 3232* file preparation 3233 3234- download UCD & IDNA files 3235- make sure that the Unicode data folder passed into preparseucd.py 3236 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3237- only for manual diffs: remove version suffixes from the file names 3238 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3239 (see https://sites.google.com/site/unicodetools/inputdata) 3240- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3241- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3242- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3243 3244- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 3245 and copy to $UNIDATA 3246 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 3247 3248* preparseucd.py changes 3249- remove or add new Unicode scripts from/to the 3250 only-in-ISO-15924 list according to the error messages: 3251 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 3252 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 3253 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 3254 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 3255 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3256 and in com.ibm.icu.dev.test.lang.TestUScript.java 3257- DerivedNumericValues.txt new numeric values 3258 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 3259 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 3260 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 3261 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 3262 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 3263 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 3264 uchar.c, UCharacterProperty.java 3265 to support a new series of values 3266- adjust preparseucd.py for Tangut algorithmic names 3267 in ppucd.txt: 3268 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 3269 -> 3270 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 3271- avoid block-compressing most String/Miscellaneous property values, 3272 triggered by genprops not coping with a multi-code point Case_Folding on 3273 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 3274 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 3275 3276* PropertyAliases.txt changes 3277- 1 new property PCM=Prepended_Concatenation_Mark 3278 Ignore: Only useful for layout engines. 3279 Ok to list in ppucd.txt. 3280 3281* PropertyValueAliases.txt new property values 3282 blk; Adlam ; Adlam 3283 blk; Bhaiksuki ; Bhaiksuki 3284 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 3285 blk; Glagolitic_Sup ; Glagolitic_Supplement 3286 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 3287 blk; Marchen ; Marchen 3288 blk; Mongolian_Sup ; Mongolian_Supplement 3289 blk; Newa ; Newa 3290 blk; Osage ; Osage 3291 blk; Tangut ; Tangut 3292 blk; Tangut_Components ; Tangut_Components 3293 -> add to uchar.h 3294 use long property names for enum constants 3295 -> add to UCharacter.UnicodeBlock IDs 3296 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3297 replace public static final int \1_ID = \2; \3 3298 -> add to UCharacter.UnicodeBlock objects 3299 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3300 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3301 3302 GCB; EB ; E_Base 3303 GCB; EBG ; E_Base_GAZ 3304 GCB; EM ; E_Modifier 3305 GCB; GAZ ; Glue_After_Zwj 3306 GCB; ZWJ ; ZWJ 3307 -> uchar.h & UCharacter.GraphemeClusterBreak 3308 3309 jg ; African_Feh ; African_Feh 3310 jg ; African_Noon ; African_Noon 3311 jg ; African_Qaf ; African_Qaf 3312 -> uchar.h & UCharacter.JoiningGroup 3313 3314 lb ; EB ; E_Base 3315 lb ; EM ; E_Modifier 3316 lb ; ZWJ ; ZWJ 3317 -> uchar.h & UCharacter.LineBreak 3318 3319 sc ; Adlm ; Adlam 3320 sc ; Bhks ; Bhaiksuki 3321 sc ; Marc ; Marchen 3322 sc ; Newa ; Newa 3323 sc ; Osge ; Osage 3324 sc ; Tang ; Tangut 3325 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3326 3327 WB ; EB ; E_Base 3328 WB ; EBG ; E_Base_GAZ 3329 WB ; EM ; E_Modifier 3330 WB ; GAZ ; Glue_After_Zwj 3331 WB ; ZWJ ; ZWJ 3332 -> uchar.h & UCharacter.WordBreak 3333 3334* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3335 (not strictly necessary for NOT_ENCODED scripts) 3336 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3337 3338* generate normalization data files 3339 cd $ICU_ROOT/dbg 3340 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3341 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3342 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3343 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3344 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3345 3346* build ICU (make install) 3347 so that the tools build can pick up the new definitions from the installed header files. 3348 3349 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 3350 3351* build Unicode tools using CMake+make 3352 3353~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3354 3355 # Location (--prefix) of where ICU was installed. 3356 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3357 # Location of the ICU source tree. 3358 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3359 3360 ~/svn.icutools/trunk/dbg/unicode/c$ 3361 cmake ../../../src/unicode/c 3362 make 3363 3364* generate core properties data files 3365 ~/svn.icutools/trunk/dbg/unicode/c$ 3366 genprops/genprops $ICU_SRC_DIR 3367 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3368 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3369- rebuild ICU (make install) & tools 3370 3371* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3372 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3373- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3374- Unicode 6.0..9.0: U+2260, U+226E, U+226F 3375- nothing new in 9.0, no test file to update 3376 3377* run & fix ICU4C tests 3378- Andy handles RBBI & spoof check test failures 3379 3380* collation: CLDR collation root, UCA DUCET 3381 3382- UCA DUCET goes into Mark's Unicode tools, see 3383 https://sites.google.com/site/unicodetools/home#TOC-UCA 3384- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 3385 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3386 3387- cd (CLDR UCA branch)/common/uca/ 3388- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3389 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3390- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3391 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 3392 (note removing the underscore before "Rules") 3393 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3394- restore TODO diffs in UCARules.txt 3395 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3396- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3397 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3398 from the CLDR root files (..._CLDR_..._SHORT.txt) 3399 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3400 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3401 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3402- if CLDR common/uca/unihan-index.txt changes, then update 3403 CLDR common/collation/root.xml <collation type="private-unihan"> 3404 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 3405 3406- run genuca, see command line above; 3407 deal with 3408 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 3409 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 3410 (add the character to genuca.cpp sampleCharsToScripts[]) 3411 + look up the USCRIPT_ code for the new sample characters 3412 (should be obvious from the comment in the error output) 3413 + *add* mappings to sampleCharsToScripts[], do not replace them 3414 (in case the script sample characters flip-flop) 3415 + insert new scripts in DUCET script order, see the top_byte table 3416 at the beginning of FractionalUCA.txt 3417- rebuild ICU4C 3418 3419* Unihan collators 3420- run Unicode Tools 3421 org.unicode.draft.GenerateUnihanCollators 3422 with VM arguments 3423 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 3424 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 3425 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 3426 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 3427 -DUVERSION=9.0.0 3428 -ea 3429- run Unicode Tools 3430 org.unicode.draft.GenerateUnihanCollatorFiles 3431 with the same arguments 3432- check CLDR diffs 3433 cd ~/svn.cldr/trunk 3434 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 3435 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 3436- copy to CLDR 3437 cd ~/svn.cldr/trunk 3438 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 3439 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 3440- commit to CLDR 3441- generate ICU zh collation data: run CLDR 3442 org.unicode.cldr.icu.NewLdml2IcuConverter 3443 with program arguments 3444 -t collation 3445 -s /home/mscherer/svn.cldr/trunk/common/collation 3446 -m /home/mscherer/svn.cldr/trunk/common/supplemental 3447 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 3448 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 3449 zh 3450 and VM arguments 3451 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 3452- rebuild ICU4C 3453 3454* run & fix ICU4C tests, now with new CLDR collation root data 3455- run all tests with the collation test data *_SHORT.txt or the full files 3456 (the full ones have comments, useful for debugging) 3457- note on intltest: if collate/UCAConformanceTest fails, then 3458 utility/MultithreadTest/TestCollators will fail as well; 3459 fix the conformance test before looking into the multi-thread test 3460 3461* update Java data files 3462- refresh just the UCD/UCA-related/derived files, just to be safe 3463- see (ICU4C)/source/data/icu4j-readme.txt 3464- mkdir /tmp/icu4j 3465- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3466 output: 3467 ... 3468 Unicode .icu files built to ./out/build/icudt58l 3469 echo timestamp > uni-core-data 3470 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3471 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 3472 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3473 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 3474 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 3475 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 3476 mkdir -p /tmp/icu4j/main/shared/data 3477 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3478 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 3479 mkdir -p /tmp/icu4j/main/shared/data 3480 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3481 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3482- copy the big-endian Unicode data files to another location, 3483 separate from the other data files, 3484 and then refresh ICU4J 3485 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3486 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3487 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3488 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3489 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3490 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3491 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3492 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3493 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3494 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3495 3496* When refreshing all of ICU4J data from ICU4C 3497- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3498- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3499or 3500- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3501 3502* update CollationFCD.java 3503 + copy & paste the initializers of lcccIndex[] etc. from 3504 ICU4C/source/i18n/collationfcd.cpp to 3505 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3506 3507* refresh Java test .txt files 3508- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3509 cd $ICU_SRC_DIR/source/data/unidata 3510 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3511 cd ../../test/testdata 3512 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3513 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3514 3515* run & fix ICU4J tests 3516 3517*** LayoutEngine script information 3518 3519* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3520 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3521 in the working directory. 3522 3523 (It also generates ScriptRunData.cpp, which is no longer needed.) 3524 3525 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3526 (a plain text file) 3527 which maps ICU versions to the numbers of script/language constants 3528 that were added then. 3529 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3530 3531 The generated files have a current copyright date and "@deprecated" statement. 3532 3533* Review changes, fix Java tool if necessary, and copy to ICU4C 3534 cd ~/svn.icu4j/trunk/src 3535 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3536 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3537 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3538 3539*** API additions 3540- send notice to icu-design about new born-@stable API (enum constants etc.) 3541 3542*** merge the Unicode update branches back onto the trunk 3543- do not merge the icudata.jar and testdata.jar, 3544 instead rebuild them from merged & tested ICU4C 3545- make sure that changes to Unicode tools & ICU tools are checked in 3546 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3547 http://bugs.icu-project.org/trac/log/tools/trunk 3548 3549---------------------------------------------------------------------------- *** 3550 3551New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 3552 3553Adding 3554- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 3555- new combination/alias codes: Hanb, Jamo 3556 - used in CLDR 29 and in spoof checker 3557- new Z* code: Zsye 3558 3559Add new codes to uscript.h & UScript.java, see Unicode update logs. 3560 -> com.ibm.icu.lang.UScript 3561 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3562 replace public static final int \1 = \2; \3 3563 3564Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 3565add new script codes. 3566"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 3567 3568Note: If we have to run preparseucd.py again before the Unicode 9 update, 3569then we need to manually keep/restore the new script codes. 3570 3571ICU_ROOT=~/svn.icu/trunk 3572ICU_SRC_DIR=$ICU_ROOT/src 3573ICUDT=icudt57b 3574export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3575SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3576UNIDATA=$ICU_SRC_DIR/source/data/unidata 3577 3578Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 3579see https://unicode-org.atlassian.net/browse/ICU-12141 3580 3581make install, then icutools cmake & make, then 3582~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3583 3584Generate Java data as usual, only update pnames.icu & uprops.icu. 3585 3586*** LayoutEngine script information 3587 3588* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3589 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3590 in the working directory. 3591 3592 (It also generates ScriptRunData.cpp, which is no longer needed.) 3593 3594 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3595 (a plain text file) 3596 which maps ICU versions to the numbers of script/language constants 3597 that were added then. 3598 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3599 3600 The generated files have a current copyright date and "@deprecated" statement. 3601 3602* Review changes, fix Java tool if necessary, and copy to ICU4C 3603 cd ~/svn.icu4j/trunk/src 3604 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3605 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3606 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3607 3608---------------------------------------------------------------------------- *** 3609 3610Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 3611 3612Edit preparseucd.py to add & parse new properties. 3613They share the UCD property namespace but are not listed in PropertyAliases.txt. 3614 3615Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 3616Initial data from emoji/2.0/ 3617 3618ICU_ROOT=~/svn.icu/trunk 3619ICU_SRC_DIR=$ICU_ROOT/src 3620ICUDT=icudt56b 3621export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3622SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3623UNIDATA=$ICU_SRC_DIR/source/data/unidata 3624 3625Add binary-property constants to uchar.h enum UProperty & UProperty.java. 3626 3627~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3628(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 3629 3630Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 3631 3632make install, then icutools cmake & make, then 3633~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3634 3635Generate Java data as usual, only update pnames.icu & uprops.icu. 3636 3637---------------------------------------------------------------------------- *** 3638 3639Unicode 8.0 update for ICU 56 3640 3641* Command-line environment setup 3642 3643ICU_ROOT=~/svn.icu/trunk 3644ICU_SRC_DIR=$ICU_ROOT/src 3645ICUDT=icudt56b 3646export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3647SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3648UNIDATA=$ICU_SRC_DIR/source/data/unidata 3649 3650http://www.unicode.org/review/pri297/ -- beta review 3651http://www.unicode.org/reports/uax-proposed-updates.html 3652http://unicode.org/versions/beta-8.0.0.html 3653http://www.unicode.org/versions/Unicode8.0.0/ 3654http://www.unicode.org/reports/tr44/tr44-15.html 3655 3656*** ICU Trac 3657 3658- ticket:11574: Unicode 8 3659- C++ branches/markus/uni80 at r37351 from trunk at r37343 3660- Java branches/markus/uni80 at r37352 from trunk at r37338 3661 3662*** CLDR Trac 3663 3664- cldrbug 8311: UCA 8 3665- branches/markus/uni80 at r11518 from trunk at r11517 3666 3667- cldrbug 8109: Unicode 8.0 script metadata 3668- cldrbug 8418: Updated segmentation for Unicode 8.0 3669 3670*** Unicode version numbers 3671- makedata.mak 3672- uchar.h 3673- com.ibm.icu.util.VersionInfo 3674- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3675 3676- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3677 so that the makefiles see the new version number. 3678 3679*** data files & enums & parser code 3680 3681* file preparation 3682 3683- download UCD & IDNA files 3684- make sure that the Unicode data folder passed into preparseucd.py 3685 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3686- only for manual diffs: remove version suffixes from the file names 3687 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3688 (see https://sites.google.com/site/unicodetools/inputdata) 3689- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3690- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3691- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3692 3693- also: from http://unicode.org/Public/security/8.0.0/ download new 3694 confusables.txt & confusablesWholeScript.txt 3695 and copy to $UNIDATA 3696 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 3697 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 3698 3699* initial preparseucd.py changes 3700- remove new Unicode scripts from the 3701 only-in-ISO-15924 list according to the error message: 3702 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 3703 from _scripts_only_in_iso15924 3704 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3705 and in com.ibm.icu.dev.test.lang.TestUScript.java 3706- property and file name change: 3707 IndicMatraCategory -> IndicPositionalCategory 3708- UnicodeData.txt unusual numeric values (improper fractions) 3709 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 3710 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 3711 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 3712 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 3713 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 3714 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 3715 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 3716 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 3717 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 3718 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 3719 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 3720 which are listed in DerivedNumericValues.txt; 3721 keeps storage in data file simple 3722 3723* PropertyValueAliases.txt changes 3724- 10 new Block (blk) values: 3725 blk; Ahom ; Ahom 3726 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 3727 blk; Cherokee_Sup ; Cherokee_Supplement 3728 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 3729 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 3730 blk; Hatran ; Hatran 3731 blk; Multani ; Multani 3732 blk; Old_Hungarian ; Old_Hungarian 3733 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 3734 blk; Sutton_SignWriting ; Sutton_SignWriting 3735 -> add to uchar.h 3736 use long property names for enum constants 3737 -> add to UCharacter.UnicodeBlock IDs 3738 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3739 replace public static final int \1_ID = \2; \3 3740 -> add to UCharacter.UnicodeBlock objects 3741 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3742 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3743- 6 new Script (sc) values: 3744 sc ; Ahom ; Ahom 3745 sc ; Hatr ; Hatran 3746 sc ; Hluw ; Anatolian_Hieroglyphs 3747 sc ; Hung ; Old_Hungarian 3748 sc ; Mult ; Multani 3749 sc ; Sgnw ; SignWriting 3750 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3751 3752* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3753 (not strictly necessary for NOT_ENCODED scripts) 3754 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3755 3756* generate normalization data files 3757 cd $ICU_ROOT/dbg 3758 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3759 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3760 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3761 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3762 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3763 3764* build ICU (make install) 3765 so that the tools build can pick up the new definitions from the installed header files. 3766 3767 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3768 3769* build Unicode tools using CMake+make 3770 3771~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3772 3773 # Location (--prefix) of where ICU was installed. 3774 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3775 # Location of the ICU source tree. 3776 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3777 3778 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3779 ~/svn.icutools/trunk/dbg/unicode/c$ make 3780 3781* generate core properties data files 3782- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3783- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3784- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3785- rebuild ICU (make install) & tools 3786- run genuca again (see step above) so that it picks up the new nfc.nrm 3787- rebuild ICU (make install) & tools 3788 3789* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3790 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3791- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3792- Unicode 6.0..8.0: U+2260, U+226E, U+226F 3793- nothing new in 8.0, no test file to update 3794 3795* run & fix ICU4C tests 3796- bad Cherokee case folding due to difference in fallbacks: 3797 UCD case folding falls back to no mapping, 3798 ICU runtime case folding falls back to lowercasing; 3799 fixed casepropsbuilder.cpp to generate scf mappings to self 3800 when there is an slc mapping but no scf 3801- Andy handles RBBI & spoof check test failures 3802 3803* collation: CLDR collation root, UCA DUCET 3804 3805- UCA DUCET goes into Mark's Unicode tools, see 3806 https://sites.google.com/site/unicodetools/home#TOC-UCA 3807- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 3808- cd (CLDR UCA branch)/common/uca/ 3809- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3810 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3811- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3812 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 3813 (note removing the underscore before "Rules") 3814 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3815- restore TODO diffs in UCARules.txt 3816 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3817- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3818 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3819 from the CLDR root files (..._CLDR_..._SHORT.txt) 3820 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3821 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3822 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3823- if CLDR common/uca/unihan-index.txt changes, then update 3824 CLDR common/collation/root.xml <collation type="private-unihan"> 3825 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 3826- run genuca, see command line above; 3827 deal with 3828 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 3829 (add the character to genuca.cpp sampleCharsToScripts[]) 3830 + look up the script for the new sample characters 3831 (e.g., in FractionalUCA.txt) 3832 + *add* mappings to sampleCharsToScripts[], do not replace them 3833 (in case the script sample characters flip-flop) 3834 + insert new scripts in DUCET script order, see the top_byte table 3835 at the beginning of FractionalUCA.txt 3836- rebuild ICU4C 3837 3838* run & fix ICU4C tests, now with new CLDR collation root data 3839- run all tests with the collation test data *_SHORT.txt or the full files 3840 (the full ones have comments, useful for debugging) 3841- note on intltest: if collate/UCAConformanceTest fails, then 3842 utility/MultithreadTest/TestCollators will fail as well; 3843 fix the conformance test before looking into the multi-thread test 3844- fixed bug in CollationWeights::getWeightRanges() 3845 exposed by new data and CollationTest::TestRootElements 3846 3847* update Java data files 3848- refresh just the UCD/UCA-related/derived files, just to be safe 3849- see (ICU4C)/source/data/icu4j-readme.txt 3850- mkdir /tmp/icu4j 3851- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3852 output: 3853 ... 3854 Unicode .icu files built to ./out/build/icudt56l 3855 echo timestamp > uni-core-data 3856 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3857 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 3858 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3859 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3860 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 3861 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 3862 mkdir -p /tmp/icu4j/main/shared/data 3863 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3864 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 3865 mkdir -p /tmp/icu4j/main/shared/data 3866 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3867 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3868- copy the big-endian Unicode data files to another location, 3869 separate from the other data files, 3870 and then refresh ICU4J 3871 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3872 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3873 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3874 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3875 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3876 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3877 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3878 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3879 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3880 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3881 3882* When refreshing all of ICU4J data from ICU4C 3883- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3884- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3885or 3886- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3887 3888* update CollationFCD.java 3889 + copy & paste the initializers of lcccIndex[] etc. from 3890 ICU4C/source/i18n/collationfcd.cpp to 3891 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3892 3893* refresh Java test .txt files 3894- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3895 cd $ICU_SRC_DIR/source/data/unidata 3896 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3897 cd ../../test/testdata 3898 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3899 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3900 3901* run & fix ICU4J tests 3902 3903*** LayoutEngine script information 3904 3905* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 3906 because the layout engine was deprecated in ICU 54. 3907 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 3908 to write lines that we used to add manually. 3909 3910* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3911 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3912 in the working directory. 3913 3914 (It also generates ScriptRunData.cpp, which is no longer needed.) 3915 3916 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3917 (a plain text file) 3918 which maps ICU versions to the numbers of script/language constants 3919 that were added then. 3920 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3921 3922 The generated files have a current copyright date and "@deprecated" statement. 3923 3924* Review changes, fix Java tool if necessary, and copy to ICU4C 3925 cd ~/svn.icu4j/trunk/src 3926 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3927 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3928 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3929 3930*** API additions 3931- send notice to icu-design about new born-@stable API (enum constants etc.) 3932 3933*** merge the Unicode update branches back onto the trunk 3934- do not merge the icudata.jar and testdata.jar, 3935 instead rebuild them from merged & tested ICU4C 3936- make sure that changes to Unicode tools & ICU tools are checked in 3937 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3938 http://bugs.icu-project.org/trac/log/tools/trunk 3939 3940---------------------------------------------------------------------------- *** 3941 3942Unicode 7.0 update for ICU 54 3943 3944http://www.unicode.org/review/pri271/ -- beta review 3945http://www.unicode.org/reports/uax-proposed-updates.html 3946http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 3947http://www.unicode.org/reports/tr44/tr44-13.html 3948 3949*** ICU Trac 3950 3951- ticket 10821: Unicode 7.0, UCA 7.0 3952- C++ branches/markus/uni70 at r35584 from trunk at r35580 3953- Java branches/markus/uni70 at r35587 from trunk at r35545 3954 3955*** CLDR Trac 3956 3957- ticket 7195: UCA 7.0 CLDR root collation 3958- branches/markus/uni70 at r10062 from trunk at r10061 3959 3960- ticket 6762: script metadata for Unicode 7.0 new scripts 3961 3962*** Unicode version numbers 3963- makedata.mak 3964- uchar.h 3965- com.ibm.icu.util.VersionInfo 3966- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3967 3968- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3969 so that the makefiles see the new version number. 3970 3971*** data files & enums & parser code 3972 3973* file preparation 3974 3975- download UCD & IDNA files 3976- make sure that the Unicode data folder passed into preparseucd.py 3977 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3978- only for manual diffs: remove version suffixes from the file names 3979 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3980 (see https://sites.google.com/site/unicodetools/inputdata) 3981- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3982- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3983- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3984- Restore TODO diffs in source/data/unidata/UCARules.txt 3985 cd $ICU_SRC_DIR 3986 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 3987- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 3988 3989- also: from http://unicode.org/Public/security/7.0.0/ download new 3990 confusables.txt & confusablesWholeScript.txt 3991 and copy to $ICU_ROOT/src/source/data/unidata/ 3992 3993* initial preparseucd.py changes 3994- remove new Unicode scripts from the 3995 only-in-ISO-15924 list according to the error message: 3996 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 3997 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 3998 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 3999 from _scripts_only_in_iso15924 4000 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4001 and in com.ibm.icu.dev.test.lang.TestUScript.java 4002- NamesList.txt now has a heading with a non-ASCII character 4003 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 4004 + escape non-ASCII characters in heading comments 4005- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 4006 + get the copyright from the first file whose copyright line contains the current year 4007 4008* PropertyValueAliases.txt changes 4009- 32 new Block (blk) values: 4010 blk; Bassa_Vah ; Bassa_Vah 4011 blk; Caucasian_Albanian ; Caucasian_Albanian 4012 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 4013 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 4014 blk; Duployan ; Duployan 4015 blk; Elbasan ; Elbasan 4016 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 4017 blk; Grantha ; Grantha 4018 blk; Khojki ; Khojki 4019 blk; Khudawadi ; Khudawadi 4020 blk; Latin_Ext_E ; Latin_Extended_E 4021 blk; Linear_A ; Linear_A 4022 blk; Mahajani ; Mahajani 4023 blk; Manichaean ; Manichaean 4024 blk; Mende_Kikakui ; Mende_Kikakui 4025 blk; Modi ; Modi 4026 blk; Mro ; Mro 4027 blk; Myanmar_Ext_B ; Myanmar_Extended_B 4028 blk; Nabataean ; Nabataean 4029 blk; Old_North_Arabian ; Old_North_Arabian 4030 blk; Old_Permic ; Old_Permic 4031 blk; Ornamental_Dingbats ; Ornamental_Dingbats 4032 blk; Pahawh_Hmong ; Pahawh_Hmong 4033 blk; Palmyrene ; Palmyrene 4034 blk; Pau_Cin_Hau ; Pau_Cin_Hau 4035 blk; Psalter_Pahlavi ; Psalter_Pahlavi 4036 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 4037 blk; Siddham ; Siddham 4038 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 4039 blk; Sup_Arrows_C ; Supplemental_Arrows_C 4040 blk; Tirhuta ; Tirhuta 4041 blk; Warang_Citi ; Warang_Citi 4042 -> add to uchar.h 4043 use long property names for enum constants 4044 -> add to UCharacter.UnicodeBlock IDs 4045 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4046 replace public static final int \1_ID = \2; \3 4047 -> add to UCharacter.UnicodeBlock objects 4048 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4049 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4050- 28 new Joining_Group (jg) values: 4051 jg ; Manichaean_Aleph ; Manichaean_Aleph 4052 jg ; Manichaean_Ayin ; Manichaean_Ayin 4053 jg ; Manichaean_Beth ; Manichaean_Beth 4054 jg ; Manichaean_Daleth ; Manichaean_Daleth 4055 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 4056 jg ; Manichaean_Five ; Manichaean_Five 4057 jg ; Manichaean_Gimel ; Manichaean_Gimel 4058 jg ; Manichaean_Heth ; Manichaean_Heth 4059 jg ; Manichaean_Hundred ; Manichaean_Hundred 4060 jg ; Manichaean_Kaph ; Manichaean_Kaph 4061 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 4062 jg ; Manichaean_Mem ; Manichaean_Mem 4063 jg ; Manichaean_Nun ; Manichaean_Nun 4064 jg ; Manichaean_One ; Manichaean_One 4065 jg ; Manichaean_Pe ; Manichaean_Pe 4066 jg ; Manichaean_Qoph ; Manichaean_Qoph 4067 jg ; Manichaean_Resh ; Manichaean_Resh 4068 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 4069 jg ; Manichaean_Samekh ; Manichaean_Samekh 4070 jg ; Manichaean_Taw ; Manichaean_Taw 4071 jg ; Manichaean_Ten ; Manichaean_Ten 4072 jg ; Manichaean_Teth ; Manichaean_Teth 4073 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 4074 jg ; Manichaean_Twenty ; Manichaean_Twenty 4075 jg ; Manichaean_Waw ; Manichaean_Waw 4076 jg ; Manichaean_Yodh ; Manichaean_Yodh 4077 jg ; Manichaean_Zayin ; Manichaean_Zayin 4078 jg ; Straight_Waw ; Straight_Waw 4079 -> uchar.h & UCharacter.JoiningGroup 4080- 23 new Script (sc) values: 4081 sc ; Aghb ; Caucasian_Albanian 4082 sc ; Bass ; Bassa_Vah 4083 sc ; Dupl ; Duployan 4084 sc ; Elba ; Elbasan 4085 sc ; Gran ; Grantha 4086 sc ; Hmng ; Pahawh_Hmong 4087 sc ; Khoj ; Khojki 4088 sc ; Lina ; Linear_A 4089 sc ; Mahj ; Mahajani 4090 sc ; Mani ; Manichaean 4091 sc ; Mend ; Mende_Kikakui 4092 sc ; Modi ; Modi 4093 sc ; Mroo ; Mro 4094 sc ; Narb ; Old_North_Arabian 4095 sc ; Nbat ; Nabataean 4096 sc ; Palm ; Palmyrene 4097 sc ; Pauc ; Pau_Cin_Hau 4098 sc ; Perm ; Old_Permic 4099 sc ; Phlp ; Psalter_Pahlavi 4100 sc ; Sidd ; Siddham 4101 sc ; Sind ; Khudawadi 4102 sc ; Tirh ; Tirhuta 4103 sc ; Wara ; Warang_Citi 4104 -> uscript.h (many were added before) 4105 comment "Mende Kikakui" for USCRIPT_MENDE 4106 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 4107 -> com.ibm.icu.lang.UScript 4108 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4109 replace public static final int \1 = \2; \3 4110- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4111 (added 2012-11-01) 4112 Ahom 338 Ahom 4113 Hatr 127 Hatran 4114 Mult 323 Multani 4115 (added 2013-10-12) 4116 Modi 324 Modi 4117 Pauc 263 Pau Cin Hau 4118 Sidd 302 Siddham 4119 -> uscript.h (some overlap with additions from Unicode) 4120 -> com.ibm.icu.lang.UScript 4121 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4122 replace public static final int \1 = \2; \3 4123 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 4124 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4125 and in com.ibm.icu.dev.test.lang.TestUScript.java 4126 4127* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4128 (not strictly necessary for NOT_ENCODED scripts) 4129 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 4130 4131* generate normalization data files 4132- cd $ICU_ROOT/dbg 4133- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 4134- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 4135- UNIDATA=$ICU_SRC_DIR/source/data/unidata 4136- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 4137- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4138- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4139- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4140- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4141 4142* build ICU (make install) 4143 so that the tools build can pick up the new definitions from the installed header files. 4144 4145~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4146 4147* build Unicode tools using CMake+make 4148 4149~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4150 4151# Location (--prefix) of where ICU was installed. 4152set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 4153# Location of the ICU source tree. 4154set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 4155 4156~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4157~/svn.icutools/trunk/dbg/unicode/c$ make 4158 4159* genprops work 4160- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 4161 + add second array of Joining_Group values for at most 10800..10FFF 4162 icutools: unicode/c/genprops/bidipropsbuilder.cpp 4163 icu: source/common/ubidi_props.h/.c/_data.h 4164 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 4165 4166* generate core properties data files 4167- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 4168- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 4169- rebuild ICU (make install) & tools 4170- run genuca again (see step above) so that it picks up the new nfc.nrm 4171- rebuild ICU (make install) & tools 4172 4173* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4174 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4175- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4176- Unicode 6.0..7.0: U+2260, U+226E, U+226F 4177- nothing new in 7.0, no test file to update 4178 4179* run & fix ICU4C tests 4180 4181* update Java data files 4182- refresh just the UCD-related files, just to be safe 4183- see (ICU4C)/source/data/icu4j-readme.txt 4184- mkdir /tmp/icu4j 4185- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4186 output: 4187 ... 4188 Unicode .icu files built to ./out/build/icudt53l 4189 echo timestamp > uni-core-data 4190 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4191 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 4192 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4193 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 4194 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 4195 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 4196 mkdir -p /tmp/icu4j/main/shared/data 4197 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4198 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 4199 mkdir -p /tmp/icu4j/main/shared/data 4200 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4201 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 4202- copy the big-endian Unicode data files to another location, 4203 separate from the other data files 4204 ICUDT=icudt54b 4205 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4206 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4207 cd ~/svn.icu/uni70/dbg/data/out/icu4j 4208 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4209 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4210 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 4211 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 4212 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4213 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 4214- refresh ICU4J 4215 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4216 4217* update CollationFCD.java 4218 + copy & paste the initializers of lcccIndex[] etc. from 4219 ICU4C/source/i18n/collationfcd.cpp to 4220 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 4221 4222* refresh Java test .txt files 4223- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4224 cd $ICU_SRC_DIR/source/data/unidata 4225 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4226 cd ../../test/testdata 4227 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4228 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 4229 4230* UCA 4231 4232- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 4233- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 4234- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 4235- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 4236- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 4237- review data; compare files, use blankweights.sed or similar 4238 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 4239- cd ~/svn.unitools/Generated/uca/7.0.0/ 4240- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4241 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 4242- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4243 (note removing the underscore before "Rules") 4244 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 4245- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4246 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4247 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4248 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 4249 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 4250 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 4251- run genuca, see command line above 4252- rebuild ICU4C 4253- refresh ICU4J collation data: 4254 (subset of instructions above for properties data refresh, except copies all coll/*) 4255 ICUDT=icudt54b 4256 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4257 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4258 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 4259 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 4260- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4261- note on intltest: if collate/UCAConformanceTest fails, then 4262 utility/MultithreadTest/TestCollators will fail as well; 4263 fix the conformance test before looking into the multi-thread test 4264- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 4265- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 4266 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 4267 4268* When refreshing all of ICU4J data from ICU4C 4269- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4270- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4271or 4272- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4273 4274* run & fix ICU4J tests 4275 4276*** LayoutEngine script information 4277 4278(For details see the Unicode 5.2 change log below.) 4279 4280* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4281 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4282 in the working directory. 4283 (It also generates ScriptRunData.cpp, which is no longer needed.) 4284 4285 The generated files have a current copyright date and "@stable" statement. 4286 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 4287 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 4288 which may not contain dots any more. 4289 4290- diff current <icu>/source/layout files vs. generated ones 4291 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4292 review and manually merge desired changes; 4293 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 4294 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4295- if you just copy the above files, then 4296 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4297 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4298 4299*** API additions 4300- send notice to icu-design about new born-@stable API (enum constants etc.) 4301 4302*** merge the Unicode update branches back onto the trunk 4303- do not merge the icudata.jar and testdata.jar, 4304 instead rebuild them from merged & tested ICU4C 4305 4306---------------------------------------------------------------------------- *** 4307 4308Unicode 6.3 update 4309 4310http://www.unicode.org/review/pri249/ -- beta review 4311http://www.unicode.org/reports/uax-proposed-updates.html 4312http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 4313http://www.unicode.org/reports/tr44/tr44-11.html 4314 4315*** ICU Trac 4316 4317- ticket 10128: update ICU to Unicode 6.3 beta 4318- ticket 10168: update ICU to Unicode 6.3 final 4319- C++ branches/markus/uni63 at r33552 from trunk at r33551 4320- Java branches/markus/uni63 at r33550 from trunk at r33553 4321 4322- ticket 10142: implement Unicode 6.3 bidi algorithm additions 4323 4324*** Unicode version numbers 4325- makedata.mak 4326- uchar.h 4327 (configure.in & configure: have been modified to extract the version from uchar.h) 4328- com.ibm.icu.util.VersionInfo 4329- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4330 4331- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 4332 so that the makefiles see the new version number. 4333 4334*** data files & enums & parser code 4335 4336* file preparation 4337 4338- download UCD, UCA & IDNA files 4339- make sure that the Unicode data folder passed into preparseucd.py 4340 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4341- modify preparseucd.py: 4342 parse new file BidiBrackets.txt 4343 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 4344- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 4345- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4346- Check test file diffs for previously commented-out, known-failing data lines; 4347 probably need to keep those commented out. 4348 4349* PropertyAliases.txt changes 4350- 1 new Enumerated Property 4351 bpt ; Bidi_Paired_Bracket_Type 4352 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 4353 -> ubidi_props.h & .c & UBiDiProps.java 4354 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 4355 -> uprops.cpp 4356 -> change ubidi.icu format version from 2.0 to 2.1 4357- 1 new Miscellaneous Property 4358 bpb ; Bidi_Paired_Bracket 4359 -> uchar.h & UProperty.java 4360 -> ppucd.h & .cpp 4361 4362* PropertyValueAliases.txt changes 4363- 3 Bidi_Paired_Bracket_Type (bpt) values: 4364 bpt; c ; Close 4365 bpt; n ; None 4366 bpt; o ; Open 4367 -> uchar.h & UCharacter.BidiPairedBracketType 4368 -> ubidi_props.h & .c & UBiDiProps.java 4369 -> change ubidi.icu format version from 2.0 to 2.1 4370- 4 new Bidi_Class (bc) values: 4371 bc ; FSI ; First_Strong_Isolate 4372 bc ; LRI ; Left_To_Right_Isolate 4373 bc ; RLI ; Right_To_Left_Isolate 4374 bc ; PDI ; Pop_Directional_Isolate 4375 -> uchar.h & UCharacterEnums.ECharacterDirection 4376 -> until the bidi code gets updated, 4377 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 4378- 3 new Word_Break (WB) values: 4379 WB ; HL ; Hebrew_Letter 4380 WB ; SQ ; Single_Quote 4381 WB ; DQ ; Double_Quote 4382 -> uchar.h & UCharacter.WordBreak 4383 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 4384- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4385 (added 2012-10-16) 4386 Aghb 239 Caucasian Albanian 4387 Mahj 314 Mahajani 4388 -> uscript.h 4389 -> com.ibm.icu.lang.UScript 4390 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4391 replace public static final int \1 = \2;\3 4392 -> preparseucd.py _scripts_only_in_iso15924 4393 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4394 and in com.ibm.icu.dev.test.lang.TestUScript.java 4395 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 4396 (not strictly necessary for NOT_ENCODED scripts) 4397 4398* generate normalization data files 4399- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 4400- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 4401- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 4402- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4403- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4404- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4405- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4406 4407* build ICU (make install) 4408 so that the tools build can pick up the new definitions from the installed header files. 4409 4410~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 4411 4412* build Unicode tools using CMake+make 4413 4414~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 4415 4416# Location (--prefix) of where ICU was installed. 4417set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 4418# Location of the ICU source tree. 4419set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 4420 4421~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 4422~/svn.icutools/trunk/dbg/unicode/c$ make 4423 4424* generate core properties data files 4425- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 4426- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 4427- rebuild ICU (make install) & tools 4428- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4429- rebuild ICU (make install) & tools 4430 4431* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4432 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4433- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4434- Unicode 6.0..6.3: U+2260, U+226E, U+226F 4435- nothing new in 6.3, no test file to update 4436 4437* update Java data files 4438- refresh just the UCD-related files, just to be safe 4439- see (ICU4C)/source/data/icu4j-readme.txt 4440- mkdir /tmp/icu4j 4441- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4442 output: 4443 ... 4444 Unicode .icu files built to ./out/build/icudt52l 4445 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4446 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 4447 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4448 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 4449 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 4450 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 4451 mkdir -p /tmp/icu4j/main/shared/data 4452 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4453 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 4454 mkdir -p /tmp/icu4j/main/shared/data 4455 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4456 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 4457- copy the big-endian Unicode data files to another location, 4458 separate from the other data files 4459 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4460 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4461 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4462 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 4463 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 4464 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4465 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 4466- refresh ICU4J 4467 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4468 4469* refresh Java test .txt files 4470- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4471 4472* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 4473 4474- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4475- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4476- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4477- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4478 (note removing the underscore before "Rules") 4479- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4480 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4481 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4482- check test file diffs for previously commented-out, known-failing data lines; 4483 probably need to keep those commented out 4484- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4485- run genuca, see command line above 4486- rebuild ICU4C 4487- refresh ICU4J collation data: 4488 (subset of instructions above for properties data refresh, except copies all coll/*) 4489 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4490 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4491 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 4492 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 4493- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4494- note on intltest: if collate/UCAConformanceTest fails, then 4495 utility/MultithreadTest/TestCollators will fail as well; 4496 fix the conformance test before looking into the multi-thread test 4497 4498* test ICU, fix test code where necessary 4499 4500* When refreshing all of ICU4J data from ICU4C 4501- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4502- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4503or 4504- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4505 4506*** LayoutEngine script information 4507- skipped for Unicode 6.3: no new scripts 4508 4509*** merge the Unicode update branches back onto the trunk 4510- do not merge the icudata.jar and testdata.jar, 4511 instead rebuild them from merged & tested ICU4C 4512 4513---------------------------------------------------------------------------- *** 4514 4515Unicode 6.2 update 4516 4517http://www.unicode.org/review/pri230/ 4518http://www.unicode.org/versions/beta-6.2.0.html 4519http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 4520http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 4521http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 4522http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 4523http://www.unicode.org/reports/tr46/tr46-8.html IDNA 4524http://unicode.org/Public/idna/6.2.0/ 4525 4526*** ICU Trac 4527 4528- ticket 9515: Unicode 6.2: final ICU update 4529 4530- ticket 9514: UCA 6.2: fix UCARules.txt 4531 4532- ticket 9437: update ICU to Unicode 6.2 4533- C++ branches/markus/uni62 at r32050 from trunk at r32041 4534- Java branches/markus/uni62 at r32068 from trunk at r32066 4535 4536*** Unicode version numbers 4537- makedata.mak 4538- uchar.h 4539 (configure.in & configure: have been modified to extract the version from uchar.h) 4540- com.ibm.icu.util.VersionInfo 4541- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 4542 4543*** data files & enums & parser code 4544 4545* file preparation 4546 4547- download UCD, UCA & IDNA files 4548- make sure that the Unicode data folder passed into preparseucd.py 4549 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 4550- modify preparseucd.py: NamesList.txt is now in UTF-8 4551- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 4552- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4553- Check test file diffs for previously commented-out, known-failing data lines; 4554 probably need to keep those commented out. 4555 4556* PropertyValueAliases.txt changes 4557- 1 new Line_Break (lb) value: 4558 lb ; RI ; Regional_Indicator 4559 -> uchar.h & UCharacter.LineBreak 4560- 1 new Word_Break (WB) value: 4561 WB ; RI ; Regional_Indicator 4562 -> uchar.h & UCharacter.WordBreak 4563- 1 new Grapheme_Cluster_Break (GCB) value: 4564 GCB; RI ; Regional_Indicator 4565 -> uchar.h & UCharacter.GraphemeClusterBreak 4566 4567* 3 new numeric values 4568 The new value -1, which was really supposed to be NaN but that would have required 4569 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 4570 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 4571 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 4572 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 4573 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 4574 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 4575 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 4576 -> uprops.h, uchar.c & UCharacterProperty.java 4577 -> cucdtst.c & UCharacterTest.java 4578 4579* generate normalization data files 4580- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 4581- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 4582- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 4583- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4584- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4585- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4586- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4587 4588* build ICU (make install) 4589 so that the tools build can pick up the new definitions from the installed header files. 4590* build Unicode tools using CMake+make 4591 4592* generate core properties data files 4593- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 4594- in initial bootstrapping, change the UCA version 4595 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4596- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 4597- rebuild ICU (make install) & tools 4598 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4599 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4600 (see step above) 4601- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4602- rebuild ICU (make install) & tools 4603 4604* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4605 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4606- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4607- Unicode 6.0..6.2: U+2260, U+226E, U+226F 4608- nothing new in 6.2, no test file to update 4609 4610* update Java data files 4611- refresh just the UCD-related files, just to be safe 4612- see (ICU4C)/source/data/icu4j-readme.txt 4613- mkdir /tmp/icu4j 4614- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4615 output: 4616 ... 4617 Unicode .icu files built to ./out/build/icudt50l 4618 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4619 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 4620 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4621 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4622 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 4623 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 4624 mkdir -p /tmp/icu4j/main/shared/data 4625 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4626 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 4627 mkdir -p /tmp/icu4j/main/shared/data 4628 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4629 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 4630- copy the big-endian Unicode data files to another location, 4631 separate from the other data files 4632 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4633 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4634 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4635 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 4636 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4637 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4638 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4639- refresh ICU4J 4640 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4641 4642* refresh Java test .txt files 4643- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4644 4645* UCA 4646 4647- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4648- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4649- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4650- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4651 (note removing the underscore before "Rules") 4652- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4653 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4654 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4655- check test file diffs for previously commented-out, known-failing data lines; 4656 probably need to keep those commented out 4657- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4658- run genuca, see command line above 4659- rebuild ICU4C 4660- refresh ICU4J collation data: 4661 (subset of instructions above for properties data refresh, except copies all coll/*) 4662 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4663 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4664 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4665 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4666- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4667- note on intltest: if collate/UCAConformanceTest fails, then 4668 utility/MultithreadTest/TestCollators will fail as well; 4669 fix the conformance test before looking into the multi-thread test 4670 4671* test ICU, fix test code where necessary 4672 4673* When refreshing all of ICU4J data from ICU4C 4674- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4675- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4676or 4677- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4678 4679*** LayoutEngine script information 4680- skipped for Unicode 6.2: no new scripts 4681 4682*** merge the Unicode update branches back onto the trunk 4683- do not merge the icudata.jar and testdata.jar, 4684 instead rebuild them from merged & tested ICU4C 4685 4686---------------------------------------------------------------------------- *** 4687 4688Future Unicode update 4689 4690Tools simplified since the Unicode 6.1 update. See 4691- https://icu.unicode.org/design/props/ppucd 4692- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 4693 4694* Unicode version numbers 4695- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 4696 4697* file preparation 4698- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 4699- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 4700- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4701- Check test file diffs for previously commented-out, known-failing data lines; 4702 probably need to keep those commented out. 4703 4704* PropertyValueAliases.txt changes 4705- Script codes that are in ISO 15924 but not in Unicode are now listed in 4706 preparseucd.py, in the _scripts_only_in_iso15924 variable. 4707 If there are new ISO codes, then add them. 4708 If Unicode adds some of them, then remove them from the .py variable. 4709 4710* UnicodeData.txt changes 4711- No more manual changes for CJK ranges for algorithmic names; 4712 those are now written to ppucd.txt and genprops reads them from there. 4713 4714* generate core properties data files (makeprops.sh was deleted) 4715- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 4716 4717* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 4718- it is now generated by preparseucd.py 4719 4720* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 4721- it is now generated by preparseucd.py 4722- make sure that the Unicode data folder passed into preparseucd.py 4723 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4724 (can be in some subfolder) 4725 4726* generate normalization data files 4727- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 4728- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 4729- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 4730- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4731- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4732- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4733- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4734 4735* build ICU (make install) 4736* build Unicode tools using CMake+make 4737 4738* new way to call genuca (makeuca.sh was deleted) 4739- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 4740 4741---------------------------------------------------------------------------- *** 4742 4743Unicode 6.1 update 4744 4745*** ICU Trac 4746 4747- ticket 8995 final update to Unicode 6.1 4748- ticket 8994 regenerate source/layout/CanonData.cpp 4749 4750- ticket 8961 support Unicode "Age" value *names* 4751- ticket 8963 support multiple character name aliases & types 4752 4753- ticket 8827 "update ICU to Unicode 6.1" 4754- C++ branches/markus/uni61 at r30864 from trunk at r30843 4755- Java branches/markus/uni61 at r30865 from trunk at r30863 4756 4757*** Unicode version numbers 4758- makedata.mak 4759- uchar.h 4760 (configure.in & configure: have been modified to extract the version from uchar.h) 4761- com.ibm.icu.util.VersionInfo 4762- icutools/unicode/makedefs.sh 4763 + also review & update other definitions in that file, 4764 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 4765 4766*** data files & enums & parser code 4767 4768* file preparation 4769 4770~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 4771- This prepares both unidata and testdata files in respective output subfolders. 4772- Check test file diffs for previously commented-out, known-failing data lines; 4773 probably need to keep those commented out. 4774 4775* PropertyValueAliases.txt changes 4776- 11 new block names: 4777 Arabic_Extended_A 4778 Arabic_Mathematical_Alphabetic_Symbols 4779 Chakma 4780 Meetei_Mayek_Extensions 4781 Meroitic_Cursive 4782 Meroitic_Hieroglyphs 4783 Miao 4784 Sharada 4785 Sora_Sompeng 4786 Sundanese_Supplement 4787 Takri 4788 -> add to uchar.h 4789 -> add to UCharacter.UnicodeBlock IDs 4790 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4791 replace public static final int \1_ID = \2; \3 4792 -> add to UCharacter.UnicodeBlock objects 4793 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4794 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4795- 1 new Joining_Group (jg) value: 4796 Rohingya_Yeh 4797 -> uchar.h & UCharacter.JoiningGroup 4798- 2 new Line_Break (lb) values: 4799 CJ=Conditional_Japanese_Starter 4800 HL=Hebrew_Letter 4801 -> uchar.h & UCharacter.LineBreak 4802- 7 new scripts: 4803 sc ; Cakm ; Chakma 4804 sc ; Merc ; Meroitic_Cursive 4805 sc ; Mero ; Meroitic_Hieroglyphs 4806 sc ; Plrd ; Miao 4807 sc ; Shrd ; Sharada 4808 sc ; Sora ; Sora_Sompeng 4809 sc ; Takr ; Takri 4810 -> remove these from SyntheticPropertyValueAliases.txt 4811 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4812 and in com.ibm.icu.dev.test.lang.TestUScript.java 4813- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4814 (added 2011-06-21) 4815 Khoj 322 Khojki 4816 Tirh 326 Tirhuta 4817 and another one added 2011-12-09 4818 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 4819 -> uscript.h 4820 -> com.ibm.icu.lang.UScript 4821 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4822 replace public static final int \1 = \2;\3 4823 -> SyntheticPropertyValueAliases.txt 4824 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4825 and in com.ibm.icu.dev.test.lang.TestUScript.java 4826 4827* UnicodeData.txt changes 4828- the last Unihan code point changes from U+9FCB to U+9FCC 4829 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 4830 + do change gennames.c 4831 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4832 4833* DerivedBidiClass.txt changes 4834- 2 new default-AL blocks: 4835# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 4836# Arabic Mathematical Alphabetic Symbols: 4837# U+1EE00 - U+1EEFF (was default-R) 4838- 2 new default-R blocks: 4839# Meroitic Hieroglyphs: 4840# U+10980 - U+1099F 4841# Meroitic Cursive: U+109A0 - U+109FF 4842 -> should be picked up by the explicit data in the file 4843 4844* NameAliases.txt changes 4845- from 4846 # Each line has two fields 4847 # First field: Code point 4848 # Second field: Alias 4849- to 4850 # Each line has three fields, as described here: 4851 # 4852 # First field: Code point 4853 # Second field: Alias 4854 # Third field: Type 4855- Also, the file previously allowed multiple aliases but only now does it 4856 actually provide multiple, even multiple of the same type. For example, 4857 FEFF;BYTE ORDER MARK;alternate 4858 FEFF;BOM;abbreviation 4859 FEFF;ZWNBSP;abbreviation 4860- This breaks our gennames parser, unames.icu data structure, and API. 4861 Fix gennames to only pick up "correction" aliases. 4862 New ticket #8963 for further changes. 4863 4864* run genpname/preparse.pl (on Linux) 4865 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4866 + make sure that data.h is writable 4867 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4868 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4869 4870* build ICU (make install) 4871 so that the tools build can pick up the new definitions from the installed header files. 4872* build Unicode tools (at least genpname) using CMake+make 4873 4874* run genpname 4875 (builds both pnames.icu and propname_data.h) 4876- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4877- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4878 4879* build ICU (make install) 4880* build Unicode tools using CMake+make 4881 4882* update source/data/unidata/norm2/nfkc_cf.txt 4883- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4884 4885* update source/data/unidata/norm2/uts46.txt 4886- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4887 to ~/svn.icu/tools/trunk/src/unicode/py 4888- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 4889- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4890- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4891 4892* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4893 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4894- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4895- Unicode 6.0..6.1: U+2260, U+226E, U+226F 4896- nothing new in 6.1, no test file to update 4897 4898* generate core properties data files 4899- in initial bootstrapping, change the UCA version 4900 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4901- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4902- rebuild ICU & tools 4903 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4904 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4905 (see step above) 4906- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 4907 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4908- rebuild ICU & tools 4909 4910* update Java data files 4911- refresh just the UCD-related files, just to be safe 4912- see (ICU4C)/source/data/icu4j-readme.txt 4913- mkdir /tmp/icu4j 4914- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4915 output: 4916 ... 4917 Unicode .icu files built to ./out/build/icudt49l 4918 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4919 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 4920 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4921 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4922 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 4923 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 4924 mkdir -p /tmp/icu4j/main/shared/data 4925 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4926 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 4927 mkdir -p /tmp/icu4j/main/shared/data 4928 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4929 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 4930- copy the big-endian Unicode data files to another location, 4931 separate from the other data files 4932 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4933 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4934 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4935 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 4936 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4937 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4938 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4939- refresh ICU4J 4940 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4941 4942* refresh Java test .txt files 4943- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4944 4945* test ICU so far, fix test code where necessary 4946- temporarily ignore collation issues that look like UCA/UCD mismatches, 4947 until UCA data is updated 4948 4949* UCA 4950 4951- get output from Mark's tools; look in 4952 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 4953- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4954- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4955 (note removing the underscore before "Rules") 4956- update (ICU)/source/test/testdata/CollationTest_*.txt 4957 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4958 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4959- check test file diffs for previously commented-out, known-failing data lines; 4960 probably need to keep those commented out 4961- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4962- run makeuca.sh: 4963 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4964- rebuild ICU4C 4965- refresh ICU4J collation data: 4966 (subset of instructions above for properties data refresh, except copies all coll/*) 4967 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4968 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4969 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4970 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4971- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4972- note on intltest: if collate/UCAConformanceTest fails, then 4973 utility/MultithreadTest/TestCollators will fail as well; 4974 fix the conformance test before looking into the multi-thread test 4975 4976* When refreshing all of ICU4J data from ICU4C 4977- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4978- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4979or 4980- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4981 4982*** LayoutEngine script information 4983 4984(For details see the Unicode 5.2 change log below.) 4985 4986* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4987 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4988 in the working directory. 4989 (It also generates ScriptRunData.cpp, which is no longer needed.) 4990 4991 The generated files have a current copyright date and "@draft" statement. 4992 4993- diff current <icu>/source/layout files vs. generated ones 4994 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4995 review and manually merge desired changes; 4996 fix gratuitous changes, incorrect @draft and missing aliases; 4997 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4998- if you just copy the above files, then 4999 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 5000 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5001 5002*** merge the Unicode update branches back onto the trunk 5003- do not merge the icudata.jar and testdata.jar, 5004 instead rebuild them from merged & tested ICU4C 5005 5006---------------------------------------------------------------------------- *** 5007 5008ICU 4.8 (no Unicode update, just new script codes) 5009 5010* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5011 (added 2010-12-21) 5012 Afak 439 Afaka 5013 Jurc 510 Jurchen 5014 Mroo 199 Mro, Mru 5015 Nshu 499 Nüshu 5016 Shrd 319 Sharada, Śāradā 5017 Sora 398 Sora Sompeng 5018 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 5019 Tang 520 Tangut 5020 Wole 480 Woleai 5021 -> uscript.h 5022 -> com.ibm.icu.lang.UScript 5023 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5024 replace public static final int \1 = \2;\3 5025 -> genpname/SyntheticPropertyValueAliases.txt 5026 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5027 and in com.ibm.icu.dev.test.lang.TestUScript.java 5028 5029* run genpname/preparse.pl (on Linux) 5030 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5031 + make sure that data.h is writable 5032 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5033 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5034 5035* rebuild Unicode tools (at least genpname) using make 5036- You might first need to "make install" ICU so that the tools build can pick 5037 up the new definitions from the installed header files. 5038 5039* run genpname 5040 (builds both pnames.icu and propname_data.h) 5041- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5042- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 5043- rebuild ICU & tools 5044 5045* run genprops 5046- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5047- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 5048- rebuild ICU & tools 5049 5050* update Java data files 5051- refresh just the UCD-related files, just to be safe 5052- see (ICU4C)/source/data/icu4j-readme.txt 5053- mkdir /tmp/icu4j 5054- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5055- copy the big-endian Unicode data files to another location, 5056 separate from the other data files 5057 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5058 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5059 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 5060- refresh ICU4J 5061 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 5062 5063* should have updated the layout engine script codes but forgot 5064 5065---------------------------------------------------------------------------- *** 5066 5067Unicode 6.0 update 5068 5069*** related ICU Trac tickets 5070 50717264 Unicode 6.0 Update 5072 5073*** Unicode version numbers 5074- makedata.mak 5075- uchar.h 5076 (configure.in & configure: have been modified to extract the version from uchar.h) 5077- com.ibm.icu.util.VersionInfo 5078 5079*** data files & enums & parser code 5080 5081* file preparation 5082 5083~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 5084- This now prepares both unidata and testdata files in respective output subfolders. 5085 5086* PropertyAliases.txt changes 5087- new Script_Extensions property defined in the new ScriptExtensions.txt file 5088 but not listed in PropertyAliases.txt; reported to unicode.org; 5089 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 5090 scx; Script_Extensions 5091 -> uchar.h with new UProperty section 5092 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 5093 5094* PropertyValueAliases.txt changes 5095- 12 new block names: 5096 Alchemical_Symbols 5097 Bamum_Supplement 5098 Batak 5099 Brahmi 5100 CJK_Unified_Ideographs_Extension_D 5101 Emoticons 5102 Ethiopic_Extended_A 5103 Kana_Supplement 5104 Mandaic 5105 Miscellaneous_Symbols_And_Pictographs 5106 Playing_Cards 5107 Transport_And_Map_Symbols 5108 -> add to uchar.h 5109 -> add to UCharacter.UnicodeBlock 5110 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 5111 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5112- Joining_Group (jg) values: 5113 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 5114 -> uchar.h & UCharacter.JoiningGroup 5115- 3 new scripts: 5116 sc ; Batk ; Batak 5117 sc ; Brah ; Brahmi 5118 sc ; Mand ; Mandaic 5119 -> remove these from SyntheticPropertyValueAliases.txt 5120 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 5121 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 5122 and in com.ibm.icu.dev.test.lang.TestUScript.java 5123- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 5124 (added 2009-11-11..2010-07-18) 5125 Bass 259 Bassa Vah 5126 Dupl 755 Duployan shortand 5127 Elba 226 Elbasan 5128 Gran 343 Grantha 5129 Kpel 436 Kpelle 5130 Loma 437 Loma 5131 Mend 438 Mende 5132 Merc 101 Meroitic Cursive 5133 Narb 106 Old North Arabian 5134 Nbat 159 Nabataean 5135 Palm 126 Palmyrene 5136 Sind 318 Sindhi 5137 Wara 262 Warang Citi 5138 -> uscript.h 5139 -> com.ibm.icu.lang.UScript 5140 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 5141 replace public static final int \1 = \2;\3 5142 -> SyntheticPropertyValueAliases.txt 5143 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 5144 and in com.ibm.icu.dev.test.lang.TestUScript.java 5145- ISO 15924 name change 5146 Mero 100 Meroitic Hieroglyphs (was Meroitic) 5147 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 5148- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 5149 5150* UnicodeData.txt changes 5151- new CJK block: 5152 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 5153 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 5154 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 5155 5156* build Unicode tools using CMake+make 5157 5158* run genpname/preparse.pl (on Linux) 5159 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 5160 + make sure that data.h is writable 5161 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 5162 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 5163 5164* rebuild Unicode tools (at least genpname) using make 5165- You might first need to "make install" ICU so that the tools build can pick 5166 up the new definitions from the installed header files. 5167 5168* run genpname 5169- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 5170- rebuild ICU & tools 5171 5172* update source/data/unidata/norm2/nfkc_cf.txt 5173- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 5174 5175* update source/data/unidata/norm2/uts46.txt 5176- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 5177 to ~/svn.icu/tools/trunk/src/unicode/py 5178- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 5179- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 5180- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 5181 5182* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 5183 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 5184- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 5185- Unicode 6.0: U+2260, U+226E, U+226F 5186 5187* generate core properties data files 5188- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5189- rebuild ICU & tools 5190- run makeuca.sh so that genuca picks up the new nfc.nrm: 5191 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5192- rebuild ICU & tools 5193 5194* implement new Script_Extensions property (provisional) 5195- parser & generator: genprops & uprops.icu 5196- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 5197- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 5198 5199* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 5200- (one-time change) 5201- genbidi/gencase/genprops tools changes 5202- re-run makeprops.sh (see above) 5203- UCharacterProperty.java, UCharacterTypeIterator.java, 5204 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 5205 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 5206 5207* update Java data files 5208- refresh just the UCD-related files, just to be safe 5209- see (ICU4C)/source/data/icu4j-readme.txt 5210- mkdir /tmp/icu4j 5211- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5212 output: 5213 ... 5214 Unicode .icu files built to ./out/build/icudt45l 5215 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5216 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 5217 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 5218 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 5219 mkdir -p /tmp/icu4j/main/shared/data 5220 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 5221- copy the big-endian Unicode data files to another location, 5222 separate from the other data files 5223 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5224 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5225 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5226 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 5227 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 5228 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5229 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 5230- refresh ICU4J 5231 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5232 5233* refresh Java test .txt files 5234- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 5235 5236* un-hardcode normalization skippable (NF*_Inert) test data 5237- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 5238 5239* copy updated break iterator test files 5240- now handled by early ucdcopy.py and 5241 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 5242 (old instructions: 5243 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 5244 to ~/svn.icu/trunk/src/source/test/testdata) 5245- they are not used in ICU4J 5246 5247* UCA 5248 5249- get output from Mark's tools; look in 5250 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 5251 http://www.macchiato.com/unicode/utc/additional-uca-files 5252 http://www.unicode.org/Public/UCA/6.0.0/ 5253 http://www.unicode.org/~mdavis/uca/ 5254- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 5255- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 5256- update Han-implicit ranges for new CJK extensions: 5257 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 5258- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 5259 do not add it into invuca so that tailoring primary-after an ignorable works 5260- genuca: permit space between [variable top] bytes 5261- ucol.cpp: treat noncharacters like unassigned rather than ignorable 5262- run makeuca.sh: 5263 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 5264- rebuild ICU4C 5265- refresh ICU4J collation data: 5266 (subset of instructions above for properties data refresh, except copies all coll/*) 5267 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5268 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5269 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 5270 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 5271- update (ICU)/source/test/testdata/CollationTest_*.txt 5272 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 5273 with output from Mark's Unicode tools 5274- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5275- note on intltest: if collate/UCAConformanceTest fails, then 5276 utility/MultithreadTest/TestCollators will fail as well; 5277 fix the conformance test before looking into the multi-thread test 5278 5279* When refreshing all of ICU4J data from ICU4C 5280- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 5281- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 5282or 5283- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 5284 5285*** LayoutEngine script information 5286 5287(For details see the Unicode 5.2 change log below.) 5288 5289* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5290ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5291ScriptRunData.cpp, which is no longer needed.) 5292 5293The generated files have a current copyright date and "@draft" statement. 5294 5295* copy the above files into <icu>/source/layout, replacing the old files. 5296* fix mixed line endings 5297* review the diffs and fix incorrect @draft and missing aliases; 5298 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 5299* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5300 5301---------------------------------------------------------------------------- *** 5302 5303Unicode 5.2 update 5304 5305*** related ICU Trac tickets 5306 53077084 Unicode 5.2 5308 53097167 verify collation bytes 53107235 Java test NAME_ALIAS 53117236 Java DerivedCoreProperties.txt test 53127237 Java BidiTest.txt 53137238 UTrie2 in core unidata 53147239 test for tailoring gaps 53157240 Java fix CollationMiscTest 53167243 update layout engine for Unicode 5.2 5317 5318*** Unicode version numbers 5319- makedata.mak 5320- uchar.h 5321- configure.in & configure 5322- update ucdVersion in gennames.c if an algorithmic range changes 5323 5324*** data files & enums & parser code 5325 5326* file preparation 5327 5328python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 5329- includes finding files regardless of version numbers, 5330 copying them, and performing the equivalent processing of the 5331 ucdstrip and ucdmerge tools on the desired set of files 5332 5333* notes on changes 5334- PropertyAliases.txt 5335 moved from numeric to enumerated: 5336 ccc ; Canonical_Combining_Class 5337 new string properties: 5338 NFKC_CF ; NFKC_Casefold 5339 Name_Alias; Name_Alias 5340 new binary properties: 5341 Cased ; Cased 5342 CI ; Case_Ignorable 5343 CWCF ; Changes_When_Casefolded 5344 CWCM ; Changes_When_Casemapped 5345 CWKCF ; Changes_When_NFKC_Casefolded 5346 CWL ; Changes_When_Lowercased 5347 CWT ; Changes_When_Titlecased 5348 CWU ; Changes_When_Uppercased 5349 new CJK Unihan properties (not supported by ICU) 5350- PropertyValueAliases.txt 5351 new block names 5352 new scripts 5353 one script code change: 5354 sc ; Qaai ; Inherited 5355 -> 5356 sc ; Zinh ; Inherited ; Qaai 5357 new Line_Break (lb) value: 5358 lb ; CP ; Close_Parenthesis 5359 new Joining_Group (jg) values: Farsi_Yeh, Nya 5360 other new values: 5361 ccc; 214; ATA ; Attached_Above 5362- DerivedBidiClass.txt 5363 new default-R range: U+1E800 - U+1EFFF 5364- UnicodeData.txt 5365 all of the ISO comments are gone 5366 new CJK block end: 5367 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 5368 new CJK block: 5369 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 5370 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 5371 5372* genpname 5373- run preparse.pl 5374 + cd \svn\icuproj\icu\trunk\source\tools\genpname 5375 + make sure that data.h is writable 5376 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 5377 + preparse.pl complains with errors like the following: 5378 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 5379 This is because ICU 4.0 had scripts from ISO 15924 which are now 5380 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 5381 and PropertyValueAliases.txt. 5382 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5383 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 5384 + preparse.pl complains with errors about block names missing from uchar.h; add them 5385 5386* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5387- new block & script values 5388 + 26 new blocks 5389 copy new blocks from Blocks.txt 5390 MS VC++ 2008 regular expression: 5391 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 5392 replace with " UBLOCK_\3 = 172, /*[\1]*/" 5393 + several new script values already added in ICU 4.0 for ISO 15924 coverage 5394 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 5395 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 5396 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 5397 (added to SyntheticPropertyValueAliases.txt) 5398- new Joining Group (JG) values: Farsi_Yeh, Nya 5399- new Line_Break (lb) value: 5400 lb ; CP ; Close_Parenthesis 5401 5402* hardcoded Unihan range end/limit 5403- Unihan range end moves from 9FC3 to 9FCB 5404 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 5405 + do change gennames.c 5406 5407* Compare definitions of new binary properties with what we used to use 5408 in algorithms, to see if the definitions changed. 5409- Verified that definitions for Cased and Case_Ignorable are unchanged. 5410 The gencase tool now parses the newly public Case_Ignorable values 5411 in case the definition changes in the future. 5412 5413* uchar.c & uprops.h & uprops.c & genprops 5414- new numeric values that didn't exist in Unicode data before: 5415 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 5416 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 5417 therefore redesign the encoding of numeric types and values for formatVersion 6; 5418 design for simple numbers up to at least 144 ("one gross"), 5419 large values up to at least 10^20, 5420 and fractions with numerators -1..17 and denominators 1..16 5421 to cover current and expected future values 5422 (e.g., more Han numeric values, Meroitic twelfths) 5423 5424* reimplement Hangul_Syllable_Type for new Jamo characters 5425- the old code assumed that all Jamo characters are in the 11xx block 5426- Unicode 5.2 fills holes there and adds new Jamo characters in 5427 A960..A97F; Hangul Jamo Extended-A 5428 and in 5429 D7B0..D7FF; Hangul Jamo Extended-B 5430- Hangul_Syllable_Type can be trivially derived from a subset of 5431 Grapheme_Cluster_Break values 5432 5433* build Unicode data source code for hardcoding core data 5434C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 5435 5436ICU data make path is \svn\icuproj\icu\trunk\source\data\ 5437ICU root path is \svn\icuproj\icu\trunk 5438Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5439Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5440Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5441Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5442Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5443Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5444Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5445Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 5446Creating data file for Unicode Property Names 5447Creating data file for Unicode Character Properties 5448Creating data file for Unicode Case Mapping Properties 5449Creating data file for Unicode BiDi/Shaping Properties 5450Creating data file for Unicode Normalization 5451Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 5452Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 5453 5454- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 5455 and rebuild the common library 5456 5457*** UCA 5458 5459- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 5460- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 5461- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 5462[ Begin obsolete instructions: 5463 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 5464 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 5465 on Windows: 5466 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 5467 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 5468 End obsolete instructions] 5469- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 5470 not just the *_STUB.txt files 5471- note on intltest: if collate/UCAConformanceTest fails, then 5472 utility/MultithreadTest/TestCollators will fail as well; 5473 fix the conformance test before looking into the multi-thread test 5474 5475*** Implement Cased & Case_Ignorable properties 5476- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 5477- Problem: These properties should be disjoint, but aren't 5478- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 5479- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 5480 5481*** Implement Changes_When_Xyz properties 5482- without stored data 5483 5484*** Implement Name_Alias property 5485- add it as another name field in unames.icu 5486- make it available via u_charName() and UCharNameChoice and 5487- consider it in u_charFromName() 5488 5489*** Break iterators 5490 5491* Update break iterator rules to new UAX versions and new property values 5492* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 5493 5494*** new BidiTest file 5495- review format and data 5496- copy BidiTest.txt to source/test/testdata 5497- write test code using this data 5498- fix ICU code where it fails the conformance test 5499 5500*** Java 5501- generally, find and update code corresponding to C/C++ 5502- UCharacter.UnicodeBlock constants: 5503 a) add an _ID integer per new block, update COUNT 5504 b) add a class instance per new block 5505 Visual Studio regex: 5506 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 5507 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 5508- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 5509 5510- port test changes to Java 5511 5512*** LayoutEngine script information 5513 5514(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 5515 5516* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 5517ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 5518ScriptRunData.cpp, which is no longer needed.) 5519 5520The generated files have a current copyright date and "@draft" statement. 5521 5522-> Eric Mader wrote in email on 20090930: 5523 "I think the tool has been modified to update @draft to @stable for 5524 older scripts and to add @draft for new scripts. 5525 (I worked with an intern on this last year.) 5526 You should check the output after you run it." 5527 5528* copy the above files into <icu>/source/layout, replacing the old files. 5529* fix mixed line endings 5530* review the diffs and fix incorrect @draft and missing aliases 5531* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 5532 5533Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5534and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5535 5536-> Eric Mader wrote in email on 20090930: 5537 "This is just a matter of making sure that all the per-script tables have 5538 entries for any new scripts that were added. 5539 If any new Indic characters were added, then the class tables in 5540 IndicClassTables.cpp should be updated to reflect this. 5541 John Emmons should know how to do this if it's required." 5542 5543* rebuild the layout and layoutex libraries. 5544 5545*** Documentation 5546- Update User Guide 5547 + Jamo_Short_Name, sfc->scf, binary property value aliases 5548 5549---------------------------------------------------------------------------- *** 5550 5551Unicode 5.1 update 5552 5553*** related ICU Trac tickets 5554 55555696 Update to Unicode 5.1 5556 5557*** Unicode version numbers 5558- makedata.mak 5559- uchar.h 5560- configure.in & configure 5561- update ucdVersion in gennames.c if an algorithmic range changes 5562 5563*** data files & enums & parser code 5564 5565* file preparation 5566- ucdstrip: 5567 DerivedCoreProperties.txt 5568 DerivedNormalizationProps.txt 5569 NormalizationTest.txt 5570 PropList.txt 5571 Scripts.txt 5572 GraphemeBreakProperty.txt 5573 SentenceBreakProperty.txt 5574 WordBreakProperty.txt 5575- ucdstrip and ucdmerge: 5576 EastAsianWidth.txt 5577 LineBreak.txt 5578 5579* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5580copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 5581copy 5.1.0\ucd\Blocks.txt ..\unidata\ 5582copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 5583copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 5584copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5585copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5586copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5587copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5588copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 5589copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 5590copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 5591copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 5592copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 5593 5594ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5595ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5596ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5597ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 5598ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5599ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5600ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5601ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5602ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5603ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5604 5605* genpname 5606- run preparse.pl 5607 + cd \svn\icuproj\icu\uni51\source\tools\genpname 5608 + make sure that data.h is writable 5609 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 5610 + preparse.pl complains with errors like the following: 5611 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 5612 This is because ICU 3.8 had scripts from ISO 15924 which are now 5613 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 5614 and PropertyValueAliases.txt. 5615 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5616 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 5617 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 5618 N/Y, No/Yes, F/T, False/True 5619 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 5620 It will use further values from the file if present. 5621 5622* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5623- new block & script values 5624 + 17 new blocks 5625 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 5626 (removed from SyntheticPropertyValueAliases.txt) 5627 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 5628 (added to SyntheticPropertyValueAliases.txt) 5629- uprops.icu (uprops.h) only provides 7 bits for script codes. 5630 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 5631 There is none above 127 yet which is the script code for an 5632 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 5633 script code values greater than 127. 5634 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 5635 in a parallel bit field, and that overflows now. 5636 Also, future values >=128 would be incompatible anyway. 5637 uprops.h is modified to move around several of the bit fields 5638 in the properties vector words, and now uses 8 bits for the script code. 5639 Two other bit fields also grow to accommodate future growth: 5640 Block (current count: 172) grows from 8 to 9 bits, 5641 and Word_Break grows from 4 to 5 bits. 5642- renamed property Simple_Case_Folding (sfc->scf) 5643 + nothing to be done: handled as normal alias 5644- new property JSN Jamo_Short_Name 5645 + no new API: only contributes to the Name property 5646- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 5647- new Joining Group (JG) value: Burushashki_Yeh_Barree 5648- new Sentence_Break (SB) values: 5649 SB ; CR ; CR 5650 SB ; EX ; Extend 5651 SB ; LF ; LF 5652 SB ; SC ; SContinue 5653- new Word_Break (WB) values: 5654 WB ; CR ; CR 5655 WB ; Extend ; Extend 5656 WB ; LF ; LF 5657 WB ; MB ; MidNumLet 5658 5659* Further changes in the 2008-02-29 update: 5660- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 5661 because they should not normally be invisible. 5662- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 5663- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 5664- new Word_Break (WB) value: NL=Newline 5665 5666* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 5667- Unihan range end moves from 9FBB to 9FC3 5668 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 5669 + do change gennames.c 5670 5671* build Unicode data source code for hardcoding core data 5672C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 5673 5674ICU data make path is \svn\icuproj\icu\uni51\source\data\ 5675ICU root path is \svn\icuproj\icu\uni51 5676Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5677Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5678Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5679Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5680Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5681Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5682Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5683Creating data file for Unicode Character Properties 5684Creating data file for Unicode Case Mapping Properties 5685Creating data file for Unicode BiDi/Shaping Properties 5686Creating data file for Unicode Normalization 5687Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 5688Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 5689 5690- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 5691 and rebuild the common library 5692 5693*** Break iterators 5694 5695* Update break iterator rules to new UAX versions and new property values 5696 5697*** UCA 5698 5699* update FractionalUCA.txt and UCARules.txt with new canonical closure 5700 5701*** Test suites 5702- Test that APIs using Unicode property value aliases (like UnicodeSet) 5703 support all of the boolean values N/Y, No/Yes, F/T, False/True 5704 -> TestBinaryValues() tests in both cintltst and intltest 5705 5706*** LayoutEngine script information 5707* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5708ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5709ScriptRunData.cpp, which is no longer needed.) 5710 5711The generated files have a current copyright date and "@draft" statement. 5712 5713* copy the above files into <icu>/source/layout, replacing the old files. 5714 5715Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5716and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5717 5718* rebuild the layout and layoutex libraries. 5719 5720*** Documentation 5721- Update User Guide 5722 + Jamo_Short_Name, sfc->scf, binary property value aliases 5723 5724---------------------------------------------------------------------------- *** 5725 5726Unicode 5.0 update 5727 5728*** related Jitterbugs 5729 57305084 RFE: Update to Unicode 5.0 5731 5732*** data files & enums & parser code 5733 5734* file preparation 5735- ucdstrip: 5736 DerivedCoreProperties.txt 5737 DerivedNormalizationProps.txt 5738 NormalizationTest.txt 5739 PropList.txt 5740 Scripts.txt 5741 GraphemeBreakProperty.txt 5742 SentenceBreakProperty.txt 5743 WordBreakProperty.txt 5744- ucdstrip and ucdmerge: 5745 EastAsianWidth.txt 5746 LineBreak.txt 5747 5748* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5749copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 5750copy 5.0.0\ucd\Blocks.txt ..\unidata\ 5751copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 5752copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 5753copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5754copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5755copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5756copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5757copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 5758copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 5759copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 5760copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 5761copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 5762 5763ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5764ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5765ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5766ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 5767ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5768ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5769ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5770ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5771ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5772ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5773 5774* update FractionalUCA.txt and UCARules.txt with new canonical closure 5775 5776* genpname 5777- run preparse.pl 5778 + make sure that data.h is writable 5779 + perl preparse.pl \cvs\oss\icu > out.txt 5780 5781* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5782- new block & script values 5783 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 5784 5785* build Unicode data source code for hardcoding core data 5786C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 5787 5788ICU data make path is \cvs\oss\icu\source\data\ 5789ICU root path is \cvs\oss\icu 5790Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5791[etc.] 5792Creating data file for Unicode Character Properties 5793Creating data file for Unicode Case Mapping Properties 5794Creating data file for Unicode BiDi/Shaping Properties 5795Creating data file for Unicode Normalization 5796Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 5797Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 5798 5799- copy the .c source files to C:\cvs\oss\icu\source\common 5800 and rebuild the common library 5801 5802*** Unicode version numbers 5803- makedata.mak 5804- uchar.h 5805- configure.in 5806 5807*** LayoutEngine script information 5808* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5809ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5810ScriptRunData.cpp, which is no longer needed.) 5811 5812The generated files have a current copyright date and "@draft" statement. 5813 5814* copy the above files into <icu>/source/layout, replacing the old files. 5815 5816Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5817and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5818 5819* rebuild the layout and layoutex libraries. 5820 5821---------------------------------------------------------------------------- *** 5822 5823Unicode 4.1 update 5824 5825*** related Jitterbugs 5826 58274332 RFE: Update to Unicode 4.1 58284157 RBBI, TR29 4.1 updates 5829 5830*** data files & enums & parser code 5831 5832* file preparation 5833- ucdstrip: 5834 DerivedCoreProperties.txt 5835 DerivedNormalizationProps.txt 5836 NormalizationTest.txt 5837 GraphemeBreakProperty.txt 5838 SentenceBreakProperty.txt 5839 WordBreakProperty.txt 5840- ucdstrip and ucdmerge: 5841 EastAsianWidth.txt 5842 LineBreak.txt 5843 5844* add new files to the repository 5845 GraphemeBreakProperty.txt 5846 SentenceBreakProperty.txt 5847 WordBreakProperty.txt 5848 5849* update FractionalUCA.txt and UCARules.txt with new canonical closure 5850 5851* genpname 5852- handle new enumerated properties in sub read_uchar 5853- run preparse.pl 5854 5855* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5856- new binary properties 5857 + Pattern_Syntax 5858 + Pattern_White_Space 5859- new enumerated properties 5860 + Grapheme_Cluster_Break 5861 + Sentence_Break 5862 + Word_Break 5863- new block & script & line break values 5864 5865* gencase 5866- case-ignorable changes 5867 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5868 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 5869 5870*** Unicode version numbers 5871- makedata.mak 5872- uchar.h 5873- configure.in 5874 5875*** tests 5876- verify that u_charMirror() round-trips 5877- test all new properties and some new values of old properties 5878 5879*** other code 5880 5881* hardcoded Unihan range end/limit 5882- Unihan range end moves from 9FA5 to 9FBB 5883 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 5884 + do not modify BOCU/BOCSU code because that would change the encoding 5885 and break binary compatibility! 5886 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 5887 NamePrepProfile.txt 5888 + ignore trietest.c: test data is arbitrary 5889 + ignore tstnorm.cpp: test optimization, not important 5890 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 5891 + do change line_th.txt and word_th.txt 5892 by replacing hardcoded ranges with the new property values 5893 + do change gennames.c 5894 5895source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5896source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5897source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 5898 5899* case mappings 5900- compare new special casing context conditions with previous ones 5901 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5902 5903* genpname 5904- consider storing only the short name if it is the same as the long name 5905 5906*** other reviews 5907- UAX #29 changes (grapheme/word/sentence breaks) 5908- UAX #14 changes (line breaks) 5909- Pattern_Syntax & Pattern_White_Space 5910 5911---------------------------------------------------------------------------- *** 5912 5913Unicode 4.0.1 update 5914 5915*** related Jitterbugs 5916 59173170 RFE: Update to Unicode 4.0.1 59183171 Add new Unicode 4.0.1 properties 59193520 use Unicode 4.0.1 updates for break iteration 5920 5921*** data files & enums & parser code 5922 5923* file preparation 5924- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 5925- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 5926 5927* file fixes 5928- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 5929 according to PRI #26 5930 http://www.unicode.org/review/resolved-pri.html#pri26 5931- undone again because no corrigendum in sight; 5932 instead modified tests to not check consistency on this for Unicode 4.0.1 5933 5934* ucdterms.txt 5935- update from http://www.unicode.org/copyright.html 5936 formatted for plain text 5937 5938* uchar.h & uprops.h & uprops.c & genprops 5939- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 5940- add U_LB_INSEPARABLE due to a spelling fix 5941 + put short name comment only on line with new constant 5942 for genpname perl script parser 5943- new binary properties 5944 + STerm 5945 + Variation_Selector 5946 5947* genpname 5948- fix genpname perl script so that it doesn't choke on more than 2 names per property value 5949- perl script: correctly calculate the maximum number of fields per row 5950 5951* uscript.h 5952- new script code Hrkt=Katakana_Or_Hiragana 5953 5954* gennorm.c track changes in DerivedNormalizationProps.txt 5955- "FNC" -> "FC_NFKC" 5956- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 5957 5958* genprops/props2.c track changes in DerivedNumericValues.txt 5959- changed from 3 columns to 2, dropping the numeric type 5960 + assume that the type is always numeric for Han characters, 5961 and that only those are added in addition to what UnicodeData.txt lists 5962 5963*** Unicode version numbers 5964- makedata.mak 5965- uchar.h 5966- configure.in 5967 5968*** tests 5969- update test of default bidi classes according to PRI #28 5970 /tsutil/cucdtst/TestUnicodeData 5971 http://www.unicode.org/review/resolved-pri.html#pri28 5972- bidi tests: change exemplar character for ES depending on Unicode version 5973- change hardcoded expected property values where they change 5974 5975*** other code 5976 5977* name matching 5978- read UCD.html 5979 5980* scripts 5981- use new Hrkt=Katakana_Or_Hiragana 5982 5983* ZWJ & ZWNJ 5984- are now part of combining character sequences 5985- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 5986