1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13 14* change log for Unicode updates 15 16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18Notes: 19 20This log includes several command lines as used in the update process. 21Some of them include a console prompt with the present working directory (pwd) followed by a $ sign. 22Use a console window that is set to that directory, or cd to there, 23and then paste the command that follows the $ sign. 24 25Most command lines use environment variables to make them more portable across versions 26and machine configurations. When you set up a console window, copy & paste the `export` commands 27from near the top of the current section before pasting tool command lines. 28Adjust the environment variables to the current version and your machine setup. 29(The command lines are currently as used on Linux.) 30 31---------------------------------------------------------------------------- *** 32 33* New ISO 15924 script codes 34 35Normally, add new script codes as part of a Unicode update. 36See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 37and see the change logs below. 38 39---------------------------------------------------------------------------- *** 40 41Unicode 15.0 update for ICU 72 42 43https://www.unicode.org/versions/Unicode15.0.0/ 44https://www.unicode.org/versions/beta-15.0.0.html 45https://www.unicode.org/Public/15.0.0/ucd/ 46https://www.unicode.org/reports/uax-proposed-updates.html 47https://www.unicode.org/reports/tr44/tr44-29.html 48 49https://unicode-org.atlassian.net/browse/ICU-21980 Unicode 15 50https://unicode-org.atlassian.net/browse/CLDR-15516 Unicode 15 51https://unicode-org.atlassian.net/browse/CLDR-15253 Unicode 15 script metadata (in CLDR 41) 52 53* Command-line environment setup 54 55export UNICODE_DATA=~/unidata/uni15/20220830 56export CLDR_SRC=~/cldr/uni/src 57export ICU_ROOT=~/icu/uni 58export ICU_SRC=$ICU_ROOT/src 59export ICUDT=icudt72b 60export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 61export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 62export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 63 64*** Unicode version numbers 65- makedata.mak 66- uchar.h 67- com.ibm.icu.util.VersionInfo 68- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 69 70- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 71 so that the makefiles see the new version number. 72 cd $ICU_ROOT/dbg/icu4c 73 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 74 75*** data files & enums & parser code 76 77* download files 78- same as for the early Unicode Tools setup and data refresh: 79 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 80 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 81- mkdir -p $UNICODE_DATA 82- download Unicode files into $UNICODE_DATA 83 + subfolders: emoji, idna, security, ucd, uca 84 + old way of fetching files: from the "Public" area on unicode.org 85 ~ inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 86 ~ split Unihan into single-property files 87 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 88 + new way of fetching files, if available: 89 copy the files from a Unicode Tools workspace that is up to date with 90 https://github.com/unicode-org/unicodetools 91 and which might at this point be *ahead* of "Public" 92 ~ before the Unicode release copy files from "dev" subfolders, for example 93 https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev 94 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 95 or from the UCD/cldr/ output folder of the Unicode Tools: 96 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 97 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 98 or 99 cp ~/unitools/mine/Generated/UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 100 101* for manual diffs and for Unicode Tools input data updates: 102 remove version suffixes from the file names 103 ~$ unidata/desuffixucd.py $UNICODE_DATA 104 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 105 106* process and/or copy files 107- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 108 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 109 + For debugging, and tweaking how ppucd.txt is written, 110 the tool has an --only_ppucd option: 111 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 112 113- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 114 115* new constants for new property values 116- preparseucd.py error: 117 ValueError: missing uchar.h enum constants for some property values: [('blk', {'Nag_Mundari', 'CJK_Ext_H', 'Kawi', 'Kaktovik_Numerals', 'Devanagari_Ext_A', 'Arabic_Ext_C', 'Cyrillic_Ext_D'}), ('sc', {'Nagm', 'Kawi'})] 118 = PropertyValueAliases.txt new property values (diff old & new .txt files) 119 ~/unidata$ diff -u uni14/20210922/ucd/PropertyValueAliases.txt uni15/beta/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 120 +age; 15.0 ; V15_0 121 +blk; Arabic_Ext_C ; Arabic_Extended_C 122 +blk; CJK_Ext_H ; CJK_Unified_Ideographs_Extension_H 123 +blk; Cyrillic_Ext_D ; Cyrillic_Extended_D 124 +blk; Devanagari_Ext_A ; Devanagari_Extended_A 125 +blk; Kaktovik_Numerals ; Kaktovik_Numerals 126 +blk; Kawi ; Kawi 127 +blk; Nag_Mundari ; Nag_Mundari 128 +sc ; Kawi ; Kawi 129 +sc ; Nagm ; Nag_Mundari 130 -> add new blocks to uchar.h before UBLOCK_COUNT 131 use long property names for enum constants, 132 for the trailing comment get the block start code point: diff old & new Blocks.txt 133 ~/unidata$ diff -u uni14/20210922/ucd/Blocks.txt uni15/beta/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 134 +10EC0..10EFF; Arabic Extended-C 135 +11B00..11B5F; Devanagari Extended-A 136 +11F00..11F5F; Kawi 137 -13430..1343F; Egyptian Hieroglyph Format Controls 138 +13430..1345F; Egyptian Hieroglyph Format Controls 139 +1D2C0..1D2DF; Kaktovik Numerals 140 +1E030..1E08F; Cyrillic Extended-D 141 +1E4D0..1E4FF; Nag Mundari 142 +31350..323AF; CJK Unified Ideographs Extension H 143 (ignore blocks whose end code point changed) 144 -> add new blocks to UCharacter.UnicodeBlock IDs 145 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 146 replace public static final int \1_ID = \2; \3 147 -> add new blocks to UCharacter.UnicodeBlock objects 148 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 149 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 150 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 151 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 152 replace public static final int \1 = \2; \3 153 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 154 and in com.ibm.icu.dev.test.lang.TestUScript.java 155 156* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 157 (not strictly necessary for NOT_ENCODED scripts) 158 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 159 160* build ICU 161 to make sure that there are no syntax errors 162 163 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 164 165* update spoof checker UnicodeSet initializers: 166 inclusionPat & recommendedPat in i18n/uspoof.cpp 167 INCLUSION & RECOMMENDED in SpoofChecker.java 168- make sure that the Unicode Tools tree contains the latest security data files 169- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 170- run the tool (no special environment variables needed) 171- copy & paste from the Console output into the .cpp & .java files 172 173* Bazel build process 174 175See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 176for an overview and for setup instructions. 177 178Consider running `bazelisk --version` outside of the $ICU_SRC folder 179to find out the latest `bazel` version, and 180copying that version number into the $ICU_SRC/.bazeliskrc config file. 181(Revert if you find incompatibilities, or, better, update our build & config files.) 182 183* generate data files 184 185- remember to define the environment variables 186 (see the start of the section for this Unicode version) 187- cd $ICU_SRC 188- optional but not necessary: 189 bazelisk clean 190- build/bootstrap/generate new files: 191 icu4c/source/data/unidata/generate.sh 192 193* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 194 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 195- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 196 ~/unitools/mine/src$ grep disallowed_STD3_valid unicodetools/data/idna/dev/IdnaMappingTable.txt 197- Unicode 6.0..15.0: U+2260, U+226E, U+226F 198- nothing new in this Unicode version, no test file to update 199 200* run & fix ICU4C tests 201- Note: Some of the collation data and test data will be updated below, 202 so at this time we might get some collation test failures. 203 Ignore these for now. 204- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 205 (no rule changes in Unicode 15) 206- update CLDR GraphemeBreakTest.txt 207 cd ~/unitools/mine/Generated 208 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 209 cp UCD/15.0.0/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 210 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 211- Andy helps with RBBI & spoof check test failures 212 213* collation: CLDR collation root, UCA DUCET 214 215- UCA DUCET goes into Mark's Unicode tools, 216 and a tool-tailored version goes into CLDR, see 217 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 218 219- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 220 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 221- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 222 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 223 (note removing the underscore before "Rules") 224 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 225- restore TODO diffs in UCARules.txt 226 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 227- update (ICU4C)/source/test/testdata/CollationTest_*.txt 228 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 229 from the CLDR root files (..._CLDR_..._SHORT.txt) 230 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 231 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 232 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 233- if CLDR common/uca/unihan-index.txt changes, then update 234 CLDR common/collation/root.xml <collation type="private-unihan"> 235 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 236 237- generate data files, as above (generate.sh), now to pick up new collation data 238- update CollationFCD.java: 239 copy & paste the initializers of lcccIndex[] etc. from 240 ICU4C/source/i18n/collationfcd.cpp to 241 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 242- rebuild ICU4C (make clean, make check, as usual) 243 244* Unihan collators 245 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 246- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 247 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 248- generate ICU zh collation data 249 instructions inspired by 250 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 251 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 252 + setup: 253 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 254 (didn't work without setting JAVA_HOME, 255 nor with the Google default of /usr/local/buildtools/java/jdk 256 [Google security limitations in the XML parser]) 257 export TOOLS_ROOT=~/icu/uni/src/tools 258 export CLDR_DIR=~/cldr/uni/src 259 export CLDR_DATA_DIR=~/cldr/uni/src 260 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 261 cd "$TOOLS_ROOT/cldr/lib" 262 ./install-cldr-jars.sh "$CLDR_DIR" 263 + generate the files we need 264 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 265 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 266 + diff 267 cd $ICU_SRC 268 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 269 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 270 + copy into the source tree 271 cd $ICU_SRC 272 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 273 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 274- rebuild ICU4C 275 276* run & fix ICU4C tests, now with new CLDR collation root data 277- run all tests with the collation test data *_SHORT.txt or the full files 278 (the full ones have comments, useful for debugging) 279- note on intltest: if collate/UCAConformanceTest fails, then 280 utility/MultithreadTest/TestCollators will fail as well; 281 fix the conformance test before looking into the multi-thread test 282 283* update Java data files 284- refresh just the UCD/UCA-related/derived files, just to be safe 285- see (ICU4C)/source/data/icu4j-readme.txt 286- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 287- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 288 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 289 you need to reconfigure with unicore data; see the "configure" line above. 290 output: 291 ... 292 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 293 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt72b 294 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b 295 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt72l.dat ./out/icu4j/icudt72b.dat -s ./out/build/icudt72l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt72b 296 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt72b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt72b" 297 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt72b/ 298 mkdir -p /tmp/icu4j/main/shared/data 299 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 300 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt72b/ 301 mkdir -p /tmp/icu4j/main/shared/data 302 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 303 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 304- copy the big-endian Unicode data files to another location, 305 separate from the other data files, 306 and then refresh ICU4J 307 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 308 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 309 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 310 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 311 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 312 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 313 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 314 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 315 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 316 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 317 318* When refreshing all of ICU4J data from ICU4C 319- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 320- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 321or 322- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 323 324* refresh Java test .txt files 325- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 326 cd $ICU_SRC/icu4c/source/data/unidata 327 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 328 cd ../../test/testdata 329 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 330 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 331 332* run & fix ICU4J tests 333 334*** API additions 335- send notice to icu-design about new born-@stable API (enum constants etc.) 336 337*** CLDR numbering systems 338- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 339 for example: 340 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 341 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-15.txt 342 ~/icu/uni/src$ diff -u /tmp/icu/nv4-14.txt /tmp/icu/nv4-15.txt 343 --> 344 +cp;11F54;-Alpha;gc=Nd;InSC=Number;lb=NU;na=KAWI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 345 +cp;1E4F4;-Alpha;gc=Nd;-IDS;lb=NU;na=NAG MUNDARI DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 346 or: 347 ~/unitools/mine/src$ diff -u unicodetools/data/ucd/14.0.0-Update/extracted/DerivedGeneralCategory.txt unicodetools/data/ucd/dev/extracted/DerivedGeneralCategory.txt | grep '; Nd' | egrep '^\+' 348 --> 349 +11F50..11F59 ; Nd # [10] KAWI DIGIT ZERO..KAWI DIGIT NINE 350 +1E4F0..1E4F9 ; Nd # [10] NAG MUNDARI DIGIT ZERO..NAG MUNDARI DIGIT NINE 351 Unicode 15: 352 kawi 11F50..11F59 Kawi 353 nagm 1E4F0..1E4F9 Nag Mundari 354 https://github.com/unicode-org/cldr/pull/2041 355 356*** merge the Unicode update branches back onto the trunk 357- do not merge the icudata.jar and testdata.jar, 358 instead rebuild them from merged & tested ICU4C 359- if there is a merge conflict in icudata.jar, here is one way to deal with it: 360 + remove icudata.jar from the commit so that rebasing is trivial 361 + ~/icu/uni/src$ git restore --source=main icu4j/main/shared/data/icudata.jar 362 + ~/icu/uni/src$ git commit -a --amend 363 + switch to main, pull updates, switch back to the dev branch 364 + ~/icu/uni/src$ git rebase main 365 + rebuild icudata.jar 366 + ~/icu/uni/src$ git commit -a --amend 367 + ~/icu/uni/src$ git push -f 368- make sure that changes to Unicode tools are checked in: 369 https://github.com/unicode-org/unicodetools 370 371---------------------------------------------------------------------------- *** 372 373Unicode 14.0 update for ICU 70 374 375https://www.unicode.org/versions/Unicode14.0.0/ 376https://www.unicode.org/versions/beta-14.0.0.html 377https://www.unicode.org/Public/14.0.0/ucd/ 378https://www.unicode.org/reports/uax-proposed-updates.html 379https://www.unicode.org/reports/tr44/tr44-27.html 380 381https://unicode-org.atlassian.net/browse/CLDR-14801 382https://unicode-org.atlassian.net/browse/ICU-21635 383 384* Command-line environment setup 385 386export UNICODE_DATA=~/unidata/uni14/20210903 387export CLDR_SRC=~/cldr/uni/src 388export ICU_ROOT=~/icu/uni 389export ICU_SRC=$ICU_ROOT/src 390export ICUDT=icudt70b 391export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 392export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 393export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 394 395*** Unicode version numbers 396- makedata.mak 397- uchar.h 398- com.ibm.icu.util.VersionInfo 399- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 400 401- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 402 so that the makefiles see the new version number. 403 cd $ICU_ROOT/dbg/icu4c 404 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 405 406*** data files & enums & parser code 407 408* download files 409- same as for the early Unicode Tools setup and data refresh: 410 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 411 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 412- mkdir -p $UNICODE_DATA 413- download Unicode files into $UNICODE_DATA 414 + subfolders: emoji, idna, security, ucd, uca 415 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 416 + split Unihan into single-property files 417 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 418 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 419 or from the UCD/cldr/ output folder of the Unicode Tools: 420 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 421 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 422 or 423 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 424 425* for manual diffs and for Unicode Tools input data updates: 426 remove version suffixes from the file names 427 ~$ unidata/desuffixucd.py $UNICODE_DATA 428 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 429 430* process and/or copy files 431- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 432 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 433 + For debugging, and tweaking how ppucd.txt is written, 434 the tool has an --only_ppucd option: 435 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 436 437- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 438 439* new constants for new property values 440- preparseucd.py error: 441 ValueError: missing uchar.h enum constants for some property values: 442 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 443 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 444 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 445 = PropertyValueAliases.txt new property values (diff old & new .txt files) 446 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 447 +age; 14.0 ; V14_0 448 +blk; Arabic_Ext_B ; Arabic_Extended_B 449 +blk; Cypro_Minoan ; Cypro_Minoan 450 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 451 +blk; Kana_Ext_B ; Kana_Extended_B 452 +blk; Latin_Ext_F ; Latin_Extended_F 453 +blk; Latin_Ext_G ; Latin_Extended_G 454 +blk; Old_Uyghur ; Old_Uyghur 455 +blk; Tangsa ; Tangsa 456 +blk; Toto ; Toto 457 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 458 +blk; Vithkuqi ; Vithkuqi 459 +blk; Znamenny_Music ; Znamenny_Musical_Notation 460 +jg ; Thin_Yeh ; Thin_Yeh 461 +jg ; Vertical_Tail ; Vertical_Tail 462 +sc ; Cpmn ; Cypro_Minoan 463 +sc ; Ougr ; Old_Uyghur 464 +sc ; Tnsa ; Tangsa 465 +sc ; Toto ; Toto 466 +sc ; Vith ; Vithkuqi 467 -> add new blocks to uchar.h before UBLOCK_COUNT 468 use long property names for enum constants, 469 for the trailing comment get the block start code point: diff old & new Blocks.txt 470 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 471 +0870..089F; Arabic Extended-B 472 +10570..105BF; Vithkuqi 473 +10780..107BF; Latin Extended-F 474 +10F70..10FAF; Old Uyghur 475 -11700..1173F; Ahom 476 +11700..1174F; Ahom 477 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 478 +12F90..12FFF; Cypro-Minoan 479 +16A70..16ACF; Tangsa 480 -18D00..18D8F; Tangut Supplement 481 +18D00..18D7F; Tangut Supplement 482 +1AFF0..1AFFF; Kana Extended-B 483 +1CF00..1CFCF; Znamenny Musical Notation 484 +1DF00..1DFFF; Latin Extended-G 485 +1E290..1E2BF; Toto 486 +1E7E0..1E7FF; Ethiopic Extended-B 487 (ignore blocks whose end code point changed) 488 -> add new blocks to UCharacter.UnicodeBlock IDs 489 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 490 replace public static final int \1_ID = \2; \3 491 -> add new blocks to UCharacter.UnicodeBlock objects 492 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 493 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 494 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 495 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 496 replace public static final int \1 = \2; \3 497 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 498 and in com.ibm.icu.dev.test.lang.TestUScript.java 499 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 500 501* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 502 (not strictly necessary for NOT_ENCODED scripts) 503 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 504 505* build ICU 506 to make sure that there are no syntax errors 507 508 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 509 510* update spoof checker UnicodeSet initializers: 511 inclusionPat & recommendedPat in i18n/uspoof.cpp 512 INCLUSION & RECOMMENDED in SpoofChecker.java 513- make sure that the Unicode Tools tree contains the latest security data files 514- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 515- run the tool (no special environment variables needed) 516- copy & paste from the Console output into the .cpp & .java files 517 518* Bazel build process 519 520See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 521for an overview and for setup instructions. 522 523Consider running `bazelisk --version` outside of the $ICU_SRC folder 524to find out the latest `bazel` version, and 525copying that version number into the $ICU_SRC/.bazeliskrc config file. 526(Revert if you find incompatibilities, or, better, update our build & config files.) 527 528* generate data files 529 530- remember to define the environment variables 531 (see the start of the section for this Unicode version) 532- cd $ICU_SRC 533- optional but not necessary: 534 bazelisk clean 535- build/bootstrap/generate new files: 536 icu4c/source/data/unidata/generate.sh 537 538* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 539 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 540- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 541- Unicode 6.0..14.0: U+2260, U+226E, U+226F 542- nothing new in this Unicode version, no test file to update 543 544* run & fix ICU4C tests 545- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 546- update CLDR GraphemeBreakTest.txt 547 cd ~/unitools/mine/Generated 548 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 549 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 550 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 551- Andy helps with RBBI & spoof check test failures 552 553* collation: CLDR collation root, UCA DUCET 554 555- UCA DUCET goes into Mark's Unicode tools, 556 and a tool-tailored version goes into CLDR, see 557 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 558 559- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 560 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 561- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 562 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 563 (note removing the underscore before "Rules") 564 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 565- restore TODO diffs in UCARules.txt 566 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 567- update (ICU4C)/source/test/testdata/CollationTest_*.txt 568 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 569 from the CLDR root files (..._CLDR_..._SHORT.txt) 570 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 571 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 572 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 573- if CLDR common/uca/unihan-index.txt changes, then update 574 CLDR common/collation/root.xml <collation type="private-unihan"> 575 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 576 577- generate data files, as above (generate.sh), now to pick up new collation data 578- update CollationFCD.java: 579 copy & paste the initializers of lcccIndex[] etc. from 580 ICU4C/source/i18n/collationfcd.cpp to 581 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 582- rebuild ICU4C (make clean, make check, as usual) 583 584* Unihan collators 585 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 586- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 587 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 588- generate ICU zh collation data 589 instructions inspired by 590 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 591 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 592 + setup: 593 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 594 (didn't work without setting JAVA_HOME, 595 nor with the Google default of /usr/local/buildtools/java/jdk 596 [Google security limitations in the XML parser]) 597 export TOOLS_ROOT=~/icu/uni/src/tools 598 export CLDR_DIR=~/cldr/uni/src 599 export CLDR_DATA_DIR=~/cldr/uni/src 600 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 601 cd "$TOOLS_ROOT/cldr/lib" 602 ./install-cldr-jars.sh "$CLDR_DIR" 603 + generate the files we need 604 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 605 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 606 + diff 607 cd $ICU_SRC 608 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 609 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 610 + copy into the source tree 611 cd $ICU_SRC 612 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 613 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 614- rebuild ICU4C 615 616* run & fix ICU4C tests, now with new CLDR collation root data 617- run all tests with the collation test data *_SHORT.txt or the full files 618 (the full ones have comments, useful for debugging) 619- note on intltest: if collate/UCAConformanceTest fails, then 620 utility/MultithreadTest/TestCollators will fail as well; 621 fix the conformance test before looking into the multi-thread test 622 623* update Java data files 624- refresh just the UCD/UCA-related/derived files, just to be safe 625- see (ICU4C)/source/data/icu4j-readme.txt 626- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 627- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 628 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 629 you need to reconfigure with unicore data; see the "configure" line above. 630 output: 631 ... 632 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 633 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 634 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 635 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 636 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 637 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 638 mkdir -p /tmp/icu4j/main/shared/data 639 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 640 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 641 mkdir -p /tmp/icu4j/main/shared/data 642 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 643 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 644- copy the big-endian Unicode data files to another location, 645 separate from the other data files, 646 and then refresh ICU4J 647 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 648 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 649 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 650 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 651 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 652 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 653 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 654 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 655 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 656 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 657 658* When refreshing all of ICU4J data from ICU4C 659- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 660- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 661or 662- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 663 664* refresh Java test .txt files 665- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 666 cd $ICU_SRC/icu4c/source/data/unidata 667 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 668 cd ../../test/testdata 669 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 670 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 671 672* run & fix ICU4J tests 673 674*** API additions 675- send notice to icu-design about new born-@stable API (enum constants etc.) 676 677*** CLDR numbering systems 678- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 679 for example: 680 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 681 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 682 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 683 --> 684 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 685 Unicode 14: 686 tnsa 16AC0..16AC9 Tangsa 687 https://github.com/unicode-org/cldr/pull/1326 688 689*** merge the Unicode update branches back onto the trunk 690- do not merge the icudata.jar and testdata.jar, 691 instead rebuild them from merged & tested ICU4C 692- make sure that changes to Unicode tools are checked in: 693 https://github.com/unicode-org/unicodetools 694 695---------------------------------------------------------------------------- *** 696 697Unicode 13.0 update for ICU 66 698 699https://www.unicode.org/versions/Unicode13.0.0/ 700https://www.unicode.org/versions/beta-13.0.0.html 701https://www.unicode.org/Public/13.0.0/ucd/ 702https://www.unicode.org/reports/uax-proposed-updates.html 703https://www.unicode.org/reports/tr44/tr44-25.html 704 705https://unicode-org.atlassian.net/browse/CLDR-13387 706https://unicode-org.atlassian.net/browse/ICU-20893 707 708* Command-line environment setup 709 710UNICODE_DATA=~/unidata/uni13/20200212 711CLDR_SRC=~/cldr/uni/src 712ICU_ROOT=~/icu/uni 713ICU_SRC=$ICU_ROOT/src 714ICUDT=icudt66b 715ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 716ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 717export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 718 719*** Unicode version numbers 720- makedata.mak 721- uchar.h 722- com.ibm.icu.util.VersionInfo 723- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 724 725- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 726 so that the makefiles see the new version number. 727 cd $ICU_ROOT/dbg/icu4c 728 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 729 730*** data files & enums & parser code 731 732* download files 733- mkdir -p $UNICODE_DATA 734- download Unicode files into $UNICODE_DATA 735 + subfolders: emoji, idna, security, ucd, uca 736 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 737 + split Unihan into single-property files 738 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 739 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 740 or from the ucd/cldr/ output folder of the Unicode Tools: 741 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 742 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 743 744* for manual diffs and for Unicode Tools input data updates: 745 remove version suffixes from the file names 746 ~$ unidata/desuffixucd.py $UNICODE_DATA 747 (see https://sites.google.com/site/unicodetools/inputdata) 748 749* process and/or copy files 750- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 751 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 752 + For debugging, and tweaking how ppucd.txt is written, 753 the tool has an --only_ppucd option: 754 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 755 756- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 757 758* new constants for new property values 759- preparseucd.py error: 760 ValueError: missing uchar.h enum constants for some property values: 761 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 762 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 763 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 764 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 765 = PropertyValueAliases.txt new property values (diff old & new .txt files) 766 blk; Chorasmian ; Chorasmian 767 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 768 blk; Dives_Akuru ; Dives_Akuru 769 blk; Khitan_Small_Script ; Khitan_Small_Script 770 blk; Lisu_Sup ; Lisu_Supplement 771 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 772 blk; Tangut_Sup ; Tangut_Supplement 773 blk; Yezidi ; Yezidi 774 -> add to uchar.h before UBLOCK_COUNT 775 use long property names for enum constants, 776 for the trailing comment get the block start code point: diff old & new Blocks.txt 777 -> add to UCharacter.UnicodeBlock IDs 778 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 779 replace public static final int \1_ID = \2; \3 780 -> add to UCharacter.UnicodeBlock objects 781 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 782 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 783 784 sc ; Chrs ; Chorasmian 785 sc ; Diak ; Dives_Akuru 786 sc ; Kits ; Khitan_Small_Script 787 sc ; Yezi ; Yezidi 788 -> uscript.h & com.ibm.icu.lang.UScript 789 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 790 and in com.ibm.icu.dev.test.lang.TestUScript.java 791 792 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 793 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 794 795* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 796 (not strictly necessary for NOT_ENCODED scripts) 797 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 798 799* build ICU (make install) 800 to make sure that there are no syntax errors, and 801 so that the tools build can pick up the new definitions from the installed header files. 802 803 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 804 805* update spoof checker UnicodeSet initializers: 806 inclusionPat & recommendedPat in i18n/uspoof.cpp 807 INCLUSION & RECOMMENDED in SpoofChecker.java 808- make sure that the Unicode Tools tree contains the latest security data files 809- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 810- update the hardcoded version number there in the DIRECTORY path 811- run the tool (no special environment variables needed) 812- copy & paste from the Console output into the .cpp & .java files 813 814* generate normalization data files 815 cd $ICU_ROOT/dbg/icu4c 816 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 817 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 818 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 819 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 820 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 821 822* build ICU (make install) 823 so that the tools build can pick up the new definitions from the installed header files. 824 825 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 826 827* build Unicode tools using CMake+make 828 829$ICU_SRC/tools/unicode/c/icudefs.txt: 830 831# Location (--prefix) of where ICU was installed. 832set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 833# Location of the ICU4C source tree. 834set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 835 836 $ICU_ROOT/dbg$ 837 mkdir -p tools/unicode/c 838 cd tools/unicode/c 839 840 $ICU_ROOT/dbg/tools/unicode/c$ 841 cmake ../../../../src/tools/unicode/c 842 make 843 844* generate core properties data files 845 $ICU_ROOT/dbg/tools/unicode/c$ 846 genprops/genprops $ICU_SRC/icu4c 847- tool failure: 848 genprops: Script_Extensions indexes overflow bit field 849 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 850 -> uprops.icu data file format : 851 add two more bits to store a script code or Script_Extensions index 852 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 853- rebuild ICU (make install) & tools 854 855* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 856 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 857- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 858- Unicode 6.0..13.0: U+2260, U+226E, U+226F 859- nothing new in this Unicode version, no test file to update 860 861* run & fix ICU4C tests 862- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 863- Andy helps with RBBI & spoof check test failures 864 865* collation: CLDR collation root, UCA DUCET 866 867- UCA DUCET goes into Mark's Unicode tools, see 868 https://sites.google.com/site/unicodetools/home#TOC-UCA 869 diff the main mapping file, look for bad changes 870 (for example, more bytes per weight for common characters) 871 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 872 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 873 874- CLDR root data files are checked into $CLDR_SRC/common/uca/ 875 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 876 877- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 878 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 879- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 880 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 881 (note removing the underscore before "Rules") 882 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 883- restore TODO diffs in UCARules.txt 884 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 885- update (ICU4C)/source/test/testdata/CollationTest_*.txt 886 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 887 from the CLDR root files (..._CLDR_..._SHORT.txt) 888 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 889 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 890 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 891- if CLDR common/uca/unihan-index.txt changes, then update 892 CLDR common/collation/root.xml <collation type="private-unihan"> 893 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 894 895- run genuca 896 $ICU_ROOT/dbg/tools/unicode/c$ 897 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 898 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 899- rebuild ICU4C 900 901* Unihan collators 902 https://sites.google.com/site/unicodetools/unihan 903- run Unicode Tools 904 org.unicode.draft.GenerateUnihanCollators 905 with VM arguments 906 -ea 907 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 908 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 909 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 910 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 911 -DUVERSION=13.0.0 912- run Unicode Tools 913 org.unicode.draft.GenerateUnihanCollatorFiles 914 with the same arguments 915- check CLDR diffs 916 cd $CLDR_SRC 917 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 918 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 919- copy to CLDR 920 cd $CLDR_SRC 921 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 922 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 923- run CLDR unit tests, commit to CLDR 924- generate ICU zh collation data: run CLDR 925 org.unicode.cldr.icu.NewLdml2IcuConverter 926 with program arguments 927 -t collation 928 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 929 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 930 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 931 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 932 zh 933 and VM arguments 934 -ea 935 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 936- rebuild ICU4C 937 938* run & fix ICU4C tests, now with new CLDR collation root data 939- run all tests with the collation test data *_SHORT.txt or the full files 940 (the full ones have comments, useful for debugging) 941- note on intltest: if collate/UCAConformanceTest fails, then 942 utility/MultithreadTest/TestCollators will fail as well; 943 fix the conformance test before looking into the multi-thread test 944 945* update Java data files 946- refresh just the UCD/UCA-related/derived files, just to be safe 947- see (ICU4C)/source/data/icu4j-readme.txt 948- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 949- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 950 output: 951 ... 952 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 953 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 954 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 955 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 956 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 957 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 958 mkdir -p /tmp/icu4j/main/shared/data 959 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 960 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 961 mkdir -p /tmp/icu4j/main/shared/data 962 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 963 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 964- copy the big-endian Unicode data files to another location, 965 separate from the other data files, 966 and then refresh ICU4J 967 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 968 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 969 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 970 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 971 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 972 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 973 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 974 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 975 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 976 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 977 978* When refreshing all of ICU4J data from ICU4C 979- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 980- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 981or 982- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 983 984* update CollationFCD.java 985 + copy & paste the initializers of lcccIndex[] etc. from 986 ICU4C/source/i18n/collationfcd.cpp to 987 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 988 989* refresh Java test .txt files 990- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 991 cd $ICU_SRC/icu4c/source/data/unidata 992 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 993 cd ../../test/testdata 994 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 995 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 996 997* run & fix ICU4J tests 998 999*** API additions 1000- send notice to icu-design about new born-@stable API (enum constants etc.) 1001 1002*** CLDR numbering systems 1003- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1004 for example, look for 1005 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1006 in new blocks (Blocks.txt) 1007 Unicode 13: 1008 diak 11950..11959 Dives_Akuru 1009 1010*** merge the Unicode update branches back onto the trunk 1011- do not merge the icudata.jar and testdata.jar, 1012 instead rebuild them from merged & tested ICU4C 1013- make sure that changes to Unicode tools are checked in: 1014 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1015 1016---------------------------------------------------------------------------- *** 1017 1018Unicode 12.1 update for ICU 64.2 1019 1020** This is an abbreviated update with one new character for the new 1021** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 1022https://en.wikipedia.org/wiki/Reiwa_period 1023 1024http://www.unicode.org/versions/Unicode12.1.0/ 1025 1026ICU-20497 Unicode 12.1 1027 1028cldrbug 11978: Unicode 12.1 1029 1030* Command-line environment setup 1031 1032UNICODE_DATA=~/unidata/uni121/20190403 1033CLDR_SRC=~/svn.cldr/uni 1034ICU_ROOT=~/icu/uni 1035ICU_SRC=$ICU_ROOT/src 1036ICUDT=icudt64b 1037ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1038ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1039export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1040 1041*** Unicode version numbers 1042- makedata.mak 1043- uchar.h 1044- com.ibm.icu.util.VersionInfo 1045- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1046 1047- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1048 so that the makefiles see the new version number. 1049 cd $ICU_ROOT/dbg/icu4c 1050 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 1051 1052*** data files & enums & parser code 1053 1054* download files 1055- mkdir -p $UNICODE_DATA 1056- download Unicode files into $UNICODE_DATA 1057 + subfolders: emoji, idna, security, ucd, uca 1058 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1059 1060* for manual diffs and for Unicode Tools input data updates: 1061 remove version suffixes from the file names 1062 ~$ unidata/desuffixucd.py $UNICODE_DATA 1063 (see https://sites.google.com/site/unicodetools/inputdata) 1064 1065* process and/or copy files 1066- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1067 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1068 + For debugging, and tweaking how ppucd.txt is written, 1069 the tool has an --only_ppucd option: 1070 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1071 1072- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1073 1074* build ICU (make install) 1075 so that the tools build can pick up the new definitions from the installed header files. 1076 1077 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1078 1079* update spoof checker UnicodeSet initializers: 1080 inclusionPat & recommendedPat in uspoof.cpp 1081 INCLUSION & RECOMMENDED in SpoofChecker.java 1082- make sure that the Unicode Tools tree contains the latest security data files 1083- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1084- update the hardcoded version number there in the DIRECTORY path 1085- run the tool (no special environment variables needed) 1086- copy & paste from the Console output into the .cpp & .java files 1087 1088* generate normalization data files 1089 cd $ICU_ROOT/dbg/icu4c 1090 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1091 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1092 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1093 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1094 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1095 1096* build ICU (make install) 1097 so that the tools build can pick up the new definitions from the installed header files. 1098 1099 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1100 1101* build Unicode tools using CMake+make 1102 1103$ICU_SRC/tools/unicode/c/icudefs.txt: 1104 1105# Location (--prefix) of where ICU was installed. 1106set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1107# Location of the ICU4C source tree. 1108set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1109 1110 $ICU_ROOT/dbg$ 1111 mkdir -p tools/unicode/c 1112 cd tools/unicode/c 1113 1114 $ICU_ROOT/dbg/tools/unicode/c$ 1115 cmake ../../../../src/tools/unicode/c 1116 make 1117 1118* generate core properties data files 1119 $ICU_ROOT/dbg/tools/unicode/c$ 1120 genprops/genprops $ICU_SRC/icu4c 1121 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1122 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1123- rebuild ICU (make install) & tools 1124 1125* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1126 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1127- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1128- Unicode 6.0..12.1: U+2260, U+226E, U+226F 1129- nothing new in this Unicode version, no test file to update 1130 1131* run & fix ICU4C tests 1132- Andy handles RBBI & spoof check test failures 1133 1134* collation: CLDR collation root, UCA DUCET 1135 1136- UCA DUCET goes into Mark's Unicode tools, see 1137 https://sites.google.com/site/unicodetools/home#TOC-UCA 1138 diff the main mapping file, look for bad changes 1139 (for example, more bytes per weight for common characters) 1140 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 1141 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 1142 1143- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1144 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1145 1146- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1147 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1148- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1149 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1150 (note removing the underscore before "Rules") 1151 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1152- restore TODO diffs in UCARules.txt 1153 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1154- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1155 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1156 from the CLDR root files (..._CLDR_..._SHORT.txt) 1157 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1158 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1159 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1160- if CLDR common/uca/unihan-index.txt changes, then update 1161 CLDR common/collation/root.xml <collation type="private-unihan"> 1162 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1163 1164- run genuca, see command line above 1165- rebuild ICU4C 1166 1167* Unihan collators 1168 https://sites.google.com/site/unicodetools/unihan 1169- run Unicode Tools 1170 org.unicode.draft.GenerateUnihanCollators 1171 with VM arguments 1172 -ea 1173 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1174 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1175 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1176 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1177 -DUVERSION=12.1.0 1178- run Unicode Tools 1179 org.unicode.draft.GenerateUnihanCollatorFiles 1180 with the same arguments 1181- check CLDR diffs 1182 cd $CLDR_SRC 1183 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1184 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1185- copy to CLDR 1186 cd $CLDR_SRC 1187 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1188 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1189- run CLDR unit tests, commit to CLDR 1190- generate ICU zh collation data: run CLDR 1191 org.unicode.cldr.icu.NewLdml2IcuConverter 1192 with program arguments 1193 -t collation 1194 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1195 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1196 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1197 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1198 zh 1199 and VM arguments 1200 -ea 1201 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1202- rebuild ICU4C 1203 1204* run & fix ICU4C tests, now with new CLDR collation root data 1205- run all tests with the collation test data *_SHORT.txt or the full files 1206 (the full ones have comments, useful for debugging) 1207- note on intltest: if collate/UCAConformanceTest fails, then 1208 utility/MultithreadTest/TestCollators will fail as well; 1209 fix the conformance test before looking into the multi-thread test 1210 1211* update Java data files 1212- refresh just the UCD/UCA-related/derived files, just to be safe 1213- see (ICU4C)/source/data/icu4j-readme.txt 1214- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1215- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1216 output: 1217 ... 1218 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1219 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1220 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 1221 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 1222 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 1223 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 1224 mkdir -p /tmp/icu4j/main/shared/data 1225 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1226 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 1227 mkdir -p /tmp/icu4j/main/shared/data 1228 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1229 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1230- copy the big-endian Unicode data files to another location, 1231 separate from the other data files, 1232 and then refresh ICU4J 1233 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1234 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1235 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1236 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1237 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1238 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1239 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1240 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1241 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1242 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1243 1244* When refreshing all of ICU4J data from ICU4C 1245- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1246- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1247or 1248- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1249 1250* update CollationFCD.java 1251 + copy & paste the initializers of lcccIndex[] etc. from 1252 ICU4C/source/i18n/collationfcd.cpp to 1253 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1254 1255* refresh Java test .txt files 1256- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1257 cd $ICU_SRC/icu4c/source/data/unidata 1258 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1259 cd ../../test/testdata 1260 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1261 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1262 1263* run & fix ICU4J tests 1264 1265*** API additions 1266- send notice to icu-design about new born-@stable API (enum constants etc.) 1267 1268*** CLDR numbering systems 1269- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1270 for example, look for 1271 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1272 in new blocks (Blocks.txt) 1273 Unicode 12: using Unicode 12 CLDR ticket #11478 1274 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1275 wcho 1E2F0..1E2F9 Wancho 1276 Unicode 11: using Unicode 11 CLDR ticket #10978 1277 rohg 10D30..10D39 Hanifi_Rohingya 1278 gong 11DA0..11DA9 Gunjala_Gondi 1279 Earlier: CLDR tickets specific to adding new numbering systems. 1280 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1281 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1282 1283*** merge the Unicode update branches back onto the trunk 1284- do not merge the icudata.jar and testdata.jar, 1285 instead rebuild them from merged & tested ICU4C 1286- make sure that changes to Unicode tools are checked in: 1287 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1288 1289---------------------------------------------------------------------------- *** 1290 1291Unicode 12.0 update for ICU 64 1292 1293http://www.unicode.org/versions/Unicode12.0.0/ 1294http://unicode.org/versions/beta-12.0.0.html 1295https://www.unicode.org/review/pri389/ 1296http://www.unicode.org/reports/uax-proposed-updates.html 1297http://www.unicode.org/reports/tr44/tr44-23.html 1298 1299ICU-20203 Unicode 12 1300 1301ICU-20111 move text layout properties data into a data file 1302 1303cldrbug 11478: Unicode 12 1304Accidentally used ^/trunk instead of ^/branches/markus/uni12 1305 1306* Command-line environment setup 1307 1308UNICODE_DATA=~/unidata/uni12/20190309 1309CLDR_SRC=~/svn.cldr/uni 1310ICU_ROOT=~/icu/uni 1311ICU_SRC=$ICU_ROOT/src 1312ICUDT=icudt63b 1313ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1314ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1315export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1316 1317*** Unicode version numbers 1318- makedata.mak 1319- uchar.h 1320- com.ibm.icu.util.VersionInfo 1321- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1322 1323- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1324 so that the makefiles see the new version number. 1325 1326*** data files & enums & parser code 1327 1328* download files 1329- mkdir -p $UNICODE_DATA 1330- download Unicode files into $UNICODE_DATA 1331 + subfolders: emoji, idna, security, ucd, uca 1332 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1333 1334* for manual diffs and for Unicode Tools input data updates: 1335 remove version suffixes from the file names 1336 ~$ unidata/desuffixucd.py $UNICODE_DATA 1337 (see https://sites.google.com/site/unicodetools/inputdata) 1338 1339* process and/or copy files 1340- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1341 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1342 + For debugging, and tweaking how ppucd.txt is written, 1343 the tool has an --only_ppucd option: 1344 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1345 1346- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1347 1348* build ICU (make install) 1349 so that the tools build can pick up the new definitions from the installed header files. 1350 1351 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1352 1353* new constants for new property values 1354- preparseucd.py error: 1355 ValueError: missing uchar.h enum constants for some property values: 1356 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 1357 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 1358 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 1359 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 1360 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1361 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 1362 blk; Elymaic ; Elymaic 1363 blk; Nandinagari ; Nandinagari 1364 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 1365 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 1366 blk; Small_Kana_Ext ; Small_Kana_Extension 1367 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 1368 blk; Tamil_Sup ; Tamil_Supplement 1369 blk; Wancho ; Wancho 1370 -> add to uchar.h 1371 use long property names for enum constants, 1372 for the trailing comment get the block start code point: diff old & new Blocks.txt 1373 -> add to UCharacter.UnicodeBlock IDs 1374 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1375 replace public static final int \1_ID = \2; \3 1376 -> add to UCharacter.UnicodeBlock objects 1377 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1378 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 1379 1380 sc ; Elym ; Elymaic 1381 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 1382 sc ; Nand ; Nandinagari 1383 sc ; Wcho ; Wancho 1384 -> uscript.h & com.ibm.icu.lang.UScript 1385 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1386 and in com.ibm.icu.dev.test.lang.TestUScript.java 1387 1388* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1389 (not strictly necessary for NOT_ENCODED scripts) 1390 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1391 1392* update spoof checker UnicodeSet initializers: 1393 inclusionPat & recommendedPat in uspoof.cpp 1394 INCLUSION & RECOMMENDED in SpoofChecker.java 1395- make sure that the Unicode Tools tree contains the latest security data files 1396- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1397- update the hardcoded version number there in the DIRECTORY path 1398- run the tool (no special environment variables needed) 1399- copy & paste from the Console output into the .cpp & .java files 1400 1401* generate normalization data files 1402 cd $ICU_ROOT/dbg/icu4c 1403 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1404 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1405 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1406 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1407 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1408 1409* build ICU (make install) 1410 so that the tools build can pick up the new definitions from the installed header files. 1411 1412 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1413 1414* build Unicode tools using CMake+make 1415 1416$ICU_SRC/tools/unicode/c/icudefs.txt: 1417 1418# Location (--prefix) of where ICU was installed. 1419set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1420# Location of the ICU4C source tree. 1421set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1422 1423 $ICU_ROOT/dbg$ 1424 mkdir -p tools/unicode/c 1425 cd tools/unicode/c 1426 1427 $ICU_ROOT/dbg/tools/unicode/c$ 1428 cmake ../../../../src/tools/unicode/c 1429 make 1430 1431* generate core properties data files 1432 $ICU_ROOT/dbg/tools/unicode/c$ 1433 genprops/genprops $ICU_SRC/icu4c 1434 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1435 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1436- rebuild ICU (make install) & tools 1437 1438* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1439 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1440- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1441- Unicode 6.0..12.0: U+2260, U+226E, U+226F 1442- nothing new in this Unicode version, no test file to update 1443 1444* run & fix ICU4C tests 1445- update test of default bidi classes: 1446 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 1447 see diffs in DerivedBidiClass.txt 1448 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 1449 + UCharacterTest.java TestIteration() defaultBidi[] 1450- Andy handles RBBI & spoof check test failures 1451 1452* collation: CLDR collation root, UCA DUCET 1453 1454- UCA DUCET goes into Mark's Unicode tools, see 1455 https://sites.google.com/site/unicodetools/home#TOC-UCA 1456 diff the main mapping file, look for bad changes 1457 (for example, more bytes per weight for common characters) 1458 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 1459 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 1460 1461- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1462 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1463 1464- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1465 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1466- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1467 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1468 (note removing the underscore before "Rules") 1469 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1470- restore TODO diffs in UCARules.txt 1471 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1472- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1473 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1474 from the CLDR root files (..._CLDR_..._SHORT.txt) 1475 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1476 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1477 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1478- if CLDR common/uca/unihan-index.txt changes, then update 1479 CLDR common/collation/root.xml <collation type="private-unihan"> 1480 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1481 1482- run genuca, see command line above; 1483 deal with 1484 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1485 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 1486 (add the character to genuca.cpp sampleCharsToScripts[]) 1487 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 1488 and cache its values. 1489 Works as long as the script metadata is updated before the collation data. 1490- rebuild ICU4C 1491 1492* Unihan collators 1493 https://sites.google.com/site/unicodetools/unihan 1494- run Unicode Tools 1495 org.unicode.draft.GenerateUnihanCollators 1496 with VM arguments 1497 -ea 1498 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1499 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1500 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1501 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1502 -DUVERSION=12.0.0 1503- run Unicode Tools 1504 org.unicode.draft.GenerateUnihanCollatorFiles 1505 with the same arguments 1506- check CLDR diffs 1507 cd $CLDR_SRC 1508 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1509 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1510- copy to CLDR 1511 cd $CLDR_SRC 1512 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1513 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1514- run CLDR unit tests, commit to CLDR 1515- generate ICU zh collation data: run CLDR 1516 org.unicode.cldr.icu.NewLdml2IcuConverter 1517 with program arguments 1518 -t collation 1519 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1520 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1521 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1522 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1523 zh 1524 and VM arguments 1525 -ea 1526 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1527- rebuild ICU4C 1528 1529* run & fix ICU4C tests, now with new CLDR collation root data 1530- run all tests with the collation test data *_SHORT.txt or the full files 1531 (the full ones have comments, useful for debugging) 1532- note on intltest: if collate/UCAConformanceTest fails, then 1533 utility/MultithreadTest/TestCollators will fail as well; 1534 fix the conformance test before looking into the multi-thread test 1535 1536* update Java data files 1537- refresh just the UCD/UCA-related/derived files, just to be safe 1538- see (ICU4C)/source/data/icu4j-readme.txt 1539- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1540- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1541 output: 1542 ... 1543 Unicode .icu files built to ./out/build/icudt63l 1544 echo timestamp > uni-core-data 1545 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1546 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 1547 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1548 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1549 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 1550 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 1551 mkdir -p /tmp/icu4j/main/shared/data 1552 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1553 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 1554 mkdir -p /tmp/icu4j/main/shared/data 1555 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1556 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1557- copy the big-endian Unicode data files to another location, 1558 separate from the other data files, 1559 and then refresh ICU4J 1560 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1561 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1562 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1563 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1564 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1565 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1566 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1567 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1568 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1569 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1570 1571* When refreshing all of ICU4J data from ICU4C 1572- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1573- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1574or 1575- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1576 1577* update CollationFCD.java 1578 + copy & paste the initializers of lcccIndex[] etc. from 1579 ICU4C/source/i18n/collationfcd.cpp to 1580 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1581 1582* refresh Java test .txt files 1583- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1584 cd $ICU_SRC/icu4c/source/data/unidata 1585 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1586 cd ../../test/testdata 1587 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1588 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1589 1590* run & fix ICU4J tests 1591 1592*** API additions 1593- send notice to icu-design about new born-@stable API (enum constants etc.) 1594 1595*** CLDR numbering systems 1596- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1597 for example, look for 1598 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1599 in new blocks (Blocks.txt) 1600 Unicode 12: using Unicode 12 CLDR ticket #11478 1601 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1602 wcho 1E2F0..1E2F9 Wancho 1603 Unicode 11: using Unicode 11 CLDR ticket #10978 1604 rohg 10D30..10D39 Hanifi_Rohingya 1605 gong 11DA0..11DA9 Gunjala_Gondi 1606 Earlier: CLDR tickets specific to adding new numbering systems. 1607 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1608 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1609 1610*** merge the Unicode update branches back onto the trunk 1611- do not merge the icudata.jar and testdata.jar, 1612 instead rebuild them from merged & tested ICU4C 1613- make sure that changes to Unicode tools are checked in: 1614 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1615 1616---------------------------------------------------------------------------- *** 1617 1618ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 1619 1620* Command-line environment setup 1621 1622UNICODE_DATA=~/unidata/uni11/20180609 1623CLDR_SRC=~/svn.cldr/uni 1624ICU_ROOT=~/icu/mine 1625ICU_SRC=$ICU_ROOT/src 1626ICUDT=icudt62b 1627ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1628ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1629export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1630 1631*** Links 1632 1633https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 1634https://unicode-org.atlassian.net/browse/ICU-12850 vo 1635 1636*** data files & enums & parser code 1637 1638* API additions 1639- for each of the three new enumerated properties 1640 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 1641 + uchar.h: update UCHAR_INT_LIMIT 1642 + uchar.h: add the enum U<long prop name> 1643 with constants U_<short prop name>_<long value name> 1644 + UProperty.java: add the constant <long prop name> 1645 + UProperty.java: update INT_LIMIT 1646 + UCharacter.java: add the interface <long prop name> 1647 with constants <long value name> 1648 1649* process and/or copy files 1650- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1651 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1652 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 1653 names and aliases. 1654 + For debugging, and tweaking how ppucd.txt is written, 1655 the tool has an --only_ppucd option: 1656 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1657 1658* preparseucd.py changes 1659- add new property short names (uppercase) to _prop_and_value_re 1660 so that ParseUCharHeader() parses the new enum constants 1661 1662* build ICU (make install) 1663 so that the tools build can pick up the new definitions from the installed header files. 1664 1665 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1666 1667* build Unicode tools using CMake+make 1668 1669$ICU_SRC/tools/unicode/c/icudefs.txt: 1670 1671# Location (--prefix) of where ICU was installed. 1672set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1673# Location of the ICU4C source tree. 1674set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 1675 1676 $ICU_ROOT/dbg$ 1677 mkdir -p tools/unicode/c 1678 cd tools/unicode/c 1679 1680 $ICU_ROOT/dbg/tools/unicode/c$ 1681 cmake ../../../../../src/tools/unicode/c 1682 make 1683 1684* generate core properties data files 1685 $ICU_ROOT/dbg/tools/unicode/c$ 1686 genprops/genprops $ICU_SRC/icu4c 1687- rebuild ICU (make install) & tools 1688 1689* write data for runtime, hardcoded for now 1690- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 1691- generate new icu4c/source/common/ulayout_props_data.h 1692- for each of the three new enumerated properties 1693 + int property max value 1694 + small, 8-bit UCPTrie 1695 (A small 16-bit trie with bit fields for these three properties 1696 is very nearly the same size as the sum of the three.) 1697 1698* wire into C++ 1699- uprops.cpp: #include ulayout_props_data.h 1700- uprops.cpp: add getInPC() etc. functions 1701- uprops.cpp: add lines to intProps[], include max values 1702- uprops.h: add UPropertySource constants 1703- uprops.cpp: add uprops_addPropertyStarts(src) 1704- uniset_props.cpp: add to UnicodeSet_initInclusion() 1705- intltest/ucdtest.cpp: write unit tests 1706 1707* update Java data files 1708- refresh just the pnames.icu file with the new property [value] names, just to be safe 1709- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 1710- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1711- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1712- copy the big-endian Unicode data files to another location, 1713 separate from the other data files, 1714 and then refresh ICU4J 1715 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1716 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1717 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1718 1719* wire into Java 1720- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 1721- UCharacterProperty.java: for each new property 1722 + create a nested class to hold its CodePointTrie 1723 + initialize it from a string literal 1724 + paste in the initializer printed by genprops 1725 + add a new IntProperty object to the intProps[] array 1726 + use the correct max int value for each property, also printed by genprops 1727- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 1728- UnicodeSet.java: add to getInclusions() 1729- UCharacterTest.java: write unit tests 1730 1731---------------------------------------------------------------------------- *** 1732 1733Unicode 11.0 update for ICU 62 1734 1735http://www.unicode.org/versions/Unicode11.0.0/ 1736http://unicode.org/versions/beta-11.0.0.html 1737https://www.unicode.org/review/pri372/ 1738http://www.unicode.org/reports/uax-proposed-updates.html 1739http://www.unicode.org/reports/tr44/tr44-21.html 1740 1741* Command-line environment setup 1742 1743UNICODE_DATA=~/unidata/uni11/20180521 1744CLDR_SRC=~/svn.cldr/uni 1745ICU_ROOT=~/svn.icu/uni 1746ICU_SRC=$ICU_ROOT/src 1747ICUDT=icudt61b 1748ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1749ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1750export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1751 1752*** ICU Trac 1753 1754- ticket:13630: Unicode 11 1755- ^/branches/markus/uni11 1756 1757*** CLDR Trac 1758 1759- cldrbug 10978: Unicode 11 1760- ^/branches/markus/uni11 1761 1762*** Unicode version numbers 1763- makedata.mak 1764- uchar.h 1765- com.ibm.icu.util.VersionInfo 1766- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1767 1768- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1769 so that the makefiles see the new version number. 1770 1771*** data files & enums & parser code 1772 1773* download files 1774- mkdir -p $UNICODE_DATA 1775- download Unicode files into $UNICODE_DATA 1776 + subfolders: emoji, idna, security, ucd, uca 1777 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1778 1779* for manual diffs and for Unicode Tools input data updates: 1780 remove version suffixes from the file names 1781 ~$ unidata/desuffixucd.py $UNICODE_DATA 1782 (see https://sites.google.com/site/unicodetools/inputdata) 1783 1784* process and/or copy files 1785- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1786 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1787 + For debugging, and tweaking how ppucd.txt is written, 1788 the tool has an --only_ppucd option: 1789 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1790 1791- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1792 1793* build ICU (make install) 1794 so that the tools build can pick up the new definitions from the installed header files. 1795 1796 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1797 1798* preparseucd.py changes 1799- fix other errors 1800 NameError: unknown property Extended_Pictographic 1801 -> add Extended_Pictographic binary property 1802 -> add new short names for all Emoji properties 1803 1804* new constants for new property values 1805- preparseucd.py error: 1806 ValueError: missing uchar.h enum constants for some property values: 1807 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 1808 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 1809 u'Indic_Siyaq_Numbers'])), 1810 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 1811 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 1812 (u'GCB', set([u'LinkC', u'Virama'])), 1813 (u'WB', set([u'WSegSpace']))] 1814 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1815 blk; Chess_Symbols ; Chess_Symbols 1816 blk; Dogra ; Dogra 1817 blk; Georgian_Ext ; Georgian_Extended 1818 blk; Gunjala_Gondi ; Gunjala_Gondi 1819 blk; Hanifi_Rohingya ; Hanifi_Rohingya 1820 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 1821 blk; Makasar ; Makasar 1822 blk; Mayan_Numerals ; Mayan_Numerals 1823 blk; Medefaidrin ; Medefaidrin 1824 blk; Old_Sogdian ; Old_Sogdian 1825 blk; Sogdian ; Sogdian 1826 -> add to uchar.h 1827 use long property names for enum constants, 1828 for the trailing comment get the block start code point: diff old & new Blocks.txt 1829 -> add to UCharacter.UnicodeBlock IDs 1830 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1831 replace public static final int \1_ID = \2; \3 1832 -> add to UCharacter.UnicodeBlock objects 1833 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1834 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1835 1836 GCB; LinkC ; LinkingConsonant 1837 GCB; Virama ; Virama 1838 -> uchar.h & UCharacter.GraphemeClusterBreak 1839 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 1840 1841 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 1842 -> ignore: ICU does not yet support this property 1843 1844 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 1845 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 1846 -> uchar.h & UCharacter.JoiningGroup 1847 1848 sc ; Dogr ; Dogra 1849 sc ; Gong ; Gunjala_Gondi 1850 sc ; Maka ; Makasar 1851 sc ; Medf ; Medefaidrin 1852 sc ; Rohg ; Hanifi_Rohingya 1853 sc ; Sogd ; Sogdian 1854 sc ; Sogo ; Old_Sogdian 1855 -> uscript.h & com.ibm.icu.lang.UScript 1856 -> Nushu had been added already 1857 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1858 and in com.ibm.icu.dev.test.lang.TestUScript.java 1859 1860 WB ; WSegSpace ; WSegSpace 1861 -> uchar.h & UCharacter.WordBreak 1862 1863* New short names for emoji properties 1864- see UTS #51 1865- short names set in preparseucd.py 1866 1867* New properties 1868- boolean emoji property Extended_Pictographic 1869 -> added in preparseucd.py 1870 -> uchar.h & UProperty.java 1871- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 1872 as shown in PropertyValueAliases.txt 1873 -> ignore for now 1874 1875* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1876 (not strictly necessary for NOT_ENCODED scripts) 1877 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1878 1879* update spoof checker UnicodeSet initializers: 1880 inclusionPat & recommendedPat in uspoof.cpp 1881 INCLUSION & RECOMMENDED in SpoofChecker.java 1882- make sure that the Unicode Tools tree contains the latest security data files 1883- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1884- update the hardcoded version number there in the DIRECTORY path 1885- run the tool (no special environment variables needed) 1886- copy & paste from the Console output into the .cpp & .java files 1887 1888* generate normalization data files 1889 cd $ICU_ROOT/dbg/icu4c 1890 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1891 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1892 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1893 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1894 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1895 1896* build ICU (make install) 1897 so that the tools build can pick up the new definitions from the installed header files. 1898 1899 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1900 1901* build Unicode tools using CMake+make 1902 1903$ICU_SRC/tools/unicode/c/icudefs.txt: 1904 1905# Location (--prefix) of where ICU was installed. 1906set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1907# Location of the ICU4C source tree. 1908set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 1909 1910 $ICU_ROOT/dbg$ 1911 mkdir -p tools/unicode/c 1912 cd tools/unicode/c 1913 1914 $ICU_ROOT/dbg/tools/unicode/c$ 1915 cmake ../../../../src/tools/unicode/c 1916 make 1917 1918* generate core properties data files 1919 $ICU_ROOT/dbg/tools/unicode/c$ 1920 genprops/genprops $ICU_SRC/icu4c 1921 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 1922 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1923- rebuild ICU (make install) & tools 1924 1925* Fix case props 1926 genprops error: casepropsbuilder: too many exceptions words 1927 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 1928- With the addition of Georgian Mtavruli capital letters, 1929 there are now too many simple case mappings with big mapping deltas 1930 that yield uncompressible exceptions. 1931- Changing the data structure (now formatVersion 4), 1932 adding one bit for no-simple-case-folding (for Cherokee), and 1933 one optional slot for a big delta (for most faraway mappings), 1934 together with another bit for whether that is negative. 1935 This makes most Cherokee & Georgian etc. case mappings compressible, 1936 reducing the number of exceptions words. 1937- Further changes to gain one more bit for the exceptions index, 1938 for future growth. Details see casepropsbuilder.cpp. 1939 1940* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1941 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1942- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1943- Unicode 6.0..11.0: U+2260, U+226E, U+226F 1944- nothing new in this Unicode version, no test file to update 1945 1946* run & fix ICU4C tests 1947- Andy handles RBBI & spoof check test failures 1948 1949- Errors in char.txt, word.txt, word_POSIX.txt like 1950 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 1951 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 1952 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 1953 not empty, just to get ICU building. 1954 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 1955 and properties together with the rules that used them (GB 10, WB 14). 1956 -> Andy adjusts the rule sets further to sync with 1957 Unicode 11 grapheme, word, and line break spec changes. 1958 1959* collation: CLDR collation root, UCA DUCET 1960 1961- UCA DUCET goes into Mark's Unicode tools, see 1962 https://sites.google.com/site/unicodetools/home#TOC-UCA 1963 diff the main mapping file, look for bad changes 1964 (for example, more bytes per weight for common characters) 1965 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 1966 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 1967 1968- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1969 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1970 1971- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1972 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1973- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1974 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1975 (note removing the underscore before "Rules") 1976 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1977- restore TODO diffs in UCARules.txt 1978 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1979- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1980 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1981 from the CLDR root files (..._CLDR_..._SHORT.txt) 1982 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1983 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1984 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1985- if CLDR common/uca/unihan-index.txt changes, then update 1986 CLDR common/collation/root.xml <collation type="private-unihan"> 1987 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1988 1989- run genuca, see command line above; 1990 deal with 1991 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1992 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 1993 (add the character to genuca.cpp sampleCharsToScripts[]) 1994 + look up the USCRIPT_ code for the new sample characters 1995 (should be obvious from the comment in the error output) 1996 + *add* mappings to sampleCharsToScripts[], do not replace them 1997 (in case the script sample characters flip-flop) 1998 + insert new scripts in DUCET script order, see the top_byte table 1999 at the beginning of FractionalUCA.txt 2000- rebuild ICU4C 2001 2002* Unihan collators 2003 https://sites.google.com/site/unicodetools/unihan 2004- run Unicode Tools 2005 org.unicode.draft.GenerateUnihanCollators 2006 with VM arguments 2007 -ea 2008 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2009 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2010 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2011 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2012 -DUVERSION=11.0.0 2013- run Unicode Tools 2014 org.unicode.draft.GenerateUnihanCollatorFiles 2015 with the same arguments 2016- check CLDR diffs 2017 cd $CLDR_SRC 2018 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2019 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2020- copy to CLDR 2021 cd $CLDR_SRC 2022 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2023 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2024- run CLDR unit tests, commit to CLDR 2025- generate ICU zh collation data: run CLDR 2026 org.unicode.cldr.icu.NewLdml2IcuConverter 2027 with program arguments 2028 -t collation 2029 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 2030 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 2031 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 2032 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 2033 zh 2034 and VM arguments 2035 -ea 2036 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 2037- rebuild ICU4C 2038 2039* run & fix ICU4C tests, now with new CLDR collation root data 2040- run all tests with the collation test data *_SHORT.txt or the full files 2041 (the full ones have comments, useful for debugging) 2042- note on intltest: if collate/UCAConformanceTest fails, then 2043 utility/MultithreadTest/TestCollators will fail as well; 2044 fix the conformance test before looking into the multi-thread test 2045 2046* update Java data files 2047- refresh just the UCD/UCA-related/derived files, just to be safe 2048- see (ICU4C)/source/data/icu4j-readme.txt 2049- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2050- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2051 output: 2052 ... 2053 Unicode .icu files built to ./out/build/icudt61l 2054 echo timestamp > uni-core-data 2055 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2056 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 2057 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2058 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 2059 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 2060 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 2061 mkdir -p /tmp/icu4j/main/shared/data 2062 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2063 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 2064 mkdir -p /tmp/icu4j/main/shared/data 2065 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2066 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 2067- copy the big-endian Unicode data files to another location, 2068 separate from the other data files, 2069 and then refresh ICU4J 2070 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2071 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2072 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2073 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2074 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2075 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2076 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2077 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2078 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2079 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2080 2081* When refreshing all of ICU4J data from ICU4C 2082- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2083- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2084or 2085- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2086 2087* update CollationFCD.java 2088 + copy & paste the initializers of lcccIndex[] etc. from 2089 ICU4C/source/i18n/collationfcd.cpp to 2090 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2091 2092* refresh Java test .txt files 2093- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2094 cd $ICU_SRC/icu4c/source/data/unidata 2095 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2096 cd ../../test/testdata 2097 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2098 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2099 2100* run & fix ICU4J tests 2101 2102*** API additions 2103- send notice to icu-design about new born-@stable API (enum constants etc.) 2104 2105*** CLDR numbering systems 2106- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 2107 Unicode 11: using Unicode 11 CLDR ticket #10978 2108 rohg 10D30..10D39 Hanifi_Rohingya 2109 gong 11DA0..11DA9 Gunjala_Gondi 2110 Earlier: CLDR tickets specific to adding new numbering systems. 2111 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2112 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2113 2114*** merge the Unicode update branches back onto the trunk 2115- do not merge the icudata.jar and testdata.jar, 2116 instead rebuild them from merged & tested ICU4C 2117- make sure that changes to Unicode tools are checked in: 2118 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2119 2120---------------------------------------------------------------------------- *** 2121 2122Unicode 10.0 update for ICU 60 2123 2124http://www.unicode.org/versions/Unicode10.0.0/ 2125http://www.unicode.org/versions/beta-10.0.0.html 2126http://blog.unicode.org/2017/03/unicode-100-beta-review.html 2127http://www.unicode.org/review/pri350/ 2128http://www.unicode.org/reports/uax-proposed-updates.html 2129http://www.unicode.org/reports/tr44/tr44-19.html 2130 2131* Command-line environment setup 2132 2133UNICODE_DATA=~/unidata/uni10/20170605 2134CLDR_SRC=~/svn.cldr/uni10 2135ICU_ROOT=~/svn.icu/uni10 2136ICU_SRC=$ICU_ROOT/src 2137ICUDT=icudt60b 2138ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 2139ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 2140export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 2141 2142*** ICU Trac 2143 2144- ticket:12985: Unicode 10 2145- ticket:13061: undo hacks from emoji 5.0 update 2146- ticket:13062: add Emoji_Component property 2147- ^/branches/markus/uni10 2148 2149*** CLDR Trac 2150 2151- cldrbug 10055: Unicode 10 2152- cldrbug 9882: Unicode 10 script metadata 2153- cldrbug 10219: numbering systems for Unicode 10 2154 2155*** Unicode version numbers 2156- makedata.mak 2157- uchar.h 2158- com.ibm.icu.util.VersionInfo 2159- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2160 2161- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2162 so that the makefiles see the new version number. 2163 2164*** data files & enums & parser code 2165 2166* download files 2167- mkdir -p $UNICODE_DATA 2168- download Unicode 10.0 files into $UNICODE_DATA 2169 + subfolders: ucd, uca, idna, security 2170 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2171- download emoji 5.0 files into $UNICODE_DATA/emoji 2172 2173* for manual diffs: remove version suffixes from the file names 2174 ~$ unidata/desuffixucd.py $UNICODE_DATA 2175 (see https://sites.google.com/site/unicodetools/inputdata) 2176 2177* process and/or copy files 2178- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 2179 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2180 + For debugging, and tweaking how ppucd.txt is written, 2181 the tool has an --only_ppucd option: 2182 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 2183 2184- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 2185 2186* build ICU (make install) 2187 so that the tools build can pick up the new definitions from the installed header files. 2188 2189 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2190 2191* preparseucd.py changes 2192- remove or add new Unicode scripts from/to the 2193 only-in-ISO-15924 list according to the error messages: 2194 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 2195 -> adjust _scripts_only_in_iso15924 as indicated 2196- fix other errors 2197 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 2198 -> add vo=Vertical_Orientation to _ignored_properties 2199 -> later removed again, parsing the file, even though we do not yet store data for runtime use 2200 2201* new constants for new property values 2202- preparseucd.py error: 2203 ValueError: missing uchar.h enum constants for some property values: 2204 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 2205 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 2206 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 2207 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 2208 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 2209 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 2210 = PropertyValueAliases.txt new property values (diff old & new .txt files) 2211 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 2212 blk; Kana_Ext_A ; Kana_Extended_A 2213 blk; Masaram_Gondi ; Masaram_Gondi 2214 blk; Nushu ; Nushu 2215 blk; Soyombo ; Soyombo 2216 blk; Syriac_Sup ; Syriac_Supplement 2217 blk; Zanabazar_Square ; Zanabazar_Square 2218 -> add to uchar.h 2219 use long property names for enum constants, 2220 for the trailing comment get the block start code point: diff old & new Blocks.txt 2221 -> add to UCharacter.UnicodeBlock IDs 2222 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2223 replace public static final int \1_ID = \2; \3 2224 -> add to UCharacter.UnicodeBlock objects 2225 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2226 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2227 2228 jg ; Malayalam_Bha ; Malayalam_Bha 2229 jg ; Malayalam_Ja ; Malayalam_Ja 2230 jg ; Malayalam_Lla ; Malayalam_Lla 2231 jg ; Malayalam_Llla ; Malayalam_Llla 2232 jg ; Malayalam_Nga ; Malayalam_Nga 2233 jg ; Malayalam_Nna ; Malayalam_Nna 2234 jg ; Malayalam_Nnna ; Malayalam_Nnna 2235 jg ; Malayalam_Nya ; Malayalam_Nya 2236 jg ; Malayalam_Ra ; Malayalam_Ra 2237 jg ; Malayalam_Ssa ; Malayalam_Ssa 2238 jg ; Malayalam_Tta ; Malayalam_Tta 2239 -> uchar.h & UCharacter.JoiningGroup 2240 2241 sc ; Gonm ; Masaram_Gondi 2242 sc ; Nshu ; Nushu 2243 sc ; Soyo ; Soyombo 2244 sc ; Zanb ; Zanabazar_Square 2245 -> uscript.h & com.ibm.icu.lang.UScript 2246 -> Nushu had been added already 2247 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2248 and in com.ibm.icu.dev.test.lang.TestUScript.java 2249 2250* New properties as shown in PropertyValueAliases.txt changes 2251- boolean Emoji_Component from emoji 5 2252 -> uchar.h & UProperty.java 2253- boolean 2254 # Regional_Indicator (RI) 2255 2256 RI ; N ; No ; F ; False 2257 RI ; Y ; Yes ; T ; True 2258 -> uchar.h & UProperty.java 2259 -> single immutable range, to be hardcoded 2260- boolean 2261 # Prepended_Concatenation_Mark (PCM) 2262 2263 PCM; N ; No ; F ; False 2264 PCM; Y ; Yes ; T ; True 2265 -> was new in Unicode 9 2266 -> uchar.h & UProperty.java 2267- enumerated 2268 # Vertical_Orientation (vo) 2269 2270 vo ; R ; Rotated 2271 vo ; Tr ; Transformed_Rotated 2272 vo ; Tu ; Transformed_Upright 2273 vo ; U ; Upright 2274 -> only pre-parsed for now, but not yet stored for runtime use 2275 2276* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2277 (not strictly necessary for NOT_ENCODED scripts) 2278 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 2279 2280* generate normalization data files 2281 cd $ICU_ROOT/dbg/icu4c 2282 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 2283 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 2284 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 2285 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2286 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 2287 2288* build ICU (make install) 2289 so that the tools build can pick up the new definitions from the installed header files. 2290 2291 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2292 2293* build Unicode tools using CMake+make 2294 2295$ICU_SRC/tools/unicode/c/icudefs.txt: 2296 2297# Location (--prefix) of where ICU was installed. 2298set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2299# Location of the ICU4C source tree. 2300set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 2301 2302 $ICU_ROOT/dbg/tools/unicode/c$ 2303 cmake ../../../../src/tools/unicode/c 2304 make 2305 2306* generate core properties data files 2307 $ICU_ROOT/dbg/tools/unicode/c$ 2308 genprops/genprops $ICU_SRC/icu4c 2309 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 2310 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 2311- rebuild ICU (make install) & tools 2312 2313* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2314 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2315- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2316- Unicode 6.0..10.0: U+2260, U+226E, U+226F 2317- nothing new in this Unicode version, no test file to update 2318 2319* run & fix ICU4C tests 2320- Andy handles RBBI & spoof check test failures 2321 2322* collation: CLDR collation root, UCA DUCET 2323 2324- UCA DUCET goes into Mark's Unicode tools, see 2325 https://sites.google.com/site/unicodetools/home#TOC-UCA 2326- CLDR root data files are checked into $CLDR_SRC/common/uca/ 2327 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 2328 2329- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2330 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 2331- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2332 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 2333 (note removing the underscore before "Rules") 2334 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 2335- restore TODO diffs in UCARules.txt 2336 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 2337- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2338 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2339 from the CLDR root files (..._CLDR_..._SHORT.txt) 2340 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2341 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2342 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 2343- if CLDR common/uca/unihan-index.txt changes, then update 2344 CLDR common/collation/root.xml <collation type="private-unihan"> 2345 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2346 2347- run genuca, see command line above; 2348 deal with 2349 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 2350 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 2351 (add the character to genuca.cpp sampleCharsToScripts[]) 2352 + look up the USCRIPT_ code for the new sample characters 2353 (should be obvious from the comment in the error output) 2354 + *add* mappings to sampleCharsToScripts[], do not replace them 2355 (in case the script sample characters flip-flop) 2356 + insert new scripts in DUCET script order, see the top_byte table 2357 at the beginning of FractionalUCA.txt 2358- rebuild ICU4C 2359 2360* Unihan collators 2361 https://sites.google.com/site/unicodetools/unihan 2362- run Unicode Tools 2363 org.unicode.draft.GenerateUnihanCollators 2364 with VM arguments 2365 -ea 2366 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2367 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2368 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2369 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2370 -DUVERSION=10.0.0 2371- run Unicode Tools 2372 org.unicode.draft.GenerateUnihanCollatorFiles 2373 with the same arguments 2374- check CLDR diffs 2375 cd $CLDR_SRC 2376 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2377 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2378- copy to CLDR 2379 cd $CLDR_SRC 2380 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2381 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2382- run CLDR unit tests, commit to CLDR 2383- generate ICU zh collation data: run CLDR 2384 org.unicode.cldr.icu.NewLdml2IcuConverter 2385 with program arguments 2386 -t collation 2387 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 2388 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 2389 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 2390 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 2391 zh 2392 and VM arguments 2393 -ea 2394 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2395- rebuild ICU4C 2396 2397* run & fix ICU4C tests, now with new CLDR collation root data 2398- run all tests with the collation test data *_SHORT.txt or the full files 2399 (the full ones have comments, useful for debugging) 2400- note on intltest: if collate/UCAConformanceTest fails, then 2401 utility/MultithreadTest/TestCollators will fail as well; 2402 fix the conformance test before looking into the multi-thread test 2403 2404* update Java data files 2405- refresh just the UCD/UCA-related/derived files, just to be safe 2406- see (ICU4C)/source/data/icu4j-readme.txt 2407- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2408- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2409 output: 2410 ... 2411 Unicode .icu files built to ./out/build/icudt60l 2412 echo timestamp > uni-core-data 2413 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2414 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 2415 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2416 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2417 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 2418 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 2419 mkdir -p /tmp/icu4j/main/shared/data 2420 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2421 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 2422 mkdir -p /tmp/icu4j/main/shared/data 2423 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2424 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 2425- copy the big-endian Unicode data files to another location, 2426 separate from the other data files, 2427 and then refresh ICU4J 2428 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2429 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2430 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2431 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2432 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2433 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2434 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2435 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2436 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2437 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2438 2439* When refreshing all of ICU4J data from ICU4C 2440- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2441- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2442or 2443- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2444 2445* update CollationFCD.java 2446 + copy & paste the initializers of lcccIndex[] etc. from 2447 ICU4C/source/i18n/collationfcd.cpp to 2448 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2449 2450* refresh Java test .txt files 2451- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2452 cd $ICU_SRC/icu4c/source/data/unidata 2453 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2454 cd ../../test/testdata 2455 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2456 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2457 2458* run & fix ICU4J tests 2459 2460*** API additions 2461- send notice to icu-design about new born-@stable API (enum constants etc.) 2462 2463*** CLDR numbering systems 2464- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 2465 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2466 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2467 2468*** merge the Unicode update branches back onto the trunk 2469- do not merge the icudata.jar and testdata.jar, 2470 instead rebuild them from merged & tested ICU4C 2471- make sure that changes to Unicode tools are checked in: 2472 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2473 2474---------------------------------------------------------------------------- *** 2475 2476Emoji 5.0 update for ICU 59 2477- ICU 59 mostly remains on Unicode 9.0 2478- except updates bidi and segmentation data to Unicode 10 beta 2479 2480First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 2481 2482* Command-line environment setup 2483 2484ICU_ROOT=~/svn.icu/trunk 2485ICU_SRC_DIR=$ICU_ROOT/src 2486ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 2487ICUDT=icudt59b 2488export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2489SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 2490UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 2491 2492*** ICU Trac 2493 2494- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 2495- changes directly on trunk 2496 2497*** data files & enums & parser code 2498 2499* download files 2500 2501- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 2502- download emoji 5.0 beta files into the same uni90e50 folder 2503- download Unicode 10.0 beta files: ucd 2504 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 2505 BidiBrackets.txt 2506 BidiCharacterTest.txt 2507 BidiMirroring.txt 2508 BidiTest.txt 2509 extracted/DerivedBidiClass.txt 2510 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 2511 LineBreak.txt 2512 auxiliary/* 2513 2514* preparseucd.py changes 2515- adjust for combined trunks 2516- write new copyright lines 2517- ignore new Emoji_Component property for now 2518 2519* process and/or copy files 2520- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 2521 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2522 2523- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 2524 2525* build ICU (make install) 2526 so that the tools build can pick up the new definitions from the installed header files. 2527 2528 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2529 2530* build Unicode tools using CMake+make 2531 2532~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 2533 2534# Location (--prefix) of where ICU was installed. 2535set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2536# Location of the ICU4C source tree. 2537set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 2538 2539 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2540 cmake ../../../../src/tools/unicode/c 2541 make 2542 2543* generate core properties data files 2544 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2545 genprops/genprops $ICU4C_SRC_DIR 2546- rebuild ICU (make install) & tools 2547 2548* run & fix ICU4C tests 2549- Andy handles RBBI & spoof check test failures 2550 2551* update Java data files 2552- refresh just the UCD/UCA-related/derived files, just to be safe 2553- see (ICU4C)/source/data/icu4j-readme.txt 2554- mkdir /tmp/icu4j 2555- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2556 output: 2557 ... 2558 Unicode .icu files built to ./out/build/icudt59l 2559 echo timestamp > uni-core-data 2560 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2561 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 2562 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2563 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2564 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 2565 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 2566 mkdir -p /tmp/icu4j/main/shared/data 2567 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2568 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 2569 mkdir -p /tmp/icu4j/main/shared/data 2570 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2571 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 2572- copy the big-endian Unicode data files to another location, 2573 separate from the other data files, 2574 and then refresh ICU4J 2575 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 2576 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2577 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2578 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2579 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2580 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2581 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2582 2583* When refreshing all of ICU4J data from ICU4C 2584- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2585- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 2586or 2587- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 2588 2589* refresh Java test .txt files 2590- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2591 cd $ICU4C_SRC_DIR/source/data/unidata 2592 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2593 cd ../../test/testdata 2594 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2595 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2596 2597* run & fix ICU4J tests 2598 2599---------------------------------------------------------------------------- *** 2600 2601Unicode 9.0 update for ICU 58 2602 2603* Command-line environment setup 2604 2605ICU_ROOT=~/svn.icu/trunk 2606ICU_SRC_DIR=$ICU_ROOT/src 2607ICUDT=icudt58b 2608export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2609SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2610UNIDATA=$ICU_SRC_DIR/source/data/unidata 2611 2612http://www.unicode.org/review/pri323/ -- beta review 2613http://www.unicode.org/reports/uax-proposed-updates.html 2614http://www.unicode.org/versions/beta-9.0.0.html 2615http://www.unicode.org/versions/Unicode9.0.0/ 2616http://www.unicode.org/reports/tr44/tr44-17.html 2617 2618*** ICU Trac 2619 2620- ticket:12526: integrate Unicode 9 2621- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 2622- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 2623 2624*** CLDR Trac 2625 2626- cldrbug 9414: UCA 9 2627- ^/branches/markus/uni90 at r11518 from trunk at r11517 2628 2629- cldrbug 8745: Unicode 9.0 script metadata 2630 2631*** Unicode version numbers 2632- makedata.mak 2633- uchar.h 2634- com.ibm.icu.util.VersionInfo 2635- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2636 2637- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2638 so that the makefiles see the new version number. 2639 2640*** data files & enums & parser code 2641 2642* file preparation 2643 2644- download UCD & IDNA files 2645- make sure that the Unicode data folder passed into preparseucd.py 2646 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2647- only for manual diffs: remove version suffixes from the file names 2648 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2649 (see https://sites.google.com/site/unicodetools/inputdata) 2650- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2651- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2652- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2653 2654- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 2655 and copy to $UNIDATA 2656 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 2657 2658* preparseucd.py changes 2659- remove or add new Unicode scripts from/to the 2660 only-in-ISO-15924 list according to the error messages: 2661 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 2662 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 2663 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 2664 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 2665 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2666 and in com.ibm.icu.dev.test.lang.TestUScript.java 2667- DerivedNumericValues.txt new numeric values 2668 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 2669 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 2670 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 2671 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 2672 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 2673 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 2674 uchar.c, UCharacterProperty.java 2675 to support a new series of values 2676- adjust preparseucd.py for Tangut algorithmic names 2677 in ppucd.txt: 2678 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 2679 -> 2680 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 2681- avoid block-compressing most String/Miscellaneous property values, 2682 triggered by genprops not coping with a multi-code point Case_Folding on 2683 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 2684 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 2685 2686* PropertyAliases.txt changes 2687- 1 new property PCM=Prepended_Concatenation_Mark 2688 Ignore: Only useful for layout engines. 2689 Ok to list in ppucd.txt. 2690 2691* PropertyValueAliases.txt new property values 2692 blk; Adlam ; Adlam 2693 blk; Bhaiksuki ; Bhaiksuki 2694 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 2695 blk; Glagolitic_Sup ; Glagolitic_Supplement 2696 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 2697 blk; Marchen ; Marchen 2698 blk; Mongolian_Sup ; Mongolian_Supplement 2699 blk; Newa ; Newa 2700 blk; Osage ; Osage 2701 blk; Tangut ; Tangut 2702 blk; Tangut_Components ; Tangut_Components 2703 -> add to uchar.h 2704 use long property names for enum constants 2705 -> add to UCharacter.UnicodeBlock IDs 2706 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2707 replace public static final int \1_ID = \2; \3 2708 -> add to UCharacter.UnicodeBlock objects 2709 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2710 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2711 2712 GCB; EB ; E_Base 2713 GCB; EBG ; E_Base_GAZ 2714 GCB; EM ; E_Modifier 2715 GCB; GAZ ; Glue_After_Zwj 2716 GCB; ZWJ ; ZWJ 2717 -> uchar.h & UCharacter.GraphemeClusterBreak 2718 2719 jg ; African_Feh ; African_Feh 2720 jg ; African_Noon ; African_Noon 2721 jg ; African_Qaf ; African_Qaf 2722 -> uchar.h & UCharacter.JoiningGroup 2723 2724 lb ; EB ; E_Base 2725 lb ; EM ; E_Modifier 2726 lb ; ZWJ ; ZWJ 2727 -> uchar.h & UCharacter.LineBreak 2728 2729 sc ; Adlm ; Adlam 2730 sc ; Bhks ; Bhaiksuki 2731 sc ; Marc ; Marchen 2732 sc ; Newa ; Newa 2733 sc ; Osge ; Osage 2734 sc ; Tang ; Tangut 2735 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2736 2737 WB ; EB ; E_Base 2738 WB ; EBG ; E_Base_GAZ 2739 WB ; EM ; E_Modifier 2740 WB ; GAZ ; Glue_After_Zwj 2741 WB ; ZWJ ; ZWJ 2742 -> uchar.h & UCharacter.WordBreak 2743 2744* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2745 (not strictly necessary for NOT_ENCODED scripts) 2746 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2747 2748* generate normalization data files 2749 cd $ICU_ROOT/dbg 2750 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2751 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2752 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2753 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2754 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2755 2756* build ICU (make install) 2757 so that the tools build can pick up the new definitions from the installed header files. 2758 2759 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 2760 2761* build Unicode tools using CMake+make 2762 2763~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2764 2765 # Location (--prefix) of where ICU was installed. 2766 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2767 # Location of the ICU source tree. 2768 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2769 2770 ~/svn.icutools/trunk/dbg/unicode/c$ 2771 cmake ../../../src/unicode/c 2772 make 2773 2774* generate core properties data files 2775 ~/svn.icutools/trunk/dbg/unicode/c$ 2776 genprops/genprops $ICU_SRC_DIR 2777 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2778 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2779- rebuild ICU (make install) & tools 2780 2781* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2782 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2783- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2784- Unicode 6.0..9.0: U+2260, U+226E, U+226F 2785- nothing new in 9.0, no test file to update 2786 2787* run & fix ICU4C tests 2788- Andy handles RBBI & spoof check test failures 2789 2790* collation: CLDR collation root, UCA DUCET 2791 2792- UCA DUCET goes into Mark's Unicode tools, see 2793 https://sites.google.com/site/unicodetools/home#TOC-UCA 2794- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2795 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 2796 2797- cd (CLDR UCA branch)/common/uca/ 2798- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2799 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2800- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2801 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2802 (note removing the underscore before "Rules") 2803 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2804- restore TODO diffs in UCARules.txt 2805 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2806- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2807 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2808 from the CLDR root files (..._CLDR_..._SHORT.txt) 2809 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2810 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2811 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2812- if CLDR common/uca/unihan-index.txt changes, then update 2813 CLDR common/collation/root.xml <collation type="private-unihan"> 2814 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2815 2816- run genuca, see command line above; 2817 deal with 2818 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 2819 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 2820 (add the character to genuca.cpp sampleCharsToScripts[]) 2821 + look up the USCRIPT_ code for the new sample characters 2822 (should be obvious from the comment in the error output) 2823 + *add* mappings to sampleCharsToScripts[], do not replace them 2824 (in case the script sample characters flip-flop) 2825 + insert new scripts in DUCET script order, see the top_byte table 2826 at the beginning of FractionalUCA.txt 2827- rebuild ICU4C 2828 2829* Unihan collators 2830- run Unicode Tools 2831 org.unicode.draft.GenerateUnihanCollators 2832 with VM arguments 2833 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 2834 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 2835 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 2836 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2837 -DUVERSION=9.0.0 2838 -ea 2839- run Unicode Tools 2840 org.unicode.draft.GenerateUnihanCollatorFiles 2841 with the same arguments 2842- check CLDR diffs 2843 cd ~/svn.cldr/trunk 2844 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2845 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2846- copy to CLDR 2847 cd ~/svn.cldr/trunk 2848 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2849 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2850- commit to CLDR 2851- generate ICU zh collation data: run CLDR 2852 org.unicode.cldr.icu.NewLdml2IcuConverter 2853 with program arguments 2854 -t collation 2855 -s /home/mscherer/svn.cldr/trunk/common/collation 2856 -m /home/mscherer/svn.cldr/trunk/common/supplemental 2857 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 2858 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 2859 zh 2860 and VM arguments 2861 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2862- rebuild ICU4C 2863 2864* run & fix ICU4C tests, now with new CLDR collation root data 2865- run all tests with the collation test data *_SHORT.txt or the full files 2866 (the full ones have comments, useful for debugging) 2867- note on intltest: if collate/UCAConformanceTest fails, then 2868 utility/MultithreadTest/TestCollators will fail as well; 2869 fix the conformance test before looking into the multi-thread test 2870 2871* update Java data files 2872- refresh just the UCD/UCA-related/derived files, just to be safe 2873- see (ICU4C)/source/data/icu4j-readme.txt 2874- mkdir /tmp/icu4j 2875- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2876 output: 2877 ... 2878 Unicode .icu files built to ./out/build/icudt58l 2879 echo timestamp > uni-core-data 2880 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2881 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 2882 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2883 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2884 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 2885 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 2886 mkdir -p /tmp/icu4j/main/shared/data 2887 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2888 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 2889 mkdir -p /tmp/icu4j/main/shared/data 2890 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2891 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 2892- copy the big-endian Unicode data files to another location, 2893 separate from the other data files, 2894 and then refresh ICU4J 2895 cd ~/svn.icu/trunk/dbg/data/out/icu4j 2896 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2897 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2898 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2899 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2900 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2901 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2902 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2903 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2904 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2905 2906* When refreshing all of ICU4J data from ICU4C 2907- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2908- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2909or 2910- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2911 2912* update CollationFCD.java 2913 + copy & paste the initializers of lcccIndex[] etc. from 2914 ICU4C/source/i18n/collationfcd.cpp to 2915 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2916 2917* refresh Java test .txt files 2918- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2919 cd $ICU_SRC_DIR/source/data/unidata 2920 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2921 cd ../../test/testdata 2922 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2923 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2924 2925* run & fix ICU4J tests 2926 2927*** LayoutEngine script information 2928 2929* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2930 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2931 in the working directory. 2932 2933 (It also generates ScriptRunData.cpp, which is no longer needed.) 2934 2935 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2936 (a plain text file) 2937 which maps ICU versions to the numbers of script/language constants 2938 that were added then. 2939 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2940 2941 The generated files have a current copyright date and "@deprecated" statement. 2942 2943* Review changes, fix Java tool if necessary, and copy to ICU4C 2944 cd ~/svn.icu4j/trunk/src 2945 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2946 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2947 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2948 2949*** API additions 2950- send notice to icu-design about new born-@stable API (enum constants etc.) 2951 2952*** merge the Unicode update branches back onto the trunk 2953- do not merge the icudata.jar and testdata.jar, 2954 instead rebuild them from merged & tested ICU4C 2955- make sure that changes to Unicode tools & ICU tools are checked in 2956 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2957 http://bugs.icu-project.org/trac/log/tools/trunk 2958 2959---------------------------------------------------------------------------- *** 2960 2961New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 2962 2963Adding 2964- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 2965- new combination/alias codes: Hanb, Jamo 2966 - used in CLDR 29 and in spoof checker 2967- new Z* code: Zsye 2968 2969Add new codes to uscript.h & UScript.java, see Unicode update logs. 2970 -> com.ibm.icu.lang.UScript 2971 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 2972 replace public static final int \1 = \2; \3 2973 2974Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 2975add new script codes. 2976"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 2977 2978Note: If we have to run preparseucd.py again before the Unicode 9 update, 2979then we need to manually keep/restore the new script codes. 2980 2981ICU_ROOT=~/svn.icu/trunk 2982ICU_SRC_DIR=$ICU_ROOT/src 2983ICUDT=icudt57b 2984export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2985SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2986UNIDATA=$ICU_SRC_DIR/source/data/unidata 2987 2988Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 2989see https://unicode-org.atlassian.net/browse/ICU-12141 2990 2991make install, then icutools cmake & make, then 2992~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 2993 2994Generate Java data as usual, only update pnames.icu & uprops.icu. 2995 2996*** LayoutEngine script information 2997 2998* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2999 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3000 in the working directory. 3001 3002 (It also generates ScriptRunData.cpp, which is no longer needed.) 3003 3004 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3005 (a plain text file) 3006 which maps ICU versions to the numbers of script/language constants 3007 that were added then. 3008 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3009 3010 The generated files have a current copyright date and "@deprecated" statement. 3011 3012* Review changes, fix Java tool if necessary, and copy to ICU4C 3013 cd ~/svn.icu4j/trunk/src 3014 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3015 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3016 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3017 3018---------------------------------------------------------------------------- *** 3019 3020Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 3021 3022Edit preparseucd.py to add & parse new properties. 3023They share the UCD property namespace but are not listed in PropertyAliases.txt. 3024 3025Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 3026Initial data from emoji/2.0/ 3027 3028ICU_ROOT=~/svn.icu/trunk 3029ICU_SRC_DIR=$ICU_ROOT/src 3030ICUDT=icudt56b 3031export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3032SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3033UNIDATA=$ICU_SRC_DIR/source/data/unidata 3034 3035Add binary-property constants to uchar.h enum UProperty & UProperty.java. 3036 3037~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3038(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 3039 3040Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 3041 3042make install, then icutools cmake & make, then 3043~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 3044 3045Generate Java data as usual, only update pnames.icu & uprops.icu. 3046 3047---------------------------------------------------------------------------- *** 3048 3049Unicode 8.0 update for ICU 56 3050 3051* Command-line environment setup 3052 3053ICU_ROOT=~/svn.icu/trunk 3054ICU_SRC_DIR=$ICU_ROOT/src 3055ICUDT=icudt56b 3056export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3057SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3058UNIDATA=$ICU_SRC_DIR/source/data/unidata 3059 3060http://www.unicode.org/review/pri297/ -- beta review 3061http://www.unicode.org/reports/uax-proposed-updates.html 3062http://unicode.org/versions/beta-8.0.0.html 3063http://www.unicode.org/versions/Unicode8.0.0/ 3064http://www.unicode.org/reports/tr44/tr44-15.html 3065 3066*** ICU Trac 3067 3068- ticket:11574: Unicode 8 3069- C++ branches/markus/uni80 at r37351 from trunk at r37343 3070- Java branches/markus/uni80 at r37352 from trunk at r37338 3071 3072*** CLDR Trac 3073 3074- cldrbug 8311: UCA 8 3075- branches/markus/uni80 at r11518 from trunk at r11517 3076 3077- cldrbug 8109: Unicode 8.0 script metadata 3078- cldrbug 8418: Updated segmentation for Unicode 8.0 3079 3080*** Unicode version numbers 3081- makedata.mak 3082- uchar.h 3083- com.ibm.icu.util.VersionInfo 3084- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3085 3086- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3087 so that the makefiles see the new version number. 3088 3089*** data files & enums & parser code 3090 3091* file preparation 3092 3093- download UCD & IDNA files 3094- make sure that the Unicode data folder passed into preparseucd.py 3095 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3096- only for manual diffs: remove version suffixes from the file names 3097 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3098 (see https://sites.google.com/site/unicodetools/inputdata) 3099- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3100- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3101- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3102 3103- also: from http://unicode.org/Public/security/8.0.0/ download new 3104 confusables.txt & confusablesWholeScript.txt 3105 and copy to $UNIDATA 3106 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 3107 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 3108 3109* initial preparseucd.py changes 3110- remove new Unicode scripts from the 3111 only-in-ISO-15924 list according to the error message: 3112 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 3113 from _scripts_only_in_iso15924 3114 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3115 and in com.ibm.icu.dev.test.lang.TestUScript.java 3116- property and file name change: 3117 IndicMatraCategory -> IndicPositionalCategory 3118- UnicodeData.txt unusual numeric values (improper fractions) 3119 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 3120 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 3121 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 3122 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 3123 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 3124 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 3125 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 3126 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 3127 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 3128 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 3129 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 3130 which are listed in DerivedNumericValues.txt; 3131 keeps storage in data file simple 3132 3133* PropertyValueAliases.txt changes 3134- 10 new Block (blk) values: 3135 blk; Ahom ; Ahom 3136 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 3137 blk; Cherokee_Sup ; Cherokee_Supplement 3138 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 3139 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 3140 blk; Hatran ; Hatran 3141 blk; Multani ; Multani 3142 blk; Old_Hungarian ; Old_Hungarian 3143 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 3144 blk; Sutton_SignWriting ; Sutton_SignWriting 3145 -> add to uchar.h 3146 use long property names for enum constants 3147 -> add to UCharacter.UnicodeBlock IDs 3148 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3149 replace public static final int \1_ID = \2; \3 3150 -> add to UCharacter.UnicodeBlock objects 3151 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3152 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3153- 6 new Script (sc) values: 3154 sc ; Ahom ; Ahom 3155 sc ; Hatr ; Hatran 3156 sc ; Hluw ; Anatolian_Hieroglyphs 3157 sc ; Hung ; Old_Hungarian 3158 sc ; Mult ; Multani 3159 sc ; Sgnw ; SignWriting 3160 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 3161 3162* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3163 (not strictly necessary for NOT_ENCODED scripts) 3164 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3165 3166* generate normalization data files 3167 cd $ICU_ROOT/dbg 3168 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3169 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3170 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3171 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3172 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3173 3174* build ICU (make install) 3175 so that the tools build can pick up the new definitions from the installed header files. 3176 3177 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3178 3179* build Unicode tools using CMake+make 3180 3181~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3182 3183 # Location (--prefix) of where ICU was installed. 3184 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 3185 # Location of the ICU source tree. 3186 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 3187 3188 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3189 ~/svn.icutools/trunk/dbg/unicode/c$ make 3190 3191* generate core properties data files 3192- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3193- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 3194- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 3195- rebuild ICU (make install) & tools 3196- run genuca again (see step above) so that it picks up the new nfc.nrm 3197- rebuild ICU (make install) & tools 3198 3199* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3200 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3201- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3202- Unicode 6.0..8.0: U+2260, U+226E, U+226F 3203- nothing new in 8.0, no test file to update 3204 3205* run & fix ICU4C tests 3206- bad Cherokee case folding due to difference in fallbacks: 3207 UCD case folding falls back to no mapping, 3208 ICU runtime case folding falls back to lowercasing; 3209 fixed casepropsbuilder.cpp to generate scf mappings to self 3210 when there is an slc mapping but no scf 3211- Andy handles RBBI & spoof check test failures 3212 3213* collation: CLDR collation root, UCA DUCET 3214 3215- UCA DUCET goes into Mark's Unicode tools, see 3216 https://sites.google.com/site/unicodetools/home#TOC-UCA 3217- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 3218- cd (CLDR UCA branch)/common/uca/ 3219- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3220 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3221- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3222 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 3223 (note removing the underscore before "Rules") 3224 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3225- restore TODO diffs in UCARules.txt 3226 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3227- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3228 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3229 from the CLDR root files (..._CLDR_..._SHORT.txt) 3230 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3231 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3232 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3233- if CLDR common/uca/unihan-index.txt changes, then update 3234 CLDR common/collation/root.xml <collation type="private-unihan"> 3235 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 3236- run genuca, see command line above; 3237 deal with 3238 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 3239 (add the character to genuca.cpp sampleCharsToScripts[]) 3240 + look up the script for the new sample characters 3241 (e.g., in FractionalUCA.txt) 3242 + *add* mappings to sampleCharsToScripts[], do not replace them 3243 (in case the script sample characters flip-flop) 3244 + insert new scripts in DUCET script order, see the top_byte table 3245 at the beginning of FractionalUCA.txt 3246- rebuild ICU4C 3247 3248* run & fix ICU4C tests, now with new CLDR collation root data 3249- run all tests with the collation test data *_SHORT.txt or the full files 3250 (the full ones have comments, useful for debugging) 3251- note on intltest: if collate/UCAConformanceTest fails, then 3252 utility/MultithreadTest/TestCollators will fail as well; 3253 fix the conformance test before looking into the multi-thread test 3254- fixed bug in CollationWeights::getWeightRanges() 3255 exposed by new data and CollationTest::TestRootElements 3256 3257* update Java data files 3258- refresh just the UCD/UCA-related/derived files, just to be safe 3259- see (ICU4C)/source/data/icu4j-readme.txt 3260- mkdir /tmp/icu4j 3261- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3262 output: 3263 ... 3264 Unicode .icu files built to ./out/build/icudt56l 3265 echo timestamp > uni-core-data 3266 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3267 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 3268 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 3269 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 3270 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 3271 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 3272 mkdir -p /tmp/icu4j/main/shared/data 3273 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3274 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 3275 mkdir -p /tmp/icu4j/main/shared/data 3276 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3277 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 3278- copy the big-endian Unicode data files to another location, 3279 separate from the other data files, 3280 and then refresh ICU4J 3281 cd ~/svn.icu/trunk/dbg/data/out/icu4j 3282 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3283 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3284 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3285 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3286 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3287 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3288 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3289 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3290 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3291 3292* When refreshing all of ICU4J data from ICU4C 3293- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3294- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3295or 3296- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3297 3298* update CollationFCD.java 3299 + copy & paste the initializers of lcccIndex[] etc. from 3300 ICU4C/source/i18n/collationfcd.cpp to 3301 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3302 3303* refresh Java test .txt files 3304- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3305 cd $ICU_SRC_DIR/source/data/unidata 3306 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3307 cd ../../test/testdata 3308 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3309 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3310 3311* run & fix ICU4J tests 3312 3313*** LayoutEngine script information 3314 3315* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 3316 because the layout engine was deprecated in ICU 54. 3317 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 3318 to write lines that we used to add manually. 3319 3320* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3321 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3322 in the working directory. 3323 3324 (It also generates ScriptRunData.cpp, which is no longer needed.) 3325 3326 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 3327 (a plain text file) 3328 which maps ICU versions to the numbers of script/language constants 3329 that were added then. 3330 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 3331 3332 The generated files have a current copyright date and "@deprecated" statement. 3333 3334* Review changes, fix Java tool if necessary, and copy to ICU4C 3335 cd ~/svn.icu4j/trunk/src 3336 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3337 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 3338 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 3339 3340*** API additions 3341- send notice to icu-design about new born-@stable API (enum constants etc.) 3342 3343*** merge the Unicode update branches back onto the trunk 3344- do not merge the icudata.jar and testdata.jar, 3345 instead rebuild them from merged & tested ICU4C 3346- make sure that changes to Unicode tools & ICU tools are checked in 3347 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3348 http://bugs.icu-project.org/trac/log/tools/trunk 3349 3350---------------------------------------------------------------------------- *** 3351 3352Unicode 7.0 update for ICU 54 3353 3354http://www.unicode.org/review/pri271/ -- beta review 3355http://www.unicode.org/reports/uax-proposed-updates.html 3356http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 3357http://www.unicode.org/reports/tr44/tr44-13.html 3358 3359*** ICU Trac 3360 3361- ticket 10821: Unicode 7.0, UCA 7.0 3362- C++ branches/markus/uni70 at r35584 from trunk at r35580 3363- Java branches/markus/uni70 at r35587 from trunk at r35545 3364 3365*** CLDR Trac 3366 3367- ticket 7195: UCA 7.0 CLDR root collation 3368- branches/markus/uni70 at r10062 from trunk at r10061 3369 3370- ticket 6762: script metadata for Unicode 7.0 new scripts 3371 3372*** Unicode version numbers 3373- makedata.mak 3374- uchar.h 3375- com.ibm.icu.util.VersionInfo 3376- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3377 3378- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3379 so that the makefiles see the new version number. 3380 3381*** data files & enums & parser code 3382 3383* file preparation 3384 3385- download UCD & IDNA files 3386- make sure that the Unicode data folder passed into preparseucd.py 3387 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3388- only for manual diffs: remove version suffixes from the file names 3389 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3390 (see https://sites.google.com/site/unicodetools/inputdata) 3391- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3392- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3393- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3394- Restore TODO diffs in source/data/unidata/UCARules.txt 3395 cd $ICU_SRC_DIR 3396 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 3397- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 3398 3399- also: from http://unicode.org/Public/security/7.0.0/ download new 3400 confusables.txt & confusablesWholeScript.txt 3401 and copy to $ICU_ROOT/src/source/data/unidata/ 3402 3403* initial preparseucd.py changes 3404- remove new Unicode scripts from the 3405 only-in-ISO-15924 list according to the error message: 3406 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 3407 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 3408 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 3409 from _scripts_only_in_iso15924 3410 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3411 and in com.ibm.icu.dev.test.lang.TestUScript.java 3412- NamesList.txt now has a heading with a non-ASCII character 3413 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 3414 + escape non-ASCII characters in heading comments 3415- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 3416 + get the copyright from the first file whose copyright line contains the current year 3417 3418* PropertyValueAliases.txt changes 3419- 32 new Block (blk) values: 3420 blk; Bassa_Vah ; Bassa_Vah 3421 blk; Caucasian_Albanian ; Caucasian_Albanian 3422 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 3423 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 3424 blk; Duployan ; Duployan 3425 blk; Elbasan ; Elbasan 3426 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 3427 blk; Grantha ; Grantha 3428 blk; Khojki ; Khojki 3429 blk; Khudawadi ; Khudawadi 3430 blk; Latin_Ext_E ; Latin_Extended_E 3431 blk; Linear_A ; Linear_A 3432 blk; Mahajani ; Mahajani 3433 blk; Manichaean ; Manichaean 3434 blk; Mende_Kikakui ; Mende_Kikakui 3435 blk; Modi ; Modi 3436 blk; Mro ; Mro 3437 blk; Myanmar_Ext_B ; Myanmar_Extended_B 3438 blk; Nabataean ; Nabataean 3439 blk; Old_North_Arabian ; Old_North_Arabian 3440 blk; Old_Permic ; Old_Permic 3441 blk; Ornamental_Dingbats ; Ornamental_Dingbats 3442 blk; Pahawh_Hmong ; Pahawh_Hmong 3443 blk; Palmyrene ; Palmyrene 3444 blk; Pau_Cin_Hau ; Pau_Cin_Hau 3445 blk; Psalter_Pahlavi ; Psalter_Pahlavi 3446 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 3447 blk; Siddham ; Siddham 3448 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 3449 blk; Sup_Arrows_C ; Supplemental_Arrows_C 3450 blk; Tirhuta ; Tirhuta 3451 blk; Warang_Citi ; Warang_Citi 3452 -> add to uchar.h 3453 use long property names for enum constants 3454 -> add to UCharacter.UnicodeBlock IDs 3455 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3456 replace public static final int \1_ID = \2; \3 3457 -> add to UCharacter.UnicodeBlock objects 3458 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3459 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3460- 28 new Joining_Group (jg) values: 3461 jg ; Manichaean_Aleph ; Manichaean_Aleph 3462 jg ; Manichaean_Ayin ; Manichaean_Ayin 3463 jg ; Manichaean_Beth ; Manichaean_Beth 3464 jg ; Manichaean_Daleth ; Manichaean_Daleth 3465 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 3466 jg ; Manichaean_Five ; Manichaean_Five 3467 jg ; Manichaean_Gimel ; Manichaean_Gimel 3468 jg ; Manichaean_Heth ; Manichaean_Heth 3469 jg ; Manichaean_Hundred ; Manichaean_Hundred 3470 jg ; Manichaean_Kaph ; Manichaean_Kaph 3471 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 3472 jg ; Manichaean_Mem ; Manichaean_Mem 3473 jg ; Manichaean_Nun ; Manichaean_Nun 3474 jg ; Manichaean_One ; Manichaean_One 3475 jg ; Manichaean_Pe ; Manichaean_Pe 3476 jg ; Manichaean_Qoph ; Manichaean_Qoph 3477 jg ; Manichaean_Resh ; Manichaean_Resh 3478 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 3479 jg ; Manichaean_Samekh ; Manichaean_Samekh 3480 jg ; Manichaean_Taw ; Manichaean_Taw 3481 jg ; Manichaean_Ten ; Manichaean_Ten 3482 jg ; Manichaean_Teth ; Manichaean_Teth 3483 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 3484 jg ; Manichaean_Twenty ; Manichaean_Twenty 3485 jg ; Manichaean_Waw ; Manichaean_Waw 3486 jg ; Manichaean_Yodh ; Manichaean_Yodh 3487 jg ; Manichaean_Zayin ; Manichaean_Zayin 3488 jg ; Straight_Waw ; Straight_Waw 3489 -> uchar.h & UCharacter.JoiningGroup 3490- 23 new Script (sc) values: 3491 sc ; Aghb ; Caucasian_Albanian 3492 sc ; Bass ; Bassa_Vah 3493 sc ; Dupl ; Duployan 3494 sc ; Elba ; Elbasan 3495 sc ; Gran ; Grantha 3496 sc ; Hmng ; Pahawh_Hmong 3497 sc ; Khoj ; Khojki 3498 sc ; Lina ; Linear_A 3499 sc ; Mahj ; Mahajani 3500 sc ; Mani ; Manichaean 3501 sc ; Mend ; Mende_Kikakui 3502 sc ; Modi ; Modi 3503 sc ; Mroo ; Mro 3504 sc ; Narb ; Old_North_Arabian 3505 sc ; Nbat ; Nabataean 3506 sc ; Palm ; Palmyrene 3507 sc ; Pauc ; Pau_Cin_Hau 3508 sc ; Perm ; Old_Permic 3509 sc ; Phlp ; Psalter_Pahlavi 3510 sc ; Sidd ; Siddham 3511 sc ; Sind ; Khudawadi 3512 sc ; Tirh ; Tirhuta 3513 sc ; Wara ; Warang_Citi 3514 -> uscript.h (many were added before) 3515 comment "Mende Kikakui" for USCRIPT_MENDE 3516 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 3517 -> com.ibm.icu.lang.UScript 3518 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3519 replace public static final int \1 = \2; \3 3520- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3521 (added 2012-11-01) 3522 Ahom 338 Ahom 3523 Hatr 127 Hatran 3524 Mult 323 Multani 3525 (added 2013-10-12) 3526 Modi 324 Modi 3527 Pauc 263 Pau Cin Hau 3528 Sidd 302 Siddham 3529 -> uscript.h (some overlap with additions from Unicode) 3530 -> com.ibm.icu.lang.UScript 3531 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3532 replace public static final int \1 = \2; \3 3533 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 3534 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3535 and in com.ibm.icu.dev.test.lang.TestUScript.java 3536 3537* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3538 (not strictly necessary for NOT_ENCODED scripts) 3539 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3540 3541* generate normalization data files 3542- cd $ICU_ROOT/dbg 3543- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3544- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3545- UNIDATA=$ICU_SRC_DIR/source/data/unidata 3546- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3547- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3548- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3549- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3550- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3551 3552* build ICU (make install) 3553 so that the tools build can pick up the new definitions from the installed header files. 3554 3555~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3556 3557* build Unicode tools using CMake+make 3558 3559~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3560 3561# Location (--prefix) of where ICU was installed. 3562set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 3563# Location of the ICU source tree. 3564set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 3565 3566~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3567~/svn.icutools/trunk/dbg/unicode/c$ make 3568 3569* genprops work 3570- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 3571 + add second array of Joining_Group values for at most 10800..10FFF 3572 icutools: unicode/c/genprops/bidipropsbuilder.cpp 3573 icu: source/common/ubidi_props.h/.c/_data.h 3574 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 3575 3576* generate core properties data files 3577- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3578- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 3579- rebuild ICU (make install) & tools 3580- run genuca again (see step above) so that it picks up the new nfc.nrm 3581- rebuild ICU (make install) & tools 3582 3583* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3584 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3585- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3586- Unicode 6.0..7.0: U+2260, U+226E, U+226F 3587- nothing new in 7.0, no test file to update 3588 3589* run & fix ICU4C tests 3590 3591* update Java data files 3592- refresh just the UCD-related files, just to be safe 3593- see (ICU4C)/source/data/icu4j-readme.txt 3594- mkdir /tmp/icu4j 3595- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3596 output: 3597 ... 3598 Unicode .icu files built to ./out/build/icudt53l 3599 echo timestamp > uni-core-data 3600 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3601 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 3602 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3603 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3604 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 3605 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 3606 mkdir -p /tmp/icu4j/main/shared/data 3607 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3608 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 3609 mkdir -p /tmp/icu4j/main/shared/data 3610 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3611 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 3612- copy the big-endian Unicode data files to another location, 3613 separate from the other data files 3614 ICUDT=icudt54b 3615 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3616 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3617 cd ~/svn.icu/uni70/dbg/data/out/icu4j 3618 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3619 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3620 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3621 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3622 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3623 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3624- refresh ICU4J 3625 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3626 3627* update CollationFCD.java 3628 + copy & paste the initializers of lcccIndex[] etc. from 3629 ICU4C/source/i18n/collationfcd.cpp to 3630 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3631 3632* refresh Java test .txt files 3633- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3634 cd $ICU_SRC_DIR/source/data/unidata 3635 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3636 cd ../../test/testdata 3637 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3638 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3639 3640* UCA 3641 3642- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 3643- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 3644- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 3645- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 3646- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 3647- review data; compare files, use blankweights.sed or similar 3648 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 3649- cd ~/svn.unitools/Generated/uca/7.0.0/ 3650- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3651 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3652- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3653 (note removing the underscore before "Rules") 3654 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3655- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3656 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3657 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3658 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3659 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3660 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3661- run genuca, see command line above 3662- rebuild ICU4C 3663- refresh ICU4J collation data: 3664 (subset of instructions above for properties data refresh, except copies all coll/*) 3665 ICUDT=icudt54b 3666 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3667 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3668 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3669 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3670- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3671- note on intltest: if collate/UCAConformanceTest fails, then 3672 utility/MultithreadTest/TestCollators will fail as well; 3673 fix the conformance test before looking into the multi-thread test 3674- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 3675- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 3676 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3677 3678* When refreshing all of ICU4J data from ICU4C 3679- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3680- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3681or 3682- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3683 3684* run & fix ICU4J tests 3685 3686*** LayoutEngine script information 3687 3688(For details see the Unicode 5.2 change log below.) 3689 3690* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3691 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3692 in the working directory. 3693 (It also generates ScriptRunData.cpp, which is no longer needed.) 3694 3695 The generated files have a current copyright date and "@stable" statement. 3696 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 3697 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 3698 which may not contain dots any more. 3699 3700- diff current <icu>/source/layout files vs. generated ones 3701 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3702 review and manually merge desired changes; 3703 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 3704 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 3705- if you just copy the above files, then 3706 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 3707 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 3708 3709*** API additions 3710- send notice to icu-design about new born-@stable API (enum constants etc.) 3711 3712*** merge the Unicode update branches back onto the trunk 3713- do not merge the icudata.jar and testdata.jar, 3714 instead rebuild them from merged & tested ICU4C 3715 3716---------------------------------------------------------------------------- *** 3717 3718Unicode 6.3 update 3719 3720http://www.unicode.org/review/pri249/ -- beta review 3721http://www.unicode.org/reports/uax-proposed-updates.html 3722http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 3723http://www.unicode.org/reports/tr44/tr44-11.html 3724 3725*** ICU Trac 3726 3727- ticket 10128: update ICU to Unicode 6.3 beta 3728- ticket 10168: update ICU to Unicode 6.3 final 3729- C++ branches/markus/uni63 at r33552 from trunk at r33551 3730- Java branches/markus/uni63 at r33550 from trunk at r33553 3731 3732- ticket 10142: implement Unicode 6.3 bidi algorithm additions 3733 3734*** Unicode version numbers 3735- makedata.mak 3736- uchar.h 3737 (configure.in & configure: have been modified to extract the version from uchar.h) 3738- com.ibm.icu.util.VersionInfo 3739- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3740 3741- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3742 so that the makefiles see the new version number. 3743 3744*** data files & enums & parser code 3745 3746* file preparation 3747 3748- download UCD, UCA & IDNA files 3749- make sure that the Unicode data folder passed into preparseucd.py 3750 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3751- modify preparseucd.py: 3752 parse new file BidiBrackets.txt 3753 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 3754- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 3755- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3756- Check test file diffs for previously commented-out, known-failing data lines; 3757 probably need to keep those commented out. 3758 3759* PropertyAliases.txt changes 3760- 1 new Enumerated Property 3761 bpt ; Bidi_Paired_Bracket_Type 3762 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 3763 -> ubidi_props.h & .c & UBiDiProps.java 3764 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 3765 -> uprops.cpp 3766 -> change ubidi.icu format version from 2.0 to 2.1 3767- 1 new Miscellaneous Property 3768 bpb ; Bidi_Paired_Bracket 3769 -> uchar.h & UProperty.java 3770 -> ppucd.h & .cpp 3771 3772* PropertyValueAliases.txt changes 3773- 3 Bidi_Paired_Bracket_Type (bpt) values: 3774 bpt; c ; Close 3775 bpt; n ; None 3776 bpt; o ; Open 3777 -> uchar.h & UCharacter.BidiPairedBracketType 3778 -> ubidi_props.h & .c & UBiDiProps.java 3779 -> change ubidi.icu format version from 2.0 to 2.1 3780- 4 new Bidi_Class (bc) values: 3781 bc ; FSI ; First_Strong_Isolate 3782 bc ; LRI ; Left_To_Right_Isolate 3783 bc ; RLI ; Right_To_Left_Isolate 3784 bc ; PDI ; Pop_Directional_Isolate 3785 -> uchar.h & UCharacterEnums.ECharacterDirection 3786 -> until the bidi code gets updated, 3787 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 3788- 3 new Word_Break (WB) values: 3789 WB ; HL ; Hebrew_Letter 3790 WB ; SQ ; Single_Quote 3791 WB ; DQ ; Double_Quote 3792 -> uchar.h & UCharacter.WordBreak 3793 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 3794- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3795 (added 2012-10-16) 3796 Aghb 239 Caucasian Albanian 3797 Mahj 314 Mahajani 3798 -> uscript.h 3799 -> com.ibm.icu.lang.UScript 3800 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3801 replace public static final int \1 = \2;\3 3802 -> preparseucd.py _scripts_only_in_iso15924 3803 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3804 and in com.ibm.icu.dev.test.lang.TestUScript.java 3805 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3806 (not strictly necessary for NOT_ENCODED scripts) 3807 3808* generate normalization data files 3809- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 3810- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 3811- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 3812- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3813- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3814- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3815- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3816 3817* build ICU (make install) 3818 so that the tools build can pick up the new definitions from the installed header files. 3819 3820~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3821 3822* build Unicode tools using CMake+make 3823 3824~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3825 3826# Location (--prefix) of where ICU was installed. 3827set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 3828# Location of the ICU source tree. 3829set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 3830 3831~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3832~/svn.icutools/trunk/dbg/unicode/c$ make 3833 3834* generate core properties data files 3835- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 3836- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 3837- rebuild ICU (make install) & tools 3838- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3839- rebuild ICU (make install) & tools 3840 3841* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3842 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3843- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3844- Unicode 6.0..6.3: U+2260, U+226E, U+226F 3845- nothing new in 6.3, no test file to update 3846 3847* update Java data files 3848- refresh just the UCD-related files, just to be safe 3849- see (ICU4C)/source/data/icu4j-readme.txt 3850- mkdir /tmp/icu4j 3851- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3852 output: 3853 ... 3854 Unicode .icu files built to ./out/build/icudt52l 3855 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3856 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 3857 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3858 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3859 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 3860 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 3861 mkdir -p /tmp/icu4j/main/shared/data 3862 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3863 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 3864 mkdir -p /tmp/icu4j/main/shared/data 3865 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3866 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 3867- copy the big-endian Unicode data files to another location, 3868 separate from the other data files 3869 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3870 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3871 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3872 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 3873 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3874 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3875 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3876- refresh ICU4J 3877 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3878 3879* refresh Java test .txt files 3880- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3881 3882* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 3883 3884- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 3885- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 3886- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3887- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3888 (note removing the underscore before "Rules") 3889- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3890 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3891 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3892- check test file diffs for previously commented-out, known-failing data lines; 3893 probably need to keep those commented out 3894- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3895- run genuca, see command line above 3896- rebuild ICU4C 3897- refresh ICU4J collation data: 3898 (subset of instructions above for properties data refresh, except copies all coll/*) 3899 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3900 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3901 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3902 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3903- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3904- note on intltest: if collate/UCAConformanceTest fails, then 3905 utility/MultithreadTest/TestCollators will fail as well; 3906 fix the conformance test before looking into the multi-thread test 3907 3908* test ICU, fix test code where necessary 3909 3910* When refreshing all of ICU4J data from ICU4C 3911- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3912- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3913or 3914- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3915 3916*** LayoutEngine script information 3917- skipped for Unicode 6.3: no new scripts 3918 3919*** merge the Unicode update branches back onto the trunk 3920- do not merge the icudata.jar and testdata.jar, 3921 instead rebuild them from merged & tested ICU4C 3922 3923---------------------------------------------------------------------------- *** 3924 3925Unicode 6.2 update 3926 3927http://www.unicode.org/review/pri230/ 3928http://www.unicode.org/versions/beta-6.2.0.html 3929http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 3930http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 3931http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 3932http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 3933http://www.unicode.org/reports/tr46/tr46-8.html IDNA 3934http://unicode.org/Public/idna/6.2.0/ 3935 3936*** ICU Trac 3937 3938- ticket 9515: Unicode 6.2: final ICU update 3939 3940- ticket 9514: UCA 6.2: fix UCARules.txt 3941 3942- ticket 9437: update ICU to Unicode 6.2 3943- C++ branches/markus/uni62 at r32050 from trunk at r32041 3944- Java branches/markus/uni62 at r32068 from trunk at r32066 3945 3946*** Unicode version numbers 3947- makedata.mak 3948- uchar.h 3949 (configure.in & configure: have been modified to extract the version from uchar.h) 3950- com.ibm.icu.util.VersionInfo 3951- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3952 3953*** data files & enums & parser code 3954 3955* file preparation 3956 3957- download UCD, UCA & IDNA files 3958- make sure that the Unicode data folder passed into preparseucd.py 3959 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3960- modify preparseucd.py: NamesList.txt is now in UTF-8 3961- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 3962- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3963- Check test file diffs for previously commented-out, known-failing data lines; 3964 probably need to keep those commented out. 3965 3966* PropertyValueAliases.txt changes 3967- 1 new Line_Break (lb) value: 3968 lb ; RI ; Regional_Indicator 3969 -> uchar.h & UCharacter.LineBreak 3970- 1 new Word_Break (WB) value: 3971 WB ; RI ; Regional_Indicator 3972 -> uchar.h & UCharacter.WordBreak 3973- 1 new Grapheme_Cluster_Break (GCB) value: 3974 GCB; RI ; Regional_Indicator 3975 -> uchar.h & UCharacter.GraphemeClusterBreak 3976 3977* 3 new numeric values 3978 The new value -1, which was really supposed to be NaN but that would have required 3979 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 3980 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 3981 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 3982 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 3983 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 3984 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 3985 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 3986 -> uprops.h, uchar.c & UCharacterProperty.java 3987 -> cucdtst.c & UCharacterTest.java 3988 3989* generate normalization data files 3990- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 3991- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 3992- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 3993- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3994- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3995- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3996- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3997 3998* build ICU (make install) 3999 so that the tools build can pick up the new definitions from the installed header files. 4000* build Unicode tools using CMake+make 4001 4002* generate core properties data files 4003- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 4004- in initial bootstrapping, change the UCA version 4005 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4006- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 4007- rebuild ICU (make install) & tools 4008 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4009 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4010 (see step above) 4011- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 4012- rebuild ICU (make install) & tools 4013 4014* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4015 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4016- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4017- Unicode 6.0..6.2: U+2260, U+226E, U+226F 4018- nothing new in 6.2, no test file to update 4019 4020* update Java data files 4021- refresh just the UCD-related files, just to be safe 4022- see (ICU4C)/source/data/icu4j-readme.txt 4023- mkdir /tmp/icu4j 4024- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4025 output: 4026 ... 4027 Unicode .icu files built to ./out/build/icudt50l 4028 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4029 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 4030 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4031 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 4032 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 4033 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 4034 mkdir -p /tmp/icu4j/main/shared/data 4035 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4036 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 4037 mkdir -p /tmp/icu4j/main/shared/data 4038 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4039 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 4040- copy the big-endian Unicode data files to another location, 4041 separate from the other data files 4042 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4043 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4044 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4045 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 4046 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 4047 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4048 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 4049- refresh ICU4J 4050 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4051 4052* refresh Java test .txt files 4053- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4054 4055* UCA 4056 4057- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 4058- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 4059- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4060- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4061 (note removing the underscore before "Rules") 4062- update (ICU4C)/source/test/testdata/CollationTest_*.txt 4063 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4064 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4065- check test file diffs for previously commented-out, known-failing data lines; 4066 probably need to keep those commented out 4067- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4068- run genuca, see command line above 4069- rebuild ICU4C 4070- refresh ICU4J collation data: 4071 (subset of instructions above for properties data refresh, except copies all coll/*) 4072 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4073 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4074 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 4075 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 4076- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4077- note on intltest: if collate/UCAConformanceTest fails, then 4078 utility/MultithreadTest/TestCollators will fail as well; 4079 fix the conformance test before looking into the multi-thread test 4080 4081* test ICU, fix test code where necessary 4082 4083* When refreshing all of ICU4J data from ICU4C 4084- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4085- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4086or 4087- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4088 4089*** LayoutEngine script information 4090- skipped for Unicode 6.2: no new scripts 4091 4092*** merge the Unicode update branches back onto the trunk 4093- do not merge the icudata.jar and testdata.jar, 4094 instead rebuild them from merged & tested ICU4C 4095 4096---------------------------------------------------------------------------- *** 4097 4098Future Unicode update 4099 4100Tools simplified since the Unicode 6.1 update. See 4101- https://icu.unicode.org/design/props/ppucd 4102- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 4103 4104* Unicode version numbers 4105- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 4106 4107* file preparation 4108- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 4109- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 4110- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 4111- Check test file diffs for previously commented-out, known-failing data lines; 4112 probably need to keep those commented out. 4113 4114* PropertyValueAliases.txt changes 4115- Script codes that are in ISO 15924 but not in Unicode are now listed in 4116 preparseucd.py, in the _scripts_only_in_iso15924 variable. 4117 If there are new ISO codes, then add them. 4118 If Unicode adds some of them, then remove them from the .py variable. 4119 4120* UnicodeData.txt changes 4121- No more manual changes for CJK ranges for algorithmic names; 4122 those are now written to ppucd.txt and genprops reads them from there. 4123 4124* generate core properties data files (makeprops.sh was deleted) 4125- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 4126 4127* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 4128- it is now generated by preparseucd.py 4129 4130* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 4131- it is now generated by preparseucd.py 4132- make sure that the Unicode data folder passed into preparseucd.py 4133 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4134 (can be in some subfolder) 4135 4136* generate normalization data files 4137- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 4138- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 4139- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 4140- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 4141- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 4142- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 4143- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 4144 4145* build ICU (make install) 4146* build Unicode tools using CMake+make 4147 4148* new way to call genuca (makeuca.sh was deleted) 4149- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 4150 4151---------------------------------------------------------------------------- *** 4152 4153Unicode 6.1 update 4154 4155*** ICU Trac 4156 4157- ticket 8995 final update to Unicode 6.1 4158- ticket 8994 regenerate source/layout/CanonData.cpp 4159 4160- ticket 8961 support Unicode "Age" value *names* 4161- ticket 8963 support multiple character name aliases & types 4162 4163- ticket 8827 "update ICU to Unicode 6.1" 4164- C++ branches/markus/uni61 at r30864 from trunk at r30843 4165- Java branches/markus/uni61 at r30865 from trunk at r30863 4166 4167*** Unicode version numbers 4168- makedata.mak 4169- uchar.h 4170 (configure.in & configure: have been modified to extract the version from uchar.h) 4171- com.ibm.icu.util.VersionInfo 4172- icutools/unicode/makedefs.sh 4173 + also review & update other definitions in that file, 4174 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 4175 4176*** data files & enums & parser code 4177 4178* file preparation 4179 4180~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 4181- This prepares both unidata and testdata files in respective output subfolders. 4182- Check test file diffs for previously commented-out, known-failing data lines; 4183 probably need to keep those commented out. 4184 4185* PropertyValueAliases.txt changes 4186- 11 new block names: 4187 Arabic_Extended_A 4188 Arabic_Mathematical_Alphabetic_Symbols 4189 Chakma 4190 Meetei_Mayek_Extensions 4191 Meroitic_Cursive 4192 Meroitic_Hieroglyphs 4193 Miao 4194 Sharada 4195 Sora_Sompeng 4196 Sundanese_Supplement 4197 Takri 4198 -> add to uchar.h 4199 -> add to UCharacter.UnicodeBlock IDs 4200 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 4201 replace public static final int \1_ID = \2; \3 4202 -> add to UCharacter.UnicodeBlock objects 4203 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4204 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4205- 1 new Joining_Group (jg) value: 4206 Rohingya_Yeh 4207 -> uchar.h & UCharacter.JoiningGroup 4208- 2 new Line_Break (lb) values: 4209 CJ=Conditional_Japanese_Starter 4210 HL=Hebrew_Letter 4211 -> uchar.h & UCharacter.LineBreak 4212- 7 new scripts: 4213 sc ; Cakm ; Chakma 4214 sc ; Merc ; Meroitic_Cursive 4215 sc ; Mero ; Meroitic_Hieroglyphs 4216 sc ; Plrd ; Miao 4217 sc ; Shrd ; Sharada 4218 sc ; Sora ; Sora_Sompeng 4219 sc ; Takr ; Takri 4220 -> remove these from SyntheticPropertyValueAliases.txt 4221 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4222 and in com.ibm.icu.dev.test.lang.TestUScript.java 4223- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4224 (added 2011-06-21) 4225 Khoj 322 Khojki 4226 Tirh 326 Tirhuta 4227 and another one added 2011-12-09 4228 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 4229 -> uscript.h 4230 -> com.ibm.icu.lang.UScript 4231 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4232 replace public static final int \1 = \2;\3 4233 -> SyntheticPropertyValueAliases.txt 4234 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4235 and in com.ibm.icu.dev.test.lang.TestUScript.java 4236 4237* UnicodeData.txt changes 4238- the last Unihan code point changes from U+9FCB to U+9FCC 4239 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 4240 + do change gennames.c 4241 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4242 4243* DerivedBidiClass.txt changes 4244- 2 new default-AL blocks: 4245# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 4246# Arabic Mathematical Alphabetic Symbols: 4247# U+1EE00 - U+1EEFF (was default-R) 4248- 2 new default-R blocks: 4249# Meroitic Hieroglyphs: 4250# U+10980 - U+1099F 4251# Meroitic Cursive: U+109A0 - U+109FF 4252 -> should be picked up by the explicit data in the file 4253 4254* NameAliases.txt changes 4255- from 4256 # Each line has two fields 4257 # First field: Code point 4258 # Second field: Alias 4259- to 4260 # Each line has three fields, as described here: 4261 # 4262 # First field: Code point 4263 # Second field: Alias 4264 # Third field: Type 4265- Also, the file previously allowed multiple aliases but only now does it 4266 actually provide multiple, even multiple of the same type. For example, 4267 FEFF;BYTE ORDER MARK;alternate 4268 FEFF;BOM;abbreviation 4269 FEFF;ZWNBSP;abbreviation 4270- This breaks our gennames parser, unames.icu data structure, and API. 4271 Fix gennames to only pick up "correction" aliases. 4272 New ticket #8963 for further changes. 4273 4274* run genpname/preparse.pl (on Linux) 4275 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4276 + make sure that data.h is writable 4277 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4278 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4279 4280* build ICU (make install) 4281 so that the tools build can pick up the new definitions from the installed header files. 4282* build Unicode tools (at least genpname) using CMake+make 4283 4284* run genpname 4285 (builds both pnames.icu and propname_data.h) 4286- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4287- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4288 4289* build ICU (make install) 4290* build Unicode tools using CMake+make 4291 4292* update source/data/unidata/norm2/nfkc_cf.txt 4293- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4294 4295* update source/data/unidata/norm2/uts46.txt 4296- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 4297 to ~/svn.icu/tools/trunk/src/unicode/py 4298- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 4299- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4300- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4301 4302* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4303 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4304- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4305- Unicode 6.0..6.1: U+2260, U+226E, U+226F 4306- nothing new in 6.1, no test file to update 4307 4308* generate core properties data files 4309- in initial bootstrapping, change the UCA version 4310 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 4311- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4312- rebuild ICU & tools 4313 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 4314 check if the UCA version in FractionalUCA.txt matches the new Unicode version 4315 (see step above) 4316- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 4317 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4318- rebuild ICU & tools 4319 4320* update Java data files 4321- refresh just the UCD-related files, just to be safe 4322- see (ICU4C)/source/data/icu4j-readme.txt 4323- mkdir /tmp/icu4j 4324- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4325 output: 4326 ... 4327 Unicode .icu files built to ./out/build/icudt49l 4328 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4329 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 4330 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4331 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 4332 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 4333 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 4334 mkdir -p /tmp/icu4j/main/shared/data 4335 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4336 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 4337 mkdir -p /tmp/icu4j/main/shared/data 4338 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 4339 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 4340- copy the big-endian Unicode data files to another location, 4341 separate from the other data files 4342 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4343 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4344 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4345 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 4346 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4347 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4348 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4349- refresh ICU4J 4350 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4351 4352* refresh Java test .txt files 4353- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4354 4355* test ICU so far, fix test code where necessary 4356- temporarily ignore collation issues that look like UCA/UCD mismatches, 4357 until UCA data is updated 4358 4359* UCA 4360 4361- get output from Mark's tools; look in 4362 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 4363- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4364- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4365 (note removing the underscore before "Rules") 4366- update (ICU)/source/test/testdata/CollationTest_*.txt 4367 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4368 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4369- check test file diffs for previously commented-out, known-failing data lines; 4370 probably need to keep those commented out 4371- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4372- run makeuca.sh: 4373 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4374- rebuild ICU4C 4375- refresh ICU4J collation data: 4376 (subset of instructions above for properties data refresh, except copies all coll/*) 4377 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4378 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4379 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4380 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4381- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4382- note on intltest: if collate/UCAConformanceTest fails, then 4383 utility/MultithreadTest/TestCollators will fail as well; 4384 fix the conformance test before looking into the multi-thread test 4385 4386* When refreshing all of ICU4J data from ICU4C 4387- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4388- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4389or 4390- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4391 4392*** LayoutEngine script information 4393 4394(For details see the Unicode 5.2 change log below.) 4395 4396* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4397 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4398 in the working directory. 4399 (It also generates ScriptRunData.cpp, which is no longer needed.) 4400 4401 The generated files have a current copyright date and "@draft" statement. 4402 4403- diff current <icu>/source/layout files vs. generated ones 4404 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4405 review and manually merge desired changes; 4406 fix gratuitous changes, incorrect @draft and missing aliases; 4407 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4408- if you just copy the above files, then 4409 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4410 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4411 4412*** merge the Unicode update branches back onto the trunk 4413- do not merge the icudata.jar and testdata.jar, 4414 instead rebuild them from merged & tested ICU4C 4415 4416---------------------------------------------------------------------------- *** 4417 4418ICU 4.8 (no Unicode update, just new script codes) 4419 4420* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4421 (added 2010-12-21) 4422 Afak 439 Afaka 4423 Jurc 510 Jurchen 4424 Mroo 199 Mro, Mru 4425 Nshu 499 Nüshu 4426 Shrd 319 Sharada, Śāradā 4427 Sora 398 Sora Sompeng 4428 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 4429 Tang 520 Tangut 4430 Wole 480 Woleai 4431 -> uscript.h 4432 -> com.ibm.icu.lang.UScript 4433 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4434 replace public static final int \1 = \2;\3 4435 -> genpname/SyntheticPropertyValueAliases.txt 4436 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4437 and in com.ibm.icu.dev.test.lang.TestUScript.java 4438 4439* run genpname/preparse.pl (on Linux) 4440 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4441 + make sure that data.h is writable 4442 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4443 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4444 4445* rebuild Unicode tools (at least genpname) using make 4446- You might first need to "make install" ICU so that the tools build can pick 4447 up the new definitions from the installed header files. 4448 4449* run genpname 4450 (builds both pnames.icu and propname_data.h) 4451- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4452- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4453- rebuild ICU & tools 4454 4455* run genprops 4456- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4457- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4458- rebuild ICU & tools 4459 4460* update Java data files 4461- refresh just the UCD-related files, just to be safe 4462- see (ICU4C)/source/data/icu4j-readme.txt 4463- mkdir /tmp/icu4j 4464- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4465- copy the big-endian Unicode data files to another location, 4466 separate from the other data files 4467 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4468 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4469 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4470- refresh ICU4J 4471 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 4472 4473* should have updated the layout engine script codes but forgot 4474 4475---------------------------------------------------------------------------- *** 4476 4477Unicode 6.0 update 4478 4479*** related ICU Trac tickets 4480 44817264 Unicode 6.0 Update 4482 4483*** Unicode version numbers 4484- makedata.mak 4485- uchar.h 4486 (configure.in & configure: have been modified to extract the version from uchar.h) 4487- com.ibm.icu.util.VersionInfo 4488 4489*** data files & enums & parser code 4490 4491* file preparation 4492 4493~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 4494- This now prepares both unidata and testdata files in respective output subfolders. 4495 4496* PropertyAliases.txt changes 4497- new Script_Extensions property defined in the new ScriptExtensions.txt file 4498 but not listed in PropertyAliases.txt; reported to unicode.org; 4499 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 4500 scx; Script_Extensions 4501 -> uchar.h with new UProperty section 4502 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 4503 4504* PropertyValueAliases.txt changes 4505- 12 new block names: 4506 Alchemical_Symbols 4507 Bamum_Supplement 4508 Batak 4509 Brahmi 4510 CJK_Unified_Ideographs_Extension_D 4511 Emoticons 4512 Ethiopic_Extended_A 4513 Kana_Supplement 4514 Mandaic 4515 Miscellaneous_Symbols_And_Pictographs 4516 Playing_Cards 4517 Transport_And_Map_Symbols 4518 -> add to uchar.h 4519 -> add to UCharacter.UnicodeBlock 4520 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4521 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4522- Joining_Group (jg) values: 4523 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 4524 -> uchar.h & UCharacter.JoiningGroup 4525- 3 new scripts: 4526 sc ; Batk ; Batak 4527 sc ; Brah ; Brahmi 4528 sc ; Mand ; Mandaic 4529 -> remove these from SyntheticPropertyValueAliases.txt 4530 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 4531 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4532 and in com.ibm.icu.dev.test.lang.TestUScript.java 4533- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4534 (added 2009-11-11..2010-07-18) 4535 Bass 259 Bassa Vah 4536 Dupl 755 Duployan shortand 4537 Elba 226 Elbasan 4538 Gran 343 Grantha 4539 Kpel 436 Kpelle 4540 Loma 437 Loma 4541 Mend 438 Mende 4542 Merc 101 Meroitic Cursive 4543 Narb 106 Old North Arabian 4544 Nbat 159 Nabataean 4545 Palm 126 Palmyrene 4546 Sind 318 Sindhi 4547 Wara 262 Warang Citi 4548 -> uscript.h 4549 -> com.ibm.icu.lang.UScript 4550 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4551 replace public static final int \1 = \2;\3 4552 -> SyntheticPropertyValueAliases.txt 4553 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4554 and in com.ibm.icu.dev.test.lang.TestUScript.java 4555- ISO 15924 name change 4556 Mero 100 Meroitic Hieroglyphs (was Meroitic) 4557 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 4558- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 4559 4560* UnicodeData.txt changes 4561- new CJK block: 4562 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 4563 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 4564 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 4565 4566* build Unicode tools using CMake+make 4567 4568* run genpname/preparse.pl (on Linux) 4569 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4570 + make sure that data.h is writable 4571 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4572 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4573 4574* rebuild Unicode tools (at least genpname) using make 4575- You might first need to "make install" ICU so that the tools build can pick 4576 up the new definitions from the installed header files. 4577 4578* run genpname 4579- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4580- rebuild ICU & tools 4581 4582* update source/data/unidata/norm2/nfkc_cf.txt 4583- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4584 4585* update source/data/unidata/norm2/uts46.txt 4586- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 4587 to ~/svn.icu/tools/trunk/src/unicode/py 4588- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 4589- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4590- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4591 4592* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4593 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4594- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4595- Unicode 6.0: U+2260, U+226E, U+226F 4596 4597* generate core properties data files 4598- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4599- rebuild ICU & tools 4600- run makeuca.sh so that genuca picks up the new nfc.nrm: 4601 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4602- rebuild ICU & tools 4603 4604* implement new Script_Extensions property (provisional) 4605- parser & generator: genprops & uprops.icu 4606- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 4607- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 4608 4609* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 4610- (one-time change) 4611- genbidi/gencase/genprops tools changes 4612- re-run makeprops.sh (see above) 4613- UCharacterProperty.java, UCharacterTypeIterator.java, 4614 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 4615 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 4616 4617* update Java data files 4618- refresh just the UCD-related files, just to be safe 4619- see (ICU4C)/source/data/icu4j-readme.txt 4620- mkdir /tmp/icu4j 4621- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4622 output: 4623 ... 4624 Unicode .icu files built to ./out/build/icudt45l 4625 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4626 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4627 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4628 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 4629 mkdir -p /tmp/icu4j/main/shared/data 4630 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4631- copy the big-endian Unicode data files to another location, 4632 separate from the other data files 4633 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4634 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4635 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4636 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 4637 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4638 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4639 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4640- refresh ICU4J 4641 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4642 4643* refresh Java test .txt files 4644- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4645 4646* un-hardcode normalization skippable (NF*_Inert) test data 4647- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 4648 4649* copy updated break iterator test files 4650- now handled by early ucdcopy.py and 4651 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 4652 (old instructions: 4653 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 4654 to ~/svn.icu/trunk/src/source/test/testdata) 4655- they are not used in ICU4J 4656 4657* UCA 4658 4659- get output from Mark's tools; look in 4660 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 4661 http://www.macchiato.com/unicode/utc/additional-uca-files 4662 http://www.unicode.org/Public/UCA/6.0.0/ 4663 http://www.unicode.org/~mdavis/uca/ 4664- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4665- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4666- update Han-implicit ranges for new CJK extensions: 4667 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4668- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 4669 do not add it into invuca so that tailoring primary-after an ignorable works 4670- genuca: permit space between [variable top] bytes 4671- ucol.cpp: treat noncharacters like unassigned rather than ignorable 4672- run makeuca.sh: 4673 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4674- rebuild ICU4C 4675- refresh ICU4J collation data: 4676 (subset of instructions above for properties data refresh, except copies all coll/*) 4677 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4678 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4679 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4680 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4681- update (ICU)/source/test/testdata/CollationTest_*.txt 4682 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4683 with output from Mark's Unicode tools 4684- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4685- note on intltest: if collate/UCAConformanceTest fails, then 4686 utility/MultithreadTest/TestCollators will fail as well; 4687 fix the conformance test before looking into the multi-thread test 4688 4689* When refreshing all of ICU4J data from ICU4C 4690- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4691- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4692or 4693- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4694 4695*** LayoutEngine script information 4696 4697(For details see the Unicode 5.2 change log below.) 4698 4699* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4700ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4701ScriptRunData.cpp, which is no longer needed.) 4702 4703The generated files have a current copyright date and "@draft" statement. 4704 4705* copy the above files into <icu>/source/layout, replacing the old files. 4706* fix mixed line endings 4707* review the diffs and fix incorrect @draft and missing aliases; 4708 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4709* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4710 4711---------------------------------------------------------------------------- *** 4712 4713Unicode 5.2 update 4714 4715*** related ICU Trac tickets 4716 47177084 Unicode 5.2 4718 47197167 verify collation bytes 47207235 Java test NAME_ALIAS 47217236 Java DerivedCoreProperties.txt test 47227237 Java BidiTest.txt 47237238 UTrie2 in core unidata 47247239 test for tailoring gaps 47257240 Java fix CollationMiscTest 47267243 update layout engine for Unicode 5.2 4727 4728*** Unicode version numbers 4729- makedata.mak 4730- uchar.h 4731- configure.in & configure 4732- update ucdVersion in gennames.c if an algorithmic range changes 4733 4734*** data files & enums & parser code 4735 4736* file preparation 4737 4738python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 4739- includes finding files regardless of version numbers, 4740 copying them, and performing the equivalent processing of the 4741 ucdstrip and ucdmerge tools on the desired set of files 4742 4743* notes on changes 4744- PropertyAliases.txt 4745 moved from numeric to enumerated: 4746 ccc ; Canonical_Combining_Class 4747 new string properties: 4748 NFKC_CF ; NFKC_Casefold 4749 Name_Alias; Name_Alias 4750 new binary properties: 4751 Cased ; Cased 4752 CI ; Case_Ignorable 4753 CWCF ; Changes_When_Casefolded 4754 CWCM ; Changes_When_Casemapped 4755 CWKCF ; Changes_When_NFKC_Casefolded 4756 CWL ; Changes_When_Lowercased 4757 CWT ; Changes_When_Titlecased 4758 CWU ; Changes_When_Uppercased 4759 new CJK Unihan properties (not supported by ICU) 4760- PropertyValueAliases.txt 4761 new block names 4762 new scripts 4763 one script code change: 4764 sc ; Qaai ; Inherited 4765 -> 4766 sc ; Zinh ; Inherited ; Qaai 4767 new Line_Break (lb) value: 4768 lb ; CP ; Close_Parenthesis 4769 new Joining_Group (jg) values: Farsi_Yeh, Nya 4770 other new values: 4771 ccc; 214; ATA ; Attached_Above 4772- DerivedBidiClass.txt 4773 new default-R range: U+1E800 - U+1EFFF 4774- UnicodeData.txt 4775 all of the ISO comments are gone 4776 new CJK block end: 4777 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 4778 new CJK block: 4779 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 4780 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 4781 4782* genpname 4783- run preparse.pl 4784 + cd \svn\icuproj\icu\trunk\source\tools\genpname 4785 + make sure that data.h is writable 4786 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 4787 + preparse.pl complains with errors like the following: 4788 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 4789 This is because ICU 4.0 had scripts from ISO 15924 which are now 4790 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 4791 and PropertyValueAliases.txt. 4792 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4793 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 4794 + preparse.pl complains with errors about block names missing from uchar.h; add them 4795 4796* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4797- new block & script values 4798 + 26 new blocks 4799 copy new blocks from Blocks.txt 4800 MS VC++ 2008 regular expression: 4801 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 4802 replace with " UBLOCK_\3 = 172, /*[\1]*/" 4803 + several new script values already added in ICU 4.0 for ISO 15924 coverage 4804 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 4805 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 4806 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 4807 (added to SyntheticPropertyValueAliases.txt) 4808- new Joining Group (JG) values: Farsi_Yeh, Nya 4809- new Line_Break (lb) value: 4810 lb ; CP ; Close_Parenthesis 4811 4812* hardcoded Unihan range end/limit 4813- Unihan range end moves from 9FC3 to 9FCB 4814 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 4815 + do change gennames.c 4816 4817* Compare definitions of new binary properties with what we used to use 4818 in algorithms, to see if the definitions changed. 4819- Verified that definitions for Cased and Case_Ignorable are unchanged. 4820 The gencase tool now parses the newly public Case_Ignorable values 4821 in case the definition changes in the future. 4822 4823* uchar.c & uprops.h & uprops.c & genprops 4824- new numeric values that didn't exist in Unicode data before: 4825 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 4826 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 4827 therefore redesign the encoding of numeric types and values for formatVersion 6; 4828 design for simple numbers up to at least 144 ("one gross"), 4829 large values up to at least 10^20, 4830 and fractions with numerators -1..17 and denominators 1..16 4831 to cover current and expected future values 4832 (e.g., more Han numeric values, Meroitic twelfths) 4833 4834* reimplement Hangul_Syllable_Type for new Jamo characters 4835- the old code assumed that all Jamo characters are in the 11xx block 4836- Unicode 5.2 fills holes there and adds new Jamo characters in 4837 A960..A97F; Hangul Jamo Extended-A 4838 and in 4839 D7B0..D7FF; Hangul Jamo Extended-B 4840- Hangul_Syllable_Type can be trivially derived from a subset of 4841 Grapheme_Cluster_Break values 4842 4843* build Unicode data source code for hardcoding core data 4844C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 4845 4846ICU data make path is \svn\icuproj\icu\trunk\source\data\ 4847ICU root path is \svn\icuproj\icu\trunk 4848Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4849Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 4850Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 4851Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 4852Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 4853Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 4854Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 4855Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 4856Creating data file for Unicode Property Names 4857Creating data file for Unicode Character Properties 4858Creating data file for Unicode Case Mapping Properties 4859Creating data file for Unicode BiDi/Shaping Properties 4860Creating data file for Unicode Normalization 4861Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 4862Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 4863 4864- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 4865 and rebuild the common library 4866 4867*** UCA 4868 4869- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 4870- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 4871- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 4872[ Begin obsolete instructions: 4873 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 4874 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 4875 on Windows: 4876 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 4877 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 4878 End obsolete instructions] 4879- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4880 not just the *_STUB.txt files 4881- note on intltest: if collate/UCAConformanceTest fails, then 4882 utility/MultithreadTest/TestCollators will fail as well; 4883 fix the conformance test before looking into the multi-thread test 4884 4885*** Implement Cased & Case_Ignorable properties 4886- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 4887- Problem: These properties should be disjoint, but aren't 4888- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 4889- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 4890 4891*** Implement Changes_When_Xyz properties 4892- without stored data 4893 4894*** Implement Name_Alias property 4895- add it as another name field in unames.icu 4896- make it available via u_charName() and UCharNameChoice and 4897- consider it in u_charFromName() 4898 4899*** Break iterators 4900 4901* Update break iterator rules to new UAX versions and new property values 4902* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 4903 4904*** new BidiTest file 4905- review format and data 4906- copy BidiTest.txt to source/test/testdata 4907- write test code using this data 4908- fix ICU code where it fails the conformance test 4909 4910*** Java 4911- generally, find and update code corresponding to C/C++ 4912- UCharacter.UnicodeBlock constants: 4913 a) add an _ID integer per new block, update COUNT 4914 b) add a class instance per new block 4915 Visual Studio regex: 4916 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 4917 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4918- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 4919 4920- port test changes to Java 4921 4922*** LayoutEngine script information 4923 4924(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 4925 4926* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4927ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4928ScriptRunData.cpp, which is no longer needed.) 4929 4930The generated files have a current copyright date and "@draft" statement. 4931 4932-> Eric Mader wrote in email on 20090930: 4933 "I think the tool has been modified to update @draft to @stable for 4934 older scripts and to add @draft for new scripts. 4935 (I worked with an intern on this last year.) 4936 You should check the output after you run it." 4937 4938* copy the above files into <icu>/source/layout, replacing the old files. 4939* fix mixed line endings 4940* review the diffs and fix incorrect @draft and missing aliases 4941* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4942 4943Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4944and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4945 4946-> Eric Mader wrote in email on 20090930: 4947 "This is just a matter of making sure that all the per-script tables have 4948 entries for any new scripts that were added. 4949 If any new Indic characters were added, then the class tables in 4950 IndicClassTables.cpp should be updated to reflect this. 4951 John Emmons should know how to do this if it's required." 4952 4953* rebuild the layout and layoutex libraries. 4954 4955*** Documentation 4956- Update User Guide 4957 + Jamo_Short_Name, sfc->scf, binary property value aliases 4958 4959---------------------------------------------------------------------------- *** 4960 4961Unicode 5.1 update 4962 4963*** related ICU Trac tickets 4964 49655696 Update to Unicode 5.1 4966 4967*** Unicode version numbers 4968- makedata.mak 4969- uchar.h 4970- configure.in & configure 4971- update ucdVersion in gennames.c if an algorithmic range changes 4972 4973*** data files & enums & parser code 4974 4975* file preparation 4976- ucdstrip: 4977 DerivedCoreProperties.txt 4978 DerivedNormalizationProps.txt 4979 NormalizationTest.txt 4980 PropList.txt 4981 Scripts.txt 4982 GraphemeBreakProperty.txt 4983 SentenceBreakProperty.txt 4984 WordBreakProperty.txt 4985- ucdstrip and ucdmerge: 4986 EastAsianWidth.txt 4987 LineBreak.txt 4988 4989* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 4990copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 4991copy 5.1.0\ucd\Blocks.txt ..\unidata\ 4992copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 4993copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 4994copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 4995copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 4996copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 4997copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 4998copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 4999copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 5000copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 5001copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 5002copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 5003 5004ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5005ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5006ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5007ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 5008ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5009ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5010ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5011ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5012ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5013ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5014 5015* genpname 5016- run preparse.pl 5017 + cd \svn\icuproj\icu\uni51\source\tools\genpname 5018 + make sure that data.h is writable 5019 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 5020 + preparse.pl complains with errors like the following: 5021 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 5022 This is because ICU 3.8 had scripts from ISO 15924 which are now 5023 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 5024 and PropertyValueAliases.txt. 5025 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 5026 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 5027 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 5028 N/Y, No/Yes, F/T, False/True 5029 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 5030 It will use further values from the file if present. 5031 5032* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5033- new block & script values 5034 + 17 new blocks 5035 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 5036 (removed from SyntheticPropertyValueAliases.txt) 5037 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 5038 (added to SyntheticPropertyValueAliases.txt) 5039- uprops.icu (uprops.h) only provides 7 bits for script codes. 5040 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 5041 There is none above 127 yet which is the script code for an 5042 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 5043 script code values greater than 127. 5044 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 5045 in a parallel bit field, and that overflows now. 5046 Also, future values >=128 would be incompatible anyway. 5047 uprops.h is modified to move around several of the bit fields 5048 in the properties vector words, and now uses 8 bits for the script code. 5049 Two other bit fields also grow to accommodate future growth: 5050 Block (current count: 172) grows from 8 to 9 bits, 5051 and Word_Break grows from 4 to 5 bits. 5052- renamed property Simple_Case_Folding (sfc->scf) 5053 + nothing to be done: handled as normal alias 5054- new property JSN Jamo_Short_Name 5055 + no new API: only contributes to the Name property 5056- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 5057- new Joining Group (JG) value: Burushashki_Yeh_Barree 5058- new Sentence_Break (SB) values: 5059 SB ; CR ; CR 5060 SB ; EX ; Extend 5061 SB ; LF ; LF 5062 SB ; SC ; SContinue 5063- new Word_Break (WB) values: 5064 WB ; CR ; CR 5065 WB ; Extend ; Extend 5066 WB ; LF ; LF 5067 WB ; MB ; MidNumLet 5068 5069* Further changes in the 2008-02-29 update: 5070- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 5071 because they should not normally be invisible. 5072- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 5073- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 5074- new Word_Break (WB) value: NL=Newline 5075 5076* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 5077- Unihan range end moves from 9FBB to 9FC3 5078 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 5079 + do change gennames.c 5080 5081* build Unicode data source code for hardcoding core data 5082C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 5083 5084ICU data make path is \svn\icuproj\icu\uni51\source\data\ 5085ICU root path is \svn\icuproj\icu\uni51 5086Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5087Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 5088Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 5089Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 5090Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 5091Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 5092Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 5093Creating data file for Unicode Character Properties 5094Creating data file for Unicode Case Mapping Properties 5095Creating data file for Unicode BiDi/Shaping Properties 5096Creating data file for Unicode Normalization 5097Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 5098Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 5099 5100- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 5101 and rebuild the common library 5102 5103*** Break iterators 5104 5105* Update break iterator rules to new UAX versions and new property values 5106 5107*** UCA 5108 5109* update FractionalUCA.txt and UCARules.txt with new canonical closure 5110 5111*** Test suites 5112- Test that APIs using Unicode property value aliases (like UnicodeSet) 5113 support all of the boolean values N/Y, No/Yes, F/T, False/True 5114 -> TestBinaryValues() tests in both cintltst and intltest 5115 5116*** LayoutEngine script information 5117* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5118ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5119ScriptRunData.cpp, which is no longer needed.) 5120 5121The generated files have a current copyright date and "@draft" statement. 5122 5123* copy the above files into <icu>/source/layout, replacing the old files. 5124 5125Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5126and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5127 5128* rebuild the layout and layoutex libraries. 5129 5130*** Documentation 5131- Update User Guide 5132 + Jamo_Short_Name, sfc->scf, binary property value aliases 5133 5134---------------------------------------------------------------------------- *** 5135 5136Unicode 5.0 update 5137 5138*** related Jitterbugs 5139 51405084 RFE: Update to Unicode 5.0 5141 5142*** data files & enums & parser code 5143 5144* file preparation 5145- ucdstrip: 5146 DerivedCoreProperties.txt 5147 DerivedNormalizationProps.txt 5148 NormalizationTest.txt 5149 PropList.txt 5150 Scripts.txt 5151 GraphemeBreakProperty.txt 5152 SentenceBreakProperty.txt 5153 WordBreakProperty.txt 5154- ucdstrip and ucdmerge: 5155 EastAsianWidth.txt 5156 LineBreak.txt 5157 5158* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 5159copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 5160copy 5.0.0\ucd\Blocks.txt ..\unidata\ 5161copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 5162copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 5163copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 5164copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 5165copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 5166copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 5167copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 5168copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 5169copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 5170copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 5171copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 5172 5173ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 5174ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 5175ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 5176ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 5177ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 5178ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 5179ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 5180ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 5181ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 5182ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 5183 5184* update FractionalUCA.txt and UCARules.txt with new canonical closure 5185 5186* genpname 5187- run preparse.pl 5188 + make sure that data.h is writable 5189 + perl preparse.pl \cvs\oss\icu > out.txt 5190 5191* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5192- new block & script values 5193 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 5194 5195* build Unicode data source code for hardcoding core data 5196C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 5197 5198ICU data make path is \cvs\oss\icu\source\data\ 5199ICU root path is \cvs\oss\icu 5200Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 5201[etc.] 5202Creating data file for Unicode Character Properties 5203Creating data file for Unicode Case Mapping Properties 5204Creating data file for Unicode BiDi/Shaping Properties 5205Creating data file for Unicode Normalization 5206Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 5207Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 5208 5209- copy the .c source files to C:\cvs\oss\icu\source\common 5210 and rebuild the common library 5211 5212*** Unicode version numbers 5213- makedata.mak 5214- uchar.h 5215- configure.in 5216 5217*** LayoutEngine script information 5218* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 5219ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 5220ScriptRunData.cpp, which is no longer needed.) 5221 5222The generated files have a current copyright date and "@draft" statement. 5223 5224* copy the above files into <icu>/source/layout, replacing the old files. 5225 5226Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 5227and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 5228 5229* rebuild the layout and layoutex libraries. 5230 5231---------------------------------------------------------------------------- *** 5232 5233Unicode 4.1 update 5234 5235*** related Jitterbugs 5236 52374332 RFE: Update to Unicode 4.1 52384157 RBBI, TR29 4.1 updates 5239 5240*** data files & enums & parser code 5241 5242* file preparation 5243- ucdstrip: 5244 DerivedCoreProperties.txt 5245 DerivedNormalizationProps.txt 5246 NormalizationTest.txt 5247 GraphemeBreakProperty.txt 5248 SentenceBreakProperty.txt 5249 WordBreakProperty.txt 5250- ucdstrip and ucdmerge: 5251 EastAsianWidth.txt 5252 LineBreak.txt 5253 5254* add new files to the repository 5255 GraphemeBreakProperty.txt 5256 SentenceBreakProperty.txt 5257 WordBreakProperty.txt 5258 5259* update FractionalUCA.txt and UCARules.txt with new canonical closure 5260 5261* genpname 5262- handle new enumerated properties in sub read_uchar 5263- run preparse.pl 5264 5265* uchar.h & uscript.h & uprops.h & uprops.c & genprops 5266- new binary properties 5267 + Pattern_Syntax 5268 + Pattern_White_Space 5269- new enumerated properties 5270 + Grapheme_Cluster_Break 5271 + Sentence_Break 5272 + Word_Break 5273- new block & script & line break values 5274 5275* gencase 5276- case-ignorable changes 5277 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5278 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 5279 5280*** Unicode version numbers 5281- makedata.mak 5282- uchar.h 5283- configure.in 5284 5285*** tests 5286- verify that u_charMirror() round-trips 5287- test all new properties and some new values of old properties 5288 5289*** other code 5290 5291* hardcoded Unihan range end/limit 5292- Unihan range end moves from 9FA5 to 9FBB 5293 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 5294 + do not modify BOCU/BOCSU code because that would change the encoding 5295 and break binary compatibility! 5296 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 5297 NamePrepProfile.txt 5298 + ignore trietest.c: test data is arbitrary 5299 + ignore tstnorm.cpp: test optimization, not important 5300 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 5301 + do change line_th.txt and word_th.txt 5302 by replacing hardcoded ranges with the new property values 5303 + do change gennames.c 5304 5305source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5306source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 5307source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 5308 5309* case mappings 5310- compare new special casing context conditions with previous ones 5311 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 5312 5313* genpname 5314- consider storing only the short name if it is the same as the long name 5315 5316*** other reviews 5317- UAX #29 changes (grapheme/word/sentence breaks) 5318- UAX #14 changes (line breaks) 5319- Pattern_Syntax & Pattern_White_Space 5320 5321---------------------------------------------------------------------------- *** 5322 5323Unicode 4.0.1 update 5324 5325*** related Jitterbugs 5326 53273170 RFE: Update to Unicode 4.0.1 53283171 Add new Unicode 4.0.1 properties 53293520 use Unicode 4.0.1 updates for break iteration 5330 5331*** data files & enums & parser code 5332 5333* file preparation 5334- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 5335- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 5336 5337* file fixes 5338- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 5339 according to PRI #26 5340 http://www.unicode.org/review/resolved-pri.html#pri26 5341- undone again because no corrigendum in sight; 5342 instead modified tests to not check consistency on this for Unicode 4.0.1 5343 5344* ucdterms.txt 5345- update from http://www.unicode.org/copyright.html 5346 formatted for plain text 5347 5348* uchar.h & uprops.h & uprops.c & genprops 5349- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 5350- add U_LB_INSEPARABLE due to a spelling fix 5351 + put short name comment only on line with new constant 5352 for genpname perl script parser 5353- new binary properties 5354 + STerm 5355 + Variation_Selector 5356 5357* genpname 5358- fix genpname perl script so that it doesn't choke on more than 2 names per property value 5359- perl script: correctly calculate the maximum number of fields per row 5360 5361* uscript.h 5362- new script code Hrkt=Katakana_Or_Hiragana 5363 5364* gennorm.c track changes in DerivedNormalizationProps.txt 5365- "FNC" -> "FC_NFKC" 5366- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 5367 5368* genprops/props2.c track changes in DerivedNumericValues.txt 5369- changed from 3 columns to 2, dropping the numeric type 5370 + assume that the type is always numeric for Han characters, 5371 and that only those are added in addition to what UnicodeData.txt lists 5372 5373*** Unicode version numbers 5374- makedata.mak 5375- uchar.h 5376- configure.in 5377 5378*** tests 5379- update test of default bidi classes according to PRI #28 5380 /tsutil/cucdtst/TestUnicodeData 5381 http://www.unicode.org/review/resolved-pri.html#pri28 5382- bidi tests: change exemplar character for ES depending on Unicode version 5383- change hardcoded expected property values where they change 5384 5385*** other code 5386 5387* name matching 5388- read UCD.html 5389 5390* scripts 5391- use new Hrkt=Katakana_Or_Hiragana 5392 5393* ZWJ & ZWNJ 5394- are now part of combining character sequences 5395- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 5396