1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13 14* change log for Unicode updates 15 16For an overview, see https://unicode-org.github.io/icu/processes/unicode-update 17 18---------------------------------------------------------------------------- *** 19 20* New ISO 15924 script codes 21 22Normally, add new script codes as part of a Unicode update. 23See https://unicode-org.github.io/icu/processes/release/tasks/standards#update-script-code-enums 24and see the change logs below. 25 26---------------------------------------------------------------------------- *** 27 28Unicode 14.0 update for ICU 70 29 30https://www.unicode.org/versions/Unicode14.0.0/ 31https://www.unicode.org/versions/beta-14.0.0.html 32https://www.unicode.org/Public/14.0.0/ucd/ 33https://www.unicode.org/reports/uax-proposed-updates.html 34https://www.unicode.org/reports/tr44/tr44-27.html 35 36https://unicode-org.atlassian.net/browse/CLDR-14801 37https://unicode-org.atlassian.net/browse/ICU-21635 38 39* Command-line environment setup 40 41export UNICODE_DATA=~/unidata/uni14/20210903 42export CLDR_SRC=~/cldr/uni/src 43export ICU_ROOT=~/icu/uni 44export ICU_SRC=$ICU_ROOT/src 45export ICUDT=icudt70b 46export ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 47export ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 48export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 49 50*** Unicode version numbers 51- makedata.mak 52- uchar.h 53- com.ibm.icu.util.VersionInfo 54- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 55 56- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 57 so that the makefiles see the new version number. 58 cd $ICU_ROOT/dbg/icu4c 59 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 60 61*** data files & enums & parser code 62 63* download files 64- same as for the early Unicode Tools setup and data refresh: 65 https://github.com/unicode-org/unicodetools/blob/main/docs/index.md 66 https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md 67- mkdir -p $UNICODE_DATA 68- download Unicode files into $UNICODE_DATA 69 + subfolders: emoji, idna, security, ucd, uca 70 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 71 + split Unihan into single-property files 72 ~/unitools/mine/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 73 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 74 or from the UCD/cldr/ output folder of the Unicode Tools: 75 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 76 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 77 or 78 cp ~/unitools/mine/Generated/UCD/d19/cldr/GraphemeBreakTest-cldr-14.0.0d19.txt icu4c/source/test/testdata/GraphemeBreakTest.txt 79 80* for manual diffs and for Unicode Tools input data updates: 81 remove version suffixes from the file names 82 ~$ unidata/desuffixucd.py $UNICODE_DATA 83 (see https://github.com/unicode-org/unicodetools/blob/main/docs/inputdata.md) 84 85* process and/or copy files 86- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 87 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 88 + For debugging, and tweaking how ppucd.txt is written, 89 the tool has an --only_ppucd option: 90 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 91 92- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 93 94* new constants for new property values 95- preparseucd.py error: 96 ValueError: missing uchar.h enum constants for some property values: 97 [(u'blk', set([u'Toto', u'Tangsa', u'Cypro_Minoan', u'Arabic_Ext_B', u'Vithkuqi', u'Old_Uyghur', u'Latin_Ext_F', u'UCAS_Ext_A', u'Kana_Ext_B', u'Ethiopic_Ext_B', u'Latin_Ext_G', u'Znamenny_Music'])), 98 (u'jg', set([u'Vertical_Tail', u'Thin_Yeh'])), 99 (u'sc', set([u'Toto', u'Ougr', u'Vith', u'Tnsa', u'Cpmn']))] 100 = PropertyValueAliases.txt new property values (diff old & new .txt files) 101 ~/unidata$ diff -u uni13/20200304/ucd/PropertyValueAliases.txt uni14/20210609/ucd/PropertyValueAliases.txt | egrep '^[-+][a-zA-Z]' 102 +age; 14.0 ; V14_0 103 +blk; Arabic_Ext_B ; Arabic_Extended_B 104 +blk; Cypro_Minoan ; Cypro_Minoan 105 +blk; Ethiopic_Ext_B ; Ethiopic_Extended_B 106 +blk; Kana_Ext_B ; Kana_Extended_B 107 +blk; Latin_Ext_F ; Latin_Extended_F 108 +blk; Latin_Ext_G ; Latin_Extended_G 109 +blk; Old_Uyghur ; Old_Uyghur 110 +blk; Tangsa ; Tangsa 111 +blk; Toto ; Toto 112 +blk; UCAS_Ext_A ; Unified_Canadian_Aboriginal_Syllabics_Extended_A 113 +blk; Vithkuqi ; Vithkuqi 114 +blk; Znamenny_Music ; Znamenny_Musical_Notation 115 +jg ; Thin_Yeh ; Thin_Yeh 116 +jg ; Vertical_Tail ; Vertical_Tail 117 +sc ; Cpmn ; Cypro_Minoan 118 +sc ; Ougr ; Old_Uyghur 119 +sc ; Tnsa ; Tangsa 120 +sc ; Toto ; Toto 121 +sc ; Vith ; Vithkuqi 122 -> add new blocks to uchar.h before UBLOCK_COUNT 123 use long property names for enum constants, 124 for the trailing comment get the block start code point: diff old & new Blocks.txt 125 ~/unidata$ diff -u uni13/20200304/ucd/Blocks.txt uni14/20210609/ucd/Blocks.txt | egrep '^[-+][0-9A-Z]' 126 +0870..089F; Arabic Extended-B 127 +10570..105BF; Vithkuqi 128 +10780..107BF; Latin Extended-F 129 +10F70..10FAF; Old Uyghur 130 -11700..1173F; Ahom 131 +11700..1174F; Ahom 132 +11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A 133 +12F90..12FFF; Cypro-Minoan 134 +16A70..16ACF; Tangsa 135 -18D00..18D8F; Tangut Supplement 136 +18D00..18D7F; Tangut Supplement 137 +1AFF0..1AFFF; Kana Extended-B 138 +1CF00..1CFCF; Znamenny Musical Notation 139 +1DF00..1DFFF; Latin Extended-G 140 +1E290..1E2BF; Toto 141 +1E7E0..1E7FF; Ethiopic Extended-B 142 (ignore blocks whose end code point changed) 143 -> add new blocks to UCharacter.UnicodeBlock IDs 144 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 145 replace public static final int \1_ID = \2; \3 146 -> add new blocks to UCharacter.UnicodeBlock objects 147 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 148 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 149 -> add new scripts to uscript.h & com.ibm.icu.lang.UScript 150 Eclipse find USCRIPT_([^ ]+) *= ([0-9]+),(/.+) 151 replace public static final int \1 = \2; \3 152 -> for new scripts: fix expectedLong names in cintltst/cucdapi.c/TestUScriptCodeAPI() 153 and in com.ibm.icu.dev.test.lang.TestUScript.java 154 -> add new joining groups to uchar.h & UCharacter.JoiningGroup 155 156* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 157 (not strictly necessary for NOT_ENCODED scripts) 158 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 159 160* build ICU 161 to make sure that there are no syntax errors 162 163 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 tests &> out.txt ; tail -n 30 out.txt ; date 164 165* update spoof checker UnicodeSet initializers: 166 inclusionPat & recommendedPat in i18n/uspoof.cpp 167 INCLUSION & RECOMMENDED in SpoofChecker.java 168- make sure that the Unicode Tools tree contains the latest security data files 169- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 170- run the tool (no special environment variables needed) 171- copy & paste from the Console output into the .cpp & .java files 172 173* Bazel build process 174 175See https://unicode-org.github.io/icu/processes/unicode-update#bazel-build-process 176for an overview and for setup instructions. 177 178Consider running `bazelisk --version` outside of the $ICU_SRC folder 179to find out the latest `bazel` version, and 180copying that version number into the $ICU_SRC/.bazeliskrc config file. 181(Revert if you find incompatibilities, or, better, update our build & config files.) 182 183* generate data files 184 185- remember to define the environment variables 186 (see the start of the section for this Unicode version) 187- cd $ICU_SRC 188- optional but not necessary: 189 bazelisk clean 190- build/bootstrap/generate new files: 191 icu4c/source/data/unidata/generate.sh 192 193* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 194 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 195- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 196- Unicode 6.0..14.0: U+2260, U+226E, U+226F 197- nothing new in this Unicode version, no test file to update 198 199* run & fix ICU4C tests 200- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 201- update CLDR GraphemeBreakTest.txt 202 cd ~/unitools/mine/Generated 203 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.txt $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 204 cp UCD/d22d/cldr/GraphemeBreakTest-cldr.html $CLDR_SRC/common/properties/segments/GraphemeBreakTest.html 205 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt $ICU_SRC/icu4c/source/test/testdata 206- Andy helps with RBBI & spoof check test failures 207 208* collation: CLDR collation root, UCA DUCET 209 210- UCA DUCET goes into Mark's Unicode tools, 211 and a tool-tailored version goes into CLDR, see 212 https://github.com/unicode-org/unicodetools/blob/main/docs/uca/index.md 213 214- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 215 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 216- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 217 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 218 (note removing the underscore before "Rules") 219 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 220- restore TODO diffs in UCARules.txt 221 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 222- update (ICU4C)/source/test/testdata/CollationTest_*.txt 223 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 224 from the CLDR root files (..._CLDR_..._SHORT.txt) 225 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 226 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 227 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 228- if CLDR common/uca/unihan-index.txt changes, then update 229 CLDR common/collation/root.xml <collation type="private-unihan"> 230 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 231 232- generate data files, as above (generate.sh), now to pick up new collation data 233- update CollationFCD.java: 234 copy & paste the initializers of lcccIndex[] etc. from 235 ICU4C/source/i18n/collationfcd.cpp to 236 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 237- rebuild ICU4C (make clean, make check, as usual) 238 239* Unihan collators 240 https://github.com/unicode-org/unicodetools/blob/main/docs/unihan.md 241- run Unicode Tools GenerateUnihanCollators & GenerateUnihanCollatorFiles, 242 check CLDR diffs, copy to CLDR, test CLDR, ... as documented there 243- generate ICU zh collation data 244 instructions inspired by 245 https://github.com/unicode-org/icu/blob/main/tools/cldr/cldr-to-icu/README.txt and 246 https://github.com/unicode-org/icu/blob/main/icu4c/source/data/cldr-icu-readme.txt 247 + setup: 248 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 249 (didn't work without setting JAVA_HOME, 250 nor with the Google default of /usr/local/buildtools/java/jdk 251 [Google security limitations in the XML parser]) 252 export TOOLS_ROOT=~/icu/uni/src/tools 253 export CLDR_DIR=~/cldr/uni/src 254 export CLDR_DATA_DIR=~/cldr/uni/src 255 (pointing to the "raw" data, not cldr-staging/.../production should be ok for the relevant files) 256 cd "$TOOLS_ROOT/cldr/lib" 257 ./install-cldr-jars.sh "$CLDR_DIR" 258 + generate the files we need 259 cd "$TOOLS_ROOT/cldr/cldr-to-icu" 260 ant -f build-icu-data.xml -DoutDir=/tmp/icu -DoutputTypes=coll,transforms -DlocaleIdFilter='zh.*' 261 + diff 262 cd $ICU_SRC 263 meld icu4c/source/data/coll/zh.txt /tmp/icu/coll/zh.txt 264 meld icu4c/source/data/translit/Hani_Latn.txt /tmp/icu/translit/Hani_Latn.txt 265 + copy into the source tree 266 cd $ICU_SRC 267 cp /tmp/icu/coll/zh.txt icu4c/source/data/coll/zh.txt 268 cp /tmp/icu/translit/Hani_Latn.txt icu4c/source/data/translit/Hani_Latn.txt 269- rebuild ICU4C 270 271* run & fix ICU4C tests, now with new CLDR collation root data 272- run all tests with the collation test data *_SHORT.txt or the full files 273 (the full ones have comments, useful for debugging) 274- note on intltest: if collate/UCAConformanceTest fails, then 275 utility/MultithreadTest/TestCollators will fail as well; 276 fix the conformance test before looking into the multi-thread test 277 278* update Java data files 279- refresh just the UCD/UCA-related/derived files, just to be safe 280- see (ICU4C)/source/data/icu4j-readme.txt 281- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 282- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 283 NOTE: If you get the error "No rule to make target 'out/build/icudt70l/uprops.icu'", 284 you need to reconfigure with unicore data; see the "configure" line above. 285 output: 286 ... 287 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 288 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt70b 289 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b 290 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt70l.dat ./out/icu4j/icudt70b.dat -s ./out/build/icudt70l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt70b 291 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt70b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt70b" 292 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt70b/ 293 mkdir -p /tmp/icu4j/main/shared/data 294 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 295 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt70b/ 296 mkdir -p /tmp/icu4j/main/shared/data 297 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 298 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 299- copy the big-endian Unicode data files to another location, 300 separate from the other data files, 301 and then refresh ICU4J 302 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 303 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 304 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 305 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 306 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 307 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 308 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 309 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 310 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 311 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 312 313* When refreshing all of ICU4J data from ICU4C 314- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 315- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 316or 317- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 318 319* refresh Java test .txt files 320- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 321 cd $ICU_SRC/icu4c/source/data/unidata 322 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 323 cd ../../test/testdata 324 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 325 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 326 327* run & fix ICU4J tests 328 329*** API additions 330- send notice to icu-design about new born-@stable API (enum constants etc.) 331 332*** CLDR numbering systems 333- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 334 for example: 335 ~/icu/mine/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-13.txt 336 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt > /tmp/icu/nv4-14.txt 337 ~/icu/uni/src$ diff -u /tmp/icu/nv4-13.txt /tmp/icu/nv4-14.txt 338 --> 339 +cp;16AC4;-Alpha;gc=Nd;-IDS;lb=NU;na=TANGSA DIGIT FOUR;nt=De;nv=4;SB=NU;WB=NU;-XIDS 340 Unicode 14: 341 tnsa 16AC0..16AC9 Tangsa 342 https://github.com/unicode-org/cldr/pull/1326 343 344*** merge the Unicode update branches back onto the trunk 345- do not merge the icudata.jar and testdata.jar, 346 instead rebuild them from merged & tested ICU4C 347- make sure that changes to Unicode tools are checked in: 348 https://github.com/unicode-org/unicodetools 349 350---------------------------------------------------------------------------- *** 351 352Unicode 13.0 update for ICU 66 353 354https://www.unicode.org/versions/Unicode13.0.0/ 355https://www.unicode.org/versions/beta-13.0.0.html 356https://www.unicode.org/Public/13.0.0/ucd/ 357https://www.unicode.org/reports/uax-proposed-updates.html 358https://www.unicode.org/reports/tr44/tr44-25.html 359 360https://unicode-org.atlassian.net/browse/CLDR-13387 361https://unicode-org.atlassian.net/browse/ICU-20893 362 363* Command-line environment setup 364 365UNICODE_DATA=~/unidata/uni13/20200212 366CLDR_SRC=~/cldr/uni/src 367ICU_ROOT=~/icu/uni 368ICU_SRC=$ICU_ROOT/src 369ICUDT=icudt66b 370ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 371ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 372export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 373 374*** Unicode version numbers 375- makedata.mak 376- uchar.h 377- com.ibm.icu.util.VersionInfo 378- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 379 380- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 381 so that the makefiles see the new version number. 382 cd $ICU_ROOT/dbg/icu4c 383 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 384 385*** data files & enums & parser code 386 387* download files 388- mkdir -p $UNICODE_DATA 389- download Unicode files into $UNICODE_DATA 390 + subfolders: emoji, idna, security, ucd, uca 391 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 392 + split Unihan into single-property files 393 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 394 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 395 or from the ucd/cldr/ output folder of the Unicode Tools: 396 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 397 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 398 399* for manual diffs and for Unicode Tools input data updates: 400 remove version suffixes from the file names 401 ~$ unidata/desuffixucd.py $UNICODE_DATA 402 (see https://sites.google.com/site/unicodetools/inputdata) 403 404* process and/or copy files 405- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 406 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 407 + For debugging, and tweaking how ppucd.txt is written, 408 the tool has an --only_ppucd option: 409 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 410 411- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 412 413* new constants for new property values 414- preparseucd.py error: 415 ValueError: missing uchar.h enum constants for some property values: 416 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 417 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 418 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 419 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 420 = PropertyValueAliases.txt new property values (diff old & new .txt files) 421 blk; Chorasmian ; Chorasmian 422 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 423 blk; Dives_Akuru ; Dives_Akuru 424 blk; Khitan_Small_Script ; Khitan_Small_Script 425 blk; Lisu_Sup ; Lisu_Supplement 426 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 427 blk; Tangut_Sup ; Tangut_Supplement 428 blk; Yezidi ; Yezidi 429 -> add to uchar.h before UBLOCK_COUNT 430 use long property names for enum constants, 431 for the trailing comment get the block start code point: diff old & new Blocks.txt 432 -> add to UCharacter.UnicodeBlock IDs 433 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 434 replace public static final int \1_ID = \2; \3 435 -> add to UCharacter.UnicodeBlock objects 436 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 437 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 438 439 sc ; Chrs ; Chorasmian 440 sc ; Diak ; Dives_Akuru 441 sc ; Kits ; Khitan_Small_Script 442 sc ; Yezi ; Yezidi 443 -> uscript.h & com.ibm.icu.lang.UScript 444 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 445 and in com.ibm.icu.dev.test.lang.TestUScript.java 446 447 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 448 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 449 450* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 451 (not strictly necessary for NOT_ENCODED scripts) 452 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 453 454* build ICU (make install) 455 to make sure that there are no syntax errors, and 456 so that the tools build can pick up the new definitions from the installed header files. 457 458 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 459 460* update spoof checker UnicodeSet initializers: 461 inclusionPat & recommendedPat in i18n/uspoof.cpp 462 INCLUSION & RECOMMENDED in SpoofChecker.java 463- make sure that the Unicode Tools tree contains the latest security data files 464- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 465- update the hardcoded version number there in the DIRECTORY path 466- run the tool (no special environment variables needed) 467- copy & paste from the Console output into the .cpp & .java files 468 469* generate normalization data files 470 cd $ICU_ROOT/dbg/icu4c 471 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 472 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 473 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 474 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 475 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 476 477* build ICU (make install) 478 so that the tools build can pick up the new definitions from the installed header files. 479 480 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 481 482* build Unicode tools using CMake+make 483 484$ICU_SRC/tools/unicode/c/icudefs.txt: 485 486# Location (--prefix) of where ICU was installed. 487set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 488# Location of the ICU4C source tree. 489set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 490 491 $ICU_ROOT/dbg$ 492 mkdir -p tools/unicode/c 493 cd tools/unicode/c 494 495 $ICU_ROOT/dbg/tools/unicode/c$ 496 cmake ../../../../src/tools/unicode/c 497 make 498 499* generate core properties data files 500 $ICU_ROOT/dbg/tools/unicode/c$ 501 genprops/genprops $ICU_SRC/icu4c 502- tool failure: 503 genprops: Script_Extensions indexes overflow bit field 504 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 505 -> uprops.icu data file format : 506 add two more bits to store a script code or Script_Extensions index 507 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 508- rebuild ICU (make install) & tools 509 510* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 511 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 512- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 513- Unicode 6.0..13.0: U+2260, U+226E, U+226F 514- nothing new in this Unicode version, no test file to update 515 516* run & fix ICU4C tests 517- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 518- Andy helps with RBBI & spoof check test failures 519 520* collation: CLDR collation root, UCA DUCET 521 522- UCA DUCET goes into Mark's Unicode tools, see 523 https://sites.google.com/site/unicodetools/home#TOC-UCA 524 diff the main mapping file, look for bad changes 525 (for example, more bytes per weight for common characters) 526 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 527 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 528 529- CLDR root data files are checked into $CLDR_SRC/common/uca/ 530 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 531 532- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 533 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 534- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 535 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 536 (note removing the underscore before "Rules") 537 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 538- restore TODO diffs in UCARules.txt 539 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 540- update (ICU4C)/source/test/testdata/CollationTest_*.txt 541 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 542 from the CLDR root files (..._CLDR_..._SHORT.txt) 543 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 544 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 545 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 546- if CLDR common/uca/unihan-index.txt changes, then update 547 CLDR common/collation/root.xml <collation type="private-unihan"> 548 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 549 550- run genuca 551 $ICU_ROOT/dbg/tools/unicode/c$ 552 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 553 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 554- rebuild ICU4C 555 556* Unihan collators 557 https://sites.google.com/site/unicodetools/unihan 558- run Unicode Tools 559 org.unicode.draft.GenerateUnihanCollators 560 with VM arguments 561 -ea 562 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 563 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 564 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 565 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 566 -DUVERSION=13.0.0 567- run Unicode Tools 568 org.unicode.draft.GenerateUnihanCollatorFiles 569 with the same arguments 570- check CLDR diffs 571 cd $CLDR_SRC 572 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 573 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 574- copy to CLDR 575 cd $CLDR_SRC 576 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 577 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 578- run CLDR unit tests, commit to CLDR 579- generate ICU zh collation data: run CLDR 580 org.unicode.cldr.icu.NewLdml2IcuConverter 581 with program arguments 582 -t collation 583 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 584 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 585 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 586 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 587 zh 588 and VM arguments 589 -ea 590 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 591- rebuild ICU4C 592 593* run & fix ICU4C tests, now with new CLDR collation root data 594- run all tests with the collation test data *_SHORT.txt or the full files 595 (the full ones have comments, useful for debugging) 596- note on intltest: if collate/UCAConformanceTest fails, then 597 utility/MultithreadTest/TestCollators will fail as well; 598 fix the conformance test before looking into the multi-thread test 599 600* update Java data files 601- refresh just the UCD/UCA-related/derived files, just to be safe 602- see (ICU4C)/source/data/icu4j-readme.txt 603- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 604- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 605 output: 606 ... 607 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 608 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 609 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 610 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 611 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 612 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 613 mkdir -p /tmp/icu4j/main/shared/data 614 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 615 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 616 mkdir -p /tmp/icu4j/main/shared/data 617 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 618 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 619- copy the big-endian Unicode data files to another location, 620 separate from the other data files, 621 and then refresh ICU4J 622 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 623 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 624 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 625 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 626 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 627 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 628 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 629 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 630 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 631 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 632 633* When refreshing all of ICU4J data from ICU4C 634- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 635- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 636or 637- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 638 639* update CollationFCD.java 640 + copy & paste the initializers of lcccIndex[] etc. from 641 ICU4C/source/i18n/collationfcd.cpp to 642 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 643 644* refresh Java test .txt files 645- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 646 cd $ICU_SRC/icu4c/source/data/unidata 647 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 648 cd ../../test/testdata 649 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 650 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 651 652* run & fix ICU4J tests 653 654*** API additions 655- send notice to icu-design about new born-@stable API (enum constants etc.) 656 657*** CLDR numbering systems 658- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 659 for example, look for 660 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 661 in new blocks (Blocks.txt) 662 Unicode 13: 663 diak 11950..11959 Dives_Akuru 664 665*** merge the Unicode update branches back onto the trunk 666- do not merge the icudata.jar and testdata.jar, 667 instead rebuild them from merged & tested ICU4C 668- make sure that changes to Unicode tools are checked in: 669 http://www.unicode.org/utility/trac/log/trunk/unicodetools 670 671---------------------------------------------------------------------------- *** 672 673Unicode 12.1 update for ICU 64.2 674 675** This is an abbreviated update with one new character for the new 676** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 677https://en.wikipedia.org/wiki/Reiwa_period 678 679http://www.unicode.org/versions/Unicode12.1.0/ 680 681ICU-20497 Unicode 12.1 682 683cldrbug 11978: Unicode 12.1 684 685* Command-line environment setup 686 687UNICODE_DATA=~/unidata/uni121/20190403 688CLDR_SRC=~/svn.cldr/uni 689ICU_ROOT=~/icu/uni 690ICU_SRC=$ICU_ROOT/src 691ICUDT=icudt64b 692ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 693ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 694export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 695 696*** Unicode version numbers 697- makedata.mak 698- uchar.h 699- com.ibm.icu.util.VersionInfo 700- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 701 702- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 703 so that the makefiles see the new version number. 704 cd $ICU_ROOT/dbg/icu4c 705 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 706 707*** data files & enums & parser code 708 709* download files 710- mkdir -p $UNICODE_DATA 711- download Unicode files into $UNICODE_DATA 712 + subfolders: emoji, idna, security, ucd, uca 713 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 714 715* for manual diffs and for Unicode Tools input data updates: 716 remove version suffixes from the file names 717 ~$ unidata/desuffixucd.py $UNICODE_DATA 718 (see https://sites.google.com/site/unicodetools/inputdata) 719 720* process and/or copy files 721- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 722 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 723 + For debugging, and tweaking how ppucd.txt is written, 724 the tool has an --only_ppucd option: 725 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 726 727- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 728 729* build ICU (make install) 730 so that the tools build can pick up the new definitions from the installed header files. 731 732 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 733 734* update spoof checker UnicodeSet initializers: 735 inclusionPat & recommendedPat in uspoof.cpp 736 INCLUSION & RECOMMENDED in SpoofChecker.java 737- make sure that the Unicode Tools tree contains the latest security data files 738- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 739- update the hardcoded version number there in the DIRECTORY path 740- run the tool (no special environment variables needed) 741- copy & paste from the Console output into the .cpp & .java files 742 743* generate normalization data files 744 cd $ICU_ROOT/dbg/icu4c 745 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 746 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 747 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 748 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 749 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 750 751* build ICU (make install) 752 so that the tools build can pick up the new definitions from the installed header files. 753 754 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 755 756* build Unicode tools using CMake+make 757 758$ICU_SRC/tools/unicode/c/icudefs.txt: 759 760# Location (--prefix) of where ICU was installed. 761set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 762# Location of the ICU4C source tree. 763set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 764 765 $ICU_ROOT/dbg$ 766 mkdir -p tools/unicode/c 767 cd tools/unicode/c 768 769 $ICU_ROOT/dbg/tools/unicode/c$ 770 cmake ../../../../src/tools/unicode/c 771 make 772 773* generate core properties data files 774 $ICU_ROOT/dbg/tools/unicode/c$ 775 genprops/genprops $ICU_SRC/icu4c 776 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 777 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 778- rebuild ICU (make install) & tools 779 780* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 781 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 782- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 783- Unicode 6.0..12.1: U+2260, U+226E, U+226F 784- nothing new in this Unicode version, no test file to update 785 786* run & fix ICU4C tests 787- Andy handles RBBI & spoof check test failures 788 789* collation: CLDR collation root, UCA DUCET 790 791- UCA DUCET goes into Mark's Unicode tools, see 792 https://sites.google.com/site/unicodetools/home#TOC-UCA 793 diff the main mapping file, look for bad changes 794 (for example, more bytes per weight for common characters) 795 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 796 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 797 798- CLDR root data files are checked into $CLDR_SRC/common/uca/ 799 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 800 801- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 802 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 803- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 804 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 805 (note removing the underscore before "Rules") 806 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 807- restore TODO diffs in UCARules.txt 808 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 809- update (ICU4C)/source/test/testdata/CollationTest_*.txt 810 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 811 from the CLDR root files (..._CLDR_..._SHORT.txt) 812 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 813 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 814 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 815- if CLDR common/uca/unihan-index.txt changes, then update 816 CLDR common/collation/root.xml <collation type="private-unihan"> 817 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 818 819- run genuca, see command line above 820- rebuild ICU4C 821 822* Unihan collators 823 https://sites.google.com/site/unicodetools/unihan 824- run Unicode Tools 825 org.unicode.draft.GenerateUnihanCollators 826 with VM arguments 827 -ea 828 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 829 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 830 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 831 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 832 -DUVERSION=12.1.0 833- run Unicode Tools 834 org.unicode.draft.GenerateUnihanCollatorFiles 835 with the same arguments 836- check CLDR diffs 837 cd $CLDR_SRC 838 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 839 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 840- copy to CLDR 841 cd $CLDR_SRC 842 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 843 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 844- run CLDR unit tests, commit to CLDR 845- generate ICU zh collation data: run CLDR 846 org.unicode.cldr.icu.NewLdml2IcuConverter 847 with program arguments 848 -t collation 849 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 850 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 851 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 852 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 853 zh 854 and VM arguments 855 -ea 856 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 857- rebuild ICU4C 858 859* run & fix ICU4C tests, now with new CLDR collation root data 860- run all tests with the collation test data *_SHORT.txt or the full files 861 (the full ones have comments, useful for debugging) 862- note on intltest: if collate/UCAConformanceTest fails, then 863 utility/MultithreadTest/TestCollators will fail as well; 864 fix the conformance test before looking into the multi-thread test 865 866* update Java data files 867- refresh just the UCD/UCA-related/derived files, just to be safe 868- see (ICU4C)/source/data/icu4j-readme.txt 869- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 870- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 871 output: 872 ... 873 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 874 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 875 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 876 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 877 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 878 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 879 mkdir -p /tmp/icu4j/main/shared/data 880 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 881 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 882 mkdir -p /tmp/icu4j/main/shared/data 883 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 884 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 885- copy the big-endian Unicode data files to another location, 886 separate from the other data files, 887 and then refresh ICU4J 888 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 889 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 890 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 891 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 892 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 893 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 894 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 895 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 896 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 897 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 898 899* When refreshing all of ICU4J data from ICU4C 900- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 901- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 902or 903- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 904 905* update CollationFCD.java 906 + copy & paste the initializers of lcccIndex[] etc. from 907 ICU4C/source/i18n/collationfcd.cpp to 908 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 909 910* refresh Java test .txt files 911- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 912 cd $ICU_SRC/icu4c/source/data/unidata 913 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 914 cd ../../test/testdata 915 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 916 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 917 918* run & fix ICU4J tests 919 920*** API additions 921- send notice to icu-design about new born-@stable API (enum constants etc.) 922 923*** CLDR numbering systems 924- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 925 for example, look for 926 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 927 in new blocks (Blocks.txt) 928 Unicode 12: using Unicode 12 CLDR ticket #11478 929 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 930 wcho 1E2F0..1E2F9 Wancho 931 Unicode 11: using Unicode 11 CLDR ticket #10978 932 rohg 10D30..10D39 Hanifi_Rohingya 933 gong 11DA0..11DA9 Gunjala_Gondi 934 Earlier: CLDR tickets specific to adding new numbering systems. 935 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 936 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 937 938*** merge the Unicode update branches back onto the trunk 939- do not merge the icudata.jar and testdata.jar, 940 instead rebuild them from merged & tested ICU4C 941- make sure that changes to Unicode tools are checked in: 942 http://www.unicode.org/utility/trac/log/trunk/unicodetools 943 944---------------------------------------------------------------------------- *** 945 946Unicode 12.0 update for ICU 64 947 948http://www.unicode.org/versions/Unicode12.0.0/ 949http://unicode.org/versions/beta-12.0.0.html 950https://www.unicode.org/review/pri389/ 951http://www.unicode.org/reports/uax-proposed-updates.html 952http://www.unicode.org/reports/tr44/tr44-23.html 953 954ICU-20203 Unicode 12 955 956ICU-20111 move text layout properties data into a data file 957 958cldrbug 11478: Unicode 12 959Accidentally used ^/trunk instead of ^/branches/markus/uni12 960 961* Command-line environment setup 962 963UNICODE_DATA=~/unidata/uni12/20190309 964CLDR_SRC=~/svn.cldr/uni 965ICU_ROOT=~/icu/uni 966ICU_SRC=$ICU_ROOT/src 967ICUDT=icudt63b 968ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 969ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 970export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 971 972*** Unicode version numbers 973- makedata.mak 974- uchar.h 975- com.ibm.icu.util.VersionInfo 976- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 977 978- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 979 so that the makefiles see the new version number. 980 981*** data files & enums & parser code 982 983* download files 984- mkdir -p $UNICODE_DATA 985- download Unicode files into $UNICODE_DATA 986 + subfolders: emoji, idna, security, ucd, uca 987 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 988 989* for manual diffs and for Unicode Tools input data updates: 990 remove version suffixes from the file names 991 ~$ unidata/desuffixucd.py $UNICODE_DATA 992 (see https://sites.google.com/site/unicodetools/inputdata) 993 994* process and/or copy files 995- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 996 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 997 + For debugging, and tweaking how ppucd.txt is written, 998 the tool has an --only_ppucd option: 999 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1000 1001- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1002 1003* build ICU (make install) 1004 so that the tools build can pick up the new definitions from the installed header files. 1005 1006 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1007 1008* new constants for new property values 1009- preparseucd.py error: 1010 ValueError: missing uchar.h enum constants for some property values: 1011 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 1012 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 1013 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 1014 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 1015 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1016 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 1017 blk; Elymaic ; Elymaic 1018 blk; Nandinagari ; Nandinagari 1019 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 1020 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 1021 blk; Small_Kana_Ext ; Small_Kana_Extension 1022 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 1023 blk; Tamil_Sup ; Tamil_Supplement 1024 blk; Wancho ; Wancho 1025 -> add to uchar.h 1026 use long property names for enum constants, 1027 for the trailing comment get the block start code point: diff old & new Blocks.txt 1028 -> add to UCharacter.UnicodeBlock IDs 1029 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1030 replace public static final int \1_ID = \2; \3 1031 -> add to UCharacter.UnicodeBlock objects 1032 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1033 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 1034 1035 sc ; Elym ; Elymaic 1036 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 1037 sc ; Nand ; Nandinagari 1038 sc ; Wcho ; Wancho 1039 -> uscript.h & com.ibm.icu.lang.UScript 1040 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1041 and in com.ibm.icu.dev.test.lang.TestUScript.java 1042 1043* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1044 (not strictly necessary for NOT_ENCODED scripts) 1045 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1046 1047* update spoof checker UnicodeSet initializers: 1048 inclusionPat & recommendedPat in uspoof.cpp 1049 INCLUSION & RECOMMENDED in SpoofChecker.java 1050- make sure that the Unicode Tools tree contains the latest security data files 1051- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1052- update the hardcoded version number there in the DIRECTORY path 1053- run the tool (no special environment variables needed) 1054- copy & paste from the Console output into the .cpp & .java files 1055 1056* generate normalization data files 1057 cd $ICU_ROOT/dbg/icu4c 1058 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1059 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1060 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1061 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1062 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1063 1064* build ICU (make install) 1065 so that the tools build can pick up the new definitions from the installed header files. 1066 1067 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 1068 1069* build Unicode tools using CMake+make 1070 1071$ICU_SRC/tools/unicode/c/icudefs.txt: 1072 1073# Location (--prefix) of where ICU was installed. 1074set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1075# Location of the ICU4C source tree. 1076set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 1077 1078 $ICU_ROOT/dbg$ 1079 mkdir -p tools/unicode/c 1080 cd tools/unicode/c 1081 1082 $ICU_ROOT/dbg/tools/unicode/c$ 1083 cmake ../../../../src/tools/unicode/c 1084 make 1085 1086* generate core properties data files 1087 $ICU_ROOT/dbg/tools/unicode/c$ 1088 genprops/genprops $ICU_SRC/icu4c 1089 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 1090 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1091- rebuild ICU (make install) & tools 1092 1093* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1094 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1095- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1096- Unicode 6.0..12.0: U+2260, U+226E, U+226F 1097- nothing new in this Unicode version, no test file to update 1098 1099* run & fix ICU4C tests 1100- update test of default bidi classes: 1101 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 1102 see diffs in DerivedBidiClass.txt 1103 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 1104 + UCharacterTest.java TestIteration() defaultBidi[] 1105- Andy handles RBBI & spoof check test failures 1106 1107* collation: CLDR collation root, UCA DUCET 1108 1109- UCA DUCET goes into Mark's Unicode tools, see 1110 https://sites.google.com/site/unicodetools/home#TOC-UCA 1111 diff the main mapping file, look for bad changes 1112 (for example, more bytes per weight for common characters) 1113 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 1114 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 1115 1116- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1117 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1118 1119- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1120 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1121- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1122 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1123 (note removing the underscore before "Rules") 1124 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1125- restore TODO diffs in UCARules.txt 1126 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1127- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1128 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1129 from the CLDR root files (..._CLDR_..._SHORT.txt) 1130 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1131 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1132 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1133- if CLDR common/uca/unihan-index.txt changes, then update 1134 CLDR common/collation/root.xml <collation type="private-unihan"> 1135 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1136 1137- run genuca, see command line above; 1138 deal with 1139 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1140 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 1141 (add the character to genuca.cpp sampleCharsToScripts[]) 1142 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 1143 and cache its values. 1144 Works as long as the script metadata is updated before the collation data. 1145- rebuild ICU4C 1146 1147* Unihan collators 1148 https://sites.google.com/site/unicodetools/unihan 1149- run Unicode Tools 1150 org.unicode.draft.GenerateUnihanCollators 1151 with VM arguments 1152 -ea 1153 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1154 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1155 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1156 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1157 -DUVERSION=12.0.0 1158- run Unicode Tools 1159 org.unicode.draft.GenerateUnihanCollatorFiles 1160 with the same arguments 1161- check CLDR diffs 1162 cd $CLDR_SRC 1163 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1164 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1165- copy to CLDR 1166 cd $CLDR_SRC 1167 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1168 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1169- run CLDR unit tests, commit to CLDR 1170- generate ICU zh collation data: run CLDR 1171 org.unicode.cldr.icu.NewLdml2IcuConverter 1172 with program arguments 1173 -t collation 1174 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1175 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1176 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 1177 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 1178 zh 1179 and VM arguments 1180 -ea 1181 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1182- rebuild ICU4C 1183 1184* run & fix ICU4C tests, now with new CLDR collation root data 1185- run all tests with the collation test data *_SHORT.txt or the full files 1186 (the full ones have comments, useful for debugging) 1187- note on intltest: if collate/UCAConformanceTest fails, then 1188 utility/MultithreadTest/TestCollators will fail as well; 1189 fix the conformance test before looking into the multi-thread test 1190 1191* update Java data files 1192- refresh just the UCD/UCA-related/derived files, just to be safe 1193- see (ICU4C)/source/data/icu4j-readme.txt 1194- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1195- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1196 output: 1197 ... 1198 Unicode .icu files built to ./out/build/icudt63l 1199 echo timestamp > uni-core-data 1200 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1201 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 1202 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1203 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 1204 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 1205 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 1206 mkdir -p /tmp/icu4j/main/shared/data 1207 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1208 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 1209 mkdir -p /tmp/icu4j/main/shared/data 1210 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1211 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 1212- copy the big-endian Unicode data files to another location, 1213 separate from the other data files, 1214 and then refresh ICU4J 1215 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1216 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1217 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1218 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1219 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1220 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1221 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1222 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1223 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1224 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1225 1226* When refreshing all of ICU4J data from ICU4C 1227- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1228- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1229or 1230- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1231 1232* update CollationFCD.java 1233 + copy & paste the initializers of lcccIndex[] etc. from 1234 ICU4C/source/i18n/collationfcd.cpp to 1235 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1236 1237* refresh Java test .txt files 1238- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1239 cd $ICU_SRC/icu4c/source/data/unidata 1240 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1241 cd ../../test/testdata 1242 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1243 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1244 1245* run & fix ICU4J tests 1246 1247*** API additions 1248- send notice to icu-design about new born-@stable API (enum constants etc.) 1249 1250*** CLDR numbering systems 1251- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1252 for example, look for 1253 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 1254 in new blocks (Blocks.txt) 1255 Unicode 12: using Unicode 12 CLDR ticket #11478 1256 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 1257 wcho 1E2F0..1E2F9 Wancho 1258 Unicode 11: using Unicode 11 CLDR ticket #10978 1259 rohg 10D30..10D39 Hanifi_Rohingya 1260 gong 11DA0..11DA9 Gunjala_Gondi 1261 Earlier: CLDR tickets specific to adding new numbering systems. 1262 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1263 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1264 1265*** merge the Unicode update branches back onto the trunk 1266- do not merge the icudata.jar and testdata.jar, 1267 instead rebuild them from merged & tested ICU4C 1268- make sure that changes to Unicode tools are checked in: 1269 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1270 1271---------------------------------------------------------------------------- *** 1272 1273ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 1274 1275* Command-line environment setup 1276 1277UNICODE_DATA=~/unidata/uni11/20180609 1278CLDR_SRC=~/svn.cldr/uni 1279ICU_ROOT=~/icu/mine 1280ICU_SRC=$ICU_ROOT/src 1281ICUDT=icudt62b 1282ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1283ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1284export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1285 1286*** Links 1287 1288https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 1289https://unicode-org.atlassian.net/browse/ICU-12850 vo 1290 1291*** data files & enums & parser code 1292 1293* API additions 1294- for each of the three new enumerated properties 1295 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 1296 + uchar.h: update UCHAR_INT_LIMIT 1297 + uchar.h: add the enum U<long prop name> 1298 with constants U_<short prop name>_<long value name> 1299 + UProperty.java: add the constant <long prop name> 1300 + UProperty.java: update INT_LIMIT 1301 + UCharacter.java: add the interface <long prop name> 1302 with constants <long value name> 1303 1304* process and/or copy files 1305- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1306 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1307 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 1308 names and aliases. 1309 + For debugging, and tweaking how ppucd.txt is written, 1310 the tool has an --only_ppucd option: 1311 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1312 1313* preparseucd.py changes 1314- add new property short names (uppercase) to _prop_and_value_re 1315 so that ParseUCharHeader() parses the new enum constants 1316 1317* build ICU (make install) 1318 so that the tools build can pick up the new definitions from the installed header files. 1319 1320 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1321 1322* build Unicode tools using CMake+make 1323 1324$ICU_SRC/tools/unicode/c/icudefs.txt: 1325 1326# Location (--prefix) of where ICU was installed. 1327set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1328# Location of the ICU4C source tree. 1329set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 1330 1331 $ICU_ROOT/dbg$ 1332 mkdir -p tools/unicode/c 1333 cd tools/unicode/c 1334 1335 $ICU_ROOT/dbg/tools/unicode/c$ 1336 cmake ../../../../../src/tools/unicode/c 1337 make 1338 1339* generate core properties data files 1340 $ICU_ROOT/dbg/tools/unicode/c$ 1341 genprops/genprops $ICU_SRC/icu4c 1342- rebuild ICU (make install) & tools 1343 1344* write data for runtime, hardcoded for now 1345- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 1346- generate new icu4c/source/common/ulayout_props_data.h 1347- for each of the three new enumerated properties 1348 + int property max value 1349 + small, 8-bit UCPTrie 1350 (A small 16-bit trie with bit fields for these three properties 1351 is very nearly the same size as the sum of the three.) 1352 1353* wire into C++ 1354- uprops.cpp: #include ulayout_props_data.h 1355- uprops.cpp: add getInPC() etc. functions 1356- uprops.cpp: add lines to intProps[], include max values 1357- uprops.h: add UPropertySource constants 1358- uprops.cpp: add uprops_addPropertyStarts(src) 1359- uniset_props.cpp: add to UnicodeSet_initInclusion() 1360- intltest/ucdtest.cpp: write unit tests 1361 1362* update Java data files 1363- refresh just the pnames.icu file with the new property [value] names, just to be safe 1364- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 1365- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1366- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1367- copy the big-endian Unicode data files to another location, 1368 separate from the other data files, 1369 and then refresh ICU4J 1370 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1371 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1372 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1373 1374* wire into Java 1375- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 1376- UCharacterProperty.java: for each new property 1377 + create a nested class to hold its CodePointTrie 1378 + initialize it from a string literal 1379 + paste in the initializer printed by genprops 1380 + add a new IntProperty object to the intProps[] array 1381 + use the correct max int value for each property, also printed by genprops 1382- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 1383- UnicodeSet.java: add to getInclusions() 1384- UCharacterTest.java: write unit tests 1385 1386---------------------------------------------------------------------------- *** 1387 1388Unicode 11.0 update for ICU 62 1389 1390http://www.unicode.org/versions/Unicode11.0.0/ 1391http://unicode.org/versions/beta-11.0.0.html 1392https://www.unicode.org/review/pri372/ 1393http://www.unicode.org/reports/uax-proposed-updates.html 1394http://www.unicode.org/reports/tr44/tr44-21.html 1395 1396* Command-line environment setup 1397 1398UNICODE_DATA=~/unidata/uni11/20180521 1399CLDR_SRC=~/svn.cldr/uni 1400ICU_ROOT=~/svn.icu/uni 1401ICU_SRC=$ICU_ROOT/src 1402ICUDT=icudt61b 1403ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1404ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1405export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1406 1407*** ICU Trac 1408 1409- ticket:13630: Unicode 11 1410- ^/branches/markus/uni11 1411 1412*** CLDR Trac 1413 1414- cldrbug 10978: Unicode 11 1415- ^/branches/markus/uni11 1416 1417*** Unicode version numbers 1418- makedata.mak 1419- uchar.h 1420- com.ibm.icu.util.VersionInfo 1421- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1422 1423- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1424 so that the makefiles see the new version number. 1425 1426*** data files & enums & parser code 1427 1428* download files 1429- mkdir -p $UNICODE_DATA 1430- download Unicode files into $UNICODE_DATA 1431 + subfolders: emoji, idna, security, ucd, uca 1432 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1433 1434* for manual diffs and for Unicode Tools input data updates: 1435 remove version suffixes from the file names 1436 ~$ unidata/desuffixucd.py $UNICODE_DATA 1437 (see https://sites.google.com/site/unicodetools/inputdata) 1438 1439* process and/or copy files 1440- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1441 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1442 + For debugging, and tweaking how ppucd.txt is written, 1443 the tool has an --only_ppucd option: 1444 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1445 1446- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1447 1448* build ICU (make install) 1449 so that the tools build can pick up the new definitions from the installed header files. 1450 1451 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1452 1453* preparseucd.py changes 1454- fix other errors 1455 NameError: unknown property Extended_Pictographic 1456 -> add Extended_Pictographic binary property 1457 -> add new short names for all Emoji properties 1458 1459* new constants for new property values 1460- preparseucd.py error: 1461 ValueError: missing uchar.h enum constants for some property values: 1462 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 1463 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 1464 u'Indic_Siyaq_Numbers'])), 1465 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 1466 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 1467 (u'GCB', set([u'LinkC', u'Virama'])), 1468 (u'WB', set([u'WSegSpace']))] 1469 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1470 blk; Chess_Symbols ; Chess_Symbols 1471 blk; Dogra ; Dogra 1472 blk; Georgian_Ext ; Georgian_Extended 1473 blk; Gunjala_Gondi ; Gunjala_Gondi 1474 blk; Hanifi_Rohingya ; Hanifi_Rohingya 1475 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 1476 blk; Makasar ; Makasar 1477 blk; Mayan_Numerals ; Mayan_Numerals 1478 blk; Medefaidrin ; Medefaidrin 1479 blk; Old_Sogdian ; Old_Sogdian 1480 blk; Sogdian ; Sogdian 1481 -> add to uchar.h 1482 use long property names for enum constants, 1483 for the trailing comment get the block start code point: diff old & new Blocks.txt 1484 -> add to UCharacter.UnicodeBlock IDs 1485 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1486 replace public static final int \1_ID = \2; \3 1487 -> add to UCharacter.UnicodeBlock objects 1488 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1489 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1490 1491 GCB; LinkC ; LinkingConsonant 1492 GCB; Virama ; Virama 1493 -> uchar.h & UCharacter.GraphemeClusterBreak 1494 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 1495 1496 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 1497 -> ignore: ICU does not yet support this property 1498 1499 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 1500 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 1501 -> uchar.h & UCharacter.JoiningGroup 1502 1503 sc ; Dogr ; Dogra 1504 sc ; Gong ; Gunjala_Gondi 1505 sc ; Maka ; Makasar 1506 sc ; Medf ; Medefaidrin 1507 sc ; Rohg ; Hanifi_Rohingya 1508 sc ; Sogd ; Sogdian 1509 sc ; Sogo ; Old_Sogdian 1510 -> uscript.h & com.ibm.icu.lang.UScript 1511 -> Nushu had been added already 1512 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1513 and in com.ibm.icu.dev.test.lang.TestUScript.java 1514 1515 WB ; WSegSpace ; WSegSpace 1516 -> uchar.h & UCharacter.WordBreak 1517 1518* New short names for emoji properties 1519- see UTS #51 1520- short names set in preparseucd.py 1521 1522* New properties 1523- boolean emoji property Extended_Pictographic 1524 -> added in preparseucd.py 1525 -> uchar.h & UProperty.java 1526- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 1527 as shown in PropertyValueAliases.txt 1528 -> ignore for now 1529 1530* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1531 (not strictly necessary for NOT_ENCODED scripts) 1532 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1533 1534* update spoof checker UnicodeSet initializers: 1535 inclusionPat & recommendedPat in uspoof.cpp 1536 INCLUSION & RECOMMENDED in SpoofChecker.java 1537- make sure that the Unicode Tools tree contains the latest security data files 1538- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1539- update the hardcoded version number there in the DIRECTORY path 1540- run the tool (no special environment variables needed) 1541- copy & paste from the Console output into the .cpp & .java files 1542 1543* generate normalization data files 1544 cd $ICU_ROOT/dbg/icu4c 1545 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1546 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1547 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1548 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1549 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1550 1551* build ICU (make install) 1552 so that the tools build can pick up the new definitions from the installed header files. 1553 1554 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1555 1556* build Unicode tools using CMake+make 1557 1558$ICU_SRC/tools/unicode/c/icudefs.txt: 1559 1560# Location (--prefix) of where ICU was installed. 1561set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1562# Location of the ICU4C source tree. 1563set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 1564 1565 $ICU_ROOT/dbg$ 1566 mkdir -p tools/unicode/c 1567 cd tools/unicode/c 1568 1569 $ICU_ROOT/dbg/tools/unicode/c$ 1570 cmake ../../../../src/tools/unicode/c 1571 make 1572 1573* generate core properties data files 1574 $ICU_ROOT/dbg/tools/unicode/c$ 1575 genprops/genprops $ICU_SRC/icu4c 1576 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 1577 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1578- rebuild ICU (make install) & tools 1579 1580* Fix case props 1581 genprops error: casepropsbuilder: too many exceptions words 1582 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 1583- With the addition of Georgian Mtavruli capital letters, 1584 there are now too many simple case mappings with big mapping deltas 1585 that yield uncompressible exceptions. 1586- Changing the data structure (now formatVersion 4), 1587 adding one bit for no-simple-case-folding (for Cherokee), and 1588 one optional slot for a big delta (for most faraway mappings), 1589 together with another bit for whether that is negative. 1590 This makes most Cherokee & Georgian etc. case mappings compressible, 1591 reducing the number of exceptions words. 1592- Further changes to gain one more bit for the exceptions index, 1593 for future growth. Details see casepropsbuilder.cpp. 1594 1595* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1596 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1597- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1598- Unicode 6.0..11.0: U+2260, U+226E, U+226F 1599- nothing new in this Unicode version, no test file to update 1600 1601* run & fix ICU4C tests 1602- Andy handles RBBI & spoof check test failures 1603 1604- Errors in char.txt, word.txt, word_POSIX.txt like 1605 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 1606 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 1607 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 1608 not empty, just to get ICU building. 1609 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 1610 and properties together with the rules that used them (GB 10, WB 14). 1611 -> Andy adjusts the rule sets further to sync with 1612 Unicode 11 grapheme, word, and line break spec changes. 1613 1614* collation: CLDR collation root, UCA DUCET 1615 1616- UCA DUCET goes into Mark's Unicode tools, see 1617 https://sites.google.com/site/unicodetools/home#TOC-UCA 1618 diff the main mapping file, look for bad changes 1619 (for example, more bytes per weight for common characters) 1620 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 1621 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 1622 1623- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1624 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1625 1626- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1627 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1628- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1629 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1630 (note removing the underscore before "Rules") 1631 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1632- restore TODO diffs in UCARules.txt 1633 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1634- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1635 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1636 from the CLDR root files (..._CLDR_..._SHORT.txt) 1637 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1638 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1639 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1640- if CLDR common/uca/unihan-index.txt changes, then update 1641 CLDR common/collation/root.xml <collation type="private-unihan"> 1642 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1643 1644- run genuca, see command line above; 1645 deal with 1646 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1647 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 1648 (add the character to genuca.cpp sampleCharsToScripts[]) 1649 + look up the USCRIPT_ code for the new sample characters 1650 (should be obvious from the comment in the error output) 1651 + *add* mappings to sampleCharsToScripts[], do not replace them 1652 (in case the script sample characters flip-flop) 1653 + insert new scripts in DUCET script order, see the top_byte table 1654 at the beginning of FractionalUCA.txt 1655- rebuild ICU4C 1656 1657* Unihan collators 1658 https://sites.google.com/site/unicodetools/unihan 1659- run Unicode Tools 1660 org.unicode.draft.GenerateUnihanCollators 1661 with VM arguments 1662 -ea 1663 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1664 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1665 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1666 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1667 -DUVERSION=11.0.0 1668- run Unicode Tools 1669 org.unicode.draft.GenerateUnihanCollatorFiles 1670 with the same arguments 1671- check CLDR diffs 1672 cd $CLDR_SRC 1673 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1674 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1675- copy to CLDR 1676 cd $CLDR_SRC 1677 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1678 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1679- run CLDR unit tests, commit to CLDR 1680- generate ICU zh collation data: run CLDR 1681 org.unicode.cldr.icu.NewLdml2IcuConverter 1682 with program arguments 1683 -t collation 1684 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1685 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1686 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 1687 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 1688 zh 1689 and VM arguments 1690 -ea 1691 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1692- rebuild ICU4C 1693 1694* run & fix ICU4C tests, now with new CLDR collation root data 1695- run all tests with the collation test data *_SHORT.txt or the full files 1696 (the full ones have comments, useful for debugging) 1697- note on intltest: if collate/UCAConformanceTest fails, then 1698 utility/MultithreadTest/TestCollators will fail as well; 1699 fix the conformance test before looking into the multi-thread test 1700 1701* update Java data files 1702- refresh just the UCD/UCA-related/derived files, just to be safe 1703- see (ICU4C)/source/data/icu4j-readme.txt 1704- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1705- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1706 output: 1707 ... 1708 Unicode .icu files built to ./out/build/icudt61l 1709 echo timestamp > uni-core-data 1710 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 1711 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 1712 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1713 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 1714 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 1715 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 1716 mkdir -p /tmp/icu4j/main/shared/data 1717 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1718 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 1719 mkdir -p /tmp/icu4j/main/shared/data 1720 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1721 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 1722- copy the big-endian Unicode data files to another location, 1723 separate from the other data files, 1724 and then refresh ICU4J 1725 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1726 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1727 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1728 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1729 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1730 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1731 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1732 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1733 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1734 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1735 1736* When refreshing all of ICU4J data from ICU4C 1737- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1738- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1739or 1740- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1741 1742* update CollationFCD.java 1743 + copy & paste the initializers of lcccIndex[] etc. from 1744 ICU4C/source/i18n/collationfcd.cpp to 1745 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1746 1747* refresh Java test .txt files 1748- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1749 cd $ICU_SRC/icu4c/source/data/unidata 1750 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1751 cd ../../test/testdata 1752 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1753 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1754 1755* run & fix ICU4J tests 1756 1757*** API additions 1758- send notice to icu-design about new born-@stable API (enum constants etc.) 1759 1760*** CLDR numbering systems 1761- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1762 Unicode 11: using Unicode 11 CLDR ticket #10978 1763 rohg 10D30..10D39 Hanifi_Rohingya 1764 gong 11DA0..11DA9 Gunjala_Gondi 1765 Earlier: CLDR tickets specific to adding new numbering systems. 1766 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1767 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1768 1769*** merge the Unicode update branches back onto the trunk 1770- do not merge the icudata.jar and testdata.jar, 1771 instead rebuild them from merged & tested ICU4C 1772- make sure that changes to Unicode tools are checked in: 1773 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1774 1775---------------------------------------------------------------------------- *** 1776 1777Unicode 10.0 update for ICU 60 1778 1779http://www.unicode.org/versions/Unicode10.0.0/ 1780http://www.unicode.org/versions/beta-10.0.0.html 1781http://blog.unicode.org/2017/03/unicode-100-beta-review.html 1782http://www.unicode.org/review/pri350/ 1783http://www.unicode.org/reports/uax-proposed-updates.html 1784http://www.unicode.org/reports/tr44/tr44-19.html 1785 1786* Command-line environment setup 1787 1788UNICODE_DATA=~/unidata/uni10/20170605 1789CLDR_SRC=~/svn.cldr/uni10 1790ICU_ROOT=~/svn.icu/uni10 1791ICU_SRC=$ICU_ROOT/src 1792ICUDT=icudt60b 1793ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1794ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1795export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1796 1797*** ICU Trac 1798 1799- ticket:12985: Unicode 10 1800- ticket:13061: undo hacks from emoji 5.0 update 1801- ticket:13062: add Emoji_Component property 1802- ^/branches/markus/uni10 1803 1804*** CLDR Trac 1805 1806- cldrbug 10055: Unicode 10 1807- cldrbug 9882: Unicode 10 script metadata 1808- cldrbug 10219: numbering systems for Unicode 10 1809 1810*** Unicode version numbers 1811- makedata.mak 1812- uchar.h 1813- com.ibm.icu.util.VersionInfo 1814- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1815 1816- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1817 so that the makefiles see the new version number. 1818 1819*** data files & enums & parser code 1820 1821* download files 1822- mkdir -p $UNICODE_DATA 1823- download Unicode 10.0 files into $UNICODE_DATA 1824 + subfolders: ucd, uca, idna, security 1825 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1826- download emoji 5.0 files into $UNICODE_DATA/emoji 1827 1828* for manual diffs: remove version suffixes from the file names 1829 ~$ unidata/desuffixucd.py $UNICODE_DATA 1830 (see https://sites.google.com/site/unicodetools/inputdata) 1831 1832* process and/or copy files 1833- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1834 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1835 + For debugging, and tweaking how ppucd.txt is written, 1836 the tool has an --only_ppucd option: 1837 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1838 1839- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1840 1841* build ICU (make install) 1842 so that the tools build can pick up the new definitions from the installed header files. 1843 1844 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1845 1846* preparseucd.py changes 1847- remove or add new Unicode scripts from/to the 1848 only-in-ISO-15924 list according to the error messages: 1849 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 1850 -> adjust _scripts_only_in_iso15924 as indicated 1851- fix other errors 1852 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 1853 -> add vo=Vertical_Orientation to _ignored_properties 1854 -> later removed again, parsing the file, even though we do not yet store data for runtime use 1855 1856* new constants for new property values 1857- preparseucd.py error: 1858 ValueError: missing uchar.h enum constants for some property values: 1859 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 1860 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 1861 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 1862 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 1863 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 1864 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 1865 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1866 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 1867 blk; Kana_Ext_A ; Kana_Extended_A 1868 blk; Masaram_Gondi ; Masaram_Gondi 1869 blk; Nushu ; Nushu 1870 blk; Soyombo ; Soyombo 1871 blk; Syriac_Sup ; Syriac_Supplement 1872 blk; Zanabazar_Square ; Zanabazar_Square 1873 -> add to uchar.h 1874 use long property names for enum constants, 1875 for the trailing comment get the block start code point: diff old & new Blocks.txt 1876 -> add to UCharacter.UnicodeBlock IDs 1877 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1878 replace public static final int \1_ID = \2; \3 1879 -> add to UCharacter.UnicodeBlock objects 1880 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1881 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1882 1883 jg ; Malayalam_Bha ; Malayalam_Bha 1884 jg ; Malayalam_Ja ; Malayalam_Ja 1885 jg ; Malayalam_Lla ; Malayalam_Lla 1886 jg ; Malayalam_Llla ; Malayalam_Llla 1887 jg ; Malayalam_Nga ; Malayalam_Nga 1888 jg ; Malayalam_Nna ; Malayalam_Nna 1889 jg ; Malayalam_Nnna ; Malayalam_Nnna 1890 jg ; Malayalam_Nya ; Malayalam_Nya 1891 jg ; Malayalam_Ra ; Malayalam_Ra 1892 jg ; Malayalam_Ssa ; Malayalam_Ssa 1893 jg ; Malayalam_Tta ; Malayalam_Tta 1894 -> uchar.h & UCharacter.JoiningGroup 1895 1896 sc ; Gonm ; Masaram_Gondi 1897 sc ; Nshu ; Nushu 1898 sc ; Soyo ; Soyombo 1899 sc ; Zanb ; Zanabazar_Square 1900 -> uscript.h & com.ibm.icu.lang.UScript 1901 -> Nushu had been added already 1902 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1903 and in com.ibm.icu.dev.test.lang.TestUScript.java 1904 1905* New properties as shown in PropertyValueAliases.txt changes 1906- boolean Emoji_Component from emoji 5 1907 -> uchar.h & UProperty.java 1908- boolean 1909 # Regional_Indicator (RI) 1910 1911 RI ; N ; No ; F ; False 1912 RI ; Y ; Yes ; T ; True 1913 -> uchar.h & UProperty.java 1914 -> single immutable range, to be hardcoded 1915- boolean 1916 # Prepended_Concatenation_Mark (PCM) 1917 1918 PCM; N ; No ; F ; False 1919 PCM; Y ; Yes ; T ; True 1920 -> was new in Unicode 9 1921 -> uchar.h & UProperty.java 1922- enumerated 1923 # Vertical_Orientation (vo) 1924 1925 vo ; R ; Rotated 1926 vo ; Tr ; Transformed_Rotated 1927 vo ; Tu ; Transformed_Upright 1928 vo ; U ; Upright 1929 -> only pre-parsed for now, but not yet stored for runtime use 1930 1931* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1932 (not strictly necessary for NOT_ENCODED scripts) 1933 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1934 1935* generate normalization data files 1936 cd $ICU_ROOT/dbg/icu4c 1937 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1938 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1939 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1940 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1941 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1942 1943* build ICU (make install) 1944 so that the tools build can pick up the new definitions from the installed header files. 1945 1946 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1947 1948* build Unicode tools using CMake+make 1949 1950$ICU_SRC/tools/unicode/c/icudefs.txt: 1951 1952# Location (--prefix) of where ICU was installed. 1953set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1954# Location of the ICU4C source tree. 1955set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 1956 1957 $ICU_ROOT/dbg/tools/unicode/c$ 1958 cmake ../../../../src/tools/unicode/c 1959 make 1960 1961* generate core properties data files 1962 $ICU_ROOT/dbg/tools/unicode/c$ 1963 genprops/genprops $ICU_SRC/icu4c 1964 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 1965 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1966- rebuild ICU (make install) & tools 1967 1968* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1969 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1970- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1971- Unicode 6.0..10.0: U+2260, U+226E, U+226F 1972- nothing new in this Unicode version, no test file to update 1973 1974* run & fix ICU4C tests 1975- Andy handles RBBI & spoof check test failures 1976 1977* collation: CLDR collation root, UCA DUCET 1978 1979- UCA DUCET goes into Mark's Unicode tools, see 1980 https://sites.google.com/site/unicodetools/home#TOC-UCA 1981- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1982 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1983 1984- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1985 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1986- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1987 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1988 (note removing the underscore before "Rules") 1989 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1990- restore TODO diffs in UCARules.txt 1991 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1992- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1993 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1994 from the CLDR root files (..._CLDR_..._SHORT.txt) 1995 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1996 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1997 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1998- if CLDR common/uca/unihan-index.txt changes, then update 1999 CLDR common/collation/root.xml <collation type="private-unihan"> 2000 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 2001 2002- run genuca, see command line above; 2003 deal with 2004 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 2005 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 2006 (add the character to genuca.cpp sampleCharsToScripts[]) 2007 + look up the USCRIPT_ code for the new sample characters 2008 (should be obvious from the comment in the error output) 2009 + *add* mappings to sampleCharsToScripts[], do not replace them 2010 (in case the script sample characters flip-flop) 2011 + insert new scripts in DUCET script order, see the top_byte table 2012 at the beginning of FractionalUCA.txt 2013- rebuild ICU4C 2014 2015* Unihan collators 2016 https://sites.google.com/site/unicodetools/unihan 2017- run Unicode Tools 2018 org.unicode.draft.GenerateUnihanCollators 2019 with VM arguments 2020 -ea 2021 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 2022 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 2023 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 2024 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2025 -DUVERSION=10.0.0 2026- run Unicode Tools 2027 org.unicode.draft.GenerateUnihanCollatorFiles 2028 with the same arguments 2029- check CLDR diffs 2030 cd $CLDR_SRC 2031 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2032 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2033- copy to CLDR 2034 cd $CLDR_SRC 2035 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2036 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2037- run CLDR unit tests, commit to CLDR 2038- generate ICU zh collation data: run CLDR 2039 org.unicode.cldr.icu.NewLdml2IcuConverter 2040 with program arguments 2041 -t collation 2042 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 2043 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 2044 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 2045 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 2046 zh 2047 and VM arguments 2048 -ea 2049 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 2050- rebuild ICU4C 2051 2052* run & fix ICU4C tests, now with new CLDR collation root data 2053- run all tests with the collation test data *_SHORT.txt or the full files 2054 (the full ones have comments, useful for debugging) 2055- note on intltest: if collate/UCAConformanceTest fails, then 2056 utility/MultithreadTest/TestCollators will fail as well; 2057 fix the conformance test before looking into the multi-thread test 2058 2059* update Java data files 2060- refresh just the UCD/UCA-related/derived files, just to be safe 2061- see (ICU4C)/source/data/icu4j-readme.txt 2062- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2063- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2064 output: 2065 ... 2066 Unicode .icu files built to ./out/build/icudt60l 2067 echo timestamp > uni-core-data 2068 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2069 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 2070 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2071 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 2072 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 2073 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 2074 mkdir -p /tmp/icu4j/main/shared/data 2075 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2076 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 2077 mkdir -p /tmp/icu4j/main/shared/data 2078 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2079 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 2080- copy the big-endian Unicode data files to another location, 2081 separate from the other data files, 2082 and then refresh ICU4J 2083 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 2084 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2085 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2086 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2087 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2088 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2089 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2090 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2091 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2092 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2093 2094* When refreshing all of ICU4J data from ICU4C 2095- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2096- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 2097or 2098- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 2099 2100* update CollationFCD.java 2101 + copy & paste the initializers of lcccIndex[] etc. from 2102 ICU4C/source/i18n/collationfcd.cpp to 2103 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2104 2105* refresh Java test .txt files 2106- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2107 cd $ICU_SRC/icu4c/source/data/unidata 2108 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2109 cd ../../test/testdata 2110 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2111 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2112 2113* run & fix ICU4J tests 2114 2115*** API additions 2116- send notice to icu-design about new born-@stable API (enum constants etc.) 2117 2118*** CLDR numbering systems 2119- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 2120 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 2121 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 2122 2123*** merge the Unicode update branches back onto the trunk 2124- do not merge the icudata.jar and testdata.jar, 2125 instead rebuild them from merged & tested ICU4C 2126- make sure that changes to Unicode tools are checked in: 2127 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2128 2129---------------------------------------------------------------------------- *** 2130 2131Emoji 5.0 update for ICU 59 2132- ICU 59 mostly remains on Unicode 9.0 2133- except updates bidi and segmentation data to Unicode 10 beta 2134 2135First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 2136 2137* Command-line environment setup 2138 2139ICU_ROOT=~/svn.icu/trunk 2140ICU_SRC_DIR=$ICU_ROOT/src 2141ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 2142ICUDT=icudt59b 2143export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2144SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 2145UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 2146 2147*** ICU Trac 2148 2149- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 2150- changes directly on trunk 2151 2152*** data files & enums & parser code 2153 2154* download files 2155 2156- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 2157- download emoji 5.0 beta files into the same uni90e50 folder 2158- download Unicode 10.0 beta files: ucd 2159 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 2160 BidiBrackets.txt 2161 BidiCharacterTest.txt 2162 BidiMirroring.txt 2163 BidiTest.txt 2164 extracted/DerivedBidiClass.txt 2165 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 2166 LineBreak.txt 2167 auxiliary/* 2168 2169* preparseucd.py changes 2170- adjust for combined trunks 2171- write new copyright lines 2172- ignore new Emoji_Component property for now 2173 2174* process and/or copy files 2175- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 2176 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2177 2178- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 2179 2180* build ICU (make install) 2181 so that the tools build can pick up the new definitions from the installed header files. 2182 2183 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 2184 2185* build Unicode tools using CMake+make 2186 2187~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 2188 2189# Location (--prefix) of where ICU was installed. 2190set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 2191# Location of the ICU4C source tree. 2192set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 2193 2194 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2195 cmake ../../../../src/tools/unicode/c 2196 make 2197 2198* generate core properties data files 2199 ~/svn.icu/trunk/dbg/tools/unicode/c$ 2200 genprops/genprops $ICU4C_SRC_DIR 2201- rebuild ICU (make install) & tools 2202 2203* run & fix ICU4C tests 2204- Andy handles RBBI & spoof check test failures 2205 2206* update Java data files 2207- refresh just the UCD/UCA-related/derived files, just to be safe 2208- see (ICU4C)/source/data/icu4j-readme.txt 2209- mkdir /tmp/icu4j 2210- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2211 output: 2212 ... 2213 Unicode .icu files built to ./out/build/icudt59l 2214 echo timestamp > uni-core-data 2215 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2216 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 2217 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2218 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 2219 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 2220 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 2221 mkdir -p /tmp/icu4j/main/shared/data 2222 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2223 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 2224 mkdir -p /tmp/icu4j/main/shared/data 2225 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2226 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 2227- copy the big-endian Unicode data files to another location, 2228 separate from the other data files, 2229 and then refresh ICU4J 2230 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 2231 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2232 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2233 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2234 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2235 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2236 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2237 2238* When refreshing all of ICU4J data from ICU4C 2239- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2240- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 2241or 2242- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 2243 2244* refresh Java test .txt files 2245- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2246 cd $ICU4C_SRC_DIR/source/data/unidata 2247 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2248 cd ../../test/testdata 2249 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2250 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 2251 2252* run & fix ICU4J tests 2253 2254---------------------------------------------------------------------------- *** 2255 2256Unicode 9.0 update for ICU 58 2257 2258* Command-line environment setup 2259 2260ICU_ROOT=~/svn.icu/trunk 2261ICU_SRC_DIR=$ICU_ROOT/src 2262ICUDT=icudt58b 2263export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2264SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2265UNIDATA=$ICU_SRC_DIR/source/data/unidata 2266 2267http://www.unicode.org/review/pri323/ -- beta review 2268http://www.unicode.org/reports/uax-proposed-updates.html 2269http://www.unicode.org/versions/beta-9.0.0.html 2270http://www.unicode.org/versions/Unicode9.0.0/ 2271http://www.unicode.org/reports/tr44/tr44-17.html 2272 2273*** ICU Trac 2274 2275- ticket:12526: integrate Unicode 9 2276- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 2277- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 2278 2279*** CLDR Trac 2280 2281- cldrbug 9414: UCA 9 2282- ^/branches/markus/uni90 at r11518 from trunk at r11517 2283 2284- cldrbug 8745: Unicode 9.0 script metadata 2285 2286*** Unicode version numbers 2287- makedata.mak 2288- uchar.h 2289- com.ibm.icu.util.VersionInfo 2290- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2291 2292- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2293 so that the makefiles see the new version number. 2294 2295*** data files & enums & parser code 2296 2297* file preparation 2298 2299- download UCD & IDNA files 2300- make sure that the Unicode data folder passed into preparseucd.py 2301 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2302- only for manual diffs: remove version suffixes from the file names 2303 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2304 (see https://sites.google.com/site/unicodetools/inputdata) 2305- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2306- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2307- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2308 2309- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 2310 and copy to $UNIDATA 2311 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 2312 2313* preparseucd.py changes 2314- remove or add new Unicode scripts from/to the 2315 only-in-ISO-15924 list according to the error messages: 2316 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 2317 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 2318 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 2319 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 2320 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2321 and in com.ibm.icu.dev.test.lang.TestUScript.java 2322- DerivedNumericValues.txt new numeric values 2323 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 2324 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 2325 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 2326 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 2327 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 2328 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 2329 uchar.c, UCharacterProperty.java 2330 to support a new series of values 2331- adjust preparseucd.py for Tangut algorithmic names 2332 in ppucd.txt: 2333 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 2334 -> 2335 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 2336- avoid block-compressing most String/Miscellaneous property values, 2337 triggered by genprops not coping with a multi-code point Case_Folding on 2338 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 2339 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 2340 2341* PropertyAliases.txt changes 2342- 1 new property PCM=Prepended_Concatenation_Mark 2343 Ignore: Only useful for layout engines. 2344 Ok to list in ppucd.txt. 2345 2346* PropertyValueAliases.txt new property values 2347 blk; Adlam ; Adlam 2348 blk; Bhaiksuki ; Bhaiksuki 2349 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 2350 blk; Glagolitic_Sup ; Glagolitic_Supplement 2351 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 2352 blk; Marchen ; Marchen 2353 blk; Mongolian_Sup ; Mongolian_Supplement 2354 blk; Newa ; Newa 2355 blk; Osage ; Osage 2356 blk; Tangut ; Tangut 2357 blk; Tangut_Components ; Tangut_Components 2358 -> add to uchar.h 2359 use long property names for enum constants 2360 -> add to UCharacter.UnicodeBlock IDs 2361 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2362 replace public static final int \1_ID = \2; \3 2363 -> add to UCharacter.UnicodeBlock objects 2364 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2365 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2366 2367 GCB; EB ; E_Base 2368 GCB; EBG ; E_Base_GAZ 2369 GCB; EM ; E_Modifier 2370 GCB; GAZ ; Glue_After_Zwj 2371 GCB; ZWJ ; ZWJ 2372 -> uchar.h & UCharacter.GraphemeClusterBreak 2373 2374 jg ; African_Feh ; African_Feh 2375 jg ; African_Noon ; African_Noon 2376 jg ; African_Qaf ; African_Qaf 2377 -> uchar.h & UCharacter.JoiningGroup 2378 2379 lb ; EB ; E_Base 2380 lb ; EM ; E_Modifier 2381 lb ; ZWJ ; ZWJ 2382 -> uchar.h & UCharacter.LineBreak 2383 2384 sc ; Adlm ; Adlam 2385 sc ; Bhks ; Bhaiksuki 2386 sc ; Marc ; Marchen 2387 sc ; Newa ; Newa 2388 sc ; Osge ; Osage 2389 sc ; Tang ; Tangut 2390 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2391 2392 WB ; EB ; E_Base 2393 WB ; EBG ; E_Base_GAZ 2394 WB ; EM ; E_Modifier 2395 WB ; GAZ ; Glue_After_Zwj 2396 WB ; ZWJ ; ZWJ 2397 -> uchar.h & UCharacter.WordBreak 2398 2399* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2400 (not strictly necessary for NOT_ENCODED scripts) 2401 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2402 2403* generate normalization data files 2404 cd $ICU_ROOT/dbg 2405 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2406 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2407 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2408 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2409 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2410 2411* build ICU (make install) 2412 so that the tools build can pick up the new definitions from the installed header files. 2413 2414 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 2415 2416* build Unicode tools using CMake+make 2417 2418~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2419 2420 # Location (--prefix) of where ICU was installed. 2421 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2422 # Location of the ICU source tree. 2423 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2424 2425 ~/svn.icutools/trunk/dbg/unicode/c$ 2426 cmake ../../../src/unicode/c 2427 make 2428 2429* generate core properties data files 2430 ~/svn.icutools/trunk/dbg/unicode/c$ 2431 genprops/genprops $ICU_SRC_DIR 2432 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2433 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2434- rebuild ICU (make install) & tools 2435 2436* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2437 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2438- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2439- Unicode 6.0..9.0: U+2260, U+226E, U+226F 2440- nothing new in 9.0, no test file to update 2441 2442* run & fix ICU4C tests 2443- Andy handles RBBI & spoof check test failures 2444 2445* collation: CLDR collation root, UCA DUCET 2446 2447- UCA DUCET goes into Mark's Unicode tools, see 2448 https://sites.google.com/site/unicodetools/home#TOC-UCA 2449- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2450 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 2451 2452- cd (CLDR UCA branch)/common/uca/ 2453- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2454 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2455- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2456 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2457 (note removing the underscore before "Rules") 2458 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2459- restore TODO diffs in UCARules.txt 2460 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2461- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2462 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2463 from the CLDR root files (..._CLDR_..._SHORT.txt) 2464 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2465 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2466 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2467- if CLDR common/uca/unihan-index.txt changes, then update 2468 CLDR common/collation/root.xml <collation type="private-unihan"> 2469 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2470 2471- run genuca, see command line above; 2472 deal with 2473 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 2474 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 2475 (add the character to genuca.cpp sampleCharsToScripts[]) 2476 + look up the USCRIPT_ code for the new sample characters 2477 (should be obvious from the comment in the error output) 2478 + *add* mappings to sampleCharsToScripts[], do not replace them 2479 (in case the script sample characters flip-flop) 2480 + insert new scripts in DUCET script order, see the top_byte table 2481 at the beginning of FractionalUCA.txt 2482- rebuild ICU4C 2483 2484* Unihan collators 2485- run Unicode Tools 2486 org.unicode.draft.GenerateUnihanCollators 2487 with VM arguments 2488 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 2489 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 2490 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 2491 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2492 -DUVERSION=9.0.0 2493 -ea 2494- run Unicode Tools 2495 org.unicode.draft.GenerateUnihanCollatorFiles 2496 with the same arguments 2497- check CLDR diffs 2498 cd ~/svn.cldr/trunk 2499 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2500 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2501- copy to CLDR 2502 cd ~/svn.cldr/trunk 2503 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2504 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2505- commit to CLDR 2506- generate ICU zh collation data: run CLDR 2507 org.unicode.cldr.icu.NewLdml2IcuConverter 2508 with program arguments 2509 -t collation 2510 -s /home/mscherer/svn.cldr/trunk/common/collation 2511 -m /home/mscherer/svn.cldr/trunk/common/supplemental 2512 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 2513 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 2514 zh 2515 and VM arguments 2516 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2517- rebuild ICU4C 2518 2519* run & fix ICU4C tests, now with new CLDR collation root data 2520- run all tests with the collation test data *_SHORT.txt or the full files 2521 (the full ones have comments, useful for debugging) 2522- note on intltest: if collate/UCAConformanceTest fails, then 2523 utility/MultithreadTest/TestCollators will fail as well; 2524 fix the conformance test before looking into the multi-thread test 2525 2526* update Java data files 2527- refresh just the UCD/UCA-related/derived files, just to be safe 2528- see (ICU4C)/source/data/icu4j-readme.txt 2529- mkdir /tmp/icu4j 2530- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2531 output: 2532 ... 2533 Unicode .icu files built to ./out/build/icudt58l 2534 echo timestamp > uni-core-data 2535 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2536 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 2537 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2538 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2539 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 2540 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 2541 mkdir -p /tmp/icu4j/main/shared/data 2542 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2543 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 2544 mkdir -p /tmp/icu4j/main/shared/data 2545 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2546 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 2547- copy the big-endian Unicode data files to another location, 2548 separate from the other data files, 2549 and then refresh ICU4J 2550 cd ~/svn.icu/trunk/dbg/data/out/icu4j 2551 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2552 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2553 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2554 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2555 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2556 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2557 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2558 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2559 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2560 2561* When refreshing all of ICU4J data from ICU4C 2562- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2563- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2564or 2565- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2566 2567* update CollationFCD.java 2568 + copy & paste the initializers of lcccIndex[] etc. from 2569 ICU4C/source/i18n/collationfcd.cpp to 2570 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2571 2572* refresh Java test .txt files 2573- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2574 cd $ICU_SRC_DIR/source/data/unidata 2575 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2576 cd ../../test/testdata 2577 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2578 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2579 2580* run & fix ICU4J tests 2581 2582*** LayoutEngine script information 2583 2584* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2585 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2586 in the working directory. 2587 2588 (It also generates ScriptRunData.cpp, which is no longer needed.) 2589 2590 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2591 (a plain text file) 2592 which maps ICU versions to the numbers of script/language constants 2593 that were added then. 2594 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2595 2596 The generated files have a current copyright date and "@deprecated" statement. 2597 2598* Review changes, fix Java tool if necessary, and copy to ICU4C 2599 cd ~/svn.icu4j/trunk/src 2600 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2601 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2602 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2603 2604*** API additions 2605- send notice to icu-design about new born-@stable API (enum constants etc.) 2606 2607*** merge the Unicode update branches back onto the trunk 2608- do not merge the icudata.jar and testdata.jar, 2609 instead rebuild them from merged & tested ICU4C 2610- make sure that changes to Unicode tools & ICU tools are checked in 2611 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2612 http://bugs.icu-project.org/trac/log/tools/trunk 2613 2614---------------------------------------------------------------------------- *** 2615 2616New script codes early in ICU 58: https://unicode-org.atlassian.net/browse/ICU-11764 2617 2618Adding 2619- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 2620- new combination/alias codes: Hanb, Jamo 2621 - used in CLDR 29 and in spoof checker 2622- new Z* code: Zsye 2623 2624Add new codes to uscript.h & UScript.java, see Unicode update logs. 2625 -> com.ibm.icu.lang.UScript 2626 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 2627 replace public static final int \1 = \2; \3 2628 2629Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 2630add new script codes. 2631"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 2632 2633Note: If we have to run preparseucd.py again before the Unicode 9 update, 2634then we need to manually keep/restore the new script codes. 2635 2636ICU_ROOT=~/svn.icu/trunk 2637ICU_SRC_DIR=$ICU_ROOT/src 2638ICUDT=icudt57b 2639export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2640SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2641UNIDATA=$ICU_SRC_DIR/source/data/unidata 2642 2643Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 2644see https://unicode-org.atlassian.net/browse/ICU-12141 2645 2646make install, then icutools cmake & make, then 2647~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 2648 2649Generate Java data as usual, only update pnames.icu & uprops.icu. 2650 2651*** LayoutEngine script information 2652 2653* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2654 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2655 in the working directory. 2656 2657 (It also generates ScriptRunData.cpp, which is no longer needed.) 2658 2659 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2660 (a plain text file) 2661 which maps ICU versions to the numbers of script/language constants 2662 that were added then. 2663 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2664 2665 The generated files have a current copyright date and "@deprecated" statement. 2666 2667* Review changes, fix Java tool if necessary, and copy to ICU4C 2668 cd ~/svn.icu4j/trunk/src 2669 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2670 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2671 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2672 2673---------------------------------------------------------------------------- *** 2674 2675Emoji properties added in ICU 57: https://unicode-org.atlassian.net/browse/ICU-11802 2676 2677Edit preparseucd.py to add & parse new properties. 2678They share the UCD property namespace but are not listed in PropertyAliases.txt. 2679 2680Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 2681Initial data from emoji/2.0/ 2682 2683ICU_ROOT=~/svn.icu/trunk 2684ICU_SRC_DIR=$ICU_ROOT/src 2685ICUDT=icudt56b 2686export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2687SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2688UNIDATA=$ICU_SRC_DIR/source/data/unidata 2689 2690Add binary-property constants to uchar.h enum UProperty & UProperty.java. 2691 2692~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2693(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 2694 2695Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 2696 2697make install, then icutools cmake & make, then 2698~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 2699 2700Generate Java data as usual, only update pnames.icu & uprops.icu. 2701 2702---------------------------------------------------------------------------- *** 2703 2704Unicode 8.0 update for ICU 56 2705 2706* Command-line environment setup 2707 2708ICU_ROOT=~/svn.icu/trunk 2709ICU_SRC_DIR=$ICU_ROOT/src 2710ICUDT=icudt56b 2711export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2712SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2713UNIDATA=$ICU_SRC_DIR/source/data/unidata 2714 2715http://www.unicode.org/review/pri297/ -- beta review 2716http://www.unicode.org/reports/uax-proposed-updates.html 2717http://unicode.org/versions/beta-8.0.0.html 2718http://www.unicode.org/versions/Unicode8.0.0/ 2719http://www.unicode.org/reports/tr44/tr44-15.html 2720 2721*** ICU Trac 2722 2723- ticket:11574: Unicode 8 2724- C++ branches/markus/uni80 at r37351 from trunk at r37343 2725- Java branches/markus/uni80 at r37352 from trunk at r37338 2726 2727*** CLDR Trac 2728 2729- cldrbug 8311: UCA 8 2730- branches/markus/uni80 at r11518 from trunk at r11517 2731 2732- cldrbug 8109: Unicode 8.0 script metadata 2733- cldrbug 8418: Updated segmentation for Unicode 8.0 2734 2735*** Unicode version numbers 2736- makedata.mak 2737- uchar.h 2738- com.ibm.icu.util.VersionInfo 2739- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2740 2741- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2742 so that the makefiles see the new version number. 2743 2744*** data files & enums & parser code 2745 2746* file preparation 2747 2748- download UCD & IDNA files 2749- make sure that the Unicode data folder passed into preparseucd.py 2750 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2751- only for manual diffs: remove version suffixes from the file names 2752 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2753 (see https://sites.google.com/site/unicodetools/inputdata) 2754- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2755- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2756- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2757 2758- also: from http://unicode.org/Public/security/8.0.0/ download new 2759 confusables.txt & confusablesWholeScript.txt 2760 and copy to $UNIDATA 2761 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 2762 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 2763 2764* initial preparseucd.py changes 2765- remove new Unicode scripts from the 2766 only-in-ISO-15924 list according to the error message: 2767 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 2768 from _scripts_only_in_iso15924 2769 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2770 and in com.ibm.icu.dev.test.lang.TestUScript.java 2771- property and file name change: 2772 IndicMatraCategory -> IndicPositionalCategory 2773- UnicodeData.txt unusual numeric values (improper fractions) 2774 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 2775 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 2776 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 2777 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 2778 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 2779 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 2780 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 2781 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 2782 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 2783 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 2784 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 2785 which are listed in DerivedNumericValues.txt; 2786 keeps storage in data file simple 2787 2788* PropertyValueAliases.txt changes 2789- 10 new Block (blk) values: 2790 blk; Ahom ; Ahom 2791 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 2792 blk; Cherokee_Sup ; Cherokee_Supplement 2793 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 2794 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 2795 blk; Hatran ; Hatran 2796 blk; Multani ; Multani 2797 blk; Old_Hungarian ; Old_Hungarian 2798 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 2799 blk; Sutton_SignWriting ; Sutton_SignWriting 2800 -> add to uchar.h 2801 use long property names for enum constants 2802 -> add to UCharacter.UnicodeBlock IDs 2803 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2804 replace public static final int \1_ID = \2; \3 2805 -> add to UCharacter.UnicodeBlock objects 2806 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2807 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2808- 6 new Script (sc) values: 2809 sc ; Ahom ; Ahom 2810 sc ; Hatr ; Hatran 2811 sc ; Hluw ; Anatolian_Hieroglyphs 2812 sc ; Hung ; Old_Hungarian 2813 sc ; Mult ; Multani 2814 sc ; Sgnw ; SignWriting 2815 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2816 2817* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2818 (not strictly necessary for NOT_ENCODED scripts) 2819 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2820 2821* generate normalization data files 2822 cd $ICU_ROOT/dbg 2823 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2824 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2825 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2826 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2827 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2828 2829* build ICU (make install) 2830 so that the tools build can pick up the new definitions from the installed header files. 2831 2832 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 2833 2834* build Unicode tools using CMake+make 2835 2836~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2837 2838 # Location (--prefix) of where ICU was installed. 2839 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2840 # Location of the ICU source tree. 2841 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2842 2843 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 2844 ~/svn.icutools/trunk/dbg/unicode/c$ make 2845 2846* generate core properties data files 2847- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 2848- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2849- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2850- rebuild ICU (make install) & tools 2851- run genuca again (see step above) so that it picks up the new nfc.nrm 2852- rebuild ICU (make install) & tools 2853 2854* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2855 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2856- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2857- Unicode 6.0..8.0: U+2260, U+226E, U+226F 2858- nothing new in 8.0, no test file to update 2859 2860* run & fix ICU4C tests 2861- bad Cherokee case folding due to difference in fallbacks: 2862 UCD case folding falls back to no mapping, 2863 ICU runtime case folding falls back to lowercasing; 2864 fixed casepropsbuilder.cpp to generate scf mappings to self 2865 when there is an slc mapping but no scf 2866- Andy handles RBBI & spoof check test failures 2867 2868* collation: CLDR collation root, UCA DUCET 2869 2870- UCA DUCET goes into Mark's Unicode tools, see 2871 https://sites.google.com/site/unicodetools/home#TOC-UCA 2872- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2873- cd (CLDR UCA branch)/common/uca/ 2874- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2875 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2876- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2877 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2878 (note removing the underscore before "Rules") 2879 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2880- restore TODO diffs in UCARules.txt 2881 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2882- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2883 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2884 from the CLDR root files (..._CLDR_..._SHORT.txt) 2885 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2886 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2887 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2888- if CLDR common/uca/unihan-index.txt changes, then update 2889 CLDR common/collation/root.xml <collation type="private-unihan"> 2890 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2891- run genuca, see command line above; 2892 deal with 2893 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 2894 (add the character to genuca.cpp sampleCharsToScripts[]) 2895 + look up the script for the new sample characters 2896 (e.g., in FractionalUCA.txt) 2897 + *add* mappings to sampleCharsToScripts[], do not replace them 2898 (in case the script sample characters flip-flop) 2899 + insert new scripts in DUCET script order, see the top_byte table 2900 at the beginning of FractionalUCA.txt 2901- rebuild ICU4C 2902 2903* run & fix ICU4C tests, now with new CLDR collation root data 2904- run all tests with the collation test data *_SHORT.txt or the full files 2905 (the full ones have comments, useful for debugging) 2906- note on intltest: if collate/UCAConformanceTest fails, then 2907 utility/MultithreadTest/TestCollators will fail as well; 2908 fix the conformance test before looking into the multi-thread test 2909- fixed bug in CollationWeights::getWeightRanges() 2910 exposed by new data and CollationTest::TestRootElements 2911 2912* update Java data files 2913- refresh just the UCD/UCA-related/derived files, just to be safe 2914- see (ICU4C)/source/data/icu4j-readme.txt 2915- mkdir /tmp/icu4j 2916- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2917 output: 2918 ... 2919 Unicode .icu files built to ./out/build/icudt56l 2920 echo timestamp > uni-core-data 2921 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 2922 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 2923 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2924 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 2925 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 2926 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 2927 mkdir -p /tmp/icu4j/main/shared/data 2928 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2929 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 2930 mkdir -p /tmp/icu4j/main/shared/data 2931 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2932 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 2933- copy the big-endian Unicode data files to another location, 2934 separate from the other data files, 2935 and then refresh ICU4J 2936 cd ~/svn.icu/trunk/dbg/data/out/icu4j 2937 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2938 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2939 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2940 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2941 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2942 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2943 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2944 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2945 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2946 2947* When refreshing all of ICU4J data from ICU4C 2948- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2949- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2950or 2951- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2952 2953* update CollationFCD.java 2954 + copy & paste the initializers of lcccIndex[] etc. from 2955 ICU4C/source/i18n/collationfcd.cpp to 2956 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2957 2958* refresh Java test .txt files 2959- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2960 cd $ICU_SRC_DIR/source/data/unidata 2961 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2962 cd ../../test/testdata 2963 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2964 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2965 2966* run & fix ICU4J tests 2967 2968*** LayoutEngine script information 2969 2970* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 2971 because the layout engine was deprecated in ICU 54. 2972 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 2973 to write lines that we used to add manually. 2974 2975* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2976 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2977 in the working directory. 2978 2979 (It also generates ScriptRunData.cpp, which is no longer needed.) 2980 2981 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2982 (a plain text file) 2983 which maps ICU versions to the numbers of script/language constants 2984 that were added then. 2985 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2986 2987 The generated files have a current copyright date and "@deprecated" statement. 2988 2989* Review changes, fix Java tool if necessary, and copy to ICU4C 2990 cd ~/svn.icu4j/trunk/src 2991 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2992 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2993 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2994 2995*** API additions 2996- send notice to icu-design about new born-@stable API (enum constants etc.) 2997 2998*** merge the Unicode update branches back onto the trunk 2999- do not merge the icudata.jar and testdata.jar, 3000 instead rebuild them from merged & tested ICU4C 3001- make sure that changes to Unicode tools & ICU tools are checked in 3002 http://www.unicode.org/utility/trac/log/trunk/unicodetools 3003 http://bugs.icu-project.org/trac/log/tools/trunk 3004 3005---------------------------------------------------------------------------- *** 3006 3007Unicode 7.0 update for ICU 54 3008 3009http://www.unicode.org/review/pri271/ -- beta review 3010http://www.unicode.org/reports/uax-proposed-updates.html 3011http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 3012http://www.unicode.org/reports/tr44/tr44-13.html 3013 3014*** ICU Trac 3015 3016- ticket 10821: Unicode 7.0, UCA 7.0 3017- C++ branches/markus/uni70 at r35584 from trunk at r35580 3018- Java branches/markus/uni70 at r35587 from trunk at r35545 3019 3020*** CLDR Trac 3021 3022- ticket 7195: UCA 7.0 CLDR root collation 3023- branches/markus/uni70 at r10062 from trunk at r10061 3024 3025- ticket 6762: script metadata for Unicode 7.0 new scripts 3026 3027*** Unicode version numbers 3028- makedata.mak 3029- uchar.h 3030- com.ibm.icu.util.VersionInfo 3031- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3032 3033- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3034 so that the makefiles see the new version number. 3035 3036*** data files & enums & parser code 3037 3038* file preparation 3039 3040- download UCD & IDNA files 3041- make sure that the Unicode data folder passed into preparseucd.py 3042 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3043- only for manual diffs: remove version suffixes from the file names 3044 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 3045 (see https://sites.google.com/site/unicodetools/inputdata) 3046- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 3047- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 3048- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3049- Restore TODO diffs in source/data/unidata/UCARules.txt 3050 cd $ICU_SRC_DIR 3051 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 3052- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 3053 3054- also: from http://unicode.org/Public/security/7.0.0/ download new 3055 confusables.txt & confusablesWholeScript.txt 3056 and copy to $ICU_ROOT/src/source/data/unidata/ 3057 3058* initial preparseucd.py changes 3059- remove new Unicode scripts from the 3060 only-in-ISO-15924 list according to the error message: 3061 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 3062 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 3063 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 3064 from _scripts_only_in_iso15924 3065 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3066 and in com.ibm.icu.dev.test.lang.TestUScript.java 3067- NamesList.txt now has a heading with a non-ASCII character 3068 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 3069 + escape non-ASCII characters in heading comments 3070- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 3071 + get the copyright from the first file whose copyright line contains the current year 3072 3073* PropertyValueAliases.txt changes 3074- 32 new Block (blk) values: 3075 blk; Bassa_Vah ; Bassa_Vah 3076 blk; Caucasian_Albanian ; Caucasian_Albanian 3077 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 3078 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 3079 blk; Duployan ; Duployan 3080 blk; Elbasan ; Elbasan 3081 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 3082 blk; Grantha ; Grantha 3083 blk; Khojki ; Khojki 3084 blk; Khudawadi ; Khudawadi 3085 blk; Latin_Ext_E ; Latin_Extended_E 3086 blk; Linear_A ; Linear_A 3087 blk; Mahajani ; Mahajani 3088 blk; Manichaean ; Manichaean 3089 blk; Mende_Kikakui ; Mende_Kikakui 3090 blk; Modi ; Modi 3091 blk; Mro ; Mro 3092 blk; Myanmar_Ext_B ; Myanmar_Extended_B 3093 blk; Nabataean ; Nabataean 3094 blk; Old_North_Arabian ; Old_North_Arabian 3095 blk; Old_Permic ; Old_Permic 3096 blk; Ornamental_Dingbats ; Ornamental_Dingbats 3097 blk; Pahawh_Hmong ; Pahawh_Hmong 3098 blk; Palmyrene ; Palmyrene 3099 blk; Pau_Cin_Hau ; Pau_Cin_Hau 3100 blk; Psalter_Pahlavi ; Psalter_Pahlavi 3101 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 3102 blk; Siddham ; Siddham 3103 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 3104 blk; Sup_Arrows_C ; Supplemental_Arrows_C 3105 blk; Tirhuta ; Tirhuta 3106 blk; Warang_Citi ; Warang_Citi 3107 -> add to uchar.h 3108 use long property names for enum constants 3109 -> add to UCharacter.UnicodeBlock IDs 3110 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3111 replace public static final int \1_ID = \2; \3 3112 -> add to UCharacter.UnicodeBlock objects 3113 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3114 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3115- 28 new Joining_Group (jg) values: 3116 jg ; Manichaean_Aleph ; Manichaean_Aleph 3117 jg ; Manichaean_Ayin ; Manichaean_Ayin 3118 jg ; Manichaean_Beth ; Manichaean_Beth 3119 jg ; Manichaean_Daleth ; Manichaean_Daleth 3120 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 3121 jg ; Manichaean_Five ; Manichaean_Five 3122 jg ; Manichaean_Gimel ; Manichaean_Gimel 3123 jg ; Manichaean_Heth ; Manichaean_Heth 3124 jg ; Manichaean_Hundred ; Manichaean_Hundred 3125 jg ; Manichaean_Kaph ; Manichaean_Kaph 3126 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 3127 jg ; Manichaean_Mem ; Manichaean_Mem 3128 jg ; Manichaean_Nun ; Manichaean_Nun 3129 jg ; Manichaean_One ; Manichaean_One 3130 jg ; Manichaean_Pe ; Manichaean_Pe 3131 jg ; Manichaean_Qoph ; Manichaean_Qoph 3132 jg ; Manichaean_Resh ; Manichaean_Resh 3133 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 3134 jg ; Manichaean_Samekh ; Manichaean_Samekh 3135 jg ; Manichaean_Taw ; Manichaean_Taw 3136 jg ; Manichaean_Ten ; Manichaean_Ten 3137 jg ; Manichaean_Teth ; Manichaean_Teth 3138 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 3139 jg ; Manichaean_Twenty ; Manichaean_Twenty 3140 jg ; Manichaean_Waw ; Manichaean_Waw 3141 jg ; Manichaean_Yodh ; Manichaean_Yodh 3142 jg ; Manichaean_Zayin ; Manichaean_Zayin 3143 jg ; Straight_Waw ; Straight_Waw 3144 -> uchar.h & UCharacter.JoiningGroup 3145- 23 new Script (sc) values: 3146 sc ; Aghb ; Caucasian_Albanian 3147 sc ; Bass ; Bassa_Vah 3148 sc ; Dupl ; Duployan 3149 sc ; Elba ; Elbasan 3150 sc ; Gran ; Grantha 3151 sc ; Hmng ; Pahawh_Hmong 3152 sc ; Khoj ; Khojki 3153 sc ; Lina ; Linear_A 3154 sc ; Mahj ; Mahajani 3155 sc ; Mani ; Manichaean 3156 sc ; Mend ; Mende_Kikakui 3157 sc ; Modi ; Modi 3158 sc ; Mroo ; Mro 3159 sc ; Narb ; Old_North_Arabian 3160 sc ; Nbat ; Nabataean 3161 sc ; Palm ; Palmyrene 3162 sc ; Pauc ; Pau_Cin_Hau 3163 sc ; Perm ; Old_Permic 3164 sc ; Phlp ; Psalter_Pahlavi 3165 sc ; Sidd ; Siddham 3166 sc ; Sind ; Khudawadi 3167 sc ; Tirh ; Tirhuta 3168 sc ; Wara ; Warang_Citi 3169 -> uscript.h (many were added before) 3170 comment "Mende Kikakui" for USCRIPT_MENDE 3171 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 3172 -> com.ibm.icu.lang.UScript 3173 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3174 replace public static final int \1 = \2; \3 3175- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3176 (added 2012-11-01) 3177 Ahom 338 Ahom 3178 Hatr 127 Hatran 3179 Mult 323 Multani 3180 (added 2013-10-12) 3181 Modi 324 Modi 3182 Pauc 263 Pau Cin Hau 3183 Sidd 302 Siddham 3184 -> uscript.h (some overlap with additions from Unicode) 3185 -> com.ibm.icu.lang.UScript 3186 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3187 replace public static final int \1 = \2; \3 3188 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 3189 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3190 and in com.ibm.icu.dev.test.lang.TestUScript.java 3191 3192* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3193 (not strictly necessary for NOT_ENCODED scripts) 3194 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 3195 3196* generate normalization data files 3197- cd $ICU_ROOT/dbg 3198- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 3199- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 3200- UNIDATA=$ICU_SRC_DIR/source/data/unidata 3201- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 3202- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3203- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3204- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3205- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3206 3207* build ICU (make install) 3208 so that the tools build can pick up the new definitions from the installed header files. 3209 3210~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3211 3212* build Unicode tools using CMake+make 3213 3214~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3215 3216# Location (--prefix) of where ICU was installed. 3217set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 3218# Location of the ICU source tree. 3219set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 3220 3221~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3222~/svn.icutools/trunk/dbg/unicode/c$ make 3223 3224* genprops work 3225- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 3226 + add second array of Joining_Group values for at most 10800..10FFF 3227 icutools: unicode/c/genprops/bidipropsbuilder.cpp 3228 icu: source/common/ubidi_props.h/.c/_data.h 3229 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 3230 3231* generate core properties data files 3232- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 3233- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 3234- rebuild ICU (make install) & tools 3235- run genuca again (see step above) so that it picks up the new nfc.nrm 3236- rebuild ICU (make install) & tools 3237 3238* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3239 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3240- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3241- Unicode 6.0..7.0: U+2260, U+226E, U+226F 3242- nothing new in 7.0, no test file to update 3243 3244* run & fix ICU4C tests 3245 3246* update Java data files 3247- refresh just the UCD-related files, just to be safe 3248- see (ICU4C)/source/data/icu4j-readme.txt 3249- mkdir /tmp/icu4j 3250- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3251 output: 3252 ... 3253 Unicode .icu files built to ./out/build/icudt53l 3254 echo timestamp > uni-core-data 3255 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3256 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 3257 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3258 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 3259 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 3260 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 3261 mkdir -p /tmp/icu4j/main/shared/data 3262 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3263 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 3264 mkdir -p /tmp/icu4j/main/shared/data 3265 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3266 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 3267- copy the big-endian Unicode data files to another location, 3268 separate from the other data files 3269 ICUDT=icudt54b 3270 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3271 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3272 cd ~/svn.icu/uni70/dbg/data/out/icu4j 3273 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3274 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3275 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 3276 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 3277 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3278 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 3279- refresh ICU4J 3280 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3281 3282* update CollationFCD.java 3283 + copy & paste the initializers of lcccIndex[] etc. from 3284 ICU4C/source/i18n/collationfcd.cpp to 3285 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 3286 3287* refresh Java test .txt files 3288- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3289 cd $ICU_SRC_DIR/source/data/unidata 3290 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3291 cd ../../test/testdata 3292 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3293 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 3294 3295* UCA 3296 3297- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 3298- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 3299- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 3300- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 3301- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 3302- review data; compare files, use blankweights.sed or similar 3303 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 3304- cd ~/svn.unitools/Generated/uca/7.0.0/ 3305- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3306 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3307- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3308 (note removing the underscore before "Rules") 3309 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3310- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3311 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3312 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3313 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3314 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3315 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3316- run genuca, see command line above 3317- rebuild ICU4C 3318- refresh ICU4J collation data: 3319 (subset of instructions above for properties data refresh, except copies all coll/*) 3320 ICUDT=icudt54b 3321 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3322 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3323 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3324 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3325- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3326- note on intltest: if collate/UCAConformanceTest fails, then 3327 utility/MultithreadTest/TestCollators will fail as well; 3328 fix the conformance test before looking into the multi-thread test 3329- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 3330- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 3331 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3332 3333* When refreshing all of ICU4J data from ICU4C 3334- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3335- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3336or 3337- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3338 3339* run & fix ICU4J tests 3340 3341*** LayoutEngine script information 3342 3343(For details see the Unicode 5.2 change log below.) 3344 3345* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3346 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3347 in the working directory. 3348 (It also generates ScriptRunData.cpp, which is no longer needed.) 3349 3350 The generated files have a current copyright date and "@stable" statement. 3351 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 3352 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 3353 which may not contain dots any more. 3354 3355- diff current <icu>/source/layout files vs. generated ones 3356 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3357 review and manually merge desired changes; 3358 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 3359 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 3360- if you just copy the above files, then 3361 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 3362 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 3363 3364*** API additions 3365- send notice to icu-design about new born-@stable API (enum constants etc.) 3366 3367*** merge the Unicode update branches back onto the trunk 3368- do not merge the icudata.jar and testdata.jar, 3369 instead rebuild them from merged & tested ICU4C 3370 3371---------------------------------------------------------------------------- *** 3372 3373Unicode 6.3 update 3374 3375http://www.unicode.org/review/pri249/ -- beta review 3376http://www.unicode.org/reports/uax-proposed-updates.html 3377http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 3378http://www.unicode.org/reports/tr44/tr44-11.html 3379 3380*** ICU Trac 3381 3382- ticket 10128: update ICU to Unicode 6.3 beta 3383- ticket 10168: update ICU to Unicode 6.3 final 3384- C++ branches/markus/uni63 at r33552 from trunk at r33551 3385- Java branches/markus/uni63 at r33550 from trunk at r33553 3386 3387- ticket 10142: implement Unicode 6.3 bidi algorithm additions 3388 3389*** Unicode version numbers 3390- makedata.mak 3391- uchar.h 3392 (configure.in & configure: have been modified to extract the version from uchar.h) 3393- com.ibm.icu.util.VersionInfo 3394- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3395 3396- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3397 so that the makefiles see the new version number. 3398 3399*** data files & enums & parser code 3400 3401* file preparation 3402 3403- download UCD, UCA & IDNA files 3404- make sure that the Unicode data folder passed into preparseucd.py 3405 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3406- modify preparseucd.py: 3407 parse new file BidiBrackets.txt 3408 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 3409- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 3410- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3411- Check test file diffs for previously commented-out, known-failing data lines; 3412 probably need to keep those commented out. 3413 3414* PropertyAliases.txt changes 3415- 1 new Enumerated Property 3416 bpt ; Bidi_Paired_Bracket_Type 3417 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 3418 -> ubidi_props.h & .c & UBiDiProps.java 3419 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 3420 -> uprops.cpp 3421 -> change ubidi.icu format version from 2.0 to 2.1 3422- 1 new Miscellaneous Property 3423 bpb ; Bidi_Paired_Bracket 3424 -> uchar.h & UProperty.java 3425 -> ppucd.h & .cpp 3426 3427* PropertyValueAliases.txt changes 3428- 3 Bidi_Paired_Bracket_Type (bpt) values: 3429 bpt; c ; Close 3430 bpt; n ; None 3431 bpt; o ; Open 3432 -> uchar.h & UCharacter.BidiPairedBracketType 3433 -> ubidi_props.h & .c & UBiDiProps.java 3434 -> change ubidi.icu format version from 2.0 to 2.1 3435- 4 new Bidi_Class (bc) values: 3436 bc ; FSI ; First_Strong_Isolate 3437 bc ; LRI ; Left_To_Right_Isolate 3438 bc ; RLI ; Right_To_Left_Isolate 3439 bc ; PDI ; Pop_Directional_Isolate 3440 -> uchar.h & UCharacterEnums.ECharacterDirection 3441 -> until the bidi code gets updated, 3442 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 3443- 3 new Word_Break (WB) values: 3444 WB ; HL ; Hebrew_Letter 3445 WB ; SQ ; Single_Quote 3446 WB ; DQ ; Double_Quote 3447 -> uchar.h & UCharacter.WordBreak 3448 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 3449- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3450 (added 2012-10-16) 3451 Aghb 239 Caucasian Albanian 3452 Mahj 314 Mahajani 3453 -> uscript.h 3454 -> com.ibm.icu.lang.UScript 3455 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3456 replace public static final int \1 = \2;\3 3457 -> preparseucd.py _scripts_only_in_iso15924 3458 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3459 and in com.ibm.icu.dev.test.lang.TestUScript.java 3460 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3461 (not strictly necessary for NOT_ENCODED scripts) 3462 3463* generate normalization data files 3464- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 3465- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 3466- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 3467- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3468- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3469- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3470- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3471 3472* build ICU (make install) 3473 so that the tools build can pick up the new definitions from the installed header files. 3474 3475~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3476 3477* build Unicode tools using CMake+make 3478 3479~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3480 3481# Location (--prefix) of where ICU was installed. 3482set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 3483# Location of the ICU source tree. 3484set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 3485 3486~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3487~/svn.icutools/trunk/dbg/unicode/c$ make 3488 3489* generate core properties data files 3490- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 3491- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 3492- rebuild ICU (make install) & tools 3493- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3494- rebuild ICU (make install) & tools 3495 3496* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3497 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3498- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3499- Unicode 6.0..6.3: U+2260, U+226E, U+226F 3500- nothing new in 6.3, no test file to update 3501 3502* update Java data files 3503- refresh just the UCD-related files, just to be safe 3504- see (ICU4C)/source/data/icu4j-readme.txt 3505- mkdir /tmp/icu4j 3506- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3507 output: 3508 ... 3509 Unicode .icu files built to ./out/build/icudt52l 3510 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3511 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 3512 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3513 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3514 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 3515 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 3516 mkdir -p /tmp/icu4j/main/shared/data 3517 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3518 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 3519 mkdir -p /tmp/icu4j/main/shared/data 3520 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3521 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 3522- copy the big-endian Unicode data files to another location, 3523 separate from the other data files 3524 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3525 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3526 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3527 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 3528 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3529 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3530 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3531- refresh ICU4J 3532 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3533 3534* refresh Java test .txt files 3535- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3536 3537* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 3538 3539- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 3540- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 3541- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3542- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3543 (note removing the underscore before "Rules") 3544- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3545 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3546 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3547- check test file diffs for previously commented-out, known-failing data lines; 3548 probably need to keep those commented out 3549- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3550- run genuca, see command line above 3551- rebuild ICU4C 3552- refresh ICU4J collation data: 3553 (subset of instructions above for properties data refresh, except copies all coll/*) 3554 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3555 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3556 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3557 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3558- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3559- note on intltest: if collate/UCAConformanceTest fails, then 3560 utility/MultithreadTest/TestCollators will fail as well; 3561 fix the conformance test before looking into the multi-thread test 3562 3563* test ICU, fix test code where necessary 3564 3565* When refreshing all of ICU4J data from ICU4C 3566- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3567- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3568or 3569- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3570 3571*** LayoutEngine script information 3572- skipped for Unicode 6.3: no new scripts 3573 3574*** merge the Unicode update branches back onto the trunk 3575- do not merge the icudata.jar and testdata.jar, 3576 instead rebuild them from merged & tested ICU4C 3577 3578---------------------------------------------------------------------------- *** 3579 3580Unicode 6.2 update 3581 3582http://www.unicode.org/review/pri230/ 3583http://www.unicode.org/versions/beta-6.2.0.html 3584http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 3585http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 3586http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 3587http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 3588http://www.unicode.org/reports/tr46/tr46-8.html IDNA 3589http://unicode.org/Public/idna/6.2.0/ 3590 3591*** ICU Trac 3592 3593- ticket 9515: Unicode 6.2: final ICU update 3594 3595- ticket 9514: UCA 6.2: fix UCARules.txt 3596 3597- ticket 9437: update ICU to Unicode 6.2 3598- C++ branches/markus/uni62 at r32050 from trunk at r32041 3599- Java branches/markus/uni62 at r32068 from trunk at r32066 3600 3601*** Unicode version numbers 3602- makedata.mak 3603- uchar.h 3604 (configure.in & configure: have been modified to extract the version from uchar.h) 3605- com.ibm.icu.util.VersionInfo 3606- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3607 3608*** data files & enums & parser code 3609 3610* file preparation 3611 3612- download UCD, UCA & IDNA files 3613- make sure that the Unicode data folder passed into preparseucd.py 3614 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3615- modify preparseucd.py: NamesList.txt is now in UTF-8 3616- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 3617- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3618- Check test file diffs for previously commented-out, known-failing data lines; 3619 probably need to keep those commented out. 3620 3621* PropertyValueAliases.txt changes 3622- 1 new Line_Break (lb) value: 3623 lb ; RI ; Regional_Indicator 3624 -> uchar.h & UCharacter.LineBreak 3625- 1 new Word_Break (WB) value: 3626 WB ; RI ; Regional_Indicator 3627 -> uchar.h & UCharacter.WordBreak 3628- 1 new Grapheme_Cluster_Break (GCB) value: 3629 GCB; RI ; Regional_Indicator 3630 -> uchar.h & UCharacter.GraphemeClusterBreak 3631 3632* 3 new numeric values 3633 The new value -1, which was really supposed to be NaN but that would have required 3634 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 3635 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 3636 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 3637 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 3638 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 3639 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 3640 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 3641 -> uprops.h, uchar.c & UCharacterProperty.java 3642 -> cucdtst.c & UCharacterTest.java 3643 3644* generate normalization data files 3645- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 3646- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 3647- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 3648- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3649- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3650- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3651- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3652 3653* build ICU (make install) 3654 so that the tools build can pick up the new definitions from the installed header files. 3655* build Unicode tools using CMake+make 3656 3657* generate core properties data files 3658- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 3659- in initial bootstrapping, change the UCA version 3660 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 3661- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 3662- rebuild ICU (make install) & tools 3663 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 3664 check if the UCA version in FractionalUCA.txt matches the new Unicode version 3665 (see step above) 3666- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3667- rebuild ICU (make install) & tools 3668 3669* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3670 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3671- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3672- Unicode 6.0..6.2: U+2260, U+226E, U+226F 3673- nothing new in 6.2, no test file to update 3674 3675* update Java data files 3676- refresh just the UCD-related files, just to be safe 3677- see (ICU4C)/source/data/icu4j-readme.txt 3678- mkdir /tmp/icu4j 3679- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3680 output: 3681 ... 3682 Unicode .icu files built to ./out/build/icudt50l 3683 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 3684 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 3685 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3686 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 3687 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 3688 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 3689 mkdir -p /tmp/icu4j/main/shared/data 3690 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3691 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 3692 mkdir -p /tmp/icu4j/main/shared/data 3693 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3694 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 3695- copy the big-endian Unicode data files to another location, 3696 separate from the other data files 3697 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3698 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 3699 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 3700 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 3701 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 3702 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3703 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 3704- refresh ICU4J 3705 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 3706 3707* refresh Java test .txt files 3708- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3709 3710* UCA 3711 3712- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 3713- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 3714- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3715- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3716 (note removing the underscore before "Rules") 3717- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3718 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3719 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3720- check test file diffs for previously commented-out, known-failing data lines; 3721 probably need to keep those commented out 3722- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3723- run genuca, see command line above 3724- rebuild ICU4C 3725- refresh ICU4J collation data: 3726 (subset of instructions above for properties data refresh, except copies all coll/*) 3727 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3728 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3729 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3730 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 3731- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3732- note on intltest: if collate/UCAConformanceTest fails, then 3733 utility/MultithreadTest/TestCollators will fail as well; 3734 fix the conformance test before looking into the multi-thread test 3735 3736* test ICU, fix test code where necessary 3737 3738* When refreshing all of ICU4J data from ICU4C 3739- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3740- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3741or 3742- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3743 3744*** LayoutEngine script information 3745- skipped for Unicode 6.2: no new scripts 3746 3747*** merge the Unicode update branches back onto the trunk 3748- do not merge the icudata.jar and testdata.jar, 3749 instead rebuild them from merged & tested ICU4C 3750 3751---------------------------------------------------------------------------- *** 3752 3753Future Unicode update 3754 3755Tools simplified since the Unicode 6.1 update. See 3756- https://icu.unicode.org/design/props/ppucd 3757- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 3758 3759* Unicode version numbers 3760- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 3761 3762* file preparation 3763- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 3764- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 3765- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3766- Check test file diffs for previously commented-out, known-failing data lines; 3767 probably need to keep those commented out. 3768 3769* PropertyValueAliases.txt changes 3770- Script codes that are in ISO 15924 but not in Unicode are now listed in 3771 preparseucd.py, in the _scripts_only_in_iso15924 variable. 3772 If there are new ISO codes, then add them. 3773 If Unicode adds some of them, then remove them from the .py variable. 3774 3775* UnicodeData.txt changes 3776- No more manual changes for CJK ranges for algorithmic names; 3777 those are now written to ppucd.txt and genprops reads them from there. 3778 3779* generate core properties data files (makeprops.sh was deleted) 3780- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 3781 3782* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 3783- it is now generated by preparseucd.py 3784 3785* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 3786- it is now generated by preparseucd.py 3787- make sure that the Unicode data folder passed into preparseucd.py 3788 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 3789 (can be in some subfolder) 3790 3791* generate normalization data files 3792- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 3793- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 3794- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 3795- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3796- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3797- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3798- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3799 3800* build ICU (make install) 3801* build Unicode tools using CMake+make 3802 3803* new way to call genuca (makeuca.sh was deleted) 3804- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 3805 3806---------------------------------------------------------------------------- *** 3807 3808Unicode 6.1 update 3809 3810*** ICU Trac 3811 3812- ticket 8995 final update to Unicode 6.1 3813- ticket 8994 regenerate source/layout/CanonData.cpp 3814 3815- ticket 8961 support Unicode "Age" value *names* 3816- ticket 8963 support multiple character name aliases & types 3817 3818- ticket 8827 "update ICU to Unicode 6.1" 3819- C++ branches/markus/uni61 at r30864 from trunk at r30843 3820- Java branches/markus/uni61 at r30865 from trunk at r30863 3821 3822*** Unicode version numbers 3823- makedata.mak 3824- uchar.h 3825 (configure.in & configure: have been modified to extract the version from uchar.h) 3826- com.ibm.icu.util.VersionInfo 3827- icutools/unicode/makedefs.sh 3828 + also review & update other definitions in that file, 3829 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 3830 3831*** data files & enums & parser code 3832 3833* file preparation 3834 3835~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 3836- This prepares both unidata and testdata files in respective output subfolders. 3837- Check test file diffs for previously commented-out, known-failing data lines; 3838 probably need to keep those commented out. 3839 3840* PropertyValueAliases.txt changes 3841- 11 new block names: 3842 Arabic_Extended_A 3843 Arabic_Mathematical_Alphabetic_Symbols 3844 Chakma 3845 Meetei_Mayek_Extensions 3846 Meroitic_Cursive 3847 Meroitic_Hieroglyphs 3848 Miao 3849 Sharada 3850 Sora_Sompeng 3851 Sundanese_Supplement 3852 Takri 3853 -> add to uchar.h 3854 -> add to UCharacter.UnicodeBlock IDs 3855 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3856 replace public static final int \1_ID = \2; \3 3857 -> add to UCharacter.UnicodeBlock objects 3858 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3859 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3860- 1 new Joining_Group (jg) value: 3861 Rohingya_Yeh 3862 -> uchar.h & UCharacter.JoiningGroup 3863- 2 new Line_Break (lb) values: 3864 CJ=Conditional_Japanese_Starter 3865 HL=Hebrew_Letter 3866 -> uchar.h & UCharacter.LineBreak 3867- 7 new scripts: 3868 sc ; Cakm ; Chakma 3869 sc ; Merc ; Meroitic_Cursive 3870 sc ; Mero ; Meroitic_Hieroglyphs 3871 sc ; Plrd ; Miao 3872 sc ; Shrd ; Sharada 3873 sc ; Sora ; Sora_Sompeng 3874 sc ; Takr ; Takri 3875 -> remove these from SyntheticPropertyValueAliases.txt 3876 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3877 and in com.ibm.icu.dev.test.lang.TestUScript.java 3878- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3879 (added 2011-06-21) 3880 Khoj 322 Khojki 3881 Tirh 326 Tirhuta 3882 and another one added 2011-12-09 3883 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 3884 -> uscript.h 3885 -> com.ibm.icu.lang.UScript 3886 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3887 replace public static final int \1 = \2;\3 3888 -> SyntheticPropertyValueAliases.txt 3889 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3890 and in com.ibm.icu.dev.test.lang.TestUScript.java 3891 3892* UnicodeData.txt changes 3893- the last Unihan code point changes from U+9FCB to U+9FCC 3894 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 3895 + do change gennames.c 3896 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 3897 3898* DerivedBidiClass.txt changes 3899- 2 new default-AL blocks: 3900# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 3901# Arabic Mathematical Alphabetic Symbols: 3902# U+1EE00 - U+1EEFF (was default-R) 3903- 2 new default-R blocks: 3904# Meroitic Hieroglyphs: 3905# U+10980 - U+1099F 3906# Meroitic Cursive: U+109A0 - U+109FF 3907 -> should be picked up by the explicit data in the file 3908 3909* NameAliases.txt changes 3910- from 3911 # Each line has two fields 3912 # First field: Code point 3913 # Second field: Alias 3914- to 3915 # Each line has three fields, as described here: 3916 # 3917 # First field: Code point 3918 # Second field: Alias 3919 # Third field: Type 3920- Also, the file previously allowed multiple aliases but only now does it 3921 actually provide multiple, even multiple of the same type. For example, 3922 FEFF;BYTE ORDER MARK;alternate 3923 FEFF;BOM;abbreviation 3924 FEFF;ZWNBSP;abbreviation 3925- This breaks our gennames parser, unames.icu data structure, and API. 3926 Fix gennames to only pick up "correction" aliases. 3927 New ticket #8963 for further changes. 3928 3929* run genpname/preparse.pl (on Linux) 3930 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 3931 + make sure that data.h is writable 3932 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 3933 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 3934 3935* build ICU (make install) 3936 so that the tools build can pick up the new definitions from the installed header files. 3937* build Unicode tools (at least genpname) using CMake+make 3938 3939* run genpname 3940 (builds both pnames.icu and propname_data.h) 3941- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 3942- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 3943 3944* build ICU (make install) 3945* build Unicode tools using CMake+make 3946 3947* update source/data/unidata/norm2/nfkc_cf.txt 3948- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 3949 3950* update source/data/unidata/norm2/uts46.txt 3951- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 3952 to ~/svn.icu/tools/trunk/src/unicode/py 3953- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 3954- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 3955- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 3956 3957* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3958 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3959- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3960- Unicode 6.0..6.1: U+2260, U+226E, U+226F 3961- nothing new in 6.1, no test file to update 3962 3963* generate core properties data files 3964- in initial bootstrapping, change the UCA version 3965 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 3966- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3967- rebuild ICU & tools 3968 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 3969 check if the UCA version in FractionalUCA.txt matches the new Unicode version 3970 (see step above) 3971- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 3972 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3973- rebuild ICU & tools 3974 3975* update Java data files 3976- refresh just the UCD-related files, just to be safe 3977- see (ICU4C)/source/data/icu4j-readme.txt 3978- mkdir /tmp/icu4j 3979- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3980 output: 3981 ... 3982 Unicode .icu files built to ./out/build/icudt49l 3983 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 3984 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 3985 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3986 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 3987 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 3988 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 3989 mkdir -p /tmp/icu4j/main/shared/data 3990 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3991 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 3992 mkdir -p /tmp/icu4j/main/shared/data 3993 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3994 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 3995- copy the big-endian Unicode data files to another location, 3996 separate from the other data files 3997 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 3998 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 3999 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4000 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 4001 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 4002 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4003 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 4004- refresh ICU4J 4005 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4006 4007* refresh Java test .txt files 4008- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4009 4010* test ICU so far, fix test code where necessary 4011- temporarily ignore collation issues that look like UCA/UCD mismatches, 4012 until UCA data is updated 4013 4014* UCA 4015 4016- get output from Mark's tools; look in 4017 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 4018- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4019- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4020 (note removing the underscore before "Rules") 4021- update (ICU)/source/test/testdata/CollationTest_*.txt 4022 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4023 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 4024- check test file diffs for previously commented-out, known-failing data lines; 4025 probably need to keep those commented out 4026- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 4027- run makeuca.sh: 4028 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4029- rebuild ICU4C 4030- refresh ICU4J collation data: 4031 (subset of instructions above for properties data refresh, except copies all coll/*) 4032 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4033 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4034 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 4035 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 4036- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 4037- note on intltest: if collate/UCAConformanceTest fails, then 4038 utility/MultithreadTest/TestCollators will fail as well; 4039 fix the conformance test before looking into the multi-thread test 4040 4041* When refreshing all of ICU4J data from ICU4C 4042- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4043- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4044or 4045- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4046 4047*** LayoutEngine script information 4048 4049(For details see the Unicode 5.2 change log below.) 4050 4051* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 4052 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 4053 in the working directory. 4054 (It also generates ScriptRunData.cpp, which is no longer needed.) 4055 4056 The generated files have a current copyright date and "@draft" statement. 4057 4058- diff current <icu>/source/layout files vs. generated ones 4059 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 4060 review and manually merge desired changes; 4061 fix gratuitous changes, incorrect @draft and missing aliases; 4062 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4063- if you just copy the above files, then 4064 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 4065 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4066 4067*** merge the Unicode update branches back onto the trunk 4068- do not merge the icudata.jar and testdata.jar, 4069 instead rebuild them from merged & tested ICU4C 4070 4071---------------------------------------------------------------------------- *** 4072 4073ICU 4.8 (no Unicode update, just new script codes) 4074 4075* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4076 (added 2010-12-21) 4077 Afak 439 Afaka 4078 Jurc 510 Jurchen 4079 Mroo 199 Mro, Mru 4080 Nshu 499 Nüshu 4081 Shrd 319 Sharada, Śāradā 4082 Sora 398 Sora Sompeng 4083 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 4084 Tang 520 Tangut 4085 Wole 480 Woleai 4086 -> uscript.h 4087 -> com.ibm.icu.lang.UScript 4088 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4089 replace public static final int \1 = \2;\3 4090 -> genpname/SyntheticPropertyValueAliases.txt 4091 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4092 and in com.ibm.icu.dev.test.lang.TestUScript.java 4093 4094* run genpname/preparse.pl (on Linux) 4095 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4096 + make sure that data.h is writable 4097 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4098 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4099 4100* rebuild Unicode tools (at least genpname) using make 4101- You might first need to "make install" ICU so that the tools build can pick 4102 up the new definitions from the installed header files. 4103 4104* run genpname 4105 (builds both pnames.icu and propname_data.h) 4106- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4107- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 4108- rebuild ICU & tools 4109 4110* run genprops 4111- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4112- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 4113- rebuild ICU & tools 4114 4115* update Java data files 4116- refresh just the UCD-related files, just to be safe 4117- see (ICU4C)/source/data/icu4j-readme.txt 4118- mkdir /tmp/icu4j 4119- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4120- copy the big-endian Unicode data files to another location, 4121 separate from the other data files 4122 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4123 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4124 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 4125- refresh ICU4J 4126 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 4127 4128* should have updated the layout engine script codes but forgot 4129 4130---------------------------------------------------------------------------- *** 4131 4132Unicode 6.0 update 4133 4134*** related ICU Trac tickets 4135 41367264 Unicode 6.0 Update 4137 4138*** Unicode version numbers 4139- makedata.mak 4140- uchar.h 4141 (configure.in & configure: have been modified to extract the version from uchar.h) 4142- com.ibm.icu.util.VersionInfo 4143 4144*** data files & enums & parser code 4145 4146* file preparation 4147 4148~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 4149- This now prepares both unidata and testdata files in respective output subfolders. 4150 4151* PropertyAliases.txt changes 4152- new Script_Extensions property defined in the new ScriptExtensions.txt file 4153 but not listed in PropertyAliases.txt; reported to unicode.org; 4154 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 4155 scx; Script_Extensions 4156 -> uchar.h with new UProperty section 4157 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 4158 4159* PropertyValueAliases.txt changes 4160- 12 new block names: 4161 Alchemical_Symbols 4162 Bamum_Supplement 4163 Batak 4164 Brahmi 4165 CJK_Unified_Ideographs_Extension_D 4166 Emoticons 4167 Ethiopic_Extended_A 4168 Kana_Supplement 4169 Mandaic 4170 Miscellaneous_Symbols_And_Pictographs 4171 Playing_Cards 4172 Transport_And_Map_Symbols 4173 -> add to uchar.h 4174 -> add to UCharacter.UnicodeBlock 4175 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 4176 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4177- Joining_Group (jg) values: 4178 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 4179 -> uchar.h & UCharacter.JoiningGroup 4180- 3 new scripts: 4181 sc ; Batk ; Batak 4182 sc ; Brah ; Brahmi 4183 sc ; Mand ; Mandaic 4184 -> remove these from SyntheticPropertyValueAliases.txt 4185 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 4186 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 4187 and in com.ibm.icu.dev.test.lang.TestUScript.java 4188- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 4189 (added 2009-11-11..2010-07-18) 4190 Bass 259 Bassa Vah 4191 Dupl 755 Duployan shortand 4192 Elba 226 Elbasan 4193 Gran 343 Grantha 4194 Kpel 436 Kpelle 4195 Loma 437 Loma 4196 Mend 438 Mende 4197 Merc 101 Meroitic Cursive 4198 Narb 106 Old North Arabian 4199 Nbat 159 Nabataean 4200 Palm 126 Palmyrene 4201 Sind 318 Sindhi 4202 Wara 262 Warang Citi 4203 -> uscript.h 4204 -> com.ibm.icu.lang.UScript 4205 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 4206 replace public static final int \1 = \2;\3 4207 -> SyntheticPropertyValueAliases.txt 4208 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 4209 and in com.ibm.icu.dev.test.lang.TestUScript.java 4210- ISO 15924 name change 4211 Mero 100 Meroitic Hieroglyphs (was Meroitic) 4212 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 4213- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 4214 4215* UnicodeData.txt changes 4216- new CJK block: 4217 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 4218 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 4219 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 4220 4221* build Unicode tools using CMake+make 4222 4223* run genpname/preparse.pl (on Linux) 4224 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 4225 + make sure that data.h is writable 4226 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 4227 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 4228 4229* rebuild Unicode tools (at least genpname) using make 4230- You might first need to "make install" ICU so that the tools build can pick 4231 up the new definitions from the installed header files. 4232 4233* run genpname 4234- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 4235- rebuild ICU & tools 4236 4237* update source/data/unidata/norm2/nfkc_cf.txt 4238- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 4239 4240* update source/data/unidata/norm2/uts46.txt 4241- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 4242 to ~/svn.icu/tools/trunk/src/unicode/py 4243- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 4244- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 4245- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 4246 4247* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 4248 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 4249- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 4250- Unicode 6.0: U+2260, U+226E, U+226F 4251 4252* generate core properties data files 4253- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4254- rebuild ICU & tools 4255- run makeuca.sh so that genuca picks up the new nfc.nrm: 4256 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4257- rebuild ICU & tools 4258 4259* implement new Script_Extensions property (provisional) 4260- parser & generator: genprops & uprops.icu 4261- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 4262- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 4263 4264* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 4265- (one-time change) 4266- genbidi/gencase/genprops tools changes 4267- re-run makeprops.sh (see above) 4268- UCharacterProperty.java, UCharacterTypeIterator.java, 4269 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 4270 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 4271 4272* update Java data files 4273- refresh just the UCD-related files, just to be safe 4274- see (ICU4C)/source/data/icu4j-readme.txt 4275- mkdir /tmp/icu4j 4276- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4277 output: 4278 ... 4279 Unicode .icu files built to ./out/build/icudt45l 4280 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4281 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 4282 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 4283 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 4284 mkdir -p /tmp/icu4j/main/shared/data 4285 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 4286- copy the big-endian Unicode data files to another location, 4287 separate from the other data files 4288 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4289 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4290 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4291 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 4292 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 4293 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4294 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 4295- refresh ICU4J 4296 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4297 4298* refresh Java test .txt files 4299- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4300 4301* un-hardcode normalization skippable (NF*_Inert) test data 4302- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 4303 4304* copy updated break iterator test files 4305- now handled by early ucdcopy.py and 4306 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 4307 (old instructions: 4308 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 4309 to ~/svn.icu/trunk/src/source/test/testdata) 4310- they are not used in ICU4J 4311 4312* UCA 4313 4314- get output from Mark's tools; look in 4315 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 4316 http://www.macchiato.com/unicode/utc/additional-uca-files 4317 http://www.unicode.org/Public/UCA/6.0.0/ 4318 http://www.unicode.org/~mdavis/uca/ 4319- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4320- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4321- update Han-implicit ranges for new CJK extensions: 4322 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4323- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 4324 do not add it into invuca so that tailoring primary-after an ignorable works 4325- genuca: permit space between [variable top] bytes 4326- ucol.cpp: treat noncharacters like unassigned rather than ignorable 4327- run makeuca.sh: 4328 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4329- rebuild ICU4C 4330- refresh ICU4J collation data: 4331 (subset of instructions above for properties data refresh, except copies all coll/*) 4332 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4333 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4334 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4335 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4336- update (ICU)/source/test/testdata/CollationTest_*.txt 4337 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4338 with output from Mark's Unicode tools 4339- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4340- note on intltest: if collate/UCAConformanceTest fails, then 4341 utility/MultithreadTest/TestCollators will fail as well; 4342 fix the conformance test before looking into the multi-thread test 4343 4344* When refreshing all of ICU4J data from ICU4C 4345- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4346- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4347or 4348- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4349 4350*** LayoutEngine script information 4351 4352(For details see the Unicode 5.2 change log below.) 4353 4354* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4355ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4356ScriptRunData.cpp, which is no longer needed.) 4357 4358The generated files have a current copyright date and "@draft" statement. 4359 4360* copy the above files into <icu>/source/layout, replacing the old files. 4361* fix mixed line endings 4362* review the diffs and fix incorrect @draft and missing aliases; 4363 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4364* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4365 4366---------------------------------------------------------------------------- *** 4367 4368Unicode 5.2 update 4369 4370*** related ICU Trac tickets 4371 43727084 Unicode 5.2 4373 43747167 verify collation bytes 43757235 Java test NAME_ALIAS 43767236 Java DerivedCoreProperties.txt test 43777237 Java BidiTest.txt 43787238 UTrie2 in core unidata 43797239 test for tailoring gaps 43807240 Java fix CollationMiscTest 43817243 update layout engine for Unicode 5.2 4382 4383*** Unicode version numbers 4384- makedata.mak 4385- uchar.h 4386- configure.in & configure 4387- update ucdVersion in gennames.c if an algorithmic range changes 4388 4389*** data files & enums & parser code 4390 4391* file preparation 4392 4393python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 4394- includes finding files regardless of version numbers, 4395 copying them, and performing the equivalent processing of the 4396 ucdstrip and ucdmerge tools on the desired set of files 4397 4398* notes on changes 4399- PropertyAliases.txt 4400 moved from numeric to enumerated: 4401 ccc ; Canonical_Combining_Class 4402 new string properties: 4403 NFKC_CF ; NFKC_Casefold 4404 Name_Alias; Name_Alias 4405 new binary properties: 4406 Cased ; Cased 4407 CI ; Case_Ignorable 4408 CWCF ; Changes_When_Casefolded 4409 CWCM ; Changes_When_Casemapped 4410 CWKCF ; Changes_When_NFKC_Casefolded 4411 CWL ; Changes_When_Lowercased 4412 CWT ; Changes_When_Titlecased 4413 CWU ; Changes_When_Uppercased 4414 new CJK Unihan properties (not supported by ICU) 4415- PropertyValueAliases.txt 4416 new block names 4417 new scripts 4418 one script code change: 4419 sc ; Qaai ; Inherited 4420 -> 4421 sc ; Zinh ; Inherited ; Qaai 4422 new Line_Break (lb) value: 4423 lb ; CP ; Close_Parenthesis 4424 new Joining_Group (jg) values: Farsi_Yeh, Nya 4425 other new values: 4426 ccc; 214; ATA ; Attached_Above 4427- DerivedBidiClass.txt 4428 new default-R range: U+1E800 - U+1EFFF 4429- UnicodeData.txt 4430 all of the ISO comments are gone 4431 new CJK block end: 4432 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 4433 new CJK block: 4434 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 4435 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 4436 4437* genpname 4438- run preparse.pl 4439 + cd \svn\icuproj\icu\trunk\source\tools\genpname 4440 + make sure that data.h is writable 4441 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 4442 + preparse.pl complains with errors like the following: 4443 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 4444 This is because ICU 4.0 had scripts from ISO 15924 which are now 4445 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 4446 and PropertyValueAliases.txt. 4447 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4448 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 4449 + preparse.pl complains with errors about block names missing from uchar.h; add them 4450 4451* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4452- new block & script values 4453 + 26 new blocks 4454 copy new blocks from Blocks.txt 4455 MS VC++ 2008 regular expression: 4456 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 4457 replace with " UBLOCK_\3 = 172, /*[\1]*/" 4458 + several new script values already added in ICU 4.0 for ISO 15924 coverage 4459 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 4460 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 4461 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 4462 (added to SyntheticPropertyValueAliases.txt) 4463- new Joining Group (JG) values: Farsi_Yeh, Nya 4464- new Line_Break (lb) value: 4465 lb ; CP ; Close_Parenthesis 4466 4467* hardcoded Unihan range end/limit 4468- Unihan range end moves from 9FC3 to 9FCB 4469 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 4470 + do change gennames.c 4471 4472* Compare definitions of new binary properties with what we used to use 4473 in algorithms, to see if the definitions changed. 4474- Verified that definitions for Cased and Case_Ignorable are unchanged. 4475 The gencase tool now parses the newly public Case_Ignorable values 4476 in case the definition changes in the future. 4477 4478* uchar.c & uprops.h & uprops.c & genprops 4479- new numeric values that didn't exist in Unicode data before: 4480 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 4481 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 4482 therefore redesign the encoding of numeric types and values for formatVersion 6; 4483 design for simple numbers up to at least 144 ("one gross"), 4484 large values up to at least 10^20, 4485 and fractions with numerators -1..17 and denominators 1..16 4486 to cover current and expected future values 4487 (e.g., more Han numeric values, Meroitic twelfths) 4488 4489* reimplement Hangul_Syllable_Type for new Jamo characters 4490- the old code assumed that all Jamo characters are in the 11xx block 4491- Unicode 5.2 fills holes there and adds new Jamo characters in 4492 A960..A97F; Hangul Jamo Extended-A 4493 and in 4494 D7B0..D7FF; Hangul Jamo Extended-B 4495- Hangul_Syllable_Type can be trivially derived from a subset of 4496 Grapheme_Cluster_Break values 4497 4498* build Unicode data source code for hardcoding core data 4499C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 4500 4501ICU data make path is \svn\icuproj\icu\trunk\source\data\ 4502ICU root path is \svn\icuproj\icu\trunk 4503Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4504Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 4505Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 4506Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 4507Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 4508Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 4509Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 4510Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 4511Creating data file for Unicode Property Names 4512Creating data file for Unicode Character Properties 4513Creating data file for Unicode Case Mapping Properties 4514Creating data file for Unicode BiDi/Shaping Properties 4515Creating data file for Unicode Normalization 4516Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 4517Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 4518 4519- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 4520 and rebuild the common library 4521 4522*** UCA 4523 4524- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 4525- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 4526- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 4527[ Begin obsolete instructions: 4528 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 4529 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 4530 on Windows: 4531 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 4532 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 4533 End obsolete instructions] 4534- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4535 not just the *_STUB.txt files 4536- note on intltest: if collate/UCAConformanceTest fails, then 4537 utility/MultithreadTest/TestCollators will fail as well; 4538 fix the conformance test before looking into the multi-thread test 4539 4540*** Implement Cased & Case_Ignorable properties 4541- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 4542- Problem: These properties should be disjoint, but aren't 4543- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 4544- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 4545 4546*** Implement Changes_When_Xyz properties 4547- without stored data 4548 4549*** Implement Name_Alias property 4550- add it as another name field in unames.icu 4551- make it available via u_charName() and UCharNameChoice and 4552- consider it in u_charFromName() 4553 4554*** Break iterators 4555 4556* Update break iterator rules to new UAX versions and new property values 4557* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 4558 4559*** new BidiTest file 4560- review format and data 4561- copy BidiTest.txt to source/test/testdata 4562- write test code using this data 4563- fix ICU code where it fails the conformance test 4564 4565*** Java 4566- generally, find and update code corresponding to C/C++ 4567- UCharacter.UnicodeBlock constants: 4568 a) add an _ID integer per new block, update COUNT 4569 b) add a class instance per new block 4570 Visual Studio regex: 4571 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 4572 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4573- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 4574 4575- port test changes to Java 4576 4577*** LayoutEngine script information 4578 4579(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 4580 4581* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4582ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4583ScriptRunData.cpp, which is no longer needed.) 4584 4585The generated files have a current copyright date and "@draft" statement. 4586 4587-> Eric Mader wrote in email on 20090930: 4588 "I think the tool has been modified to update @draft to @stable for 4589 older scripts and to add @draft for new scripts. 4590 (I worked with an intern on this last year.) 4591 You should check the output after you run it." 4592 4593* copy the above files into <icu>/source/layout, replacing the old files. 4594* fix mixed line endings 4595* review the diffs and fix incorrect @draft and missing aliases 4596* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4597 4598Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4599and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4600 4601-> Eric Mader wrote in email on 20090930: 4602 "This is just a matter of making sure that all the per-script tables have 4603 entries for any new scripts that were added. 4604 If any new Indic characters were added, then the class tables in 4605 IndicClassTables.cpp should be updated to reflect this. 4606 John Emmons should know how to do this if it's required." 4607 4608* rebuild the layout and layoutex libraries. 4609 4610*** Documentation 4611- Update User Guide 4612 + Jamo_Short_Name, sfc->scf, binary property value aliases 4613 4614---------------------------------------------------------------------------- *** 4615 4616Unicode 5.1 update 4617 4618*** related ICU Trac tickets 4619 46205696 Update to Unicode 5.1 4621 4622*** Unicode version numbers 4623- makedata.mak 4624- uchar.h 4625- configure.in & configure 4626- update ucdVersion in gennames.c if an algorithmic range changes 4627 4628*** data files & enums & parser code 4629 4630* file preparation 4631- ucdstrip: 4632 DerivedCoreProperties.txt 4633 DerivedNormalizationProps.txt 4634 NormalizationTest.txt 4635 PropList.txt 4636 Scripts.txt 4637 GraphemeBreakProperty.txt 4638 SentenceBreakProperty.txt 4639 WordBreakProperty.txt 4640- ucdstrip and ucdmerge: 4641 EastAsianWidth.txt 4642 LineBreak.txt 4643 4644* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 4645copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 4646copy 5.1.0\ucd\Blocks.txt ..\unidata\ 4647copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 4648copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 4649copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 4650copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 4651copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 4652copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 4653copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 4654copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 4655copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 4656copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 4657copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 4658 4659ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 4660ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 4661ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 4662ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 4663ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 4664ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 4665ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 4666ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 4667ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 4668ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 4669 4670* genpname 4671- run preparse.pl 4672 + cd \svn\icuproj\icu\uni51\source\tools\genpname 4673 + make sure that data.h is writable 4674 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 4675 + preparse.pl complains with errors like the following: 4676 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 4677 This is because ICU 3.8 had scripts from ISO 15924 which are now 4678 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 4679 and PropertyValueAliases.txt. 4680 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4681 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 4682 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 4683 N/Y, No/Yes, F/T, False/True 4684 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 4685 It will use further values from the file if present. 4686 4687* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4688- new block & script values 4689 + 17 new blocks 4690 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 4691 (removed from SyntheticPropertyValueAliases.txt) 4692 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 4693 (added to SyntheticPropertyValueAliases.txt) 4694- uprops.icu (uprops.h) only provides 7 bits for script codes. 4695 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 4696 There is none above 127 yet which is the script code for an 4697 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 4698 script code values greater than 127. 4699 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 4700 in a parallel bit field, and that overflows now. 4701 Also, future values >=128 would be incompatible anyway. 4702 uprops.h is modified to move around several of the bit fields 4703 in the properties vector words, and now uses 8 bits for the script code. 4704 Two other bit fields also grow to accommodate future growth: 4705 Block (current count: 172) grows from 8 to 9 bits, 4706 and Word_Break grows from 4 to 5 bits. 4707- renamed property Simple_Case_Folding (sfc->scf) 4708 + nothing to be done: handled as normal alias 4709- new property JSN Jamo_Short_Name 4710 + no new API: only contributes to the Name property 4711- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 4712- new Joining Group (JG) value: Burushashki_Yeh_Barree 4713- new Sentence_Break (SB) values: 4714 SB ; CR ; CR 4715 SB ; EX ; Extend 4716 SB ; LF ; LF 4717 SB ; SC ; SContinue 4718- new Word_Break (WB) values: 4719 WB ; CR ; CR 4720 WB ; Extend ; Extend 4721 WB ; LF ; LF 4722 WB ; MB ; MidNumLet 4723 4724* Further changes in the 2008-02-29 update: 4725- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 4726 because they should not normally be invisible. 4727- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 4728- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 4729- new Word_Break (WB) value: NL=Newline 4730 4731* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 4732- Unihan range end moves from 9FBB to 9FC3 4733 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 4734 + do change gennames.c 4735 4736* build Unicode data source code for hardcoding core data 4737C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 4738 4739ICU data make path is \svn\icuproj\icu\uni51\source\data\ 4740ICU root path is \svn\icuproj\icu\uni51 4741Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4742Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 4743Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 4744Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 4745Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 4746Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 4747Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 4748Creating data file for Unicode Character Properties 4749Creating data file for Unicode Case Mapping Properties 4750Creating data file for Unicode BiDi/Shaping Properties 4751Creating data file for Unicode Normalization 4752Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 4753Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 4754 4755- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 4756 and rebuild the common library 4757 4758*** Break iterators 4759 4760* Update break iterator rules to new UAX versions and new property values 4761 4762*** UCA 4763 4764* update FractionalUCA.txt and UCARules.txt with new canonical closure 4765 4766*** Test suites 4767- Test that APIs using Unicode property value aliases (like UnicodeSet) 4768 support all of the boolean values N/Y, No/Yes, F/T, False/True 4769 -> TestBinaryValues() tests in both cintltst and intltest 4770 4771*** LayoutEngine script information 4772* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 4773ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 4774ScriptRunData.cpp, which is no longer needed.) 4775 4776The generated files have a current copyright date and "@draft" statement. 4777 4778* copy the above files into <icu>/source/layout, replacing the old files. 4779 4780Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4781and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4782 4783* rebuild the layout and layoutex libraries. 4784 4785*** Documentation 4786- Update User Guide 4787 + Jamo_Short_Name, sfc->scf, binary property value aliases 4788 4789---------------------------------------------------------------------------- *** 4790 4791Unicode 5.0 update 4792 4793*** related Jitterbugs 4794 47955084 RFE: Update to Unicode 5.0 4796 4797*** data files & enums & parser code 4798 4799* file preparation 4800- ucdstrip: 4801 DerivedCoreProperties.txt 4802 DerivedNormalizationProps.txt 4803 NormalizationTest.txt 4804 PropList.txt 4805 Scripts.txt 4806 GraphemeBreakProperty.txt 4807 SentenceBreakProperty.txt 4808 WordBreakProperty.txt 4809- ucdstrip and ucdmerge: 4810 EastAsianWidth.txt 4811 LineBreak.txt 4812 4813* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 4814copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 4815copy 5.0.0\ucd\Blocks.txt ..\unidata\ 4816copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 4817copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 4818copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 4819copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 4820copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 4821copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 4822copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 4823copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 4824copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 4825copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 4826copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 4827 4828ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 4829ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 4830ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 4831ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 4832ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 4833ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 4834ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 4835ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 4836ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 4837ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 4838 4839* update FractionalUCA.txt and UCARules.txt with new canonical closure 4840 4841* genpname 4842- run preparse.pl 4843 + make sure that data.h is writable 4844 + perl preparse.pl \cvs\oss\icu > out.txt 4845 4846* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4847- new block & script values 4848 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 4849 4850* build Unicode data source code for hardcoding core data 4851C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 4852 4853ICU data make path is \cvs\oss\icu\source\data\ 4854ICU root path is \cvs\oss\icu 4855Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4856[etc.] 4857Creating data file for Unicode Character Properties 4858Creating data file for Unicode Case Mapping Properties 4859Creating data file for Unicode BiDi/Shaping Properties 4860Creating data file for Unicode Normalization 4861Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 4862Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 4863 4864- copy the .c source files to C:\cvs\oss\icu\source\common 4865 and rebuild the common library 4866 4867*** Unicode version numbers 4868- makedata.mak 4869- uchar.h 4870- configure.in 4871 4872*** LayoutEngine script information 4873* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 4874ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 4875ScriptRunData.cpp, which is no longer needed.) 4876 4877The generated files have a current copyright date and "@draft" statement. 4878 4879* copy the above files into <icu>/source/layout, replacing the old files. 4880 4881Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4882and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4883 4884* rebuild the layout and layoutex libraries. 4885 4886---------------------------------------------------------------------------- *** 4887 4888Unicode 4.1 update 4889 4890*** related Jitterbugs 4891 48924332 RFE: Update to Unicode 4.1 48934157 RBBI, TR29 4.1 updates 4894 4895*** data files & enums & parser code 4896 4897* file preparation 4898- ucdstrip: 4899 DerivedCoreProperties.txt 4900 DerivedNormalizationProps.txt 4901 NormalizationTest.txt 4902 GraphemeBreakProperty.txt 4903 SentenceBreakProperty.txt 4904 WordBreakProperty.txt 4905- ucdstrip and ucdmerge: 4906 EastAsianWidth.txt 4907 LineBreak.txt 4908 4909* add new files to the repository 4910 GraphemeBreakProperty.txt 4911 SentenceBreakProperty.txt 4912 WordBreakProperty.txt 4913 4914* update FractionalUCA.txt and UCARules.txt with new canonical closure 4915 4916* genpname 4917- handle new enumerated properties in sub read_uchar 4918- run preparse.pl 4919 4920* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4921- new binary properties 4922 + Pattern_Syntax 4923 + Pattern_White_Space 4924- new enumerated properties 4925 + Grapheme_Cluster_Break 4926 + Sentence_Break 4927 + Word_Break 4928- new block & script & line break values 4929 4930* gencase 4931- case-ignorable changes 4932 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 4933 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 4934 4935*** Unicode version numbers 4936- makedata.mak 4937- uchar.h 4938- configure.in 4939 4940*** tests 4941- verify that u_charMirror() round-trips 4942- test all new properties and some new values of old properties 4943 4944*** other code 4945 4946* hardcoded Unihan range end/limit 4947- Unihan range end moves from 9FA5 to 9FBB 4948 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 4949 + do not modify BOCU/BOCSU code because that would change the encoding 4950 and break binary compatibility! 4951 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 4952 NamePrepProfile.txt 4953 + ignore trietest.c: test data is arbitrary 4954 + ignore tstnorm.cpp: test optimization, not important 4955 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 4956 + do change line_th.txt and word_th.txt 4957 by replacing hardcoded ranges with the new property values 4958 + do change gennames.c 4959 4960source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 4961source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 4962source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 4963 4964* case mappings 4965- compare new special casing context conditions with previous ones 4966 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 4967 4968* genpname 4969- consider storing only the short name if it is the same as the long name 4970 4971*** other reviews 4972- UAX #29 changes (grapheme/word/sentence breaks) 4973- UAX #14 changes (line breaks) 4974- Pattern_Syntax & Pattern_White_Space 4975 4976---------------------------------------------------------------------------- *** 4977 4978Unicode 4.0.1 update 4979 4980*** related Jitterbugs 4981 49823170 RFE: Update to Unicode 4.0.1 49833171 Add new Unicode 4.0.1 properties 49843520 use Unicode 4.0.1 updates for break iteration 4985 4986*** data files & enums & parser code 4987 4988* file preparation 4989- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 4990- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 4991 4992* file fixes 4993- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 4994 according to PRI #26 4995 http://www.unicode.org/review/resolved-pri.html#pri26 4996- undone again because no corrigendum in sight; 4997 instead modified tests to not check consistency on this for Unicode 4.0.1 4998 4999* ucdterms.txt 5000- update from http://www.unicode.org/copyright.html 5001 formatted for plain text 5002 5003* uchar.h & uprops.h & uprops.c & genprops 5004- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 5005- add U_LB_INSEPARABLE due to a spelling fix 5006 + put short name comment only on line with new constant 5007 for genpname perl script parser 5008- new binary properties 5009 + STerm 5010 + Variation_Selector 5011 5012* genpname 5013- fix genpname perl script so that it doesn't choke on more than 2 names per property value 5014- perl script: correctly calculate the maximum number of fields per row 5015 5016* uscript.h 5017- new script code Hrkt=Katakana_Or_Hiragana 5018 5019* gennorm.c track changes in DerivedNormalizationProps.txt 5020- "FNC" -> "FC_NFKC" 5021- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 5022 5023* genprops/props2.c track changes in DerivedNumericValues.txt 5024- changed from 3 columns to 2, dropping the numeric type 5025 + assume that the type is always numeric for Han characters, 5026 and that only those are added in addition to what UnicodeData.txt lists 5027 5028*** Unicode version numbers 5029- makedata.mak 5030- uchar.h 5031- configure.in 5032 5033*** tests 5034- update test of default bidi classes according to PRI #28 5035 /tsutil/cucdtst/TestUnicodeData 5036 http://www.unicode.org/review/resolved-pri.html#pri28 5037- bidi tests: change exemplar character for ES depending on Unicode version 5038- change hardcoded expected property values where they change 5039 5040*** other code 5041 5042* name matching 5043- read UCD.html 5044 5045* scripts 5046- use new Hrkt=Katakana_Or_Hiragana 5047 5048* ZWJ & ZWNJ 5049- are now part of combining character sequences 5050- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 5051