1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13* 14* change log for Unicode updates 15* 16* For each new Unicode version, during the beta period, 17* I copy the change log for the previous version to the top of this file. 18* I adjust the versions, tickets, URLs, and paths. 19* I work my way through the steps listed in the log, top to bottom, 20* adjusting the log as necessary. 21* I report problems to the UTC and/or CLDR and/or ICU. 22* Before the data is final, I "turn the crank" several more times, 23* using appropriate subsets of the steps. 24 25---------------------------------------------------------------------------- *** 26 27* New ISO 15924 script codes 28 29Starting with ICU 55, we do not add UScriptCode constants for new scripts any more 30until they are encoded in Unicode, 31or can be assumed to be encoded in the next Unicode version. 32Script enum constant names want to follow the Unicode script property value aliases, 33which are assigned only when the scripts are encoded. 34When we encode scripts early and guess wrong, then we have confusing enum constants 35and have sometimes added aliases. 36 37Variant script codes like Latf and Aran that are not subject to separate encoding 38can be added at any time. 39(For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.) 40 41We add script codes used in CLDR or in the spoof checker. 42This includes combination/alias codes like Hanb and Jamo. 43See http://unicode.org/reports/tr35/#unicode_script_subtag_validity 44and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html 45 46We add special Z* script codes like Zsye. 47 48For new script codes see http://www.unicode.org/iso15924/codechanges.html 49 50---------------------------------------------------------------------------- *** 51 52Unicode 13.0 update for ICU 66 53 54https://www.unicode.org/versions/Unicode13.0.0/ 55https://www.unicode.org/versions/beta-13.0.0.html 56https://www.unicode.org/Public/13.0.0/ucd/ 57https://www.unicode.org/reports/uax-proposed-updates.html 58https://www.unicode.org/reports/tr44/tr44-25.html 59 60https://unicode-org.atlassian.net/browse/CLDR-13387 61https://unicode-org.atlassian.net/browse/ICU-20893 62 63* Command-line environment setup 64 65UNICODE_DATA=~/unidata/uni13/20200212 66CLDR_SRC=~/cldr/uni/src 67ICU_ROOT=~/icu/uni 68ICU_SRC=$ICU_ROOT/src 69ICUDT=icudt66b 70ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 71ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 72export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 73 74*** Unicode version numbers 75- makedata.mak 76- uchar.h 77- com.ibm.icu.util.VersionInfo 78- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 79 80- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 81 so that the makefiles see the new version number. 82 cd $ICU_ROOT/dbg/icu4c 83 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 84 85*** data files & enums & parser code 86 87* download files 88- mkdir -p $UNICODE_DATA 89- download Unicode files into $UNICODE_DATA 90 + subfolders: emoji, idna, security, ucd, uca 91 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 92 + split Unihan into single-property files 93 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan 94 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt 95 or from the ucd/cldr/ output folder of the Unicode Tools: 96 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. 97 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata 98 99* for manual diffs and for Unicode Tools input data updates: 100 remove version suffixes from the file names 101 ~$ unidata/desuffixucd.py $UNICODE_DATA 102 (see https://sites.google.com/site/unicodetools/inputdata) 103 104* process and/or copy files 105- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 106 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 107 + For debugging, and tweaking how ppucd.txt is written, 108 the tool has an --only_ppucd option: 109 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 110 111- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 112 113* new constants for new property values 114- preparseucd.py error: 115 ValueError: missing uchar.h enum constants for some property values: 116 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', 117 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), 118 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), 119 (u'InPC', set([u'Top_And_Bottom_And_Left']))] 120 = PropertyValueAliases.txt new property values (diff old & new .txt files) 121 blk; Chorasmian ; Chorasmian 122 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G 123 blk; Dives_Akuru ; Dives_Akuru 124 blk; Khitan_Small_Script ; Khitan_Small_Script 125 blk; Lisu_Sup ; Lisu_Supplement 126 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing 127 blk; Tangut_Sup ; Tangut_Supplement 128 blk; Yezidi ; Yezidi 129 -> add to uchar.h before UBLOCK_COUNT 130 use long property names for enum constants, 131 for the trailing comment get the block start code point: diff old & new Blocks.txt 132 -> add to UCharacter.UnicodeBlock IDs 133 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 134 replace public static final int \1_ID = \2; \3 135 -> add to UCharacter.UnicodeBlock objects 136 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 137 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 138 139 sc ; Chrs ; Chorasmian 140 sc ; Diak ; Dives_Akuru 141 sc ; Kits ; Khitan_Small_Script 142 sc ; Yezi ; Yezidi 143 -> uscript.h & com.ibm.icu.lang.UScript 144 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 145 and in com.ibm.icu.dev.test.lang.TestUScript.java 146 147 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left 148 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory 149 150* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 151 (not strictly necessary for NOT_ENCODED scripts) 152 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 153 154* build ICU (make install) 155 to make sure that there are no syntax errors, and 156 so that the tools build can pick up the new definitions from the installed header files. 157 158 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 159 160* update spoof checker UnicodeSet initializers: 161 inclusionPat & recommendedPat in i18n/uspoof.cpp 162 INCLUSION & RECOMMENDED in SpoofChecker.java 163- make sure that the Unicode Tools tree contains the latest security data files 164- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 165- update the hardcoded version number there in the DIRECTORY path 166- run the tool (no special environment variables needed) 167- copy & paste from the Console output into the .cpp & .java files 168 169* generate normalization data files 170 cd $ICU_ROOT/dbg/icu4c 171 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 172 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 173 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 174 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 175 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 176 177* build ICU (make install) 178 so that the tools build can pick up the new definitions from the installed header files. 179 180 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 181 182* build Unicode tools using CMake+make 183 184$ICU_SRC/tools/unicode/c/icudefs.txt: 185 186# Location (--prefix) of where ICU was installed. 187set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 188# Location of the ICU4C source tree. 189set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 190 191 $ICU_ROOT/dbg$ 192 mkdir -p tools/unicode/c 193 cd tools/unicode/c 194 195 $ICU_ROOT/dbg/tools/unicode/c$ 196 cmake ../../../../src/tools/unicode/c 197 make 198 199* generate core properties data files 200 $ICU_ROOT/dbg/tools/unicode/c$ 201 genprops/genprops $ICU_SRC/icu4c 202- tool failure: 203 genprops: Script_Extensions indexes overflow bit field 204 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR 205 -> uprops.icu data file format : 206 add two more bits to store a script code or Script_Extensions index 207 -> generator code, C++ & Java runtime, uprops.icu format version 7.7 208- rebuild ICU (make install) & tools 209 210* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 211 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 212- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 213- Unicode 6.0..13.0: U+2260, U+226E, U+226F 214- nothing new in this Unicode version, no test file to update 215 216* run & fix ICU4C tests 217- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files 218- Andy helps with RBBI & spoof check test failures 219 220* collation: CLDR collation root, UCA DUCET 221 222- UCA DUCET goes into Mark's Unicode tools, see 223 https://sites.google.com/site/unicodetools/home#TOC-UCA 224 diff the main mapping file, look for bad changes 225 (for example, more bytes per weight for common characters) 226 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt 227 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt 228 229- CLDR root data files are checked into $CLDR_SRC/common/uca/ 230 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 231 232- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 233 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 234- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 235 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 236 (note removing the underscore before "Rules") 237 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 238- restore TODO diffs in UCARules.txt 239 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 240- update (ICU4C)/source/test/testdata/CollationTest_*.txt 241 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 242 from the CLDR root files (..._CLDR_..._SHORT.txt) 243 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 244 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 245 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 246- if CLDR common/uca/unihan-index.txt changes, then update 247 CLDR common/collation/root.xml <collation type="private-unihan"> 248 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 249 250- run genuca 251 $ICU_ROOT/dbg/tools/unicode/c$ 252 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 253 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 254- rebuild ICU4C 255 256* Unihan collators 257 https://sites.google.com/site/unicodetools/unihan 258- run Unicode Tools 259 org.unicode.draft.GenerateUnihanCollators 260 with VM arguments 261 -ea 262 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 263 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 264 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 265 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 266 -DUVERSION=13.0.0 267- run Unicode Tools 268 org.unicode.draft.GenerateUnihanCollatorFiles 269 with the same arguments 270- check CLDR diffs 271 cd $CLDR_SRC 272 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 273 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 274- copy to CLDR 275 cd $CLDR_SRC 276 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 277 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 278- run CLDR unit tests, commit to CLDR 279- generate ICU zh collation data: run CLDR 280 org.unicode.cldr.icu.NewLdml2IcuConverter 281 with program arguments 282 -t collation 283 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation 284 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental 285 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 286 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 287 zh 288 and VM arguments 289 -ea 290 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src 291- rebuild ICU4C 292 293* run & fix ICU4C tests, now with new CLDR collation root data 294- run all tests with the collation test data *_SHORT.txt or the full files 295 (the full ones have comments, useful for debugging) 296- note on intltest: if collate/UCAConformanceTest fails, then 297 utility/MultithreadTest/TestCollators will fail as well; 298 fix the conformance test before looking into the multi-thread test 299 300* update Java data files 301- refresh just the UCD/UCA-related/derived files, just to be safe 302- see (ICU4C)/source/data/icu4j-readme.txt 303- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 304- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 305 output: 306 ... 307 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 308 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b 309 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b 310 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b 311 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" 312 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ 313 mkdir -p /tmp/icu4j/main/shared/data 314 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 315 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ 316 mkdir -p /tmp/icu4j/main/shared/data 317 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 318 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 319- copy the big-endian Unicode data files to another location, 320 separate from the other data files, 321 and then refresh ICU4J 322 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 323 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 324 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 325 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 326 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 327 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 328 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 329 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 330 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 331 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 332 333* When refreshing all of ICU4J data from ICU4C 334- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 335- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 336or 337- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 338 339* update CollationFCD.java 340 + copy & paste the initializers of lcccIndex[] etc. from 341 ICU4C/source/i18n/collationfcd.cpp to 342 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 343 344* refresh Java test .txt files 345- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 346 cd $ICU_SRC/icu4c/source/data/unidata 347 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 348 cd ../../test/testdata 349 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 350 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 351 352* run & fix ICU4J tests 353 354*** API additions 355- send notice to icu-design about new born-@stable API (enum constants etc.) 356 357*** CLDR numbering systems 358- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 359 for example, look for 360 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 361 in new blocks (Blocks.txt) 362 Unicode 13: 363 diak 11950..11959 Dives_Akuru 364 365*** merge the Unicode update branches back onto the trunk 366- do not merge the icudata.jar and testdata.jar, 367 instead rebuild them from merged & tested ICU4C 368- make sure that changes to Unicode tools are checked in: 369 http://www.unicode.org/utility/trac/log/trunk/unicodetools 370 371---------------------------------------------------------------------------- *** 372 373Unicode 12.1 update for ICU 64.2 374 375** This is an abbreviated update with one new character for the new 376** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA 377https://en.wikipedia.org/wiki/Reiwa_period 378 379http://www.unicode.org/versions/Unicode12.1.0/ 380 381ICU-20497 Unicode 12.1 382 383cldrbug 11978: Unicode 12.1 384 385* Command-line environment setup 386 387UNICODE_DATA=~/unidata/uni121/20190403 388CLDR_SRC=~/svn.cldr/uni 389ICU_ROOT=~/icu/uni 390ICU_SRC=$ICU_ROOT/src 391ICUDT=icudt64b 392ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 393ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 394export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 395 396*** Unicode version numbers 397- makedata.mak 398- uchar.h 399- com.ibm.icu.util.VersionInfo 400- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 401 402- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 403 so that the makefiles see the new version number. 404 cd $ICU_ROOT/dbg/icu4c 405 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh 406 407*** data files & enums & parser code 408 409* download files 410- mkdir -p $UNICODE_DATA 411- download Unicode files into $UNICODE_DATA 412 + subfolders: emoji, idna, security, ucd, uca 413 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 414 415* for manual diffs and for Unicode Tools input data updates: 416 remove version suffixes from the file names 417 ~$ unidata/desuffixucd.py $UNICODE_DATA 418 (see https://sites.google.com/site/unicodetools/inputdata) 419 420* process and/or copy files 421- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 422 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 423 + For debugging, and tweaking how ppucd.txt is written, 424 the tool has an --only_ppucd option: 425 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 426 427- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 428 429* build ICU (make install) 430 so that the tools build can pick up the new definitions from the installed header files. 431 432 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 433 434* update spoof checker UnicodeSet initializers: 435 inclusionPat & recommendedPat in uspoof.cpp 436 INCLUSION & RECOMMENDED in SpoofChecker.java 437- make sure that the Unicode Tools tree contains the latest security data files 438- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 439- update the hardcoded version number there in the DIRECTORY path 440- run the tool (no special environment variables needed) 441- copy & paste from the Console output into the .cpp & .java files 442 443* generate normalization data files 444 cd $ICU_ROOT/dbg/icu4c 445 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 446 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 447 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 448 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 449 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 450 451* build ICU (make install) 452 so that the tools build can pick up the new definitions from the installed header files. 453 454 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 455 456* build Unicode tools using CMake+make 457 458$ICU_SRC/tools/unicode/c/icudefs.txt: 459 460# Location (--prefix) of where ICU was installed. 461set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 462# Location of the ICU4C source tree. 463set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 464 465 $ICU_ROOT/dbg$ 466 mkdir -p tools/unicode/c 467 cd tools/unicode/c 468 469 $ICU_ROOT/dbg/tools/unicode/c$ 470 cmake ../../../../src/tools/unicode/c 471 make 472 473* generate core properties data files 474 $ICU_ROOT/dbg/tools/unicode/c$ 475 genprops/genprops $ICU_SRC/icu4c 476 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 477 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 478- rebuild ICU (make install) & tools 479 480* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 481 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 482- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 483- Unicode 6.0..12.1: U+2260, U+226E, U+226F 484- nothing new in this Unicode version, no test file to update 485 486* run & fix ICU4C tests 487- Andy handles RBBI & spoof check test failures 488 489* collation: CLDR collation root, UCA DUCET 490 491- UCA DUCET goes into Mark's Unicode tools, see 492 https://sites.google.com/site/unicodetools/home#TOC-UCA 493 diff the main mapping file, look for bad changes 494 (for example, more bytes per weight for common characters) 495 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt 496 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt 497 498- CLDR root data files are checked into $CLDR_SRC/common/uca/ 499 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 500 501- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 502 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 503- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 504 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 505 (note removing the underscore before "Rules") 506 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 507- restore TODO diffs in UCARules.txt 508 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 509- update (ICU4C)/source/test/testdata/CollationTest_*.txt 510 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 511 from the CLDR root files (..._CLDR_..._SHORT.txt) 512 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 513 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 514 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 515- if CLDR common/uca/unihan-index.txt changes, then update 516 CLDR common/collation/root.xml <collation type="private-unihan"> 517 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 518 519- run genuca, see command line above 520- rebuild ICU4C 521 522* Unihan collators 523 https://sites.google.com/site/unicodetools/unihan 524- run Unicode Tools 525 org.unicode.draft.GenerateUnihanCollators 526 with VM arguments 527 -ea 528 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 529 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 530 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 531 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 532 -DUVERSION=12.1.0 533- run Unicode Tools 534 org.unicode.draft.GenerateUnihanCollatorFiles 535 with the same arguments 536- check CLDR diffs 537 cd $CLDR_SRC 538 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 539 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 540- copy to CLDR 541 cd $CLDR_SRC 542 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 543 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 544- run CLDR unit tests, commit to CLDR 545- generate ICU zh collation data: run CLDR 546 org.unicode.cldr.icu.NewLdml2IcuConverter 547 with program arguments 548 -t collation 549 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 550 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 551 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 552 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 553 zh 554 and VM arguments 555 -ea 556 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 557- rebuild ICU4C 558 559* run & fix ICU4C tests, now with new CLDR collation root data 560- run all tests with the collation test data *_SHORT.txt or the full files 561 (the full ones have comments, useful for debugging) 562- note on intltest: if collate/UCAConformanceTest fails, then 563 utility/MultithreadTest/TestCollators will fail as well; 564 fix the conformance test before looking into the multi-thread test 565 566* update Java data files 567- refresh just the UCD/UCA-related/derived files, just to be safe 568- see (ICU4C)/source/data/icu4j-readme.txt 569- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 570- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 571 output: 572 ... 573 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 574 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b 575 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b 576 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b 577 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" 578 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ 579 mkdir -p /tmp/icu4j/main/shared/data 580 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 581 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ 582 mkdir -p /tmp/icu4j/main/shared/data 583 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 584 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 585- copy the big-endian Unicode data files to another location, 586 separate from the other data files, 587 and then refresh ICU4J 588 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 589 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 590 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 591 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 592 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 593 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 594 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 595 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 596 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 597 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 598 599* When refreshing all of ICU4J data from ICU4C 600- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 601- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 602or 603- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 604 605* update CollationFCD.java 606 + copy & paste the initializers of lcccIndex[] etc. from 607 ICU4C/source/i18n/collationfcd.cpp to 608 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 609 610* refresh Java test .txt files 611- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 612 cd $ICU_SRC/icu4c/source/data/unidata 613 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 614 cd ../../test/testdata 615 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 616 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 617 618* run & fix ICU4J tests 619 620*** API additions 621- send notice to icu-design about new born-@stable API (enum constants etc.) 622 623*** CLDR numbering systems 624- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 625 for example, look for 626 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 627 in new blocks (Blocks.txt) 628 Unicode 12: using Unicode 12 CLDR ticket #11478 629 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 630 wcho 1E2F0..1E2F9 Wancho 631 Unicode 11: using Unicode 11 CLDR ticket #10978 632 rohg 10D30..10D39 Hanifi_Rohingya 633 gong 11DA0..11DA9 Gunjala_Gondi 634 Earlier: CLDR tickets specific to adding new numbering systems. 635 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 636 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 637 638*** merge the Unicode update branches back onto the trunk 639- do not merge the icudata.jar and testdata.jar, 640 instead rebuild them from merged & tested ICU4C 641- make sure that changes to Unicode tools are checked in: 642 http://www.unicode.org/utility/trac/log/trunk/unicodetools 643 644---------------------------------------------------------------------------- *** 645 646Unicode 12.0 update for ICU 64 647 648http://www.unicode.org/versions/Unicode12.0.0/ 649http://unicode.org/versions/beta-12.0.0.html 650https://www.unicode.org/review/pri389/ 651http://www.unicode.org/reports/uax-proposed-updates.html 652http://www.unicode.org/reports/tr44/tr44-23.html 653 654ICU-20203 Unicode 12 655 656ICU-20111 move text layout properties data into a data file 657 658cldrbug 11478: Unicode 12 659Accidentally used ^/trunk instead of ^/branches/markus/uni12 660 661* Command-line environment setup 662 663UNICODE_DATA=~/unidata/uni12/20190309 664CLDR_SRC=~/svn.cldr/uni 665ICU_ROOT=~/icu/uni 666ICU_SRC=$ICU_ROOT/src 667ICUDT=icudt63b 668ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 669ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 670export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 671 672*** Unicode version numbers 673- makedata.mak 674- uchar.h 675- com.ibm.icu.util.VersionInfo 676- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 677 678- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 679 so that the makefiles see the new version number. 680 681*** data files & enums & parser code 682 683* download files 684- mkdir -p $UNICODE_DATA 685- download Unicode files into $UNICODE_DATA 686 + subfolders: emoji, idna, security, ucd, uca 687 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 688 689* for manual diffs and for Unicode Tools input data updates: 690 remove version suffixes from the file names 691 ~$ unidata/desuffixucd.py $UNICODE_DATA 692 (see https://sites.google.com/site/unicodetools/inputdata) 693 694* process and/or copy files 695- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 696 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 697 + For debugging, and tweaking how ppucd.txt is written, 698 the tool has an --only_ppucd option: 699 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 700 701- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 702 703* build ICU (make install) 704 so that the tools build can pick up the new definitions from the installed header files. 705 706 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 707 708* new constants for new property values 709- preparseucd.py error: 710 ValueError: missing uchar.h enum constants for some property values: 711 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', 712 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', 713 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), 714 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] 715 = PropertyValueAliases.txt new property values (diff old & new .txt files) 716 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls 717 blk; Elymaic ; Elymaic 718 blk; Nandinagari ; Nandinagari 719 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong 720 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers 721 blk; Small_Kana_Ext ; Small_Kana_Extension 722 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A 723 blk; Tamil_Sup ; Tamil_Supplement 724 blk; Wancho ; Wancho 725 -> add to uchar.h 726 use long property names for enum constants, 727 for the trailing comment get the block start code point: diff old & new Blocks.txt 728 -> add to UCharacter.UnicodeBlock IDs 729 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 730 replace public static final int \1_ID = \2; \3 731 -> add to UCharacter.UnicodeBlock objects 732 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 733 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 734 735 sc ; Elym ; Elymaic 736 sc ; Hmnp ; Nyiakeng_Puachue_Hmong 737 sc ; Nand ; Nandinagari 738 sc ; Wcho ; Wancho 739 -> uscript.h & com.ibm.icu.lang.UScript 740 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 741 and in com.ibm.icu.dev.test.lang.TestUScript.java 742 743* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 744 (not strictly necessary for NOT_ENCODED scripts) 745 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 746 747* update spoof checker UnicodeSet initializers: 748 inclusionPat & recommendedPat in uspoof.cpp 749 INCLUSION & RECOMMENDED in SpoofChecker.java 750- make sure that the Unicode Tools tree contains the latest security data files 751- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 752- update the hardcoded version number there in the DIRECTORY path 753- run the tool (no special environment variables needed) 754- copy & paste from the Console output into the .cpp & .java files 755 756* generate normalization data files 757 cd $ICU_ROOT/dbg/icu4c 758 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 759 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 760 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 761 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 762 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 763 764* build ICU (make install) 765 so that the tools build can pick up the new definitions from the installed header files. 766 767 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date 768 769* build Unicode tools using CMake+make 770 771$ICU_SRC/tools/unicode/c/icudefs.txt: 772 773# Location (--prefix) of where ICU was installed. 774set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 775# Location of the ICU4C source tree. 776set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) 777 778 $ICU_ROOT/dbg$ 779 mkdir -p tools/unicode/c 780 cd tools/unicode/c 781 782 $ICU_ROOT/dbg/tools/unicode/c$ 783 cmake ../../../../src/tools/unicode/c 784 make 785 786* generate core properties data files 787 $ICU_ROOT/dbg/tools/unicode/c$ 788 genprops/genprops $ICU_SRC/icu4c 789 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ 790 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 791- rebuild ICU (make install) & tools 792 793* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 794 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 795- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 796- Unicode 6.0..12.0: U+2260, U+226E, U+226F 797- nothing new in this Unicode version, no test file to update 798 799* run & fix ICU4C tests 800- update test of default bidi classes: 801 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, 802 see diffs in DerivedBidiClass.txt 803 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] 804 + UCharacterTest.java TestIteration() defaultBidi[] 805- Andy handles RBBI & spoof check test failures 806 807* collation: CLDR collation root, UCA DUCET 808 809- UCA DUCET goes into Mark's Unicode tools, see 810 https://sites.google.com/site/unicodetools/home#TOC-UCA 811 diff the main mapping file, look for bad changes 812 (for example, more bytes per weight for common characters) 813 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt 814 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt 815 816- CLDR root data files are checked into $CLDR_SRC/common/uca/ 817 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 818 819- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 820 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 821- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 822 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 823 (note removing the underscore before "Rules") 824 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 825- restore TODO diffs in UCARules.txt 826 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 827- update (ICU4C)/source/test/testdata/CollationTest_*.txt 828 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 829 from the CLDR root files (..._CLDR_..._SHORT.txt) 830 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 831 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 832 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 833- if CLDR common/uca/unihan-index.txt changes, then update 834 CLDR common/collation/root.xml <collation type="private-unihan"> 835 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 836 837- run genuca, see command line above; 838 deal with 839 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 840 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) 841 (add the character to genuca.cpp sampleCharsToScripts[]) 842 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) 843 and cache its values. 844 Works as long as the script metadata is updated before the collation data. 845- rebuild ICU4C 846 847* Unihan collators 848 https://sites.google.com/site/unicodetools/unihan 849- run Unicode Tools 850 org.unicode.draft.GenerateUnihanCollators 851 with VM arguments 852 -ea 853 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 854 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 855 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 856 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 857 -DUVERSION=12.0.0 858- run Unicode Tools 859 org.unicode.draft.GenerateUnihanCollatorFiles 860 with the same arguments 861- check CLDR diffs 862 cd $CLDR_SRC 863 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 864 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 865- copy to CLDR 866 cd $CLDR_SRC 867 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 868 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 869- run CLDR unit tests, commit to CLDR 870- generate ICU zh collation data: run CLDR 871 org.unicode.cldr.icu.NewLdml2IcuConverter 872 with program arguments 873 -t collation 874 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 875 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 876 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll 877 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation 878 zh 879 and VM arguments 880 -ea 881 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 882- rebuild ICU4C 883 884* run & fix ICU4C tests, now with new CLDR collation root data 885- run all tests with the collation test data *_SHORT.txt or the full files 886 (the full ones have comments, useful for debugging) 887- note on intltest: if collate/UCAConformanceTest fails, then 888 utility/MultithreadTest/TestCollators will fail as well; 889 fix the conformance test before looking into the multi-thread test 890 891* update Java data files 892- refresh just the UCD/UCA-related/derived files, just to be safe 893- see (ICU4C)/source/data/icu4j-readme.txt 894- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 895- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 896 output: 897 ... 898 Unicode .icu files built to ./out/build/icudt63l 899 echo timestamp > uni-core-data 900 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b 901 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b 902 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 903 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b 904 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" 905 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ 906 mkdir -p /tmp/icu4j/main/shared/data 907 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 908 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ 909 mkdir -p /tmp/icu4j/main/shared/data 910 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 911 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' 912- copy the big-endian Unicode data files to another location, 913 separate from the other data files, 914 and then refresh ICU4J 915 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 916 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 917 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 918 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 919 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 920 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 921 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 922 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 923 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 924 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 925 926* When refreshing all of ICU4J data from ICU4C 927- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 928- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 929or 930- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 931 932* update CollationFCD.java 933 + copy & paste the initializers of lcccIndex[] etc. from 934 ICU4C/source/i18n/collationfcd.cpp to 935 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 936 937* refresh Java test .txt files 938- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 939 cd $ICU_SRC/icu4c/source/data/unidata 940 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 941 cd ../../test/testdata 942 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 943 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 944 945* run & fix ICU4J tests 946 947*** API additions 948- send notice to icu-design about new born-@stable API (enum constants etc.) 949 950*** CLDR numbering systems 951- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 952 for example, look for 953 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt 954 in new blocks (Blocks.txt) 955 Unicode 12: using Unicode 12 CLDR ticket #11478 956 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong 957 wcho 1E2F0..1E2F9 Wancho 958 Unicode 11: using Unicode 11 CLDR ticket #10978 959 rohg 10D30..10D39 Hanifi_Rohingya 960 gong 11DA0..11DA9 Gunjala_Gondi 961 Earlier: CLDR tickets specific to adding new numbering systems. 962 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 963 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 964 965*** merge the Unicode update branches back onto the trunk 966- do not merge the icudata.jar and testdata.jar, 967 instead rebuild them from merged & tested ICU4C 968- make sure that changes to Unicode tools are checked in: 969 http://www.unicode.org/utility/trac/log/trunk/unicodetools 970 971---------------------------------------------------------------------------- *** 972 973ICU 63 addition of ICU support of text layout properties InPC, InSC, vo 974 975* Command-line environment setup 976 977UNICODE_DATA=~/unidata/uni11/20180609 978CLDR_SRC=~/svn.cldr/uni 979ICU_ROOT=~/icu/mine 980ICU_SRC=$ICU_ROOT/src 981ICUDT=icudt62b 982ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 983ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 984export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 985 986*** Links 987 988https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC 989https://unicode-org.atlassian.net/browse/ICU-12850 vo 990 991*** data files & enums & parser code 992 993* API additions 994- for each of the three new enumerated properties 995 + uchar.h: add the enum UProperty constant UCHAR_<long prop name> 996 + uchar.h: update UCHAR_INT_LIMIT 997 + uchar.h: add the enum U<long prop name> 998 with constants U_<short prop name>_<long value name> 999 + UProperty.java: add the constant <long prop name> 1000 + UProperty.java: update INT_LIMIT 1001 + UCharacter.java: add the interface <long prop name> 1002 with constants <long value name> 1003 1004* process and/or copy files 1005- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1006 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1007 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value 1008 names and aliases. 1009 + For debugging, and tweaking how ppucd.txt is written, 1010 the tool has an --only_ppucd option: 1011 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1012 1013* preparseucd.py changes 1014- add new property short names (uppercase) to _prop_and_value_re 1015 so that ParseUCharHeader() parses the new enum constants 1016 1017* build ICU (make install) 1018 so that the tools build can pick up the new definitions from the installed header files. 1019 1020 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1021 1022* build Unicode tools using CMake+make 1023 1024$ICU_SRC/tools/unicode/c/icudefs.txt: 1025 1026# Location (--prefix) of where ICU was installed. 1027set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) 1028# Location of the ICU4C source tree. 1029set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) 1030 1031 $ICU_ROOT/dbg$ 1032 mkdir -p tools/unicode/c 1033 cd tools/unicode/c 1034 1035 $ICU_ROOT/dbg/tools/unicode/c$ 1036 cmake ../../../../../src/tools/unicode/c 1037 make 1038 1039* generate core properties data files 1040 $ICU_ROOT/dbg/tools/unicode/c$ 1041 genprops/genprops $ICU_SRC/icu4c 1042- rebuild ICU (make install) & tools 1043 1044* write data for runtime, hardcoded for now 1045- add genprops/layoutpropsbuilder.cpp with pieces from sibling files 1046- generate new icu4c/source/common/ulayout_props_data.h 1047- for each of the three new enumerated properties 1048 + int property max value 1049 + small, 8-bit UCPTrie 1050 (A small 16-bit trie with bit fields for these three properties 1051 is very nearly the same size as the sum of the three.) 1052 1053* wire into C++ 1054- uprops.cpp: #include ulayout_props_data.h 1055- uprops.cpp: add getInPC() etc. functions 1056- uprops.cpp: add lines to intProps[], include max values 1057- uprops.h: add UPropertySource constants 1058- uprops.cpp: add uprops_addPropertyStarts(src) 1059- uniset_props.cpp: add to UnicodeSet_initInclusion() 1060- intltest/ucdtest.cpp: write unit tests 1061 1062* update Java data files 1063- refresh just the pnames.icu file with the new property [value] names, just to be safe 1064- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt 1065- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1066- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1067- copy the big-endian Unicode data files to another location, 1068 separate from the other data files, 1069 and then refresh ICU4J 1070 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1071 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1072 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1073 1074* wire into Java 1075- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ 1076- UCharacterProperty.java: for each new property 1077 + create a nested class to hold its CodePointTrie 1078 + initialize it from a string literal 1079 + paste in the initializer printed by genprops 1080 + add a new IntProperty object to the intProps[] array 1081 + use the correct max int value for each property, also printed by genprops 1082- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) 1083- UnicodeSet.java: add to getInclusions() 1084- UCharacterTest.java: write unit tests 1085 1086---------------------------------------------------------------------------- *** 1087 1088Unicode 11.0 update for ICU 62 1089 1090http://www.unicode.org/versions/Unicode11.0.0/ 1091http://unicode.org/versions/beta-11.0.0.html 1092https://www.unicode.org/review/pri372/ 1093http://www.unicode.org/reports/uax-proposed-updates.html 1094http://www.unicode.org/reports/tr44/tr44-21.html 1095 1096* Command-line environment setup 1097 1098UNICODE_DATA=~/unidata/uni11/20180521 1099CLDR_SRC=~/svn.cldr/uni 1100ICU_ROOT=~/svn.icu/uni 1101ICU_SRC=$ICU_ROOT/src 1102ICUDT=icudt61b 1103ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1104ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1105export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1106 1107*** ICU Trac 1108 1109- ticket:13630: Unicode 11 1110- ^/branches/markus/uni11 1111 1112*** CLDR Trac 1113 1114- cldrbug 10978: Unicode 11 1115- ^/branches/markus/uni11 1116 1117*** Unicode version numbers 1118- makedata.mak 1119- uchar.h 1120- com.ibm.icu.util.VersionInfo 1121- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1122 1123- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1124 so that the makefiles see the new version number. 1125 1126*** data files & enums & parser code 1127 1128* download files 1129- mkdir -p $UNICODE_DATA 1130- download Unicode files into $UNICODE_DATA 1131 + subfolders: emoji, idna, security, ucd, uca 1132 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1133 1134* for manual diffs and for Unicode Tools input data updates: 1135 remove version suffixes from the file names 1136 ~$ unidata/desuffixucd.py $UNICODE_DATA 1137 (see https://sites.google.com/site/unicodetools/inputdata) 1138 1139* process and/or copy files 1140- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1141 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1142 + For debugging, and tweaking how ppucd.txt is written, 1143 the tool has an --only_ppucd option: 1144 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1145 1146- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1147 1148* build ICU (make install) 1149 so that the tools build can pick up the new definitions from the installed header files. 1150 1151 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1152 1153* preparseucd.py changes 1154- fix other errors 1155 NameError: unknown property Extended_Pictographic 1156 -> add Extended_Pictographic binary property 1157 -> add new short names for all Emoji properties 1158 1159* new constants for new property values 1160- preparseucd.py error: 1161 ValueError: missing uchar.h enum constants for some property values: 1162 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', 1163 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', 1164 u'Indic_Siyaq_Numbers'])), 1165 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), 1166 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), 1167 (u'GCB', set([u'LinkC', u'Virama'])), 1168 (u'WB', set([u'WSegSpace']))] 1169 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1170 blk; Chess_Symbols ; Chess_Symbols 1171 blk; Dogra ; Dogra 1172 blk; Georgian_Ext ; Georgian_Extended 1173 blk; Gunjala_Gondi ; Gunjala_Gondi 1174 blk; Hanifi_Rohingya ; Hanifi_Rohingya 1175 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers 1176 blk; Makasar ; Makasar 1177 blk; Mayan_Numerals ; Mayan_Numerals 1178 blk; Medefaidrin ; Medefaidrin 1179 blk; Old_Sogdian ; Old_Sogdian 1180 blk; Sogdian ; Sogdian 1181 -> add to uchar.h 1182 use long property names for enum constants, 1183 for the trailing comment get the block start code point: diff old & new Blocks.txt 1184 -> add to UCharacter.UnicodeBlock IDs 1185 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1186 replace public static final int \1_ID = \2; \3 1187 -> add to UCharacter.UnicodeBlock objects 1188 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1189 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1190 1191 GCB; LinkC ; LinkingConsonant 1192 GCB; Virama ; Virama 1193 -> uchar.h & UCharacter.GraphemeClusterBreak 1194 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 1195 1196 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed 1197 -> ignore: ICU does not yet support this property 1198 1199 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya 1200 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa 1201 -> uchar.h & UCharacter.JoiningGroup 1202 1203 sc ; Dogr ; Dogra 1204 sc ; Gong ; Gunjala_Gondi 1205 sc ; Maka ; Makasar 1206 sc ; Medf ; Medefaidrin 1207 sc ; Rohg ; Hanifi_Rohingya 1208 sc ; Sogd ; Sogdian 1209 sc ; Sogo ; Old_Sogdian 1210 -> uscript.h & com.ibm.icu.lang.UScript 1211 -> Nushu had been added already 1212 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1213 and in com.ibm.icu.dev.test.lang.TestUScript.java 1214 1215 WB ; WSegSpace ; WSegSpace 1216 -> uchar.h & UCharacter.WordBreak 1217 1218* New short names for emoji properties 1219- see UTS #51 1220- short names set in preparseucd.py 1221 1222* New properties 1223- boolean emoji property Extended_Pictographic 1224 -> added in preparseucd.py 1225 -> uchar.h & UProperty.java 1226- misc. property Equivalent_Unified_Ideograph (EqUIdeo) 1227 as shown in PropertyValueAliases.txt 1228 -> ignore for now 1229 1230* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1231 (not strictly necessary for NOT_ENCODED scripts) 1232 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1233 1234* update spoof checker UnicodeSet initializers: 1235 inclusionPat & recommendedPat in uspoof.cpp 1236 INCLUSION & RECOMMENDED in SpoofChecker.java 1237- make sure that the Unicode Tools tree contains the latest security data files 1238- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator 1239- update the hardcoded version number there in the DIRECTORY path 1240- run the tool (no special environment variables needed) 1241- copy & paste from the Console output into the .cpp & .java files 1242 1243* generate normalization data files 1244 cd $ICU_ROOT/dbg/icu4c 1245 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1246 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1247 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1248 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1249 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1250 1251* build ICU (make install) 1252 so that the tools build can pick up the new definitions from the installed header files. 1253 1254 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1255 1256* build Unicode tools using CMake+make 1257 1258$ICU_SRC/tools/unicode/c/icudefs.txt: 1259 1260# Location (--prefix) of where ICU was installed. 1261set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1262# Location of the ICU4C source tree. 1263set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) 1264 1265 $ICU_ROOT/dbg$ 1266 mkdir -p tools/unicode/c 1267 cd tools/unicode/c 1268 1269 $ICU_ROOT/dbg/tools/unicode/c$ 1270 cmake ../../../../src/tools/unicode/c 1271 make 1272 1273* generate core properties data files 1274 $ICU_ROOT/dbg/tools/unicode/c$ 1275 genprops/genprops $ICU_SRC/icu4c 1276 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 1277 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1278- rebuild ICU (make install) & tools 1279 1280* Fix case props 1281 genprops error: casepropsbuilder: too many exceptions words 1282 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR 1283- With the addition of Georgian Mtavruli capital letters, 1284 there are now too many simple case mappings with big mapping deltas 1285 that yield uncompressible exceptions. 1286- Changing the data structure (now formatVersion 4), 1287 adding one bit for no-simple-case-folding (for Cherokee), and 1288 one optional slot for a big delta (for most faraway mappings), 1289 together with another bit for whether that is negative. 1290 This makes most Cherokee & Georgian etc. case mappings compressible, 1291 reducing the number of exceptions words. 1292- Further changes to gain one more bit for the exceptions index, 1293 for future growth. Details see casepropsbuilder.cpp. 1294 1295* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1296 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1297- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1298- Unicode 6.0..11.0: U+2260, U+226E, U+226F 1299- nothing new in this Unicode version, no test file to update 1300 1301* run & fix ICU4C tests 1302- Andy handles RBBI & spoof check test failures 1303 1304- Errors in char.txt, word.txt, word_POSIX.txt like 1305 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 1306 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. 1307 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them 1308 not empty, just to get ICU building. 1309 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables 1310 and properties together with the rules that used them (GB 10, WB 14). 1311 -> Andy adjusts the rule sets further to sync with 1312 Unicode 11 grapheme, word, and line break spec changes. 1313 1314* collation: CLDR collation root, UCA DUCET 1315 1316- UCA DUCET goes into Mark's Unicode tools, see 1317 https://sites.google.com/site/unicodetools/home#TOC-UCA 1318 diff the main mapping file, look for bad changes 1319 (for example, more bytes per weight for common characters) 1320 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt 1321 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt 1322 1323- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1324 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1325 1326- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1327 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1328- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1329 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1330 (note removing the underscore before "Rules") 1331 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1332- restore TODO diffs in UCARules.txt 1333 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1334- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1335 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1336 from the CLDR root files (..._CLDR_..._SHORT.txt) 1337 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1338 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1339 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1340- if CLDR common/uca/unihan-index.txt changes, then update 1341 CLDR common/collation/root.xml <collation type="private-unihan"> 1342 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1343 1344- run genuca, see command line above; 1345 deal with 1346 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: 1347 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) 1348 (add the character to genuca.cpp sampleCharsToScripts[]) 1349 + look up the USCRIPT_ code for the new sample characters 1350 (should be obvious from the comment in the error output) 1351 + *add* mappings to sampleCharsToScripts[], do not replace them 1352 (in case the script sample characters flip-flop) 1353 + insert new scripts in DUCET script order, see the top_byte table 1354 at the beginning of FractionalUCA.txt 1355- rebuild ICU4C 1356 1357* Unihan collators 1358 https://sites.google.com/site/unicodetools/unihan 1359- run Unicode Tools 1360 org.unicode.draft.GenerateUnihanCollators 1361 with VM arguments 1362 -ea 1363 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1364 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1365 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1366 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1367 -DUVERSION=11.0.0 1368- run Unicode Tools 1369 org.unicode.draft.GenerateUnihanCollatorFiles 1370 with the same arguments 1371- check CLDR diffs 1372 cd $CLDR_SRC 1373 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1374 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1375- copy to CLDR 1376 cd $CLDR_SRC 1377 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1378 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1379- run CLDR unit tests, commit to CLDR 1380- generate ICU zh collation data: run CLDR 1381 org.unicode.cldr.icu.NewLdml2IcuConverter 1382 with program arguments 1383 -t collation 1384 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation 1385 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental 1386 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll 1387 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation 1388 zh 1389 and VM arguments 1390 -ea 1391 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni 1392- rebuild ICU4C 1393 1394* run & fix ICU4C tests, now with new CLDR collation root data 1395- run all tests with the collation test data *_SHORT.txt or the full files 1396 (the full ones have comments, useful for debugging) 1397- note on intltest: if collate/UCAConformanceTest fails, then 1398 utility/MultithreadTest/TestCollators will fail as well; 1399 fix the conformance test before looking into the multi-thread test 1400 1401* update Java data files 1402- refresh just the UCD/UCA-related/derived files, just to be safe 1403- see (ICU4C)/source/data/icu4j-readme.txt 1404- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1405- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1406 output: 1407 ... 1408 Unicode .icu files built to ./out/build/icudt61l 1409 echo timestamp > uni-core-data 1410 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b 1411 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b 1412 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1413 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b 1414 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" 1415 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ 1416 mkdir -p /tmp/icu4j/main/shared/data 1417 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1418 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ 1419 mkdir -p /tmp/icu4j/main/shared/data 1420 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1421 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' 1422- copy the big-endian Unicode data files to another location, 1423 separate from the other data files, 1424 and then refresh ICU4J 1425 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1426 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1427 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1428 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1429 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1430 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1431 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1432 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1433 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1434 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1435 1436* When refreshing all of ICU4J data from ICU4C 1437- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1438- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1439or 1440- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1441 1442* update CollationFCD.java 1443 + copy & paste the initializers of lcccIndex[] etc. from 1444 ICU4C/source/i18n/collationfcd.cpp to 1445 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1446 1447* refresh Java test .txt files 1448- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1449 cd $ICU_SRC/icu4c/source/data/unidata 1450 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1451 cd ../../test/testdata 1452 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1453 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1454 1455* run & fix ICU4J tests 1456 1457*** API additions 1458- send notice to icu-design about new born-@stable API (enum constants etc.) 1459 1460*** CLDR numbering systems 1461- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR 1462 Unicode 11: using Unicode 11 CLDR ticket #10978 1463 rohg 10D30..10D39 Hanifi_Rohingya 1464 gong 11DA0..11DA9 Gunjala_Gondi 1465 Earlier: CLDR tickets specific to adding new numbering systems. 1466 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1467 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1468 1469*** merge the Unicode update branches back onto the trunk 1470- do not merge the icudata.jar and testdata.jar, 1471 instead rebuild them from merged & tested ICU4C 1472- make sure that changes to Unicode tools are checked in: 1473 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1474 1475---------------------------------------------------------------------------- *** 1476 1477Unicode 10.0 update for ICU 60 1478 1479http://www.unicode.org/versions/Unicode10.0.0/ 1480http://www.unicode.org/versions/beta-10.0.0.html 1481http://blog.unicode.org/2017/03/unicode-100-beta-review.html 1482http://www.unicode.org/review/pri350/ 1483http://www.unicode.org/reports/uax-proposed-updates.html 1484http://www.unicode.org/reports/tr44/tr44-19.html 1485 1486* Command-line environment setup 1487 1488UNICODE_DATA=~/unidata/uni10/20170605 1489CLDR_SRC=~/svn.cldr/uni10 1490ICU_ROOT=~/svn.icu/uni10 1491ICU_SRC=$ICU_ROOT/src 1492ICUDT=icudt60b 1493ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in 1494ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata 1495export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib 1496 1497*** ICU Trac 1498 1499- ticket:12985: Unicode 10 1500- ticket:13061: undo hacks from emoji 5.0 update 1501- ticket:13062: add Emoji_Component property 1502- ^/branches/markus/uni10 1503 1504*** CLDR Trac 1505 1506- cldrbug 10055: Unicode 10 1507- cldrbug 9882: Unicode 10 script metadata 1508- cldrbug 10219: numbering systems for Unicode 10 1509 1510*** Unicode version numbers 1511- makedata.mak 1512- uchar.h 1513- com.ibm.icu.util.VersionInfo 1514- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1515 1516- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1517 so that the makefiles see the new version number. 1518 1519*** data files & enums & parser code 1520 1521* download files 1522- mkdir -p $UNICODE_DATA 1523- download Unicode 10.0 files into $UNICODE_DATA 1524 + subfolders: ucd, uca, idna, security 1525 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 1526- download emoji 5.0 files into $UNICODE_DATA/emoji 1527 1528* for manual diffs: remove version suffixes from the file names 1529 ~$ unidata/desuffixucd.py $UNICODE_DATA 1530 (see https://sites.google.com/site/unicodetools/inputdata) 1531 1532* process and/or copy files 1533- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC 1534 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1535 + For debugging, and tweaking how ppucd.txt is written, 1536 the tool has an --only_ppucd option: 1537 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile 1538 1539- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA 1540 1541* build ICU (make install) 1542 so that the tools build can pick up the new definitions from the installed header files. 1543 1544 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1545 1546* preparseucd.py changes 1547- remove or add new Unicode scripts from/to the 1548 only-in-ISO-15924 list according to the error messages: 1549 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 1550 -> adjust _scripts_only_in_iso15924 as indicated 1551- fix other errors 1552 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] 1553 -> add vo=Vertical_Orientation to _ignored_properties 1554 -> later removed again, parsing the file, even though we do not yet store data for runtime use 1555 1556* new constants for new property values 1557- preparseucd.py error: 1558 ValueError: missing uchar.h enum constants for some property values: 1559 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', 1560 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), 1561 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', 1562 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', 1563 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), 1564 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] 1565 = PropertyValueAliases.txt new property values (diff old & new .txt files) 1566 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F 1567 blk; Kana_Ext_A ; Kana_Extended_A 1568 blk; Masaram_Gondi ; Masaram_Gondi 1569 blk; Nushu ; Nushu 1570 blk; Soyombo ; Soyombo 1571 blk; Syriac_Sup ; Syriac_Supplement 1572 blk; Zanabazar_Square ; Zanabazar_Square 1573 -> add to uchar.h 1574 use long property names for enum constants, 1575 for the trailing comment get the block start code point: diff old & new Blocks.txt 1576 -> add to UCharacter.UnicodeBlock IDs 1577 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1578 replace public static final int \1_ID = \2; \3 1579 -> add to UCharacter.UnicodeBlock objects 1580 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1581 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1582 1583 jg ; Malayalam_Bha ; Malayalam_Bha 1584 jg ; Malayalam_Ja ; Malayalam_Ja 1585 jg ; Malayalam_Lla ; Malayalam_Lla 1586 jg ; Malayalam_Llla ; Malayalam_Llla 1587 jg ; Malayalam_Nga ; Malayalam_Nga 1588 jg ; Malayalam_Nna ; Malayalam_Nna 1589 jg ; Malayalam_Nnna ; Malayalam_Nnna 1590 jg ; Malayalam_Nya ; Malayalam_Nya 1591 jg ; Malayalam_Ra ; Malayalam_Ra 1592 jg ; Malayalam_Ssa ; Malayalam_Ssa 1593 jg ; Malayalam_Tta ; Malayalam_Tta 1594 -> uchar.h & UCharacter.JoiningGroup 1595 1596 sc ; Gonm ; Masaram_Gondi 1597 sc ; Nshu ; Nushu 1598 sc ; Soyo ; Soyombo 1599 sc ; Zanb ; Zanabazar_Square 1600 -> uscript.h & com.ibm.icu.lang.UScript 1601 -> Nushu had been added already 1602 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1603 and in com.ibm.icu.dev.test.lang.TestUScript.java 1604 1605* New properties as shown in PropertyValueAliases.txt changes 1606- boolean Emoji_Component from emoji 5 1607 -> uchar.h & UProperty.java 1608- boolean 1609 # Regional_Indicator (RI) 1610 1611 RI ; N ; No ; F ; False 1612 RI ; Y ; Yes ; T ; True 1613 -> uchar.h & UProperty.java 1614 -> single immutable range, to be hardcoded 1615- boolean 1616 # Prepended_Concatenation_Mark (PCM) 1617 1618 PCM; N ; No ; F ; False 1619 PCM; Y ; Yes ; T ; True 1620 -> was new in Unicode 9 1621 -> uchar.h & UProperty.java 1622- enumerated 1623 # Vertical_Orientation (vo) 1624 1625 vo ; R ; Rotated 1626 vo ; Tr ; Transformed_Rotated 1627 vo ; Tu ; Transformed_Upright 1628 vo ; U ; Upright 1629 -> only pre-parsed for now, but not yet stored for runtime use 1630 1631* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1632 (not strictly necessary for NOT_ENCODED scripts) 1633 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt 1634 1635* generate normalization data files 1636 cd $ICU_ROOT/dbg/icu4c 1637 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource 1638 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt 1639 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt 1640 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1641 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt 1642 1643* build ICU (make install) 1644 so that the tools build can pick up the new definitions from the installed header files. 1645 1646 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1647 1648* build Unicode tools using CMake+make 1649 1650$ICU_SRC/tools/unicode/c/icudefs.txt: 1651 1652# Location (--prefix) of where ICU was installed. 1653set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1654# Location of the ICU4C source tree. 1655set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) 1656 1657 $ICU_ROOT/dbg/tools/unicode/c$ 1658 cmake ../../../../src/tools/unicode/c 1659 make 1660 1661* generate core properties data files 1662 $ICU_ROOT/dbg/tools/unicode/c$ 1663 genprops/genprops $ICU_SRC/icu4c 1664 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c 1665 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c 1666- rebuild ICU (make install) & tools 1667 1668* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1669 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1670- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1671- Unicode 6.0..10.0: U+2260, U+226E, U+226F 1672- nothing new in this Unicode version, no test file to update 1673 1674* run & fix ICU4C tests 1675- Andy handles RBBI & spoof check test failures 1676 1677* collation: CLDR collation root, UCA DUCET 1678 1679- UCA DUCET goes into Mark's Unicode tools, see 1680 https://sites.google.com/site/unicodetools/home#TOC-UCA 1681- CLDR root data files are checked into $CLDR_SRC/common/uca/ 1682 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ 1683 1684- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1685 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt 1686- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1687 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt 1688 (note removing the underscore before "Rules") 1689 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt 1690- restore TODO diffs in UCARules.txt 1691 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt 1692- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1693 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1694 from the CLDR root files (..._CLDR_..._SHORT.txt) 1695 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1696 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1697 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data 1698- if CLDR common/uca/unihan-index.txt changes, then update 1699 CLDR common/collation/root.xml <collation type="private-unihan"> 1700 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt 1701 1702- run genuca, see command line above; 1703 deal with 1704 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: 1705 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) 1706 (add the character to genuca.cpp sampleCharsToScripts[]) 1707 + look up the USCRIPT_ code for the new sample characters 1708 (should be obvious from the comment in the error output) 1709 + *add* mappings to sampleCharsToScripts[], do not replace them 1710 (in case the script sample characters flip-flop) 1711 + insert new scripts in DUCET script order, see the top_byte table 1712 at the beginning of FractionalUCA.txt 1713- rebuild ICU4C 1714 1715* Unihan collators 1716 https://sites.google.com/site/unicodetools/unihan 1717- run Unicode Tools 1718 org.unicode.draft.GenerateUnihanCollators 1719 with VM arguments 1720 -ea 1721 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk 1722 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools 1723 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data 1724 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 1725 -DUVERSION=10.0.0 1726- run Unicode Tools 1727 org.unicode.draft.GenerateUnihanCollatorFiles 1728 with the same arguments 1729- check CLDR diffs 1730 cd $CLDR_SRC 1731 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 1732 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 1733- copy to CLDR 1734 cd $CLDR_SRC 1735 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 1736 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 1737- run CLDR unit tests, commit to CLDR 1738- generate ICU zh collation data: run CLDR 1739 org.unicode.cldr.icu.NewLdml2IcuConverter 1740 with program arguments 1741 -t collation 1742 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation 1743 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental 1744 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll 1745 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation 1746 zh 1747 and VM arguments 1748 -ea 1749 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 1750- rebuild ICU4C 1751 1752* run & fix ICU4C tests, now with new CLDR collation root data 1753- run all tests with the collation test data *_SHORT.txt or the full files 1754 (the full ones have comments, useful for debugging) 1755- note on intltest: if collate/UCAConformanceTest fails, then 1756 utility/MultithreadTest/TestCollators will fail as well; 1757 fix the conformance test before looking into the multi-thread test 1758 1759* update Java data files 1760- refresh just the UCD/UCA-related/derived files, just to be safe 1761- see (ICU4C)/source/data/icu4j-readme.txt 1762- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1763- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1764 output: 1765 ... 1766 Unicode .icu files built to ./out/build/icudt60l 1767 echo timestamp > uni-core-data 1768 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b 1769 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b 1770 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1771 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b 1772 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" 1773 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ 1774 mkdir -p /tmp/icu4j/main/shared/data 1775 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1776 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ 1777 mkdir -p /tmp/icu4j/main/shared/data 1778 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1779 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' 1780- copy the big-endian Unicode data files to another location, 1781 separate from the other data files, 1782 and then refresh ICU4J 1783 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j 1784 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1785 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1786 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1787 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1788 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1789 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1790 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1791 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1792 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1793 1794* When refreshing all of ICU4J data from ICU4C 1795- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1796- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data 1797or 1798- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install 1799 1800* update CollationFCD.java 1801 + copy & paste the initializers of lcccIndex[] etc. from 1802 ICU4C/source/i18n/collationfcd.cpp to 1803 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1804 1805* refresh Java test .txt files 1806- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1807 cd $ICU_SRC/icu4c/source/data/unidata 1808 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1809 cd ../../test/testdata 1810 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1811 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1812 1813* run & fix ICU4J tests 1814 1815*** API additions 1816- send notice to icu-design about new born-@stable API (enum constants etc.) 1817 1818*** CLDR numbering systems 1819- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket 1820 Unicode 10: http://unicode.org/cldr/trac/ticket/10219 1821 Unicode 9: http://unicode.org/cldr/trac/ticket/9692 1822 1823*** merge the Unicode update branches back onto the trunk 1824- do not merge the icudata.jar and testdata.jar, 1825 instead rebuild them from merged & tested ICU4C 1826- make sure that changes to Unicode tools are checked in: 1827 http://www.unicode.org/utility/trac/log/trunk/unicodetools 1828 1829---------------------------------------------------------------------------- *** 1830 1831Emoji 5.0 update for ICU 59 1832- ICU 59 mostly remains on Unicode 9.0 1833- except updates bidi and segmentation data to Unicode 10 beta 1834 1835First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. 1836 1837* Command-line environment setup 1838 1839ICU_ROOT=~/svn.icu/trunk 1840ICU_SRC_DIR=$ICU_ROOT/src 1841ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c 1842ICUDT=icudt59b 1843export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 1844SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in 1845UNIDATA=$ICU4C_SRC_DIR/source/data/unidata 1846 1847*** ICU Trac 1848 1849- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released 1850- changes directly on trunk 1851 1852*** data files & enums & parser code 1853 1854* download files 1855 1856- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) 1857- download emoji 5.0 beta files into the same uni90e50 folder 1858- download Unicode 10.0 beta files: ucd 1859 + copy Unicode 10 bidi files to the uni90e50/ucd folder: 1860 BidiBrackets.txt 1861 BidiCharacterTest.txt 1862 BidiMirroring.txt 1863 BidiTest.txt 1864 extracted/DerivedBidiClass.txt 1865 + copy Unicode 10 segmentation files to the uni90e50/ucd folder: 1866 LineBreak.txt 1867 auxiliary/* 1868 1869* preparseucd.py changes 1870- adjust for combined trunks 1871- write new copyright lines 1872- ignore new Emoji_Component property for now 1873 1874* process and/or copy files 1875- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR 1876 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1877 1878- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA 1879 1880* build ICU (make install) 1881 so that the tools build can pick up the new definitions from the installed header files. 1882 1883 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date 1884 1885* build Unicode tools using CMake+make 1886 1887~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: 1888 1889# Location (--prefix) of where ICU was installed. 1890set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) 1891# Location of the ICU4C source tree. 1892set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) 1893 1894 ~/svn.icu/trunk/dbg/tools/unicode/c$ 1895 cmake ../../../../src/tools/unicode/c 1896 make 1897 1898* generate core properties data files 1899 ~/svn.icu/trunk/dbg/tools/unicode/c$ 1900 genprops/genprops $ICU4C_SRC_DIR 1901- rebuild ICU (make install) & tools 1902 1903* run & fix ICU4C tests 1904- Andy handles RBBI & spoof check test failures 1905 1906* update Java data files 1907- refresh just the UCD/UCA-related/derived files, just to be safe 1908- see (ICU4C)/source/data/icu4j-readme.txt 1909- mkdir /tmp/icu4j 1910- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1911 output: 1912 ... 1913 Unicode .icu files built to ./out/build/icudt59l 1914 echo timestamp > uni-core-data 1915 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b 1916 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b 1917 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 1918 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b 1919 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" 1920 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ 1921 mkdir -p /tmp/icu4j/main/shared/data 1922 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1923 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ 1924 mkdir -p /tmp/icu4j/main/shared/data 1925 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1926 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' 1927- copy the big-endian Unicode data files to another location, 1928 separate from the other data files, 1929 and then refresh ICU4J 1930 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j 1931 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1932 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1933 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1934 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1935 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1936 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1937 1938* When refreshing all of ICU4J data from ICU4C 1939- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1940- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data 1941or 1942- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install 1943 1944* refresh Java test .txt files 1945- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1946 cd $ICU4C_SRC_DIR/source/data/unidata 1947 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1948 cd ../../test/testdata 1949 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1950 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode 1951 1952* run & fix ICU4J tests 1953 1954---------------------------------------------------------------------------- *** 1955 1956Unicode 9.0 update for ICU 58 1957 1958* Command-line environment setup 1959 1960ICU_ROOT=~/svn.icu/trunk 1961ICU_SRC_DIR=$ICU_ROOT/src 1962ICUDT=icudt58b 1963export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 1964SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 1965UNIDATA=$ICU_SRC_DIR/source/data/unidata 1966 1967http://www.unicode.org/review/pri323/ -- beta review 1968http://www.unicode.org/reports/uax-proposed-updates.html 1969http://www.unicode.org/versions/beta-9.0.0.html 1970http://www.unicode.org/versions/Unicode9.0.0/ 1971http://www.unicode.org/reports/tr44/tr44-17.html 1972 1973*** ICU Trac 1974 1975- ticket:12526: integrate Unicode 9 1976- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 1977- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 1978 1979*** CLDR Trac 1980 1981- cldrbug 9414: UCA 9 1982- ^/branches/markus/uni90 at r11518 from trunk at r11517 1983 1984- cldrbug 8745: Unicode 9.0 script metadata 1985 1986*** Unicode version numbers 1987- makedata.mak 1988- uchar.h 1989- com.ibm.icu.util.VersionInfo 1990- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1991 1992- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1993 so that the makefiles see the new version number. 1994 1995*** data files & enums & parser code 1996 1997* file preparation 1998 1999- download UCD & IDNA files 2000- make sure that the Unicode data folder passed into preparseucd.py 2001 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2002- only for manual diffs: remove version suffixes from the file names 2003 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2004 (see https://sites.google.com/site/unicodetools/inputdata) 2005- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2006- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2007- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2008 2009- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 2010 and copy to $UNIDATA 2011 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 2012 2013* preparseucd.py changes 2014- remove or add new Unicode scripts from/to the 2015 only-in-ISO-15924 list according to the error messages: 2016 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 2017 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 2018 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 2019 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 2020 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2021 and in com.ibm.icu.dev.test.lang.TestUScript.java 2022- DerivedNumericValues.txt new numeric values 2023 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 2024 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 2025 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 2026 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 2027 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 2028 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 2029 uchar.c, UCharacterProperty.java 2030 to support a new series of values 2031- adjust preparseucd.py for Tangut algorithmic names 2032 in ppucd.txt: 2033 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 2034 -> 2035 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 2036- avoid block-compressing most String/Miscellaneous property values, 2037 triggered by genprops not coping with a multi-code point Case_Folding on 2038 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 2039 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 2040 2041* PropertyAliases.txt changes 2042- 1 new property PCM=Prepended_Concatenation_Mark 2043 Ignore: Only useful for layout engines. 2044 Ok to list in ppucd.txt. 2045 2046* PropertyValueAliases.txt new property values 2047 blk; Adlam ; Adlam 2048 blk; Bhaiksuki ; Bhaiksuki 2049 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 2050 blk; Glagolitic_Sup ; Glagolitic_Supplement 2051 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 2052 blk; Marchen ; Marchen 2053 blk; Mongolian_Sup ; Mongolian_Supplement 2054 blk; Newa ; Newa 2055 blk; Osage ; Osage 2056 blk; Tangut ; Tangut 2057 blk; Tangut_Components ; Tangut_Components 2058 -> add to uchar.h 2059 use long property names for enum constants 2060 -> add to UCharacter.UnicodeBlock IDs 2061 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2062 replace public static final int \1_ID = \2; \3 2063 -> add to UCharacter.UnicodeBlock objects 2064 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2065 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2066 2067 GCB; EB ; E_Base 2068 GCB; EBG ; E_Base_GAZ 2069 GCB; EM ; E_Modifier 2070 GCB; GAZ ; Glue_After_Zwj 2071 GCB; ZWJ ; ZWJ 2072 -> uchar.h & UCharacter.GraphemeClusterBreak 2073 2074 jg ; African_Feh ; African_Feh 2075 jg ; African_Noon ; African_Noon 2076 jg ; African_Qaf ; African_Qaf 2077 -> uchar.h & UCharacter.JoiningGroup 2078 2079 lb ; EB ; E_Base 2080 lb ; EM ; E_Modifier 2081 lb ; ZWJ ; ZWJ 2082 -> uchar.h & UCharacter.LineBreak 2083 2084 sc ; Adlm ; Adlam 2085 sc ; Bhks ; Bhaiksuki 2086 sc ; Marc ; Marchen 2087 sc ; Newa ; Newa 2088 sc ; Osge ; Osage 2089 sc ; Tang ; Tangut 2090 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2091 2092 WB ; EB ; E_Base 2093 WB ; EBG ; E_Base_GAZ 2094 WB ; EM ; E_Modifier 2095 WB ; GAZ ; Glue_After_Zwj 2096 WB ; ZWJ ; ZWJ 2097 -> uchar.h & UCharacter.WordBreak 2098 2099* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2100 (not strictly necessary for NOT_ENCODED scripts) 2101 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2102 2103* generate normalization data files 2104 cd $ICU_ROOT/dbg 2105 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2106 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2107 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2108 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2109 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2110 2111* build ICU (make install) 2112 so that the tools build can pick up the new definitions from the installed header files. 2113 2114 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 2115 2116* build Unicode tools using CMake+make 2117 2118~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2119 2120 # Location (--prefix) of where ICU was installed. 2121 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2122 # Location of the ICU source tree. 2123 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2124 2125 ~/svn.icutools/trunk/dbg/unicode/c$ 2126 cmake ../../../src/unicode/c 2127 make 2128 2129* generate core properties data files 2130 ~/svn.icutools/trunk/dbg/unicode/c$ 2131 genprops/genprops $ICU_SRC_DIR 2132 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2133 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2134- rebuild ICU (make install) & tools 2135 2136* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2137 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2138- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2139- Unicode 6.0..9.0: U+2260, U+226E, U+226F 2140- nothing new in 9.0, no test file to update 2141 2142* run & fix ICU4C tests 2143- Andy handles RBBI & spoof check test failures 2144 2145* collation: CLDR collation root, UCA DUCET 2146 2147- UCA DUCET goes into Mark's Unicode tools, see 2148 https://sites.google.com/site/unicodetools/home#TOC-UCA 2149- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2150 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 2151 2152- cd (CLDR UCA branch)/common/uca/ 2153- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2154 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2155- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2156 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2157 (note removing the underscore before "Rules") 2158 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2159- restore TODO diffs in UCARules.txt 2160 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2161- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2162 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2163 from the CLDR root files (..._CLDR_..._SHORT.txt) 2164 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2165 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2166 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2167- if CLDR common/uca/unihan-index.txt changes, then update 2168 CLDR common/collation/root.xml <collation type="private-unihan"> 2169 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2170 2171- run genuca, see command line above; 2172 deal with 2173 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 2174 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 2175 (add the character to genuca.cpp sampleCharsToScripts[]) 2176 + look up the USCRIPT_ code for the new sample characters 2177 (should be obvious from the comment in the error output) 2178 + *add* mappings to sampleCharsToScripts[], do not replace them 2179 (in case the script sample characters flip-flop) 2180 + insert new scripts in DUCET script order, see the top_byte table 2181 at the beginning of FractionalUCA.txt 2182- rebuild ICU4C 2183 2184* Unihan collators 2185- run Unicode Tools 2186 org.unicode.draft.GenerateUnihanCollators 2187 with VM arguments 2188 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 2189 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 2190 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 2191 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2192 -DUVERSION=9.0.0 2193 -ea 2194- run Unicode Tools 2195 org.unicode.draft.GenerateUnihanCollatorFiles 2196 with the same arguments 2197- check CLDR diffs 2198 cd ~/svn.cldr/trunk 2199 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 2200 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 2201- copy to CLDR 2202 cd ~/svn.cldr/trunk 2203 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 2204 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 2205- commit to CLDR 2206- generate ICU zh collation data: run CLDR 2207 org.unicode.cldr.icu.NewLdml2IcuConverter 2208 with program arguments 2209 -t collation 2210 -s /home/mscherer/svn.cldr/trunk/common/collation 2211 -m /home/mscherer/svn.cldr/trunk/common/supplemental 2212 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 2213 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 2214 zh 2215 and VM arguments 2216 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 2217- rebuild ICU4C 2218 2219* run & fix ICU4C tests, now with new CLDR collation root data 2220- run all tests with the collation test data *_SHORT.txt or the full files 2221 (the full ones have comments, useful for debugging) 2222- note on intltest: if collate/UCAConformanceTest fails, then 2223 utility/MultithreadTest/TestCollators will fail as well; 2224 fix the conformance test before looking into the multi-thread test 2225 2226* update Java data files 2227- refresh just the UCD/UCA-related/derived files, just to be safe 2228- see (ICU4C)/source/data/icu4j-readme.txt 2229- mkdir /tmp/icu4j 2230- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2231 output: 2232 ... 2233 Unicode .icu files built to ./out/build/icudt58l 2234 echo timestamp > uni-core-data 2235 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2236 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 2237 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2238 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 2239 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 2240 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 2241 mkdir -p /tmp/icu4j/main/shared/data 2242 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2243 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 2244 mkdir -p /tmp/icu4j/main/shared/data 2245 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2246 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 2247- copy the big-endian Unicode data files to another location, 2248 separate from the other data files, 2249 and then refresh ICU4J 2250 cd ~/svn.icu/trunk/dbg/data/out/icu4j 2251 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2252 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2253 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2254 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2255 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2256 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2257 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2258 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2259 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2260 2261* When refreshing all of ICU4J data from ICU4C 2262- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2263- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2264or 2265- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2266 2267* update CollationFCD.java 2268 + copy & paste the initializers of lcccIndex[] etc. from 2269 ICU4C/source/i18n/collationfcd.cpp to 2270 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2271 2272* refresh Java test .txt files 2273- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2274 cd $ICU_SRC_DIR/source/data/unidata 2275 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2276 cd ../../test/testdata 2277 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2278 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2279 2280* run & fix ICU4J tests 2281 2282*** LayoutEngine script information 2283 2284* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2285 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2286 in the working directory. 2287 2288 (It also generates ScriptRunData.cpp, which is no longer needed.) 2289 2290 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2291 (a plain text file) 2292 which maps ICU versions to the numbers of script/language constants 2293 that were added then. 2294 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2295 2296 The generated files have a current copyright date and "@deprecated" statement. 2297 2298* Review changes, fix Java tool if necessary, and copy to ICU4C 2299 cd ~/svn.icu4j/trunk/src 2300 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2301 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2302 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2303 2304*** API additions 2305- send notice to icu-design about new born-@stable API (enum constants etc.) 2306 2307*** merge the Unicode update branches back onto the trunk 2308- do not merge the icudata.jar and testdata.jar, 2309 instead rebuild them from merged & tested ICU4C 2310- make sure that changes to Unicode tools & ICU tools are checked in 2311 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2312 http://bugs.icu-project.org/trac/log/tools/trunk 2313 2314---------------------------------------------------------------------------- *** 2315 2316New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764 2317 2318Adding 2319- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 2320- new combination/alias codes: Hanb, Jamo 2321 - used in CLDR 29 and in spoof checker 2322- new Z* code: Zsye 2323 2324Add new codes to uscript.h & UScript.java, see Unicode update logs. 2325 -> com.ibm.icu.lang.UScript 2326 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 2327 replace public static final int \1 = \2; \3 2328 2329Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 2330add new script codes. 2331"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 2332 2333Note: If we have to run preparseucd.py again before the Unicode 9 update, 2334then we need to manually keep/restore the new script codes. 2335 2336ICU_ROOT=~/svn.icu/trunk 2337ICU_SRC_DIR=$ICU_ROOT/src 2338ICUDT=icudt57b 2339export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2340SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2341UNIDATA=$ICU_SRC_DIR/source/data/unidata 2342 2343Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 2344see http://bugs.icu-project.org/trac/ticket/12141 2345 2346make install, then icutools cmake & make, then 2347~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 2348 2349Generate Java data as usual, only update pnames.icu & uprops.icu. 2350 2351*** LayoutEngine script information 2352 2353* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2354 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2355 in the working directory. 2356 2357 (It also generates ScriptRunData.cpp, which is no longer needed.) 2358 2359 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2360 (a plain text file) 2361 which maps ICU versions to the numbers of script/language constants 2362 that were added then. 2363 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2364 2365 The generated files have a current copyright date and "@deprecated" statement. 2366 2367* Review changes, fix Java tool if necessary, and copy to ICU4C 2368 cd ~/svn.icu4j/trunk/src 2369 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2370 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2371 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2372 2373---------------------------------------------------------------------------- *** 2374 2375Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802 2376 2377Edit preparseucd.py to add & parse new properties. 2378They share the UCD property namespace but are not listed in PropertyAliases.txt. 2379 2380Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 2381Initial data from emoji/2.0/ 2382 2383ICU_ROOT=~/svn.icu/trunk 2384ICU_SRC_DIR=$ICU_ROOT/src 2385ICUDT=icudt56b 2386export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2387SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2388UNIDATA=$ICU_SRC_DIR/source/data/unidata 2389 2390Add binary-property constants to uchar.h enum UProperty & UProperty.java. 2391 2392~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2393(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 2394 2395Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 2396 2397make install, then icutools cmake & make, then 2398~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 2399 2400Generate Java data as usual, only update pnames.icu & uprops.icu. 2401 2402---------------------------------------------------------------------------- *** 2403 2404Unicode 8.0 update for ICU 56 2405 2406* Command-line environment setup 2407 2408ICU_ROOT=~/svn.icu/trunk 2409ICU_SRC_DIR=$ICU_ROOT/src 2410ICUDT=icudt56b 2411export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2412SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2413UNIDATA=$ICU_SRC_DIR/source/data/unidata 2414 2415http://www.unicode.org/review/pri297/ -- beta review 2416http://www.unicode.org/reports/uax-proposed-updates.html 2417http://unicode.org/versions/beta-8.0.0.html 2418http://www.unicode.org/versions/Unicode8.0.0/ 2419http://www.unicode.org/reports/tr44/tr44-15.html 2420 2421*** ICU Trac 2422 2423- ticket:11574: Unicode 8 2424- C++ branches/markus/uni80 at r37351 from trunk at r37343 2425- Java branches/markus/uni80 at r37352 from trunk at r37338 2426 2427*** CLDR Trac 2428 2429- cldrbug 8311: UCA 8 2430- branches/markus/uni80 at r11518 from trunk at r11517 2431 2432- cldrbug 8109: Unicode 8.0 script metadata 2433- cldrbug 8418: Updated segmentation for Unicode 8.0 2434 2435*** Unicode version numbers 2436- makedata.mak 2437- uchar.h 2438- com.ibm.icu.util.VersionInfo 2439- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2440 2441- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2442 so that the makefiles see the new version number. 2443 2444*** data files & enums & parser code 2445 2446* file preparation 2447 2448- download UCD & IDNA files 2449- make sure that the Unicode data folder passed into preparseucd.py 2450 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2451- only for manual diffs: remove version suffixes from the file names 2452 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2453 (see https://sites.google.com/site/unicodetools/inputdata) 2454- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2455- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2456- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2457 2458- also: from http://unicode.org/Public/security/8.0.0/ download new 2459 confusables.txt & confusablesWholeScript.txt 2460 and copy to $UNIDATA 2461 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 2462 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 2463 2464* initial preparseucd.py changes 2465- remove new Unicode scripts from the 2466 only-in-ISO-15924 list according to the error message: 2467 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 2468 from _scripts_only_in_iso15924 2469 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2470 and in com.ibm.icu.dev.test.lang.TestUScript.java 2471- property and file name change: 2472 IndicMatraCategory -> IndicPositionalCategory 2473- UnicodeData.txt unusual numeric values (improper fractions) 2474 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 2475 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 2476 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 2477 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 2478 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 2479 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 2480 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 2481 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 2482 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 2483 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 2484 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 2485 which are listed in DerivedNumericValues.txt; 2486 keeps storage in data file simple 2487 2488* PropertyValueAliases.txt changes 2489- 10 new Block (blk) values: 2490 blk; Ahom ; Ahom 2491 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 2492 blk; Cherokee_Sup ; Cherokee_Supplement 2493 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 2494 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 2495 blk; Hatran ; Hatran 2496 blk; Multani ; Multani 2497 blk; Old_Hungarian ; Old_Hungarian 2498 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 2499 blk; Sutton_SignWriting ; Sutton_SignWriting 2500 -> add to uchar.h 2501 use long property names for enum constants 2502 -> add to UCharacter.UnicodeBlock IDs 2503 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2504 replace public static final int \1_ID = \2; \3 2505 -> add to UCharacter.UnicodeBlock objects 2506 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2507 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2508- 6 new Script (sc) values: 2509 sc ; Ahom ; Ahom 2510 sc ; Hatr ; Hatran 2511 sc ; Hluw ; Anatolian_Hieroglyphs 2512 sc ; Hung ; Old_Hungarian 2513 sc ; Mult ; Multani 2514 sc ; Sgnw ; SignWriting 2515 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 2516 2517* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2518 (not strictly necessary for NOT_ENCODED scripts) 2519 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2520 2521* generate normalization data files 2522 cd $ICU_ROOT/dbg 2523 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2524 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2525 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2526 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2527 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2528 2529* build ICU (make install) 2530 so that the tools build can pick up the new definitions from the installed header files. 2531 2532 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 2533 2534* build Unicode tools using CMake+make 2535 2536~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2537 2538 # Location (--prefix) of where ICU was installed. 2539 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 2540 # Location of the ICU source tree. 2541 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 2542 2543 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 2544 ~/svn.icutools/trunk/dbg/unicode/c$ make 2545 2546* generate core properties data files 2547- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 2548- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 2549- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 2550- rebuild ICU (make install) & tools 2551- run genuca again (see step above) so that it picks up the new nfc.nrm 2552- rebuild ICU (make install) & tools 2553 2554* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2555 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2556- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2557- Unicode 6.0..8.0: U+2260, U+226E, U+226F 2558- nothing new in 8.0, no test file to update 2559 2560* run & fix ICU4C tests 2561- bad Cherokee case folding due to difference in fallbacks: 2562 UCD case folding falls back to no mapping, 2563 ICU runtime case folding falls back to lowercasing; 2564 fixed casepropsbuilder.cpp to generate scf mappings to self 2565 when there is an slc mapping but no scf 2566- Andy handles RBBI & spoof check test failures 2567 2568* collation: CLDR collation root, UCA DUCET 2569 2570- UCA DUCET goes into Mark's Unicode tools, see 2571 https://sites.google.com/site/unicodetools/home#TOC-UCA 2572- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 2573- cd (CLDR UCA branch)/common/uca/ 2574- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2575 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 2576- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2577 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 2578 (note removing the underscore before "Rules") 2579 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2580- restore TODO diffs in UCARules.txt 2581 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 2582- update (ICU4C)/source/test/testdata/CollationTest_*.txt 2583 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2584 from the CLDR root files (..._CLDR_..._SHORT.txt) 2585 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 2586 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 2587 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 2588- if CLDR common/uca/unihan-index.txt changes, then update 2589 CLDR common/collation/root.xml <collation type="private-unihan"> 2590 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 2591- run genuca, see command line above; 2592 deal with 2593 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 2594 (add the character to genuca.cpp sampleCharsToScripts[]) 2595 + look up the script for the new sample characters 2596 (e.g., in FractionalUCA.txt) 2597 + *add* mappings to sampleCharsToScripts[], do not replace them 2598 (in case the script sample characters flip-flop) 2599 + insert new scripts in DUCET script order, see the top_byte table 2600 at the beginning of FractionalUCA.txt 2601- rebuild ICU4C 2602 2603* run & fix ICU4C tests, now with new CLDR collation root data 2604- run all tests with the collation test data *_SHORT.txt or the full files 2605 (the full ones have comments, useful for debugging) 2606- note on intltest: if collate/UCAConformanceTest fails, then 2607 utility/MultithreadTest/TestCollators will fail as well; 2608 fix the conformance test before looking into the multi-thread test 2609- fixed bug in CollationWeights::getWeightRanges() 2610 exposed by new data and CollationTest::TestRootElements 2611 2612* update Java data files 2613- refresh just the UCD/UCA-related/derived files, just to be safe 2614- see (ICU4C)/source/data/icu4j-readme.txt 2615- mkdir /tmp/icu4j 2616- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2617 output: 2618 ... 2619 Unicode .icu files built to ./out/build/icudt56l 2620 echo timestamp > uni-core-data 2621 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 2622 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 2623 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 2624 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 2625 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 2626 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 2627 mkdir -p /tmp/icu4j/main/shared/data 2628 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2629 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 2630 mkdir -p /tmp/icu4j/main/shared/data 2631 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2632 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 2633- copy the big-endian Unicode data files to another location, 2634 separate from the other data files, 2635 and then refresh ICU4J 2636 cd ~/svn.icu/trunk/dbg/data/out/icu4j 2637 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2638 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2639 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2640 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2641 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2642 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2643 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2644 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2645 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2646 2647* When refreshing all of ICU4J data from ICU4C 2648- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2649- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2650or 2651- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2652 2653* update CollationFCD.java 2654 + copy & paste the initializers of lcccIndex[] etc. from 2655 ICU4C/source/i18n/collationfcd.cpp to 2656 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2657 2658* refresh Java test .txt files 2659- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2660 cd $ICU_SRC_DIR/source/data/unidata 2661 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2662 cd ../../test/testdata 2663 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2664 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2665 2666* run & fix ICU4J tests 2667 2668*** LayoutEngine script information 2669 2670* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 2671 because the layout engine was deprecated in ICU 54. 2672 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 2673 to write lines that we used to add manually. 2674 2675* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 2676 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 2677 in the working directory. 2678 2679 (It also generates ScriptRunData.cpp, which is no longer needed.) 2680 2681 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 2682 (a plain text file) 2683 which maps ICU versions to the numbers of script/language constants 2684 that were added then. 2685 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 2686 2687 The generated files have a current copyright date and "@deprecated" statement. 2688 2689* Review changes, fix Java tool if necessary, and copy to ICU4C 2690 cd ~/svn.icu4j/trunk/src 2691 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 2692 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 2693 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 2694 2695*** API additions 2696- send notice to icu-design about new born-@stable API (enum constants etc.) 2697 2698*** merge the Unicode update branches back onto the trunk 2699- do not merge the icudata.jar and testdata.jar, 2700 instead rebuild them from merged & tested ICU4C 2701- make sure that changes to Unicode tools & ICU tools are checked in 2702 http://www.unicode.org/utility/trac/log/trunk/unicodetools 2703 http://bugs.icu-project.org/trac/log/tools/trunk 2704 2705---------------------------------------------------------------------------- *** 2706 2707Unicode 7.0 update for ICU 54 2708 2709http://www.unicode.org/review/pri271/ -- beta review 2710http://www.unicode.org/reports/uax-proposed-updates.html 2711http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 2712http://www.unicode.org/reports/tr44/tr44-13.html 2713 2714*** ICU Trac 2715 2716- ticket 10821: Unicode 7.0, UCA 7.0 2717- C++ branches/markus/uni70 at r35584 from trunk at r35580 2718- Java branches/markus/uni70 at r35587 from trunk at r35545 2719 2720*** CLDR Trac 2721 2722- ticket 7195: UCA 7.0 CLDR root collation 2723- branches/markus/uni70 at r10062 from trunk at r10061 2724 2725- ticket 6762: script metadata for Unicode 7.0 new scripts 2726 2727*** Unicode version numbers 2728- makedata.mak 2729- uchar.h 2730- com.ibm.icu.util.VersionInfo 2731- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 2732 2733- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 2734 so that the makefiles see the new version number. 2735 2736*** data files & enums & parser code 2737 2738* file preparation 2739 2740- download UCD & IDNA files 2741- make sure that the Unicode data folder passed into preparseucd.py 2742 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 2743- only for manual diffs: remove version suffixes from the file names 2744 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 2745 (see https://sites.google.com/site/unicodetools/inputdata) 2746- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 2747- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 2748- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 2749- Restore TODO diffs in source/data/unidata/UCARules.txt 2750 cd $ICU_SRC_DIR 2751 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 2752- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 2753 2754- also: from http://unicode.org/Public/security/7.0.0/ download new 2755 confusables.txt & confusablesWholeScript.txt 2756 and copy to $ICU_ROOT/src/source/data/unidata/ 2757 2758* initial preparseucd.py changes 2759- remove new Unicode scripts from the 2760 only-in-ISO-15924 list according to the error message: 2761 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 2762 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 2763 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 2764 from _scripts_only_in_iso15924 2765 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 2766 and in com.ibm.icu.dev.test.lang.TestUScript.java 2767- NamesList.txt now has a heading with a non-ASCII character 2768 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 2769 + escape non-ASCII characters in heading comments 2770- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 2771 + get the copyright from the first file whose copyright line contains the current year 2772 2773* PropertyValueAliases.txt changes 2774- 32 new Block (blk) values: 2775 blk; Bassa_Vah ; Bassa_Vah 2776 blk; Caucasian_Albanian ; Caucasian_Albanian 2777 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 2778 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 2779 blk; Duployan ; Duployan 2780 blk; Elbasan ; Elbasan 2781 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 2782 blk; Grantha ; Grantha 2783 blk; Khojki ; Khojki 2784 blk; Khudawadi ; Khudawadi 2785 blk; Latin_Ext_E ; Latin_Extended_E 2786 blk; Linear_A ; Linear_A 2787 blk; Mahajani ; Mahajani 2788 blk; Manichaean ; Manichaean 2789 blk; Mende_Kikakui ; Mende_Kikakui 2790 blk; Modi ; Modi 2791 blk; Mro ; Mro 2792 blk; Myanmar_Ext_B ; Myanmar_Extended_B 2793 blk; Nabataean ; Nabataean 2794 blk; Old_North_Arabian ; Old_North_Arabian 2795 blk; Old_Permic ; Old_Permic 2796 blk; Ornamental_Dingbats ; Ornamental_Dingbats 2797 blk; Pahawh_Hmong ; Pahawh_Hmong 2798 blk; Palmyrene ; Palmyrene 2799 blk; Pau_Cin_Hau ; Pau_Cin_Hau 2800 blk; Psalter_Pahlavi ; Psalter_Pahlavi 2801 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 2802 blk; Siddham ; Siddham 2803 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 2804 blk; Sup_Arrows_C ; Supplemental_Arrows_C 2805 blk; Tirhuta ; Tirhuta 2806 blk; Warang_Citi ; Warang_Citi 2807 -> add to uchar.h 2808 use long property names for enum constants 2809 -> add to UCharacter.UnicodeBlock IDs 2810 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 2811 replace public static final int \1_ID = \2; \3 2812 -> add to UCharacter.UnicodeBlock objects 2813 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 2814 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2815- 28 new Joining_Group (jg) values: 2816 jg ; Manichaean_Aleph ; Manichaean_Aleph 2817 jg ; Manichaean_Ayin ; Manichaean_Ayin 2818 jg ; Manichaean_Beth ; Manichaean_Beth 2819 jg ; Manichaean_Daleth ; Manichaean_Daleth 2820 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 2821 jg ; Manichaean_Five ; Manichaean_Five 2822 jg ; Manichaean_Gimel ; Manichaean_Gimel 2823 jg ; Manichaean_Heth ; Manichaean_Heth 2824 jg ; Manichaean_Hundred ; Manichaean_Hundred 2825 jg ; Manichaean_Kaph ; Manichaean_Kaph 2826 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 2827 jg ; Manichaean_Mem ; Manichaean_Mem 2828 jg ; Manichaean_Nun ; Manichaean_Nun 2829 jg ; Manichaean_One ; Manichaean_One 2830 jg ; Manichaean_Pe ; Manichaean_Pe 2831 jg ; Manichaean_Qoph ; Manichaean_Qoph 2832 jg ; Manichaean_Resh ; Manichaean_Resh 2833 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 2834 jg ; Manichaean_Samekh ; Manichaean_Samekh 2835 jg ; Manichaean_Taw ; Manichaean_Taw 2836 jg ; Manichaean_Ten ; Manichaean_Ten 2837 jg ; Manichaean_Teth ; Manichaean_Teth 2838 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 2839 jg ; Manichaean_Twenty ; Manichaean_Twenty 2840 jg ; Manichaean_Waw ; Manichaean_Waw 2841 jg ; Manichaean_Yodh ; Manichaean_Yodh 2842 jg ; Manichaean_Zayin ; Manichaean_Zayin 2843 jg ; Straight_Waw ; Straight_Waw 2844 -> uchar.h & UCharacter.JoiningGroup 2845- 23 new Script (sc) values: 2846 sc ; Aghb ; Caucasian_Albanian 2847 sc ; Bass ; Bassa_Vah 2848 sc ; Dupl ; Duployan 2849 sc ; Elba ; Elbasan 2850 sc ; Gran ; Grantha 2851 sc ; Hmng ; Pahawh_Hmong 2852 sc ; Khoj ; Khojki 2853 sc ; Lina ; Linear_A 2854 sc ; Mahj ; Mahajani 2855 sc ; Mani ; Manichaean 2856 sc ; Mend ; Mende_Kikakui 2857 sc ; Modi ; Modi 2858 sc ; Mroo ; Mro 2859 sc ; Narb ; Old_North_Arabian 2860 sc ; Nbat ; Nabataean 2861 sc ; Palm ; Palmyrene 2862 sc ; Pauc ; Pau_Cin_Hau 2863 sc ; Perm ; Old_Permic 2864 sc ; Phlp ; Psalter_Pahlavi 2865 sc ; Sidd ; Siddham 2866 sc ; Sind ; Khudawadi 2867 sc ; Tirh ; Tirhuta 2868 sc ; Wara ; Warang_Citi 2869 -> uscript.h (many were added before) 2870 comment "Mende Kikakui" for USCRIPT_MENDE 2871 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 2872 -> com.ibm.icu.lang.UScript 2873 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 2874 replace public static final int \1 = \2; \3 2875- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 2876 (added 2012-11-01) 2877 Ahom 338 Ahom 2878 Hatr 127 Hatran 2879 Mult 323 Multani 2880 (added 2013-10-12) 2881 Modi 324 Modi 2882 Pauc 263 Pau Cin Hau 2883 Sidd 302 Siddham 2884 -> uscript.h (some overlap with additions from Unicode) 2885 -> com.ibm.icu.lang.UScript 2886 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 2887 replace public static final int \1 = \2; \3 2888 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 2889 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 2890 and in com.ibm.icu.dev.test.lang.TestUScript.java 2891 2892* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 2893 (not strictly necessary for NOT_ENCODED scripts) 2894 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 2895 2896* generate normalization data files 2897- cd $ICU_ROOT/dbg 2898- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 2899- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 2900- UNIDATA=$ICU_SRC_DIR/source/data/unidata 2901- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 2902- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 2903- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 2904- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 2905- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 2906 2907* build ICU (make install) 2908 so that the tools build can pick up the new definitions from the installed header files. 2909 2910~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 2911 2912* build Unicode tools using CMake+make 2913 2914~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 2915 2916# Location (--prefix) of where ICU was installed. 2917set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 2918# Location of the ICU source tree. 2919set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 2920 2921~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 2922~/svn.icutools/trunk/dbg/unicode/c$ make 2923 2924* genprops work 2925- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 2926 + add second array of Joining_Group values for at most 10800..10FFF 2927 icutools: unicode/c/genprops/bidipropsbuilder.cpp 2928 icu: source/common/ubidi_props.h/.c/_data.h 2929 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 2930 2931* generate core properties data files 2932- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 2933- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 2934- rebuild ICU (make install) & tools 2935- run genuca again (see step above) so that it picks up the new nfc.nrm 2936- rebuild ICU (make install) & tools 2937 2938* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2939 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2940- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2941- Unicode 6.0..7.0: U+2260, U+226E, U+226F 2942- nothing new in 7.0, no test file to update 2943 2944* run & fix ICU4C tests 2945 2946* update Java data files 2947- refresh just the UCD-related files, just to be safe 2948- see (ICU4C)/source/data/icu4j-readme.txt 2949- mkdir /tmp/icu4j 2950- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2951 output: 2952 ... 2953 Unicode .icu files built to ./out/build/icudt53l 2954 echo timestamp > uni-core-data 2955 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 2956 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 2957 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 2958 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 2959 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 2960 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 2961 mkdir -p /tmp/icu4j/main/shared/data 2962 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2963 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 2964 mkdir -p /tmp/icu4j/main/shared/data 2965 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 2966 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 2967- copy the big-endian Unicode data files to another location, 2968 separate from the other data files 2969 ICUDT=icudt54b 2970 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2971 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2972 cd ~/svn.icu/uni70/dbg/data/out/icu4j 2973 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2974 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2975 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 2976 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 2977 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 2978 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 2979- refresh ICU4J 2980 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 2981 2982* update CollationFCD.java 2983 + copy & paste the initializers of lcccIndex[] etc. from 2984 ICU4C/source/i18n/collationfcd.cpp to 2985 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 2986 2987* refresh Java test .txt files 2988- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2989 cd $ICU_SRC_DIR/source/data/unidata 2990 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2991 cd ../../test/testdata 2992 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2993 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 2994 2995* UCA 2996 2997- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 2998- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 2999- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 3000- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 3001- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 3002- review data; compare files, use blankweights.sed or similar 3003 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 3004- cd ~/svn.unitools/Generated/uca/7.0.0/ 3005- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3006 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 3007- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3008 (note removing the underscore before "Rules") 3009 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 3010- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3011 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3012 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3013 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 3014 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 3015 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 3016- run genuca, see command line above 3017- rebuild ICU4C 3018- refresh ICU4J collation data: 3019 (subset of instructions above for properties data refresh, except copies all coll/*) 3020 ICUDT=icudt54b 3021 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3022 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3023 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 3024 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 3025- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3026- note on intltest: if collate/UCAConformanceTest fails, then 3027 utility/MultithreadTest/TestCollators will fail as well; 3028 fix the conformance test before looking into the multi-thread test 3029- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 3030- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 3031 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 3032 3033* When refreshing all of ICU4J data from ICU4C 3034- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3035- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3036or 3037- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3038 3039* run & fix ICU4J tests 3040 3041*** LayoutEngine script information 3042 3043(For details see the Unicode 5.2 change log below.) 3044 3045* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3046 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3047 in the working directory. 3048 (It also generates ScriptRunData.cpp, which is no longer needed.) 3049 3050 The generated files have a current copyright date and "@stable" statement. 3051 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 3052 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 3053 which may not contain dots any more. 3054 3055- diff current <icu>/source/layout files vs. generated ones 3056 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3057 review and manually merge desired changes; 3058 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 3059 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 3060- if you just copy the above files, then 3061 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 3062 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 3063 3064*** API additions 3065- send notice to icu-design about new born-@stable API (enum constants etc.) 3066 3067*** merge the Unicode update branches back onto the trunk 3068- do not merge the icudata.jar and testdata.jar, 3069 instead rebuild them from merged & tested ICU4C 3070 3071---------------------------------------------------------------------------- *** 3072 3073Unicode 6.3 update 3074 3075http://www.unicode.org/review/pri249/ -- beta review 3076http://www.unicode.org/reports/uax-proposed-updates.html 3077http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 3078http://www.unicode.org/reports/tr44/tr44-11.html 3079 3080*** ICU Trac 3081 3082- ticket 10128: update ICU to Unicode 6.3 beta 3083- ticket 10168: update ICU to Unicode 6.3 final 3084- C++ branches/markus/uni63 at r33552 from trunk at r33551 3085- Java branches/markus/uni63 at r33550 from trunk at r33553 3086 3087- ticket 10142: implement Unicode 6.3 bidi algorithm additions 3088 3089*** Unicode version numbers 3090- makedata.mak 3091- uchar.h 3092 (configure.in & configure: have been modified to extract the version from uchar.h) 3093- com.ibm.icu.util.VersionInfo 3094- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3095 3096- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 3097 so that the makefiles see the new version number. 3098 3099*** data files & enums & parser code 3100 3101* file preparation 3102 3103- download UCD, UCA & IDNA files 3104- make sure that the Unicode data folder passed into preparseucd.py 3105 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3106- modify preparseucd.py: 3107 parse new file BidiBrackets.txt 3108 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 3109- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 3110- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3111- Check test file diffs for previously commented-out, known-failing data lines; 3112 probably need to keep those commented out. 3113 3114* PropertyAliases.txt changes 3115- 1 new Enumerated Property 3116 bpt ; Bidi_Paired_Bracket_Type 3117 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 3118 -> ubidi_props.h & .c & UBiDiProps.java 3119 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 3120 -> uprops.cpp 3121 -> change ubidi.icu format version from 2.0 to 2.1 3122- 1 new Miscellaneous Property 3123 bpb ; Bidi_Paired_Bracket 3124 -> uchar.h & UProperty.java 3125 -> ppucd.h & .cpp 3126 3127* PropertyValueAliases.txt changes 3128- 3 Bidi_Paired_Bracket_Type (bpt) values: 3129 bpt; c ; Close 3130 bpt; n ; None 3131 bpt; o ; Open 3132 -> uchar.h & UCharacter.BidiPairedBracketType 3133 -> ubidi_props.h & .c & UBiDiProps.java 3134 -> change ubidi.icu format version from 2.0 to 2.1 3135- 4 new Bidi_Class (bc) values: 3136 bc ; FSI ; First_Strong_Isolate 3137 bc ; LRI ; Left_To_Right_Isolate 3138 bc ; RLI ; Right_To_Left_Isolate 3139 bc ; PDI ; Pop_Directional_Isolate 3140 -> uchar.h & UCharacterEnums.ECharacterDirection 3141 -> until the bidi code gets updated, 3142 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 3143- 3 new Word_Break (WB) values: 3144 WB ; HL ; Hebrew_Letter 3145 WB ; SQ ; Single_Quote 3146 WB ; DQ ; Double_Quote 3147 -> uchar.h & UCharacter.WordBreak 3148 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 3149- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3150 (added 2012-10-16) 3151 Aghb 239 Caucasian Albanian 3152 Mahj 314 Mahajani 3153 -> uscript.h 3154 -> com.ibm.icu.lang.UScript 3155 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3156 replace public static final int \1 = \2;\3 3157 -> preparseucd.py _scripts_only_in_iso15924 3158 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3159 and in com.ibm.icu.dev.test.lang.TestUScript.java 3160 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 3161 (not strictly necessary for NOT_ENCODED scripts) 3162 3163* generate normalization data files 3164- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 3165- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 3166- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 3167- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3168- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3169- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3170- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3171 3172* build ICU (make install) 3173 so that the tools build can pick up the new definitions from the installed header files. 3174 3175~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 3176 3177* build Unicode tools using CMake+make 3178 3179~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 3180 3181# Location (--prefix) of where ICU was installed. 3182set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 3183# Location of the ICU source tree. 3184set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 3185 3186~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 3187~/svn.icutools/trunk/dbg/unicode/c$ make 3188 3189* generate core properties data files 3190- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 3191- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 3192- rebuild ICU (make install) & tools 3193- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3194- rebuild ICU (make install) & tools 3195 3196* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3197 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3198- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3199- Unicode 6.0..6.3: U+2260, U+226E, U+226F 3200- nothing new in 6.3, no test file to update 3201 3202* update Java data files 3203- refresh just the UCD-related files, just to be safe 3204- see (ICU4C)/source/data/icu4j-readme.txt 3205- mkdir /tmp/icu4j 3206- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3207 output: 3208 ... 3209 Unicode .icu files built to ./out/build/icudt52l 3210 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3211 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 3212 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3213 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 3214 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 3215 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 3216 mkdir -p /tmp/icu4j/main/shared/data 3217 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3218 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 3219 mkdir -p /tmp/icu4j/main/shared/data 3220 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3221 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 3222- copy the big-endian Unicode data files to another location, 3223 separate from the other data files 3224 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3225 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3226 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3227 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 3228 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 3229 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3230 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 3231- refresh ICU4J 3232 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3233 3234* refresh Java test .txt files 3235- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3236 3237* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 3238 3239- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 3240- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 3241- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3242- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3243 (note removing the underscore before "Rules") 3244- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3245 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3246 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3247- check test file diffs for previously commented-out, known-failing data lines; 3248 probably need to keep those commented out 3249- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3250- run genuca, see command line above 3251- rebuild ICU4C 3252- refresh ICU4J collation data: 3253 (subset of instructions above for properties data refresh, except copies all coll/*) 3254 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3255 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3256 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 3257 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 3258- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3259- note on intltest: if collate/UCAConformanceTest fails, then 3260 utility/MultithreadTest/TestCollators will fail as well; 3261 fix the conformance test before looking into the multi-thread test 3262 3263* test ICU, fix test code where necessary 3264 3265* When refreshing all of ICU4J data from ICU4C 3266- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3267- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3268or 3269- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3270 3271*** LayoutEngine script information 3272- skipped for Unicode 6.3: no new scripts 3273 3274*** merge the Unicode update branches back onto the trunk 3275- do not merge the icudata.jar and testdata.jar, 3276 instead rebuild them from merged & tested ICU4C 3277 3278---------------------------------------------------------------------------- *** 3279 3280Unicode 6.2 update 3281 3282http://www.unicode.org/review/pri230/ 3283http://www.unicode.org/versions/beta-6.2.0.html 3284http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 3285http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 3286http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 3287http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 3288http://www.unicode.org/reports/tr46/tr46-8.html IDNA 3289http://unicode.org/Public/idna/6.2.0/ 3290 3291*** ICU Trac 3292 3293- ticket 9515: Unicode 6.2: final ICU update 3294 3295- ticket 9514: UCA 6.2: fix UCARules.txt 3296 3297- ticket 9437: update ICU to Unicode 6.2 3298- C++ branches/markus/uni62 at r32050 from trunk at r32041 3299- Java branches/markus/uni62 at r32068 from trunk at r32066 3300 3301*** Unicode version numbers 3302- makedata.mak 3303- uchar.h 3304 (configure.in & configure: have been modified to extract the version from uchar.h) 3305- com.ibm.icu.util.VersionInfo 3306- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 3307 3308*** data files & enums & parser code 3309 3310* file preparation 3311 3312- download UCD, UCA & IDNA files 3313- make sure that the Unicode data folder passed into preparseucd.py 3314 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 3315- modify preparseucd.py: NamesList.txt is now in UTF-8 3316- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 3317- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3318- Check test file diffs for previously commented-out, known-failing data lines; 3319 probably need to keep those commented out. 3320 3321* PropertyValueAliases.txt changes 3322- 1 new Line_Break (lb) value: 3323 lb ; RI ; Regional_Indicator 3324 -> uchar.h & UCharacter.LineBreak 3325- 1 new Word_Break (WB) value: 3326 WB ; RI ; Regional_Indicator 3327 -> uchar.h & UCharacter.WordBreak 3328- 1 new Grapheme_Cluster_Break (GCB) value: 3329 GCB; RI ; Regional_Indicator 3330 -> uchar.h & UCharacter.GraphemeClusterBreak 3331 3332* 3 new numeric values 3333 The new value -1, which was really supposed to be NaN but that would have required 3334 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 3335 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 3336 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 3337 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 3338 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 3339 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 3340 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 3341 -> uprops.h, uchar.c & UCharacterProperty.java 3342 -> cucdtst.c & UCharacterTest.java 3343 3344* generate normalization data files 3345- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 3346- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 3347- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 3348- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3349- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3350- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3351- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3352 3353* build ICU (make install) 3354 so that the tools build can pick up the new definitions from the installed header files. 3355* build Unicode tools using CMake+make 3356 3357* generate core properties data files 3358- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 3359- in initial bootstrapping, change the UCA version 3360 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 3361- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 3362- rebuild ICU (make install) & tools 3363 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 3364 check if the UCA version in FractionalUCA.txt matches the new Unicode version 3365 (see step above) 3366- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 3367- rebuild ICU (make install) & tools 3368 3369* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3370 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3371- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3372- Unicode 6.0..6.2: U+2260, U+226E, U+226F 3373- nothing new in 6.2, no test file to update 3374 3375* update Java data files 3376- refresh just the UCD-related files, just to be safe 3377- see (ICU4C)/source/data/icu4j-readme.txt 3378- mkdir /tmp/icu4j 3379- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3380 output: 3381 ... 3382 Unicode .icu files built to ./out/build/icudt50l 3383 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 3384 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 3385 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3386 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 3387 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 3388 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 3389 mkdir -p /tmp/icu4j/main/shared/data 3390 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3391 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 3392 mkdir -p /tmp/icu4j/main/shared/data 3393 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3394 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 3395- copy the big-endian Unicode data files to another location, 3396 separate from the other data files 3397 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3398 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 3399 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 3400 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 3401 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 3402 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3403 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 3404- refresh ICU4J 3405 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 3406 3407* refresh Java test .txt files 3408- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3409 3410* UCA 3411 3412- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 3413- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 3414- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3415- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3416 (note removing the underscore before "Rules") 3417- update (ICU4C)/source/test/testdata/CollationTest_*.txt 3418 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3419 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3420- check test file diffs for previously commented-out, known-failing data lines; 3421 probably need to keep those commented out 3422- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3423- run genuca, see command line above 3424- rebuild ICU4C 3425- refresh ICU4J collation data: 3426 (subset of instructions above for properties data refresh, except copies all coll/*) 3427 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3428 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3429 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 3430 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 3431- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3432- note on intltest: if collate/UCAConformanceTest fails, then 3433 utility/MultithreadTest/TestCollators will fail as well; 3434 fix the conformance test before looking into the multi-thread test 3435 3436* test ICU, fix test code where necessary 3437 3438* When refreshing all of ICU4J data from ICU4C 3439- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3440- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3441or 3442- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3443 3444*** LayoutEngine script information 3445- skipped for Unicode 6.2: no new scripts 3446 3447*** merge the Unicode update branches back onto the trunk 3448- do not merge the icudata.jar and testdata.jar, 3449 instead rebuild them from merged & tested ICU4C 3450 3451---------------------------------------------------------------------------- *** 3452 3453Future Unicode update 3454 3455Tools simplified since the Unicode 6.1 update. See 3456- http://site.icu-project.org/design/props/ppucd 3457- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 3458 3459* Unicode version numbers 3460- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 3461 3462* file preparation 3463- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 3464- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 3465- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 3466- Check test file diffs for previously commented-out, known-failing data lines; 3467 probably need to keep those commented out. 3468 3469* PropertyValueAliases.txt changes 3470- Script codes that are in ISO 15924 but not in Unicode are now listed in 3471 preparseucd.py, in the _scripts_only_in_iso15924 variable. 3472 If there are new ISO codes, then add them. 3473 If Unicode adds some of them, then remove them from the .py variable. 3474 3475* UnicodeData.txt changes 3476- No more manual changes for CJK ranges for algorithmic names; 3477 those are now written to ppucd.txt and genprops reads them from there. 3478 3479* generate core properties data files (makeprops.sh was deleted) 3480- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 3481 3482* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 3483- it is now generated by preparseucd.py 3484 3485* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 3486- it is now generated by preparseucd.py 3487- make sure that the Unicode data folder passed into preparseucd.py 3488 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 3489 (can be in some subfolder) 3490 3491* generate normalization data files 3492- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 3493- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 3494- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 3495- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 3496- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 3497- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 3498- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 3499 3500* build ICU (make install) 3501* build Unicode tools using CMake+make 3502 3503* new way to call genuca (makeuca.sh was deleted) 3504- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 3505 3506---------------------------------------------------------------------------- *** 3507 3508Unicode 6.1 update 3509 3510*** ICU Trac 3511 3512- ticket 8995 final update to Unicode 6.1 3513- ticket 8994 regenerate source/layout/CanonData.cpp 3514 3515- ticket 8961 support Unicode "Age" value *names* 3516- ticket 8963 support multiple character name aliases & types 3517 3518- ticket 8827 "update ICU to Unicode 6.1" 3519- C++ branches/markus/uni61 at r30864 from trunk at r30843 3520- Java branches/markus/uni61 at r30865 from trunk at r30863 3521 3522*** Unicode version numbers 3523- makedata.mak 3524- uchar.h 3525 (configure.in & configure: have been modified to extract the version from uchar.h) 3526- com.ibm.icu.util.VersionInfo 3527- icutools/unicode/makedefs.sh 3528 + also review & update other definitions in that file, 3529 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 3530 3531*** data files & enums & parser code 3532 3533* file preparation 3534 3535~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 3536- This prepares both unidata and testdata files in respective output subfolders. 3537- Check test file diffs for previously commented-out, known-failing data lines; 3538 probably need to keep those commented out. 3539 3540* PropertyValueAliases.txt changes 3541- 11 new block names: 3542 Arabic_Extended_A 3543 Arabic_Mathematical_Alphabetic_Symbols 3544 Chakma 3545 Meetei_Mayek_Extensions 3546 Meroitic_Cursive 3547 Meroitic_Hieroglyphs 3548 Miao 3549 Sharada 3550 Sora_Sompeng 3551 Sundanese_Supplement 3552 Takri 3553 -> add to uchar.h 3554 -> add to UCharacter.UnicodeBlock IDs 3555 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 3556 replace public static final int \1_ID = \2; \3 3557 -> add to UCharacter.UnicodeBlock objects 3558 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3559 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3560- 1 new Joining_Group (jg) value: 3561 Rohingya_Yeh 3562 -> uchar.h & UCharacter.JoiningGroup 3563- 2 new Line_Break (lb) values: 3564 CJ=Conditional_Japanese_Starter 3565 HL=Hebrew_Letter 3566 -> uchar.h & UCharacter.LineBreak 3567- 7 new scripts: 3568 sc ; Cakm ; Chakma 3569 sc ; Merc ; Meroitic_Cursive 3570 sc ; Mero ; Meroitic_Hieroglyphs 3571 sc ; Plrd ; Miao 3572 sc ; Shrd ; Sharada 3573 sc ; Sora ; Sora_Sompeng 3574 sc ; Takr ; Takri 3575 -> remove these from SyntheticPropertyValueAliases.txt 3576 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3577 and in com.ibm.icu.dev.test.lang.TestUScript.java 3578- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3579 (added 2011-06-21) 3580 Khoj 322 Khojki 3581 Tirh 326 Tirhuta 3582 and another one added 2011-12-09 3583 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 3584 -> uscript.h 3585 -> com.ibm.icu.lang.UScript 3586 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3587 replace public static final int \1 = \2;\3 3588 -> SyntheticPropertyValueAliases.txt 3589 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3590 and in com.ibm.icu.dev.test.lang.TestUScript.java 3591 3592* UnicodeData.txt changes 3593- the last Unihan code point changes from U+9FCB to U+9FCC 3594 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 3595 + do change gennames.c 3596 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 3597 3598* DerivedBidiClass.txt changes 3599- 2 new default-AL blocks: 3600# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 3601# Arabic Mathematical Alphabetic Symbols: 3602# U+1EE00 - U+1EEFF (was default-R) 3603- 2 new default-R blocks: 3604# Meroitic Hieroglyphs: 3605# U+10980 - U+1099F 3606# Meroitic Cursive: U+109A0 - U+109FF 3607 -> should be picked up by the explicit data in the file 3608 3609* NameAliases.txt changes 3610- from 3611 # Each line has two fields 3612 # First field: Code point 3613 # Second field: Alias 3614- to 3615 # Each line has three fields, as described here: 3616 # 3617 # First field: Code point 3618 # Second field: Alias 3619 # Third field: Type 3620- Also, the file previously allowed multiple aliases but only now does it 3621 actually provide multiple, even multiple of the same type. For example, 3622 FEFF;BYTE ORDER MARK;alternate 3623 FEFF;BOM;abbreviation 3624 FEFF;ZWNBSP;abbreviation 3625- This breaks our gennames parser, unames.icu data structure, and API. 3626 Fix gennames to only pick up "correction" aliases. 3627 New ticket #8963 for further changes. 3628 3629* run genpname/preparse.pl (on Linux) 3630 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 3631 + make sure that data.h is writable 3632 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 3633 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 3634 3635* build ICU (make install) 3636 so that the tools build can pick up the new definitions from the installed header files. 3637* build Unicode tools (at least genpname) using CMake+make 3638 3639* run genpname 3640 (builds both pnames.icu and propname_data.h) 3641- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 3642- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 3643 3644* build ICU (make install) 3645* build Unicode tools using CMake+make 3646 3647* update source/data/unidata/norm2/nfkc_cf.txt 3648- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 3649 3650* update source/data/unidata/norm2/uts46.txt 3651- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 3652 to ~/svn.icu/tools/trunk/src/unicode/py 3653- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 3654- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 3655- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 3656 3657* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3658 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3659- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3660- Unicode 6.0..6.1: U+2260, U+226E, U+226F 3661- nothing new in 6.1, no test file to update 3662 3663* generate core properties data files 3664- in initial bootstrapping, change the UCA version 3665 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 3666- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3667- rebuild ICU & tools 3668 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 3669 check if the UCA version in FractionalUCA.txt matches the new Unicode version 3670 (see step above) 3671- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 3672 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3673- rebuild ICU & tools 3674 3675* update Java data files 3676- refresh just the UCD-related files, just to be safe 3677- see (ICU4C)/source/data/icu4j-readme.txt 3678- mkdir /tmp/icu4j 3679- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3680 output: 3681 ... 3682 Unicode .icu files built to ./out/build/icudt49l 3683 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 3684 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 3685 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3686 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 3687 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 3688 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 3689 mkdir -p /tmp/icu4j/main/shared/data 3690 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3691 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 3692 mkdir -p /tmp/icu4j/main/shared/data 3693 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 3694 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 3695- copy the big-endian Unicode data files to another location, 3696 separate from the other data files 3697 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 3698 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 3699 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 3700 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 3701 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 3702 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 3703 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 3704- refresh ICU4J 3705 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 3706 3707* refresh Java test .txt files 3708- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 3709 3710* test ICU so far, fix test code where necessary 3711- temporarily ignore collation issues that look like UCA/UCD mismatches, 3712 until UCA data is updated 3713 3714* UCA 3715 3716- get output from Mark's tools; look in 3717 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 3718- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 3719- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 3720 (note removing the underscore before "Rules") 3721- update (ICU)/source/test/testdata/CollationTest_*.txt 3722 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 3723 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 3724- check test file diffs for previously commented-out, known-failing data lines; 3725 probably need to keep those commented out 3726- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 3727- run makeuca.sh: 3728 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3729- rebuild ICU4C 3730- refresh ICU4J collation data: 3731 (subset of instructions above for properties data refresh, except copies all coll/*) 3732 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3733 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 3734 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 3735 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 3736- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 3737- note on intltest: if collate/UCAConformanceTest fails, then 3738 utility/MultithreadTest/TestCollators will fail as well; 3739 fix the conformance test before looking into the multi-thread test 3740 3741* When refreshing all of ICU4J data from ICU4C 3742- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3743- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 3744or 3745- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 3746 3747*** LayoutEngine script information 3748 3749(For details see the Unicode 5.2 change log below.) 3750 3751* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 3752 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 3753 in the working directory. 3754 (It also generates ScriptRunData.cpp, which is no longer needed.) 3755 3756 The generated files have a current copyright date and "@draft" statement. 3757 3758- diff current <icu>/source/layout files vs. generated ones 3759 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 3760 review and manually merge desired changes; 3761 fix gratuitous changes, incorrect @draft and missing aliases; 3762 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 3763- if you just copy the above files, then 3764 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 3765 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 3766 3767*** merge the Unicode update branches back onto the trunk 3768- do not merge the icudata.jar and testdata.jar, 3769 instead rebuild them from merged & tested ICU4C 3770 3771---------------------------------------------------------------------------- *** 3772 3773ICU 4.8 (no Unicode update, just new script codes) 3774 3775* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3776 (added 2010-12-21) 3777 Afak 439 Afaka 3778 Jurc 510 Jurchen 3779 Mroo 199 Mro, Mru 3780 Nshu 499 Nüshu 3781 Shrd 319 Sharada, Śāradā 3782 Sora 398 Sora Sompeng 3783 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 3784 Tang 520 Tangut 3785 Wole 480 Woleai 3786 -> uscript.h 3787 -> com.ibm.icu.lang.UScript 3788 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3789 replace public static final int \1 = \2;\3 3790 -> genpname/SyntheticPropertyValueAliases.txt 3791 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3792 and in com.ibm.icu.dev.test.lang.TestUScript.java 3793 3794* run genpname/preparse.pl (on Linux) 3795 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 3796 + make sure that data.h is writable 3797 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 3798 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 3799 3800* rebuild Unicode tools (at least genpname) using make 3801- You might first need to "make install" ICU so that the tools build can pick 3802 up the new definitions from the installed header files. 3803 3804* run genpname 3805 (builds both pnames.icu and propname_data.h) 3806- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 3807- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 3808- rebuild ICU & tools 3809 3810* run genprops 3811- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 3812- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 3813- rebuild ICU & tools 3814 3815* update Java data files 3816- refresh just the UCD-related files, just to be safe 3817- see (ICU4C)/source/data/icu4j-readme.txt 3818- mkdir /tmp/icu4j 3819- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3820- copy the big-endian Unicode data files to another location, 3821 separate from the other data files 3822 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 3823 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 3824 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 3825- refresh ICU4J 3826 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 3827 3828* should have updated the layout engine script codes but forgot 3829 3830---------------------------------------------------------------------------- *** 3831 3832Unicode 6.0 update 3833 3834*** related ICU Trac tickets 3835 38367264 Unicode 6.0 Update 3837 3838*** Unicode version numbers 3839- makedata.mak 3840- uchar.h 3841 (configure.in & configure: have been modified to extract the version from uchar.h) 3842- com.ibm.icu.util.VersionInfo 3843 3844*** data files & enums & parser code 3845 3846* file preparation 3847 3848~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 3849- This now prepares both unidata and testdata files in respective output subfolders. 3850 3851* PropertyAliases.txt changes 3852- new Script_Extensions property defined in the new ScriptExtensions.txt file 3853 but not listed in PropertyAliases.txt; reported to unicode.org; 3854 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 3855 scx; Script_Extensions 3856 -> uchar.h with new UProperty section 3857 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 3858 3859* PropertyValueAliases.txt changes 3860- 12 new block names: 3861 Alchemical_Symbols 3862 Bamum_Supplement 3863 Batak 3864 Brahmi 3865 CJK_Unified_Ideographs_Extension_D 3866 Emoticons 3867 Ethiopic_Extended_A 3868 Kana_Supplement 3869 Mandaic 3870 Miscellaneous_Symbols_And_Pictographs 3871 Playing_Cards 3872 Transport_And_Map_Symbols 3873 -> add to uchar.h 3874 -> add to UCharacter.UnicodeBlock 3875 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 3876 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 3877- Joining_Group (jg) values: 3878 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 3879 -> uchar.h & UCharacter.JoiningGroup 3880- 3 new scripts: 3881 sc ; Batk ; Batak 3882 sc ; Brah ; Brahmi 3883 sc ; Mand ; Mandaic 3884 -> remove these from SyntheticPropertyValueAliases.txt 3885 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 3886 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 3887 and in com.ibm.icu.dev.test.lang.TestUScript.java 3888- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 3889 (added 2009-11-11..2010-07-18) 3890 Bass 259 Bassa Vah 3891 Dupl 755 Duployan shortand 3892 Elba 226 Elbasan 3893 Gran 343 Grantha 3894 Kpel 436 Kpelle 3895 Loma 437 Loma 3896 Mend 438 Mende 3897 Merc 101 Meroitic Cursive 3898 Narb 106 Old North Arabian 3899 Nbat 159 Nabataean 3900 Palm 126 Palmyrene 3901 Sind 318 Sindhi 3902 Wara 262 Warang Citi 3903 -> uscript.h 3904 -> com.ibm.icu.lang.UScript 3905 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 3906 replace public static final int \1 = \2;\3 3907 -> SyntheticPropertyValueAliases.txt 3908 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 3909 and in com.ibm.icu.dev.test.lang.TestUScript.java 3910- ISO 15924 name change 3911 Mero 100 Meroitic Hieroglyphs (was Meroitic) 3912 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 3913- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 3914 3915* UnicodeData.txt changes 3916- new CJK block: 3917 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 3918 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 3919 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 3920 3921* build Unicode tools using CMake+make 3922 3923* run genpname/preparse.pl (on Linux) 3924 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 3925 + make sure that data.h is writable 3926 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 3927 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 3928 3929* rebuild Unicode tools (at least genpname) using make 3930- You might first need to "make install" ICU so that the tools build can pick 3931 up the new definitions from the installed header files. 3932 3933* run genpname 3934- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 3935- rebuild ICU & tools 3936 3937* update source/data/unidata/norm2/nfkc_cf.txt 3938- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 3939 3940* update source/data/unidata/norm2/uts46.txt 3941- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 3942 to ~/svn.icu/tools/trunk/src/unicode/py 3943- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 3944- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 3945- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 3946 3947* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 3948 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 3949- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 3950- Unicode 6.0: U+2260, U+226E, U+226F 3951 3952* generate core properties data files 3953- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3954- rebuild ICU & tools 3955- run makeuca.sh so that genuca picks up the new nfc.nrm: 3956 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 3957- rebuild ICU & tools 3958 3959* implement new Script_Extensions property (provisional) 3960- parser & generator: genprops & uprops.icu 3961- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 3962- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 3963 3964* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 3965- (one-time change) 3966- genbidi/gencase/genprops tools changes 3967- re-run makeprops.sh (see above) 3968- UCharacterProperty.java, UCharacterTypeIterator.java, 3969 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 3970 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 3971 3972* update Java data files 3973- refresh just the UCD-related files, just to be safe 3974- see (ICU4C)/source/data/icu4j-readme.txt 3975- mkdir /tmp/icu4j 3976- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 3977 output: 3978 ... 3979 Unicode .icu files built to ./out/build/icudt45l 3980 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 3981 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 3982 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 3983 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 3984 mkdir -p /tmp/icu4j/main/shared/data 3985 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 3986- copy the big-endian Unicode data files to another location, 3987 separate from the other data files 3988 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 3989 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 3990 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 3991 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 3992 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 3993 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 3994 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 3995- refresh ICU4J 3996 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 3997 3998* refresh Java test .txt files 3999- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 4000 4001* un-hardcode normalization skippable (NF*_Inert) test data 4002- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 4003 4004* copy updated break iterator test files 4005- now handled by early ucdcopy.py and 4006 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 4007 (old instructions: 4008 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 4009 to ~/svn.icu/trunk/src/source/test/testdata) 4010- they are not used in ICU4J 4011 4012* UCA 4013 4014- get output from Mark's tools; look in 4015 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 4016 http://www.macchiato.com/unicode/utc/additional-uca-files 4017 http://www.unicode.org/Public/UCA/6.0.0/ 4018 http://www.unicode.org/~mdavis/uca/ 4019- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 4020- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 4021- update Han-implicit ranges for new CJK extensions: 4022 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 4023- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 4024 do not add it into invuca so that tailoring primary-after an ignorable works 4025- genuca: permit space between [variable top] bytes 4026- ucol.cpp: treat noncharacters like unassigned rather than ignorable 4027- run makeuca.sh: 4028 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 4029- rebuild ICU4C 4030- refresh ICU4J collation data: 4031 (subset of instructions above for properties data refresh, except copies all coll/*) 4032 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4033 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4034 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 4035 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 4036- update (ICU)/source/test/testdata/CollationTest_*.txt 4037 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 4038 with output from Mark's Unicode tools 4039- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4040- note on intltest: if collate/UCAConformanceTest fails, then 4041 utility/MultithreadTest/TestCollators will fail as well; 4042 fix the conformance test before looking into the multi-thread test 4043 4044* When refreshing all of ICU4J data from ICU4C 4045- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 4046- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 4047or 4048- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 4049 4050*** LayoutEngine script information 4051 4052(For details see the Unicode 5.2 change log below.) 4053 4054* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4055ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4056ScriptRunData.cpp, which is no longer needed.) 4057 4058The generated files have a current copyright date and "@draft" statement. 4059 4060* copy the above files into <icu>/source/layout, replacing the old files. 4061* fix mixed line endings 4062* review the diffs and fix incorrect @draft and missing aliases; 4063 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 4064* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4065 4066---------------------------------------------------------------------------- *** 4067 4068Unicode 5.2 update 4069 4070*** related ICU Trac tickets 4071 40727084 Unicode 5.2 4073 40747167 verify collation bytes 40757235 Java test NAME_ALIAS 40767236 Java DerivedCoreProperties.txt test 40777237 Java BidiTest.txt 40787238 UTrie2 in core unidata 40797239 test for tailoring gaps 40807240 Java fix CollationMiscTest 40817243 update layout engine for Unicode 5.2 4082 4083*** Unicode version numbers 4084- makedata.mak 4085- uchar.h 4086- configure.in & configure 4087- update ucdVersion in gennames.c if an algorithmic range changes 4088 4089*** data files & enums & parser code 4090 4091* file preparation 4092 4093python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 4094- includes finding files regardless of version numbers, 4095 copying them, and performing the equivalent processing of the 4096 ucdstrip and ucdmerge tools on the desired set of files 4097 4098* notes on changes 4099- PropertyAliases.txt 4100 moved from numeric to enumerated: 4101 ccc ; Canonical_Combining_Class 4102 new string properties: 4103 NFKC_CF ; NFKC_Casefold 4104 Name_Alias; Name_Alias 4105 new binary properties: 4106 Cased ; Cased 4107 CI ; Case_Ignorable 4108 CWCF ; Changes_When_Casefolded 4109 CWCM ; Changes_When_Casemapped 4110 CWKCF ; Changes_When_NFKC_Casefolded 4111 CWL ; Changes_When_Lowercased 4112 CWT ; Changes_When_Titlecased 4113 CWU ; Changes_When_Uppercased 4114 new CJK Unihan properties (not supported by ICU) 4115- PropertyValueAliases.txt 4116 new block names 4117 new scripts 4118 one script code change: 4119 sc ; Qaai ; Inherited 4120 -> 4121 sc ; Zinh ; Inherited ; Qaai 4122 new Line_Break (lb) value: 4123 lb ; CP ; Close_Parenthesis 4124 new Joining_Group (jg) values: Farsi_Yeh, Nya 4125 other new values: 4126 ccc; 214; ATA ; Attached_Above 4127- DerivedBidiClass.txt 4128 new default-R range: U+1E800 - U+1EFFF 4129- UnicodeData.txt 4130 all of the ISO comments are gone 4131 new CJK block end: 4132 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 4133 new CJK block: 4134 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 4135 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 4136 4137* genpname 4138- run preparse.pl 4139 + cd \svn\icuproj\icu\trunk\source\tools\genpname 4140 + make sure that data.h is writable 4141 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 4142 + preparse.pl complains with errors like the following: 4143 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 4144 This is because ICU 4.0 had scripts from ISO 15924 which are now 4145 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 4146 and PropertyValueAliases.txt. 4147 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4148 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 4149 + preparse.pl complains with errors about block names missing from uchar.h; add them 4150 4151* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4152- new block & script values 4153 + 26 new blocks 4154 copy new blocks from Blocks.txt 4155 MS VC++ 2008 regular expression: 4156 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 4157 replace with " UBLOCK_\3 = 172, /*[\1]*/" 4158 + several new script values already added in ICU 4.0 for ISO 15924 coverage 4159 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 4160 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 4161 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 4162 (added to SyntheticPropertyValueAliases.txt) 4163- new Joining Group (JG) values: Farsi_Yeh, Nya 4164- new Line_Break (lb) value: 4165 lb ; CP ; Close_Parenthesis 4166 4167* hardcoded Unihan range end/limit 4168- Unihan range end moves from 9FC3 to 9FCB 4169 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 4170 + do change gennames.c 4171 4172* Compare definitions of new binary properties with what we used to use 4173 in algorithms, to see if the definitions changed. 4174- Verified that definitions for Cased and Case_Ignorable are unchanged. 4175 The gencase tool now parses the newly public Case_Ignorable values 4176 in case the definition changes in the future. 4177 4178* uchar.c & uprops.h & uprops.c & genprops 4179- new numeric values that didn't exist in Unicode data before: 4180 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 4181 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 4182 therefore redesign the encoding of numeric types and values for formatVersion 6; 4183 design for simple numbers up to at least 144 ("one gross"), 4184 large values up to at least 10^20, 4185 and fractions with numerators -1..17 and denominators 1..16 4186 to cover current and expected future values 4187 (e.g., more Han numeric values, Meroitic twelfths) 4188 4189* reimplement Hangul_Syllable_Type for new Jamo characters 4190- the old code assumed that all Jamo characters are in the 11xx block 4191- Unicode 5.2 fills holes there and adds new Jamo characters in 4192 A960..A97F; Hangul Jamo Extended-A 4193 and in 4194 D7B0..D7FF; Hangul Jamo Extended-B 4195- Hangul_Syllable_Type can be trivially derived from a subset of 4196 Grapheme_Cluster_Break values 4197 4198* build Unicode data source code for hardcoding core data 4199C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 4200 4201ICU data make path is \svn\icuproj\icu\trunk\source\data\ 4202ICU root path is \svn\icuproj\icu\trunk 4203Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4204Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 4205Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 4206Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 4207Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 4208Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 4209Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 4210Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 4211Creating data file for Unicode Property Names 4212Creating data file for Unicode Character Properties 4213Creating data file for Unicode Case Mapping Properties 4214Creating data file for Unicode BiDi/Shaping Properties 4215Creating data file for Unicode Normalization 4216Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 4217Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 4218 4219- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 4220 and rebuild the common library 4221 4222*** UCA 4223 4224- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 4225- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 4226- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 4227[ Begin obsolete instructions: 4228 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 4229 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 4230 on Windows: 4231 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 4232 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 4233 End obsolete instructions] 4234- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 4235 not just the *_STUB.txt files 4236- note on intltest: if collate/UCAConformanceTest fails, then 4237 utility/MultithreadTest/TestCollators will fail as well; 4238 fix the conformance test before looking into the multi-thread test 4239 4240*** Implement Cased & Case_Ignorable properties 4241- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 4242- Problem: These properties should be disjoint, but aren't 4243- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 4244- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 4245 4246*** Implement Changes_When_Xyz properties 4247- without stored data 4248 4249*** Implement Name_Alias property 4250- add it as another name field in unames.icu 4251- make it available via u_charName() and UCharNameChoice and 4252- consider it in u_charFromName() 4253 4254*** Break iterators 4255 4256* Update break iterator rules to new UAX versions and new property values 4257* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 4258 4259*** new BidiTest file 4260- review format and data 4261- copy BidiTest.txt to source/test/testdata 4262- write test code using this data 4263- fix ICU code where it fails the conformance test 4264 4265*** Java 4266- generally, find and update code corresponding to C/C++ 4267- UCharacter.UnicodeBlock constants: 4268 a) add an _ID integer per new block, update COUNT 4269 b) add a class instance per new block 4270 Visual Studio regex: 4271 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 4272 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 4273- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 4274 4275- port test changes to Java 4276 4277*** LayoutEngine script information 4278 4279(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 4280 4281* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 4282ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 4283ScriptRunData.cpp, which is no longer needed.) 4284 4285The generated files have a current copyright date and "@draft" statement. 4286 4287-> Eric Mader wrote in email on 20090930: 4288 "I think the tool has been modified to update @draft to @stable for 4289 older scripts and to add @draft for new scripts. 4290 (I worked with an intern on this last year.) 4291 You should check the output after you run it." 4292 4293* copy the above files into <icu>/source/layout, replacing the old files. 4294* fix mixed line endings 4295* review the diffs and fix incorrect @draft and missing aliases 4296* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 4297 4298Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4299and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4300 4301-> Eric Mader wrote in email on 20090930: 4302 "This is just a matter of making sure that all the per-script tables have 4303 entries for any new scripts that were added. 4304 If any new Indic characters were added, then the class tables in 4305 IndicClassTables.cpp should be updated to reflect this. 4306 John Emmons should know how to do this if it's required." 4307 4308* rebuild the layout and layoutex libraries. 4309 4310*** Documentation 4311- Update User Guide 4312 + Jamo_Short_Name, sfc->scf, binary property value aliases 4313 4314---------------------------------------------------------------------------- *** 4315 4316Unicode 5.1 update 4317 4318*** related ICU Trac tickets 4319 43205696 Update to Unicode 5.1 4321 4322*** Unicode version numbers 4323- makedata.mak 4324- uchar.h 4325- configure.in & configure 4326- update ucdVersion in gennames.c if an algorithmic range changes 4327 4328*** data files & enums & parser code 4329 4330* file preparation 4331- ucdstrip: 4332 DerivedCoreProperties.txt 4333 DerivedNormalizationProps.txt 4334 NormalizationTest.txt 4335 PropList.txt 4336 Scripts.txt 4337 GraphemeBreakProperty.txt 4338 SentenceBreakProperty.txt 4339 WordBreakProperty.txt 4340- ucdstrip and ucdmerge: 4341 EastAsianWidth.txt 4342 LineBreak.txt 4343 4344* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 4345copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 4346copy 5.1.0\ucd\Blocks.txt ..\unidata\ 4347copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 4348copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 4349copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 4350copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 4351copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 4352copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 4353copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 4354copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 4355copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 4356copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 4357copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 4358 4359ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 4360ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 4361ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 4362ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 4363ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 4364ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 4365ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 4366ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 4367ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 4368ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 4369 4370* genpname 4371- run preparse.pl 4372 + cd \svn\icuproj\icu\uni51\source\tools\genpname 4373 + make sure that data.h is writable 4374 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 4375 + preparse.pl complains with errors like the following: 4376 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 4377 This is because ICU 3.8 had scripts from ISO 15924 which are now 4378 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 4379 and PropertyValueAliases.txt. 4380 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 4381 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 4382 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 4383 N/Y, No/Yes, F/T, False/True 4384 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 4385 It will use further values from the file if present. 4386 4387* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4388- new block & script values 4389 + 17 new blocks 4390 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 4391 (removed from SyntheticPropertyValueAliases.txt) 4392 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 4393 (added to SyntheticPropertyValueAliases.txt) 4394- uprops.icu (uprops.h) only provides 7 bits for script codes. 4395 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 4396 There is none above 127 yet which is the script code for an 4397 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 4398 script code values greater than 127. 4399 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 4400 in a parallel bit field, and that overflows now. 4401 Also, future values >=128 would be incompatible anyway. 4402 uprops.h is modified to move around several of the bit fields 4403 in the properties vector words, and now uses 8 bits for the script code. 4404 Two other bit fields also grow to accommodate future growth: 4405 Block (current count: 172) grows from 8 to 9 bits, 4406 and Word_Break grows from 4 to 5 bits. 4407- renamed property Simple_Case_Folding (sfc->scf) 4408 + nothing to be done: handled as normal alias 4409- new property JSN Jamo_Short_Name 4410 + no new API: only contributes to the Name property 4411- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 4412- new Joining Group (JG) value: Burushashki_Yeh_Barree 4413- new Sentence_Break (SB) values: 4414 SB ; CR ; CR 4415 SB ; EX ; Extend 4416 SB ; LF ; LF 4417 SB ; SC ; SContinue 4418- new Word_Break (WB) values: 4419 WB ; CR ; CR 4420 WB ; Extend ; Extend 4421 WB ; LF ; LF 4422 WB ; MB ; MidNumLet 4423 4424* Further changes in the 2008-02-29 update: 4425- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 4426 because they should not normally be invisible. 4427- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 4428- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 4429- new Word_Break (WB) value: NL=Newline 4430 4431* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 4432- Unihan range end moves from 9FBB to 9FC3 4433 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 4434 + do change gennames.c 4435 4436* build Unicode data source code for hardcoding core data 4437C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 4438 4439ICU data make path is \svn\icuproj\icu\uni51\source\data\ 4440ICU root path is \svn\icuproj\icu\uni51 4441Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4442Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 4443Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 4444Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 4445Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 4446Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 4447Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 4448Creating data file for Unicode Character Properties 4449Creating data file for Unicode Case Mapping Properties 4450Creating data file for Unicode BiDi/Shaping Properties 4451Creating data file for Unicode Normalization 4452Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 4453Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 4454 4455- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 4456 and rebuild the common library 4457 4458*** Break iterators 4459 4460* Update break iterator rules to new UAX versions and new property values 4461 4462*** UCA 4463 4464* update FractionalUCA.txt and UCARules.txt with new canonical closure 4465 4466*** Test suites 4467- Test that APIs using Unicode property value aliases (like UnicodeSet) 4468 support all of the boolean values N/Y, No/Yes, F/T, False/True 4469 -> TestBinaryValues() tests in both cintltst and intltest 4470 4471*** LayoutEngine script information 4472* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 4473ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 4474ScriptRunData.cpp, which is no longer needed.) 4475 4476The generated files have a current copyright date and "@draft" statement. 4477 4478* copy the above files into <icu>/source/layout, replacing the old files. 4479 4480Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4481and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4482 4483* rebuild the layout and layoutex libraries. 4484 4485*** Documentation 4486- Update User Guide 4487 + Jamo_Short_Name, sfc->scf, binary property value aliases 4488 4489---------------------------------------------------------------------------- *** 4490 4491Unicode 5.0 update 4492 4493*** related Jitterbugs 4494 44955084 RFE: Update to Unicode 5.0 4496 4497*** data files & enums & parser code 4498 4499* file preparation 4500- ucdstrip: 4501 DerivedCoreProperties.txt 4502 DerivedNormalizationProps.txt 4503 NormalizationTest.txt 4504 PropList.txt 4505 Scripts.txt 4506 GraphemeBreakProperty.txt 4507 SentenceBreakProperty.txt 4508 WordBreakProperty.txt 4509- ucdstrip and ucdmerge: 4510 EastAsianWidth.txt 4511 LineBreak.txt 4512 4513* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 4514copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 4515copy 5.0.0\ucd\Blocks.txt ..\unidata\ 4516copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 4517copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 4518copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 4519copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 4520copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 4521copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 4522copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 4523copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 4524copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 4525copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 4526copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 4527 4528ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 4529ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 4530ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 4531ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 4532ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 4533ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 4534ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 4535ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 4536ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 4537ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 4538 4539* update FractionalUCA.txt and UCARules.txt with new canonical closure 4540 4541* genpname 4542- run preparse.pl 4543 + make sure that data.h is writable 4544 + perl preparse.pl \cvs\oss\icu > out.txt 4545 4546* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4547- new block & script values 4548 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 4549 4550* build Unicode data source code for hardcoding core data 4551C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 4552 4553ICU data make path is \cvs\oss\icu\source\data\ 4554ICU root path is \cvs\oss\icu 4555Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 4556[etc.] 4557Creating data file for Unicode Character Properties 4558Creating data file for Unicode Case Mapping Properties 4559Creating data file for Unicode BiDi/Shaping Properties 4560Creating data file for Unicode Normalization 4561Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 4562Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 4563 4564- copy the .c source files to C:\cvs\oss\icu\source\common 4565 and rebuild the common library 4566 4567*** Unicode version numbers 4568- makedata.mak 4569- uchar.h 4570- configure.in 4571 4572*** LayoutEngine script information 4573* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 4574ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 4575ScriptRunData.cpp, which is no longer needed.) 4576 4577The generated files have a current copyright date and "@draft" statement. 4578 4579* copy the above files into <icu>/source/layout, replacing the old files. 4580 4581Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 4582and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 4583 4584* rebuild the layout and layoutex libraries. 4585 4586---------------------------------------------------------------------------- *** 4587 4588Unicode 4.1 update 4589 4590*** related Jitterbugs 4591 45924332 RFE: Update to Unicode 4.1 45934157 RBBI, TR29 4.1 updates 4594 4595*** data files & enums & parser code 4596 4597* file preparation 4598- ucdstrip: 4599 DerivedCoreProperties.txt 4600 DerivedNormalizationProps.txt 4601 NormalizationTest.txt 4602 GraphemeBreakProperty.txt 4603 SentenceBreakProperty.txt 4604 WordBreakProperty.txt 4605- ucdstrip and ucdmerge: 4606 EastAsianWidth.txt 4607 LineBreak.txt 4608 4609* add new files to the repository 4610 GraphemeBreakProperty.txt 4611 SentenceBreakProperty.txt 4612 WordBreakProperty.txt 4613 4614* update FractionalUCA.txt and UCARules.txt with new canonical closure 4615 4616* genpname 4617- handle new enumerated properties in sub read_uchar 4618- run preparse.pl 4619 4620* uchar.h & uscript.h & uprops.h & uprops.c & genprops 4621- new binary properties 4622 + Pattern_Syntax 4623 + Pattern_White_Space 4624- new enumerated properties 4625 + Grapheme_Cluster_Break 4626 + Sentence_Break 4627 + Word_Break 4628- new block & script & line break values 4629 4630* gencase 4631- case-ignorable changes 4632 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 4633 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 4634 4635*** Unicode version numbers 4636- makedata.mak 4637- uchar.h 4638- configure.in 4639 4640*** tests 4641- verify that u_charMirror() round-trips 4642- test all new properties and some new values of old properties 4643 4644*** other code 4645 4646* hardcoded Unihan range end/limit 4647- Unihan range end moves from 9FA5 to 9FBB 4648 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 4649 + do not modify BOCU/BOCSU code because that would change the encoding 4650 and break binary compatibility! 4651 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 4652 NamePrepProfile.txt 4653 + ignore trietest.c: test data is arbitrary 4654 + ignore tstnorm.cpp: test optimization, not important 4655 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 4656 + do change line_th.txt and word_th.txt 4657 by replacing hardcoded ranges with the new property values 4658 + do change gennames.c 4659 4660source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 4661source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 4662source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 4663 4664* case mappings 4665- compare new special casing context conditions with previous ones 4666 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 4667 4668* genpname 4669- consider storing only the short name if it is the same as the long name 4670 4671*** other reviews 4672- UAX #29 changes (grapheme/word/sentence breaks) 4673- UAX #14 changes (line breaks) 4674- Pattern_Syntax & Pattern_White_Space 4675 4676---------------------------------------------------------------------------- *** 4677 4678Unicode 4.0.1 update 4679 4680*** related Jitterbugs 4681 46823170 RFE: Update to Unicode 4.0.1 46833171 Add new Unicode 4.0.1 properties 46843520 use Unicode 4.0.1 updates for break iteration 4685 4686*** data files & enums & parser code 4687 4688* file preparation 4689- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 4690- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 4691 4692* file fixes 4693- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 4694 according to PRI #26 4695 http://www.unicode.org/review/resolved-pri.html#pri26 4696- undone again because no corrigendum in sight; 4697 instead modified tests to not check consistency on this for Unicode 4.0.1 4698 4699* ucdterms.txt 4700- update from http://www.unicode.org/copyright.html 4701 formatted for plain text 4702 4703* uchar.h & uprops.h & uprops.c & genprops 4704- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 4705- add U_LB_INSEPARABLE due to a spelling fix 4706 + put short name comment only on line with new constant 4707 for genpname perl script parser 4708- new binary properties 4709 + STerm 4710 + Variation_Selector 4711 4712* genpname 4713- fix genpname perl script so that it doesn't choke on more than 2 names per property value 4714- perl script: correctly calculate the maximum number of fields per row 4715 4716* uscript.h 4717- new script code Hrkt=Katakana_Or_Hiragana 4718 4719* gennorm.c track changes in DerivedNormalizationProps.txt 4720- "FNC" -> "FC_NFKC" 4721- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 4722 4723* genprops/props2.c track changes in DerivedNumericValues.txt 4724- changed from 3 columns to 2, dropping the numeric type 4725 + assume that the type is always numeric for Han characters, 4726 and that only those are added in addition to what UnicodeData.txt lists 4727 4728*** Unicode version numbers 4729- makedata.mak 4730- uchar.h 4731- configure.in 4732 4733*** tests 4734- update test of default bidi classes according to PRI #28 4735 /tsutil/cucdtst/TestUnicodeData 4736 http://www.unicode.org/review/resolved-pri.html#pri28 4737- bidi tests: change exemplar character for ES depending on Unicode version 4738- change hardcoded expected property values where they change 4739 4740*** other code 4741 4742* name matching 4743- read UCD.html 4744 4745* scripts 4746- use new Hrkt=Katakana_Or_Hiragana 4747 4748* ZWJ & ZWNJ 4749- are now part of combining character sequences 4750- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 4751