1* Copyright (C) 2016 and later: Unicode, Inc. and others. 2* License & terms of use: http://www.unicode.org/copyright.html 3* Copyright (C) 2004-2016, International Business Machines 4* Corporation and others. All Rights Reserved. 5* 6* file name: changes.txt 7* encoding: US-ASCII 8* tab size: 8 (not used) 9* indentation:4 10* 11* created on: 2004may06 12* created by: Markus W. Scherer 13* 14* change log for Unicode updates 15 16---------------------------------------------------------------------------- *** 17 18* New ISO 15924 script codes 19 20Starting with ICU 55, we do not add UScriptCode constants for new scripts any more 21until they are encoded in Unicode, 22or can be assumed to be encoded in the next Unicode version. 23Script enum constant names want to follow the Unicode script property value aliases, 24which are assigned only when the scripts are encoded. 25When we encode scripts early and guess wrong, then we have confusing enum constants 26and have sometimes added aliases. 27 28Variant script codes like Latf and Aran that are not subject to separate encoding 29can be added at any time. 30(For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.) 31 32We add script codes used in CLDR or in the spoof checker. 33This includes combination/alias codes like Hanb and Jamo. 34See http://unicode.org/reports/tr35/#unicode_script_subtag_validity 35and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html 36 37We add special Z* script codes like Zsye. 38 39For new script codes see http://www.unicode.org/iso15924/codechanges.html 40 41---------------------------------------------------------------------------- *** 42 43Unicode 9.0 update for ICU 58 44 45* Command-line environment setup 46 47ICU_ROOT=~/svn.icu/trunk 48ICU_SRC_DIR=$ICU_ROOT/src 49ICUDT=icudt58b 50export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 51SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 52UNIDATA=$ICU_SRC_DIR/source/data/unidata 53 54http://www.unicode.org/review/pri323/ -- beta review 55http://www.unicode.org/reports/uax-proposed-updates.html 56http://www.unicode.org/versions/beta-9.0.0.html 57http://www.unicode.org/versions/Unicode9.0.0/ 58http://www.unicode.org/reports/tr44/tr44-17.html 59 60*** ICU Trac 61 62- ticket:12526: integrate Unicode 9 63- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b 64- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b 65 66*** CLDR Trac 67 68- cldrbug 9414: UCA 9 69- ^/branches/markus/uni90 at r11518 from trunk at r11517 70 71- cldrbug 8745: Unicode 9.0 script metadata 72 73*** Unicode version numbers 74- makedata.mak 75- uchar.h 76- com.ibm.icu.util.VersionInfo 77- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 78 79- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 80 so that the makefiles see the new version number. 81 82*** data files & enums & parser code 83 84* file preparation 85 86- download UCD & IDNA files 87- make sure that the Unicode data folder passed into preparseucd.py 88 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 89- only for manual diffs: remove version suffixes from the file names 90 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 91 (see https://sites.google.com/site/unicodetools/inputdata) 92- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 93- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src 94- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 95 96- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt 97 and copy to $UNIDATA 98 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA 99 100* preparseucd.py changes 101- remove or add new Unicode scripts from/to the 102 only-in-ISO-15924 list according to the error messages: 103 ValueError: remove ['Tang'] from _scripts_only_in_iso15924 104 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD 105 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD 106 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD 107 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 108 and in com.ibm.icu.dev.test.lang.TestUScript.java 109- DerivedNumericValues.txt new numeric values 110 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 111 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH 112 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS 113 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH 114 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS 115 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), 116 uchar.c, UCharacterProperty.java 117 to support a new series of values 118- adjust preparseucd.py for Tangut algorithmic names 119 in ppucd.txt: 120 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- 121 -> 122 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- 123- avoid block-compressing most String/Miscellaneous property values, 124 triggered by genprops not coping with a multi-code point Case_Folding on 125 block;1C80..1C8F;...;Cased;cf=0442;CWCF;... 126 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors 127 128* PropertyAliases.txt changes 129- 1 new property PCM=Prepended_Concatenation_Mark 130 Ignore: Only useful for layout engines. 131 Ok to list in ppucd.txt. 132 133* PropertyValueAliases.txt new property values 134 blk; Adlam ; Adlam 135 blk; Bhaiksuki ; Bhaiksuki 136 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C 137 blk; Glagolitic_Sup ; Glagolitic_Supplement 138 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation 139 blk; Marchen ; Marchen 140 blk; Mongolian_Sup ; Mongolian_Supplement 141 blk; Newa ; Newa 142 blk; Osage ; Osage 143 blk; Tangut ; Tangut 144 blk; Tangut_Components ; Tangut_Components 145 -> add to uchar.h 146 use long property names for enum constants 147 -> add to UCharacter.UnicodeBlock IDs 148 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 149 replace public static final int \1_ID = \2; \3 150 -> add to UCharacter.UnicodeBlock objects 151 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 152 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 153 154 GCB; EB ; E_Base 155 GCB; EBG ; E_Base_GAZ 156 GCB; EM ; E_Modifier 157 GCB; GAZ ; Glue_After_Zwj 158 GCB; ZWJ ; ZWJ 159 -> uchar.h & UCharacter.GraphemeClusterBreak 160 161 jg ; African_Feh ; African_Feh 162 jg ; African_Noon ; African_Noon 163 jg ; African_Qaf ; African_Qaf 164 -> uchar.h & UCharacter.JoiningGroup 165 166 lb ; EB ; E_Base 167 lb ; EM ; E_Modifier 168 lb ; ZWJ ; ZWJ 169 -> uchar.h & UCharacter.LineBreak 170 171 sc ; Adlm ; Adlam 172 sc ; Bhks ; Bhaiksuki 173 sc ; Marc ; Marchen 174 sc ; Newa ; Newa 175 sc ; Osge ; Osage 176 sc ; Tang ; Tangut 177 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 178 179 WB ; EB ; E_Base 180 WB ; EBG ; E_Base_GAZ 181 WB ; EM ; E_Modifier 182 WB ; GAZ ; Glue_After_Zwj 183 WB ; ZWJ ; ZWJ 184 -> uchar.h & UCharacter.WordBreak 185 186* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 187 (not strictly necessary for NOT_ENCODED scripts) 188 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 189 190* generate normalization data files 191 cd $ICU_ROOT/dbg 192 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 193 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 194 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 195 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 196 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 197 198* build ICU (make install) 199 so that the tools build can pick up the new definitions from the installed header files. 200 201 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt 202 203* build Unicode tools using CMake+make 204 205~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 206 207 # Location (--prefix) of where ICU was installed. 208 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 209 # Location of the ICU source tree. 210 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 211 212 ~/svn.icutools/trunk/dbg/unicode/c$ 213 cmake ../../../src/unicode/c 214 make 215 216* generate core properties data files 217 ~/svn.icutools/trunk/dbg/unicode/c$ 218 genprops/genprops $ICU_SRC_DIR 219 genuca/genuca --hanOrder implicit $ICU_SRC_DIR 220 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 221- rebuild ICU (make install) & tools 222 223* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 224 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 225- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 226- Unicode 6.0..9.0: U+2260, U+226E, U+226F 227- nothing new in 9.0, no test file to update 228 229* run & fix ICU4C tests 230- Andy handles RBBI & spoof check test failures 231 232* collation: CLDR collation root, UCA DUCET 233 234- UCA DUCET goes into Mark's Unicode tools, see 235 https://sites.google.com/site/unicodetools/home#TOC-UCA 236- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 237 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 238 239- cd (CLDR UCA branch)/common/uca/ 240- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 241 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 242- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 243 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 244 (note removing the underscore before "Rules") 245 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 246- restore TODO diffs in UCARules.txt 247 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 248- update (ICU4C)/source/test/testdata/CollationTest_*.txt 249 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 250 from the CLDR root files (..._CLDR_..._SHORT.txt) 251 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 252 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 253 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 254- if CLDR common/uca/unihan-index.txt changes, then update 255 CLDR common/collation/root.xml <collation type="private-unihan"> 256 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 257 258- run genuca, see command line above; 259 deal with 260 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: 261 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) 262 (add the character to genuca.cpp sampleCharsToScripts[]) 263 + look up the USCRIPT_ code for the new sample characters 264 (should be obvious from the comment in the error output) 265 + *add* mappings to sampleCharsToScripts[], do not replace them 266 (in case the script sample characters flip-flop) 267 + insert new scripts in DUCET script order, see the top_byte table 268 at the beginning of FractionalUCA.txt 269- rebuild ICU4C 270 271* Unihan collators 272- run Unicode Tools 273 org.unicode.draft.GenerateUnihanCollators 274 with VM arguments 275 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk 276 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools 277 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data 278 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 279 -DUVERSION=9.0.0 280 -ea 281- run Unicode Tools 282 org.unicode.draft.GenerateUnihanCollatorFiles 283 with the same arguments 284- check CLDR diffs 285 cd ~/svn.cldr/trunk 286 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml 287 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml 288- copy to CLDR 289 cd ~/svn.cldr/trunk 290 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml 291 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml 292- commit to CLDR 293- generate ICU zh collation data: run CLDR 294 org.unicode.cldr.icu.NewLdml2IcuConverter 295 with program arguments 296 -t collation 297 -s /home/mscherer/svn.cldr/trunk/common/collation 298 -m /home/mscherer/svn.cldr/trunk/common/supplemental 299 -d /home/mscherer/svn.icu/trunk/src/source/data/coll 300 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation 301 zh 302 and VM arguments 303 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk 304- rebuild ICU4C 305 306* run & fix ICU4C tests, now with new CLDR collation root data 307- run all tests with the collation test data *_SHORT.txt or the full files 308 (the full ones have comments, useful for debugging) 309- note on intltest: if collate/UCAConformanceTest fails, then 310 utility/MultithreadTest/TestCollators will fail as well; 311 fix the conformance test before looking into the multi-thread test 312 313* update Java data files 314- refresh just the UCD/UCA-related/derived files, just to be safe 315- see (ICU4C)/source/data/icu4j-readme.txt 316- mkdir /tmp/icu4j 317- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 318 output: 319 ... 320 Unicode .icu files built to ./out/build/icudt58l 321 echo timestamp > uni-core-data 322 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b 323 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b 324 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 325 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b 326 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" 327 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ 328 mkdir -p /tmp/icu4j/main/shared/data 329 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 330 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ 331 mkdir -p /tmp/icu4j/main/shared/data 332 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 333 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 334- copy the big-endian Unicode data files to another location, 335 separate from the other data files, 336 and then refresh ICU4J 337 cd ~/svn.icu/trunk/dbg/data/out/icu4j 338 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 339 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 340 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 341 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 342 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 343 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 344 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 345 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 346 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 347 348* When refreshing all of ICU4J data from ICU4C 349- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 350- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 351or 352- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 353 354* update CollationFCD.java 355 + copy & paste the initializers of lcccIndex[] etc. from 356 ICU4C/source/i18n/collationfcd.cpp to 357 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 358 359* refresh Java test .txt files 360- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 361 cd $ICU_SRC_DIR/source/data/unidata 362 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 363 cd ../../test/testdata 364 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 365 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 366 367* run & fix ICU4J tests 368 369*** LayoutEngine script information 370 371* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 372 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 373 in the working directory. 374 375 (It also generates ScriptRunData.cpp, which is no longer needed.) 376 377 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 378 (a plain text file) 379 which maps ICU versions to the numbers of script/language constants 380 that were added then. 381 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 382 383 The generated files have a current copyright date and "@deprecated" statement. 384 385* Review changes, fix Java tool if necessary, and copy to ICU4C 386 cd ~/svn.icu4j/trunk/src 387 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 388 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 389 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 390 391*** API additions 392- send notice to icu-design about new born-@stable API (enum constants etc.) 393 394*** merge the Unicode update branches back onto the trunk 395- do not merge the icudata.jar and testdata.jar, 396 instead rebuild them from merged & tested ICU4C 397- make sure that changes to Unicode tools & ICU tools are checked in 398 http://www.unicode.org/utility/trac/log/trunk/unicodetools 399 http://bugs.icu-project.org/trac/log/tools/trunk 400 401---------------------------------------------------------------------------- *** 402 403New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764 404 405Adding 406- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge 407- new combination/alias codes: Hanb, Jamo 408 - used in CLDR 29 and in spoof checker 409- new Z* code: Zsye 410 411Add new codes to uscript.h & UScript.java, see Unicode update logs. 412 -> com.ibm.icu.lang.UScript 413 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 414 replace public static final int \1 = \2; \3 415 416Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, 417add new script codes. 418"Long" script names only where established in Unicode 9 PropertyValueAliases.txt. 419 420Note: If we have to run preparseucd.py again before the Unicode 9 update, 421then we need to manually keep/restore the new script codes. 422 423ICU_ROOT=~/svn.icu/trunk 424ICU_SRC_DIR=$ICU_ROOT/src 425ICUDT=icudt57b 426export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 427SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 428UNIDATA=$ICU_SRC_DIR/source/data/unidata 429 430Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, 431see http://bugs.icu-project.org/trac/ticket/12141 432 433make install, then icutools cmake & make, then 434~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 435 436Generate Java data as usual, only update pnames.icu & uprops.icu. 437 438*** LayoutEngine script information 439 440* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 441 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 442 in the working directory. 443 444 (It also generates ScriptRunData.cpp, which is no longer needed.) 445 446 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 447 (a plain text file) 448 which maps ICU versions to the numbers of script/language constants 449 that were added then. 450 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 451 452 The generated files have a current copyright date and "@deprecated" statement. 453 454* Review changes, fix Java tool if necessary, and copy to ICU4C 455 cd ~/svn.icu4j/trunk/src 456 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 457 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 458 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 459 460---------------------------------------------------------------------------- *** 461 462Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802 463 464Edit preparseucd.py to add & parse new properties. 465They share the UCD property namespace but are not listed in PropertyAliases.txt. 466 467Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ 468Initial data from emoji/2.0/ 469 470ICU_ROOT=~/svn.icu/trunk 471ICU_SRC_DIR=$ICU_ROOT/src 472ICUDT=icudt56b 473export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 474SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 475UNIDATA=$ICU_SRC_DIR/source/data/unidata 476 477Add binary-property constants to uchar.h enum UProperty & UProperty.java. 478 479~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src 480(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) 481 482Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java 483 484make install, then icutools cmake & make, then 485~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR 486 487Generate Java data as usual, only update pnames.icu & uprops.icu. 488 489---------------------------------------------------------------------------- *** 490 491Unicode 8.0 update for ICU 56 492 493* Command-line environment setup 494 495ICU_ROOT=~/svn.icu/trunk 496ICU_SRC_DIR=$ICU_ROOT/src 497ICUDT=icudt56b 498export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 499SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 500UNIDATA=$ICU_SRC_DIR/source/data/unidata 501 502http://www.unicode.org/review/pri297/ -- beta review 503http://www.unicode.org/reports/uax-proposed-updates.html 504http://unicode.org/versions/beta-8.0.0.html 505http://www.unicode.org/versions/Unicode8.0.0/ 506http://www.unicode.org/reports/tr44/tr44-15.html 507 508*** ICU Trac 509 510- ticket:11574: Unicode 8 511- C++ branches/markus/uni80 at r37351 from trunk at r37343 512- Java branches/markus/uni80 at r37352 from trunk at r37338 513 514*** CLDR Trac 515 516- cldrbug 8311: UCA 8 517- branches/markus/uni80 at r11518 from trunk at r11517 518 519- cldrbug 8109: Unicode 8.0 script metadata 520- cldrbug 8418: Updated segmentation for Unicode 8.0 521 522*** Unicode version numbers 523- makedata.mak 524- uchar.h 525- com.ibm.icu.util.VersionInfo 526- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 527 528- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 529 so that the makefiles see the new version number. 530 531*** data files & enums & parser code 532 533* file preparation 534 535- download UCD & IDNA files 536- make sure that the Unicode data folder passed into preparseucd.py 537 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 538- only for manual diffs: remove version suffixes from the file names 539 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 540 (see https://sites.google.com/site/unicodetools/inputdata) 541- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 542- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src 543- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 544 545- also: from http://unicode.org/Public/security/8.0.0/ download new 546 confusables.txt & confusablesWholeScript.txt 547 and copy to $UNIDATA 548 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA 549 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA 550 551* initial preparseucd.py changes 552- remove new Unicode scripts from the 553 only-in-ISO-15924 list according to the error message: 554 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] 555 from _scripts_only_in_iso15924 556 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 557 and in com.ibm.icu.dev.test.lang.TestUScript.java 558- property and file name change: 559 IndicMatraCategory -> IndicPositionalCategory 560- UnicodeData.txt unusual numeric values (improper fractions) 561 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; 562 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; 563 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; 564 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; 565 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; 566 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; 567 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; 568 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; 569 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; 570 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; 571 -> change preparseucd.py to map them to proper fractions (e.g., 1/6) 572 which are listed in DerivedNumericValues.txt; 573 keeps storage in data file simple 574 575* PropertyValueAliases.txt changes 576- 10 new Block (blk) values: 577 blk; Ahom ; Ahom 578 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs 579 blk; Cherokee_Sup ; Cherokee_Supplement 580 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E 581 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform 582 blk; Hatran ; Hatran 583 blk; Multani ; Multani 584 blk; Old_Hungarian ; Old_Hungarian 585 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs 586 blk; Sutton_SignWriting ; Sutton_SignWriting 587 -> add to uchar.h 588 use long property names for enum constants 589 -> add to UCharacter.UnicodeBlock IDs 590 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 591 replace public static final int \1_ID = \2; \3 592 -> add to UCharacter.UnicodeBlock objects 593 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 594 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 595- 6 new Script (sc) values: 596 sc ; Ahom ; Ahom 597 sc ; Hatr ; Hatran 598 sc ; Hluw ; Anatolian_Hieroglyphs 599 sc ; Hung ; Old_Hungarian 600 sc ; Mult ; Multani 601 sc ; Sgnw ; SignWriting 602 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript 603 604* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 605 (not strictly necessary for NOT_ENCODED scripts) 606 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 607 608* generate normalization data files 609 cd $ICU_ROOT/dbg 610 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 611 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 612 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 613 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 614 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 615 616* build ICU (make install) 617 so that the tools build can pick up the new definitions from the installed header files. 618 619 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 620 621* build Unicode tools using CMake+make 622 623~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 624 625 # Location (--prefix) of where ICU was installed. 626 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) 627 # Location of the ICU source tree. 628 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) 629 630 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 631 ~/svn.icutools/trunk/dbg/unicode/c$ make 632 633* generate core properties data files 634- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 635- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR 636- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR 637- rebuild ICU (make install) & tools 638- run genuca again (see step above) so that it picks up the new nfc.nrm 639- rebuild ICU (make install) & tools 640 641* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 642 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 643- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 644- Unicode 6.0..8.0: U+2260, U+226E, U+226F 645- nothing new in 8.0, no test file to update 646 647* run & fix ICU4C tests 648- bad Cherokee case folding due to difference in fallbacks: 649 UCD case folding falls back to no mapping, 650 ICU runtime case folding falls back to lowercasing; 651 fixed casepropsbuilder.cpp to generate scf mappings to self 652 when there is an slc mapping but no scf 653- Andy handles RBBI & spoof check test failures 654 655* collation: CLDR collation root, UCA DUCET 656 657- UCA DUCET goes into Mark's Unicode tools, see 658 https://sites.google.com/site/unicodetools/home#TOC-UCA 659- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ 660- cd (CLDR UCA branch)/common/uca/ 661- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 662 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 663- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 664 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt 665 (note removing the underscore before "Rules") 666 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 667- restore TODO diffs in UCARules.txt 668 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 669- update (ICU4C)/source/test/testdata/CollationTest_*.txt 670 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 671 from the CLDR root files (..._CLDR_..._SHORT.txt) 672 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 673 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 674 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 675- if CLDR common/uca/unihan-index.txt changes, then update 676 CLDR common/collation/root.xml <collation type="private-unihan"> 677 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt 678- run genuca, see command line above; 679 deal with 680 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt 681 (add the character to genuca.cpp sampleCharsToScripts[]) 682 + look up the script for the new sample characters 683 (e.g., in FractionalUCA.txt) 684 + *add* mappings to sampleCharsToScripts[], do not replace them 685 (in case the script sample characters flip-flop) 686 + insert new scripts in DUCET script order, see the top_byte table 687 at the beginning of FractionalUCA.txt 688- rebuild ICU4C 689 690* run & fix ICU4C tests, now with new CLDR collation root data 691- run all tests with the collation test data *_SHORT.txt or the full files 692 (the full ones have comments, useful for debugging) 693- note on intltest: if collate/UCAConformanceTest fails, then 694 utility/MultithreadTest/TestCollators will fail as well; 695 fix the conformance test before looking into the multi-thread test 696- fixed bug in CollationWeights::getWeightRanges() 697 exposed by new data and CollationTest::TestRootElements 698 699* update Java data files 700- refresh just the UCD/UCA-related/derived files, just to be safe 701- see (ICU4C)/source/data/icu4j-readme.txt 702- mkdir /tmp/icu4j 703- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 704 output: 705 ... 706 Unicode .icu files built to ./out/build/icudt56l 707 echo timestamp > uni-core-data 708 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b 709 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b 710 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt 711 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b 712 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" 713 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ 714 mkdir -p /tmp/icu4j/main/shared/data 715 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 716 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ 717 mkdir -p /tmp/icu4j/main/shared/data 718 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 719 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' 720- copy the big-endian Unicode data files to another location, 721 separate from the other data files, 722 and then refresh ICU4J 723 cd ~/svn.icu/trunk/dbg/data/out/icu4j 724 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 725 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 726 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 727 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 728 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 729 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 730 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 731 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 732 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 733 734* When refreshing all of ICU4J data from ICU4C 735- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 736- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 737or 738- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 739 740* update CollationFCD.java 741 + copy & paste the initializers of lcccIndex[] etc. from 742 ICU4C/source/i18n/collationfcd.cpp to 743 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 744 745* refresh Java test .txt files 746- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 747 cd $ICU_SRC_DIR/source/data/unidata 748 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 749 cd ../../test/testdata 750 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 751 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 752 753* run & fix ICU4J tests 754 755*** LayoutEngine script information 756 757* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, 758 because the layout engine was deprecated in ICU 54. 759 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java 760 to write lines that we used to add manually. 761 762* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 763 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 764 in the working directory. 765 766 (It also generates ScriptRunData.cpp, which is no longer needed.) 767 768 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages 769 (a plain text file) 770 which maps ICU versions to the numbers of script/language constants 771 that were added then. 772 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) 773 774 The generated files have a current copyright date and "@deprecated" statement. 775 776* Review changes, fix Java tool if necessary, and copy to ICU4C 777 cd ~/svn.icu4j/trunk/src 778 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 779 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout 780 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout 781 782*** API additions 783- send notice to icu-design about new born-@stable API (enum constants etc.) 784 785*** merge the Unicode update branches back onto the trunk 786- do not merge the icudata.jar and testdata.jar, 787 instead rebuild them from merged & tested ICU4C 788- make sure that changes to Unicode tools & ICU tools are checked in 789 http://www.unicode.org/utility/trac/log/trunk/unicodetools 790 http://bugs.icu-project.org/trac/log/tools/trunk 791 792---------------------------------------------------------------------------- *** 793 794Unicode 7.0 update for ICU 54 795 796http://www.unicode.org/review/pri271/ -- beta review 797http://www.unicode.org/reports/uax-proposed-updates.html 798http://www.unicode.org/versions/beta-7.0.0.html#notable_issues 799http://www.unicode.org/reports/tr44/tr44-13.html 800 801*** ICU Trac 802 803- ticket 10821: Unicode 7.0, UCA 7.0 804- C++ branches/markus/uni70 at r35584 from trunk at r35580 805- Java branches/markus/uni70 at r35587 from trunk at r35545 806 807*** CLDR Trac 808 809- ticket 7195: UCA 7.0 CLDR root collation 810- branches/markus/uni70 at r10062 from trunk at r10061 811 812- ticket 6762: script metadata for Unicode 7.0 new scripts 813 814*** Unicode version numbers 815- makedata.mak 816- uchar.h 817- com.ibm.icu.util.VersionInfo 818- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 819 820- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 821 so that the makefiles see the new version number. 822 823*** data files & enums & parser code 824 825* file preparation 826 827- download UCD & IDNA files 828- make sure that the Unicode data folder passed into preparseucd.py 829 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 830- only for manual diffs: remove version suffixes from the file names 831 ~/unidata/uni70/20140403$ ../../desuffixucd.py . 832 (see https://sites.google.com/site/unicodetools/inputdata) 833- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip 834- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src 835- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 836- Restore TODO diffs in source/data/unidata/UCARules.txt 837 cd $ICU_SRC_DIR 838 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt 839- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt 840 841- also: from http://unicode.org/Public/security/7.0.0/ download new 842 confusables.txt & confusablesWholeScript.txt 843 and copy to $ICU_ROOT/src/source/data/unidata/ 844 845* initial preparseucd.py changes 846- remove new Unicode scripts from the 847 only-in-ISO-15924 list according to the error message: 848 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', 849 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', 850 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] 851 from _scripts_only_in_iso15924 852 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 853 and in com.ibm.icu.dev.test.lang.TestUScript.java 854- NamesList.txt now has a heading with a non-ASCII character 855 + keep ppucd.txt in platform charset, rather than changing tool/test parsers 856 + escape non-ASCII characters in heading comments 857- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 858 + get the copyright from the first file whose copyright line contains the current year 859 860* PropertyValueAliases.txt changes 861- 32 new Block (blk) values: 862 blk; Bassa_Vah ; Bassa_Vah 863 blk; Caucasian_Albanian ; Caucasian_Albanian 864 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers 865 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended 866 blk; Duployan ; Duployan 867 blk; Elbasan ; Elbasan 868 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended 869 blk; Grantha ; Grantha 870 blk; Khojki ; Khojki 871 blk; Khudawadi ; Khudawadi 872 blk; Latin_Ext_E ; Latin_Extended_E 873 blk; Linear_A ; Linear_A 874 blk; Mahajani ; Mahajani 875 blk; Manichaean ; Manichaean 876 blk; Mende_Kikakui ; Mende_Kikakui 877 blk; Modi ; Modi 878 blk; Mro ; Mro 879 blk; Myanmar_Ext_B ; Myanmar_Extended_B 880 blk; Nabataean ; Nabataean 881 blk; Old_North_Arabian ; Old_North_Arabian 882 blk; Old_Permic ; Old_Permic 883 blk; Ornamental_Dingbats ; Ornamental_Dingbats 884 blk; Pahawh_Hmong ; Pahawh_Hmong 885 blk; Palmyrene ; Palmyrene 886 blk; Pau_Cin_Hau ; Pau_Cin_Hau 887 blk; Psalter_Pahlavi ; Psalter_Pahlavi 888 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls 889 blk; Siddham ; Siddham 890 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers 891 blk; Sup_Arrows_C ; Supplemental_Arrows_C 892 blk; Tirhuta ; Tirhuta 893 blk; Warang_Citi ; Warang_Citi 894 -> add to uchar.h 895 use long property names for enum constants 896 -> add to UCharacter.UnicodeBlock IDs 897 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 898 replace public static final int \1_ID = \2; \3 899 -> add to UCharacter.UnicodeBlock objects 900 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 901 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 902- 28 new Joining_Group (jg) values: 903 jg ; Manichaean_Aleph ; Manichaean_Aleph 904 jg ; Manichaean_Ayin ; Manichaean_Ayin 905 jg ; Manichaean_Beth ; Manichaean_Beth 906 jg ; Manichaean_Daleth ; Manichaean_Daleth 907 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh 908 jg ; Manichaean_Five ; Manichaean_Five 909 jg ; Manichaean_Gimel ; Manichaean_Gimel 910 jg ; Manichaean_Heth ; Manichaean_Heth 911 jg ; Manichaean_Hundred ; Manichaean_Hundred 912 jg ; Manichaean_Kaph ; Manichaean_Kaph 913 jg ; Manichaean_Lamedh ; Manichaean_Lamedh 914 jg ; Manichaean_Mem ; Manichaean_Mem 915 jg ; Manichaean_Nun ; Manichaean_Nun 916 jg ; Manichaean_One ; Manichaean_One 917 jg ; Manichaean_Pe ; Manichaean_Pe 918 jg ; Manichaean_Qoph ; Manichaean_Qoph 919 jg ; Manichaean_Resh ; Manichaean_Resh 920 jg ; Manichaean_Sadhe ; Manichaean_Sadhe 921 jg ; Manichaean_Samekh ; Manichaean_Samekh 922 jg ; Manichaean_Taw ; Manichaean_Taw 923 jg ; Manichaean_Ten ; Manichaean_Ten 924 jg ; Manichaean_Teth ; Manichaean_Teth 925 jg ; Manichaean_Thamedh ; Manichaean_Thamedh 926 jg ; Manichaean_Twenty ; Manichaean_Twenty 927 jg ; Manichaean_Waw ; Manichaean_Waw 928 jg ; Manichaean_Yodh ; Manichaean_Yodh 929 jg ; Manichaean_Zayin ; Manichaean_Zayin 930 jg ; Straight_Waw ; Straight_Waw 931 -> uchar.h & UCharacter.JoiningGroup 932- 23 new Script (sc) values: 933 sc ; Aghb ; Caucasian_Albanian 934 sc ; Bass ; Bassa_Vah 935 sc ; Dupl ; Duployan 936 sc ; Elba ; Elbasan 937 sc ; Gran ; Grantha 938 sc ; Hmng ; Pahawh_Hmong 939 sc ; Khoj ; Khojki 940 sc ; Lina ; Linear_A 941 sc ; Mahj ; Mahajani 942 sc ; Mani ; Manichaean 943 sc ; Mend ; Mende_Kikakui 944 sc ; Modi ; Modi 945 sc ; Mroo ; Mro 946 sc ; Narb ; Old_North_Arabian 947 sc ; Nbat ; Nabataean 948 sc ; Palm ; Palmyrene 949 sc ; Pauc ; Pau_Cin_Hau 950 sc ; Perm ; Old_Permic 951 sc ; Phlp ; Psalter_Pahlavi 952 sc ; Sidd ; Siddham 953 sc ; Sind ; Khudawadi 954 sc ; Tirh ; Tirhuta 955 sc ; Wara ; Warang_Citi 956 -> uscript.h (many were added before) 957 comment "Mende Kikakui" for USCRIPT_MENDE 958 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias 959 -> com.ibm.icu.lang.UScript 960 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 961 replace public static final int \1 = \2; \3 962- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 963 (added 2012-11-01) 964 Ahom 338 Ahom 965 Hatr 127 Hatran 966 Mult 323 Multani 967 (added 2013-10-12) 968 Modi 324 Modi 969 Pauc 263 Pau Cin Hau 970 Sidd 302 Siddham 971 -> uscript.h (some overlap with additions from Unicode) 972 -> com.ibm.icu.lang.UScript 973 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 974 replace public static final int \1 = \2; \3 975 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 976 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 977 and in com.ibm.icu.dev.test.lang.TestUScript.java 978 979* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 980 (not strictly necessary for NOT_ENCODED scripts) 981 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt 982 983* generate normalization data files 984- cd $ICU_ROOT/dbg 985- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib 986- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in 987- UNIDATA=$ICU_SRC_DIR/source/data/unidata 988- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource 989- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 990- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 991- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 992- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 993 994* build ICU (make install) 995 so that the tools build can pick up the new definitions from the installed header files. 996 997~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 998 999* build Unicode tools using CMake+make 1000 1001~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 1002 1003# Location (--prefix) of where ICU was installed. 1004set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) 1005# Location of the ICU source tree. 1006set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) 1007 1008~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 1009~/svn.icutools/trunk/dbg/unicode/c$ make 1010 1011* genprops work 1012- new code point range for Joining_Group values: 10AC0..10AFF Manichaean 1013 + add second array of Joining_Group values for at most 10800..10FFF 1014 icutools: unicode/c/genprops/bidipropsbuilder.cpp 1015 icu: source/common/ubidi_props.h/.c/_data.h 1016 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java 1017 1018* generate core properties data files 1019- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR 1020- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR 1021- rebuild ICU (make install) & tools 1022- run genuca again (see step above) so that it picks up the new nfc.nrm 1023- rebuild ICU (make install) & tools 1024 1025* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1026 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1027- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1028- Unicode 6.0..7.0: U+2260, U+226E, U+226F 1029- nothing new in 7.0, no test file to update 1030 1031* run & fix ICU4C tests 1032 1033* update Java data files 1034- refresh just the UCD-related files, just to be safe 1035- see (ICU4C)/source/data/icu4j-readme.txt 1036- mkdir /tmp/icu4j 1037- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1038 output: 1039 ... 1040 Unicode .icu files built to ./out/build/icudt53l 1041 echo timestamp > uni-core-data 1042 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b 1043 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b 1044 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 1045 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b 1046 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" 1047 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ 1048 mkdir -p /tmp/icu4j/main/shared/data 1049 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1050 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ 1051 mkdir -p /tmp/icu4j/main/shared/data 1052 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1053 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' 1054- copy the big-endian Unicode data files to another location, 1055 separate from the other data files 1056 ICUDT=icudt54b 1057 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1058 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1059 cd ~/svn.icu/uni70/dbg/data/out/icu4j 1060 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1061 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1062 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu 1063 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT 1064 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1065 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr 1066- refresh ICU4J 1067 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1068 1069* update CollationFCD.java 1070 + copy & paste the initializers of lcccIndex[] etc. from 1071 ICU4C/source/i18n/collationfcd.cpp to 1072 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java 1073 1074* refresh Java test .txt files 1075- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1076 cd $ICU_SRC_DIR/source/data/unidata 1077 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 1078 cd ../../test/testdata 1079 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 1080 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode 1081 1082* UCA 1083 1084- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ 1085- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) 1086- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ 1087- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA 1088- output files are in ~/svn.unitools/Generated/uca/7.0.0/ 1089- review data; compare files, use blankweights.sed or similar 1090 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt 1091- cd ~/svn.unitools/Generated/uca/7.0.0/ 1092- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1093 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt 1094- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1095 (note removing the underscore before "Rules") 1096 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt 1097- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1098 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1099 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 1100 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt 1101 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt 1102 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data 1103- run genuca, see command line above 1104- rebuild ICU4C 1105- refresh ICU4J collation data: 1106 (subset of instructions above for properties data refresh, except copies all coll/*) 1107 ICUDT=icudt54b 1108 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1109 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1110 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll 1111 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT 1112- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 1113- note on intltest: if collate/UCAConformanceTest fails, then 1114 utility/MultithreadTest/TestCollators will fail as well; 1115 fix the conformance test before looking into the multi-thread test 1116- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors 1117- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch 1118 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ 1119 1120* When refreshing all of ICU4J data from ICU4C 1121- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1122- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 1123or 1124- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 1125 1126* run & fix ICU4J tests 1127 1128*** LayoutEngine script information 1129 1130(For details see the Unicode 5.2 change log below.) 1131 1132* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 1133 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 1134 in the working directory. 1135 (It also generates ScriptRunData.cpp, which is no longer needed.) 1136 1137 The generated files have a current copyright date and "@stable" statement. 1138 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java 1139 for "born stable" Unicode API constants, and to stop parsing ICU version numbers 1140 which may not contain dots any more. 1141 1142- diff current <icu>/source/layout files vs. generated ones 1143 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 1144 review and manually merge desired changes; 1145 fix gratuitous changes, incorrect @draft/@stable and missing aliases; 1146 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 1147- if you just copy the above files, then 1148 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 1149 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 1150 1151*** API additions 1152- send notice to icu-design about new born-@stable API (enum constants etc.) 1153 1154*** merge the Unicode update branches back onto the trunk 1155- do not merge the icudata.jar and testdata.jar, 1156 instead rebuild them from merged & tested ICU4C 1157 1158---------------------------------------------------------------------------- *** 1159 1160Unicode 6.3 update 1161 1162http://www.unicode.org/review/pri249/ -- beta review 1163http://www.unicode.org/reports/uax-proposed-updates.html 1164http://www.unicode.org/versions/beta-6.3.0.html#notable_issues 1165http://www.unicode.org/reports/tr44/tr44-11.html 1166 1167*** ICU Trac 1168 1169- ticket 10128: update ICU to Unicode 6.3 beta 1170- ticket 10168: update ICU to Unicode 6.3 final 1171- C++ branches/markus/uni63 at r33552 from trunk at r33551 1172- Java branches/markus/uni63 at r33550 from trunk at r33553 1173 1174- ticket 10142: implement Unicode 6.3 bidi algorithm additions 1175 1176*** Unicode version numbers 1177- makedata.mak 1178- uchar.h 1179 (configure.in & configure: have been modified to extract the version from uchar.h) 1180- com.ibm.icu.util.VersionInfo 1181- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1182 1183- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h 1184 so that the makefiles see the new version number. 1185 1186*** data files & enums & parser code 1187 1188* file preparation 1189 1190- download UCD, UCA & IDNA files 1191- make sure that the Unicode data folder passed into preparseucd.py 1192 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 1193- modify preparseucd.py: 1194 parse new file BidiBrackets.txt 1195 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type 1196- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src 1197- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1198- Check test file diffs for previously commented-out, known-failing data lines; 1199 probably need to keep those commented out. 1200 1201* PropertyAliases.txt changes 1202- 1 new Enumerated Property 1203 bpt ; Bidi_Paired_Bracket_Type 1204 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType 1205 -> ubidi_props.h & .c & UBiDiProps.java 1206 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX 1207 -> uprops.cpp 1208 -> change ubidi.icu format version from 2.0 to 2.1 1209- 1 new Miscellaneous Property 1210 bpb ; Bidi_Paired_Bracket 1211 -> uchar.h & UProperty.java 1212 -> ppucd.h & .cpp 1213 1214* PropertyValueAliases.txt changes 1215- 3 Bidi_Paired_Bracket_Type (bpt) values: 1216 bpt; c ; Close 1217 bpt; n ; None 1218 bpt; o ; Open 1219 -> uchar.h & UCharacter.BidiPairedBracketType 1220 -> ubidi_props.h & .c & UBiDiProps.java 1221 -> change ubidi.icu format version from 2.0 to 2.1 1222- 4 new Bidi_Class (bc) values: 1223 bc ; FSI ; First_Strong_Isolate 1224 bc ; LRI ; Left_To_Right_Isolate 1225 bc ; RLI ; Right_To_Left_Isolate 1226 bc ; PDI ; Pop_Directional_Isolate 1227 -> uchar.h & UCharacterEnums.ECharacterDirection 1228 -> until the bidi code gets updated, 1229 Roozbeh suggests mapping the new bc values to ON (Other_Neutral) 1230- 3 new Word_Break (WB) values: 1231 WB ; HL ; Hebrew_Letter 1232 WB ; SQ ; Single_Quote 1233 WB ; DQ ; Double_Quote 1234 -> uchar.h & UCharacter.WordBreak 1235 -> first time Word_Break numeric constants exceed 4 bits (now 17 values) 1236- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 1237 (added 2012-10-16) 1238 Aghb 239 Caucasian Albanian 1239 Mahj 314 Mahajani 1240 -> uscript.h 1241 -> com.ibm.icu.lang.UScript 1242 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 1243 replace public static final int \1 = \2;\3 1244 -> preparseucd.py _scripts_only_in_iso15924 1245 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1246 and in com.ibm.icu.dev.test.lang.TestUScript.java 1247 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata 1248 (not strictly necessary for NOT_ENCODED scripts) 1249 1250* generate normalization data files 1251- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib 1252- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in 1253- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata 1254- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 1255- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 1256- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1257- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 1258 1259* build ICU (make install) 1260 so that the tools build can pick up the new definitions from the installed header files. 1261 1262~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt 1263 1264* build Unicode tools using CMake+make 1265 1266~/svn.icutools/trunk/src/unicode/c/icudefs.txt: 1267 1268# Location (--prefix) of where ICU was installed. 1269set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) 1270# Location of the ICU source tree. 1271set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) 1272 1273~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c 1274~/svn.icutools/trunk/dbg/unicode/c$ make 1275 1276* generate core properties data files 1277- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src 1278- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src 1279- rebuild ICU (make install) & tools 1280- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 1281- rebuild ICU (make install) & tools 1282 1283* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1284 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1285- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1286- Unicode 6.0..6.3: U+2260, U+226E, U+226F 1287- nothing new in 6.3, no test file to update 1288 1289* update Java data files 1290- refresh just the UCD-related files, just to be safe 1291- see (ICU4C)/source/data/icu4j-readme.txt 1292- mkdir /tmp/icu4j 1293- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1294 output: 1295 ... 1296 Unicode .icu files built to ./out/build/icudt52l 1297 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b 1298 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b 1299 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 1300 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b 1301 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" 1302 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ 1303 mkdir -p /tmp/icu4j/main/shared/data 1304 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1305 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ 1306 mkdir -p /tmp/icu4j/main/shared/data 1307 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1308 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' 1309- copy the big-endian Unicode data files to another location, 1310 separate from the other data files 1311 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 1312 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 1313 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 1314 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu 1315 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b 1316 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 1317 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr 1318- refresh ICU4J 1319 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 1320 1321* refresh Java test .txt files 1322- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1323 1324* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files 1325 1326- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 1327- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 1328- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1329- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1330 (note removing the underscore before "Rules") 1331- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1332 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1333 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 1334- check test file diffs for previously commented-out, known-failing data lines; 1335 probably need to keep those commented out 1336- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 1337- run genuca, see command line above 1338- rebuild ICU4C 1339- refresh ICU4J collation data: 1340 (subset of instructions above for properties data refresh, except copies all coll/*) 1341 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1342 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 1343 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll 1344 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b 1345- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 1346- note on intltest: if collate/UCAConformanceTest fails, then 1347 utility/MultithreadTest/TestCollators will fail as well; 1348 fix the conformance test before looking into the multi-thread test 1349 1350* test ICU, fix test code where necessary 1351 1352* When refreshing all of ICU4J data from ICU4C 1353- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1354- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 1355or 1356- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 1357 1358*** LayoutEngine script information 1359- skipped for Unicode 6.3: no new scripts 1360 1361*** merge the Unicode update branches back onto the trunk 1362- do not merge the icudata.jar and testdata.jar, 1363 instead rebuild them from merged & tested ICU4C 1364 1365---------------------------------------------------------------------------- *** 1366 1367Unicode 6.2 update 1368 1369http://www.unicode.org/review/pri230/ 1370http://www.unicode.org/versions/beta-6.2.0.html 1371http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 1372http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values 1373http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol 1374http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols 1375http://www.unicode.org/reports/tr46/tr46-8.html IDNA 1376http://unicode.org/Public/idna/6.2.0/ 1377 1378*** ICU Trac 1379 1380- ticket 9515: Unicode 6.2: final ICU update 1381 1382- ticket 9514: UCA 6.2: fix UCARules.txt 1383 1384- ticket 9437: update ICU to Unicode 6.2 1385- C++ branches/markus/uni62 at r32050 from trunk at r32041 1386- Java branches/markus/uni62 at r32068 from trunk at r32066 1387 1388*** Unicode version numbers 1389- makedata.mak 1390- uchar.h 1391 (configure.in & configure: have been modified to extract the version from uchar.h) 1392- com.ibm.icu.util.VersionInfo 1393- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ 1394 1395*** data files & enums & parser code 1396 1397* file preparation 1398 1399- download UCD, UCA & IDNA files 1400- make sure that the Unicode data folder passed into preparseucd.py 1401 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) 1402- modify preparseucd.py: NamesList.txt is now in UTF-8 1403- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src 1404- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1405- Check test file diffs for previously commented-out, known-failing data lines; 1406 probably need to keep those commented out. 1407 1408* PropertyValueAliases.txt changes 1409- 1 new Line_Break (lb) value: 1410 lb ; RI ; Regional_Indicator 1411 -> uchar.h & UCharacter.LineBreak 1412- 1 new Word_Break (WB) value: 1413 WB ; RI ; Regional_Indicator 1414 -> uchar.h & UCharacter.WordBreak 1415- 1 new Grapheme_Cluster_Break (GCB) value: 1416 GCB; RI ; Regional_Indicator 1417 -> uchar.h & UCharacter.GraphemeClusterBreak 1418 1419* 3 new numeric values 1420 The new value -1, which was really supposed to be NaN but that would have required 1421 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, 1422 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. 1423 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 1424 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 1425 The two new values 216000 and 432000 require an addition to the encoding of numeric values. 1426 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 1427 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 1428 -> uprops.h, uchar.c & UCharacterProperty.java 1429 -> cucdtst.c & UCharacterTest.java 1430 1431* generate normalization data files 1432- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib 1433- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in 1434- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata 1435- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 1436- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 1437- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1438- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 1439 1440* build ICU (make install) 1441 so that the tools build can pick up the new definitions from the installed header files. 1442* build Unicode tools using CMake+make 1443 1444* generate core properties data files 1445- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src 1446- in initial bootstrapping, change the UCA version 1447 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 1448- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src 1449- rebuild ICU (make install) & tools 1450 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 1451 check if the UCA version in FractionalUCA.txt matches the new Unicode version 1452 (see step above) 1453- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm 1454- rebuild ICU (make install) & tools 1455 1456* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1457 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1458- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1459- Unicode 6.0..6.2: U+2260, U+226E, U+226F 1460- nothing new in 6.2, no test file to update 1461 1462* update Java data files 1463- refresh just the UCD-related files, just to be safe 1464- see (ICU4C)/source/data/icu4j-readme.txt 1465- mkdir /tmp/icu4j 1466- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1467 output: 1468 ... 1469 Unicode .icu files built to ./out/build/icudt50l 1470 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b 1471 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b 1472 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 1473 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b 1474 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" 1475 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ 1476 mkdir -p /tmp/icu4j/main/shared/data 1477 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1478 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ 1479 mkdir -p /tmp/icu4j/main/shared/data 1480 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1481 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' 1482- copy the big-endian Unicode data files to another location, 1483 separate from the other data files 1484 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 1485 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 1486 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 1487 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu 1488 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b 1489 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 1490 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr 1491- refresh ICU4J 1492 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 1493 1494* refresh Java test .txt files 1495- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1496 1497* UCA 1498 1499- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ 1500- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that 1501- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1502- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1503 (note removing the underscore before "Rules") 1504- update (ICU4C)/source/test/testdata/CollationTest_*.txt 1505 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1506 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 1507- check test file diffs for previously commented-out, known-failing data lines; 1508 probably need to keep those commented out 1509- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 1510- run genuca, see command line above 1511- rebuild ICU4C 1512- refresh ICU4J collation data: 1513 (subset of instructions above for properties data refresh, except copies all coll/*) 1514 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1515 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 1516 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll 1517 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b 1518- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 1519- note on intltest: if collate/UCAConformanceTest fails, then 1520 utility/MultithreadTest/TestCollators will fail as well; 1521 fix the conformance test before looking into the multi-thread test 1522 1523* test ICU, fix test code where necessary 1524 1525* When refreshing all of ICU4J data from ICU4C 1526- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1527- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 1528or 1529- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 1530 1531*** LayoutEngine script information 1532- skipped for Unicode 6.2: no new scripts 1533 1534*** merge the Unicode update branches back onto the trunk 1535- do not merge the icudata.jar and testdata.jar, 1536 instead rebuild them from merged & tested ICU4C 1537 1538---------------------------------------------------------------------------- *** 1539 1540Future Unicode update 1541 1542Tools simplified since the Unicode 6.1 update. See 1543- http://site.icu-project.org/design/props/ppucd 1544- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 1545 1546* Unicode version numbers 1547- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates 1548 1549* file preparation 1550- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: 1551- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src 1552- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. 1553- Check test file diffs for previously commented-out, known-failing data lines; 1554 probably need to keep those commented out. 1555 1556* PropertyValueAliases.txt changes 1557- Script codes that are in ISO 15924 but not in Unicode are now listed in 1558 preparseucd.py, in the _scripts_only_in_iso15924 variable. 1559 If there are new ISO codes, then add them. 1560 If Unicode adds some of them, then remove them from the .py variable. 1561 1562* UnicodeData.txt changes 1563- No more manual changes for CJK ranges for algorithmic names; 1564 those are now written to ppucd.txt and genprops reads them from there. 1565 1566* generate core properties data files (makeprops.sh was deleted) 1567- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src 1568 1569* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt 1570- it is now generated by preparseucd.py 1571 1572* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt 1573- it is now generated by preparseucd.py 1574- make sure that the Unicode data folder passed into preparseucd.py 1575 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 1576 (can be in some subfolder) 1577 1578* generate normalization data files 1579- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib 1580- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in 1581- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata 1582- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt 1583- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt 1584- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt 1585- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt 1586 1587* build ICU (make install) 1588* build Unicode tools using CMake+make 1589 1590* new way to call genuca (makeuca.sh was deleted) 1591- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src 1592 1593---------------------------------------------------------------------------- *** 1594 1595Unicode 6.1 update 1596 1597*** ICU Trac 1598 1599- ticket 8995 final update to Unicode 6.1 1600- ticket 8994 regenerate source/layout/CanonData.cpp 1601 1602- ticket 8961 support Unicode "Age" value *names* 1603- ticket 8963 support multiple character name aliases & types 1604 1605- ticket 8827 "update ICU to Unicode 6.1" 1606- C++ branches/markus/uni61 at r30864 from trunk at r30843 1607- Java branches/markus/uni61 at r30865 from trunk at r30863 1608 1609*** Unicode version numbers 1610- makedata.mak 1611- uchar.h 1612 (configure.in & configure: have been modified to extract the version from uchar.h) 1613- com.ibm.icu.util.VersionInfo 1614- icutools/unicode/makedefs.sh 1615 + also review & update other definitions in that file, 1616 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l 1617 1618*** data files & enums & parser code 1619 1620* file preparation 1621 1622~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed 1623- This prepares both unidata and testdata files in respective output subfolders. 1624- Check test file diffs for previously commented-out, known-failing data lines; 1625 probably need to keep those commented out. 1626 1627* PropertyValueAliases.txt changes 1628- 11 new block names: 1629 Arabic_Extended_A 1630 Arabic_Mathematical_Alphabetic_Symbols 1631 Chakma 1632 Meetei_Mayek_Extensions 1633 Meroitic_Cursive 1634 Meroitic_Hieroglyphs 1635 Miao 1636 Sharada 1637 Sora_Sompeng 1638 Sundanese_Supplement 1639 Takri 1640 -> add to uchar.h 1641 -> add to UCharacter.UnicodeBlock IDs 1642 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) 1643 replace public static final int \1_ID = \2; \3 1644 -> add to UCharacter.UnicodeBlock objects 1645 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1646 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1647- 1 new Joining_Group (jg) value: 1648 Rohingya_Yeh 1649 -> uchar.h & UCharacter.JoiningGroup 1650- 2 new Line_Break (lb) values: 1651 CJ=Conditional_Japanese_Starter 1652 HL=Hebrew_Letter 1653 -> uchar.h & UCharacter.LineBreak 1654- 7 new scripts: 1655 sc ; Cakm ; Chakma 1656 sc ; Merc ; Meroitic_Cursive 1657 sc ; Mero ; Meroitic_Hieroglyphs 1658 sc ; Plrd ; Miao 1659 sc ; Shrd ; Sharada 1660 sc ; Sora ; Sora_Sompeng 1661 sc ; Takr ; Takri 1662 -> remove these from SyntheticPropertyValueAliases.txt 1663 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1664 and in com.ibm.icu.dev.test.lang.TestUScript.java 1665- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 1666 (added 2011-06-21) 1667 Khoj 322 Khojki 1668 Tirh 326 Tirhuta 1669 and another one added 2011-12-09 1670 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) 1671 -> uscript.h 1672 -> com.ibm.icu.lang.UScript 1673 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 1674 replace public static final int \1 = \2;\3 1675 -> SyntheticPropertyValueAliases.txt 1676 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1677 and in com.ibm.icu.dev.test.lang.TestUScript.java 1678 1679* UnicodeData.txt changes 1680- the last Unihan code point changes from U+9FCB to U+9FCC 1681 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) 1682 + do change gennames.c 1683 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java 1684 1685* DerivedBidiClass.txt changes 1686- 2 new default-AL blocks: 1687# Arabic Extended-A: U+08A0 - U+08FF (was default-R) 1688# Arabic Mathematical Alphabetic Symbols: 1689# U+1EE00 - U+1EEFF (was default-R) 1690- 2 new default-R blocks: 1691# Meroitic Hieroglyphs: 1692# U+10980 - U+1099F 1693# Meroitic Cursive: U+109A0 - U+109FF 1694 -> should be picked up by the explicit data in the file 1695 1696* NameAliases.txt changes 1697- from 1698 # Each line has two fields 1699 # First field: Code point 1700 # Second field: Alias 1701- to 1702 # Each line has three fields, as described here: 1703 # 1704 # First field: Code point 1705 # Second field: Alias 1706 # Third field: Type 1707- Also, the file previously allowed multiple aliases but only now does it 1708 actually provide multiple, even multiple of the same type. For example, 1709 FEFF;BYTE ORDER MARK;alternate 1710 FEFF;BOM;abbreviation 1711 FEFF;ZWNBSP;abbreviation 1712- This breaks our gennames parser, unames.icu data structure, and API. 1713 Fix gennames to only pick up "correction" aliases. 1714 New ticket #8963 for further changes. 1715 1716* run genpname/preparse.pl (on Linux) 1717 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 1718 + make sure that data.h is writable 1719 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 1720 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 1721 1722* build ICU (make install) 1723 so that the tools build can pick up the new definitions from the installed header files. 1724* build Unicode tools (at least genpname) using CMake+make 1725 1726* run genpname 1727 (builds both pnames.icu and propname_data.h) 1728- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 1729- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 1730 1731* build ICU (make install) 1732* build Unicode tools using CMake+make 1733 1734* update source/data/unidata/norm2/nfkc_cf.txt 1735- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 1736 1737* update source/data/unidata/norm2/uts46.txt 1738- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt 1739 to ~/svn.icu/tools/trunk/src/unicode/py 1740- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". 1741- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 1742- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 1743 1744* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 1745 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 1746- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 1747- Unicode 6.0..6.1: U+2260, U+226E, U+226F 1748- nothing new in 6.1, no test file to update 1749 1750* generate core properties data files 1751- in initial bootstrapping, change the UCA version 1752 in source/data/unidata/FractionalUCA.txt to match the new Unicode version 1753- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 1754- rebuild ICU & tools 1755 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, 1756 check if the UCA version in FractionalUCA.txt matches the new Unicode version 1757 (see step above) 1758- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: 1759 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 1760- rebuild ICU & tools 1761 1762* update Java data files 1763- refresh just the UCD-related files, just to be safe 1764- see (ICU4C)/source/data/icu4j-readme.txt 1765- mkdir /tmp/icu4j 1766- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1767 output: 1768 ... 1769 Unicode .icu files built to ./out/build/icudt49l 1770 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b 1771 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b 1772 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 1773 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b 1774 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" 1775 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ 1776 mkdir -p /tmp/icu4j/main/shared/data 1777 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 1778 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ 1779 mkdir -p /tmp/icu4j/main/shared/data 1780 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data 1781 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' 1782- copy the big-endian Unicode data files to another location, 1783 separate from the other data files 1784 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 1785 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 1786 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 1787 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu 1788 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b 1789 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 1790 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr 1791- refresh ICU4J 1792 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 1793 1794* refresh Java test .txt files 1795- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 1796 1797* test ICU so far, fix test code where necessary 1798- temporarily ignore collation issues that look like UCA/UCD mismatches, 1799 until UCA data is updated 1800 1801* UCA 1802 1803- get output from Mark's tools; look in 1804 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt 1805- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 1806- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 1807 (note removing the underscore before "Rules") 1808- update (ICU)/source/test/testdata/CollationTest_*.txt 1809 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 1810 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) 1811- check test file diffs for previously commented-out, known-failing data lines; 1812 probably need to keep those commented out 1813- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani 1814- run makeuca.sh: 1815 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 1816- rebuild ICU4C 1817- refresh ICU4J collation data: 1818 (subset of instructions above for properties data refresh, except copies all coll/*) 1819 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1820 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 1821 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll 1822 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b 1823- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) 1824- note on intltest: if collate/UCAConformanceTest fails, then 1825 utility/MultithreadTest/TestCollators will fail as well; 1826 fix the conformance test before looking into the multi-thread test 1827 1828* When refreshing all of ICU4J data from ICU4C 1829- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1830- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 1831or 1832- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 1833 1834*** LayoutEngine script information 1835 1836(For details see the Unicode 5.2 change log below.) 1837 1838* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. 1839 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp 1840 in the working directory. 1841 (It also generates ScriptRunData.cpp, which is no longer needed.) 1842 1843 The generated files have a current copyright date and "@draft" statement. 1844 1845- diff current <icu>/source/layout files vs. generated ones 1846 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout 1847 review and manually merge desired changes; 1848 fix gratuitous changes, incorrect @draft and missing aliases; 1849 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 1850- if you just copy the above files, then 1851 fix mixed line endings, review the diffs as above and restore changes to API tags etc.; 1852 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 1853 1854*** merge the Unicode update branches back onto the trunk 1855- do not merge the icudata.jar and testdata.jar, 1856 instead rebuild them from merged & tested ICU4C 1857 1858---------------------------------------------------------------------------- *** 1859 1860ICU 4.8 (no Unicode update, just new script codes) 1861 1862* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 1863 (added 2010-12-21) 1864 Afak 439 Afaka 1865 Jurc 510 Jurchen 1866 Mroo 199 Mro, Mru 1867 Nshu 499 Nüshu 1868 Shrd 319 Sharada, Śāradā 1869 Sora 398 Sora Sompeng 1870 Takr 321 Takri, Ṭākrī, Ṭāṅkrī 1871 Tang 520 Tangut 1872 Wole 480 Woleai 1873 -> uscript.h 1874 -> com.ibm.icu.lang.UScript 1875 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 1876 replace public static final int \1 = \2;\3 1877 -> genpname/SyntheticPropertyValueAliases.txt 1878 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1879 and in com.ibm.icu.dev.test.lang.TestUScript.java 1880 1881* run genpname/preparse.pl (on Linux) 1882 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 1883 + make sure that data.h is writable 1884 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 1885 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 1886 1887* rebuild Unicode tools (at least genpname) using make 1888- You might first need to "make install" ICU so that the tools build can pick 1889 up the new definitions from the installed header files. 1890 1891* run genpname 1892 (builds both pnames.icu and propname_data.h) 1893- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 1894- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource 1895- rebuild ICU & tools 1896 1897* run genprops 1898- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 1899- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 1900- rebuild ICU & tools 1901 1902* update Java data files 1903- refresh just the UCD-related files, just to be safe 1904- see (ICU4C)/source/data/icu4j-readme.txt 1905- mkdir /tmp/icu4j 1906- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 1907- copy the big-endian Unicode data files to another location, 1908 separate from the other data files 1909 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 1910 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 1911 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b 1912- refresh ICU4J 1913 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b 1914 1915* should have updated the layout engine script codes but forgot 1916 1917---------------------------------------------------------------------------- *** 1918 1919Unicode 6.0 update 1920 1921*** related ICU Trac tickets 1922 19237264 Unicode 6.0 Update 1924 1925*** Unicode version numbers 1926- makedata.mak 1927- uchar.h 1928 (configure.in & configure: have been modified to extract the version from uchar.h) 1929- com.ibm.icu.util.VersionInfo 1930 1931*** data files & enums & parser code 1932 1933* file preparation 1934 1935~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed 1936- This now prepares both unidata and testdata files in respective output subfolders. 1937 1938* PropertyAliases.txt changes 1939- new Script_Extensions property defined in the new ScriptExtensions.txt file 1940 but not listed in PropertyAliases.txt; reported to unicode.org; 1941 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt 1942 scx; Script_Extensions 1943 -> uchar.h with new UProperty section 1944 -> com.ibm.icu.lang.UProperty, parallel with uchar.h 1945 1946* PropertyValueAliases.txt changes 1947- 12 new block names: 1948 Alchemical_Symbols 1949 Bamum_Supplement 1950 Batak 1951 Brahmi 1952 CJK_Unified_Ideographs_Extension_D 1953 Emoticons 1954 Ethiopic_Extended_A 1955 Kana_Supplement 1956 Mandaic 1957 Miscellaneous_Symbols_And_Pictographs 1958 Playing_Cards 1959 Transport_And_Map_Symbols 1960 -> add to uchar.h 1961 -> add to UCharacter.UnicodeBlock 1962 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) 1963 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 1964- Joining_Group (jg) values: 1965 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias 1966 -> uchar.h & UCharacter.JoiningGroup 1967- 3 new scripts: 1968 sc ; Batk ; Batak 1969 sc ; Brah ; Brahmi 1970 sc ; Mand ; Mandaic 1971 -> remove these from SyntheticPropertyValueAliases.txt 1972 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN 1973 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() 1974 and in com.ibm.icu.dev.test.lang.TestUScript.java 1975- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html 1976 (added 2009-11-11..2010-07-18) 1977 Bass 259 Bassa Vah 1978 Dupl 755 Duployan shortand 1979 Elba 226 Elbasan 1980 Gran 343 Grantha 1981 Kpel 436 Kpelle 1982 Loma 437 Loma 1983 Mend 438 Mende 1984 Merc 101 Meroitic Cursive 1985 Narb 106 Old North Arabian 1986 Nbat 159 Nabataean 1987 Palm 126 Palmyrene 1988 Sind 318 Sindhi 1989 Wara 262 Warang Citi 1990 -> uscript.h 1991 -> com.ibm.icu.lang.UScript 1992 find USCRIPT_([^ ]+) *= ([0-9]+),(.+) 1993 replace public static final int \1 = \2;\3 1994 -> SyntheticPropertyValueAliases.txt 1995 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() 1996 and in com.ibm.icu.dev.test.lang.TestUScript.java 1997- ISO 15924 name change 1998 Mero 100 Meroitic Hieroglyphs (was Meroitic) 1999 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC 2000- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt 2001 2002* UnicodeData.txt changes 2003- new CJK block: 2004 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; 2005 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; 2006 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion 2007 2008* build Unicode tools using CMake+make 2009 2010* run genpname/preparse.pl (on Linux) 2011 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname 2012 + make sure that data.h is writable 2013 + perl preparse.pl ~/svn.icu/trunk/src > out.txt 2014 + preparse.pl shows no errors, out.txt Info and Warning lines look ok 2015 2016* rebuild Unicode tools (at least genpname) using make 2017- You might first need to "make install" ICU so that the tools build can pick 2018 up the new definitions from the installed header files. 2019 2020* run genpname 2021- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in 2022- rebuild ICU & tools 2023 2024* update source/data/unidata/norm2/nfkc_cf.txt 2025- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt 2026 2027* update source/data/unidata/norm2/uts46.txt 2028- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt 2029 to ~/svn.icu/tools/trunk/src/unicode/py 2030- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values 2031- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py 2032- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 2033 2034* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to 2035 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) 2036- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters 2037- Unicode 6.0: U+2260, U+226E, U+226F 2038 2039* generate core properties data files 2040- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 2041- rebuild ICU & tools 2042- run makeuca.sh so that genuca picks up the new nfc.nrm: 2043 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 2044- rebuild ICU & tools 2045 2046* implement new Script_Extensions property (provisional) 2047- parser & generator: genprops & uprops.icu 2048- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp 2049- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java 2050 2051* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 2052- (one-time change) 2053- genbidi/gencase/genprops tools changes 2054- re-run makeprops.sh (see above) 2055- UCharacterProperty.java, UCharacterTypeIterator.java, 2056 UBiDiProps.java, UCaseProps.java, and several others with minor changes; 2057 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java 2058 2059* update Java data files 2060- refresh just the UCD-related files, just to be safe 2061- see (ICU4C)/source/data/icu4j-readme.txt 2062- mkdir /tmp/icu4j 2063- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2064 output: 2065 ... 2066 Unicode .icu files built to ./out/build/icudt45l 2067 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b 2068 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt 2069 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b 2070 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b 2071 mkdir -p /tmp/icu4j/main/shared/data 2072 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data 2073- copy the big-endian Unicode data files to another location, 2074 separate from the other data files 2075 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 2076 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 2077 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 2078 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu 2079 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b 2080 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 2081 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr 2082- refresh ICU4J 2083 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 2084 2085* refresh Java test .txt files 2086- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode 2087 2088* un-hardcode normalization skippable (NF*_Inert) test data 2089- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools 2090 2091* copy updated break iterator test files 2092- now handled by early ucdcopy.py and 2093 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata 2094 (old instructions: 2095 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt 2096 to ~/svn.icu/trunk/src/source/test/testdata) 2097- they are not used in ICU4J 2098 2099* UCA 2100 2101- get output from Mark's tools; look in 2102 http://www.unicode.org/~book/incoming/mark/uca6.0.0/ 2103 http://www.macchiato.com/unicode/utc/additional-uca-files 2104 http://www.unicode.org/Public/UCA/6.0.0/ 2105 http://www.unicode.org/~mdavis/uca/ 2106- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt 2107- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt 2108- update Han-implicit ranges for new CJK extensions: 2109 swapCJK() in ucol.cpp & ImplicitCEGenerator.java 2110- genuca: allow bytes 02 for U+FFFE, new merge-sort character; 2111 do not add it into invuca so that tailoring primary-after an ignorable works 2112- genuca: permit space between [variable top] bytes 2113- ucol.cpp: treat noncharacters like unassigned rather than ignorable 2114- run makeuca.sh: 2115 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld 2116- rebuild ICU4C 2117- refresh ICU4J collation data: 2118 (subset of instructions above for properties data refresh, except copies all coll/*) 2119 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2120 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 2121 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll 2122 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b 2123- update (ICU)/source/test/testdata/CollationTest_*.txt 2124 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt 2125 with output from Mark's Unicode tools 2126- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 2127- note on intltest: if collate/UCAConformanceTest fails, then 2128 utility/MultithreadTest/TestCollators will fail as well; 2129 fix the conformance test before looking into the multi-thread test 2130 2131* When refreshing all of ICU4J data from ICU4C 2132- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install 2133- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data 2134or 2135- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install 2136 2137*** LayoutEngine script information 2138 2139(For details see the Unicode 5.2 change log below.) 2140 2141* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 2142ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 2143ScriptRunData.cpp, which is no longer needed.) 2144 2145The generated files have a current copyright date and "@draft" statement. 2146 2147* copy the above files into <icu>/source/layout, replacing the old files. 2148* fix mixed line endings 2149* review the diffs and fix incorrect @draft and missing aliases; 2150 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. 2151* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 2152 2153---------------------------------------------------------------------------- *** 2154 2155Unicode 5.2 update 2156 2157*** related ICU Trac tickets 2158 21597084 Unicode 5.2 2160 21617167 verify collation bytes 21627235 Java test NAME_ALIAS 21637236 Java DerivedCoreProperties.txt test 21647237 Java BidiTest.txt 21657238 UTrie2 in core unidata 21667239 test for tailoring gaps 21677240 Java fix CollationMiscTest 21687243 update layout engine for Unicode 5.2 2169 2170*** Unicode version numbers 2171- makedata.mak 2172- uchar.h 2173- configure.in & configure 2174- update ucdVersion in gennames.c if an algorithmic range changes 2175 2176*** data files & enums & parser code 2177 2178* file preparation 2179 2180python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata 2181- includes finding files regardless of version numbers, 2182 copying them, and performing the equivalent processing of the 2183 ucdstrip and ucdmerge tools on the desired set of files 2184 2185* notes on changes 2186- PropertyAliases.txt 2187 moved from numeric to enumerated: 2188 ccc ; Canonical_Combining_Class 2189 new string properties: 2190 NFKC_CF ; NFKC_Casefold 2191 Name_Alias; Name_Alias 2192 new binary properties: 2193 Cased ; Cased 2194 CI ; Case_Ignorable 2195 CWCF ; Changes_When_Casefolded 2196 CWCM ; Changes_When_Casemapped 2197 CWKCF ; Changes_When_NFKC_Casefolded 2198 CWL ; Changes_When_Lowercased 2199 CWT ; Changes_When_Titlecased 2200 CWU ; Changes_When_Uppercased 2201 new CJK Unihan properties (not supported by ICU) 2202- PropertyValueAliases.txt 2203 new block names 2204 new scripts 2205 one script code change: 2206 sc ; Qaai ; Inherited 2207 -> 2208 sc ; Zinh ; Inherited ; Qaai 2209 new Line_Break (lb) value: 2210 lb ; CP ; Close_Parenthesis 2211 new Joining_Group (jg) values: Farsi_Yeh, Nya 2212 other new values: 2213 ccc; 214; ATA ; Attached_Above 2214- DerivedBidiClass.txt 2215 new default-R range: U+1E800 - U+1EFFF 2216- UnicodeData.txt 2217 all of the ISO comments are gone 2218 new CJK block end: 2219 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> 2220 new CJK block: 2221 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; 2222 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; 2223 2224* genpname 2225- run preparse.pl 2226 + cd \svn\icuproj\icu\trunk\source\tools\genpname 2227 + make sure that data.h is writable 2228 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt 2229 + preparse.pl complains with errors like the following: 2230 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. 2231 This is because ICU 4.0 had scripts from ISO 15924 which are now 2232 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt 2233 and PropertyValueAliases.txt. 2234 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 2235 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt 2236 + preparse.pl complains with errors about block names missing from uchar.h; add them 2237 2238* uchar.h & uscript.h & uprops.h & uprops.c & genprops 2239- new block & script values 2240 + 26 new blocks 2241 copy new blocks from Blocks.txt 2242 MS VC++ 2008 regular expression: 2243 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" 2244 replace with " UBLOCK_\3 = 172, /*[\1]*/" 2245 + several new script values already added in ICU 4.0 for ISO 15924 coverage 2246 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) 2247 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage 2248 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) 2249 (added to SyntheticPropertyValueAliases.txt) 2250- new Joining Group (JG) values: Farsi_Yeh, Nya 2251- new Line_Break (lb) value: 2252 lb ; CP ; Close_Parenthesis 2253 2254* hardcoded Unihan range end/limit 2255- Unihan range end moves from 9FC3 to 9FCB 2256 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) 2257 + do change gennames.c 2258 2259* Compare definitions of new binary properties with what we used to use 2260 in algorithms, to see if the definitions changed. 2261- Verified that definitions for Cased and Case_Ignorable are unchanged. 2262 The gencase tool now parses the newly public Case_Ignorable values 2263 in case the definition changes in the future. 2264 2265* uchar.c & uprops.h & uprops.c & genprops 2266- new numeric values that didn't exist in Unicode data before: 2267 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 2268 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, 2269 therefore redesign the encoding of numeric types and values for formatVersion 6; 2270 design for simple numbers up to at least 144 ("one gross"), 2271 large values up to at least 10^20, 2272 and fractions with numerators -1..17 and denominators 1..16 2273 to cover current and expected future values 2274 (e.g., more Han numeric values, Meroitic twelfths) 2275 2276* reimplement Hangul_Syllable_Type for new Jamo characters 2277- the old code assumed that all Jamo characters are in the 11xx block 2278- Unicode 5.2 fills holes there and adds new Jamo characters in 2279 A960..A97F; Hangul Jamo Extended-A 2280 and in 2281 D7B0..D7FF; Hangul Jamo Extended-B 2282- Hangul_Syllable_Type can be trivially derived from a subset of 2283 Grapheme_Cluster_Break values 2284 2285* build Unicode data source code for hardcoding core data 2286C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data 2287 2288ICU data make path is \svn\icuproj\icu\trunk\source\data\ 2289ICU root path is \svn\icuproj\icu\trunk 2290Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 2291Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 2292Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 2293Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 2294Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 2295Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 2296Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 2297Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. 2298Creating data file for Unicode Property Names 2299Creating data file for Unicode Character Properties 2300Creating data file for Unicode Case Mapping Properties 2301Creating data file for Unicode BiDi/Shaping Properties 2302Creating data file for Unicode Normalization 2303Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" 2304Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" 2305 2306- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common 2307 and rebuild the common library 2308 2309*** UCA 2310 2311- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) 2312- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools 2313- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools 2314[ Begin obsolete instructions: 2315 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. 2316 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py 2317 on Windows: 2318 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt 2319 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt 2320 End obsolete instructions] 2321- run all tests with the *_SHORT.txt or the full files (the full ones have comments) 2322 not just the *_STUB.txt files 2323- note on intltest: if collate/UCAConformanceTest fails, then 2324 utility/MultithreadTest/TestCollators will fail as well; 2325 fix the conformance test before looking into the multi-thread test 2326 2327*** Implement Cased & Case_Ignorable properties 2328- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() 2329- Problem: These properties should be disjoint, but aren't 2330- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not 2331- change ucase.icu to be able to store any combination of Cased and Case_Ignorable 2332 2333*** Implement Changes_When_Xyz properties 2334- without stored data 2335 2336*** Implement Name_Alias property 2337- add it as another name field in unames.icu 2338- make it available via u_charName() and UCharNameChoice and 2339- consider it in u_charFromName() 2340 2341*** Break iterators 2342 2343* Update break iterator rules to new UAX versions and new property values 2344* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary 2345 2346*** new BidiTest file 2347- review format and data 2348- copy BidiTest.txt to source/test/testdata 2349- write test code using this data 2350- fix ICU code where it fails the conformance test 2351 2352*** Java 2353- generally, find and update code corresponding to C/C++ 2354- UCharacter.UnicodeBlock constants: 2355 a) add an _ID integer per new block, update COUNT 2356 b) add a class instance per new block 2357 Visual Studio regex: 2358 find UBLOCK_{[^ ]+} = [0-9]+, {/.+} 2359 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 2360- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() 2361 2362- port test changes to Java 2363 2364*** LayoutEngine script information 2365 2366(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) 2367 2368* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, 2369ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates 2370ScriptRunData.cpp, which is no longer needed.) 2371 2372The generated files have a current copyright date and "@draft" statement. 2373 2374-> Eric Mader wrote in email on 20090930: 2375 "I think the tool has been modified to update @draft to @stable for 2376 older scripts and to add @draft for new scripts. 2377 (I worked with an intern on this last year.) 2378 You should check the output after you run it." 2379 2380* copy the above files into <icu>/source/layout, replacing the old files. 2381* fix mixed line endings 2382* review the diffs and fix incorrect @draft and missing aliases 2383* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h 2384 2385Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 2386and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 2387 2388-> Eric Mader wrote in email on 20090930: 2389 "This is just a matter of making sure that all the per-script tables have 2390 entries for any new scripts that were added. 2391 If any new Indic characters were added, then the class tables in 2392 IndicClassTables.cpp should be updated to reflect this. 2393 John Emmons should know how to do this if it's required." 2394 2395* rebuild the layout and layoutex libraries. 2396 2397*** Documentation 2398- Update User Guide 2399 + Jamo_Short_Name, sfc->scf, binary property value aliases 2400 2401---------------------------------------------------------------------------- *** 2402 2403Unicode 5.1 update 2404 2405*** related ICU Trac tickets 2406 24075696 Update to Unicode 5.1 2408 2409*** Unicode version numbers 2410- makedata.mak 2411- uchar.h 2412- configure.in & configure 2413- update ucdVersion in gennames.c if an algorithmic range changes 2414 2415*** data files & enums & parser code 2416 2417* file preparation 2418- ucdstrip: 2419 DerivedCoreProperties.txt 2420 DerivedNormalizationProps.txt 2421 NormalizationTest.txt 2422 PropList.txt 2423 Scripts.txt 2424 GraphemeBreakProperty.txt 2425 SentenceBreakProperty.txt 2426 WordBreakProperty.txt 2427- ucdstrip and ucdmerge: 2428 EastAsianWidth.txt 2429 LineBreak.txt 2430 2431* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 2432copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ 2433copy 5.1.0\ucd\Blocks.txt ..\unidata\ 2434copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ 2435copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ 2436copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 2437copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 2438copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 2439copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 2440copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ 2441copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ 2442copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ 2443copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ 2444copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ 2445 2446ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 2447ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 2448ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 2449ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt 2450ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 2451ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 2452ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 2453ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 2454ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 2455ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 2456 2457* genpname 2458- run preparse.pl 2459 + cd \svn\icuproj\icu\uni51\source\tools\genpname 2460 + make sure that data.h is writable 2461 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt 2462 + preparse.pl complains with errors like the following: 2463 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. 2464 This is because ICU 3.8 had scripts from ISO 15924 which are now 2465 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt 2466 and PropertyValueAliases.txt. 2467 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: 2468 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii 2469 + PropertyValueAliases.txt now explicitly contains values for boolean properties: 2470 N/Y, No/Yes, F/T, False/True 2471 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. 2472 It will use further values from the file if present. 2473 2474* uchar.h & uscript.h & uprops.h & uprops.c & genprops 2475- new block & script values 2476 + 17 new blocks 2477 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage 2478 (removed from SyntheticPropertyValueAliases.txt) 2479 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) 2480 (added to SyntheticPropertyValueAliases.txt) 2481- uprops.icu (uprops.h) only provides 7 bits for script codes. 2482 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. 2483 There is none above 127 yet which is the script code for an 2484 assigned Unicode character, so ICU 4.0 uprops.icu does not store any 2485 script code values greater than 127. 2486 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 2487 in a parallel bit field, and that overflows now. 2488 Also, future values >=128 would be incompatible anyway. 2489 uprops.h is modified to move around several of the bit fields 2490 in the properties vector words, and now uses 8 bits for the script code. 2491 Two other bit fields also grow to accommodate future growth: 2492 Block (current count: 172) grows from 8 to 9 bits, 2493 and Word_Break grows from 4 to 5 bits. 2494- renamed property Simple_Case_Folding (sfc->scf) 2495 + nothing to be done: handled as normal alias 2496- new property JSN Jamo_Short_Name 2497 + no new API: only contributes to the Name property 2498- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark 2499- new Joining Group (JG) value: Burushashki_Yeh_Barree 2500- new Sentence_Break (SB) values: 2501 SB ; CR ; CR 2502 SB ; EX ; Extend 2503 SB ; LF ; LF 2504 SB ; SC ; SContinue 2505- new Word_Break (WB) values: 2506 WB ; CR ; CR 2507 WB ; Extend ; Extend 2508 WB ; LF ; LF 2509 WB ; MB ; MidNumLet 2510 2511* Further changes in the 2008-02-29 update: 2512- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP 2513 because they should not normally be invisible. 2514- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) 2515- new Grapheme_Cluster_Break (GCB) value: PP=Prepend 2516- new Word_Break (WB) value: NL=Newline 2517 2518* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) 2519- Unihan range end moves from 9FBB to 9FC3 2520 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) 2521 + do change gennames.c 2522 2523* build Unicode data source code for hardcoding core data 2524C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data 2525 2526ICU data make path is \svn\icuproj\icu\uni51\source\data\ 2527ICU root path is \svn\icuproj\icu\uni51 2528Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 2529Information: cannot find "brklocal.mk". Not building user-additional break iterator files. 2530Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. 2531Information: cannot find "collocal.mk". Not building user-additional resource bundle files. 2532Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. 2533Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. 2534Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. 2535Creating data file for Unicode Character Properties 2536Creating data file for Unicode Case Mapping Properties 2537Creating data file for Unicode BiDi/Shaping Properties 2538Creating data file for Unicode Normalization 2539Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" 2540Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" 2541 2542- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common 2543 and rebuild the common library 2544 2545*** Break iterators 2546 2547* Update break iterator rules to new UAX versions and new property values 2548 2549*** UCA 2550 2551* update FractionalUCA.txt and UCARules.txt with new canonical closure 2552 2553*** Test suites 2554- Test that APIs using Unicode property value aliases (like UnicodeSet) 2555 support all of the boolean values N/Y, No/Yes, F/T, False/True 2556 -> TestBinaryValues() tests in both cintltst and intltest 2557 2558*** LayoutEngine script information 2559* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 2560ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 2561ScriptRunData.cpp, which is no longer needed.) 2562 2563The generated files have a current copyright date and "@draft" statement. 2564 2565* copy the above files into <icu>/source/layout, replacing the old files. 2566 2567Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 2568and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 2569 2570* rebuild the layout and layoutex libraries. 2571 2572*** Documentation 2573- Update User Guide 2574 + Jamo_Short_Name, sfc->scf, binary property value aliases 2575 2576---------------------------------------------------------------------------- *** 2577 2578Unicode 5.0 update 2579 2580*** related Jitterbugs 2581 25825084 RFE: Update to Unicode 5.0 2583 2584*** data files & enums & parser code 2585 2586* file preparation 2587- ucdstrip: 2588 DerivedCoreProperties.txt 2589 DerivedNormalizationProps.txt 2590 NormalizationTest.txt 2591 PropList.txt 2592 Scripts.txt 2593 GraphemeBreakProperty.txt 2594 SentenceBreakProperty.txt 2595 WordBreakProperty.txt 2596- ucdstrip and ucdmerge: 2597 EastAsianWidth.txt 2598 LineBreak.txt 2599 2600* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) 2601copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ 2602copy 5.0.0\ucd\Blocks.txt ..\unidata\ 2603copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ 2604copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ 2605copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ 2606copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ 2607copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ 2608copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ 2609copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ 2610copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ 2611copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ 2612copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ 2613copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ 2614 2615ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt 2616ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt 2617ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt 2618ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt 2619ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt 2620ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt 2621ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt 2622ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt 2623ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt 2624ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt 2625 2626* update FractionalUCA.txt and UCARules.txt with new canonical closure 2627 2628* genpname 2629- run preparse.pl 2630 + make sure that data.h is writable 2631 + perl preparse.pl \cvs\oss\icu > out.txt 2632 2633* uchar.h & uscript.h & uprops.h & uprops.c & genprops 2634- new block & script values 2635 + script values already added in ICU 3.6 because all of ISO 15924 is now covered 2636 2637* build Unicode data source code for hardcoding core data 2638C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data 2639 2640ICU data make path is \cvs\oss\icu\source\data\ 2641ICU root path is \cvs\oss\icu 2642Information: cannot find "ucmlocal.mk". Not building user-additional converter files. 2643[etc.] 2644Creating data file for Unicode Character Properties 2645Creating data file for Unicode Case Mapping Properties 2646Creating data file for Unicode BiDi/Shaping Properties 2647Creating data file for Unicode Normalization 2648Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" 2649Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" 2650 2651- copy the .c source files to C:\cvs\oss\icu\source\common 2652 and rebuild the common library 2653 2654*** Unicode version numbers 2655- makedata.mak 2656- uchar.h 2657- configure.in 2658 2659*** LayoutEngine script information 2660* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, 2661ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates 2662ScriptRunData.cpp, which is no longer needed.) 2663 2664The generated files have a current copyright date and "@draft" statement. 2665 2666* copy the above files into <icu>/source/layout, replacing the old files. 2667 2668Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp 2669and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) 2670 2671* rebuild the layout and layoutex libraries. 2672 2673---------------------------------------------------------------------------- *** 2674 2675Unicode 4.1 update 2676 2677*** related Jitterbugs 2678 26794332 RFE: Update to Unicode 4.1 26804157 RBBI, TR29 4.1 updates 2681 2682*** data files & enums & parser code 2683 2684* file preparation 2685- ucdstrip: 2686 DerivedCoreProperties.txt 2687 DerivedNormalizationProps.txt 2688 NormalizationTest.txt 2689 GraphemeBreakProperty.txt 2690 SentenceBreakProperty.txt 2691 WordBreakProperty.txt 2692- ucdstrip and ucdmerge: 2693 EastAsianWidth.txt 2694 LineBreak.txt 2695 2696* add new files to the repository 2697 GraphemeBreakProperty.txt 2698 SentenceBreakProperty.txt 2699 WordBreakProperty.txt 2700 2701* update FractionalUCA.txt and UCARules.txt with new canonical closure 2702 2703* genpname 2704- handle new enumerated properties in sub read_uchar 2705- run preparse.pl 2706 2707* uchar.h & uscript.h & uprops.h & uprops.c & genprops 2708- new binary properties 2709 + Pattern_Syntax 2710 + Pattern_White_Space 2711- new enumerated properties 2712 + Grapheme_Cluster_Break 2713 + Sentence_Break 2714 + Word_Break 2715- new block & script & line break values 2716 2717* gencase 2718- case-ignorable changes 2719 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 2720 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk 2721 2722*** Unicode version numbers 2723- makedata.mak 2724- uchar.h 2725- configure.in 2726 2727*** tests 2728- verify that u_charMirror() round-trips 2729- test all new properties and some new values of old properties 2730 2731*** other code 2732 2733* hardcoded Unihan range end/limit 2734- Unihan range end moves from 9FA5 to 9FBB 2735 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) 2736 + do not modify BOCU/BOCSU code because that would change the encoding 2737 and break binary compatibility! 2738 + similarly, do not change the GB 18030 range data (ucnvmbcs.c), 2739 NamePrepProfile.txt 2740 + ignore trietest.c: test data is arbitrary 2741 + ignore tstnorm.cpp: test optimization, not important 2742 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF 2743 + do change line_th.txt and word_th.txt 2744 by replacing hardcoded ranges with the new property values 2745 + do change gennames.c 2746 2747source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 2748source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 2749source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, 2750 2751* case mappings 2752- compare new special casing context conditions with previous ones 2753 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods 2754 2755* genpname 2756- consider storing only the short name if it is the same as the long name 2757 2758*** other reviews 2759- UAX #29 changes (grapheme/word/sentence breaks) 2760- UAX #14 changes (line breaks) 2761- Pattern_Syntax & Pattern_White_Space 2762 2763---------------------------------------------------------------------------- *** 2764 2765Unicode 4.0.1 update 2766 2767*** related Jitterbugs 2768 27693170 RFE: Update to Unicode 4.0.1 27703171 Add new Unicode 4.0.1 properties 27713520 use Unicode 4.0.1 updates for break iteration 2772 2773*** data files & enums & parser code 2774 2775* file preparation 2776- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt 2777- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt 2778 2779* file fixes 2780- fix UnicodeData.txt general categories of Ethiopic digits Nd->No 2781 according to PRI #26 2782 http://www.unicode.org/review/resolved-pri.html#pri26 2783- undone again because no corrigendum in sight; 2784 instead modified tests to not check consistency on this for Unicode 4.0.1 2785 2786* ucdterms.txt 2787- update from http://www.unicode.org/copyright.html 2788 formatted for plain text 2789 2790* uchar.h & uprops.h & uprops.c & genprops 2791- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed 2792- add U_LB_INSEPARABLE due to a spelling fix 2793 + put short name comment only on line with new constant 2794 for genpname perl script parser 2795- new binary properties 2796 + STerm 2797 + Variation_Selector 2798 2799* genpname 2800- fix genpname perl script so that it doesn't choke on more than 2 names per property value 2801- perl script: correctly calculate the maximum number of fields per row 2802 2803* uscript.h 2804- new script code Hrkt=Katakana_Or_Hiragana 2805 2806* gennorm.c track changes in DerivedNormalizationProps.txt 2807- "FNC" -> "FC_NFKC" 2808- single field "NFD_NO" -> two fields "NFD_QC; N" etc. 2809 2810* genprops/props2.c track changes in DerivedNumericValues.txt 2811- changed from 3 columns to 2, dropping the numeric type 2812 + assume that the type is always numeric for Han characters, 2813 and that only those are added in addition to what UnicodeData.txt lists 2814 2815*** Unicode version numbers 2816- makedata.mak 2817- uchar.h 2818- configure.in 2819 2820*** tests 2821- update test of default bidi classes according to PRI #28 2822 /tsutil/cucdtst/TestUnicodeData 2823 http://www.unicode.org/review/resolved-pri.html#pri28 2824- bidi tests: change exemplar character for ES depending on Unicode version 2825- change hardcoded expected property values where they change 2826 2827*** other code 2828 2829* name matching 2830- read UCD.html 2831 2832* scripts 2833- use new Hrkt=Katakana_Or_Hiragana 2834 2835* ZWJ & ZWNJ 2836- are now part of combining character sequences 2837- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ 2838