1--- 2title: Unihan Data 3--- 4 5# Unihan Data 6 7## Background 8 9In CLDR, we use this data for sorting and romanization of Chinese data. Both of these need to be weighted for proper names, since those are the items most commonly needed (contact names, map locations, etc.). 10 111. Sorting 12 1. A major collation for simplified Chinese compares characters first by pinyin, then (if the same pinyin) by total strokes. It thus needs the most common (simplified) pinyin values, and total (simplified) strokes. 13 2. A major collation for traditional Chinese compares characters by total (traditional) strokes. It needs reliable total (traditional) strokes. 14 3. For both of these, we use the Unicode radical-stroke (kRSUnicode) as a tie-breaker. The pinyin values need to be the best single-character readings (without context). 152. Romanization 16 1. We need to have the most common pinyin values. These can have contextual readings (eg more than one character). 17 18## Tool 19 20There is a file called **GenerateUnihanCollators.java** which is currently used to generate the CLDR data, making use of Unihan data plus some special data files. The code is pretty crufty, since it was mostly designed to synthesize data from different sources before kMandarin and kTotalstrokes were expanded in Unihan. It is currently in the unicodetools project since it needs to be run against draft versions of the UCD. 21 22As input, it uses the Unicode properties, plus the following: 23 24- [bihua-chinese-sorting.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/bihua-chinese-sorting.txt) 25- [CJK\_Radicals.csv](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/CJK_Radicals.csv) 26- [patchPinyin.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchPinyin.txt) 27- [patchStroke.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchStroke.txt) 28- [patchStrokeT.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/patchStrokeT.txt) 29- [pinyinHeader.txt](http://unicode.org/repos/unicodetools/trunk/unicodetools/org/unicode/draft/pinyinHeader.txt) 30 31It creates a number of files in {Generated}/cldr/han/kMandarin.txt 32 331. Take Han-Latin.txt, and insert into /cldr/common/transforms/Han-Latin.txt, replacing the lines between 34 - \# START AUTOGENERATED Han-Latin.xml 35 - \# END AUTOGENERATED Han-Latin.xml 362. Diff to sanity check. Run the Transform tests (or just all of them), then check in. 373. Take the strokeT.\*\\.txt files, and pinyin.\*\\.txt and insert them in the appropriate slots in 38 1. pinyin.txt → # START AUTOGENERATED PINYIN LONG (sort by pinyin then kTotalStrokes then kRSUnicode) 39 2. pinyin\_short.txt → \# START AUTOGENERATED PINYIN SHORT (sort by pinyin then kTotalStrokes then kRSUnicode) 40 3. strokeT.txt → # START AUTOGENERATED STROKE LONG 41 4. strokeT\_short.txt → \# START AUTOGENERATED STROKE SHORT 424. Diff to sanity check. 435. Run tests, check in. 44 45The tool also generates some files that we should take back to the Unihan people. Either changes should be made in Unihan, or we should drop the items from out patch files. Examples: 46 471. [kTotalStrokesReplacements.txt](http://unicode.org/repos/cldr-tmp/trunk/dropbox/han/kTotalStrokesReplacements.txt) 48 49It shows the cases where the binhua values are different than Unihan. 50 512. [imputedStrokes.txt](http://unicode.org/repos/cldr-tmp/trunk/dropbox/han/imputedStrokes.txt) 52 53It shows the cases where a stroke count is synthesized from radical/stroke information. This is only approximate, but better than sorting them all at the bottom. It is only used if there is no Unihan or binhua information. 54 55### Stopgap 56 57As a proxy for the best pinyin, we use an algorithm to pick from the many pinyin choices in Unihan, based on an algorithm that Richard supplied. There is a small patch file based on having native Chinese speakers look over the data. Any patches should be pulled back into Unihan. The algorithm is: 58 59Take the first pinyin from the following. Where there are multiple choices in a field, use the first 60 611. patchFile 622. kMandarin // moved up in CLDR 30. 633. kHanyuPinlu 644. kXHC1983 655. kHanyuPinyin 666. bihua 67 68Then, if it is still missing, try to map to a character that does have a pinyin. If we find one, stop and use it. 69 701. Radical => Unifield 712. kTraditionalVariant 723. kSimplifiedVariant 734. NFKD 74 75## OLD 76 77~~**DRAFT!!**~~ 78 79~~In 1.9, we converted to using Unihan data for CLDR collation and transliteration. We've run into some problems (pedberg - see for example~~ [#3428](http://unicode.org/cldr/trac/ticket/3428)~~), and this is a draft proposal for how to resolve them.~~ 80 81### Longer Term 82 83~~The following are (draft) recommendations for the UTC.~~ 84 851. ~~Define the kMandarin field to contain one or two values. If there are two values, then the first is preferred for zh-Hans (CN) and the second is preferred for zh-Hant (TW). If the values would be the same, there is only one value. (pedberg - it is already defined that way)~~ 862. ~~The preferred value should be the one that is most commonly used, with a focus on proper names (persons or places). For example, if reading X has 30% of the frequency of Y, but X is used with proper names but Y is not, X would be preferred.~~ 873. ~~Define the kTotalStrokes field to be what is most appropriate for use with zh-Hant, and add a new field, kTotalSimplifiedStrokes, to be what is most appropriate for use with zh-Hans. pedberg- The kTotalStrokes field is already defined to be the value "for the character as drawn in the Unicode charts" which may not match the value for zh-Hant; we may need to add 2 stroke count fields.~~ 884. ~~Get a commitment from the IRG to supply these values for all new characters. Set in place a program to add/fix values for existing characters.~~ 89 90~~Once this is in place, remove the now-superfluous patch files in the CLDR collation/transliteration generation.~~ 91 92### Short Term (1.9.1) 93 941. ~~Modify the pinyin to choose the 1.8 CLDR transliteration value first, then fall back to the others.~~ 952. ~~Have two transliteration pinyin variants: Names and General. Make the default for pinyin be "Names". (There are only currently 2 differences.) (pedberg - Yes, but there is a ticket to add more, see~~ [~~#3381~~](http://unicode.org/cldr/trac/ticket/3381)~~, which covers some of the problems from #3428 above)~~ 963. ~~Use the default pinyin for collation.~~ 974. ~~Add two total-strokes patch files for the collation generator, one for simplified and one for traditional.~~ 985. ~~In the generator, have two different total-strokes used for simplified vs traditional.~~ 99 100~~pedberg comments:~~ 101 1021. ~~We need to ensure that the transliteration value is consistent with the pinyin collator.~~ 1032. ~~The 1.8 transliterator had many errors, I don't think a wholesale fallback to that is a good idea.~~ 1043. ~~Using the name reading rather than the general reading for standard pinyin collation might produce unexpected results.~~ 1054. ~~Why not just specify the name reading when that is desired? No need to make it the default if it is the less common reading.~~ 106 107