1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 2<html> 3<head> 4 5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 6 7 <meta http-equiv="Content-Language" content="en-us"> 8 9 <meta name="VI60_defaultClientScript" content="JavaScript"> 10 11 <meta name="GENERATOR" content="Microsoft FrontPage 6.0"> 12 13 <meta name="keywords" content="Unicode, common locale data repository"> 14 15 <meta name="ProgId" content="FrontPage.Editor.Document"> 16 17 18 <title>Unicode CLDR Bug Reports</title> 19 <link rel="stylesheet" type="text/css" href="http://www.unicode.org/webscripts/standard_styles.css"> 20 21 <style type="text/css"> 22<!-- 23.e{margin-left:1em;text-indent:-1em;margin-right:1em} 24.tx{font-weight:bold} 25--> 26 </style> 27</head> 28 29 30 31<body text="#330000"> 32 33 34<table border="0" cellpadding="0" cellspacing="0" width="100%"> 35 36 <tbody> 37 <tr> 38 39 <td colspan="2"> 40 41 <table border="0" cellpadding="0" cellspacing="0" width="100%"> 42 43 <tbody> 44 <tr> 45 46 <td class="icon"><a href="http://www.unicode.org/"> 47 <img src="http://www.unicode.org/webscripts/logo60s2.gif" alt="[Unicode]" align="middle" border="0" height="33" width="34"></a> 48 <a class="bar" href="index.html"><font size="3">Common Locale Data Repository</font></a></td> 49 50 <td class="bar"><a href="http://www.unicode.org" class="bar">Home</a> | <a href="http://www.unicode.org/sitemap/" class="bar">Site Map</a> | 51 <a href="http://www.unicode.org/search/" class="bar">Search</a></td> 52 53 </tr> 54 55 56 </tbody> 57 </table> 58 59 </td> 60 61 </tr> 62 63 <tr> 64 65 <td colspan="2" class="gray"> </td> 66 67 </tr> 68 69 <tr> 70 71 <td class="navCol" valign="top" width="25%"> 72 73 <table class="navColTable" border="0" cellpadding="0" cellspacing="4" width="100%"> 74 75 <tbody> 76 <tr> 77 78 <td class="navColTitle">Contents</td> 79 80 </tr> 81 82 <tr> 83 84 <td class="navColCell" valign="top"><a href="#Collation_Bugs">Collation Bugs</a></td> 85 86 </tr> 87 88 <tr> 89 90 <td class="navColCell" valign="top"><a href="#Possible_Comparison_Sources">Sources</a></td> 91 92 </tr> 93 94 <tr> 95 96 <td class="navColTitle">Unicode CLDR</td> 97 98 </tr> 99 100 <tr> 101 102 <td class="navColCell" valign="top"><a href="index.html">CLDR Project</a></td> 103 104 </tr> 105 106 <tr> 107 108 <td class="navColCell" valign="top"><a href="repository_access.html">CLDR Releases (Downloads)</a></td> 109 110 </tr> 111 112 <tr> 113 114 <td class="navColCell" valign="top"><a href="survey_tool.html">CLDR Survey Tool</a></td> 115 116 </tr> 117 118 <tr> 119 120 <td class="navColCell" valign="top"><a href="filing_bug_reports.html">CLDR Bug Reports</a></td> 121 122 </tr> 123 124 <tr> 125 126 <td class="navColCell" valign="top"><a href="comparison_charts.html">CLDR Charts</a></td> 127 128 </tr> 129 130 <tr> 131 132 <td class="navColCell" valign="top"><a href="process.html">CLDR Process</a></td> 133 134 </tr> 135 136 <tr> 137 138 <td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a></td> 139 140 </tr> 141 142 <tr> 143 144 <td class="navColTitle">Related Links</td> 145 146 </tr> 147 148 <tr> 149 150 <td class="navColCell" valign="top">Join the <a href="http://www.unicode.org/consortium/consort.html">Unicode Consortium</a></td> 151 152 </tr> 153 154 <tr> 155 156 <td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/">Unicode Technical Reports</a></td> 157 158 </tr> 159 160 <tr> 161 162 <td class="navColCell" valign="top"><a href="http://www.unicode.org/faq/reports_process.html">Technical Reports Development and Maintenance Process</a></td> 163 164 </tr> 165 166 <tr> 167 168 <td class="navColCell" valign="top"><a href="http://www.unicode.org/consortium/utc.html">Unicode Technical Committee</a></td> 169 170 </tr> 171 172 <tr> 173 174 <td class="navColCell" valign="top"><a href="http://www.unicode.org/versions/">Versions of the Unicode Standard</a></td> 175 176 </tr> 177 178 <tr> 179 180 <td class="navColTitle">Other Publications</td> 181 182 </tr> 183 184 <tr> 185 186 <td class="navColCell" valign="top"><a href="http://www.unicode.org/standard/standard.html">The Unicode Standard</a></td> 187 188 </tr> 189 190 <tr> 191 192 <td class="navColCell" valign="top"><a href="http://www.unicode.org/notes/">Unicode Technical Notes</a></td> 193 194 </tr> 195 196 197 </tbody> 198 </table> 199 200 <!-- BEGIN CONTENTS --></td> 201 202 <td> 203 204 <table> 205 206 <tbody> 207 <tr> 208 209 <td class="contents" valign="top"> 210 211 <div class="body"> 212 213 <h1>Unicode CLDR Bug Reports</h1> 214 215 216 <p><span class="changed">Most proposed data (new or corrections) should be entered via the </span><a href="survey_tool.html">CLDR Survey Tool</a><span class="changed">. 217 </span></p> 218 219 220 <p>Bugs may be filed for defects in the survey tool, for 221adding or changing non-language data (such as currency usage), for 222additions or changes to data that is not yet handled by the survey tool 223(collation, segmentation, and transliteration), and for feature 224requests in CLDR or <a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a>.</p> 225 226 227 <p>To file such a bug, go to <a href="http://www.unicode.org/cldr/bugs/locale-bugs">Locale Bugs</a>. 228Try to give as much information as possible to help address the issue, 229and please group related bugs (such as a list of problems with the LDML 230specification) into a single bug report. Some specific cases are 231covered below.</p> 232 233 234 <h2><a name="Collation_Bugs">Collation Bugs</a></h2> 235 236 237 <p>The exact collation sequence for a given language may be 238difficult to determine. The base ordering of characters can be fairly 239straightforward, but there are quite a few other complications 240involved. </p> 241 242 243 <p><span>Most standards that specify collation, such as DIN 244or CS, are not targeted at algorithmic sorting, and are not complete 245algorithmic specifications. For example, CSN 97 6030 requires 246transliteration of foreign scripts, but there are many choices as to 247how to transliterate, and the exact mechanism is not specified. It also 248specifies that geometric shapes are sorted by the number of vertices 249and edges, which is, at a minimum, difficult to determine; and are 250subject to variation in glyphs. </span>T<span>he CLDR goals are to match the sorting of exemplar letters 251 and common punctuation and 252 leave everything else to the standard UCA ordering. </span>For more information, see 253 <a href="http://www.unicode.org/reports/tr10/#Introduction">UTS #10: Unicode Collation Algorithm</a> (UCA).</p> 254 255 256 <p>For readability, the rules are presented here in 257Java/ICU rule format, rather than XML; for the same reason, we prefer 258the bug reports to also use that format, even though the end result 259will be in XML. For more information, see <a href="http://icu.sourceforge.net/userguide/Collate_Customization.html">ICU Collation Customization</a>.</p> 260 261 262 <p>Please supply some short test cases that illustrate the 263correct sorting behavior as a list of lines in sorted order. Try to 264include cases that show the boundary behavior by including high 265suffixes, such as the following:</p> 266 267 268 <ul> 269 270 <li><i>Rules:</i> 271 272 <ul> 273 274 <li><i>& c < cs</i></li> 275 276 <li>& cs <<< ccs / cs</li> 277 278 279 </ul> 280 281 </li> 282 283 <li><i>Test Data:</i> 284 285 <ul> 286 287 <li><i>c<br> 288 289 cy<br> 290 291 cs<br> 292 293 cscs<br> 294 295 ccs<br> 296 297 cscsy<br> 298 299 ccsy<br> 300 301 csy<br> 302 303 d</i></li> 304 305 306 </ul> 307 308 </li> 309 310 311 </ul> 312 313 314 <p>Please test out any suggested rules before filing a bug, using Locale Explorer:</p> 315 316 317 <ol> 318 319 <li>Go to the <a href="http://ibm.com/software/globalization/icu/demo/locales">ICU Locale Explorer</a></li> 320 321 <li>Pick the appropriate locale</li> 322 323 <li>Follow the instructions at the bottom to use your suggested rules on your suggested test data.</li> 324 325 <li>Verify that the proper order results.</li> 326 327 328 </ol> 329 330 331 <h3>Pitfalls</h3> 332 333 334 <p>There are a number of pitfalls with collation, so be 335careful. In some cases, such as Hungarian or Japanese, the rules can be 336fairly complicated (of course, reflecting that the sorting sequence for 337those languages is complicated).</p> 338 339 340 <ol> 341 342 <li><b>Only tailor expected data. </b>We focus on the required collation sequence for a given language with normal data. So we don't include 343 full-width characters for a European collation sequence, such as 344 345 <ul> 346 347 <li>... CSCS <<< CSCS ...</li> 348 349 <li>... CSCS <<< \uFF23\uFF33\uFF23\uFF33 ... (equivalently)</li> 350 351 352 </ul> 353 354 </li> 355 356 <li><b>Tailor trailing contractions. </b>If a sequence of characters is treated as a unit for collation, it should be entered as a contraction. 357 358 <p>& c < ch</p> 359 360 361 <p>One might think that sequence like "dz" doesn't 362require that, since it would always come after "d" followed by any 363other letter; it is a "trailing contraction". But in unusual cases, 364that wouldn't be true; if "dz" is a unit sorted as if it were a 365distinct letter after "d", one should get the ordering "d<font size="3">α" < "dz". This will only happen if "dz" is a contraction, such as</font></p> 366 367 368 <p><font size="3">& d < dz</font></p> 369 370 </li> 371 372 <li><b>Watch out for Expansions.</b> If you have a rule like &cs < d, and "cs" has not occurred in a previous rule as a contraction, then 373 this is automatically considered to be the same as &c < d / s; that is, the d <i>expands</i> as if it were a "cs" (actually, primary greater 374 than a "cs", since we wrote "<"). This expansion takes effect until the next primary difference. 375 376 <p>So suppose that "ccs" is to behave as if it were 377"cscs", and take case differences into account. You might try to do 378this with the rules on the left:</p> 379 380 381 <table id="table3" border="1" cellpadding="4" cellspacing="0"> 382 383 <tbody> 384 <tr> 385 386 <th align="left" width="50%">Rules (Wrong)</th> 387 388 <th align="left" width="50%">Actual Effect</th> 389 390 </tr> 391 392 <tr> 393 394 <td width="50%">& C < cs <<< Cs <<< CS<br> 395 396 & cscs <<< ccs<br> 397 398 <<< Cscs <<< Ccs<br> 399 400 <<< CSCS <<< CCS</td> 401 402 <td width="50%">& C < cs <<< Cs <<< CS<br> 403 404 & cs <<< ccs / cs<br> 405 406 <<< Cscs / cs <<< Ccs / cs<br> 407 408 <<< CSCS / cs <<< CCS / cs</td> 409 410 </tr> 411 412 413 </tbody> 414 </table> 415 416 417 <p>But since the <u>CSCS</u> has not been made a contraction in previous rules, this produces an automatic expansion, one that continues 418 through the entire sequence of non-primary differences, as shown on the right. This is <i>not</i> what is wanted: each item acts like it 419 expands compared to the previous item. So CCS, for example, will act like it expands to CSCScs!</p> 420 421 422 <p>What you actually want is the following:</p> 423 424 425 <table id="table4" border="1" cellpadding="4" cellspacing="0"> 426 427 <tbody> 428 <tr> 429 430 <th align="left" width="50%">Rules (Right)</th> 431 432 <th align="left" width="50%">Actual Effect</th> 433 434 </tr> 435 436 <tr> 437 438 <td width="50%">& C < cs <<< Cs <<< CS<br> 439 440 & cscs <<< ccs<br> 441 442 & Cscs <<< Ccs<br> 443 444 & CSCS <<< CCS</td> 445 446 <td width="50%">& C < cs <<< Cs <<< CS<br> 447 448 & cs <<< ccs / cs<br> 449 450 & Cs <<< Ccs / cs<br> 451 452 & CS <<< CCS / CS</td> 453 454 </tr> 455 456 457 </tbody> 458 </table> 459 460 461 <p>In short, when you have expansions, it is always 462safer and clearer to express them with separate resets. There are only 463a few exceptions to this, notably when CJK characters are interleaved 464with Hangul Syllables.</p> 465 466 </li> 467 468 <li><b>Don't tailor what you don't have to. </b>Example: Maltese was sorting character sequences <i>before</i> a base character using the 469 following style: 470 471 <p>& B<br> 472 473 < ċ<br> 474 475 <<<Ċ<br> 476 477 < c<br> 478 479 <<<C</p> 480 481 482 <p>This works, but is sub-optimal for two reasons. </p> 483 484 485 <ol> 486 487 <li>it tailors c/C when it doesn't need to be; any extra tailoring generally makes for longer sort keys.</li> 488 489 <li>by tailoring c/C, it puts other those things that are after b/B after c/C instead. See 490 <a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a> for examples.</li> 491 492 493 </ol> 494 495 496 <p>The correct rules should be:</p> 497 498 499 <p>& [before 1] c < ċ <<< Ċ</p> 500 501 502 <p>This finds the highest primary (that's what the 1 is 503for) character less than c, and uses that as the reset point. For 504Maltese, the same technique needs to be used for ġ and ż.</p> 505 506 </li> 507 508 <li>Contractions can be blocked with CGJ, as described in the Unicode Standard and in the 509 <a href="http://www.unicode.org/faq/char_combmark.html">Characters and Combining Marks FAQ</a>.</li> 510 511 <li>Normally all combinations of case need to be supplied for contractions. That is, if <i>ch</i> 512is a contraction, then you would have the rules ... ch < cH < Ch 513< CH. The reason for this is so that all case variants sort at the 514same primary level: thus lowercasing a string will not affect its 515primary order. Cases such as <i>McHugh</i> are handled like other instances where contractions should be blocked.</li> 516 517 518 </ol> 519 520 521 <h2><a name="Possible_Comparison_Sources">Possible Comparison Sources</a></h2> 522 523 524 <p>Sources and references may be standards or can also be dictionaries, journal style guides (such as <i>The Economist Style Guide for English</i>), 525and other available sources that provide guidance as to common 526practice. Online sources are preferred where available, since they can 527be more easily checked.</p> 528 529 530 <p>The goal is to follow common, customary practice. For 531example, language or territory display names should use the most 532recognizable name in common usage. This is generally not the official 533name. For example, one would use "Switzerland" not "Swiss 534Confederation".</p> 535 536 537 <p>Here are some possible resources for comparison of locale data. <i>This is <b>not</b> an endorsement of the sources, merely a collation of 538 possibly-useful links. </i><font color="black" face="Arial" size="3"><span style="font-size: 12pt;">To suggest additions, </span></font> 539 file a <a href="filing_bug_reports.html">Bug Report</a>.</p> 540 541 542 <h3>Territory names; Language names; Gregorian/non-Gregorian month names; Day names; Exemplar characters, and Collation</h3> 543 544 545 <ul> 546 547 <li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li> 548 549 550 </ul> 551 552 553 <h3><i>The Economist Style Guide</i> (unfortunately only hard copy): Currencies, Display Names, Formatting for English:</h3> 554 555 556 <ul> 557 558 <li><a href="http://www.amazon.com/exec/obidos/tg/detail/-/186197535X">http://www.amazon.com/exec/obidos/tg/detail/-/186197535X</a> </li> 559 560 561 </ul> 562 563 564 <h3><a name="Exemplar_Characters">Exemplar Characters</a></h3> 565 566 567 <ul> 568 569 <li><a href="http://www.eki.ee/letter/">http://www.eki.ee/letter/</a> </li> 570 571 <li><a href="http://europa.eu.int/comm/eurostat/research/index.htm?http://europa.eu.int/en/comm/eurostat/research/isi/special/&1">http://europa.eu.int/comm/eurostat/research/index.htm</a></li> 572 573 <li><a href="http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin"><span>http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin</span></a><span> 574 </span></li> 575 576 <li><a href="http://www.omniglot.com/writing/alphabets.htm"> 577 http://www.omniglot.com/writing/alphabets.htm</a> </li> 578 579 <li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li> 580 581 582 </ul> 583 584 585 <h3>Territory Names</h3> 586 587 588 <ul> 589 590 <li><a href="http://www.world-gazetteer.com/pronun.htm">http://www.world-gazetteer.com/pronun.htm</a></li> 591 592 <li><a href="http://www.eki.ee/knn/lingid2.htm#WRLD">http://www.eki.ee/knn/lingid2.htm#WRLD</a> </li> 593 594 <li><a href="http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html">http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html</a> 595 </li> 596 597 598 </ul> 599 600 601 <h3>Currency names; Territory names (Replace es with desired language code) </h3> 602 603 604 <ul> 605 606 <li><a href="http://publications.eu.int/code/es/es-5000500.htm">http://publications.eu.int/code/es/es-5000500.htm</a> <br> 607 608 <a href="http://publications.eu.int/code/es/es-5000700.htm">http://publications.eu.int/code/es/es-5000700.htm</a> <br> 609 610 <a href="http://publications.eu.int/">http://publications.eu.int/</a> </li> 611 612 613 </ul> 614 615 616 <h3>Territory & Region names (Use the links at the top switch languages); </h3> 617 618 619 <ul> 620 621 <li><a href="http://www.worldlanguage.com/Arabic/Countries/">http://www.worldlanguage.com/Arabic/Countries/</a> </li> 622 623 624 </ul> 625 626 627 <h3>Exemplar/collation information</h3> 628 629 630 <ul> 631 632 <li><a href="http://www.omniglot.com/writing/">http://www.omniglot.com/writing/</a><br> 633 634 <a href="http://www.alphabets-world.com/">http://www.alphabets-world.com/</a> <br> 635 636 <a href="http://developer.mimer.com/collations/charts/">http://developer.mimer.com/collations/charts/</a> </li> 637 638 639 </ul> 640 641 642 <h3>Simple Translations</h3> 643 644 645 <ul> 646 647 <li><a href="http://world.altavista.com/">http://world.altavista.com/</a></li> 648 649 <li><a href="http://www.google.com/language_tools">http://www.google.com/language_tools</a> </li> 650 651 652 </ul> 653 654 655 <h3>List of date/time formatting for Windows</h3> 656 657 658 <ul> 659 660 <li><a href="http://www.microsoft.com/globaldev/nlsweb/">http://www.microsoft.com/globaldev/nlsweb/</a> </li> 661 662 663 </ul> 664 665 666 <h3>Exemplar Characters; Transliteration</h3> 667 668 669 <ul> 670 671 <li><a href="http://www.eki.ee/wgrs/">UNGEGN: Working Group on Romanization Systems</a> </li> 672 673 <li><a href="http://ee.www.ee/transliteration/">Transliteration of Non-Roman Alphabets and Scripts (Søren Binks)</a> </li> 674 675 <li><a href="http://www.archivists.org/catalog/stds99/chapter8.html">Standards for Archival Description: Romanization</a> </li> 676 677 <li><a href="http://ee.www.ee/transliteration/pdf/Hindi-Marathi-Nepali.pdf">ISO-15915 (Hindi)</a> </li> 678 679 <li><a href="http://ee.www.ee/transliteration/pdf/Gujarati.pdf">ISO-15915 (Gujarati) </a></li> 680 681 <li><a href="http://ee.www.ee/transliteration/pdf/Kannada.pdf">ISO-15915 (Kannada) </a></li> 682 683 <li><a href="http://www.cdacindia.com/html/gist/down/iscii_d.asp">ISCII-91</a> </li> 684 685 686 </ul> 687 688 689 <h3>Geographical Names</h3> 690 691 692 <ul> 693 694 <li><a href="http://unstats.un.org/unsd/geoinfo/">http://unstats.un.org/unsd/geoinfo/</a> </li> 695 696 697 </ul> 698 699 700 <h3><span>Currencies</span></h3> 701 702 703 <ul> 704 705 <li><a href="http://www.globalfindata.com/gh/index.html"><span>http://www.globalfindata.com/gh/index.html</span></a><span> </span></li> 706 707 708 </ul> 709 710 711 <h3>General</h3> 712 713 714 <ul> 715 716 <li><a href="http://www.cia.gov/cia/publications/factbook/">http://www.cia.gov/cia/publications/factbook/</a> </li> 717 718 <li><a href="http://www.microsoft.com/mspress/books/5717.asp">http://www.microsoft.com/mspress/books/5717.asp</a> very complete set of information, 719 like postal information, currency symbols, date/time formats, calendars,...</li> 720 721 722 </ul> 723 724 725 <p> </p> 726 727 728 <blockquote> 729 </blockquote> 730 731 </div> 732 733 </td> 734 735 </tr> 736 737 <tr> 738 739 <td class="contents" valign="top"> </td> 740 741 </tr> 742 743 744 </tbody> 745 </table> 746 747 748 <hr width="50%"> 749 750 <div align="center"> 751 752 <center> 753 754 <table border="0" cellpadding="0" cellspacing="0"> 755 756 <tbody> 757 <tr> 758 759 <td><a href="http://www.unicode.org/copyright.html"> 760 <img src="http://www.unicode.org/img/hb_notice.gif" alt="Access to Copyright and terms of use" border="0" height="50" width="216"></a></td> 761 762 </tr> 763 764 765 </tbody> 766 </table> 767 768 769 <script language="Javascript" type="text/javascript" src="http://www.unicode.org/webscripts/lastModified.js"> 770 </script> 771 </center> 772 </div> 773 774 </td> 775 776 </tr> 777 778 </tbody> 779</table> 780 781 782</body> 783</html> 784