• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<html>
3<head>
4
5  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
6
7  <meta http-equiv="Content-Language" content="en-us">
8
9  <meta name="VI60_defaultClientScript" content="JavaScript">
10
11  <meta name="GENERATOR" content="Microsoft FrontPage 6.0">
12
13  <meta name="keywords" content="Unicode, common locale data repository">
14
15  <meta name="ProgId" content="FrontPage.Editor.Document">
16
17
18  <title>Unicode CLDR Bug Reports</title>
19  <link rel="stylesheet" type="text/css" href="http://www.unicode.org/webscripts/standard_styles.css">
20
21  <style type="text/css">
22<!--
23.e{margin-left:1em;text-indent:-1em;margin-right:1em}
24.tx{font-weight:bold}
25-->
26  </style>
27</head>
28
29
30
31<body text="#330000">
32
33
34<table border="0" cellpadding="0" cellspacing="0" width="100%">
35
36	<tbody>
37    <tr>
38
39		<td colspan="2">
40
41      <table border="0" cellpadding="0" cellspacing="0" width="100%">
42
43			<tbody>
44          <tr>
45
46				<td class="icon"><a href="http://www.unicode.org/">
47				<img src="http://www.unicode.org/webscripts/logo60s2.gif" alt="[Unicode]" align="middle" border="0" height="33" width="34"></a>&nbsp;&nbsp;
48				<a class="bar" href="index.html"><font size="3">Common Locale Data Repository</font></a></td>
49
50				<td class="bar"><a href="http://www.unicode.org" class="bar">Home</a> | <a href="http://www.unicode.org/sitemap/" class="bar">Site Map</a> |
51				<a href="http://www.unicode.org/search/" class="bar">Search</a></td>
52
53			</tr>
54
55
56        </tbody>
57      </table>
58
59		</td>
60
61	</tr>
62
63	<tr>
64
65		<td colspan="2" class="gray">&nbsp;</td>
66
67	</tr>
68
69	<tr>
70
71		<td class="navCol" valign="top" width="25%">
72
73      <table class="navColTable" border="0" cellpadding="0" cellspacing="4" width="100%">
74
75			<tbody>
76          <tr>
77
78				<td class="navColTitle">Contents</td>
79
80			</tr>
81
82			<tr>
83
84				<td class="navColCell" valign="top"><a href="#Collation_Bugs">Collation Bugs</a></td>
85
86			</tr>
87
88			<tr>
89
90				<td class="navColCell" valign="top"><a href="#Possible_Comparison_Sources">Sources</a></td>
91
92			</tr>
93
94			<tr>
95
96				<td class="navColTitle">Unicode CLDR</td>
97
98			</tr>
99
100			<tr>
101
102				<td class="navColCell" valign="top"><a href="index.html">CLDR Project</a></td>
103
104			</tr>
105
106			<tr>
107
108				<td class="navColCell" valign="top"><a href="repository_access.html">CLDR Releases (Downloads)</a></td>
109
110			</tr>
111
112			<tr>
113
114				<td class="navColCell" valign="top"><a href="survey_tool.html">CLDR Survey Tool</a></td>
115
116			</tr>
117
118			<tr>
119
120				<td class="navColCell" valign="top"><a href="filing_bug_reports.html">CLDR Bug Reports</a></td>
121
122			</tr>
123
124			<tr>
125
126				<td class="navColCell" valign="top"><a href="comparison_charts.html">CLDR Charts</a></td>
127
128			</tr>
129
130			<tr>
131
132				<td class="navColCell" valign="top"><a href="process.html">CLDR Process</a></td>
133
134			</tr>
135
136			<tr>
137
138				<td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a></td>
139
140			</tr>
141
142			<tr>
143
144				<td class="navColTitle">Related Links</td>
145
146			</tr>
147
148			<tr>
149
150				<td class="navColCell" valign="top">Join the <a href="http://www.unicode.org/consortium/consort.html">Unicode Consortium</a></td>
151
152			</tr>
153
154			<tr>
155
156				<td class="navColCell" valign="top"><a href="http://www.unicode.org/reports/">Unicode Technical Reports</a></td>
157
158			</tr>
159
160			<tr>
161
162				<td class="navColCell" valign="top"><a href="http://www.unicode.org/faq/reports_process.html">Technical Reports Development and Maintenance Process</a></td>
163
164			</tr>
165
166			<tr>
167
168				<td class="navColCell" valign="top"><a href="http://www.unicode.org/consortium/utc.html">Unicode Technical Committee</a></td>
169
170			</tr>
171
172			<tr>
173
174				<td class="navColCell" valign="top"><a href="http://www.unicode.org/versions/">Versions of the Unicode Standard</a></td>
175
176			</tr>
177
178			<tr>
179
180				<td class="navColTitle">Other Publications</td>
181
182			</tr>
183
184			<tr>
185
186				<td class="navColCell" valign="top"><a href="http://www.unicode.org/standard/standard.html">The Unicode Standard</a></td>
187
188			</tr>
189
190			<tr>
191
192				<td class="navColCell" valign="top"><a href="http://www.unicode.org/notes/">Unicode Technical Notes</a></td>
193
194			</tr>
195
196
197        </tbody>
198      </table>
199
200		<!-- BEGIN CONTENTS --></td>
201
202		<td>
203
204      <table>
205
206			<tbody>
207          <tr>
208
209				<td class="contents" valign="top">
210
211            <div class="body">
212
213            <h1>Unicode CLDR Bug Reports</h1>
214
215
216            <p><span class="changed">Most proposed data (new or corrections) should be entered via the </span><a href="survey_tool.html">CLDR Survey Tool</a><span class="changed">.
217					</span></p>
218
219
220            <p>Bugs may be filed for defects in the survey tool, for
221adding or changing non-language data (such as currency usage), for
222additions or changes to data that is not yet handled by the survey tool
223(collation, segmentation, and transliteration), and for feature
224requests in CLDR or <a href="http://www.unicode.org/reports/tr35/">UTS #35: Locale Data Markup Language (LDML)</a>.</p>
225
226
227            <p>To file such a bug, go to <a href="http://www.unicode.org/cldr/bugs/locale-bugs">Locale Bugs</a>.
228Try to give as much information as possible to help address the issue,
229and please group related bugs (such as a list of problems with the LDML
230specification) into a single bug report. Some specific cases are
231covered below.</p>
232
233
234            <h2><a name="Collation_Bugs">Collation Bugs</a></h2>
235
236
237            <p>The exact collation sequence for a given language may be
238difficult to determine. The base ordering of characters can be fairly
239straightforward, but there are quite a few other complications
240involved. </p>
241
242
243            <p><span>Most standards that specify collation, such as DIN
244or CS, are not targeted at algorithmic sorting, and are not complete
245algorithmic specifications. For example, CSN 97 6030 requires
246transliteration of foreign scripts, but there are many choices as to
247how to transliterate, and the exact mechanism is not specified. It also
248specifies that geometric shapes are sorted by the number of vertices
249and edges, which is, at a minimum, difficult to determine; and are
250subject to variation in glyphs. </span>T<span>he CLDR goals are to match the sorting of exemplar letters
251					and common punctuation and
252					leave everything else to the standard UCA ordering. </span>For more information, see
253					<a href="http://www.unicode.org/reports/tr10/#Introduction">UTS #10: Unicode Collation Algorithm</a> (UCA).</p>
254
255
256            <p>For readability, the rules are presented here in
257Java/ICU rule format, rather than XML; for the same reason, we prefer
258the bug reports to also use that format, even though the end result
259will be in XML. For more information, see <a href="http://icu.sourceforge.net/userguide/Collate_Customization.html">ICU Collation Customization</a>.</p>
260
261
262            <p>Please supply some short test cases that illustrate the
263correct sorting behavior as a list of lines in sorted order. Try to
264include cases that show the boundary behavior by including high
265suffixes, such as the following:</p>
266
267
268            <ul>
269
270						<li><i>Rules:</i>
271
272                <ul>
273
274							<li><i>&amp; c &lt; cs</i></li>
275
276							<li>&amp; cs &lt;&lt;&lt; ccs / cs</li>
277
278
279                </ul>
280
281						</li>
282
283						<li><i>Test Data:</i>
284
285                <ul>
286
287							<li><i>c<br>
288
289							cy<br>
290
291							cs<br>
292
293							cscs<br>
294
295							ccs<br>
296
297							cscsy<br>
298
299							ccsy<br>
300
301							csy<br>
302
303							d</i></li>
304
305
306                </ul>
307
308						</li>
309
310
311            </ul>
312
313
314            <p>Please test out any suggested rules before filing a bug, using Locale Explorer:</p>
315
316
317            <ol>
318
319						<li>Go to the <a href="http://ibm.com/software/globalization/icu/demo/locales">ICU Locale Explorer</a></li>
320
321						<li>Pick the appropriate locale</li>
322
323						<li>Follow the instructions at the bottom to use your suggested rules on your suggested test data.</li>
324
325						<li>Verify that the proper order results.</li>
326
327
328            </ol>
329
330
331            <h3>Pitfalls</h3>
332
333
334            <p>There are a number of pitfalls with collation, so be
335careful. In some cases, such as Hungarian or Japanese, the rules can be
336fairly complicated (of course, reflecting that the sorting sequence for
337those languages is complicated).</p>
338
339
340            <ol>
341
342						<li><b>Only tailor expected data. </b>We focus on the required collation sequence for a given language with normal data. So we don't include
343						full-width characters for a European collation sequence, such as
344
345                <ul>
346
347							<li>... CSCS &lt;&lt;&lt; CSCS ...</li>
348
349							<li>...&nbsp; CSCS &lt;&lt;&lt; \uFF23\uFF33\uFF23\uFF33 ... (equivalently)</li>
350
351
352                </ul>
353
354						</li>
355
356						<li><b>Tailor trailing contractions. </b>If a sequence of characters is treated as a unit for collation, it should be entered as a contraction.
357
358                <p>&amp; c &lt; ch</p>
359
360
361                <p>One might think that sequence like "dz" doesn't
362require that, since it would always come after "d" followed by any
363other letter; it is a "trailing contraction". But in unusual cases,
364that wouldn't be true; if "dz" is a unit sorted as if it were a
365distinct letter after "d", one should get the ordering "d<font size="3">α" &lt; "dz". This will only happen if "dz" is a contraction, such as</font></p>
366
367
368                <p><font size="3">&amp; d &lt; dz</font></p>
369
370						</li>
371
372						<li><b>Watch out for Expansions.</b> If you have a rule like &amp;cs &lt; d, and "cs" has not occurred in a previous rule as a contraction, then
373						this is automatically considered to be the same as &amp;c &lt; d / s; that is, the d <i>expands</i> as if it were a "cs" (actually, primary greater
374						than a "cs", since we wrote "&lt;"). This expansion takes effect until the next primary difference.
375
376                <p>So suppose that "ccs" is to behave as if it were
377"cscs", and take case differences into account. You might try to do
378this with the rules on the left:</p>
379
380
381                <table id="table3" border="1" cellpadding="4" cellspacing="0">
382
383							<tbody>
384                    <tr>
385
386								<th align="left" width="50%">Rules (Wrong)</th>
387
388								<th align="left" width="50%">Actual Effect</th>
389
390							</tr>
391
392							<tr>
393
394								<td width="50%">&amp; C &lt; cs &lt;&lt;&lt; Cs &lt;&lt;&lt; CS<br>
395
396								&amp; cscs &lt;&lt;&lt; ccs<br>
397
398								&lt;&lt;&lt; Cscs &lt;&lt;&lt; Ccs<br>
399
400								&lt;&lt;&lt; CSCS &lt;&lt;&lt; CCS</td>
401
402								<td width="50%">&amp; C &lt; cs &lt;&lt;&lt; Cs &lt;&lt;&lt; CS<br>
403
404								&amp; cs &lt;&lt;&lt; ccs / cs<br>
405
406								&lt;&lt;&lt; Cscs&nbsp; / cs &lt;&lt;&lt; Ccs&nbsp; / cs<br>
407
408								&lt;&lt;&lt; CSCS&nbsp; / cs &lt;&lt;&lt; CCS / cs</td>
409
410							</tr>
411
412
413                  </tbody>
414                </table>
415
416
417                <p>But since the <u>CSCS</u> has not been made a contraction in previous rules, this produces an automatic expansion, one that continues
418						through the entire sequence of non-primary differences, as shown on the right. This is <i>not</i> what is wanted: each item acts like it
419						expands compared to the previous item. So CCS, for example, will act like it expands to CSCScs!</p>
420
421
422                <p>What you actually want is the following:</p>
423
424
425                <table id="table4" border="1" cellpadding="4" cellspacing="0">
426
427							<tbody>
428                    <tr>
429
430								<th align="left" width="50%">Rules (Right)</th>
431
432								<th align="left" width="50%">Actual Effect</th>
433
434							</tr>
435
436							<tr>
437
438								<td width="50%">&amp; C &lt; cs &lt;&lt;&lt; Cs &lt;&lt;&lt; CS<br>
439
440								&amp; cscs &lt;&lt;&lt; ccs<br>
441
442								&amp; Cscs &lt;&lt;&lt; Ccs<br>
443
444								&amp; CSCS &lt;&lt;&lt; CCS</td>
445
446								<td width="50%">&amp; C &lt; cs &lt;&lt;&lt; Cs &lt;&lt;&lt; CS<br>
447
448								&amp; cs &lt;&lt;&lt; ccs / cs<br>
449
450								&amp; Cs &lt;&lt;&lt; Ccs / cs<br>
451
452								&amp; CS &lt;&lt;&lt; CCS / CS</td>
453
454							</tr>
455
456
457                  </tbody>
458                </table>
459
460
461                <p>In short, when you have expansions, it is always
462safer and clearer to express them with separate resets. There are only
463a few exceptions to this, notably when CJK characters are interleaved
464with Hangul Syllables.</p>
465
466						</li>
467
468						<li><b>Don't tailor what you don't have to. </b>Example: Maltese was sorting character sequences <i>before</i> a base character using the
469						following style:
470
471                <p>&amp; B<br>
472
473						&lt; ċ<br>
474
475						&lt;&lt;&lt;Ċ<br>
476
477						&lt; c<br>
478
479						&lt;&lt;&lt;C</p>
480
481
482                <p>This works, but is sub-optimal for two reasons. </p>
483
484
485                <ol>
486
487							<li>it tailors c/C when it doesn't need to be; any extra tailoring generally makes for longer sort keys.</li>
488
489							<li>by tailoring c/C, it puts other those things that are after b/B after c/C instead. See
490							<a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a>&nbsp; for examples.</li>
491
492
493                </ol>
494
495
496                <p>The correct rules should be:</p>
497
498
499                <p>&amp; [before 1] c &lt; ċ &lt;&lt;&lt; Ċ</p>
500
501
502                <p>This finds the highest primary (that's what the 1 is
503for) character less than c, and uses that as the reset point. For
504Maltese, the same technique needs to be used for ġ and ż.</p>
505
506						</li>
507
508						<li>Contractions can be blocked with CGJ, as described in the Unicode Standard and in the
509						<a href="http://www.unicode.org/faq/char_combmark.html">Characters and Combining Marks FAQ</a>.</li>
510
511						<li>Normally all combinations of case need to be supplied for contractions. That is, if <i>ch</i>
512is a contraction, then you would have the rules ... ch &lt; cH &lt; Ch
513&lt; CH. The reason for this is so that all case variants sort at the
514same primary level: thus lowercasing a string will not affect its
515primary order. Cases such as <i>McHugh</i> are handled like other instances where contractions should be blocked.</li>
516
517
518            </ol>
519
520
521            <h2><a name="Possible_Comparison_Sources">Possible Comparison Sources</a></h2>
522
523
524            <p>Sources and references may be standards or can also be dictionaries, journal style guides (such as <i>The Economist Style Guide for English</i>),
525and other available sources that provide guidance as to common
526practice. Online sources are preferred where available, since they can
527be more easily checked.</p>
528
529
530            <p>The goal is to follow common, customary practice. For
531example, language or territory display names should use the most
532recognizable name in common usage. This is generally not the official
533name. For example, one would use "Switzerland" not "Swiss
534Confederation".</p>
535
536
537            <p>Here are some possible resources for comparison of locale data. <i>This is <b>not</b> an endorsement of the sources, merely a collation of
538					possibly-useful links. </i><font color="black" face="Arial" size="3"><span style="font-size: 12pt;">To suggest additions, </span></font>
539					file a <a href="filing_bug_reports.html">Bug Report</a>.</p>
540
541
542            <h3>Territory names; Language names; Gregorian/non-Gregorian month names; Day names; Exemplar characters, and Collation</h3>
543
544
545            <ul>
546
547						<li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li>
548
549
550            </ul>
551
552
553            <h3><i>The Economist Style Guide</i> (unfortunately only hard copy): Currencies, Display Names, Formatting for English:</h3>
554
555
556            <ul>
557
558						<li><a href="http://www.amazon.com/exec/obidos/tg/detail/-/186197535X">http://www.amazon.com/exec/obidos/tg/detail/-/186197535X</a> </li>
559
560
561            </ul>
562
563
564            <h3><a name="Exemplar_Characters">Exemplar Characters</a></h3>
565
566
567            <ul>
568
569						<li><a href="http://www.eki.ee/letter/">http://www.eki.ee/letter/</a> </li>
570
571						<li><a href="http://europa.eu.int/comm/eurostat/research/index.htm?http://europa.eu.int/en/comm/eurostat/research/isi/special/&amp;1">http://europa.eu.int/comm/eurostat/research/index.htm</a></li>
572
573						<li><a href="http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin"><span>http://en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin</span></a><span>
574						</span></li>
575
576						<li><a href="http://www.omniglot.com/writing/alphabets.htm">
577						http://www.omniglot.com/writing/alphabets.htm</a> </li>
578
579						<li><a href="http://www.geonames.de/">http://www.geonames.de/</a></li>
580
581
582            </ul>
583
584
585            <h3>Territory Names</h3>
586
587
588            <ul>
589
590						<li><a href="http://www.world-gazetteer.com/pronun.htm">http://www.world-gazetteer.com/pronun.htm</a></li>
591
592						<li><a href="http://www.eki.ee/knn/lingid2.htm#WRLD">http://www.eki.ee/knn/lingid2.htm#WRLD</a> </li>
593
594						<li><a href="http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html">http://www.p.lodz.pl/I35/personal/jw37/EUROPE/europe.html</a>
595						</li>
596
597
598            </ul>
599
600
601            <h3>Currency names; Territory names (Replace es with desired language code) </h3>
602
603
604            <ul>
605
606						<li><a href="http://publications.eu.int/code/es/es-5000500.htm">http://publications.eu.int/code/es/es-5000500.htm</a> <br>
607
608						<a href="http://publications.eu.int/code/es/es-5000700.htm">http://publications.eu.int/code/es/es-5000700.htm</a> <br>
609
610						<a href="http://publications.eu.int/">http://publications.eu.int/</a> </li>
611
612
613            </ul>
614
615
616            <h3>Territory &amp; Region names (Use the links at the top switch languages); </h3>
617
618
619            <ul>
620
621						<li><a href="http://www.worldlanguage.com/Arabic/Countries/">http://www.worldlanguage.com/Arabic/Countries/</a> </li>
622
623
624            </ul>
625
626
627            <h3>Exemplar/collation information</h3>
628
629
630            <ul>
631
632						<li><a href="http://www.omniglot.com/writing/">http://www.omniglot.com/writing/</a><br>
633
634						<a href="http://www.alphabets-world.com/">http://www.alphabets-world.com/</a> <br>
635
636						<a href="http://developer.mimer.com/collations/charts/">http://developer.mimer.com/collations/charts/</a> </li>
637
638
639            </ul>
640
641
642            <h3>Simple Translations</h3>
643
644
645            <ul>
646
647						<li><a href="http://world.altavista.com/">http://world.altavista.com/</a></li>
648
649						<li><a href="http://www.google.com/language_tools">http://www.google.com/language_tools</a> </li>
650
651
652            </ul>
653
654
655            <h3>List of date/time formatting for Windows</h3>
656
657
658            <ul>
659
660						<li><a href="http://www.microsoft.com/globaldev/nlsweb/">http://www.microsoft.com/globaldev/nlsweb/</a> </li>
661
662
663            </ul>
664
665
666            <h3>Exemplar Characters; Transliteration</h3>
667
668
669            <ul>
670
671						<li><a href="http://www.eki.ee/wgrs/">UNGEGN: Working Group on Romanization Systems</a> </li>
672
673						<li><a href="http://ee.www.ee/transliteration/">Transliteration of Non-Roman Alphabets and Scripts (Søren Binks)</a> </li>
674
675						<li><a href="http://www.archivists.org/catalog/stds99/chapter8.html">Standards for Archival Description: Romanization</a> </li>
676
677						<li><a href="http://ee.www.ee/transliteration/pdf/Hindi-Marathi-Nepali.pdf">ISO-15915 (Hindi)</a> </li>
678
679						<li><a href="http://ee.www.ee/transliteration/pdf/Gujarati.pdf">ISO-15915 (Gujarati) </a></li>
680
681						<li><a href="http://ee.www.ee/transliteration/pdf/Kannada.pdf">ISO-15915 (Kannada) </a></li>
682
683						<li><a href="http://www.cdacindia.com/html/gist/down/iscii_d.asp">ISCII-91</a> </li>
684
685
686            </ul>
687
688
689            <h3>Geographical Names</h3>
690
691
692            <ul>
693
694						<li><a href="http://unstats.un.org/unsd/geoinfo/">http://unstats.un.org/unsd/geoinfo/</a> </li>
695
696
697            </ul>
698
699
700            <h3><span>Currencies</span></h3>
701
702
703            <ul>
704
705						<li><a href="http://www.globalfindata.com/gh/index.html"><span>http://www.globalfindata.com/gh/index.html</span></a><span> </span></li>
706
707
708            </ul>
709
710
711            <h3>General</h3>
712
713
714            <ul>
715
716						<li><a href="http://www.cia.gov/cia/publications/factbook/">http://www.cia.gov/cia/publications/factbook/</a> </li>
717
718						<li><a href="http://www.microsoft.com/mspress/books/5717.asp">http://www.microsoft.com/mspress/books/5717.asp</a> very complete set of information,
719						like postal information, currency symbols, date/time formats, calendars,...</li>
720
721
722            </ul>
723
724
725            <p>&nbsp;</p>
726
727
728            <blockquote>
729					</blockquote>
730
731				</div>
732
733				</td>
734
735			</tr>
736
737			<tr>
738
739				<td class="contents" valign="top">&nbsp;</td>
740
741			</tr>
742
743
744        </tbody>
745      </table>
746
747
748      <hr width="50%">
749
750      <div align="center">
751
752      <center>
753
754      <table border="0" cellpadding="0" cellspacing="0">
755
756				<tbody>
757          <tr>
758
759					<td><a href="http://www.unicode.org/copyright.html">
760					<img src="http://www.unicode.org/img/hb_notice.gif" alt="Access to Copyright and terms of use" border="0" height="50" width="216"></a></td>
761
762				</tr>
763
764
765        </tbody>
766      </table>
767
768
769      <script language="Javascript" type="text/javascript" src="http://www.unicode.org/webscripts/lastModified.js">
770      </script>
771			</center>
772      </div>
773
774		</td>
775
776	</tr>
777
778  </tbody>
779</table>
780
781
782</body>
783</html>
784