• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2"http://www.w3.org/TR/html4/loose.dtd">
3<html>
4
5<head>
6<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
7<meta http-equiv="Content-Language" content="en-us">
8<link rel="stylesheet" href="http://www.unicode.org/reports/reports.css"
9	type="text/css">
10<title>UTS #35: Unicode LDML: Collation</title>
11<style type="text/css">
12<!--
13.dtd {
14	font-family: monospace;
15	font-size: 90%;
16	background-color: #CCCCFF;
17	border-style: dotted;
18	border-width: 1px;
19}
20
21.xmlExample {
22	font-family: monospace;
23	font-size: 80%
24}
25
26.blockedInherited {
27	font-style: italic;
28	font-weight: bold;
29	border-style: dashed;
30	border-width: 1px;
31	background-color: #FF0000
32}
33
34.inherited {
35	font-weight: bold;
36	border-style: dashed;
37	border-width: 1px;
38	background-color: #00FF00
39}
40
41.element {
42	font-weight: bold;
43	color: red;
44}
45
46.attribute {
47	font-weight: bold;
48	color: maroon;
49}
50
51.attributeValue {
52	font-weight: bold;
53	color: blue;
54}
55
56li, p {
57	margin-top: 0.5em;
58	margin-bottom: 0.5em
59}
60
61h2, h3, h4, table {
62	margin-top: 1.5em;
63	margin-bottom: 0.5em;
64}
65-->
66</style>
67</head>
68
69<body>
70
71	<table class="header" width="100%">
72		<tr>
73			<td class="icon"><a href="http://unicode.org"> <img
74					alt="[Unicode]" src="http://unicode.org/webscripts/logo60s2.gif"
75					width="34" height="33"
76					style="vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a>&nbsp;
77				<a class="bar" href="http://www.unicode.org/reports/">Technical
78					Reports</a></td>
79		</tr>
80		<tr>
81			<td class="gray">&nbsp;</td>
82		</tr>
83	</table>
84	<div class="body">
85		<h2 style="text-align: center">
86			Unicode Technical
87			Standard #35
88		</h2>
89		<h1>
90			Unicode Locale Data Markup Language (LDML)<br>Part 5: Collation
91		</h1>
92
93		<!-- At least the first row of this header table should be identical across the parts of this UTS. -->
94		<table border="1" cellpadding="2" cellspacing="0" class="wide">
95			<tr>
96				<td>Version</td>
97				<td>34</td>
98			</tr>
99			<tr>
100				<td>Editors</td>
101				<td><a
102					href="https://plus.google.com/117587389715494866571?rel=author">
103						Markus Scherer</a> (<a href="mailto:markus.icu@gmail.com">markus.icu@gmail.com</a>)
104					and <a href="tr35.html#Acknowledgments">other CLDR committee
105						members</a></td>
106			</tr>
107		</table>
108
109		<p>
110			For the full header, summary, and status, see <a href="tr35.html">
111				Part 1: Core</a>
112		</p>
113
114		<h3>
115			<i>Summary</i>
116		</h3>
117		<p>
118			This document describes parts of an XML format (<i>vocabulary</i>)
119			for the exchange of structured locale data. This format is used in
120			the <a href="http://cldr.unicode.org/">Unicode Common Locale Data
121				Repository</a>.
122		</p>
123
124		<p>
125			This is a partial document, describing only those parts of the LDML
126			that are relevant for collation (sorting, searching &amp; grouping).
127			For the other parts of the LDML see the <a href="tr35.html">main
128				LDML document</a> and the links above.
129		</p>
130
131		<h3>
132			<i>Status</i>
133		</h3>
134
135		<!-- NOT YET APPROVED
136		<p>
137				<i class="changed">This is a<b><font color="#ff3333">
138				draft </font></b>document which may be updated, replaced, or superseded by
139				other documents at any time. Publication does not imply endorsement
140				by the Unicode Consortium. This is not a stable document; it is
141				inappropriate to cite this document as other than a work in
142				progress.
143			</i>
144		</p>
145		 END NOT YET APPROVED -->
146		<!-- APPROVED -->
147		<p>
148			<i>This document has been reviewed by Unicode members and other
149				interested parties, and has been approved for publication by the
150				Unicode Consortium. This is a stable document and may be used as
151				reference material or cited as a normative reference by other
152				specifications.</i>
153		</p>
154		<!-- END APPROVED -->
155
156
157		<blockquote>
158			<p>
159				<i><b>A Unicode Technical Standard (UTS)</b> is an independent
160					specification. Conformance to the Unicode Standard does not imply
161					conformance to any UTS.</i>
162			</p>
163		</blockquote>
164		<p>
165			<i>Please submit corrigenda and other comments with the CLDR bug
166				reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related
167				information that is useful in understanding this document is found
168				in the <a href="tr35.html#References">References</a>. For the latest
169				version of the Unicode Standard see [<a href="tr35.html#Unicode">Unicode</a>].
170				For a list of current Unicode Technical Reports see [<a
171				href="tr35.html#Reports">Reports</a>]. For more information about
172				versions of the Unicode Standard, see [<a href="tr35.html#Versions">Versions</a>].
173			</i>
174		</p>
175		<h2>
176			<a name="Parts" href="#Parts">Parts</a>
177		</h2>
178
179		<!-- This section of Parts should be identical in all of the parts of this UTS. -->
180		<p>The LDML specification is divided into the following parts:</p>
181		<ul class="toc">
182			<li>Part 1: <a href="tr35.html#Contents">Core</a> (languages,
183				locales, basic structure)
184			</li>
185			<li>Part 2: <a href="tr35-general.html#Contents">General</a>
186				(display names &amp; transforms, etc.)
187			</li>
188			<li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a>
189				(number &amp; currency formatting)
190			</li>
191			<li>Part 4: <a href="tr35-dates.html#Contents">Dates</a> (date,
192				time, time zone formatting)
193			</li>
194			<li>Part 5: <a href="tr35-collation.html#Contents">Collation</a>
195				(sorting, searching, grouping)
196			</li>
197			<li>Part 6: <a href="tr35-info.html#Contents">Supplemental</a>
198				(supplemental data)
199			</li>
200			<li>Part 7: <a href="tr35-keyboards.html#Contents">Keyboards</a>
201				(keyboard mappings)
202			</li>
203		</ul>
204
205		<h2>
206			<a name="Contents" href="#Contents">Contents of Part 5, Collation</a>
207		</h2>
208		<!-- START Generated TOC: CheckHtmlFiles -->
209		<ul class="toc">
210			<li>1 <a href="#CLDR_Collation">CLDR Collation</a>
211				<ul class="toc">
212					<li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR Collation
213							Algorithm</a>
214						<ul class="toc">
215							<li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li>
216							<li>1.1.2 <a href="#Context_Sensitive_Mappings">Context-Sensitive
217									Mappings</a></li>
218							<li>1.1.3 <a href="#Algorithm_Case">Case Handling</a></li>
219							<li>1.1.4 <a href="#Algorithm_Reordering_Groups">Reordering
220									Groups</a></li>
221							<li>1.1.5 <a href="#Combining_Rules">Combining Rules</a></li>
222						</ul>
223					</li>
224				</ul>
225			</li>
226			<li>2 <a href="#Root_Collation">Root Collation</a>
227				<ul class="toc">
228					<li>2.1 <a href="#grouping_classes_of_characters">Grouping
229							classes of characters</a></li>
230					<li>2.2 <a href="#non_variable_symbols">Non-variable
231							symbols</a></li>
232					<li>2.3 <a href="#tibetan_contractions">Additional
233							contractions for Tibetan</a></li>
234					<li>2.4 <a href="#tailored_noncharacter_weights">Tailored
235							noncharacter weights</a></li>
236					<li>2.5 <a href="#Root_Data_Files">Root Collation Data
237							Files</a></li>
238					<li>2.6 <a href="#Root_Data_File_Formats">Root Collation
239							Data File Formats</a>
240						<ul class="toc">
241							<li>2.6.1 <a href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li>
242							<li>2.6.2 <a href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li>
243							<li>2.6.3 <a href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li>
244						</ul>
245					</li>
246				</ul>
247			</li>
248			<li>3 <a href="#Collation_Tailorings">Collation Tailorings</a>
249				<ul class="toc">
250					<li>3.1 <a href="#Collation_Types">Collation Types</a>
251						<ul class="toc">
252							<li>3.1.1 <a href="#Collation_Type_Fallback">Collation
253									Type Fallback</a>
254								<ul class="toc">
255									<li>Table: <a
256										href="#Sample_requested_and_actual_collation_locales_and_types">Sample
257											requested and actual collation locales and types</a></li>
258								</ul>
259							</li>
260						</ul>
261					</li>
262					<li>3.2 <a href="#Collation_Version">Version</a></li>
263					<li>3.3 <a href="#Collation_Element">Collation Element</a></li>
264					<li>3.4 <a href="#Setting_Options">Setting Options</a>
265						<ul class="toc">
266							<li>Table: <a href="#Collation_Settings">Collation
267									Settings</a></li>
268							<li>3.4.1 <a href="#Common_Settings">Common settings
269									combinations</a></li>
270							<li>3.4.2 <a href="#Normalization_Setting">Notes on the
271									normalization setting</a></li>
272							<li>3.4.3 <a href="#Variable_Top_Settings">Notes on
273									variable top settings</a></li>
274						</ul>
275					</li>
276					<li>3.5 <a href="#Rules">Collation Rule Syntax</a></li>
277					<li>3.6 <a href="#Orderings">Orderings</a>
278						<ul class="toc">
279							<li>Table: <a href="#Specifying_Collation_Ordering">Specifying
280									Collation Ordering</a></li>
281							<li>Table: <a href="#Abbreviating_Ordering_Specifications">Abbreviating
282									Ordering Specifications</a></li>
283						</ul>
284					</li>
285					<li>3.7 <a href="#Contractions">Contractions</a>
286						<ul class="toc">
287							<li>Table: <a href="#Specifying_Contractions">Specifying
288									Contractions</a></li>
289						</ul>
290					</li>
291					<li>3.8 <a href="#Expansions">Expansions</a></li>
292					<li>3.9 <a href="#Context_Before">Context Before</a>
293						<ul class="toc">
294							<li>Table: <a href="#Specifying_Previous_Context">Specifying
295									Previous Context</a></li>
296						</ul>
297					</li>
298					<li>3.10 <a href="#Placing_Characters_Before_Others">Placing
299							Characters Before Others</a></li>
300					<li>3.11 <a href="#Logical_Reset_Positions">Logical Reset
301							Positions</a>
302						<ul class="toc">
303							<li>Table: <a href="#Specifying_Logical_Positions">Specifying
304									Logical Positions</a></li>
305						</ul>
306					</li>
307					<li>3.12 <a href="#Special_Purpose_Commands">Special-Purpose
308							Commands</a>
309						<ul class="toc">
310							<li>Table: <a href="#Special_Purpose_Elements">Special-Purpose
311									Elements</a></li>
312						</ul>
313					</li>
314					<li>3.13 <a href="#Script_Reordering">Collation Reordering</a>
315						<ul class="toc">
316							<li>3.13.1 <a href="#Interpretation_reordering">Interpretation
317									of a reordering list</a></li>
318							<li>3.13.2 <a href="#Reordering_Groups_allkeys">Reordering
319									Groups for allkeys.txt</a></li>
320						</ul>
321					</li>
322					<li>3.14 <a href="#Case_Parameters">Case Parameters</a>
323						<ul class="toc">
324							<li>3.14.1 <a href="#Case_Untailored">Untailored
325									Characters</a></li>
326							<li>3.14.2 <a href="#Case_Weights">Compute Modified
327									Collation Elements</a></li>
328							<li>3.14.3 <a href="#Case_Tailored">Tailored Strings</a></li>
329						</ul>
330					</li>
331					<li>3.15 <a href="#Visibility">Visibility</a></li>
332					<li>3.16 <a href="#Collation_Indexes">Collation Indexes</a>
333						<ul class="toc">
334							<li>3.16.1 <a href="#Index_Characters">Index Characters</a></li>
335							<li>3.16.2 <a href="#CJK_Index_Markers">CJK Index
336									Markers</a></li>
337						</ul>
338					</li>
339				</ul>
340			</li>
341		</ul>
342		<!-- END Generated TOC: CheckHtmlFiles -->
343
344		<h2>
345			1 <a name="CLDR_Collation" href="#CLDR_Collation">CLDR Collation</a>
346		</h2>
347		<p>Collation is the general term for the process and function of
348			determining the sorting order of strings of characters, for example
349			for lists of strings presented to users, or in databases for sorting
350			and selecting records.</p>
351
352		<p>Collation varies by language, by application (some languages
353			use special phonebook sorting), and other criteria (for example,
354			phonetic vs. visual).</p>
355
356		<p>
357			CLDR provides collation data for many languages and styles. The data
358			supports not only sorting but also language-sensitive searching and
359			grouping under index headers. All CLDR collations are based on the [<a
360				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default
361			order, with common modifications applied in the CLDR root collation,
362			and further tailored for language and style as needed.
363		</p>
364
365		<h3>
366			1.1 <a name="CLDR_Collation_Algorithm"
367				href="#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a>
368		</h3>
369
370		<p>
371			The CLDR collation algorithm is an extension of the <a
372				href="http://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode
373				Collation Algorithm</a>.
374		</p>
375
376		<h4>
377			1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE">U+FFFE</a>
378		</h4>
379
380		<p>
381			U+FFFE maps to a CE with a minimal, unique primary weight. Its
382			primary weight is not "variable": U+FFFE must not become ignorable in
383			alternate handling. On the identical level, a minimal, unique
384			“weight” must be emitted for U+FFFE as well. This allows for <a
385				href="http://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging
386				Sort Keys</a> within code point space.
387		</p>
388		<p>
389			For example, when sorting names in a database, a sortable string can
390			be formed with <em>last_name</em> + '\uFFFE' + <em>first_name</em>.
391			These strings would sort properly, without ever comparing the last
392			part of a last name with the first part of another first name.
393		</p>
394
395		<p>
396			For backwards secondary level sorting, text <i>segments</i> separated
397			by U+FFFE are processed in forward segment order, and <i>within</i>
398			each segment the secondary weights are compared backwards. This is so
399			that such combined strings are processed consistently with merging
400			their sort keys (for example, by concatenating them level by level
401			with a low separator).
402		</p>
403
404		<p class="note">
405			Note: With unique, low weights on <i>all</i> levels it is possible to
406			achieve
407			<code>sortkey(str1 + "\uFFFE" + str2) ==
408				mergeSortkeys(sortkey(str1), sortkey(str2))</code>
409			. When that is not necessary, then code can be a little simpler (no
410			special handling for U+FFFE except for backwards-secondary), sort
411			keys can be a little shorter (when using compressible common
412			non-primary weights for U+FFFE), and another low weight can be used
413			in tailorings.
414		</p>
415
416		<h4>
417			1.1.2 <a name="Context_Sensitive_Mappings"
418				href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a>
419		</h4>
420
421		<p>Contraction matching, as in the UCA, starts from the first
422			character of the contraction string. It slows down processing of that
423			first character even when none of its contractions matches. In some
424			cases, it is preferrable to change such contractions to mappings with
425			a prefix (context before a character), so that complex processing is
426			done only when the less-frequently occurring trailing character is
427			encountered.</p>
428
429		<p>For example, the DUCET contains contractions for several
430			variants of L· (L followed by middle dot). Collating ASCII text is
431			slowed down by contraction matching starting with L/l. In the CLDR
432			root collation, these contractions are replaced by prefix mappings
433			(L|·) which are triggered only when the middle dot is encountered.
434			CLDR also uses prefix rules in the Japanese tailoring, for processing
435			of Hiragana/Katakana length and iteration marks.</p>
436
437		<p>The mapping is conditional on the prefix match but does not
438			change the mappings for the preceding text. As a result, a
439			contraction mapping for "px" can be replaced by a prefix rule "p|x"
440			only if px maps to the collation elements for p followed by the
441			collation elements for "x if after p". In the DUCET, L· maps to CE(L)
442			followed by a special secondary CE (which differs from CE(·) when ·
443			is not preceded by L). In the CLDR root collation, L has no
444			context-sensitive mappings, but · maps to that special secondary CE
445			if preceded by L.</p>
446
447		<p>A prefix mapping for p|x behaves mostly like the contraction
448			px, except when there is a contraction that overlaps with the prefix,
449			for example one for "op". A contraction matches only new text (and
450			consumes it), while a prefix matches only already-consumed text.</p>
451		<ul>
452			<li>With mappings for "op" and "px", only the first contraction
453				matches in text "opx". (It consumes the "op" characters, and there
454				is no context-sensitive mapping for x.)</li>
455			<li>With mappings for "op" and "p|x", both the contraction and
456				the prefix rule match in text "opx". (The prefix always matches
457				already-consumed characters, regardless of whether they mapped as
458				part of contractions.)</li>
459		</ul>
460
461		<p class="note">
462			Note: Matching of discontiguous contractions should be implemented
463			without rewriting the text (unlike in the [<a
464				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] algorithm
465			specification), so that prefix matching is predictable. (It should
466			also help with contraction matching performance.) An implementation
467			that does rewrite the text, as in the UCA, will get different results
468			for some (unusual) combinations of contractions, prefix rules, and
469			input text.
470		</p>
471
472		<p>Prefix matching uses a simple longest-match algorithm (op|c
473			wins over p|c). It is recommended that prefix rules be limited to
474			mappings where both the prefix string and the mapped string begin
475			with an NFC boundary (that is, with a normalization starter that does
476			not combine backwards). (In op|ch both o and c should be starters
477			(ccc=0) and NFC_QC=Yes.) Otherwise, prefix matching would be affected
478			by canonical reordering and discontiguous matching, like
479			contractions. Prefix matching is thus always contiguous.</p>
480
481		<p>A character can have mappings with both prefixes (context
482			before) and contraction suffixes. Prefixes are matched first. This is
483			to keep them reasonably implementable: When there is a mapping with
484			both a prefix and a contraction suffix (like in Japanese: ぐ|ゞ), then
485			the matching needs to go in both directions. The contraction might
486			involve discontiguous matching, which needs complex text iteration
487			and handling of skipped combining marks, and will consume the
488			matching suffix. Prefix matching should be first because, regardless
489			of whether there is a match, the implementation will always return to
490			the original text index (right after the prefix) from where it will
491			start to look at all of the contractions for that prefix.</p>
492
493		<p>If there is a match for a prefix but no match for any of the
494			suffixes for that prefix, then fall back to mappings with the
495			next-longest matching prefix, and so on, ultimately to mappings with
496			no prefix. (Otherwise mappings with longer prefixes would “hide”
497			mappings with shorter prefixes.)</p>
498
499		<p>Consider the following mappings.</p>
500		<ol>
501			<li>p → CE(p)</li>
502			<li>h → CE(h)</li>
503			<li>c → CE(c)</li>
504			<li>ch → CE(d)</li>
505			<li>p|c → CE(u)</li>
506			<li>p|ci → CE(v)</li>
507			<li>p|ĉ → CE(w)</li>
508			<li>op|ck → CE(x)</li>
509		</ol>
510
511		<p>With these, text collates like this:</p>
512		<ul>
513			<li>pc → CE(p)CE(u)</li>
514			<li>pci → CE(p)CE(v)</li>
515			<li>pch → CE(p)CE(u)CE(h)</li>
516			<li>pĉ → CE(p)CE(w)</li>
517			<li>pĉ̣ → CE(p)CE(w)CE(U+0323) // discontiguous</li>
518			<li>opck → CE(o)CE(p)CE(x)</li>
519			<li>opch → CE(o)CE(p)CE(u)CE(h)</li>
520		</ul>
521
522		<p>
523			However, if the mapping p|c → CE(u) is missing, then text "pch" maps
524			to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and "pĉ̣" maps to
525			CE(p)CE(c)CE(U+0323)CE(U+0302) (because discontiguous contraction
526			matching extends <i>an existing match</i> by one non-starter at a
527			time).
528		</p>
529
530		<h4>
531			1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case">Case
532				Handling</a>
533		</h4>
534		<p>
535			CLDR specifies how to sort lowercase or uppercase first, as a
536			stronger distinction than other tertiary variants (<strong>caseFirst</strong>)
537			or while completely ignoring all other tertiary distinctions (<strong>caseLevel</strong>).
538			See <i>Section 3.3 <a href="#Setting_Options">Setting Options</a></i>
539			and <i>Section 3.13 <a href="#Case_Parameters">Case
540					Parameters</a></i>.
541		</p>
542
543		<h4>
544			1.1.4 <a name="Algorithm_Reordering_Groups"
545				href="#Algorithm_Reordering_Groups">Reordering Groups</a>
546		</h4>
547		<p>CLDR specifies how to do parametric reordering of groups of
548			scripts (e.g., “native script first”) as well as special groups
549			(e.g., “digits after letters”), and provides data for the effective
550			implementation of such reordering.</p>
551
552		<h4>
553			1.1.5 <a name="Combining_Rules"
554				href="#Combining_Rules">Combining Rules</a>
555		</h4>
556		<p>Rules from different sources can be combined, with the later rules overriding the earlier ones. The following is an example of how this can be useful.</p>
557		<p>There is a root collation for &quot;emoji&quot; in CLDR. So use of &quot;-u-co-emoji&quot; in a Unicode locale identifier will access that ordering. </p>
558		<p>Example, using ICU:</p>
559		<blockquote>
560		  <p>collator = Collator.getInstance(ULocale.forLanguageTag(&quot;en-u-co-emoji&quot;));  </p>
561	  </blockquote>
562		<p>However, use of the emoji will supplant the language's customizations. So the above is the equivalent of: </p>
563		<blockquote>
564		  <p>collator = Collator.getInstance(ULocale.forLanguageTag(&quot;und-u-co-emoji&quot;));  </p>
565	  </blockquote>
566		<p>The same structure will not work for a language that does require customization, like Danish. That is, the following will fail.</p>
567		<blockquote>
568		  <p> collator = Collator.getInstance(ULocale.forLanguageTag(&quot;da-u-co-emoji&quot;));  </p>
569	  </blockquote>
570		<p>For that, a slightly more cumbersome method needs to be employed, which is to take the rules for Danish, and explicitly add the rules for emoji. </p>
571		<blockquote>
572		  <p>RuleBasedCollator collator = new RuleBasedCollator(<br>
573		    ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag(&quot;da&quot;))).getRules() +<br>
574		    ((RuleBasedCollator) Collator.getInstance(ULocale.forLanguageTag(&quot;und-u-co-emoji&quot;)))<br>
575	      .getRules());</p>
576	  </blockquote>
577		<p>The following table shows the differences. When emoji ordering is supported, the two faces will be adjacent. When Danish ordering is supported, the ü is after the y.</p>
578		<table class='simple'>
579		  <tbody>
580		    <tr>
581		      <td>code point order</td>
582		      <td>,</td>
583		      <td></td>
584		      <td></td>
585		      <td>Z</td>
586		      <td>a</td>
587		      <td>y</td>
588		      <td>ü</td>
589		      <td>☹️</td>
590		      <td>✈️️</td>
591		      <td>글</td>
592		      <td>��</td>
593	        </tr>
594		    <tr>
595		      <td>en</td>
596		      <td>,</td>
597		      <td>☹️</td>
598		      <td>✈️️</td>
599		      <td>��</td>
600		      <td>a</td>
601		      <td>ü</td>
602		      <td>y</td>
603		      <td>Z</td>
604		      <td>글</td>
605	        </tr>
606		    <tr>
607		      <td>en-u-co-emoji</td>
608		      <td>,</td>
609		      <td>��</td>
610		      <td>☹️</td>
611		      <td>✈️️</td>
612		      <td>a</td>
613		      <td>ü</td>
614		      <td>y</td>
615		      <td>Z</td>
616		      <td>글</td>
617	        </tr>
618		    <tr>
619		      <td>da</td>
620		      <td>,</td>
621		      <td>☹️</td>
622		      <td>✈️️</td>
623		      <td>��</td>
624		      <td>a</td>
625		      <td>y</td>
626		      <td><strong><u>ü</u></strong></td>
627		      <td>Z</td>
628		      <td>글</td>
629	        </tr>
630		    <tr>
631		      <td>da-u-co-emoji</td>
632		      <td>,</td>
633		      <td>��</td>
634		      <td>☹️</td>
635		      <td>✈️️</td>
636		      <td>a</td>
637		      <td><strong><u>ü</u></strong></td>
638		      <td>y</td>
639		      <td>Z</td>
640		      <td>글</td>
641	        </tr>
642		    <tr>
643		      <td>combined rules</td>
644		      <td>,</td>
645		      <td>��</td>
646		      <td>☹️</td>
647		      <td>✈️️</td>
648		      <td>a</td>
649		      <td>y</td>
650		      <td><strong><u>ü</u></strong></td>
651		      <td>Z</td>
652		      <td>글</td>
653	        </tr>
654	      </tbody>
655	  </table>
656
657		<br>
658		<p>&nbsp;</p>
659		<p> </p>
660
661		<h2>
662			2 <a name="Root_Collation" href="#Root_Collation">Root Collation</a>
663		</h2>
664		<p>
665			The CLDR root collation order is based on the <a
666				href="http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">Default
667				Unicode Collation Element Table (DUCET)</a> defined in <em>UTS #10:
668				Unicode Collation Algorithm</em> [<a
669				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is
670			used by all other locales by default, or as the base for their
671			tailorings. (For a chart view of the UCA, see Collation Chart [<a
672				href="tr35.html#UCAChart">UCAChart</a>].)
673		</p>
674		<p>Starting with CLDR 1.9, CLDR uses modified tables for the root
675			collation order. The root locale ordering is tailored in the
676			following ways:</p>
677
678		<h3>
679			2.1 <a name="grouping_classes_of_characters"
680				href="#grouping_classes_of_characters">Grouping classes of
681				characters</a>
682		</h3>
683		<p>As of Version 6.1.0, the DUCET puts characters into the
684			following ordering:</p>
685		<ul>
686			<li>First &quot;common characters&quot;: whitespace,
687				punctuation, general symbols, some numbers, currency symbols, and
688				other numbers.</li>
689			<li>Then &quot;script characters&quot;: Latin, Greek, and the
690				rest of the scripts.</li>
691		</ul>
692		<p>(There are a few exceptions to this general ordering.)</p>
693		<p>The CLDR root locale modifies the DUCET tailoring by ordering
694			the common characters more strictly by category:</p>
695		<ul>
696			<li>whitespace, punctuation, general symbols, currency symbols,
697				and numbers.</li>
698		</ul>
699		<p>What the regrouping allows is for users to parametrically
700			reorder the groups. For example, users can reorder numbers after all
701			scripts, or reorder Greek before Latin.</p>
702		<p>The relative order within each of these groups still matches
703			the DUCET. Symbols, punctuation, and numbers that are grouped with a
704			particular script stay with that script. The differences between CLDR
705			and the DUCET order are:</p>
706		<ol>
707			<li>CLDR groups the numbers together after currency symbols,
708				instead of splitting them with some before and some after. Thus the
709				following are put <em>after</em> currencies and just before all the
710				other numbers.
711				<blockquote>
712					<p>
713						U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE<br> ...<br>
714						U+1D371 ( �� ) [No] COUNTING ROD TENS DIGIT NINE
715					</p>
716				</blockquote>
717			</li>
718			<li>CLDR handles a few other characters differently
719				<ol>
720					<li>U+10A7F ( �� ) [Po] OLD SOUTH ARABIAN NUMERIC INDICATOR is
721						put with punctuation, not symbols</li>
722					<li>U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc] RIAL
723						SIGN are put with currency signs, not with R and REH.</li>
724				</ol>
725			</li>
726		</ol>
727
728		<h3>
729			2.2 <a name="non_variable_symbols" href="#non_variable_symbols">Non-variable
730				symbols</a>
731		</h3>
732		<p>
733			There are multiple <a
734				href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a>
735			options in the UCA for symbols and punctuation, including <em>non-ignorable</em>
736			and <em>shifted</em>. With the <em>shifted</em> option, almost all
737			symbols and punctuation are ignored—except at a fourth level. The
738			CLDR root locale ordering is modified so that symbols are not
739			affected by the <em>shifted</em> option. That is, by default, symbols
740			are not “variable” in CLDR. So <em>shifted</em> only causes
741			whitespace and punctuation to be ignored, but not symbols (like ♥).
742			The DUCET behavior can be specified with a locale ID using the
743			&quot;kv&quot; keyword, to set the Variable section to include all of
744			the symbols below it, or be set parametrically where implementations
745			allow access.
746		</p>
747		<p>See also:</p>
748		<ul>
749			<li><i>Section 3.3, <a href="#Setting_Options">Setting
750						Options</a></i></li>
751			<li><a href="http://www.unicode.org/charts/collation/">http://www.unicode.org/charts/collation/</a></li>
752		</ul>
753
754		<h3>
755			2.3 <a name="tibetan_contractions" href="#tibetan_contractions">Additional
756				contractions for Tibetan</a>
757		</h3>
758		<p>
759			Ten contractions are added for Tibetan: Two to fulfill <a
760				href="http://www.unicode.org/reports/tr10/#WF5">well-formedness
761				condition 5</a>, and eight more to preserve the default order for
762			Tibetan. For details see <i>UTS #10, Section 3.8.2, <a
763				href="http://www.unicode.org/reports/tr10/#Well_Formed_DUCET">Well-Formedness
764					of the DUCET</a></i>.
765		</p>
766
767		<h3>
768			2.4 <a name="tailored_noncharacter_weights"
769				href="#tailored_noncharacter_weights">Tailored noncharacter
770				weights</a>
771		</h3>
772		<p>U+FFFE and U+FFFF have special tailorings:</p>
773		<blockquote>
774			<p>
775				<strong>U+FFFF: </strong>This code point is tailored to have a
776				primary weight higher than all other characters. This allows the
777				reliable specification of a range, such as &ldquo;Sch&rdquo; ≤ X ≤
778				&ldquo;Sch\uFFFF&rdquo;, to include all strings starting with
779				&quot;sch&quot; or equivalent.
780			</p>
781			<p>
782				<strong>U+FFFE: </strong>This code point produces a CE with minimal,
783				unique weights on primary and identical levels. For details see the
784				<i><a href="#Algorithm_FFFE">CLDR Collation Algorithm</a></i> above.
785			</p>
786		</blockquote>
787		<p>
788			UCA (beginning with version 6.3) also maps <strong>U+FFFD</strong> to
789			a special collation element with a very high primary weight, so that
790			it is reliably non-<a
791				href="http://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>,
792			for use with <a
793				href="http://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed
794				code unit sequences</a>.
795		</p>
796		<p>
797			In CLDR, so as to maintain the special collation elements, <strong>U+FFFD..U+FFFF
798			</strong> are not further tailorable, and nothing can tailor to them. That is,
799			neither can occur in a collation rule. For example, the following
800			rules are illegal:
801		</p>
802		<p>
803			<code>&amp;\uFFFF &lt; x</code>
804		</p>
805		<p>
806			<code>&amp;x &lt;\uFFFF</code>
807			<br>
808		</p>
809
810		<p class="note">
811			<b>Note:</b>
812		</p>
813		<ul>
814			<li class="note">Java uses an early version of this collation
815				syntax, but has not been updated recently. It does not support any
816				of the syntax marked with [...], and its default table is not the
817				DUCET nor the CLDR root collation.</li>
818		</ul>
819
820		<h3>
821			2.5 <a name="Root_Data_Files" href="#Root_Data_Files">Root
822				Collation Data Files</a>
823		</h3>
824		<p>
825			The CLDR root collation data files are in the CLDR repository and
826			release, under the path <a
827				href="http://unicode.org/repos/cldr/tags/latest/common/uca/">common/uca/</a>.
828		</p>
829
830		<p>
831			For most data files there are <strong>_SHORT</strong> versions
832			available. They contain the same data but only minimal comments, to
833			reduce the file sizes.
834		</p>
835
836		<p>Comments with DUCET-style weights in files other than
837			allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined in
838			allkeys_CLDR.txt.</p>
839		<ul>
840			<li><strong>allkeys_CLDR</strong> - A file that provides a
841				remapping of UCA DUCET weights for use with CLDR.</li>
842			<li><strong>allkeys_DUCET</strong> - The same as DUCET
843				allkeys.txt, but in alternate=non-ignorable sort order, for easier
844				comparison with allkeys_CLDR.txt.</li>
845			<li><strong>FractionalUCA</strong> - A file that provides a
846				remapping of UCA DUCET weights for use with CLDR. The weight values
847				are modified:
848				<ul>
849					<li>The weights have variable length, with 1..4 bytes each.
850						Each secondary or tertiary weight currently uses at most 2 bytes.</li>
851					<li>There are tailoring gaps between adjacent weights, so that
852						a number of characters can be tailored to sort between any two
853						root collation elements.</li>
854					<li>There are collation elements with primary weights at the
855						boundaries between reordering groups and Unicode scripts, so that
856						tailoring around the first or last primary of a group/script
857						results in new collation elements that sort and reorder together
858						with that group or script. These boundary weights also define the
859						primary weight ranges for parametric group and script reordering.
860					</li>
861				</ul> An implementation may modify the weights further to fit the needs
862				of its data structures.</li>
863			<li><strong>UCA_Rules</strong> - A file that specifies the root
864				collation order in the form of <a href="#Collation_Tailorings">tailoring
865					rules</a>. This is only an approximation of the FractionalUCA data,
866				since the rule syntax cannot express every detail of the collation
867				elements. For example, in the DUCET and in FractionalUCA, tertiary
868				differences are usually expressed with special tertiary weights on
869				all collation elements of an expansion, while a typical from-rules
870				builder will modify the tertiary weight of only one of the collation
871				elements.</li>
872			<li><strong>CollationTest_CLDR</strong> - The CLDR versions of
873				the CollationTest files, which use the tailorings for CLDR. For
874				information on the format, see <a
875				href="http://www.unicode.org/Public/UCA/latest/CollationTest.html">CollationTest.html</a>
876				in the <a href="http://www.unicode.org/reports/tr10/#Data10">UCA
877					data directory</a>.
878				<ul>
879					<li>CollationTest_CLDR_NON_IGNORABLE.txt</li>
880					<li>CollationTest_CLDR_SHIFTED.txt</li>
881				</ul></li>
882		</ul>
883
884		<h3>
885			2.6 <a name="Root_Data_File_Formats" href="#Root_Data_File_Formats">Root
886				Collation Data File Formats</a>
887		</h3>
888
889		<p>The file formats may change between versions of CLDR. The
890			formats for CLDR 23 and beyond are as follows. As usual, text after a
891			# is a comment.</p>
892
893		<h4>
894			2.6.1 <a name="File_Format_allkeys_CLDR_txt"
895				href="#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a>
896		</h4>
897		<p>
898			This file defines CLDR’s tailoring of the DUCET, as described in <i>Section
899				2, <a href="#Root_Collation">Root Collation</a>
900			</i>.
901		</p>
902		<p>
903			The format is similar to that of <a
904				href="http://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>,
905			although there may be some differences in whitespace.
906		</p>
907
908		<h4>
909			2.6.2 <a name="File_Format_FractionalUCA_txt"
910				href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a>
911		</h4>
912		<p>The format is illustrated by the following sample lines, with
913			commentary afterwards.</p>
914		<pre>[UCA version = 6.0.0]</pre>
915		<blockquote>
916			<p>Provides the version number of the UCA table.</p>
917		</blockquote>
918
919		<pre>[Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre>
920		<blockquote>
921			<p>
922				Lists the ranges of Unified_Ideograph characters in collation order.
923				(New in CLDR 24.) They map to collation elements with <a
924					href="http://www.unicode.org/reports/tr10/#Implicit_Weights">implicit
925					(constructed) primary weights</a>.
926			</p>
927		</blockquote>
928
929		<pre>[radical 6=⼅亅:亅��了��-��亇��予㐧��-��争����亊��-����事㐨��-��������]
930[radical 210=⿑齊:齊����齋䶒䶓��齌������齍��-��齎����齏��-��]
931[radical 210'=⻬齐:齐齑]
932[radical end]</pre>
933		<blockquote>
934			<p>
935				Data for Unihan radical-stroke order. (New in CLDR 26.) Following
936				the [Unified_Ideograph] line, a section of
937				<code>[radical ...]</code>
938				lines defines a radical-stroke order of the Unified_Ideograph
939				characters.
940			</p>
941
942			<p>
943				For Han characters, an implementation may choose either to implement
944				the order defined in the UCA and the [Unified_Ideograph] data, or to
945				implement the order defined by the
946				<code>[radical ...]</code>
947				lines. Beginning with CLDR 26, the CJK type="unihan" tailorings
948				assume that the root collation order sorts Han characters in Unihan
949				radical-stroke order according to the
950				<code>[radical ...]</code>
951				data. The CollationTest_CLDR files only contain Han characters that
952				are in the same relative order using implicit weights or the
953				radical-stroke order.
954			</p>
955
956			<p>
957				The root collation radical-stroke order is derived from the first
958				(normative) values of the <a
959					href="http://www.unicode.org/reports/tr38/#kRSUnicode">Unihan
960					kRSUnicode</a> field for each Han character. Han characters are ordered
961				by radical, with traditional forms sorting before simplified ones.
962				Characters with the same radical are ordered by residual stroke
963				count. Characters with the same radical-stroke values are ordered by
964				block and code point, as for <a
965					href="http://www.unicode.org/reports/tr10/#Implicit_Weights">UCA
966					implicit weights</a>.
967			</p>
968
969			<p>
970				There is one
971				<code>[radical ...]</code>
972				line per radical, in the order of radical numbers. Each line shows
973				the radical number and the representative characters from the <a
974					href="http://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD
975					file CJKRadicals.txt</a>, followed by a colon (“:”) and the Han
976				characters with that radical in the order as described above. A
977				range like
978				<code>万-丌</code>
979				indicates that the code points in that range sort in code point
980				order.
981			</p>
982
983			<p>
984				The radical number and characters are informational. The sort order
985				is established only by the order of the
986				<code>[radical ...]</code>
987				lines, and within each line by the characters and ranges between the
988				colon (“:”) and the bracket (“]”).
989			</p>
990
991			<p>
992				Each Unified_Ideograph occurs exactly once. Only Unified_Ideograph
993				characters are listed on
994				<code>[radical ...]</code>
995				lines.
996			</p>
997
998			<p>
999				This section is terminated with one
1000				<code>[radical end]</code>
1001				line.
1002			</p>
1003		</blockquote>
1004
1005		<pre>0000; [,,]     # Zyyy Cc       [0000.0000.0000]        * &lt;NULL&gt;</pre>
1006		<blockquote>
1007			<p>
1008				Provides a weight line. The first element (before the &quot;;&quot;)
1009				is a hex codepoint sequence. The second field is a sequence of
1010				collation elements. Each collation element has 3 parts separated by
1011				commas: the primary weight, secondary weight, and tertiary weight.
1012				The tertiary weight actually consists of two components: the top two
1013				bits (0xC0) are used for the <em>case level</em>, and should be
1014				masked off where a case level is not used.
1015			</p>
1016			<p>A weight is either empty (meaning a zero or ignorable weight)
1017				or is a sequence of one or more bytes. The bytes are interpreted as
1018				a &quot;fraction&quot;, meaning that the ordering is 04 &lt; 05 05
1019				&lt; 06. The weights are constructed so that no weight is an initial
1020				subsequence of another: that is, having both the weights 05 and 05
1021				05 is illegal. The above line consists of all ignorable weights.</p>
1022			<p>The vertical bar (“|”) character is used to indicate context,
1023				as in:</p>
1024		</blockquote>
1025		<pre>006C | 00B7; [, DB A9, 05]</pre>
1026		<blockquote>
1027			This example indicates that if U+00B7 appears immediately after
1028			U+006C, it is given the corresponding collation element instead. This
1029			syntax is roughly equivalent to the following contraction, but is
1030			more efficient. For details see the specification of <i><a
1031				href="#Context_Sensitive_Mappings">Context-Sensitive Mappings</a></i>
1032			above.
1033		</blockquote>
1034		<pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre>
1035		<blockquote>
1036			<p>Single-byte primary weights are given to particularly frequent
1037				characters, such as space, digits, and a-z. More frequent characters
1038				are given two-byte weights, while relatively infrequent characters
1039				are given three-byte weights. For example:</p>
1040		</blockquote>
1041		<pre>...
10420009; [03 05, 05, 05] # Zyyy Cc       [0100.0020.0002]        * &lt;CHARACTER TABULATION&gt;
1043...
10441B60; [06 14 0C, 05, 05]    # Bali Po       [0111.0020.0002]        * BALINESE PAMENENG
1045...
10460031; [14, 05, 05]    # Zyyy Nd       [149B.0020.0002]        * DIGIT ONE</pre>
1047		<blockquote>
1048			<p>The assignment of 2 vs 3 bytes does not reflect importance, or
1049				exact frequency.</p>
1050		</blockquote>
1051
1052		<pre>
10533041; [76 06, 05, 03]	# Hira Lo	[3888.0020.000D]	* HIRAGANA LETTER SMALL A
10543042; [76 06, 05, 85]	# Hira Lo	[3888.0020.000E]	* HIRAGANA LETTER A
105530A1; [76 06, 05, 10]	# Kana Lo	[3888.0020.000F]	* KATAKANA LETTER SMALL A
105630A2; [76 06, 05, 9E]	# Kana Lo	[3888.0020.0011]	* KATAKANA LETTER A</pre>
1057		<blockquote>
1058			<p>
1059				Beginning with CLDR 27, some primary or secondary collation elements
1060				may have below-common tertiary weights (e.g.,
1061				<code>03</code>
1062				), in particular to allow normal Hiragana letters to have common
1063				tertiary weights.
1064			</p>
1065		</blockquote>
1066
1067		<pre># SPECIAL MAX/MIN COLLATION ELEMENTS
1068FFFE; [02, 05, 05]     # Special LOWEST primary, for merge/interleaving
1069FFFF; [EF FE, 05, 05]  # Special HIGHEST primary, for ranges</pre>
1070		<blockquote>
1071			<p>The two tailored noncharacters have their own primary weights.
1072			</p>
1073		</blockquote>
1074
1075		<pre>
1076F967; [U+4E0D]  # Hani Lo       [FB40.0020.0002][CE0D.0000.0000]        * CJK COMPATIBILITY IDEOGRAPH-F967
10772F02; [U+4E36, 10]      # Hani So       [FB40.0020.0004][CE36.0000.0000]        * KANGXI RADICAL DOT
10782E80; [U+4E36, 70, 20]  # Hani So       [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004]        * CJK RADICAL REPEAT</pre>
1079		<blockquote>
1080			<p>Some collation elements are specified by reference to other
1081				mappings. This is particularly useful for Han characters which are
1082				given implicit/constructed primary weights; the reference to a
1083				Unified_Ideograph makes these mappings independent of implementation
1084				details. This technique may also be used in other mappings to show
1085				the relationship of character variants.</p>
1086			<p>The referenced character must have a mapping listed earlier in
1087				the file, or the mapping must have been defined via the
1088				[Unified_Ideograph] data line. The referenced character must map to
1089				exactly one collation element.</p>
1090			<p>
1091				<code>[U+4E0D]</code>
1092				copies U+4E0D’s entire collation element.
1093				<code>[U+4E36, 10]</code>
1094				copies U+4E36’s primary and secondary weights and specifies a
1095				different tertiary weight.
1096				<code>[U+4E36, 70, 20]</code>
1097				only copies U+4E36’s primary weight and specifies other secondary
1098				and tertiary weights.
1099			</p>
1100			<p>FractionalUCA.txt does not have any explicit mappings for
1101				implicit weights. Therefore, an implementation is free to choose an
1102				algorithm for computing implicit weights according to the principles
1103				specified in the UCA.</p>
1104		</blockquote>
1105
1106		<pre>
1107FDD1 20AC;	[0D 20 02, 05, 05]	# CURRENCY first primary
1108FDD1 0034;	[0E 02 02, 05, 05]	# DIGIT first primary starts new lead byte
1109FDD0 FF21;	[26 02 02, 05, 05]	# REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte
1110FDD1 004C;	[28 02 02, 05, 05]	# LATIN first primary starts new lead byte
1111FDD0 FF3A;	[5D 02 02, 05, 05]	# REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte
1112FDD1 03A9;	[5F 04 02, 05, 05]	# GREEK first primary starts new lead byte (compressible)
1113FDD1 03E2;	[5F 60 02, 05, 05]	# COPTIC first primary (compressible)</pre>
1114		<blockquote>
1115			<p>
1116				These are special mappings with primaries at the boundaries of
1117				scripts and reordering groups. They serve as tailoring boundaries,
1118				so that tailoring near the first or last character of a script or
1119				group places the tailored item into the same group. Beginning with
1120				CLDR 24, each of these is a contraction of U+FDD1 with
1121				a character of the corresponding script
1122				(or of the General_Category [Z, P, S, Sc, Nd]
1123				corresponding to a special reordering group),
1124				mapping to the first possible primary weight per
1125				script or group. They can be enumerated for implementations of <a
1126					href="#Collation_Indexes">Collation Indexes</a>. (Earlier versions
1127				mapped contractions with U+FDD0 to the last primary weights of each
1128				group but not each script.)
1129			</p>
1130			<p>Beginning with CLDR 27, these mappings alone define the
1131				boundaries for reordering single scripts. (There are no mappings for
1132				Hrkt, Hans, or Hant because they are not fully distinct scripts;
1133				they share primary weights with other scripts: Hrkt=Hira=Kana &amp;
1134				Hans=Hant=Hani.) There are some reserved ranges, beginning at
1135				boundaries marked with U+FDD0 plus following characters as shown
1136				above. The reserved ranges are not used for collation elements and
1137				are not available for tailoring.</p>
1138			<p>Some primary lead bytes must be reserved so that reordering of
1139				scripts along partial-lead-byte boundaries can “split” the primary
1140				lead byte and use up a reserved byte. This is for implementations
1141				that write sort keys, which must reorder primary weights by
1142				offsetting them by whole lead bytes. There are reorder-reserved
1143				ranges before and after Latin, so that reordering scripts with few
1144				primary lead bytes relative to Latin can move those scripts into the
1145				reserved ranges without changing the primary weights of any other
1146				script. Each of these boundaries begins with a new two-byte primary;
1147				that is, no two groups/scripts/ranges share the top 16 bits of their
1148				primary weights.</p>
1149		</blockquote>
1150
1151		<pre>
1152FDD0 0034;      [11, 05, 05]    # lead byte for numeric sorting</pre>
1153		<blockquote>
1154			<p>This mapping specifies the lead byte for numeric sorting. It
1155				must be different from the lead byte of any other primary weight,
1156				otherwise numeric sorting would generate ill-formed collation
1157				elements. Therefore, this mapping itself must be excluded from the
1158				set of regular mappings. This value can be ignored by
1159				implementations that do not support numeric sorting. (Other
1160				contractions with U+FDD0 can normally be ignored altogether.)</p>
1161		</blockquote>
1162
1163		<pre>
1164# HOMELESS COLLATION ELEMENTS
1165FDD0 0063; [, 97, 3D]       # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F]    * U+01C6 LATIN SMALL LETTER DZ WITH CARON
1166FDD0 0064; [, A7, 09]       # [15D1.0020.0004] [0000.0056.0004]     * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
1167FDD0 0065; [, B1, 09]       # [1644.0020.0004] [0000.0061.0004]     * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre>
1168		<blockquote>
1169			<p>The DUCET has some weights that don't correspond directly to a
1170				character. To allow for implementations to have a mapping for each
1171				collation element (necessary for certain implementations of
1172				tailoring), this requires the construction of special sequences for
1173				those weights. These collation elements can normally be ignored.</p>
1174		</blockquote>
1175
1176		<p>Next, a number of tables are defined. The function of each of
1177			the tables is summarized afterwards.</p>
1178
1179		<pre># VALUES BASED ON UCA
1180...
1181[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
1182[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
1183[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
1184[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
1185[first trailing [E5, 05, 05]] # CONSTRUCTED
1186[last trailing [E5, 05, 05]] # CONSTRUCTED
1187...</pre>
1188		<blockquote>
1189			<p>This table summarizes ranges of important groups of characters
1190				for implementations.</p>
1191		</blockquote>
1192		<pre># Top Byte =&gt; Reordering Tokens
1193[top_byte     00      TERMINATOR ]    #       [0]     TERMINATOR=1
1194[top_byte     01      LEVEL-SEPARATOR ]       #       [0]     LEVEL-SEPARATOR=1
1195[top_byte     02      FIELD-SEPARATOR ]       #       [0]     FIELD-SEPARATOR=1
1196[top_byte     03      SPACE ] #       [9]     SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
1197...</pre>
1198		<blockquote>
1199			<p>This table defines the reordering groups, for script
1200				reordering. The table maps from the first bytes of the fractional
1201				weights to a reordering token. The format is &quot;[top_byte &quot;
1202				byte-value reordering-token &quot;COMPRESS&quot;? &quot;]&quot;. The
1203				&quot;COMPRESS&quot; value is present when there is only one byte in
1204				the reordering token, and primary-weight compression can be applied.
1205				Most reordering tokens are script values; others are special-purpose
1206				values, such as PUNCTUATION. Beginning with CLDR 24, this table
1207				precedes the regular mappings, so that parsers can use this
1208				information while processing and optimizing mappings. Beginning with
1209				CLDR 27, most of this data is irrelevant because single scripts can
1210				be reordered. Only the "COMPRESS" data is still useful.</p>
1211		</blockquote>
1212		<pre># Reordering Tokens =&gt; Top Bytes
1213[reorderingTokens     Arab    61=910 62=910 ]
1214[reorderingTokens     Armi    7A=22 ]
1215[reorderingTokens     Armn    5F=82 ]
1216[reorderingTokens     Avst    7A=54 ]
1217...</pre>
1218		<blockquote>
1219			<p>This table is an inverse mapping from reordering token to top
1220				byte(s). In terms like &quot;61=910&quot;, the first value is the
1221				top byte, while the second is informational, indicating the number
1222				of primaries assigned with that top byte.</p>
1223		</blockquote>
1224		<pre># General Categories =&gt; Top Byte
1225[categories   Cc      03{SPACE}=6 ]
1226[categories   Cf      77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
1227[categories   Lm      0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre>
1228		<blockquote>
1229			<p>This table is informational, providing the top bytes, scripts,
1230				and primaries associated with each general category value.</p>
1231		</blockquote>
1232		<pre># FIXED VALUES
1233[fixed first implicit byte E0]
1234[fixed last implicit byte E4]
1235[fixed first trail byte E5]
1236[fixed last trail byte EF]
1237[fixed first special byte F0]
1238[fixed last special byte FF]
1239
1240[fixed secondary common byte 05]
1241[fixed last secondary common byte 45]
1242[fixed first ignorable secondary byte 80]
1243
1244[fixed tertiary common byte 05]
1245[fixed first ignorable tertiary byte 3C]
1246		</pre>
1247		<blockquote>
1248			<p>The final table gives certain hard-coded byte values. The
1249				&quot;trail&quot; area is provided for implementation of the
1250				&quot;trailing weights&quot; as described in the UCA.</p>
1251		</blockquote>
1252
1253		<p class="note">Note: The particular primary lead bytes for Hani
1254			vs. IMPLICIT vs. TRAILING are only an example. An implementation is
1255			free to move them if it also moves the explicit TRAILING weights.
1256			This affects only a small number of explicit mappings in
1257			FractionalUCA.txt, such as for U+FFFD, U+FFFF, and the “unassigned
1258			first primary”. It is possible to use no SPECIAL bytes at all, and to
1259			use only the one primary lead byte FF for TRAILING weights.</p>
1260
1261		<h4>
1262			2.6.3 <a name="File_Format_UCA_Rules_txt"
1263				href="#File_Format_UCA_Rules_txt">UCA_Rules.txt</a>
1264		</h4>
1265		<p>
1266			The format for this file uses the CLDR collation syntax, see <i>Section
1267				3, <a href="#Collation_Tailorings">Collation Tailorings</a>
1268			</i>.
1269		</p>
1270
1271
1272		<h2>
1273			3 <a name="Collation_Tailorings" href="#Collation_Tailorings">Collation
1274				Tailorings</a>
1275		</h2>
1276		<p class="dtd">&lt;!ELEMENT collations (alias |
1277			(defaultCollation?, collation*, special*)) &gt;</p>
1278		<p class="dtd">&lt;!ELEMENT defaultCollation ( #PCDATA ) &gt;</p>
1279		<p>
1280			This element of the LDML format contains one or more <span
1281				class="element">collation</span> elements, distinguished by type.
1282			Each <span class="element">collation</span> contains elements with
1283			parametric settings, or rules that specify a certain sort order, as a
1284			tailoring of the root order, or both.
1285		</p>
1286		<p class="note">
1287			Note: CLDR collation tailoring data should follow the <a
1288				href="http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR
1289				Collation Guidelines</a>.
1290		</p>
1291
1292		<h3>
1293			3.1 <a name="Collation_Types" href="#Collation_Types">Collation
1294				Types</a>
1295		</h3>
1296		<p>
1297			Each locale may have multiple sort orders (types). The <span
1298				class="element">defaultCollation</span> element defines the default
1299			tailoring for a locale and its sublocales. For example:
1300		</p>
1301		<ul>
1302			<li>root.xml: <code>&lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;</code></li>
1303			<li>zh.xml: <code>&lt;defaultCollation&gt;pinyin&lt;/defaultCollation&gt;</code></li>
1304			<li>zh_Hant.xml: <code>&lt;defaultCollation&gt;stroke&lt;/defaultCollation&gt;</code></li>
1305		</ul>
1306
1307		<p>
1308			To allow implementations in reduced memory environments to use CJK
1309			sorting, there are also short forms of each of these collation
1310			sequences. These provide for the most common characters in common
1311			use, and are marked with <span class="attribute">alt</span>=&quot;<span
1312				class="attributeValue">short</span>&quot;.
1313		</p>
1314
1315		<p>A collation type name that starts with "private-", for example,
1316			"private-kana", indicates an incomplete tailoring that is only
1317			intended for import into one or more other tailorings (usually for
1318			sharing common rules). It does not establish a complete sort order.
1319			An implementation should not build data tables for a private
1320			collation type, and should not include a private collation type in a
1321			list of available types.</p>
1322
1323		<p class="note">
1324			<b>Note:</b>
1325		</p>
1326		<ul>
1327			<li>There is an on-line demonstration of collation at [<a
1328				href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that uses the
1329				same rule syntax. (Pick the locale and scroll to &quot;Collation
1330				Rules&quot;, near the end.)
1331			</li>
1332			<li class="note">In CLDR 23 and before, LDML collation files
1333				used an XML format. Starting with CLDR 24, the XML collation syntax
1334				is deprecated and no longer used. See the <i><a
1335					href="http://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings">CLDR
1336						23 version of this document</a></i> for details about the XML collation
1337				syntax.
1338			</li>
1339		</ul>
1340
1341		<h4>
1342			3.1.1 <a name="Collation_Type_Fallback"
1343				href="#Collation_Type_Fallback">Collation Type Fallback</a>
1344		</h4>
1345		<p>When loading a requested tailoring from its data file and the
1346			parent file chain, use the following type fallback to find the
1347			tailoring.</p>
1348		<ol>
1349			<li>Determine the default type from the &lt;defaultCollation&gt;
1350				element; map the default type to its alias if one is defined. If
1351				there is no &lt;defaultCollation&gt; element, then use "standard" as
1352				the default type.</li>
1353			<li>If the request language tag specifies the collation type
1354				(keyword "co"), then map it to its alias if one is defined (e.g.,
1355				"-co-phonebk" → "phonebook"). If the language tag does not specify
1356				the type, then use the default type.</li>
1357			<li>Use the &lt;collation&gt; element with this type.</li>
1358			<li>If it does not exist, and the type starts with "search" but
1359				is longer, then set the type to "search" and use that
1360				&lt;collation&gt; element. (For example, "searchjl" → "search".)</li>
1361			<li>If it does not exist, and the type is not the default type,
1362				then set the type to the default type and use that &lt;collation&gt;
1363				element.</li>
1364			<li>If it does not exist, and the type is not "standard", then
1365				set the type to "standard" and use that &lt;collation&gt; element.</li>
1366			<li>If it does not exist, then use the CLDR root collation.</li>
1367		</ol>
1368		<p class="note">Note that the CLDR collation/root.xml contains
1369			&lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;,
1370			&lt;collation type="standard"&gt; (with an empty tailoring, so this
1371			is the same as the CLDR root collation), and &lt;collation
1372			type="search"&gt;.</p>
1373
1374		<p>For example, assume that we have collation data for the
1375			following tailorings. ("da/search" is shorthand for
1376			"da-u-co-search".)</p>
1377		<ul>
1378			<li>root/defaultCollation=standard</li>
1379			<li>root/standard (this is the same as “the CLDR root collator”)</li>
1380			<li>root/search</li>
1381			<li>da/standard</li>
1382			<li>da/search</li>
1383			<li>el/standard</li>
1384			<li>ko/standard</li>
1385			<li>ko/search</li>
1386			<li>ko/searchjl</li>
1387			<li>zh/defaultCollation=pinyin</li>
1388			<li>zh/pinyin</li>
1389			<li>zh/stroke</li>
1390			<li>zh-Hant/defaultCollation=stroke</li>
1391		</ul>
1392		<table>
1393			<caption>
1394				<a name="Sample_requested_and_actual_collation_locales_and_types"
1395					href="#Sample_requested_and_actual_collation_locales_and_types">Sample
1396					requested and actual collation locales and types</a>
1397			</caption>
1398			<tr>
1399				<th>requested</th>
1400				<th>actual</th>
1401				<th>comment</th>
1402			</tr>
1403			<tr>
1404				<td>da/phonebook</td>
1405				<td>da/standard</td>
1406				<td>default type for Danish</td>
1407			</tr>
1408			<tr>
1409				<td>zh</td>
1410				<td>zh/pinyin</td>
1411				<td>default type for zh</td>
1412			</tr>
1413			<tr>
1414				<td>zh/standard</td>
1415				<td>root/standard</td>
1416				<td>no "standard" tailoring for zh, falls back to root</td>
1417			</tr>
1418			<tr>
1419				<td>zh/phonebook</td>
1420				<td>zh/pinyin</td>
1421				<td>default type for zh</td>
1422			</tr>
1423			<tr>
1424				<td>zh-Hant/phonebook</td>
1425				<td>zh/stroke</td>
1426				<td>default type for zh-Hant is "stroke"</td>
1427			</tr>
1428			<tr>
1429				<td>da/searchjl</td>
1430				<td>da/search</td>
1431				<td>"search.+" falls back to "search"</td>
1432			</tr>
1433			<tr>
1434				<td>el/search</td>
1435				<td>root/search</td>
1436				<td>no "search" tailoring for Greek</td>
1437			</tr>
1438			<tr>
1439				<td>el/searchjl</td>
1440				<td>root/search</td>
1441				<td>"search.+" falls back to "search", found in root</td>
1442			</tr>
1443			<tr>
1444				<td>ko/searchjl</td>
1445				<td>ko/searchjl</td>
1446				<td>requested data is actually available</td>
1447			</tr>
1448		</table>
1449
1450		<h3>
1451			3.2 <a name="Collation_Version" href="#Collation_Version">Version</a>
1452		</h3>
1453		<p>The version attribute is used in case a specific version of the
1454			UCA is to be specified. It is optional, and is specified if the
1455			results are to be identical on different systems. If it is not
1456			supplied, then the version is assumed to be the same as the Unicode
1457			version for the system as a whole.</p>
1458		<blockquote>
1459			<p class="note">
1460				<b>Note: </b>For version 3.1.1 of the UCA, the version of Unicode
1461				must also be specified with any versioning information; an example
1462				would be &quot;3.1.1/3.2&quot; for version 3.1.1 of the UCA, for
1463				version 3.2 of Unicode. This was changed by decision of the UTC, so
1464				that dual versions were no longer necessary. So for UCA 4.0 and
1465				beyond, the version just has a single number.
1466			</p>
1467		</blockquote>
1468
1469		<h3>
1470			3.3 <a name="Collation_Element" href="#Collation_Element">Collation
1471				Element</a>
1472		</h3>
1473		<p class="dtd">&lt;!ELEMENT collation (alias | (cr*, special*))
1474			&gt;</p>
1475		<p>
1476			The tailoring syntax is designed to be independent of the actual
1477			weights used in any particular UCA table. That way the same rules can
1478			be applied to UCA versions over time, even if the underlying weights
1479			change. The following illustrates the overall structure of a <span
1480				class="element">collation</span>:
1481		</p>
1482		<pre>&lt;collation type="phonebook"&gt;
1483  &lt;cr&gt;&lt;![CDATA[
1484    [caseLevel on]
1485    &amp;c &lt; k
1486  ]]&gt;&lt;/cr&gt;
1487&lt;/collation&gt;</pre>
1488
1489		<h3>
1490			3.4 <a name="Setting_Options" href="#Setting_Options">Setting
1491				Options</a>
1492		</h3>
1493		<p>
1494			Parametric settings can be specified in language tags or in rule
1495			syntax (in the form
1496			<code>[keyword value]</code>
1497			). For example,
1498			<code>-ks-level2</code>
1499			or
1500			<code>[strength 2]</code>
1501			will only compare strings based on their primary and secondary
1502			weights.
1503		</p>
1504		<p>
1505			If a setting is not present, the CLDR default (or the default for the
1506			locale, if there is one) is used. That default is listed in bold
1507			italics. Where there is a UCA default that is different, it is listed
1508			in bold with (<strong>UCA default</strong>). Note that the default
1509			value for a locale may be different than the normal default value for
1510			the setting.
1511		</p>
1512
1513		<table>
1514			<caption>
1515				<a name="Collation_Settings" href="#Collation_Settings">Collation
1516					Settings</a>
1517			</caption>
1518			<tr>
1519				<th>BCP47 Key</th>
1520				<th>BCP47 Value</th>
1521				<th>Rule Syntax</th>
1522				<th>Description</th>
1523			</tr>
1524			<tr>
1525				<td rowspan="5">ks</td>
1526				<td>level1</td>
1527				<td><code>[strength 1]</code><br>(primary)</td>
1528				<td rowspan="5">Sets the default strength for comparison, as
1529					described in the [<a
1530					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].<em>
1531						Note that a strength setting of greater than 4 may have the same
1532						effect as <strong>identical</strong>, depending on the locale and
1533						implementation.
1534				</em>
1535				</td>
1536			</tr>
1537			<tr>
1538				<td>level2</td>
1539				<td><code>[strength 2]</code><br>(secondary)</td>
1540			</tr>
1541			<tr>
1542				<td>level3</td>
1543				<td><em><strong><code>[strength 3]</code><br>(tertiary)</strong></em></td>
1544			</tr>
1545			<tr>
1546				<td>level4</td>
1547				<td><code>[strength 4]</code><br>(quaternary)</td>
1548			</tr>
1549			<tr>
1550				<td>identic</td>
1551				<td><code>[strength I]</code><br>(identical)</td>
1552			</tr>
1553			<tr>
1554				<td rowspan="3">ka</td>
1555				<td>noignore</td>
1556				<td><i><strong><code>[alternate
1557								non-ignorable]</code></strong></i><br></td>
1558				<td rowspan="3">Sets alternate handling for variable weights,
1559					as described in [<a
1560					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], where
1561					&quot;shifted&quot; causes certain characters to be ignored in
1562					comparison. <em>The default for LDML is different than it is
1563						in the UCA. In LDML, the default for alternate handling is <strong>non-ignorable</strong>,
1564						while in UCA it is <strong>shifted</strong>. In addition, in LDML
1565						only whitespace and punctuation are variable by default.
1566				</em>
1567				</td>
1568			</tr>
1569			<tr>
1570				<td>shifted</td>
1571				<td><strong><code>[alternate shifted]</code><br>(UCA
1572						default)</strong></td>
1573			</tr>
1574			<tr>
1575				<td><em>n/a</em></td>
1576				<td><i>n/a</i><br>(blanked)</td>
1577			</tr>
1578			<tr>
1579				<td rowspan="2">kb</td>
1580				<td>true</td>
1581				<td><code>[backwards 2]</code></td>
1582				<td rowspan="2">Sets the comparison for the second level to be
1583					<strong>backwards</strong>, as described in [<a
1584					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
1585				</td>
1586			</tr>
1587			<tr>
1588				<td>false</td>
1589				<td><i><strong>n/a</strong></i></td>
1590			</tr>
1591			<tr>
1592				<td rowspan="2">kk</td>
1593				<td>true</td>
1594				<td><strong><code>[normalization on]</code><br>(UCA
1595						default)</strong></td>
1596				<td rowspan="2">If <strong>on</strong>, then the normal [<a
1597					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
1598					algorithm is used. If <strong>off</strong>, then most strings
1599					should still sort correctly despite not normalizing to NFD first.<br>
1600					<em>Note that the default for CLDR locales may be different
1601						than in the UCA. The rules for particular locales have it set to <strong>on</strong>:
1602						those locales whose exemplar characters (in forms commonly
1603						interchanged) would be affected by normalization.
1604				</em>
1605				</td>
1606			</tr>
1607			<tr>
1608				<td>false</td>
1609				<td><i><strong><code>[normalization off]</code></strong></i></td>
1610			</tr>
1611			<tr>
1612				<td rowspan="2">kc</td>
1613				<td>true</td>
1614				<td><code>[caseLevel on]</code></td>
1615				<td rowspan="2">If set to <strong>on</strong><i>,</i> a level
1616					consisting only of case characteristics will be inserted in front
1617					of tertiary level, as a &quot;Level 2.5&quot;. To ignore accents
1618					but take case into account, set strength to <strong>primary</strong>
1619					and case level to <strong>on</strong>. For details, see <em>Section
1620						3.14, <a href="#Case_Parameters">Case Parameters</a>
1621				</em>.
1622				</td>
1623			</tr>
1624			<tr>
1625				<td>false</td>
1626				<td><i><strong><code>[caseLevel off]</code></strong></i></td>
1627			</tr>
1628			<tr>
1629				<td rowspan="3">kf</td>
1630				<td>upper</td>
1631				<td><code>[caseFirst upper]</code></td>
1632				<td rowspan="3">If set to <strong>upper</strong>, causes upper
1633					case to sort before lower case. If set to <strong>lower</strong>,
1634					causes lower case to sort before upper case. Useful for locales
1635					that have already supported ordering but require different order of
1636					cases. Affects case and tertiary levels. For details, see <em>Section
1637						3.14, <a href="#Case_Parameters">Case Parameters</a>
1638				</em>.
1639				</td>
1640			</tr>
1641			<tr>
1642				<td>lower</td>
1643				<td><code>[caseFirst lower]</code></td>
1644			</tr>
1645			<tr>
1646				<td>false</td>
1647				<td><i><strong><code>[caseFirst off]</code></strong></i></td>
1648			</tr>
1649			<tr>
1650				<td rowspan="2">kh</td>
1651				<td>true<br> <i><strong>Deprecated:</strong></i> Use rules
1652					with quater&shy;nary relations instead.
1653				</td>
1654				<td><code>[hiraganaQ on]</code></td>
1655				<td rowspan="2">Controls special treatment of Hiragana code
1656					points on quaternary level. If turned <strong>on</strong>, Hiragana
1657					codepoints will get lower values than all the other non-variable
1658					code points in <strong>shifted</strong>. That is, the normal Level
1659					4 value for a regular collation element is FFFF, as described in [<a
1660					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], <em>Section
1661						3.6, <a
1662						href="http://www.unicode.org/reports/tr10/#Variable_Weighting">Variable
1663							Weighting</a>
1664				</em>. This is changed to FFFE for [:script=Hiragana:] characters. The
1665					strength must be greater or equal than quaternary if this attribute
1666					is to have any effect.
1667				</td>
1668			</tr>
1669			<tr>
1670				<td>false</td>
1671				<td><i><strong><code>[hiraganaQ off]</code></strong></i></td>
1672			</tr>
1673			<tr>
1674				<td rowspan="2">kn</td>
1675				<td>true</td>
1676				<td><code>[numericOrdering on]</code></td>
1677				<td rowspan="2">If set to <strong>on</strong>, any sequence of
1678					Decimal Digits (General_Category = Nd in the [<a
1679					href="http://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is
1680					sorted at a primary level with its numeric value. For example,
1681					&quot;A-21&quot; &lt; &quot;A-123&quot;. The computed primary
1682					weights are all at the start of the <strong>digit</strong>
1683					reordering group. Thus with an untailored UCA table, &quot;a$&quot;
1684					&lt; &quot;a0&quot; &lt; &quot;a2&quot; &lt; &quot;a12&quot; &lt;
1685					&quot;a⓪&quot; &lt; &quot;aa&quot;.
1686				</td>
1687			</tr>
1688			<tr>
1689				<td>false</td>
1690				<td><i><strong><code>[numericOrdering off]</code></strong></i></td>
1691			</tr>
1692			<tr>
1693				<td>kr</td>
1694				<td>a sequence of one or more reorder codes: <strong>space,
1695						punct, symbol, currency, digit</strong>, or any BCP47 script ID
1696				</td>
1697				<td><code>[reorder Grek digit]</code></td>
1698				<td>Specifies a reordering of scripts or other significant
1699					blocks of characters such as symbols, punctuation, and digits. For
1700					the precise meaning and usage of the reorder codes, see <em>Section
1701						3.13, <a href="#Script_Reordering">Collation Reordering</a>.
1702				</em>
1703				</td>
1704			</tr>
1705			<tr>
1706				<td rowspan="4">kv</td>
1707				<td>space</td>
1708				<td><code>[maxVariable space]</code></td>
1709				<td rowspan="4">Sets the variable top to the top of the
1710					specified reordering group. All code points with primary weights
1711					less than or equal to the variable top will be considered variable,
1712					and thus affected by the alternate handling. Variables are
1713					ignorable by default in [<a
1714					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but not
1715					in CLDR.
1716				</td>
1717			</tr>
1718			<tr>
1719				<td>punct</td>
1720				<td><i><strong><code>[maxVariable punct]</code></strong></i></td>
1721			</tr>
1722			<tr>
1723				<td>symbol</td>
1724				<td><strong><code>[maxVariable symbol]</code><br>(UCA
1725						default)</strong></td>
1726			</tr>
1727			<tr>
1728				<td>currency</td>
1729				<td><code>[maxVariable currency]</code></td>
1730			</tr>
1731			<tr>
1732				<td>vt</td>
1733				<td>See <i>Part 1 Section 3.6.4, <a
1734						href="tr35.html#Unicode_Locale_Extension_Data_Files">U
1735							Extension Data Files</a></i>.<br> <i><strong>Deprecated:</strong></i>
1736					Use maxVariable instead.
1737				</td>
1738				<td><code>&amp;\u00XX\uYYYY &lt; [variable top]</code><br>
1739					<br> (the default is set to the highest punctuation, thus
1740					including spaces and punctuation, but not symbols)</td>
1741				<td>
1742					<p>
1743						The BCP47 value is described in <i>Appendix Q: <a
1744							href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale
1745								Extension Keys and Types</a>.
1746						</i>
1747					</p>
1748					<p>
1749						Sets the string value for the variable top. All the code points
1750						with primary weights less than or equal to the variable top will
1751						be considered variable, and thus affected by the alternate
1752						handling.<br> An implementation that supports the variableTop
1753						setting should also support the maxVariable setting, and it should
1754						"pin" ("round up") the variableTop to the top of the containing
1755						reordering group.<br> Variables are ignorable by default in [<a
1756							href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but
1757						not in CLDR. See below for more information.
1758					</p>
1759				</td>
1760			</tr>
1761			<tr>
1762				<td><em>n/a</em></td>
1763				<td><em>n/a</em></td>
1764				<td><em>n/a</em></td>
1765				<td>match-boundaries: <em><strong>none</strong></em> |
1766					whole-character | whole-word <br> Defined by <em>Section
1767						8, <a href="http://www.unicode.org/reports/tr10/#Searching">Searching
1768							and Matching</a>
1769				</em> of [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
1770				</td>
1771			</tr>
1772			<tr>
1773				<td><em>n/a</em></td>
1774				<td><em>n/a</em></td>
1775				<td><em>n/a</em></td>
1776				<td>match-style: <em><strong>minimal</strong></em> | medial |
1777					maximal <br> Defined by <em>Section 8, <a
1778						href="http://www.unicode.org/reports/tr10/#Searching">Searching
1779							and Matching</a></em> of [<a
1780					href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
1781				</td>
1782			</tr>
1783		</table>
1784
1785		<h4>
1786			3.4.1 <a name="Common_Settings" href="#Common_Settings">Common
1787				settings combinations</a>
1788		</h4>
1789		<p>Some commonly used parametric collation settings are available
1790			via combinations of LDML settings attributes:</p>
1791		<ul>
1792			<li>“Ignore accents”: <strong>strength=primary</strong></li>
1793			<li>“Ignore accents” but take case into account: <strong>strength=primary
1794					caseLevel=on</strong></li>
1795			<li>“Ignore case”: <strong>strength=secondary</strong></li>
1796			<li>“Ignore punctuation” (completely): <strong>strength=tertiary
1797					alternate=shifted</strong></li>
1798			<li>“Ignore punctuation” but distinguish among punctuation
1799				marks: <strong>strength=quaternary alternate=shifted</strong>
1800			</li>
1801		</ul>
1802
1803		<h4>
1804			3.4.2 <a name="Normalization_Setting" href="#Normalization_Setting">Notes
1805				on the normalization setting</a>
1806		</h4>
1807		<p>The UCA always normalizes input strings into NFD form before
1808			the rest of the algorithm. However, this results in poor performance.</p>
1809		<p>
1810			With <strong>normalization=off</strong>, strings that are in [<a
1811				href="tr35.html#FCD">FCD</a>] and do not contain Tibetan precomposed
1812			vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With <strong>normalization=on</strong>,
1813			an implementation that does not normalize to NFD must at least
1814			perform an incremental FCD check and normalize substrings as
1815			necessary. It should also always decompose the Tibetan precomposed
1816			vowels. (Otherwise discontiguous contractions across their leading
1817			components cannot be handled correctly.)
1818		</p>
1819		<p>Another complication for an implementation that does not always
1820			use NFD arises when contraction mappings overlap with canonical
1821			Decomposition_Mapping strings. For example, the Danish contraction
1822			“aa” overlaps with the decompositions of ‘ä’, ‘å’, and other
1823			characters. In the root collation (and in the DUCET), Cyrillic ‘ӛ’
1824			maps to a single collation element, which means that its
1825			decomposition “ә+&#x25CC;&#x0308;” forms a contraction, and its
1826			second character (U+0308) is the same as the first character in the
1827			Decomposition_Mapping of U+0344
1828			‘&#x25CC;&#x0344;’=“&#x25CC;&#x0308;+&#x25CC;&#x0301;”.</p>
1829		<p>In order to handle strings with these characters (e.g., “aä”
1830			and “ә&#x0344;” [which are in FCD]) exactly as with prior NFD
1831			normalization, an implementation needs to either add overlap
1832			contractions to its data (e.g., “a+ä” and “ә+&#x25CC;&#x0344;”), or
1833			it needs to decompose the relevant composites (e.g., ‘ä’ and
1834			‘&#x25CC;&#x0344;’) as soon as they are encountered.</p>
1835
1836		<h4>
1837			3.4.3 <a name="Variable_Top_Settings" href="#Variable_Top_Settings">Notes
1838				on variable top settings</a>
1839		</h4>
1840		<p>
1841			Users may want to include more or fewer characters as Variable. For
1842			example, someone could want to restrict the Variable characters to
1843			just include space marks. In that case, maxVariable would be set to
1844			"space". (In CLDR 24 and earlier, the now-deprecated variableTop
1845			would be set to U+1680, see the “Whitespace” <a
1846				href="http://unicode.org/charts/collation/">UCA collation chart</a>).
1847			Alternatively, someone could want more of the Common characters in
1848			them, and include characters up to (but not including) '0', by
1849			setting maxVariable to "currency". (In CLDR 24 and earlier, the
1850			now-deprecated variableTop would be set to U+20BA, see the
1851			“Currency-Symbol” collation chart).
1852		</p>
1853		<p>The effect of these settings is to customize to ignore
1854			different sets of characters when comparing strings. For example, the
1855			locale identifier "de-u-ka-shifted-kv-currency" is requesting
1856			settings appropriate for German, including German sorting
1857			conventions, and that currency symbols and characters sorting below
1858			them are ignored in sorting.</p>
1859
1860		<h3>
1861			3.5 <a name="Rules" href="#Rules">Collation Rule Syntax</a>
1862		</h3>
1863		<p class="dtd">&lt;!ELEMENT cr #PCDATA &gt;</p>
1864		<p>
1865			The goal for the collation rule syntax is to have clearly expressed
1866			rules with a concise format. The CLDR rule syntax is a subset of the
1867			[<a href="tr35.html#ICUCollation">ICUCollation</a>] syntax.
1868		</p>
1869
1870		<p>
1871			For the CLDR root collation, the FractionalUCA.txt file defines all
1872			mappings for all of Unicode directly, and it also provides
1873			information about script boundaries, reordering groups, and other
1874			details. For tailorings, this is neither necessary nor practical. In
1875			particular, while the root collation sort order rarely changes for
1876			existing characters, their numeric collation weights change with
1877			every version. If tailorings also specified numeric weights directly,
1878			then they would have to change with every version, parallel with the
1879			root collation. Instead, for tailorings, mappings are added and
1880			modified relative to the root collation. (There is no syntax to <i>remove</i>
1881			mappings, except via <a href="#Special_Purpose_Commands">special
1882				[suppressContractions [...]] </a>.)
1883		</p>
1884
1885		<p>
1886			The ASCII [:P:] and [:S:] characters are reserved for collation
1887			syntax:
1888			<code>[\u0021-\u002F \u003A-\u0040 \u005B-\u0060
1889				\u007B-\u007E]</code>
1890		</p>
1891
1892		<p>Unicode Pattern_White_Space characters between tokens are
1893			ignored. Unquoted white space terminates reset and relation strings.</p>
1894
1895		<p>A pair of ASCII apostrophes encloses quoted literal text. They
1896			are normally used to enclose a syntax character or white space, or a
1897			whole reset/relation string containing one or more such characters,
1898			so that those are parsed as part of the reset/relation strings rather
1899			than treated as syntax. A pair of immediately adjacent apostrophes is
1900			used to encode one apostrophe.</p>
1901
1902		<p>
1903			Code points can be escaped with
1904			<code>\uhhhh</code>
1905			and
1906			<code>\U00hhhhhh</code>
1907			escapes, as well as common escapes like
1908			<code>\t</code>
1909			and
1910			<code>\n</code>
1911			. (For details see the documentation of ICU
1912			UnicodeString::unescape().) This is particularly useful for
1913			default-ignorable code points, combining marks, visually indistinct
1914			variants, hard-to-type characters, etc. These sequences are unescaped
1915			before the rules are parsed; this means that even escaped syntax and
1916			white space characters need to be enclosed in apostrophes. For
1917			example:
1918			<code>&amp;'\u0020'='\u3000'</code>.
1919			Note: The unescaping is done by ICU tools (genrb) and demos before passing
1920			rule strings into the ICU library code.
1921			The ICU collation API does not unescape rule strings.
1922		</p>
1923
1924		<p>
1925			The ASCII double quote must be both escaped (so that the collation
1926			syntax can be enclosed in pairs of double quotes in programming
1927			environments such as ICU resource bundle .txt files)
1928			and quoted. For example:
1929			<code>&amp;'\u0022'&lt;&lt;&lt;x</code>
1930		</p>
1931
1932		<p>
1933			Comments are allowed at the beginning, and after any complete reset,
1934			relation, setting, or command. A comment begins with a
1935			<code>#</code>
1936			and extends to the end of the line (according to the Unicode Newline
1937			Guidelines).
1938		</p>
1939
1940		<p>The collation syntax is case-sensitive.</p>
1941
1942		<h3>
1943			3.6 <a name="Orderings" href="#Orderings">Orderings</a>
1944		</h3>
1945
1946		<p>The root collation mappings form the initial state. Mappings
1947			are added and removed via a sequence of rule chains. Each tailoring
1948			rule builds on the current state after all of the preceding rules
1949			(and is not affected by any following rules). Rule chains may
1950			alternate with comments, settings, and special commands.</p>
1951
1952		<p>A rule chain consists of a reset followed by one or more
1953			relations. The reset position is a string which maps to one or more
1954			collation elements according to the current state. A relation
1955			consists of an operator and a string; it maps the string to the
1956			current collation elements, modified according to the operator.</p>
1957
1958		<table>
1959			<caption>
1960				<a name="Specifying_Collation_Ordering"
1961					href="#Specifying_Collation_Ordering">Specifying Collation
1962					Ordering</a>
1963
1964			</caption>
1965			<tr>
1966				<th>Relation Operator</th>
1967				<th>&nbsp;Example</th>
1968				<th>Description</th>
1969			</tr>
1970			<tr>
1971				<td><code>&amp;</code></td>
1972				<td><code>&amp; Z</code></td>
1973				<td>Map Z to collation elements according to the current state.
1974					These will be modified according to the following relation
1975					operators and then assigned to the corresponding relation strings.</td>
1976			</tr>
1977			<tr>
1978				<td><code>&lt;</code></td>
1979				<td><code>
1980						&amp; a<br> &lt; b
1981					</code></td>
1982				<td>Make &#39;b&#39; sort after &#39;a&#39;, as a <i>primary</i>
1983					(base-character) difference
1984				</td>
1985			</tr>
1986			<tr>
1987				<td><code>&lt;&lt;</code></td>
1988				<td><code>
1989						&amp; a<br> &lt;&lt; ä
1990					</code></td>
1991				<td>Make &#39;ä&#39; sort after &#39;a&#39; as a <i>secondary</i>
1992					(accent) difference
1993				</td>
1994			</tr>
1995			<tr>
1996				<td><code>&lt;&lt;&lt;</code></td>
1997				<td><code>
1998						&amp; a<br> &lt;&lt;&lt; A
1999					</code></td>
2000				<td>Make &#39;A&#39; sort after &#39;a&#39; as a <i>tertiary</i>
2001					(case/variant) difference
2002				</td>
2003			</tr>
2004			<tr>
2005				<td><code>&lt;&lt;&lt;&lt;</code></td>
2006				<td><code>
2007						&amp; か<br> &lt;&lt;&lt;&lt; カ
2008					</code></td>
2009				<td>Make &#39;カ&#39; (Katakana Ka) sort after &#39;か&#39;
2010					(Hiragana Ka) as a <i>quaternary</i> difference
2011				</td>
2012			</tr>
2013			<tr>
2014				<td><code>=&nbsp; </code></td>
2015				<td><code>
2016						&amp; v<br> = w&nbsp;
2017					</code></td>
2018				<td>Make &#39;w&#39; sort <i>identically</i> to &#39;v&#39;
2019				</td>
2020			</tr>
2021		</table>
2022		<p>The following shows the result of serially applying three
2023			rules.</p>
2024		<table>
2025			<tr>
2026				<th>&nbsp;</th>
2027				<th>Rules</th>
2028				<th>Result</th>
2029				<th>Comment</th>
2030			</tr>
2031			<tr>
2032				<td>1</td>
2033				<td>&amp; a &lt; g</td>
2034				<td>... a<font color="red"> &lt;<sub>1</sub> g
2035				</font> ...
2036				</td>
2037				<td>Put g after a.</td>
2038			</tr>
2039			<tr>
2040				<td>2</td>
2041				<td>&amp; a &lt; h &lt; k</td>
2042				<td>... a<font color="red"> &lt;<sub>1</sub> h &lt;<sub>1</sub>
2043						k
2044				</font> &lt;<sub>1</sub> g ...
2045				</td>
2046				<td>Now put h and k after a (inserting before the g).</td>
2047			</tr>
2048			<tr>
2049				<td>3</td>
2050				<td>&amp; h &lt;&lt; g</td>
2051				<td>... a &lt;<sub>1</sub> h<font color="red"> &lt;<sub>1</sub>
2052						g
2053				</font> &lt;<sub>1</sub> k ...
2054				</td>
2055				<td>Now put g after h (inserting before k).</td>
2056			</tr>
2057		</table>
2058		<p>Notice that relation strings can occur multiple times, and thus
2059			override previous rules.</p>
2060
2061		<p>Each relation uses and modifies the collation elements of the
2062			immediately preceding reset position or relation. A rule chain with
2063			two or more relations is equivalent to a sequence of “atomic rules”
2064			where each rule chain has exactly one relation, and each relation is
2065			followed by a reset to this same relation string.</p>
2066
2067		<p>
2068			<i>Example:</i>
2069		</p>
2070		<table>
2071			<tr>
2072				<th>Rules</th>
2073				<th>Equivalent Atomic Rules</th>
2074			</tr>
2075			<tr>
2076				<td>&amp; b &lt; q &lt;&lt;&lt; Q<br> &amp; a &lt; x
2077					&lt;&lt;&lt; X &lt;&lt; q &lt;&lt;&lt; Q &lt; z
2078				</td>
2079				<td>&amp; b &lt; q<br> &amp; q &lt;&lt;&lt; Q<br>
2080					&amp; a &lt; x<br> &amp; x &lt;&lt;&lt; X<br> &amp; X
2081					&lt;&lt; q<br> &amp; q &lt;&lt;&lt; Q<br> &amp; Q &lt; z
2082				</td>
2083			</tr>
2084		</table>
2085		<p>This is not always possible because prefix and extension
2086			strings can occur in a relation but not in a reset (see below).</p>
2087
2088		<p>
2089			The relation operator
2090			<code>=</code>
2091			maps its relation string to the current collation elements. Any other
2092			relation operator modifies the current collation elements as follows.
2093		</p>
2094		<ul>
2095			<li>Find the <i>last</i> collation element whose strength is at
2096				least as great as the strength of the operator. For example, for <code>&lt;&lt;</code>
2097				find the last primary or secondary CE. This CE will be modified; all
2098				following CEs should be removed. If there is no such CE, then reset
2099				the collation elements to a single completely-ignorable CE.
2100			</li>
2101			<li>Increment the collation element weight corresponding to the
2102				strength of the operator. For example, for <code>&lt;&lt;</code>
2103				increment the secondary weight.
2104			</li>
2105			<li>The new weight must be less than the next weight for the
2106				same combination of higher-level weights of any collation element
2107				according to the current state.</li>
2108			<li>Weights must be allocated in accordance with the <a
2109				href="http://www.unicode.org/reports/tr10/#Well-Formed">UCA
2110					well-formedness conditions</a>.
2111			</li>
2112			<li>When incrementing any weight, lower-level weights should be
2113				reset to the “common” values, to help with sort key compression.</li>
2114		</ul>
2115
2116		<p>
2117			In all cases, even for
2118			<code>=</code>
2119			, the case bits are recomputed according to <i>Section 3.13, <a
2120				href="#Case_Parameters">Case Parameters</a></i>. (This can be skipped if
2121			an implementation does not support the caseLevel or caseFirst
2122			settings.)
2123		</p>
2124
2125		<p>
2126			For example,
2127			<code>&amp;ae&lt;x</code>
2128			maps ‘x’ to two collation elements. The first one is the same as for
2129			‘a’, and the second one has a primary weight between those for ‘e’
2130			and ‘f’. As a result, ‘x’ sorts between “ae” and “af”. (If the
2131			primary of the first collation element was incremented instead, then
2132			‘x’ would sort after “az”. While also sorting primary-after “ae” this
2133			would be surprising and sub-optimal.)
2134		</p>
2135
2136		<p>Some additional operators are provided to save space with large
2137			tailorings. The addition of a * to the relation operator indicates
2138			that each of the following single characters are to be handled as if
2139			they were separate relations with the corresponding strength. Each of
2140			the following single characters must be NFD-inert, that is, it does
2141			not have a canonical decomposition and it does not reorder (ccc=0).
2142			This keeps abbreviated rules unambiguous.</p>
2143		<p>
2144			A starred relation operator is followed by a sequence of characters
2145			with the same quoting/escaping rules as normal relation strings. Such
2146			a sequence can also be followed by one or more pairs of ‘-’ and
2147			another sequence of characters. The single characters adjacent to the
2148			‘-’ establish a code point order range. The same character cannot be
2149			both the end of a range and the start of another range. (For example,
2150			<code>&lt;a-d-g</code>
2151			is not allowed.)
2152		</p>
2153		<table>
2154			<caption>
2155				<a name="Abbreviating_Ordering_Specifications"
2156					href="#Abbreviating_Ordering_Specifications">Abbreviating
2157					Ordering Specifications</a>
2158			</caption>
2159			<tr>
2160				<th>Relation Operator</th>
2161				<th>Example</th>
2162				<th>Equivalent</th>
2163			</tr>
2164			<tr>
2165				<td><code>&lt;*</code></td>
2166				<td><code>
2167						&amp; <span style="color: blue">a</span><br> &lt;* <span
2168							style="color: blue">bcd-gp-s</span>&nbsp;
2169					</code></td>
2170				<td><code>
2171						&amp; <span style="color: blue">a</span><br> &lt; <span
2172							style="color: blue">b </span>&lt;<span style="color: blue">
2173							c </span>&lt;<span style="color: blue"> d</span> &lt; <span
2174							style="color: blue">e</span> &lt; <span style="color: blue">f</span>
2175						&lt; <span style="color: blue">g</span> &lt; <span
2176							style="color: blue">p</span> &lt; <span style="color: blue">q</span>
2177						&lt; <span style="color: blue">r</span> &lt; <span
2178							style="color: blue">s</span>
2179					</code></td>
2180			</tr>
2181			<tr>
2182				<td><code>&lt;&lt;*</code></td>
2183				<td><code>
2184						&amp;<span style="color: blue"> a</span><br> &lt;&lt;*<span
2185							style="color: blue"> æᶏɐ</span>
2186					</code></td>
2187				<td><code>
2188						&amp;<span style="color: blue"> a</span><br> &lt;&lt;<span
2189							style="color: blue"> æ </span>&lt;&lt; <span style="color: blue">ᶏ
2190						</span>&lt;&lt; <span style="color: blue">ɐ</span>
2191					</code></td>
2192			</tr>
2193			<tr>
2194				<td><code>&lt;&lt;&lt;*</code></td>
2195				<td><code>
2196						&amp;<span style="color: blue"> p</span><br> &lt;&lt;&lt;* <span
2197							style="color: blue">PpP</span>
2198					</code></td>
2199				<td><code>
2200						&amp;<span style="color: blue"> p</span><br> &lt;&lt;&lt; <span
2201							style="color: blue">P</span> &lt;&lt;&lt; <span
2202							style="color: blue">p</span> &lt;&lt;&lt; <span
2203							style="color: blue">P</span>
2204					</code></td>
2205			</tr>
2206			<tr>
2207				<td><code>&lt;&lt;&lt;&lt;*</code></td>
2208				<td><code>
2209						&amp;<span style="color: blue"> k</span><br>
2210						&lt;&lt;&lt;&lt;* <span style="color: blue">qQ</span>
2211					</code></td>
2212				<td><code>
2213						&amp;<span style="color: blue"> k</span><br> &lt;&lt;&lt;&lt;
2214						<span style="color: blue">q</span> &lt;&lt;&lt;&lt; <span
2215							style="color: blue">Q</span>
2216					</code></td>
2217			</tr>
2218			<tr>
2219				<td><code>=*</code></td>
2220				<td><code>
2221						&amp;<span style="color: blue"> v</span><br> =* <span
2222							style="color: blue">VwW</span>
2223					</code></td>
2224				<td><code>
2225						&amp;<span style="color: blue"> v</span><br> = <span
2226							style="color: blue">V </span>= <span style="color: blue">w
2227						</span>= <span style="color: blue">W</span>
2228					</code></td>
2229			</tr>
2230		</table>
2231		<h3>
2232			3.7 <a name="Contractions" href="#Contractions">Contractions</a>
2233		</h3>
2234
2235		<p>A multi-character relation string defines a contraction.</p>
2236
2237		<table>
2238			<caption>
2239				<a name="Specifying_Contractions" href="#Specifying_Contractions">Specifying
2240					Contractions</a>
2241			</caption>
2242			<tr>
2243				<th>Example</th>
2244				<th>Description</th>
2245			</tr>
2246			<tr>
2247				<td><code>
2248						&amp; k<br> &lt; ch
2249					</code></td>
2250				<td>Make the sequence &#39;ch&#39; sort after &#39;k&#39;, as a
2251					primary (base-character) difference</td>
2252			</tr>
2253		</table>
2254
2255		<h3>
2256			3.8 <a name="Expansions" href="#Expansions">Expansions</a>
2257		</h3>
2258		<p>
2259			A mapping to multiple collation elements defines an expansion. This
2260			is normally the result of a reset position (and/or preceding
2261			relation) that yields multiple collation elements, for example
2262			<code>&amp;ae&lt;x</code>
2263			or
2264			<code>&amp;æ&lt;y</code>
2265			.
2266		</p>
2267
2268		<p>
2269			A relation string can also be followed by
2270			<code>/</code>
2271			and an <i>extension string</i>. The extension string is mapped to
2272			collation elements according to the current state, and the relation
2273			string is mapped to the concatenation of the regular CEs and the
2274			extension CEs. The extension CEs are not modified, not even their
2275			case bits. The extension CEs are <i>not</i> retained for following
2276			relations.
2277		</p>
2278
2279		<p>
2280			For example,
2281			<code>&amp;a&lt;z/e</code>
2282			maps ‘z’ to an expansion similar to
2283			<code>&amp;ae&lt;x</code>
2284			. However, the first CE of ‘z’ is primary-after that of ‘a’, and the
2285			second CE is exactly that of ‘e’, which yields the order ae &lt; x
2286			&lt; af &lt; ag &lt; ... &lt; az &lt; z &lt; b.
2287		</p>
2288
2289		<p>
2290			The choice of reset-to-expansion vs. use of an extension string can
2291			be exploited to affect contextual mappings. For example,
2292			<code>&amp;L·=x</code>
2293			yields a second CE for ‘x’ equal to the context-sensitive
2294			middle-dot-after-L (which is a secondary CE in the root collation).
2295			On the other hand,
2296			<code>&amp;L=x/·</code>
2297			yields a second CE of the middle dot by itself (which is a primary
2298			CE).
2299		</p>
2300
2301		<p>
2302			The two ways of specifying expansions also differ in how case bits
2303			are computed. When some of the CEs are copied verbatim from an
2304			extension string, then the relation string’s case bits are
2305			distributed over a smaller number of normal CEs. For example,
2306			<code>&amp;aE=Ch</code>
2307			yields an uppercase CE and a lowercase CE, but
2308			<code>&amp;a=Ch/E</code>
2309			yields a mixed-case CE (for ‘C’ and ‘h’ together) followed by an
2310			uppercase CE (copied from ‘E’).
2311		</p>
2312
2313		<p>In summary, there are two ways of specifying expansions which
2314			produce subtly different mappings. The use of extension strings is
2315			unusual but sometimes necessary.</p>
2316
2317
2318		<h3>
2319			3.9 <a name="Context_Before" href="#Context_Before">Context
2320				Before</a>
2321		</h3>
2322		<p>
2323			A relation string can have a prefix (context before) which makes the
2324			mapping from the relation string to its tailored position conditional
2325			on the string occurring after that prefix. For details see the
2326			specification of <i><a href="#Context_Sensitive_Mappings">Context-Sensitive
2327					Mappings</a></i>.
2328		</p>
2329		<p>For example, suppose that &quot;-&quot; is sorted like the
2330			previous vowel. Then one could have rules that take &quot;a-&quot;,
2331			&quot;e-&quot;, and so on. However, that means that every time a very
2332			common character (a, e, ...) is encountered, a system will slow down
2333			as it looks for possible contractions. An alternative is to indicate
2334			that when &quot;-&quot; is encountered, and it comes after an
2335			&#39;a&#39;, it sorts like an &#39;a&#39;, and so on.</p>
2336		<table>
2337			<caption>
2338				<a name="Specifying_Previous_Context"
2339					href="#Specifying_Previous_Context">Specifying Previous Context</a>
2340			</caption>
2341			<tr>
2342				<th>Rules</th>
2343			</tr>
2344			<tr>
2345				<td><code>
2346						&amp; a &lt;&lt;&lt; a | '-'<br> &amp; e &lt;&lt;&lt; e | '-'<br>
2347						...
2348					</code></td>
2349			</tr>
2350		</table>
2351		<p>Both the prefix and extension strings can occur in a relation.
2352			For example, the following are allowed:</p>
2353		<ul>
2354			<li><code>&lt; abc | def / ghi</code></li>
2355			<li><code>&lt; def / ghi</code></li>
2356			<li><code>&lt; abc | def</code></li>
2357		</ul>
2358		<h3>
2359			3.10 <a name="Placing_Characters_Before_Others"
2360				href="#Placing_Characters_Before_Others">Placing Characters
2361				Before Others</a>
2362		</h3>
2363		<p>There are certain circumstances where characters need to be
2364			placed before a given character, rather than after. This is the case
2365			with Pinyin, for example, where certain accented letters are
2366			positioned before the base letter. That is accomplished with the
2367			following syntax.</p>
2368		<pre>&amp;[before 2] a &lt;&lt; à</pre>
2369		<p>The before-strength can be 1 (primary), 2 (secondary), or 3
2370			(tertiary).</p>
2371		<p>It is an error if the strength of the reset-before differs from
2372			the strength of the immediately following relation. Thus the
2373			following are errors.</p>
2374		<ul>
2375			<li><code>&amp;[before 2] a &lt; à # error</code></li>
2376			<li><code>&amp;[before 2] a &lt;&lt;&lt; à # error</code></li>
2377		</ul>
2378
2379		<h3>
2380			3.11 <a name="Logical_Reset_Positions"
2381				href="#Logical_Reset_Positions">Logical Reset Positions</a>
2382		</h3>
2383
2384		<p>The CLDR table (based on UCA) has the following overall
2385			structure for weights, going from low to high.</p>
2386		<table>
2387			<caption>
2388				<a name="Specifying_Logical_Positions"
2389					href="#Specifying_Logical_Positions">Specifying Logical
2390					Positions</a>
2391			</caption>
2392			<tr>
2393				<th>Name</th>
2394				<th>Description</th>
2395				<th>UCA Examples</th>
2396			</tr>
2397			<tr>
2398				<td>first tertiary ignorable<br> ...<br> last
2399					tertiary ignorable
2400				</td>
2401				<td>p, s, t = ignore</td>
2402				<td>Control Codes<br> Format Characters<br> Hebrew
2403					Points<br> Tibetan Signs<br> ...
2404				</td>
2405			</tr>
2406			<tr>
2407				<td>first secondary ignorable<br> ...<br> last
2408					secondary ignorable
2409				</td>
2410				<td>p, s = ignore</td>
2411				<td>None in UCA</td>
2412			</tr>
2413			<tr>
2414				<td>first primary ignorable<br> ...<br> last primary
2415					ignorable
2416				</td>
2417				<td>p = ignore</td>
2418				<td>Most combining marks</td>
2419			</tr>
2420			<tr>
2421				<td>first variable<br> ...<br> last variable
2422				</td>
2423				<td><i><b>if</b> alternate = non-ignorable<br> </i>p !=
2424					ignore,<br> <i><b>if</b> alternate = shifted</i><br> p,
2425					s, t = ignore</td>
2426				<td>Whitespace,<br> Punctuation
2427				</td>
2428			</tr>
2429			<tr>
2430				<td>first regular<br> ...<br> last regular
2431				</td>
2432				<td>p != ignore</td>
2433				<td>General Symbols<br> Currency Symbols<br> Numbers<br>
2434					Latin<br> Greek<br> ...
2435				</td>
2436			</tr>
2437			<tr>
2438				<td>first implicit<br>...<br>last implicit
2439				</td>
2440				<td>p != ignore, assigned automatically</td>
2441				<td>CJK, CJK compatibility (those that are not decomposed)<br>
2442					CJK Extension A, B, C, ...<br> Unassigned
2443				</td>
2444			</tr>
2445			<tr>
2446				<td>first trailing<br> ...<br> last trailing
2447				</td>
2448				<td>p != ignore,<br> used for trailing syllable components
2449				</td>
2450				<td>Jamo Trailing<br> Jamo Leading<br>U+FFFD<br>U+FFFF
2451				</td>
2452			</tr>
2453		</table>
2454		<p>
2455			Each of the above Names can be used with a reset to position
2456			characters relative to that logical position. That allows characters
2457			to be ordered before or after a <i>logical</i> position rather than a
2458			specific character.
2459		</p>
2460		<blockquote>
2461			<p class="note">
2462				<b>Note: </b>The reason for this is so that tailorings can be more
2463				stable. A future version of the UCA might add characters at any
2464				point in the above list. Suppose that you set character X to be
2465				after Y. It could be that you want X to come after Y, no matter what
2466				future characters are added; or it could be that you just want Y to
2467				come after a given logical position, for example, after the last
2468				primary ignorable.
2469			</p>
2470		</blockquote>
2471
2472		<p>Each of these special reset positions always maps to a single
2473			collation element.</p>
2474
2475		<p>Here is an example of the syntax:</p>
2476		<pre>&amp; [first tertiary ignorable] &lt;&lt; à </pre>
2477		<p>For example, to make a character be a secondary ignorable, one
2478			can make it be immediately after (at a secondary level) a specific
2479			character (like a combining diaeresis), or one can make it be
2480			immediately after the last secondary ignorable.</p>
2481
2482		<p>
2483			Each special reset position adjusts to the effects of preceding
2484			rules, just like normal reset position strings. For example, if a
2485			tailoring rule creates a new collation element after
2486			<code>&amp;[last variable]</code>
2487			(via explicit tailoring after that, or via tailoring after the
2488			relevant character), then this new CE becomes the new <i>last
2489				variable</i> CE, and is used in following resets to
2490			<code>[last variable]</code>
2491			.
2492		</p>
2493
2494		<p>[first variable] and [first regular] and [first trailing]
2495			should be the first real such CEs (e.g., CE(U+0060 &#x0060;)), as
2496			adjusted according to the tailoring, not the boundary CEs (see the
2497			FractionalUCA.txt “first primary” mappings starting with U+FDD1).</p>
2498
2499		<p>
2500			<code>[last regular]</code>
2501			is not actually the last normal CE with a primary weight before
2502			implicit primaries. It is used to tailor large numbers of characters,
2503			usually CJK, into the script=Hani range between the last regular
2504			script and the first implicit CE. (The first group of implicit CEs is
2505			for Han characters.) Therefore,
2506			<code>[last regular]</code>
2507			is set to the first Hani CE, the artificial script boundary CE at the
2508			beginning of this range. For example:
2509			<code>&amp;[last regular]&lt;*亜唖娃阿...</code>
2510		</p>
2511
2512		<p>The [last trailing] is the CE of U+FFFF. Tailoring to that is
2513			not allowed.</p>
2514
2515		<p>
2516			The
2517			<code>[last variable]</code>
2518			indicates the &quot;highest&quot; character that is treated as
2519			punctuation with alternate handling.
2520		</p>
2521		<p>
2522			The value can be changed by using the maxVariable setting. This takes
2523			effect, however, after the rules have been built, and does not affect
2524			any characters that are reset relative to the
2525			<code>[last variable]</code>
2526			value when the rules are being built. The maxVariable setting might
2527			also be changed via a runtime parameter. That also does not affect
2528			the rules.<br> (In CLDR 24 and earlier, the variable top could
2529			also be set by using a tailoring rule with
2530			<code>[variable top]</code>
2531			in the place of a relation string.)
2532		</p>
2533
2534		<h3>
2535			3.12 <a name="Special_Purpose_Commands"
2536				href="#Special_Purpose_Commands">Special-Purpose Commands</a>
2537		</h3>
2538		<p>The import command imports rules from another collation. This
2539			allows for better maintenance and smaller rule sizes. The source is a
2540			BCP 47 language tag with an optional collation type but without other
2541			extensions. The collation type is the BCP 47 form of the collation
2542			type in the source; it defaults to "standard".</p>
2543		<p>
2544			<em>Examples: </em>
2545		</p>
2546		<ul>
2547			<li><code>[import de-u-co-phonebk]</code> &nbsp; (not
2548				"...-co-phonebook")</li>
2549			<li><code>[import und-u-co-search]</code> &nbsp; (not
2550				"root-...")</li>
2551			<li><code>[import ja-u-co-private-kana]</code> &nbsp; (language
2552				"ja" required even when this import itself is in another "ja"
2553				tailoring.)</li>
2554		</ul>
2555
2556		<table>
2557			<caption>
2558				<a name="Special_Purpose_Elements" href="#Special_Purpose_Elements">Special-Purpose
2559					Elements</a>
2560			</caption>
2561			<tr>
2562				<th>Rule Syntax</th>
2563			</tr>
2564			<tr>
2565				<td>[suppressContractions [Љ-ґ]]</td>
2566			</tr>
2567			<tr>
2568				<td>[optimize [Ά-ώ]]</td>
2569			</tr>
2570		</table>
2571		<p>
2572			The <i>suppress contractions</i> tailoring command turns off any
2573			existing contractions that begin with those characters, as well as
2574			any prefixes for those characters. It is typically used to turn off
2575			the Cyrillic contractions in the UCA, since they are not used in many
2576			languages and have a considerable performance penalty. The argument
2577			is a <a href="tr35.html#Unicode_Sets">Unicode Set</a>.
2578		</p>
2579
2580		<p>
2581			The <i>suppress contractions</i> command has immediate effect on the
2582			current set of mappings, including mappings added by preceding rules.
2583			Following rules are processed after removing any context-sensitive
2584			mappings originating from any of the characters in the set.
2585		</p>
2586
2587		<p>
2588			The <i>optimize</i> tailoring command is purely for performance. It
2589			indicates that those characters are sufficiently common in the target
2590			language for the tailoring that their performance should be enhanced.
2591		</p>
2592		<p>The reason that these are not settings is so that their
2593			contents can be arbitrary characters.</p>
2594
2595		<hr width="50%">
2596		<p>
2597			<i>Example:</i>
2598		</p>
2599		<p>
2600			The following is a simple example that combines portions of different
2601			tailorings for illustration. For more complete examples, see the
2602			actual locale data: <a
2603				href="http://unicode.org/repos/cldr/tags/latest/common/collation/ja.xml">Japanese</a>,
2604			<a
2605				href="http://unicode.org/repos/cldr/tags/latest/common/collation/zh.xml">Chinese</a>,
2606			<a
2607				href="http://unicode.org/repos/cldr/tags/latest/common/collation/sv.xml">Swedish</a>,
2608			and <a
2609				href="http://unicode.org/repos/cldr/tags/latest/common/collation/de.xml">German</a>
2610			(type=&quot;phonebook&quot;) are particularly illustrative.
2611		</p>
2612		<pre>&lt;collation&gt;
2613  &lt;cr&gt;&lt;![CDATA[
2614    [caseLevel on]
2615    &amp;Z
2616    &lt; æ &lt;&lt;&lt; Æ
2617    &lt; å &lt;&lt;&lt; Å &lt;&lt;&lt; aa &lt;&lt;&lt; aA &lt;&lt;&lt; Aa &lt;&lt;&lt; AA
2618    &lt; ä &lt;&lt;&lt; Ä
2619    &lt; ö &lt;&lt;&lt; Ö &lt;&lt; ű &lt;&lt;&lt; Ű
2620    &lt; ő &lt;&lt;&lt; Ő &lt;&lt; ø &lt;&lt;&lt; Ø
2621    &amp;V &lt;&lt;&lt;* wW
2622    &amp;Y &lt;&lt;&lt;* üÜ
2623    &amp;[last non-ignorable]
2624    <span style="color: green"># The following is equivalent to &lt;亜&lt;唖&lt;娃...</span>
2625    &lt;* 亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦
2626    &lt;* 鯵梓圧斡扱
2627  ]]&gt;&lt;/cr&gt;
2628&lt;/collation&gt;</pre>
2629
2630		<h3>
2631			3.13 <a name="Script_Reordering" href="#Script_Reordering">Collation
2632				Reordering</a>
2633		</h3>
2634		<p>Collation reordering allows scripts and certain other defined
2635			blocks of characters to be moved relative to each other
2636			parametrically, without changing the detailed rules for all the
2637			characters involved. This reordering is done on top of any specific
2638			ordering rules within the script or block currently in effect.
2639			Reordering can specify groups to be placed at the start and/or the
2640			end of the collation order. For example, to reorder Greek characters
2641			before Latin characters, and digits afterwards (but before other
2642			scripts), the following can be used:</p>
2643		<table>
2644			<tr>
2645				<th>Rule Syntax</th>
2646				<th>Locale Identifier</th>
2647			</tr>
2648			<tr>
2649				<td><code>[reorder Grek Latn digit]</code></td>
2650				<td><code>en-u-kr-grek-latn-digit</code></td>
2651			</tr>
2652		</table>
2653		<p>
2654			In each case, a sequence of <em><strong>reorder_codes</strong></em>
2655			is used, separated by spaces in the settings attribute and in rule
2656			syntax, and by hyphens in locale identifiers.
2657		</p>
2658		<p>
2659			A <strong><em>reorder_code</em></strong> is any of the following
2660			special codes:
2661		</p>
2662		<ol>
2663			<li><strong>space, punct, symbol, currency, digit</strong> -
2664				core groups of characters below 'a'</li>
2665			<li><strong>any script code</strong> except <strong>Common</strong>
2666				and <strong>Inherited</strong>.
2667				<ul>
2668					<li>Some pairs of scripts sort primary-equal and always
2669						reorder together. For example, Katakana characters are are always
2670						reordered with Hiragana.</li>
2671				</ul></li>
2672			<li><strong>others</strong> - where all codes not explicitly
2673				mentioned should be ordered. The script code <strong>Zzzz</strong>
2674				(Unknown Script) is a synonym for <strong>others</strong>.</li>
2675		</ol>
2676		<p>It is an error if a code occurs multiple times.</p>
2677
2678		<p>
2679			It is an error if the sequence of reorder codes is empty in the XML
2680			attribute or in the locale identifier. Some implementations may
2681			interpret an empty sequence in the
2682			<code>[reorder]</code>
2683			rule syntax as a reset to the DUCET ordering, synonymous with
2684			<code>[reorder others]</code>
2685			; other implementations may forbid an empty sequence in the rule
2686			syntax as well.
2687		</p>
2688
2689		<p>
2690			Interaction with <strong>alternate=shifted</strong>: Whether a
2691			primary weight is “variable” is determined according to the “variable
2692			top”, before applying script reordering. Once that is determined,
2693			script reordering is applied to the primary weight regardless of
2694			whether it is “regular” (used in the primary level) or “shifted”
2695			(used in the quaternary level).
2696		</p>
2697
2698		<h4>
2699			3.13.1 <a name="Interpretation_reordering"
2700				href="#Interpretation_reordering">Interpretation of a reordering
2701				list</a>
2702		</h4>
2703		<p>The reordering list is interpreted as if it were processed in
2704			the following way.</p>
2705		<ol>
2706			<li>If any core code is not present, then it is inserted at the
2707				front of the list in the order given above.</li>
2708			<li>If the <strong>others</strong> code is not present, then it
2709				is inserted at the end of the list.
2710			</li>
2711			<li>The <strong>others</strong> code is replaced by the list of
2712				all script codes not explicitly mentioned, in DUCET order.
2713			</li>
2714			<li>The reordering list is now complete, and used to reorder
2715				characters in collation accordingly.</li>
2716		</ol>
2717		<p>
2718			The locale data may have a particular ordering. For example, the
2719			Czech locale data could put digits after all letters, with
2720			<code>[reorder others digit]</code>
2721			. Any reordering codes specified on top of that (such as with a bcp47
2722			locale identifier) completely replace what was there. To specify a
2723			version of collation that completely resets any existing reordering
2724			to the DUCET ordering, the single code <strong>Zzzz</strong> or <strong>others</strong>
2725			can be used, as below<strong></strong>.
2726		</p>
2727		<p>
2728			<em>Examples: </em>
2729		</p>
2730		<table cellpadding="0" cellspacing="0">
2731			<tbody>
2732				<tr>
2733					<th>Locale Identifier</th>
2734					<th>Effect</th>
2735				</tr>
2736				<tr>
2737					<td><code>en-u-kr-latn-digit</code></td>
2738					<td>Reorder digits after Latin characters (but before other
2739						scripts like Cyrillic).</td>
2740				</tr>
2741				<tr>
2742					<td><code>en-u-kr-others-digit</code></td>
2743					<td>Reorder digits after all other characters.</td>
2744				</tr>
2745				<tr>
2746					<td><code>en-u-kr-arab-cyrl-others-symbol</code></td>
2747					<td>Reorder Arabic characters first, then Cyrillic, and put
2748						symbols at the end—after all other characters.</td>
2749				</tr>
2750				<tr>
2751					<td><code>en-u-kr-others</code></td>
2752					<td>Remove any locale-specific reordering, and use DUCET order
2753						for reordering blocks.</td>
2754				</tr>
2755			</tbody>
2756		</table>
2757		<p>
2758			The default reordering groups are defined by the FractionalUCA.txt
2759			file, based on the primary weights of associated collation elements.
2760			The file contains special mappings for the start of each group,
2761			script, and reorder-reserved range, see <i>Section 2.6.2, <a
2762				href="#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>.
2763		</p>
2764
2765		<p>There are some special cases:</p>
2766		<ul>
2767			<li>The <strong>Hani</strong> group includes implicit weights
2768				for <em>Han characters</em> according to the UCA as well as any
2769				characters tailored relative to a Han character, or after <code>&amp;[first
2770					Hani]</code>.
2771			</li>
2772			<li>Implicit weights for <em>unassigned code points</em>
2773				according to the UCA reorder as the last weights in the <strong>others</strong>
2774				(<strong>Zzzz</strong>) group.<br> There is no script code to
2775				explicitly reorder the unassigned-implicit weights into a particular
2776				position. (Unassigned-implicit weights are used for non-Hani code
2777				points without any mappings. For a given Unicode version they are
2778				the code points with General_Category values Cn, Co, Cs.)
2779			</li>
2780			<li>The TRAILING group, the FIELD-SEPARATOR (associated with
2781				U+FFFE), and collation elements with only zero primary weights are
2782				not reordered.</li>
2783			<li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are
2784				never associated with characters.</li>
2785		</ul>
2786		<p>
2787			For example,
2788			<code>reorder="Hani Zzzz Grek"</code>
2789			sorts Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned,
2790			Greek, TRAILING.
2791		</p>
2792
2793		<p>Notes for implementations that write sort keys:</p>
2794		<ul>
2795			<li>Primaries must always be offset by one or more whole primary
2796				lead bytes. (Otherwise the number of bytes in a fractional weight
2797				may change, compressible scripts may span multiple lead bytes, or
2798				trailing primary bytes may collide with separators and
2799				primary-compression terminators.)</li>
2800			<li>When a script is reordered that does not start and end on
2801				whole-primary-lead-byte boundaries, then the lead byte needs to be
2802				“split”, and a reserved byte is used up. The data supports this via
2803				reorder-reserved ranges of primary weights that are not used for
2804				collation elements.</li>
2805			<li>Primary weights from different original lead bytes can be
2806				reordered to a shared lead byte, as long as they do not overlap.
2807				Primary compression ends when the target lead byte differs or when
2808				the original lead byte of the next primary is not compressible.</li>
2809			<li>Non-compressible groups and scripts begin or end on
2810				whole-primary-lead-byte boundaries (or both), so that reordering
2811				cannot surround a non-compressible script by two compressible ones
2812				within the same target lead byte. This is so that primary
2813				compression can be terminated reliably (choosing the low or high
2814				terminator byte) simply by comparing the previous and current
2815				primary weights. Otherwise it would have to also check for another
2816				condition (e.g., equal scripts).</li>
2817		</ul>
2818
2819		<h4>
2820			3.13.2 <a name="Reordering_Groups_allkeys"
2821				href="#Reordering_Groups_allkeys">Reordering Groups for
2822				allkeys.txt</a>
2823		</h4>
2824		<p>
2825			For allkeys_CLDR.txt, the start of each reordering group can be
2826			determined from FractionalUCA.txt, by finding the first real mapping
2827			(after “xyz first primary”) of that group (e.g.,
2828			<code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE
2829				ACCENT</code>
2830			), and looking for that mapping's character sequence (
2831			<code>0060</code>
2832			) in allkeys_CLDR.txt. The comment in FractionalUCA.txt (
2833			<code>[0312.0020.0002]</code>
2834			) also shows the allkeys_CLDR.txt collation elements.
2835		</p>
2836
2837		<p>The DUCET ordering of some characters is slightly different
2838			from the CLDR root collation order. The reordering groups for the
2839			DUCET are not specified. The following describes how reordering
2840			groups for the DUCET can be derived.</p>
2841		<p>
2842			For allkeys_DUCET.txt, the start of each reordering group is normally
2843			the primary weight corresponding to the same character sequence as
2844			for allkeys_CLDR.txt. In a few cases this requires adjustment,
2845			especially for the special reordering groups, due to CLDR’s ordering
2846			the common characters more strictly by category than the DUCET (as
2847			described in <i>Section 2, <a href="#Root_Collation">Root
2848					Collation</a></i>). The necessary adjustment would set the start of each
2849			allkeys_DUCET.txt reordering group to the primary weight of the first
2850			mapping for the relevant General_Category for a special reordering
2851			group (for characters that sort before ‘a’), or the primary weight of
2852			the first mapping for the first script (e.g., sc=Grek) of an
2853			“alphabetic” group (for characters that sort at or after ‘a’).
2854		</p>
2855		<p>Note that the following only applies to primary weights greater
2856			than the one for U+FFFE and less than "trailing" weights.</p>
2857		<p>The special reordering groups correspond to General_Category
2858			values as follows:</p>
2859		<ul>
2860			<li>punct: P</li>
2861			<li>symbol: Sk, Sm, So</li>
2862			<li>space: Z, Cc</li>
2863			<li>currency: Sc</li>
2864			<li>digit: Nd</li>
2865		</ul>
2866		<p>In the DUCET, some characters that sort below ‘a’ and have
2867			other General_Category values not mentioned above (e.g., gc=Lm) are
2868			also grouped with symbols. Variants of numbers (gc=No or Nl) can be
2869			found among punctuation, symbols, and digits.</p>
2870		<p>Each collation element of an expansion may be in a different
2871			reordering group, for example for parenthesized characters.</p>
2872
2873		<h3>
2874			3.14 <a name="Case_Parameters" href="#Case_Parameters">Case
2875				Parameters</a>
2876		</h3>
2877		<p>
2878			The <strong>case level</strong> is an <em>optional</em> intermediate
2879			level (&quot;2.5&quot;) between Level 2 and Level 3 (or after Level
2880			1, if there is no Level 2 due to strength settings). The case level
2881			is used to support two parametric features: ignoring non-case
2882			variants (Level 3 differences) except for case, and giving case
2883			differences a higher-level priority than other tertiary differences.
2884			Distinctions between small and large Kana characters are also
2885			included as case differences, to support Japanese collation.
2886		</p>
2887		<p>
2888			The <strong>case first</strong> parameter controls whether to swap
2889			the order of upper and lowercase. It can be used with or without the
2890			case level.
2891		</p>
2892		<p>
2893			Importantly, the case parameters have no effect in many instances.
2894			For example, they have no effect on the comparison of two
2895			non-ignorable characters with different primary weights, or with
2896			different secondary weights if the strength = <strong>secondary
2897				(or higher).</strong>
2898		</p>
2899		<p>
2900			When either the <strong>case level</strong> or <strong>case
2901				first</strong> parameters are set, the following describes the derivation of
2902			the modified collation elements. It assumes the original levels for
2903			the code point are [p.s.t] (primary, secondary, tertiary). This
2904			derivation may change in future versions of LDML, to track the case
2905			characteristics more closely.
2906		</p>
2907
2908		<h4>
2909			3.14.1 <a name="Case_Untailored" href="#Case_Untailored">Untailored
2910				Characters</a>
2911		</h4>
2912		<p>For untailored characters and strings, that is, for mappings in
2913			the root collation, the case value for each collation element is
2914			computed from the tertiary weight listed in allkeys_CLDR.txt. This is
2915			used to modify the collation element.</p>
2916		<p>Look up a case value for the tertiary weight x of each
2917			collation element:</p>
2918		<ol>
2919			<li>UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}</li>
2920			<li>UNCASED otherwise</li>
2921			<li>FractionalUCA.txt encodes the case information in bits 6 and
2922				7 of the first byte in each tertiary weight. The case bits are set
2923				to 00 for UNCASED and LOWERCASE, and 10 for UPPER. There is no MIXED
2924				case value (01) in the root collation.</li>
2925		</ol>
2926
2927		<h4>
2928			3.14.2 <a name="Case_Weights" href="#Case_Weights">Compute
2929				Modified Collation Elements</a>
2930		</h4>
2931		<p>
2932			From a computed case value, set a weight <strong>c</strong> according
2933			to the following.
2934		</p>
2935		<ol>
2936			<li>If <strong>CaseFirst=UpperFirst</strong>, set <strong>c</strong>
2937				= UPPER ? <strong>1</strong> : MIXED ? 2 : <strong>3</strong></li>
2938			<li>Otherwise set <strong>c</strong> = UPPER ? <strong>3</strong>
2939				: MIXED ? 2 : <strong>1</strong></li>
2940		</ol>
2941		<p>
2942			Compute a new collation element according to the following table. The
2943			notation <em>xt</em> means that the values are numerically combined
2944			into a single level, such that xt &lt; yu whenever x &lt; y. The
2945			fourth level (if it exists) is unaffected. Note that a secondary CE
2946			must have a secondary weight S which is greater than the secondary
2947			weight s of any primary CE; and a tertiary CE must have a tertiary
2948			weight T which is greater than the tertiary weight t of any primary
2949			or secondary CE ([<a
2950				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a
2951				href="http://www.unicode.org/reports/tr10/#WF2">WF2</a>).
2952		</p>
2953
2954		<div align="center">
2955			<table>
2956				<tbody>
2957					<tr>
2958						<th>Case Level</th>
2959						<th>Strength</th>
2960						<th>Original CE</th>
2961						<th>Modified CE</th>
2962						<th>Comment</th>
2963					</tr>
2964					<tr>
2965						<td rowspan="5"><strong>on</strong></td>
2966						<td rowspan="2"><strong>primary</strong></td>
2967						<td><code>0.S.t</code></td>
2968						<td><code>0.0</code></td>
2969						<td rowspan="2">ignore case level weights of
2970							primary-ignorable CEs</td>
2971					</tr>
2972					<tr>
2973						<td><code>p.s.t</code></td>
2974						<td><code>p.c</code></td>
2975					</tr>
2976					<tr>
2977						<td rowspan="3"><strong>secondary<br>
2978						</strong>or higher</td>
2979						<td><code>0.0.T</code></td>
2980						<td><code>0.0.0.T</code></td>
2981						<td rowspan="3">ignore case level weights of
2982							secondary-ignorable CEs</td>
2983					</tr>
2984					<tr>
2985						<td><code>0.S.t</code></td>
2986						<td><code>0.S.c.t</code></td>
2987					</tr>
2988					<tr>
2989						<td><code>p.s.t</code></td>
2990						<td><code>p.s.c.t</code></td>
2991					</tr>
2992					<tr>
2993						<td rowspan="4"><strong>off</strong></td>
2994						<td rowspan="4">any</td>
2995						<td><code>0.0.0</code></td>
2996						<td><code>0.0.00</code></td>
2997						<td rowspan="4">ignore case level weights of
2998							tertiary-ignorable CEs</td>
2999					</tr>
3000					<tr>
3001						<td><code>0.0.T</code></td>
3002						<td><code> 0.0.3T </code></td>
3003					</tr>
3004					<tr>
3005						<td><code>0.S.t</code></td>
3006						<td><code>0.S.ct</code></td>
3007					</tr>
3008					<tr>
3009						<td><code>p.s.t</code></td>
3010						<td><code>p.s.ct</code></td>
3011					</tr>
3012				</tbody>
3013			</table>
3014		</div>
3015
3016		<p>For primary+case, which is used for “ignore accents but not
3017			case” collation, primary ignorables are ignored so that a = ä. For
3018			secondary+case, which would by analogy mean “ignore variants but not
3019			case”, secondary ignorables are ignored for equivalent behavior.</p>
3020		<p>
3021			When using <strong>caseFirst</strong> but not <strong>caseLevel</strong>,
3022			the combined case+tertiary weight of a tertiary CE must be greater
3023			than the combined case+tertiary weight of any primary or secondary CE
3024			so that [<a href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
3025			<a href="http://www.unicode.org/reports/tr10/#WF2">well-formedness
3026				condition 2</a> is fulfilled. Since the tertiary CE’s tertiary weight T
3027			is already greater than any t of primary or secondary CEs, it is
3028			sufficient to set its case weight to UPPER=3. It must not be affected
3029			by <strong>caseFirst=upper</strong>. (The table uses the constant 3
3030			in this case rather than the computed c.)
3031		</p>
3032		<p>
3033			The case weight of a tertiary-ignorable CE must be 0 so that [<a
3034				href="http://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a
3035				href="http://www.unicode.org/reports/tr10/#WF1">well-formedness
3036				condition 1</a> is fulfilled.
3037		</p>
3038
3039		<h4>
3040			3.14.3 <a name="Case_Tailored" href="#Case_Tailored">Tailored
3041				Strings</a>
3042		</h4>
3043		<p>Characters and strings that are tailored have case values
3044			computed from their root collation case bits.</p>
3045
3046		<ol>
3047			<li>Look up the tailored string’s root CEs. (Ignore any prefix
3048				or extension strings.) N=number of primary root CEs.</li>
3049			<li>Determine the number and type (primary vs. weaker) of CEs a
3050				tailored string maps to. M=number of primary tailored CEs.</li>
3051			<li>If N&lt;=M (no more root than tailoring primary CEs): Copy
3052				the root case bits for primary CEs 0..N-1.
3053				<ul>
3054					<li>If N&lt;M (fewer root primary CEs): Clear the case bits of
3055						the remaining tailored primary CEs. (uncased/lowercase/small Kana)</li>
3056				</ul>
3057			</li>
3058			<li>If N&gt;M (more root primary CEs): Copy the root case bits
3059				for primary CEs 0..M-2. Set the case bits for tailored primary CE
3060				M-1 according to the remaining root primary CEs M-1..N-1:
3061				<ul>
3062					<li>Set to uncased/lower if all remaining root primary CEs
3063						have uncased/lower.</li>
3064					<li>Set to uppercase if all remaining root primary CEs have
3065						uppercase.</li>
3066					<li>Otherwise, set to mixed.</li>
3067				</ul>
3068			</li>
3069			<li>Clear the case bits for secondary CEs 0.s.t.</li>
3070			<li>Tertiary CEs 0.0.t must get uppercase bits.</li>
3071			<li>Tertiary-ignorable CEs 0.0.0 must get
3072				ignorable-case=lowercase bits.</li>
3073		</ol>
3074		<p class="note">Note: Almost all Cased characters have primary
3075			(non-ignorable) root collation CEs, except for U+0345 Combining
3076			Ypogegrammeni which is Lowercase. All Uppercase characters have
3077			primary root collation CEs.</p>
3078
3079
3080		<h3>
3081			3.15 <a name="Visibility" href="#Visibility">Visibility</a>
3082		</h3>
3083		<p>
3084			Collations have external visibility by default, meaning that they can
3085			be displayed in a list of collation options for users to choose from.
3086			A collation whose type name starts with "private-" is internal and
3087			should not be shown in such a list. Collations are typically internal
3088			when they are partial sequences included in other collations. See <i>Section
3089				3.1, <a href="#Collation_Types">Collation Types</a>
3090			</i>.
3091		</p>
3092
3093		<h3>
3094			3.16 <a name="Collation_Indexes" href="#Collation_Indexes">Collation
3095				Indexes</a>
3096		</h3>
3097		<h4>
3098			3.16.1 <a name="Index_Characters" href="#Index_Characters">Index
3099				Characters</a>
3100		</h4>
3101		<p>
3102			The main data includes &lt;exemplarCharacters&gt; for collation
3103			indexes. See <i>Part 2 General, Section 3, <a
3104				href="tr35-general.html#Character_Elements">Character Elements</a></i>,
3105			for general information about exemplar characters.
3106		</p>
3107		<p>The index characters are a set of characters for use as a UI
3108			"index", that is, a list of clickable characters (or character
3109			sequences) that allow the user to see a segment of a larger "target"
3110			list. Each character corresponds to a bucket in the target list. One
3111			may have different kinds of index lists; one that produces an index
3112			list that is relatively static, and the other is a list that produces
3113			roughly equally-sized buckets. While CLDR is mostly focused on the
3114			first, there is provision for supporting the second as well.</p>
3115		<p>The index characters need to be used in conjunction with a
3116			collation for the locale, which will determine the order of the
3117			characters. It will also determine which index characters show up.</p>
3118		<p>The static list would be presented as something like the
3119			following (either vertically or horizontally):</p>
3120		<p align="center">… A B C D E F G H CH I J K L M N O P Q R S T U V
3121			W X Y Z …</p>
3122		<p>In the "A" bucket, you would find all items that are primary
3123			greater than or equal to "A" in collation order, and primary less
3124			than "B". The use of the list requires that the target list be sorted
3125			according to the locale that is used to create that list. Although we
3126			say "character" above, the index character could be a sequence, like
3127			"CH" above. The index exemplar characters must always be used with a
3128			collation appropriate for the locale. Any characters that do not have
3129			primary differences from others in the set should be removed.</p>
3130		<p>Details:</p>
3131		<ol>
3132			<li>The primary weight (according to the collation) is used to
3133				determine which bucket a string is in. There are special buckets for
3134				before the first character, between buckets of different scripts,
3135				and after the last bucket (and of a different script).</li>
3136			<li>Characters in the <em>index characters</em> do not need to
3137				have distinct primary weights. That is, the <em>index
3138					characters</em> are adapted to the underlying collation: normally Ё is
3139				in the Е bucket for Russian, but if someone used a variant of
3140				Russian collation that distinguished them on a primary level, then Ё
3141				would show up as its own bucket.
3142			</li>
3143			<li>If an <em>index character</em> string ends with a single "*"
3144				(U+002A), for example "Sch*" and "St*" in German, then there will be
3145				a separate bucket for the string minus the "*", for example "Sch"
3146				and "St", even if that string does not sort distinctly.
3147			</li>
3148			<li>An <em>index character</em> can have multiple primary
3149				weights, for example "Æ" and "Sch". Names that have the same initial
3150				primary weights sort into this <em>index character</em>’s bucket.
3151				This can be achieved by using an upper-boundary string that is the
3152				concatenation of the <em>index character</em> and U+FFFF, for
3153				example "Æ\uFFFF" and "Sch\uFFFF". Names that sort greater than this
3154				upper boundary but less than the next index character are redirected
3155				to the last preceding single-primary index character (A and S for
3156				the examples here).
3157			</li>
3158		</ol>
3159		<p>
3160			For example, for index characters
3161			<code>[A Æ B R S {Sch*} {St*} T]</code>
3162			the following sample names are sorted into an index as shown.
3163		</p>
3164		<ul>
3165			<li>A &mdash; Adelbert, Afrika</li>
3166			<li>Æ &mdash; Æsculap, Aesthet</li>
3167			<li>B &mdash; Berlin</li>
3168			<li>R &mdash; Rilke</li>
3169			<li>S &mdash; Sacher, Seiler, Sultan</li>
3170			<li>Sch &mdash; Schiller</li>
3171			<li>St &mdash; Steiff</li>
3172			<li>T &mdash; Thomas</li>
3173		</ul>
3174		<p>
3175			The … items are special: each is a bucket for everything else, either
3176			less or greater. They are inserted at the start and end of the index
3177			list, <em>and</em> on script boundaries. Each script has its own
3178			range, except where scripts sort primary-equal (e.g., Hira &amp;
3179			Kana). All characters that sort in one of the low reordering groups
3180			(whitespace, punctuation, symbols, currency symbols, digits) are
3181			treated as a single script for this purpose.
3182		</p>
3183		<p>If you tailor a Greek character into the Cyrillic script, that
3184			Greek character will be bucketed (and sorted) among the Cyrillic
3185			ones.</p>
3186
3187		<p>
3188			Even in an implementation that reorders groups of scripts rather than
3189			single scripts, for example Hebrew together with Phoenician and
3190			Samaritan, the index boundaries are really script boundaries, <em>not</em>
3191			multi-script-group boundaries. So if you had a collation that
3192			reordered Hebrew after Ethiopic, you would still get index boundaries
3193			between the following (and in that order):
3194		</p>
3195		<ol>
3196			<li>Ethiopic</li>
3197			<li>Hebrew</li>
3198			<li>Phoenician<em> // included in the Hebrew reordering
3199					group</em></li>
3200			<li>Samaritan<em> // included in the Hebrew reordering
3201					group</em></li>
3202			<li>Devanagari</li>
3203		</ol>
3204		<p>(Beginning with CLDR 27, single scripts can be reordered.)</p>
3205		<p>In the UI, an index character could also be omitted or grayed
3206			out if its bucket is empty. For example, if there is nothing in the
3207			bucket for Q, then Q could be omitted. That would be up to the
3208			implementation. Additional buckets could be added if other characters
3209			are present. For example, we might see something like the following:</p>
3210		<table border="1" cellspacing="0">
3211			<tbody>
3212				<tr align="center">
3213					<td><div align="center">
3214							<strong>Sample Greek Index<br>
3215							</strong>
3216						</div></td>
3217					<td><strong>Contents<br>
3218					</strong></td>
3219				</tr>
3220				<tr align="center">
3221					<td><div align="center"> Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π
3222							Ρ Σ Τ Υ Φ Χ Ψ Ω</div></td>
3223					<td>With only content beginning with Greek letters <br>
3224					</td>
3225				</tr>
3226				<tr align="center">
3227					<td><div align="center"> … Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο
3228							Π Ρ Σ Τ Υ Φ Χ Ψ Ω …</div></td>
3229					<td>With some content before or after</td>
3230				</tr>
3231				<tr align="center">
3232					<td><div align="center"> … 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ
3233							Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …</div></td>
3234					<td>With numbers, and nothing between 9 and Alpha</td>
3235				</tr>
3236				<tr align="center">
3237					<td><div align="center">
3238							  … 9 <em>A-Z</em> Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ
3239							Ω …
3240						</div></td>
3241					<td>With numbers, some Latin</td>
3242				</tr>
3243			</tbody>
3244		</table>
3245		<p>Here is a sample of the XML structure:</p>
3246		<pre>&lt;exemplarCharacters type=&quot;index&quot;&gt;[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]&lt;/exemplarCharacters&gt;</pre>
3247		<p>
3248			The display of the index characters can be modified with the Index
3249			labels elements, discussed in the <i>Part 2 General, Section 3.3,
3250				<a href="tr35-general.html#IndexLabels">Index Labels</a>
3251			</i>.
3252		</p>
3253
3254		<h4>
3255			3.16.2 <a name="CJK_Index_Markers" href="#CJK_Index_Markers">CJK
3256				Index Markers</a>
3257		</h4>
3258		<p>Special index markers have been added to the CJK collations for
3259			stroke, pinyin, zhuyin, and unihan. These markers allow for effective
3260			and robust use of indexes for these collations.</p>
3261		<p>The per-language index exemplar characters are not useful for
3262			collation indexes for CJK because for each such language there are
3263			multiple sort orders in use (for example, Chinese pinyin vs. stroke
3264			vs. unihan vs. zhuyin), and these sort orders use very different
3265			index characters. In addition, sometimes the boundary strings are
3266			different from the bucket label strings. For collations that contain
3267			index markers, the boundary strings and bucket labels should be
3268			derived from those index markers, ignoring the index exemplar
3269			characters.</p>
3270		<p>For example, near the start of the pinyin tailoring there is
3271			the following:</p>
3272		<p>
3273			&lt;p&gt; A&lt;/p&gt;&lt;!-- INDEX A --&gt;<br>
3274			&lt;pc&gt;阿呵��锕����&lt;/pc&gt;&lt;!-- ā --&gt;
3275		</p>
3276		<p>…</p>
3277		<p>
3278			&lt;pc&gt;翶&lt;/pc&gt;&lt;!-- ao --&gt;<br> &lt;p&gt;
3279			B&lt;/p&gt;&lt;!-- INDEX B --&gt;
3280		</p>
3281		<p>These indicate the boundaries of &quot;buckets&quot; that can
3282			be used for indexing. They are always two characters starting with
3283			the noncharacter U+FDD0, and thus will not occur in normal text. For
3284			pinyin the second character is A-Z; for unihan it is one of the
3285			radicals; and for stroke it is a character after U+2800 indicating
3286			the number of strokes, such as ⠁. For zhuyin the second character is
3287			one of the standard Bopomofo characters in the range U+3105 through
3288			U+3129.</p>
3289
3290		<p>The corresponding bucket label strings are the boundary strings
3291			with the leading U+FDD0 removed. For example, the Pinyin boundary
3292			string "\uFDD0A" yields the label string "A".</p>
3293
3294		<p>However, for stroke order, the label string is the stroke count
3295			(second character minus U+2800) as a decimal-digit number followed by
3296			&#x5283; (U+5283). For example, the stroke order boundary string
3297			"\uFDD0\u2805" yields the label string "5&#x5283;".</p>
3298
3299		<hr>
3300		<p class="copyright">
3301			Copyright © 2001–2018 Unicode, Inc. All
3302			Rights Reserved. The Unicode Consortium makes no expressed or implied
3303			warranty of any kind, and assumes no liability for errors or
3304			omissions. No liability is assumed for incidental and consequential
3305			damages in connection with or arising out of the use of the
3306			information or programs contained or accompanying this technical
3307			report. The Unicode <a href="http://unicode.org/copyright.html">Terms
3308				of Use</a> apply.
3309		</p>
3310		<p class="copyright">Unicode and the Unicode logo are trademarks
3311			of Unicode, Inc., and are registered in some jurisdictions.</p>
3312	</div>
3313
3314</body>
3315
3316</html>
3317