• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1## Unicode Technical Standard #35
2
3# Unicode Locale Data Markup Language (LDML)
4
5|Version|42        |
6|-------|----------|
7|Editors|Mark Davis (<a href="mailto:markdavis@google.com">markdavis@google.com</a>) and <a href="tr35.md#Acknowledgments">other CLDR committee members</a>|
8|Date|2022-10-17|
9|This Version|<a href="https://www.unicode.org/reports/tr35/tr35-67/tr35.html">https://www.unicode.org/reports/tr35/tr35-67/tr35.html</a>|
10|Previous Version|<a href="https://www.unicode.org/reports/tr35/tr35-66/tr35.html">https://www.unicode.org/reports/tr35/tr35-66/tr35.html</a>|
11|Latest Version|<a href="https://www.unicode.org/reports/tr35/">https://www.unicode.org/reports/tr35/</a>|
12|Corrigenda|<a href="https://cldr.unicode.org/index/corrigenda">https://cldr.unicode.org/index/corrigenda</a>|
13|Latest Proposed Update|<a href="https://www.unicode.org/reports/tr35/proposed.html">https://www.unicode.org/reports/tr35/proposed.html</a></td></tr>
14|Namespace|<a href="https://www.unicode.org/cldr/">https://www.unicode.org/cldr/</a>|
15|DTDs|<a href="https://www.unicode.org/cldr/dtd/42/">https://www.unicode.org/cldr/dtd/42/</a>|
16|Revision|<a href="#Modifications">67</a>|
17
18### _Summary_
19
20This document describes an XML format (_vocabulary_) for the exchange of structured locale data. This format is used in the [Unicode Common Locale Data Repository](https://www.unicode.org/cldr/).
21
22_Note:_
23Some links may lead to in-development or older
24versions of the data files.
25See <https://cldr.unicode.org> for up-to-date CLDR release data.
26
27### _Status_
28
29_This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications._
30
31> _**A Unicode Technical Standard (UTS)** is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS._
32
33_Please submit corrigenda and other comments with the CLDR bug reporting form [[Bugs](https://cldr.unicode.org/index/bug-reports)]. Related information that is useful in understanding this document is found in the [References](#References). For the latest version of the Unicode Standard see [[Unicode](https://www.unicode.org/versions/latest/)]. For a list of current Unicode Technical Reports see [[Reports](https://www.unicode.org/reports/)]. For more information about versions of the Unicode Standard, see [[Versions](https://www.unicode.org/versions/)]._
34
35>**_NOTE: The source for the LDML specification has been converted to GitHub Markdown (GFM) instead of HTML. The formatting is now simpler, but some features — such as formatting for table captions — may not be complete by the release date. Improvements in the formatting for the specification may be done after the release, but no substantive changes will be made to the content._**
36
37## <a name="Parts" href="#Parts">Parts</a>
38
39The LDML specification is divided into the following parts:
40
41*   Part 1: [Core](tr35.md#Contents) (languages, locales, basic structure)
42*   Part 2: [General](tr35-general.md#Contents) (display names & transforms, etc.)
43*   Part 3: [Numbers](tr35-numbers.md#Contents) (number & currency formatting)
44*   Part 4: [Dates](tr35-dates.md#Contents) (date, time, time zone formatting)
45*   Part 5: [Collation](tr35-collation.md#Contents) (sorting, searching, grouping)
46*   Part 6: [Supplemental](tr35-info.md#Contents) (supplemental data)
47*   Part 7: [Keyboards](tr35-keyboards.md#Contents) (keyboard mappings)
48*   Part 8: [Person Names](tr35-personNames.md#Contents) (person names)
49
50## <a name="Contents" href="#Contents">Contents of Part 1, Core</a>
51
52* 1 [Introduction](#Introduction)
53  * 1.1 [Conformance](#Conformance)
54* 2 [What is a Locale?](#Locale)
55* 3 [Unicode Language and Locale Identifiers](#Unicode_Language_and_Locale_Identifiers)
56  * _[3.1 Unicode Language Identifier](#Unicode_language_identifier)_
57  * _[3.2 Unicode Locale Identifier](#Unicode_locale_identifier)_
58    * 3.2.1 [Canonical Unicode Locale Identifiers](#Canonical_Unicode_Locale_Identifiers)
59  * 3.3 [BCP 47 Conformance](#BCP_47_Conformance)
60    * 3.3.1 [BCP 47 Language Tag Conversion](#BCP_47_Language_Tag_Conversion)
61      * Table: [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier) Examples
62      * [Unicode Locale Identifier: CLDR to BCP 47](#Unicode_Locale_Identifier_CLDR_to_BCP_47)
63      * [Unicode Locale Identifier: BCP 47 to CLDR](#Unicode_Locale_Identifier_BCP_47_to_CLDR)
64      * [Truncation](#truncation)
65  * 3.4 [Language Identifier Field Definitions](#Field_Definitions)
66    * [`unicode_language_subtag`](#unicode_language_subtag_validity) (also known as a _Unicode base language code_)
67    * [`unicode_script_subtag`](#unicode_script_subtag_validity) (also known as a _Unicode script code_)
68    * [`unicode_region_subtag`](#unicode_region_subtag_validity) (also known as a _Unicode region code,_ or a _Unicode territory code_)
69    * [`unicode_variant_subtag`](#unicode_variant_subtag_validity) (also known as a _Unicode language variant code_)
70  * 3.5 [Special Codes](#Special_Codes)
71    * 3.5.1 [Unknown or Invalid Identifiers](#Unknown_or_Invalid_Identifiers)
72    * 3.5.2 [Numeric Codes](#Numeric_Codes)
73    * 3.5.3 [Private Use Codes](#Private_Use_Codes)
74      * Table: [Private Use Codes in CLDR](#Private_Use_CLDR)
75  * 3.6 [Unicode BCP 47 U Extension](#u_Extension)
76    * 3.6.1 [Key And Type Definitions](#Key_And_Type_Definitions_)
77      * Table: [Key/Type Definitions](#Key_Type_Definitions)
78    * 3.6.2 [Numbering System Data](#Numbering%20System%20Data)
79    * 3.6.3 [Time Zone Identifiers](#Time_Zone_Identifiers)
80    * 3.6.4 [U Extension Data Files](#Unicode_Locale_Extension_Data_Files)
81    * 3.6.5 [Subdivision Codes](#Unicode_Subdivision_Codes)
82      * 3.6.5.1 [Validity](#Validity)
83  * 3.7 [Unicode BCP 47 T Extension](#BCP47_T_Extension)
84    * 3.7.1 [T Extension Data Files](#Transformed_Content_Data_File)
85  * 3.8 [Compatibility with Older Identifiers](#Compatibility_with_Older_Identifiers)
86    * 3.8.1 [Old Locale Extension Syntax](#Old_Locale_Extension_Syntax)
87      * Table: [Locale Extension Mappings](#Locale_Extension_Mappings)
88    * 3.8.2 [Legacy Variants](#Legacy_Variants)
89      * Table: [Legacy Variant Mappings](#Legacy_Variant_Mappings)
90    * 3.8.3 [Relation to OpenI18n](#Relation_to_OpenI18n)
91  * 3.9 [Transmitting Locale Information](#Transmitting_Locale_Information)
92    * 3.9.1 [Message Formatting and Exceptions](#Message_Formatting_and_Exceptions)
93  * 3.10 [Unicode Language and Locale IDs](#Language_and_Locale_IDs)
94    * 3.10.1 [Written Language](#Written_Language)
95    * 3.10.2 [Hybrid Locale Identifiers](#Hybrid_Locale)
96  * 3.11 [Validity Data](#Validity_Data)
97* 4 [Locale Inheritance and Matching](#Locale_Inheritance)
98  * 4.1 [Lookup](#Lookup)
99    * 4.1.1 [Bundle vs Item Lookup](#Bundle_vs_Item_Lookup)
100      * Table: [Lookup Differences](#Lookup-Differences)
101    * 4.1.2 [Lateral Inheritance](#Lateral_Inheritance)
102      * Table: [Count Fallback: normal](#Count_Fallback_normal)
103      * Table: [Count Fallback: currency](#Count_Fallback_currency)
104    * 4.1.3 [Parent Locales](#Parent_Locales)
105  * 4.2 [Inheritance and Validity](#Inheritance_and_Validity)
106    * 4.2.1 [Definitions](#Definitions)
107    * 4.2.2 [Resolved Data File](#Resolved_Data_File)
108    * 4.2.3 [Valid Data](#Valid_Data)
109    * 4.2.4 [Checking for Draft Status](#Checking_for_Draft_Status)
110    * 4.2.5 [Keyword and Default Resolution](#Keyword_and_Default_Resolution)
111    * 4.2.6 [Inheritance vs Related Information](#Inheritance_vs_Related)
112  * 4.3 [Likely Subtags](#Likely_Subtags)
113  * 4.4 [Language Matching](#LanguageMatching)
114    * 4.4.1 [Enhanced Language Matching](#EnhancedLanguageMatching)
115* 5 [XML Format](#XML_Format)
116  * 5.1 [Common Elements](#Common_Elements)
117    * 5.1.1 [Element special](#special)
118      * 5.1.1.1 [Sample Special Elements](#Sample_Special_Elements)
119    * 5.1.2 [Element alias](#Alias_Elements)
120      * Table: [Inheritance with `source="locale"`](#Inheritance_with_source_locale_)
121    * 5.1.3 [Element displayName](#Element_displayName)
122    * 5.1.4 [Escaping Characters](#Escaping_Characters)
123  * 5.2 [Common Attributes](#Common_Attributes)
124    * 5.2.1 [Attribute type](#Attribute_type)
125    * 5.2.2 [Attribute draft](#Attribute_draft)
126    * 5.2.3 [Attribute alt](#alt_attribute)
127    * 5.2.4 [Attribute references](#references_attribute)
128  * 5.3 [Common Structures](#Common_Structures)
129    * 5.3.1 [Date and Date Ranges](#Date_Ranges)
130    * 5.3.2 [Text Directionality](#Text_Directionality)
131    * 5.3.3 [Unicode Sets](#Unicode_Sets)
132      * 5.3.3.1 [Lists of Code Points](#Lists_of_Code_Points)
133      * 5.3.3.2 [Unicode Properties](#Unicode_Properties)
134      * 5.3.3.3 [Boolean Operations](#Boolean_Operations)
135      * 5.3.3.4 [UnicodeSet Examples](#UnicodeSet_Examples)
136    * 5.3.4 [String Range](#String_Range)
137  * 5.4 [Identity Elements](#Identity_Elements)
138  * 5.5 [Valid Attribute Values](#Valid_Attribute_Values)
139  * 5.6 [Canonical Form](#Canonical_Form)
140    * 5.6.1 [Content](#Content)
141    * 5.6.2 [Ordering](#Ordering)
142    * 5.6.3 [Comments](#Comments)
143  * 5.7 [DTD Annotations](#DTD_Annotations)
144    * 5.7.1 [Attribute Value Constraints](#match_expressions)
145* 6 [Property Data](#Property_Data)
146  * 6.1 [Script Metadata](#Script_Metadata)
147  * 6.2 [Extended Pictographic](#Extended_Pictographic)
148  * 6.3 [Labels.txt](#Labels.txt)
149  * 6.4 [Segmentation Tests](#Segmentation_Tests)
150* 7 [Issues in Formatting and Parsing](#Format_Parse_Issues)
151  * 7.1 [Lenient Parsing](#Lenient_Parsing)
152    * 7.1.1 [Motivation](#Motivation)
153    * 7.1.2 [Loose Matching](#Loose_Matching)
154  * 7.2 [Handling Invalid Patterns](#Invalid_Patterns)
155* [Annex A Deprecated Structure](#Deprecated_Structure)
156  * [A.1 Element fallback](#Fallback_Elements)
157  * [A.2 BCP 47 Keyword Mapping](#BCP47_Keyword_Mapping)
158  * [A.3 Choice Patterns](#Choice_Patterns)
159  * [A.4 Element default](#Element_default)
160  * [A.5 Deprecated Common Attributes](#Deprecated_Common_Attributes)
161    * [A.5.1 Attribute standard](#Attribute_standard)
162    * [A.5.2 Attribute draft in non-leaf elements](#Attribute_draft_nonLeaf)
163  * [A.6 Element base](#Element_base)
164  * [A.7 Element rules](#Element_rules)
165  * [A.8 Deprecated subelements of `<dates>`](#Deprecated_subelements_of_dates)
166  * [A.9 Deprecated subelements of `<calendars>`](#Deprecated_subelements_of_calendars)
167  * [A.10 Deprecated subelements of `<timeZoneNames>`](#Deprecated_subelements_of_timeZoneNames)
168  * [A.11 Deprecated subelements of `<zone>` and `<metazone>`](#Deprecated_subelements_of_zone_metazone)
169  * [A.12 Renamed attribute values for `<contextTransformUsage>` element](#Renamed_attribute_values_for_contextTransformUsage)
170  * [A.13 Deprecated subelements of `<segmentations>`](#Deprecated_subelements_of_segmentations)
171  * [A.14 Element cp](#Element_cp)
172  * [A.15 Attribute validSubLocales](#validSubLocales)
173  * [A.16 Elements postalCodeData, postCodeRegex](#postCodeElements)
174  * [A.17 Element telephoneCodeData](#telephoneCodeData)
175* [Annex B Links to Other Parts](#Links_to_Other_Parts)
176  * Table: [Part 2 Links](#Part_2_Links): [General](tr35-general.md) (display names & transforms, etc.)
177  * Table: [Part 3 Links](#Part_3_Links): [Numbers](tr35-numbers.md) (number & currency formatting)
178  * Table: [Part 4 Links](#Part_4_Links): [Dates](tr35-dates.md) (date, time, time zone formatting)
179  * Table: [Part 5 Links](#Part_5_Links): [Collation](tr35-collation.md) (sorting, searching, grouping)
180  * Table: [Part 6 Links](#Part_6_Links): [Supplemental](tr35-info.md) (supplemental data)
181  * Table: [Part 7 Links](#Part_7_Links): [Keyboards](tr35-keyboards.md) (keyboard mappings)
182* [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization)
183  * [LocaleId Definitions](#LocaleId_Definitions)
184    * [1. Multimap interpretation](#1.-multimap-interpretation)
185    * [2. Alias elements](#2.-alias-elements)
186    * [3. Matches](#3.-matches)
187    * [4. Replacement](#4.-replacement)
188      * [Territory Exception](#territory-exception)
189    * [5. Canonicalizing Syntax](#5.-canonicalizing-syntax)
190  * [Preprocessing](#preprocessing)
191  * [Processing LanguageIds](#processing-languageids)
192  * [Processing LocaleIds](#processing-localeids)
193  * [Optimizations](#optimizations)
194* [References](#References)
195* [Acknowledgments](#Acknowledgments)
196* [Modifications](#Modifications)
197
198## 1 <a name="Introduction" href="#Introduction">Introduction</a>
199
200Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. However, there remain differences in the locale data used by different systems.
201
202The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
203
204But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but yielding different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [[Comparisons](#Comparisons)].)
205
206> **Note:** There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
207
208This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain, or create a customized locale for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
209
210For more information, see the Common Locale Data Repository project page [[LocaleProject](#localeProject)].
211
212As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.
213
214### 1.1 <a name="Conformance" href="#Conformance">Conformance</a>
215
216There are many ways to use the Unicode LDML format and the data in CLDR, and the Unicode Consortium does not restrict the ways in which the format or data are used. However, an implementation may also claim conformance to LDML or to CLDR, as follows:
217
218_**UAX35-C1.**_ An implementation that claims conformance to this specification shall:
219
2201. Identify the sections of the specification that it conforms to.
221   * For example, an implementation might claim conformance to all LDML features except for _transforms_ and _segments_.
2222. Interpret the relevant elements and attributes of LDML documents in accordance with the descriptions in those sections.
223   * For example, an implementation that claims conformance to the date format patterns must interpret the characters in such patterns according to [Date Field Symbol Table](tr35-dates.md#Date_Field_Symbol_Table).
2243. Declare which types of CLDR data it uses.
225   * For example, an implementation might declare that it only uses language names, and those with a _draft_ status of _contributed_ or _approved_.
226
227_**UAX35-C2.**_ An implementation that claims conformance to Unicode locale or language identifiers shall:
228
2291. Specify whether Unicode locale extensions are allowed
2302. Specify the canonical form used for identifiers in terms of casing and field separator characters.
231
232External specifications may also reference particular components of Unicode locale or language identifiers, such as:
233
234> _Field X can contain any Unicode region subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML), excluding grouping codes._
235
236
237
238## 2 <a name="Locale" href="#Locale">What is a Locale?</a>
239
240Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
241
242The first issue is basic: _what is a locale?_ In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time zones, languages, countries, and scripts. The data can also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.
243
244Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, and so on), music preference, religion, party affiliation, favorite charity, and so on.
245
246Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
247
248In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, and so on). The format in this document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between _locales_ and _languages_, as commonly used in the industry, are rather fuzzy. Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more information, see _[Section 3.10 Language and Locale IDs](#Language_and_Locale_IDs)_.
249
250We will speak of data as being "in locale X". That does not imply that a locale _is_ a collection of data; it is simply shorthand for "the set of data associated with the locale id X". Each individual piece of data is called a _resource_ or _field_, and a tag indicating the key of the resource is called a _resource tag._
251
252
253<a name="Identifiers"></a>
254## 3 <a name="Unicode_Language_and_Locale_Identifiers" href="#Unicode_Language_and_Locale_Identifiers">Unicode Language and Locale Identifiers</a>
255
256Unicode LDML uses stable identifiers based on [[BCP47](#BCP47)] for distinguishing among languages, locales, regions, currencies, time zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used on a particular target system. If so, some process of identifier translation may be required when using LDML data.
257
258The BCP 47 extensions (-u- and -t-) are described in _Section 3.6 [Unicode BCP 47 U Extension](#u_Extension)_ and _Section 3.7 [Unicode BCP 47 T Extension](#BCP47_T_Extension)_.
259
260### _<a name="Unicode_language_identifier" href="#Unicode_language_identifier">3.1 Unicode Language Identifier</a>_
261
262A _Unicode language identifier_ has the following structure (provided in EBNF (Perl-based)). The following table defines syntactically well-formed identifiers: they are not necessarily valid identifiers. For additional validity criteria, see the links on the right.
263
264<table>
265<tbody>
266   <tr><th></th><th>EBNF</th><th>Validity / Comments</th></tr>
267<tr>
268    <td><a name="unicode_language_id" href="#unicode_language_id"><code>unicode_language_id</code></a></td>
269    <td><pre><code>= "root"
270| (unicode_language_subtag
271    (sep unicode_script_subtag)?
272  | unicode_script_subtag)
273  (sep unicode_region_subtag)?
274  (sep unicode_variant_subtag)* ;</code></pre></td>
275    <td>"root" is treated as a special <code>unicode_language_subtag</code></td>
276</tr>
277<tr>
278    <td><a name="unicode_language_subtag" href="#unicode_language_subtag"><code>unicode_language_subtag</code></a></td>
279    <td><pre>= alpha{2,3} | alpha{5,8};</pre></td>
280    <td><a href="#unicode_language_subtag_validity">validity</a><br/>
281        <a href="https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/language.xml">latest-data</a></td>
282</tr>
283<tr>
284    <td><a name="unicode_script_subtag" href="#unicode_script_subtag"><code>unicode_script_subtag</code></a></td>
285    <td><pre>= alpha{4} ;</pre></td>
286    <td><a href="#unicode_script_subtag_validity">validity</a><br/>
287        <a href="https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/script.xml">latest-data</a></td>
288</tr>
289<tr>
290    <td><a name="unicode_region_subtag" href="#unicode_region_subtag"><code>unicode_region_subtag</code></a>
291    <td><pre>= (alpha{2} | digit{3}) ;</pre></td>
292    <td><a href="#unicode_region_subtag_validity">validity</a><br/>
293        <a href="https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/region.xml">latest-data</a></td>
294</tr>
295<tr>
296    <td><a name="unicode_variant_subtag" href="#unicode_variant_subtag"><code>unicode_variant_subtag</code></a>
297    <td><pre>= (alphanum{5,8}<br/>| digit alphanum{3}) ;</pre></td>
298    <td><a href="#unicode_variant_subtag_validity">validity</a><br/>
299        <a href="https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/variant.xml">latest-data</a></td>
300</tr>
301   <tr><td><code>sep</code></td>     <td><pre>= [-_] ;</pre></td></tr>
302<tr><td><code>digit</code></td>   <td><pre>= [0-9] ;</pre></td></tr>
303<tr><td><code>alpha</code></td>   <td><pre>= [A-Z a-z] ;</pre></td></tr>
304<tr><td><code>alphanum</code></td><td><pre>= [0-9 A-Z a-z] ;</pre></td></tr>
305</tbody></table>
306
307The semantics of the various subtags is explained in _Section 3.4 [Language Identifier Field Definitions](#Field_Definitions)_ ; there are also direct links from [`unicode_language_subtag`](#unicode_language_subtag) , etc. While theoretically the [`unicode_language_subtag`](#unicode_language_subtag) may have more than 3 letters through the IANA registration process, in practice that has not occurred. The [`unicode_language_subtag`](#unicode_language_subtag) "und" may be omitted when there is a [`unicode_script_subtag`](#unicode_script_subtag) ; for that reason [`unicode_language_subtag`](#unicode_language_subtag) values with 4 letters are not permitted. However, such [`unicode_language_id`](#unicode_language_id) values are not intended for general interchange, because they are not valid BCP 47 tags. Instead, they are intended for certain protocols such as the identification of transliterators or font ScriptLangTag values. For more information on language subtags with 4 letters, see [BCP 47 Language Tag to Unicode BCP 47 Locale Identifier](#Language_Tag_to_Locale_Identifier).
308
309For example, "en-US" (American English), "en_GB" (British English), "es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all valid Unicode language identifiers.
310
311### _<a name="Unicode_locale_identifier" href="#Unicode_locale_identifier">3.2 Unicode Locale Identifier</a>_
312
313A _Unicode locale identifier_ is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in _Section 3.6 [Unicode BCP 47 U Extension](#u_Extension)_ and _Section 3.7 [Unicode BCP 47 T Extension](#BCP47_T_Extension)_. Other extensions and private use extensions are supported for pass-through. The following table defines syntactically _well-formed_ identifiers: they are not necessarily _valid_ identifiers. For additional validity criteria, see the links on the right.
314
315As is often the case, the complete syntactic constraints are not easily captured by ABNF, so there is a further condition: There cannot be more than one extension with the same singleton (-a-, …, -t-, -u-, …). Note that the private use extension (-x-) must come after all other extensions.
316
317|                                                                                                       | EBNF                                            | Validity / Comments |
318| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------- |
319| <a name="unicode_locale_id" href="#unicode_locale_id">`unicode_locale_id`</a>                         | `= unicode_language_id`<br/>  `extensions*`<br/>  `pu_extensions? ;` |
320| <a name="extensions" href="#extensions">`extensions`</a>                                              | `= unicode_locale_extensions`<br/>`\| transformed_extensions`<br/>` \| other_extensions ;` |
321| <a name="unicode_locale_extensions" href="#unicode_locale_extensions">`unicode_locale_extensions`</a> | `= sep [uU]`<br/>  `((sep keyword)+`<br/>  `\|(sep attribute)+ (sep keyword)*) ;` |
322| <a name="transformed_extensions" href="#transformed_extensions">`transformed_extensions`</a>          | `= sep [tT]`<br/>  `((sep tlang (sep tfield)*)`<br/>  `\| (sep tfield)+) ;` |
323| <a name="pu_extensions" href="#pu_extensions">`pu_extensions`</a>                                     | `= sep [xX]`<br/>`  (sep alphanum{1,8})+ ;` |
324| <a name="other_extensions" href="#other_extensions">`other_extensions`</a>                            | `= sep [alphanum-[tTuUxX]]`<br/>`  (sep alphanum{2,8})+ ;` |
325| `keyword`<br/>(Also known as `ufield`)                                                                | `= key (sep type)? ;` |
326| `key`<br/>(Also known as `ukey`)                                                                      | `= alphanum alpha ;`<br/>(Note that this is narrower than in [[RFC6067](https://www.ietf.org/rfc/rfc6067.txt)], so that it is disjoint with tkey.) | [`validity`](#Key_Type_Definitions)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47) |
327| `type`<br/>(Also known as `uvalue`)                                                                   | `= alphanum{3,8}`<br/>`  (sep alphanum{3,8})* ;` | [`validity`](#Key_Type_Definitions)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47) |
328| `attribute`                                                                                           | `= alphanum{3,8} ;` |
329| <a name="unicode_subdivision_id" href="#unicode_subdivision_id">`unicode_subdivision_id`</a>          | `= `[`unicode_region_subtag`](#unicode_region_subtag)` unicode_subdivision_suffix ;` | [`validity`](#unicode_subdivision_subtag_validity)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/subdivision.xml) |
330| `unicode_subdivision_suffix`                                                                          | `= alphanum{1,4} ;` |
331| <a name="unicode_measure_unit" href="#unicode_measure_unit">`unicode_measure_unit`</a>                | `= alphanum{3,8}`<br/>`  (sep alphanum{3,8})* ;` | [`validity`](#Validity_Data)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/validity/unit.xml) |
332| `tlang`                                                                                               | `= unicode_language_subtag`<br/>`  (sep unicode_script_subtag)?`<br/>`  (sep unicode_region_subtag)?`<br/>`  (sep unicode_variant_subtag)* ;` | same as in unicode_language_id |
333| `tfield`                                                                                              | `= tkey tvalue;` | [`validity`](#BCP47_T_Extension)<br/>[`latest-data`](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47) |
334| `tkey`                                                                                                | `= alpha digit ;` |
335| `tvalue`                                                                                              | `= (sep alphanum{3,8})+ ;` |
336
337For historical reasons, this is called a Unicode locale identifier. However, it really functions (with few exceptions) as a language identifier, and accesses language-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass both types of data: for more information, see _[Section 3.10 Language and Locale IDs](#Language_and_Locale_IDs)_.
338
339As of the release of this specification, there were no other_extensions defined. The other_extensions are present in the syntax to allow implementations to preserve that information.
340
341As for terminology, the term _code_ may also be used instead of "subtag", and "territory" instead of "region". The primary language subtag is also called the _base language code_. For example, the base language code for "en-US" (American English) is "en" (English). The _type_ may also be referred to as a _value_ or _key-value_.
342
343The identifiers can vary in case and in the separator characters. The "-" and "\_" separators are treated as equivalent, although "-" is preferred.
344
345All identifier field values are case-insensitive. Although case distinctions do not carry any special meaning, an implementation of LDML should use the casing recommendations in [[BCP47](#BCP47)], especially when a Unicode locale identifier is used for locale data exchange in software protocols.
346
347#### 3.2.1 <a name="Canonical_Unicode_Locale_Identifiers" href="#Canonical_Unicode_Locale_Identifiers">Canonical Unicode Locale Identifiers</a>
348
349A [`unicode_locale_id`](#unicode_locale_id) has _canonical syntax_ when:
350
351* It starts with a language subtag (those beginning with a script subtag are only for specialized use)
352* Casing
353  * Any script subtag inside unicode_language_id is in title case (eg, Hant)
354  * Any region subtag inside unicode_language_id is in uppercase (eg, DE)
355  * All other subtags are in lowercase (eg, en, fonipa)
356* Order
357  * Any variants are in alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)
358  * Any extensions are in alphabetical order by their singleton (eg, en-t-xxx-u-yyy, not en-u-yyy-t-xxx)
359  * All attributes are sorted in alphabetical order.
360  * All keywords and tfields are sorted by alphabetical order of their keys, within their respective extensions.
361  * Any type or tfield value "true" is removed.
362
363For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes `"foo"` and `"bar"` in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification.
364
365NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in [Section 4.1](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1) of BCP 47. Here are the considerations that lead to that decision:
366  * The ordering in Section 4.1 is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required.
367  * Moreover, [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.”
368  * The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback.
369  * Robust implementations will accept the variants in any order, just as they accept extensions in any order.
370  * A canonical order allows for determination of identity of identifiers via string comparison.
371  * The ordering in Section 4.1 does not result in a determinant order for canonicalization, since the mechanism for determining “importance” is not specified: ca-valencia-fonipa and ca-fonipa-valencia could both be ‘canonical’ variants of one another.
372  * Pure alphabetical order is determinant and simple to implement while the ordering in Section 4.1 is indeterminant, more complex, and provides no significant benefit in modern applications.
373
374**Note:** The current version of CLDR data uses some non-preferred _syntax_ for backward compatibility. This might be changed in future CLDR releases.
375
376  * It uses uppercase letters for variant subtags, while the preferred forms are all lowercase.
377  * It uses "\_" as the separator, while the preferred form of the separator is "-".
378  * It uses "root", while the preferred form is "und".
379
380A [`unicode_locale_id`](#unicode_locale_id) is in _canonical form_ when it has canonical syntax and contains no aliased subtags. A [`unicode_locale_id`](#unicode_locale_id) can be transformed into canonical form according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization).
381
382A [`unicode_locale_id`](#unicode_locale_id) is _maximal_ when the [`unicode_language_id`](#unicode_language_id) and tlang (if any) have been transformed by the Add Likely Subtags operation in _Section 4.3 [Likely Subtags](#Likely_Subtags)_, excluding "und".
383
384> _Example:_ the maxmal form of ja-Kana-t-it is ja-Kana-JP-t-it-latn-it
385
386Note that the _latn_ and final _it_ don't use any uppercase characters, since they are not inside unicode_language_id.
387
388Two [`unicode_locale_ids`](#unicode_locale_id) are _equivalent_ when their maximal canonical forms are identical.
389
390> _Example:_ "IW-HEBR-u-ms-imperial" ~ "he-u-ms-uksystem"
391
392The equivalence relationship may change over time, such as when subtags are deprecated or likely subtag mappings change. For example, if two countries were to merge, then various subtags would become deprecated. These kinds of changes are generally very infrequent.
393
394
395### 3.3 <a name="BCP_47_Conformance" href="#BCP_47_Conformance">BCP 47 Conformance</a>
396
397Unicode language and locale identifiers inherit the design and the repertoire of subtags from [[BCP47](#BCP47)] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR:
398
399* It does not allow for the full syntax of [[BCP47](#BCP47)]:
400  * No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47 [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) and [Section 3.1.7](https://www.rfc-editor.org/rfc/rfc5646.html#section-3.1.7))
401  * No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47)
402  * A tag must not start with the subtag "x": thus a _privateuse_ (eg x-abc) can only be after a language subtag, like "und"
403* It allows for certain semantic additions and constraints:
404  * Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML
405  * Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see [Section 4.1.2](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1.2))
406* It allows certain syntax for backwards compatibility (not BCP 47-compatible):
407  * The "\_" character for field separator characters, as well as the "-" used in [[BCP47](#BCP47)] (however, the canonical form is with "-")
408  * The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead)
409  * The language tag may begin with a script subtag rather than a language subtag. This is specialized use only, and not required for CLDR conformance.
410
411There are thus two subtypes of Unicode locale identifiers:
412
413* the term _Unicode CLDR locale identifier_ applies where the backwards compatibility syntax is used.
414* the term _Unicode BCP 47 locale identifier_ applies otherwise. A _Unicode BCP 47 locale identifier_ is also a valid BCP 47 language tag.
415
416#### 3.3.1 <a name="BCP_47_Language_Tag_Conversion" href="#BCP_47_Language_Tag_Conversion">BCP 47 Language Tag Conversion</a>
417
418The different identifiers can be converted to one another as described in this section.
419
420A valid [[BCP47](#BCP47)] language tag can be converted to a valid Unicode BCP 47 locale identifier according to [Annex C. LocaleId Canonicalization](#LocaleId_Canonicalization).
421
422The result is a Unicode BCP 47 locale identifier, in canonical form. It is both a BCP 47 language tag and a Unicode locale identifier. Because the process maps from all BCP 47 language tags into a subset of BCP 47 language tags, the format changes are not reversible, much as a lowercase transformation of the string “McGowan” is not reversible.
423
424###### Table: <a name="Language_Tag_to_Locale_Identifier" href="#Language_Tag_to_Locale_Identifier">BCP 47 Language Tag to Unicode BCP 47 Locale Identifier</a> Examples
425
426| BCP 47 language tag | Unicode BCP 47 locale identifier | Comments |
427| ------------------- | -------------------------------- | -------- |
428| `en-US`             | `en-US`                          | no changes |
429| `iw-FX`             | `he-FR`                          | BCP 47 canonicalization  |
430| `cmn-TW`            | `zh-TW`                          | language alias  |
431| `zh-cmn-TW`         | `zh-TW`                          | BCP 47 canonicalization, then language alias  |
432| `sr-CS`             | `sr-RS`                          | territory alias  |
433| `sh`                | `sr-Latn`                        | multiple replacement subtags  |
434| `sh-Cyrl`           | `sr-Cyrl`                        | no replacement with multiple replacement subtags |
435| `hy-SU`             | `hy-AM`                          | multiple territory values <br/>`<territoryAlias type="SU" replacement="RU AM AZ BY EE GE KZ KG LV LT MD TJ TM UA UZ" …/>` |
436| `i-enochian`        | `und-x-i-enochian`               | prefix any legacy language tags (marked as “Type: grandfathered” in BCP 47) with "und-x-"  |
437| `x-abc`             | `und-x-abc`                      | prefix with "und-", so that there is always a base language subtag  |
438
439##### <a name="Unicode_Locale_Identifier_CLDR_to_BCP_47" href="#Unicode_Locale_Identifier_CLDR_to_BCP_47">Unicode Locale Identifier: CLDR to BCP 47</a>
440
441A Unicode CLDR locale identifier can be converted to a valid [[BCP47](#BCP47)] language tag (which is also a Unicode BCP 47 locale identifier) by performing the following transformation.
442
4431.  Replace the "\_" separators with "-"
4442.  Replace the special language identifier "root" with the BCP 47 primary language tag "und"
4453.  Add an initial "und" primary language subtag if the first subtag is a script.
446
447_Examples:_
448
449| Unicode CLDR locale identifier | BCP 47 language tag  | Comments               |
450| ------------------------------ | -------------------- | ---------------------- |
451| `en_US`                        | `en-US`              | change separator       |
452| `de_DE_u_co_phonebk`           | `de-DE-u-co-phonebk` | change separator       |
453| `root`                         | `und`                | change to "und"        |
454| `root_u_cu_usd`                | `und-u-cu-usd`       | change to "und"        |
455| `Latn_DE`                      | `und-Latn-DE`        | add "und"              |
456
457##### <a name="Unicode_Locale_Identifier_BCP_47_to_CLDR" href="#Unicode_Locale_Identifier_BCP_47_to_CLDR">Unicode Locale Identifier: BCP 47 to CLDR</a>
458
459A Unicode BCP 47 locale identifier can be transformed into a Unicode CLDR locale identifier by performing the following transformation.
460
4611.  the separator is changed to "\_"
4622.  the primary language subtag "und" is replaced with "root" if no script, region, or variant subtags are present.
463
464_Examples:_
465
466| BCP 47 language tag | Unicode CLDR locale identifier | Comments |
467| ------------------- | ------------------------------ | -------- |
468| `en-US`             | `en_US`                        | changes separator |
469| `und`               | `root`                         | changes to "root", because no script, region, or variant tag is present |
470| `und-US`            | `und_US`                       | no change to "und", because a region subtag is present |
471| `und-u-cu-USD`      | `root_u_cu_usd`                | changes to "root", because no script, region, or variant tag is present |
472
473##### Truncation
474
475BCP 47 requires that implementations allow for language tags of at least 35 characters, in [Section 4.1.1](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.4.1).
476To allow for use of extensions, CLDR extends that minimum to 255 for Unicode locale identifiers.
477Theoretically, a language tag could be far longer, due to the possibility of a large number of variants and extensions.
478In practice, the typical size of a locale or language identifier will be much smaller, so implementations can optimize for smaller sizes, as long as there is an escape mechanism allowing for up to 255.
479
480### 3.4 <a name="Field_Definitions" href="#Field_Definitions">Language Identifier Field Definitions</a>
481
482Unicode language and locale identifier field values are provided in the following table. Note that some private-use BCP 47 field values are given specific meanings in CLDR. While field values are based on [[BCP47](#BCP47)] subtag values, their validity status in CLDR is specified by means of machine-readable files in the [common/validity/](https://github.com/unicode-org/cldr-staging/tree/main/production/common/validity) subdirectory, such as language.xml. For the format of those files and more information, see _[Section 3.11 Validity Data](#Validity_Data)_.
483
484#### <a name="unicode_language_subtag_validity" href="#unicode_language_subtag_validity">`unicode_language_subtag`</a> (also known as a _Unicode base language code_)
485
486Subtags in the language.xml file (see _Section 3.11 [Validity Data](#Validity_Data)_ ). These are based on [[BCP47](#BCP47)] subtag values marked as **Type: language**
487
488ISO 639-3 introduces the notion of "macrolanguages", where certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and additional codes are given for the narrower semantics. For backwards compatibility, Unicode language identifiers retain use of the narrower semantics for these codes. For example:
489
490| For                         | Use   | _Not_ |
491| --------------------------- | ----- | ----- |
492| Standard Chinese (Mandarin) | `zh`  | `cmn` |
493| Standard Arabic             | `ar`  | `arb` |
494| Standard Malay              | `ms`  | `zsm` |
495| Standard Swahili            | `sw`  | `swh` |
496| Standard Uzbek              | `uz`  | `uzn` |
497| Standard Konkani            | `kok` | `knn` |
498| Northern Kurdish            | `ku`  | `kmr` |
499
500If a language subtag matches the `type` attribute of a `languageAlias` element, then the replacement value is used instead. For example, because "swh" occurs in `<languageAlias type="swh" replacement="sw" />` , "sw" must be used instead of "swh". Thus Unicode language identifiers use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese (Taiwan), not "cmn-TW".
501
502The private use codes listed as **excluded** in _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)_ will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.
503
504The CLDR provides data for normalizing language/locale codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US"; see the **[Aliases](https://unicode-org.github.io/cldr-staging/charts/38/supplemental/aliases.html)** Chart.
505
506The following are special language subtags:
507
508|       | Name                  | Comment |
509| ----- | --------------------- | ------- |
510| `mis` | Uncoded languages     | The content is in a language that doesn't yet have an ISO 639 code. |
511| `mul` | Multiple languages    | The content contains more than one language or text that is simultaneously in multiple languages (such as brand names). |
512| `zxx` | No linguistic content | The content is not in any particular languages (such as images, symbols, etc.) |
513
514#### <a name="unicode_script_subtag_validity" href="#unicode_script_subtag_validity">`unicode_script_subtag`</a> (also known as a _Unicode script code_)
515
516Subtags in the script.xml file (see _Section 3.11 [Validity Data](#Validity_Data)_). These are based on [[BCP47](#BCP47)] subtag values marked as **Type: script**
517
518In most cases the script is not necessary, since the language is only customarily written in a single script. Examples of cases where it is used are:
519
520| Subtag    | Description |
521| --------- | ----------- |
522| `az_Arab` | Azerbaijani in Arabic script |
523| `az_Cyrl` | Azerbaijani in Cyrillic script |
524| `az_Latn` | Azerbaijani in Latin script |
525| `zh_Hans` | Chinese, in simplified script (=zh, zh-Hans, zh-CN, zh-Hans-CN) |
526| `zh_Hant` | Chinese, in traditional script |
527
528Unicode identifiers give specific semantics to certain Unicode Script values. For more information, see also [[UAX24](https://www.unicode.org/reports/tr41/#UAX24)]:
529
530<!-- HTML: rospan, colspan -->
531<table><tbody>
532<tr><td><code>Qaag</code></td>
533    <td>Zawgyi</td>
534    <td colspan="2">Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration.</td></tr>
535<tr><td><code>Qaai</code></td>
536    <td>Inherited</td>
537    <td colspan="2"><b>deprecated</b>: the <i>canonicalized</i> form is Zinh</td></tr>
538<tr><td><code>Zinh</code></td>
539    <td>Inherited</td>
540    <td colspan="2">&nbsp;</td></tr>
541<tr><td><code>Zsye</code></td>
542    <td>Emoji Style</td>
543    <td colspan="2">Prefer emoji style for characters that have both text and emoji styles available.</td></tr>
544<tr><td><code>Zsym</code></td>
545    <td>Text Style</td>
546    <td colspan="2">Prefer text style for characters that have both text and emoji styles available.</td></tr>
547<tr><td rowspan="7"><code>Zxxx</code></td>
548    <td rowspan="7">Unwritten</td>
549    <td colspan="2">Indicates spoken or otherwise unwritten content. For example:</td></tr>
550
551<tr><th>Sample(s)</th><th>Description</th></tr>
552<tr><td>uz</td><td>either written or spoken content</td></tr>
553<tr><td>uz-Latn <i>or</i> uz-Arab</td><td>written-only content (particular script)</td></tr>
554<tr><td>uz-Zyyy</td><td>written-only content (unspecified script)</td></tr>
555<tr><td>uz-Zxxx</td><td>spoken-only content</td></tr>
556<tr><td>uz-Latn, uz-Zxxx</td><td>both specific written and spoken content (using a <i>language list</i>)</td></tr>
557
558<tr><td><code>Zyyy</code></td>
559    <td>Common</td>
560    <td colspan="2">&nbsp;</td></tr>
561<tr><td><code>Zzzz</code></td>
562    <td>Unknown</td>
563<td colspan="2">&nbsp;</td></tr>
564</tbody></table>
565
566The private use subtags listed as **excluded** in _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)_ will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.
567
568#### <a name="unicode_region_subtag_validity" href="#unicode_region_subtag_validity">`unicode_region_subtag`</a> (also known as a _Unicode region code,_ or a _Unicode territory code_)
569
570Subtags in the region.xml file (see _Section 3.11 [Validity Data](#Validity_Data)_). These are based on [[BCP47](#BCP47)] subtag values marked as **Type: region**
571
572Unicode identifiers give specific semantics to the following subtags.
573(The alpha2 codes are used as Unicode region subtags. The alpha3 and numeric codes are derived according to _Section 3.5.2 [Numeric Codes](#Numeric_Codes)_ and listed here for additional documentation.)
574
575| alpha2 | alpha3 | num | Name                         | Comment | ISO 3166-1 status |
576| ------ | ------ | --- | ---------------------------- | ------- | ----------------- |
577| `QO`   | `QOO`  | 961 | Outlying Oceania             | countries in Oceania [009] that do not have a [subcontinent](https://unicode-org.github.io/cldr-staging/charts/38/supplemental/territory_containment_un_m_49.html). | private use |
578| `QU`   | `QUU`  | 967 | European Union               | **deprecated**: the _canonicalized_ form is EU | private use |
579| `UK`   | -      | -   | United Kingdom               | **deprecated**: the _canonicalized_ form is GB | exceptionally reserved |
580| `XA`   | `XAA`  | 973 | Pseudo-Accents               | special code indicating derived testing locale with English + added accents and lengthened | private use |
581| `XB`   | `XBB`  | 974 | Pseudo-Bidi                  | special code indicating derived testing locale with forced RTL English | private use |
582| `XK`   | `XKK`  | 983 | Kosovo                       | industry practice | private use |
583| `ZZ`   | `ZZZ`  | 999 | Unknown or Invalid Territory | used in APIs or as replacement for invalid code | private use |
584
585
586The private use subtags listed as **excluded** in _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)_ will normally never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications. However, LDML may follow widespread industry practice in the use of some of these codes, such as for XK.
587
588The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".
589
590Special Codes:
591
592* The territory code 'UK' has a special status in ISO, and is used for the domain name instead of GB. It is thus recognized by CLDR as being an alternate (unnormalized) form of 'GB'.
593* The territory code '001' (the World) is used to indicate a standardized form, such as "ar-001" for Modern Standard Arabic.
594
595#### <a name="unicode_variant_subtag_validity" href="#unicode_variant_subtag_validity">`unicode_variant_subtag`</a> (also known as a _Unicode language variant code_)
596
597Subtags in the variant.xml file (see _Section 3.11 [Validity Data](#Validity_Data)_). These are based on [[BCP47](#BCP47)] subtag values marked as **Type: variant**. The sequence of variant tags must not have any duplicates: thus de-1996-fonipa-1996 is invalid, while de-1996-fonipa and de-fonipa-1996 are both valid.
598
599CLDR provides data for normalizing variant codes. About handling of the "POSIX" variant see _Section 3.8.2, [Legacy Variants](#Legacy_Variants)_.
600
601_Examples:_
602
603```
604en
605fr_BE
606zh-Hant-HK
607```
608
609_Deprecated_ codes—such as QU above—are valid, but strongly discouraged.
610
611A locale that only has a language subtag (and optionally a script subtag) is called a _language locale_; one with both language and territory subtag is called a _territory locale_ (or _country locale_).
612
613### 3.5 <a name="Special_Codes" href="#Special_Codes">Special Codes</a>
614
615#### 3.5.1 <a name="Unknown_or_Invalid_Identifiers" href="#Unknown_or_Invalid_Identifiers">Unknown or Invalid Identifiers</a>
616
617The following identifiers are used to indicate an unknown or invalid code in Unicode language and locale identifiers. For Unicode identifiers, the region code uses a private use ISO 3166 code, and Time Zone code uses an additional code; the others are defined by the relevant standards. When these codes are used in APIs connected with Unicode identifiers, the meaning is that either there was no identifier available, or that at some point an input identifier value was determined to be invalid or ill-formed.
618
619| Code Type   | Value  | Description in Referenced Standards |
620| ----------- | ------ | ----------------------------------- |
621| Language    | `und`  | Undetermined language, also used for “root” |
622| Script      | `Zzzz` | Code for uncoded script, Unknown [[UAX24](https://www.unicode.org/reports/tr41/#UAX24)] |
623| Region      | `ZZ`   | Unknown or Invalid Territory |
624| Currency    | `XXX`  | The codes assigned for transactions where no currency is involved |
625| Time Zone   | `unk`  | Unknown or Invalid Time Zone |
626| Subdivision | _\<region>zzzz_ | Unknown or Invalid Subdivision |
627
628When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek script; "und_US" represents the US territory.
629
630#### 3.5.2 <a name="Numeric_Codes" href="#Numeric_Codes">Numeric Codes</a>
631
632For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092). Unicode identifiers supply a standard mapping to these: for the numeric codes, it uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use region codes:
633
634| Region   | UN/ISO Numeric | ISO 3-Letter |
635| -------- | -------------- | ------------ |
636| `AA`     | `958`          | `AAA`        |
637| `QM..QZ` | `959..972`     | `QMM..QZZ`   |
638| `XA..XZ` | `973..998`     | `XAA..XZZ`   |
639| `ZZ`     | `999`          | `ZZZ`        |
640
641For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):
642
643| Script       | Numeric    |
644| ------------ | ---------- |
645| `Qaaa..Qabx` | `900..949` |
646
647#### 3.5.3 <a name="Private_Use_Codes" href="#Private_Use_Codes">Private Use Codes</a>
648
649Private use codes fall into three groups.
650
651*   **defined:** those that are given particular semantics currently in CLDR
652*   **reserved:** those that may be given particular semantics in future versions of CLDR
653*   **excluded:** those that will never be given particular CLDR semantics in the future, and thus can normally be used by applications without worrying about collisions. However, CLDR may follow widespread industry practice in the use of some of these codes, such as for XA, XB, and XK.
654
655###### Table: <a name="Private_Use_CLDR" href="#Private_Use_CLDR">Private Use Codes in CLDR</a>
656
657| category      | status   | codes |
658| ------------- | -------- | ----- |
659| base language | defined  | none  |
660|               | reserved | qaa..qfy |
661|               | excluded | qfz..qtz |
662| script        | defined  | Qaai (obsolete), Qaag |
663|               | reserved | Qaaa..Qaaf Qaah Qaaj..Qaap |
664|               | excluded | Qaaq..Qabx |
665| region        | defined  | QO, QU, UK, XA, XB, XK, ZZ |
666|               | reserved | AA QM..QN QP..QT QV..QZ |
667|               | excluded | XC..XJ, XL..XZ |
668| timezone      | defined  | IANA: Etc/Unknown<br/>bcp47: as listed in bcp47/timezone.xml |
669|               | reserved | bcp47: all non-5 letter codes not starting with x |
670|               | excluded | bcp47: all non-5 letter codes starting with x |
671
672See also _Section 3.5.1 [Unknown or Invalid Identifiers](#Unknown_or_Invalid_Identifiers)_.
673
674<a name="Locale_Extension_Key_and_Type_Data"></a>
675### 3.6 <a name="u_Extension" href="#u_Extension">Unicode BCP 47 U Extension</a>
676
677[[BCP47](#BCP47)] Language Tags provides a mechanism for extending language tags for use in various applications by extension subtags. Each extension subtag is identified by a single alphanumeric character subtag assigned by IANA.
678
679The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [[RFC6067](#RFC6067)] and extension 't' for transformed content [[RFC6497](#RFC6497)]. The Unicode BCP 47 extension data defines the complete list of valid subtags.
680
681These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule `extension` in the [[BCP47](#BCP47)].
682
683**The -u- Extension.** The syntax of 'u' extension subtags is defined by the rule `unicode_locale_extensions` in [Section 3.2 Unicode locale identifier](#Unicode_locale_identifier), except the separator of subtags `sep` must be always hyphen '-' when the extension is used as a part of BCP 47 language tag.
684
685A 'u' extension may contain multiple `attribute` s or `keyword` s as defined in [Section 3.2 Unicode locale identifier](#Unicode_locale_identifier). The canonical syntax is defined as in [Canonical Unicode Locale Identifiers](#Canonical_Unicode_Locale_Identifiers).
686
687_See also [Unicode Extensions for BCP 47](https://cldr.unicode.org/index/bcp47-extension) on the CLDR site._
688
689#### 3.6.1 <a name="Key_And_Type_Definitions_" href="#Key_And_Type_Definitions_">Key And Type Definitions</a>
690
691The following chart contains a set of U extension key values that are currently available, with a description or sampling of the U extension type values. Each category is associated with an XML file in the bcp47 directory.
692
693For the complete list of valid keys and types defined for Unicode locale extensions, see [Section 3.6.4 U Extension Data Files](#Unicode_Locale_Extension_Data_Files). For information on the process for adding new _key_/_type_, see [[LocaleProject](#localeProject)].
694
695Most type values are represented by a single subtag in the current version of CLDR. There are exceptions, such as types used for key "ca" (calendar) and "kr" (collation reordering). If the type is not included, then the type value "true" is assumed. Note that the default for key with a possible "true" value is often "false", but may not always be. Note also that "true"/"True" is not a valid script code, since [the ISO 15924 Registration Authority has exceptionally reserved it](https://www.unicode.org/iso15924/codelists.html), which means that it will not be assigned for any purpose.
696
697Note that canonicalization does not change invalid locales to valid locales. For example, und-u-ka canonicalizes to und-u-ka-true, but:
698
6991. "und-u-ka-true" — is invalid, since ‘yes’ is not a valid value for ka
7002. "und-u-ka" — is invalid, since the value “true” is assumed whenever there is no value, and ‘true’ is not a valid value for ka
701
702The BCP 47 form for keys and types is the canonical form, and recommended. Other aliases are included for backwards compatibility.
703
704###### Table: <a name="Key_Type_Definitions" href="#Key_Type_Definitions">Key/Type Definitions</a>
705
706<!-- HTML: rowspan, colspan -->
707<table><tbody>
708<tr><th>key<br>(old key name)</th><th>key description</th><th>example type<br>(old type name)</th><th>type description</th></tr>
709
710<tr><td colspan="4"><b>A <a name="UnicodeCalendarIdentifier" id="UnicodeCalendarIdentifier" href="#UnicodeCalendarIdentifier">Unicode Calendar Identifier</a> defines a type of calendar. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="ca" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/calendar.xml" target="_blank">calendar.xml</a></b>.</td></tr>
711<tr><td rowspan="10">"ca"<br>(calendar)</td>
712    <td rowspan="10">Calendar algorithm<br><br><i>(For information on the calendar algorithms associated with the data used with these, see [<a href="#Calendars">Calendars</a>].)</i></td>
713            <td>"buddhist"</td>
714            <td>Thai Buddhist calendar (same as Gregorian except for the year)</td></tr>
715        <tr><td>"chinese"</td>
716            <td>Traditional Chinese calendar</td></tr>
717        <tr><td colspan="2">…</td></tr>
718        <tr><td>"gregory"<br>(gregorian)</td>
719            <td>Gregorian calendar</td></tr>
720        <tr><td colspan="2">…</td></tr>
721        <tr><td>"islamic"</td>
722            <td>Islamic calendar</td></tr>
723        <tr><td>"islamic-civil"</td>
724            <td>Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch)</td></tr>
725        <tr><td>"islamic-umalqura"</td>
726            <td>Islamic calendar, Umm al-Qura</td></tr>
727        <tr><td colspan="2">…</td></tr>
728        <tr><td colspan="2"><b>Note:</b> <i>Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura".</i></td></tr>
729
730<tr><td colspan="4"><b>A <a name="UnicodeCurrencyFormatIdentifier" id="UnicodeCurrencyFormatIdentifier" href="#UnicodeCurrencyFormatIdentifier">Unicode Currency Format Identifier</a> defines a style for currency formatting. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="cf" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/currency.xml" target="_blank">currency.xml</a></b>.</td></tr>
731<tr><td rowspan="2">"cf"</td>
732    <td rowspan="2">Currency Format style</td>
733        <td>"standard"</td><td>Negative numbers use the minusSign symbol (the default).</td></tr>
734        <tr><td>"account"</td><td>Negative numbers use parentheses or equivalent.</td></tr>
735
736<tr><td colspan="4"><b>A <a name="UnicodeCollationIdentifier" id="UnicodeCollationIdentifier" href="#UnicodeCollationIdentifier">Unicode Collation Identifier</a> defines a type of collation (sort order). The valid values are those <i>name</i> attribute values in the <i>type</i> elements of bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/collation.xml" target="_blank">collation.xml</a></b>.</td></tr>
737<tr><td colspan="4"><i>For information on each collation setting parameter, from <b>ka</b> to <b>vt</b>, see <a href="tr35-collation.md#Setting_Options">Setting Options</a></i></td></tr>
738<tr><td rowspan="9">"co"<br>(collation)</td>
739    <td rowspan="9">Collation type</td>
740            <td>"standard"</td>
741            <td>The default ordering for each language. For root it is based on the [<a href="#DUCET">DUCET</a>] (Default Unicode Collation Element Table): see <i><a href="tr35-collation.md#Root_Collation">Root Collation</a></i>. Each other locale is based on that, except for appropriate modifications to certain characters for that language.</td></tr>
742        <tr><td>"search"</td>
743            <td>A special collation type dedicated for string search—it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary.</td></tr>
744        <tr><td colspan="2"><p>Other keywords provide additional choices for certain locales; <i>they only have effect in certain locales.</i></p></td></tr>
745        <tr><td colspan="2">…</td></tr>
746        <tr><td>"phonetic"</td>
747            <td>Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use.</td></tr>
748        <tr><td>"pinyin"</td>
749            <td>Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)</td></tr>
750        <tr><td>"reformed"</td><td>Reformed collation (such as in Swedish)</td></tr>
751        <tr><td>"searchjl"</td>
752            <td>Special collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant.</td></tr>
753        <tr><td colspan="2">…</td></tr>
754
755<tr><td colspan="4"><b>A <a name="UnicodeCurrencyIdentifier" id="UnicodeCurrencyIdentifier" href="#UnicodeCurrencyIdentifier">Unicode Currency Identifier</a> defines a type of currency. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="cu" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/currency.xml" target="_blank">currency.xml</a>.</b></td></tr>
756<tr><td>"cu"<br>(currency)</td>
757    <td>Currency type</td>
758    <td><i>ISO 4217 code,</i><p><i>plus others in common use</i></p></td>
759    <td><p>Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in <a href="tr35-numbers.md#Supplemental_Currency_Data">Supplemental Currency Data</a>, plus the default number of decimals.</p><p>The XXX code is given a broader interpretation as <i>Unknown or Invalid Currency</i>.</p></td></tr>
760
761<tr><td colspan="4"><b>A <a name="UnicodeDictionaryBreakExclusionIdentifier" id="UnicodeDictionaryBreakExclusionIdentifier" href="#UnicodeDictionaryBreakExclusionIdentifier">Unicode Dictionary Break Exclusion Identifier</a> specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the <i>name</i> attribute value in the <i>type</i> element of key name="dx" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/segmentation.xml" target="_blank">segmentation.xml</a>.</b></td></tr>
762<tr><td>"dx"</td>
763    <td>Dictionary break script exclusions</td>
764    <td><i><code><a href="#unicode_script_subtag">unicode_script_subtag</a></code> values</i></td>
765    <td><p>One or more items of type SCRIPT_CODE, which are valid <code><a href="#unicode_script_subtag">unicode_script_subtag</a></code> values.</p>
766        <p>The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified.</p></td></tr>
767
768<tr><td colspan="4"><b>A <a name="UnicodeEmojiPresentationStyleIdentifier" id="UnicodeEmojiPresentationStyleIdentifier" href="#UnicodeEmojiPresentationStyleIdentifier">Unicode Emoji Presentation Style Identifier</a> specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <code>&lt;html lang="sr-Latn-u-em-emoji"&gt;</code>. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="em" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/variant.xml" target="_blank">variant.xml</a></b>.</td></tr>
769<tr><td rowspan="3">"em"</td>
770    <td rowspan="3">Emoji presentation style</td>
771            <td>"emoji"</td>
772            <td>Use an emoji presentation for emoji characters if possible.</td></tr>
773        <tr><td>"text"</td>
774            <td>Use a text presentation for emoji characters if possible.</td></tr>
775        <tr><td>"default"</td><td>Use the default presentation for emoji characters as specified in UTR #51 Section 4, <a href="https://www.unicode.org/reports/tr51/#Presentation_Style">Presentation Style</a>.</td></tr>
776
777<tr><td colspan="4"><b>A <a name="UnicodeFirstDayIdentifier" id="UnicodeFirstDayIdentifier" href="#UnicodeFirstDayIdentifier">Unicode First Day Identifier</a> defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data (see Part 4 Dates, section 4.3 <a href="tr35-dates.md#Week_Data">Week Data</a>). The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="fw" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/calendar.xml" target="_blank">calendar.xml</a></b>.</td></tr>
778<tr><td rowspan="4">"fw"</td>
779    <td rowspan="4">First day of week</td>
780            <td>"sun"</td>
781            <td>Sunday</td></tr>
782        <tr><td>"mon"</td>
783            <td>Monday</td></tr>
784        <tr><td colspan="2">…</td></tr>
785        <tr><td>"sat"</td>
786            <td>Saturday</td></tr>
787
788<tr><td colspan="4"><b>A <a name="UnicodeHourCycleIdentifier" id="UnicodeHourCycleIdentifier" href="#UnicodeHourCycleIdentifier">Unicode Hour Cycle Identifier</a> defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data (see Part 4 Dates, section 4.4 <a href="tr35-dates.md#Time_Data">Time Data</a>). The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="hc" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/calendar.xml" target="_blank">calendar.xml</a></b>.</td></tr>
789<tr><td rowspan="4">"hc"</td>
790    <td rowspan="4">Hour cycle</td>
791            <td>"h12"</td>
792            <td>Hour system using 1–12; corresponds to 'h' in patterns</td></tr>
793        <tr><td>"h23"</td>
794            <td>Hour system using 0–23; corresponds to 'H' in patterns</td></tr>
795        <tr><td>"h11"</td>
796            <td>Hour system using 0–11; corresponds to 'K' in patterns</td></tr>
797        <tr><td>"h24"</td>
798            <td>Hour system using 1–24; corresponds to 'k' in pattern</td></tr>
799
800<tr><td colspan="4"><b>A <a name="UnicodeLineBreakStyleIdentifier" id="UnicodeLineBreakStyleIdentifier" href="#UnicodeLineBreakStyleIdentifier">Unicode Line Break Style Identifier</a> defines a preferred line break style corresponding to the CSS level 3 <a href="https://drafts.csswg.org/css-text/#line-break-property">line-break option</a>. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="lb" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/segmentation.xml" target="_blank">segmentation.xml</a></b>.</td></tr>
801<tr><td rowspan="3">"lb"</td>
802    <td rowspan="3">Line break style</td>
803            <td>"strict"</td>
804            <td>CSS level 3 line-break=strict, e.g. treat CJ as NS</td></tr>
805        <tr><td>"normal"</td>
806            <td>CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh</td></tr>
807        <tr><td>"loose"</td>
808            <td>CSS lev 3 line-break=loose</td></tr>
809
810<tr><td colspan="4"><b>A <a name="UnicodeLineBreakWordIdentifier" id="UnicodeLineBreakWordIdentifier" href="#UnicodeLineBreakWordIdentifier">Unicode Line Break Word Identifier</a> defines preferred line break word handling behavior corresponding to the CSS level 3 <a href="https://drafts.csswg.org/css-text/#word-break-property">word-break option</a>. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="lw" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/segmentation.xml" target="_blank">segmentation.xml</a></b>.</td></tr>
811<tr><td rowspan="4">"lw"</td>
812    <td rowspan="4">Line break word handling</td>
813            <td>"normal"</td>
814            <td>CSS level 3 word-break=normal, normal script/language behavior for midword breaks</td></tr>
815        <tr><td>"breakall"</td>
816            <td>CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting</td></tr>
817        <tr><td>"keepall"</td>
818            <td>CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks</td></tr>
819	<tr><td>"phrase"</td>
820	    <td>Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline</td></tr>
821
822<tr><td colspan="4"><b>A <a name="UnicodeMeasurementSystemIdentifier" id="UnicodeMeasurementSystemIdentifier" href="#UnicodeMeasurementSystemIdentifier">Unicode Measurement System Identifier</a> defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data (see Part 2 General, section 5 <a href="tr35-general.md#Measurement_System_Data">Measurement System Data</a>). The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="ms" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/measure.xml" target="_blank">measure.xml</a></b>.
823The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.
824<i>For information about preferred units and unit conversion, see <a href="tr35-info.md#Unit_Conversion">Unit Conversion</a> and <a href="tr35-info.md#Unit_Preferences">Unit Preferences</a>.</i>
825</td></tr>
826<tr><td rowspan="3">"ms"</td>
827    <td rowspan="3">Measurement system</td>
828            <td>"metric"</td>
829            <td>Metric System</td></tr>
830        <tr><td>"ussystem"</td>
831            <td>US System of measurement: feet, pints, etc.; pints are 16oz</td></tr>
832        <tr><td>"uksystem"</td>
833            <td>UK System of measurement: feet, pints, etc.; pints are 20oz</td></tr>
834
835<tr><td colspan="4"><b>A <a name="MeasurementUnitPreferenceOverride" id="MeasurementUnitPreferenceOverride" href="#MeasurementUnitPreferenceOverride">Measurement Unit Preference Override</a> defines an override for measurement unit preference. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="mu" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/measure.xml" target="_blank">measure.xml</a></b>.
836<i>For information about preferred units and unit conversion, see <a href="tr35-info.md#Unit_Conversion">Unit Conversion</a> and <a href="tr35-info.md#Unit_Preferences">Unit Preferences</a>.</i>
837<tr><td rowspan="3">"mu"</td>
838    <td rowspan="3">Measurement unit override</td>
839            <td>"celsius"</td>
840            <td>Celsius as temperature unit</td></tr>
841        <tr><td>"kelvin"</td>
842            <td>Kelvin as temperature unit</td></tr>
843        <tr><td>"fahrenhe"</td>
844            <td>Fahrenheit as temperature unit</td></tr>
845
846<tr><td colspan="4"><b>A <a name="UnicodeNumberSystemIdentifier" id="UnicodeNumberSystemIdentifier" href="#UnicodeNumberSystemIdentifier">Unicode Number System Identifier</a> defines a type of number system. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/number.xml" target="_blank">number.xml</a>.</b></td></tr>
847<tr><td rowspan="7">"nu"<br>(numbers)</td>
848    <td rowspan="7">Numbering system</td>
849            <td><i>Unicode script subtag</i></td>
850            <td><p>Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".)</p>
851                <p class="note">For more information, see <a href="tr35-numbers.md#Numbering_Systems">Numbering Systems</a>.</p></td></tr>
852        <tr><td>"arabext"</td>
853            <td>Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits)</td></tr>
854        <tr><td>"armnlow"</td>
855            <td>Armenian lowercase numerals</td></tr>
856        <tr><td colspan="2">…</td></tr>
857        <tr><td>"roman"</td>
858            <td>Roman numerals</td></tr>
859        <tr><td>"romanlow"</td>
860            <td>Roman lowercase numerals</td></tr>
861        <tr><td>"tamldec"</td>
862            <td>Modern Tamil decimal digits</td></tr>
863
864<tr><td colspan="4"><b>A <a name="RegionOverride" id="RegionOverride" href="#RegionOverride">Region Override</a> specifies an alternate region to use for obtaining certain region-specific default values (those specified by the <a href="tr35-info.md#rgScope">&lt;rgScope&gt;</a> element), instead of using the region specified by the <a href="#unicode_region_subtag">unicode_region_subtag</a> in the Unicode Language Identifier (or inferred from the <a href="#unicode_language_subtag">unicode_language_subtag</a>).</b></td></tr>
865<tr><td rowspan="2">"rg"</td>
866    <td rowspan="2">Region Override</td><td>"uszzzz"<br><br></td><td rowspan="2">The value is a <a href="#unicode_subdivision_id">unicode_subdivision_id</a> of type “unknown” or “regular”; this consists of a <a href="#unicode_region_subtag">unicode_region_subtag</a> for a regular region (not a macroregion), suffixed either by “zzzz” (case is not significant) to designate the region as a whole, or by a unicode_subdivision_suffix to provide more specificity. For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences.
867	The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences.
868<i>For information about preferred units and unit conversion, see <a href="tr35-info.md#Unit_Conversion">Unit Conversion</a> and <a href="tr35-info.md#Unit_Preferences">Unit Preferences</a>.</i>
869	</td></tr>
870        <tr><td>…</td></tr>
871
872<tr><td colspan="4"><b>A <a name="unicode_subdivision_subtag_validity"></a><a name="UnicodeSubdivisionIdentifier" id="UnicodeSubdivisionIdentifier" href="#UnicodeSubdivisionIdentifier">Unicode Subdivision Identifier</a> defines a regional subdivision used for locales. The valid values are based on the <i>subdivisionContainment</i> element as described in <i>Section <a href="#Unicode_Subdivision_Codes">3.6.5 Subdivision Codes</a></i>.</b></td></tr>
873<tr><td rowspan="2">"sd"</td>
874    <td rowspan="2">Regional Subdivision</td>
875            <td>"gbsct"</td>
876            <td rowspan="2">A <a href="#unicode_subdivision_id">unicode_subdivision_id</a>, which is a <a href="#unicode_region_subtag">unicode_region_subtag</a> concatenated with a unicode_subdivision_suffix.<br>For example, <i>gbsct</i> is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See <b><i><a href="#Unicode_Subdivision_Codes">3.6.5 Subdivision Codes</a></i></b>.</td></tr>
877        <tr><td>…</td></tr>
878
879<tr><td colspan="4"><b>A <a name="UnicodeSentenceBreakSuppressionsIdentifier" id="UnicodeSentenceBreakSuppressionsIdentifier" href="#UnicodeSentenceBreakSuppressionsIdentifier">Unicode Sentence Break Suppressions Identifier</a> defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules. The valid values are those <i>name</i> attribute values in the <i>type</i> elements of key name="ss" in bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/segmentation.xml" target="_blank">segmentation.xml</a></b>.</td></tr>
880<tr><td rowspan="2">"ss"</td>
881    <td rowspan="2">Sentence break suppressions</td>
882            <td>"none"</td>
883            <td>Don’t use sentence break suppressions data (the default).</td></tr>
884        <tr><td>"standard"</td>
885            <td>Use sentence break suppressions data of type "standard"</td></tr>
886
887<tr><td colspan="4"><b>A <a name="UnicodeTimezoneIdentifier" id="UnicodeTimezoneIdentifier" href="#UnicodeTimezoneIdentifier">Unicode Timezone Identifier</a> defines a timezone. The valid values are those name attribute values in the <i>type</i> elements of bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/timezone.xml" target="_blank">timezone.xml</a>.</b></td></tr>
888<tr><td>"tz"<br>(timezone)</td>
889    <td>Time zone</td>
890    <td><i>Unicode short time zone IDs</i></td>
891    <td><p>Short identifiers defined in terms of a TZ time zone database [<a href="#Olson">Olson</a>] identifier in the common/bcp47/timezone.xml file, plus a few extra values.</p>
892        <p>For more information, see <a href="#Time_Zone_Identifiers">Section 3.6.3 Time Zone Identifiers</a>.</p>
893        <p>CLDR provides data for normalizing timezone codes.</p></td></tr>
894
895<tr><td colspan="4"><b>A <a name="UnicodeVariantIdentifier" id="UnicodeVariantIdentifier" href="#UnicodeVariantIdentifier">Unicode Variant Identifier</a> defines a special variant used for locales. The valid values are those name attribute values in the <i>type</i> elements of bcp47/<a href="https://github.com/unicode-org/cldr/blob/main/common/bcp47/variant.xml" target="_blank">variant.xml</a>.</b></td></tr>
896<tr><td>"va"</td>
897    <td>Common variant type</td>
898    <td>"posix"</td>
899    <td>POSIX style locale variant. About handling of the "POSIX" variant see <i>Section 3.8.2, <a href="#Legacy_Variants">Legacy Variants</a></i>.</td></tr>
900
901</tbody></table>
902
903For more information on the allowed keys and types, see the specific elements below, and [Section 3.6.4 U Extension Data Files](#Unicode_Locale_Extension_Data_Files).
904
905Additional keys or types might be added in future versions. Implementations of LDML should be robust to handle any syntactically valid key or type values.
906
907#### 3.6.2 <a name="Numbering%20System%20Data" href="#Numbering%20System%20Data">Numbering System Data</a>
908
909LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file **bcp47/number.xml**. For example, for the latest version of the data see [bcp47/number.xml](https://github.com/unicode-org/cldr/blob/main/common/bcp47/number.xml).
910
911Details about those numbering systems are defined in **supplemental/numberingSystems.xml**. For example, for the latest version of the data see [supplemental/numberingSystems.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/numberingSystems.xml).
912
913LDML makes certain stability guarantees on this data:
914
9151.  Like other BCP 47 identifiers, once a numeric identifier is added to **bcp47/number.xml** or **numberingSystems.xml**, it will never be removed from either of those files.
9162.  If an identifier has type="numeric" in numberingSystems.xml, then
917    1.  It is a decimal, positional numbering system with an attribute `digits=X`, where `X` is a string with the 10 digits in order used by the numbering system.
918    2.  The values of the type and digits will never change.
919
920#### 3.6.3 <a name="Time_Zone_Identifiers" href="#Time_Zone_Identifiers">Time Zone Identifiers</a>
921
922LDML inherits time zone IDs from the tz database [[Olson](#Olson)]. Because these IDs from the tz database do not satisfy the BCP 47 language subtag syntax requirements, CLDR defines short identifiers for the use in the Unicode locale extension. The short identifiers are defined in the file **common/bcp47/timezone.xml**.
923
924The short identifiers use UN/LOCODE [[LOCODE](#LOCODE)] (excluding a space character) codes where possible. For example, the short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles, US is "US LAX"). Identifiers of length not equal to 5 are used where there is no corresponding UN/LOCODE, such as "usnavajo" for "America/Shiprock", or "utcw01" for "Etc/GMT+1", so that they do not overlap with future UN/LOCODE.
925
926Although the first two letters of a short identifier may match an ISO 3166 two-letter country code, a user should not assume that the time zone belongs to the country. The first two letters in an identifier of length not equal to 5 have no meaning. Also, the identifiers are stabilized, meaning that they will not change no matter what changes happen in the base standard. So if Hawaii leaves the US and joins Canada as a new province, the short time zone identifier "ushnl" would not change in CLDR even if the UN/LOCODE changes to "cahnl" or something else.
927
928There is a special code "unk" for an Unknown or Invalid time zone. This can be expressed in the tz database style ID "Etc/Unknown", although it is not defined in the tz database.
929
930**Stability of Time Zone Identifiers**
931
932Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found in **zone.tab** file) might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved to **backward** file in the tz database. CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is critical.
933
934To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the `alias` attribute in the `<type>` element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID.
935
936For example:
937
938```xml
939<type name="inccu" alias="Asia/Calcutta Asia/Kolkata" description="Kolkata, India"/>
940```
941
942Above `<type>` element defines the short time zone ID "inccu" (for the use in the Unicode locale extension), corresponding _CLDR canonical "long" ID_ "Asia/Culcutta", and an alias "Asia/Kolkata".
943
944**Links in the tz database**
945
946Not all TZDB links are in CLDR aliases.
947CLDR purposefully does not exactly match the Link structure in the TZDB.
948
9491. The links are maintained in the TZDB, and it would duplicate information that could fall out of sync (especially because the TZDB can be updated many times in a single month).
9502. The TZDB went though a change a few years ago where it dropped the mappings to countries, whereas CLDR still maintains that distinction.
9513. Because there are several different timezones that all link together, that would make for a single long alias being an alias for several different short aliases.
952
953CLDR doesn't alias across country boundaries because countries are useful for timezone selection.
954Even if, for example, Serbia and Croatia share the same rules, CLDR maintains the difference so that the user can either pick "Serbia time" or "Croatia time".
955The Croat is not forced to pick "Serbia time" (Europe/Belgrade) nor the Serb forced to pick “Croatia time” (Europe/Zagreb).
956
957#### 3.6.4 <a name="Unicode_Locale_Extension_Data_Files" href="#Unicode_Locale_Extension_Data_Files">U Extension Data Files</a>
958
959The 'u' extension data is stored in multiple XML files located under common/bcp47 directory in CLDR. Each file contains the locale extension key/type values and their backward compatibility mappings appropriate for a particular domain. [common/bcp47/collation.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/collation.xml) contains key/type values for collation, including optional collation parameters and valid type values for each key.
960
961The 't' extension data is stored in [common/bcp47/transform.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform.xml).
962
963```xml
964<!ELEMENT keyword ( key* )>
965
966<!ELEMENT key ( type* )>
967<!ATTLIST key extension NMTOKEN #IMPLIED>
968<!ATTLIST key name NMTOKEN #REQUIRED>
969<!ATTLIST key description CDATA #IMPLIED>
970<!ATTLIST key deprecated ( true | false ) "false">
971<!ATTLIST key preferred NMTOKEN #IMPLIED>
972<!ATTLIST key alias NMTOKEN #IMPLIED>
973<!ATTLIST key valueType (single | multiple | incremental | any) #IMPLIED >
974<!ATTLIST key since CDATA #IMPLIED>
975
976<!ELEMENT type EMPTY>
977<!ATTLIST type name NMTOKEN #REQUIRED>
978<!ATTLIST type description CDATA #IMPLIED>
979<!ATTLIST type deprecated ( true | false ) "false">
980<!ATTLIST type preferred NMTOKEN #IMPLIED>
981<!ATTLIST type alias CDATA #IMPLIED>
982<!ATTLIST type since CDATA #IMPLIED>
983
984<!ELEMENT attribute EMPTY>
985<!ATTLIST attribute name NMTOKEN #REQUIRED>
986<!ATTLIST attribute description CDATA #IMPLIED>
987<!ATTLIST attribute deprecated ( true | false ) "false">
988<!ATTLIST attribute preferred NMTOKEN #IMPLIED>
989<!ATTLIST attribute since CDATA #IMPLIED>
990```
991
992The extension attribute in `<key>` element specifies the BCP 47 language tag extension type. The default value of the extension attribute is "u" (Unicode locale extension). The `<type>` element is only applicable to the enclosing `<key>`.
993
994In the Unicode locale extension 'u' and 't' data files, the common attributes for the `<key>`, `<type>` and `<attribute>` elements are as follows:
995
996**name**
997
998> The key or type name used by Unicode locale extension with ['u' extension syntax](#Unicode_locale_identifier) or the 't' extensions syntax. When _alias_ below is absent, this name can be also used with the old style ["@key=type" syntax](#Old_Locale_Extension_Syntax).
999>
1000> Most type names are **literal type names**, which match exactly the same value. All of these have at least one lowercase letter, such as "buddhist". There are a small number of **indirect type names**, such as "RG_KEY_VALUE". These have no lowercase letters. The interpretation of each one is listed below.
1001>
1002> ##### <a name="CODEPOINTS" href="#CODEPOINTS">CODEPOINTS</a>
1003>
1004> The type name **"CODEPOINTS"** is reserved for a variable representing Unicode code point(s). The syntax is:
1005>
1006> |            | EBNF |
1007> | ---------- | ---- |
1008> | codepoints | `= codepoint (sep codepoint)?` |
1009> | codepoint  | `= [0-9 A-F a-f]{4,6}` |
1010>
1011> In addition, no codepoint may exceed 10FFFF. For example, "00A0", "300b", "10D40C" and "00C1-00E1" are valid, but "A0", "U060C" and "110000" are not.
1012>
1013> In the current version of CLDR, the type "CODEPOINTS" is only used for the deprecated locale extension key "vt" (variableTop). The subtags forming the type for "vt" represent an arbitrary string of characters. There is no formal limit in the number of characters, although practically anything above 1 will be rare, and anything longer than 4 might be useless. Repetition is allowed, for example, 0061-0061 ("aa") is a Valid type value for "vt", since the sequence may be a collating element. Order is vital: 0061-0062 ("ab") is different than 0062-0061 ("ba"). Note that for variableTop any character sequence must be a contraction which yields exactly one primary weight.
1014>
1015> For example,
1016>
1017> > **en-u-vt-00A4** : this indicates English, with any characters sorting at or below " ¤" (at a primary level) considered Variable.
1018>
1019> By default in UCA, variable characters are ignored in sorting at a primary, secondary, and tertiary level. But in CLDR, they are not ignorable by default. For more information, see [Collation: Section 3.3 _Setting Options_](tr35-collation.md#Setting_Options) .
1020>
1021> ##### <a name="REORDER_CODE" href="#REORDER_CODE">REORDER_CODE</a>
1022>
1023> The type name **"REORDER_CODE"** is reserved for reordering block names (e.g. "latn", "digit" and "others") defined in the _[Root Collation](tr35-collation.md#Root_Collation)_. The type "REORDER_CODE" is used for locale extension key "kr" (colReorder). The value of type for "kr" is represented by one or more reordering block names such as "latn-digit". For more information, see [Collation: Section 3.12 _Collation Reordering_](tr35-collation.md#Script_Reordering) .
1024>
1025> ##### <a name="RG_KEY_VALUE" href="#RG_KEY_VALUE">RG_KEY_VALUE</a>
1026>
1027> The type name **"RG_KEY_VALUE"** is reserved for region codes in the format required by the "rg" key; this is a subdivision code with idStatus='unknown' or 'regular' from the idValidity data in common/validity/subdivision.xml.
1028>
1029> ##### <a name="SCRIPT_CODE" href="#SCRIPT_CODE">SCRIPT_CODE</a>
1030>
1031> The type name **"SCRIPT_CODE"** is reserved for [`unicode_script_subtag`](#unicode_script_subtag) values (e.g. "thai", "laoo"). The type "SCRIPT_CODE" is used for locale extension key "dx". The value of type for "dx" is represented by one or more SCRIPT_CODEs, such as "thai-laoo".
1032>
1033> ##### <a name="SUBDIVISION_CODE" href="#SUBDIVISION_CODE">SUBDIVISION_CODE</a>
1034>
1035> The type name **"SUBDIVISION_CODE"** is reserved for subdivision codes in the format required by the "sd" key; this is a subdivision code from the idValidity data in common/validity/subdivision.xml, excluding those with idStatus='unknown'. Codes with idStatus='deprecated' should not be generated, and those with idStatus='private_use' are only to be used with prior agreement.
1036>
1037> ##### <a name="PRIVATE_USE" href="#PRIVATE_USE">PRIVATE_USE</a>
1038>
1039> The type name **"PRIVATE_USE"** is reserved for private use types. A valid type value is composed of one or more subtags separated by hyphens and each subtag consists of three to eight ASCII alphanumeric characters. In the current version of CLDR, **"PRIVATE_USE"** is only used for transform extension "x0".
1040
1041**valueType**
1042
1043> The `valueType` attribute indicates how many subtags are valid for a given key:
1044>
1045> | Value         | Description |
1046> | ------------- | ----------- |
1047> | `single`      | Either exactly one type value, or no type value (but only if the value of "true" would be valid). This is the default if no valueType attribute is present. |
1048> | `incremental` | Multiple type values are allowed, but only if a prefix is also present, and the sequence is explicitly listed. Each successive type value indicates a refinement of its prefix. For example:<br/>`<key name="ca" description="Calendar algorithm key" valueType="incremental">`<br/>`    <type name="islamic" description="Islamic calendar"/>`<br/>`    <type name="islamic-umalqura" description="Islamic calendar, Umm al-Qura"/>`<br/>Thus _ca-islamic-umalqura_ is valid. However, _ca-gregory-japanese_ is not valid, because "gregory-japanese" is not listed as a type. |
1049> | `multiple`    | Multiple type values are allowed, but each may only occur once. For example:<br/>`<key name="kr" description="Collation reorder codes" valueType="multiple">`<br/>`    <type name="REORDER_CODE" …/>` |
1050> | `any`         | Any number of type values are allowed, with none of the above restrictions. For example:<br/>`<key extension="t" name="x0" description="Private use transform type key." valueType="any">`<br/>`    <type name="PRIVATE_USE" …/>` |
1051
1052**description**
1053
1054> The description of the `key`, `type` or `attribute` element. There is also some informative text about certain keys and types in the Section 3.5 [Key And Type Definitions](#Key_And_Type_Definitions_).
1055
1056**deprecated**
1057
1058> The deprecation status of the `key`, `type` or `attribute` element. The value `"true"` indicates the element is deprecated and no longer used in the version of CLDR. The default value is `"false"`.
1059
1060**preferred**
1061
1062> The preferred value of the deprecated `key`, `type` or `attribute` element. When a `key`, `type` or `attribute` element is deprecated, this attribute is used for specifying a new canonical form if available.
1063
1064**alias** (Not applicable to `<attribute>`)
1065
1066> The BCP 47 form is the canonical form, and recommended. Other aliases are included only for backwards compatibility.
1067>
1068> _Example:_
1069>
1070> ```xml
1071> <type name="phonebk" alias="phonebook" description="Phonebook style ordering (such as in German)"/>
1072> ```
1073>
1074> The preferred term, and the only one to be used in BCP 47, is the name: in this example, "phonebk".
1075>
1076> The alias is a key or type name used by Unicode locale extensions with the old ["@key=type" syntax](#Old_Locale_Extension_Syntax). The attribute value for type may contain multiple names delimited by ASCII space characters. Of those aliases, the first name is the preferred value.
1077
1078**since**
1079
1080> The version of CLDR in which this key or type was introduced. Absence of this attribute value implies the key or type was available in CLDR 1.7.2.
1081
1082_Note: There are no values defined for the locale extension attribute in the current CLDR release._
1083
1084For example,
1085
1086```xml
1087<key name="co" alias="collation" description="Collation type key">
1088  <type name="pinyin" description="Pinyin ordering for Latin and for CJK characters (used in Chinese)"/>
1089</key>
1090
1091<key name="ka" alias="colAlternate" description="Collation parameter key for alternate handling">
1092  <type name="noignore" alias="non-ignorable" description="Variable collation elements are not reset to ignorable"/>
1093  <type name="shifted" description="Variable collation elements are reset to zero at levels one through three"/>
1094</key>
1095
1096<key name="tz" alias="timezone">
1097  ...
1098  <type name="aumel" alias="Australia/Melbourne Australia/Victoria" description="Melbourne, Australia"/>
1099  <type name="aumqi" alias="Antarctica/Macquarie" description="Macquarie Island Station, Macquarie Island" since="1.8.1"/>
1100  ...
1101</key>
1102```
1103
1104The data above indicates:
1105
1106* type "pinyin" is valid for key "co", thus "u-co-pinyin" is a valid Unicode locale extension.
1107* type "pinyin" is not valid for key "ka", thus "u-ka-pinyin" is not a valid Unicode locale extension.
1108* type "pinyin" has no _alias_, so "zh@collation=pinyin" is a valid Unicode locale identifier according to the old syntax.
1109* type "noignore" has an alias attribute, so "en@colAlternate=noignore" is not a valid Unicode locale identifier according to the old syntax.
1110* type "aumel" is valid for key "tz", supported by CLDR 1.7.2 (default value) or later versions.
1111* type "aumqi" is valid for key "tz", supported by CLDR 1.8.1 or later versions.
1112
1113It is strongly recommended that all API methods accept all possible aliases for keywords and types, but generate the canonical form. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil" on input, but the latter should be output. The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").
1114
1115#### 3.6.5 <a name="Unicode_Subdivision_Codes" href="#Unicode_Subdivision_Codes">Subdivision Codes</a>
1116
1117The subdivision codes designate a subdivision of a country or region. They are called various names, such as a _state_ in the United States, or a _province_ in Canada. The codes in CLDR are based on ISO 3166-2 subdivision codes. The ISO codes have a region code followed by a hyphen, then a suffix consisting of 1..3 ASCII letters or digits.
1118
1119The CLDR codes are designed to work in a [unicode_locale_id](#unicode_locale_id) (BCP 47), and are thus all lowercase, with no hyphen. For example, the following are valid, and mean “English as used in California, USA”.
1120
1121* en-u-sd-**usca**
1122* en-US-u-sd-**usca**
1123
1124CLDR has additional subdivision codes. These may start with a 3-digit region code or use a suffix of 4 ASCII letters or digits, so they will not collide with the ISO codes. Subdivision codes for unknown values are the region code plus "zzzz", such as "uszzzz" for an unknown subdivision of the US. Other codes may be added for stability.
1125
1126Like BCP 47, CLDR requires stable codes, which are not guaranteed for ISO 3166-2 (nor have the ISO 3166-2 codes been stable in the past). If an ISO 3166-2 code is removed, it remains valid (though marked as deprecated) in CLDR. If an ICU 3166-2 code is reused (for the same region), then CLDR will define a new equivalent code using these as 4-character suffixes.
1127
1128##### 3.6.5.1 <a name="Validity" href="#Validity">Validity</a>
1129
1130A [unicode_subdivision_id](#unicode_subdivision_id) is only valid when it is present in the subdivision.xml file as described in _Section 3.11 [Validity Data](#Validity_Data)_. The data is in a compressed form, and thus needs to be expanded before such a test is made.
1131
1132_Examples:_
1133
1134* **usca** is valid — there is an `id` element `<id type="subdivision"…>… usca …</id>`
1135* **ussct** is invalid — there is no `id` element `<id type="subdivision"…>… ussct …</id>`
1136
1137If a [unicode_locale_id](#unicode_locale_id) contains both a [unicode_region_subtag](#unicode_region_subtag) and a [unicode_subdivision_id](#unicode_subdivision_id), it is only valid if the [unicode_subdivision_id](#unicode_subdivision_id) starts with the [unicode_region_subtag](#unicode_region_subtag) (case-insensitively).
1138
1139It is recommended that a [unicode_locale_id](#unicode_locale_id) contain a [unicode_region_subtag](#unicode_region_subtag) if it contains a [unicode_subdivision_id](#unicode_subdivision_id) and the region would not be added by adding likely subtags. That produces better behavior if the [unicode_subdivision_id](#unicode_subdivision_id) is ignored by an implementation or if the language tag is truncated.
1140
1141Examples:
1142
1143* en-**US**-u-sd-**us**ca is valid — the region "US" matches the first part of "usca"
1144* en-u-sd-**us**ca is valid — it still works after adding likely subtags.
1145* en-**CA**-u-sd-**gb**sct is invalid — the region "CA" does not match the first part of "gbsct". An implementation should disregard the subdivision id (or return an error).
1146* en-u-sd-**gb**sct is valid but not recommended — an implementation that ignores the [unicode_subdivision_id](#unicode_subdivision_id) can get the wrong fallback behavior, or could add likely subtags and get the invalid en-**Latn-US**-u-sd-**gb**sct
1147
1148In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.
1149
1150<a name="t_Extension"></a>
1151### 3.7 <a name="BCP47_T_Extension" href="#BCP47_T_Extension">Unicode BCP 47 T Extension</a>
1152
1153The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [[RFC6067](#RFC6067)] and extension 't' for transformed content [[RFC6497](#RFC6497)]. The Unicode BCP 47 extension data defines the complete list of valid subtags. While the title of the RFC is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader: “including content that has been transliterated, transcribed, or translated, or _in some other way influenced by the source. It also provides for additional information used for identification._”
1154
1155**The -t- Extension.** The syntax of 't' extension subtags is defined by the rule `unicode_locale_extensions` in [_Section 3.2 Unicode locale identifier_](#Unicode_locale_identifier), except the separator of subtags `sep` must be always hyphen '-' when the extension is used as a part of BCP 47 language tag. For information about the registration process, meaning, and usage of the 't' extension, see [[RFC6497](#RFC6497)].
1156
1157These subtags are all in lowercase (that is the canonical casing for these subtags), however, subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters in length of two to eight that meet the rule `extension` in the [[BCP47](#BCP47)].
1158
1159The following keys are defined for the -t- extension:
1160
1161| Keys   | Description | Values in latest release |
1162| ------ | ----------- | ------------------------ |
1163| m0     | **Transform extension mechanism:** to reference an authority or rules for a type of transformation | [​transform.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform.xml) |
1164| s0, d0 | **Transform source/destination:** for non-languages/scripts, such as fullwidth-halfwidth conversion. | [​transform-destination.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform-destination.xml) |
1165| i0     | **Input Method Engine transform:** Used to indicate an input method transformation, such as one used by a client-side input method. The first subfield in a sequence would typically be a 'platform' or vendor designation. | [​transform_ime.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform_ime.xml) |
1166| k0     | **Keyboard transform:** Used to indicate a keyboard transformation, such as one used by a client-side virtual keyboard. The first subfield in a sequence would typically be a 'platform' designation, representing the platform that the keyboard is intended for. The keyboard might or might not correspond to a keyboard mapping shipped by the vendor for the platform. One or more subsequent fields may occur, but are only added where needed to distinguish from others. | [​transform_keyboard.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform_keyboard.xml) |
1167| t0     | **Machine Translation:** Used to indicate content that has been machine translated, or a request for a particular type of machine translation of content. The first subfield in a sequence would typically be a 'platform' or vendor designation. | [​transform_mt.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform_mt.xml) |
1168| h0     | **Hybrid Locale Identifiers:** h0 with the value 'hybrid' indicates that the -t- value is a language that is mixed into the main language tag to form a hybrid. For more information, and examples, see _Section 3.10.2 [Hybrid Locale Identifiers](#Hybrid_Locale)._ | [​transform_hybrid.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform_hybrid.xml) |
1169| x0     | **Private use transform** | [​transform_private_use.xml](https://github.com/unicode-org/cldr/blob/maint/maint-41/common/bcp47/transform_private_use.xml) |
1170
1171#### 3.7.1 <a name="Transformed_Content_Data_File" href="#Transformed_Content_Data_File">T Extension Data Files</a>
1172
1173The overall structure of the data files is the similar to the U Extension, with the following exceptions.
1174
1175In the transformed content 't' data file, the `name` attribute in a `<key>` element defines a valid field separator subtag. The `name` attribute in an enclosed `<type>` element defines a valid field subtag for the field separator subtag. For example:
1176
1177```xml
1178<key extension="t" name="m0" description="Transform extension mechanism">
1179    <type name="ungegn" description="United Nations Group of Experts on Geographical Names" since="21"/>
1180</key>
1181```
1182
1183The data above indicates:
1184
1185* "m0" is a valid field separator for the transformed content extension 't'.
1186* field subtag "ungegn" is valid for field separator "m0".
1187* field subtag "ungegn" was introduced in CLDR 21.
1188
1189The attributes are:
1190
1191**name**
1192
1193> The name of the mechanism, limited to 3-8 characters (or sequences of them). Any indirect type names are listed in 3.6.4 [U Extension Data Files](#Unicode_Locale_Extension_Data_Files).
1194
1195**description**
1196
1197> A description of the name, with all and only that information necessary to distinguish one name from others with which it might be confused. Descriptions are not intended to provide general background information.
1198
1199**since**
1200
1201> Indicates the first version of CLDR where the name appears. (Required for new items.)
1202
1203**alias**
1204
1205> Alternative name, not limited in number of characters. Aliases are intended for compatibility, not to provide all possible alternate names or designations. _(Optional)_
1206
1207For information about the registration process, meaning, and usage of the 't' extension, see [[RFC6497](#RFC6497)].
1208
1209### 3.8 <a name="Compatibility_with_Older_Identifiers" href="#Compatibility_with_Older_Identifiers">Compatibility with Older Identifiers</a>
1210
1211LDML version before 1.7.2 used slightly different syntax for variant subtags and locale extensions. Implementations of LDML may provide backward compatible identifier support as described in following sections.
1212
1213#### 3.8.1 <a name="Old_Locale_Extension_Syntax" href="#Old_Locale_Extension_Syntax">Old Locale Extension Syntax</a>
1214
1215LDML 1.7 or older specification used different syntax for representing Unicode locale extensions. The previous definition of Unicode locale extensions had the following structure:
1216
1217|                               | EBNF |
1218| ----------------------------- | ---- |
1219| `old_unicode_locale_extensions` | `= "@" old_key "=" old_type`<br/>`(";" old_key "=" old_type)*` |
1220
1221The new specification mandates keys to be two alphanumeric characters and types to be three to eight alphanumeric characters. As the result, new codes were assigned to all existing keys and some types. For example, a new key "co" replaced the previous key "collation", a new type "phonebk" replaced the previous type "phonebook". However, the existing collation type "big5han" already satisfied the new requirement, so no new type code was assigned to the type. All new keys and types introduced after LDML 1.7 satisfy the new requirement, so they do not have aliases dedicated for the old syntax, except time zone types. The conversion between old types and new types can be done regardless of key, with one known exception (old type "traditional" is mapped to new type "trad" for collation and "traditio" for numbering system), and this relationship will be maintained in the future versions unless otherwise noted.
1222
1223The new specification introduced a new field `attribute` in addition to key/type pairs in the Unicode locale extension. When it is necessary to map a new Unicode locale identifier with `attribute` field to a well-formed old locale identifier, a special key name _attribute_ with the value of entire `attribute` subtags in the new identifier is used. For example, a new identifier `ja-u-xxx-yyy-ca-japanese` is mapped to an old identifier `ja@attribute=xxx-yyy;calendar=japanese` .
1224
1225The chart below shows some example mappings between the new syntax and the old syntax.
1226
1227###### Table: <a name="Locale_Extension_Mappings" href="#Locale_Extension_Mappings">Locale Extension Mappings</a>
1228
1229| Old (LDML 1.7 or older)                    | New                          |
1230| ------------------------------------------ | ---------------------------- |
1231| `de_DE@collation=phonebook`                | `de_DE_u_co_phonebk`         |
1232| `zh_Hant_TW@collation=big5han`             | `zh_Hant_TW_u_co_big5han`    |
1233| `th_TH@calendar=gregorian;numbers=thai`    | `th_TH_u_ca_gregory_nu_thai` |
1234| `en_US_POSIX@timezone=America/Los_Angeles` | `en_US_u_tz_uslax_va_posix`  |
1235
1236Where the old API is supplied the bcp47 language code, or vice versa, the recommendation is to:
1237
12381. Have all methods that take the old syntax also take the new syntax, interpreted correctly. For example, "zh-TW-u-co-pinyin" and "zh_TW@collation=pinyin" would both be interpreted as meaning the same.
12392. Have all methods (both for old and new syntax) accept all possible aliases for keywords and types. For example, "ar-u-ca-islamicc" would be equivalent to "ar-u-ca-islamic-civil".
1240   * The one exception is where an alias would only be well-formed with the old syntax, such as "gregorian" (for "gregory").
12413. Where an API cannot successfully accept the alternate syntax, throw an exception (or otherwise indicate an error) so that people can detect that they are using the wrong method (or wrong input).
12424. Provide a method that tests a purported locale ID string to determine its status:
1243   1. **well-formed** - syntactically correct
1244   2. **valid** - well-formed and only uses registered language subtags, extensions, keywords, types...
1245   3. **canonical** - valid and no deprecated codes or structure.
1246
1247#### 3.8.2 <a name="Legacy_Variants" href="#Legacy_Variants">Legacy Variants</a>
1248
1249Old LDML specification allowed codes other than registered [[BCP47](#BCP47)] variant subtags used in Unicode language and locale identifiers for representing variations of locale data. Unicode locale identifiers including such variant codes can be converted to the new [[BCP47](#BCP47)] compatible identifiers by following the descriptions below:
1250
1251###### Table: <a name="Legacy_Variant_Mappings" href="#Legacy_Variant_Mappings">Legacy Variant Mappings</a>
1252
1253| Variant Code | Description |
1254| ------------ | ----------- |
1255| `AALAND`     | Åland, variant of "`sv`" Swedish used in Finland. Use `sv_AX` to indicate this. |
1256| `BOKMAL`     | Bokmål, variant of "`no`" Norwegian. Use primary language subtag "`nb`" to indicate this. |
1257| `NYNORSK`    | Nynorsk, variant of "`no`" Norwegian. Use primary language subtag "`nn`" to indicate this. |
1258| `POSIX`      | POSIX variation of locale data. Use Unicode locale extension `-u-va-posix` to indicate this. |
1259| `POLYTONI`   | Polytonic, variant of "`el`" Greek. Use [[BCP47](#BCP47)] variant subtag `polyton` to indicate this. |
1260| `SAAHO`      | The Saaho variant of Afar. Use primary language subtag "`ssy`" to indicate this. |
1261
1262When converting to old syntax, the Unicode locale extension "`-u-va-posix`" should be converted to the "`POSIX`" variant, _not_ to old extension syntax like "`@va=posix`". This is an exception: The other mappings above should not be reversed.
1263
1264Examples:
1265
1266* `en_US_POSIX` ↔ `en-US-u-va-posix`
1267* `en_US_POSIX@colNumeric=yes` ↔ `en-US-u-kn-va-posix`
1268* `en-US-POSIX-u-kn-true` → `en-US-u-kn-va-posix`
1269* `en-US-POSIX-u-kn-va-posix` → `en-US-u-kn-va-posix`
1270
1271> �� Note that the mapping between `en_US_POSIX` and `en-US-u-va-posix` is a conversion process, not a canonicalization process.
1272
1273#### 3.8.3 <a name="Relation_to_OpenI18n" href="#Relation_to_OpenI18n">Relation to OpenI18n</a>
1274
1275The locale id format generally follows the description in the _OpenI18N Locale Naming Guideline_ [[NamingGuideline](#NamingGuideline)], with some enhancements. The main differences from those guidelines are that the locale id:
1276
12771. does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8, although that can be transcoded to other encodings as well.)
12782. adds the ability to have a variant, as in Java
12793. adds the ability to discriminate the written language by script (or script variant).
12804. is a superset of [[BCP47](#BCP47)] codes.
1281
1282### 3.9 <a name="Transmitting_Locale_Information" href="#Transmitting_Locale_Information">Transmitting Locale Information</a>
1283
1284In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the so-called _JIT localization_ is made up of two parts:
1285
12861. Store and transmit _neutral-format_ data wherever possible.
1287   * Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) called _binary data_, even though it actually could be represented in many different ways, including a textual representation such as in XML.
1288   * Such data should use accepted standards where possible, such as for currency codes.
1289   * Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
12902. Localize that data as "_close_" to the end-user as possible.
1291
1292There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.
1293
1294Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).
1295
1296Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences around to all the places that localization could possibly need to be done.
1297
1298Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or time zone needs to be communicated between different components.
1299
1300#### 3.9.1 <a name="Message_Formatting_and_Exceptions" href="#Message_Formatting_and_Exceptions">Message Formatting and Exceptions</a>
1301
1302Windows ([FormatMessage](https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage), [String.Format](https://learn.microsoft.com/en-us/dotnet/api/system.string.format?view=net-6.0)), Java ([MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html)) and ICU ([MessageFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html), [umsg](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html)) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.
1303
1304There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.
1305
1306More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (for example, datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.
1307
1308In addition, exceptions are often caught at a higher level; they do not end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.
1309
1310### 3.10 <a name="Language_and_Locale_IDs" href="#Language_and_Locale_IDs">Unicode Language and Locale IDs</a>
1311
1312People have very slippery notions of what distinguishes a language code versus a locale code. The problem is that both are somewhat nebulous concepts.
1313
1314In practice, many people use [[BCP47](#BCP47)] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [[BCP47](#BCP47)] includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives a [[BCP47](#BCP47)] code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" versus "\_" (for example, _zh-TW_ for language code, _zh_TW_ for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "\_" as equivalent when interpreting either one on input.
1315
1316Another reason for the conflation of these codes is that _very_ little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really does not make much sense. If people see the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different in kind than other language differences between regions.
1317
1318As far as we are concerned — _as a completely practical matter_ — two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in [[ISO639](#ISO639)], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not completely consistent about this, however).
1319
1320[[BCP47](#BCP47)] _**can**_ express a difference if the use of written languages happens to correspond to region boundaries expressed as [[ISO3166](#ISO3166)] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [[ISO3166](#ISO3166)] codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script, and so on.
1321
1322Notice also that _currency codes_ are different than _currency localizations_. The currency localizations should largely be in the language-based resource bundles, not in the territory-based resource bundles. Thus, the resource bundle _en_ contains the localized mappings in English for a range of different currency codes: USD → US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols are used for more than one currency, and in such cases specializations appear in the territory-based bundles. Continuing the example, _en_US_ would have USD → $, while _en_AU_ would have AUD → $. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency. For some informal discussion of this, see [JIT Localization](https://unicode-org.github.io/icu-docs/design/jit_localization.html).)
1323
1324#### 3.10.1 <a name="Written_Language" href="#Written_Language">Written Language</a>
1325
1326Criteria for what makes a written language should be purely pragmatic; _what would copy-editors say?_ If one gave them text like the following, they would respond that is far from acceptable English for publication, and ask for it to be redone:
1327
13281. "Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."
1329
1330So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:
1331
13322. "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
13333. "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
1334
1335Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first versus last name sorting in the list, but clearly the first list was _not_ acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there are limits on what is acceptable English, and "2003年3月20日", for example, is _not_.
1336
1337Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing unfamiliar date or number formats on the user as well.
1338
1339#### 3.10.2 <a name="Hybrid_Locale" href="#Hybrid_Locale">Hybrid Locale Identifiers</a>
1340
1341Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. These are commonly referred to with portmanteau words such as _Franglais, [​Spanglish](https://en.wikipedia.org/wiki/Spanglish)_ or _Denglish_. Hybrid locales do not _not_ reference text simply containing two languages: a book of parallel text containing English and French, such as the following, is not Franglais:
1342
1343<!-- HTML: no header -->
1344<table><tbody><tr>
1345    <td>On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg…</td>
1346    <td>Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg…</td>
1347</tr></tbody></table>
1348
1349While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanglish document, and a Spanish document that has some passages quoted in English. Fine-grained tagging doesn't handle grammatical combinations like Tanglish “Enna matteru?” (_What’s the matter?_), which is neither standard Tamil nor standard English. More importantly, it doesn’t work for the very common use case for a [unicode_locale_id](#unicode_locale_id): _locale selection_.
1350
1351To communicate requests for localized content and internationalization services, locales are used. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc.). To allow an application to support Spanglish or Hinglish locale selection, [unicode_locale_id](#unicode_locale_id)s can represent hybrid locales using the T Extension key-value 'h0-hybrid'. (For more information on the T extension, see _Section 3.7 [Unicode BCP 47 T Extension](#t_Extension)._)
1352
1353_However, if users typically expect a their language in non-default script to contain a significant amount of text due to lexical borrowing, then the -t- and hybrid subtags may be omitted. An example of this is when Hindi is written in Latin script since Romanized Hindi typically contains a significant amount of English text, ‘hi-Latn’ can be used instead of ‘hi-Latn-t-en-h0-hybrid’._
1354This tends to work better in implementations that don't yet handle the -t- extension.
1355
1356Examples:
1357
1358|Locale ID			| Base script	| Hybrid name	| Description									|
1359|-------------------------------|---------------|---------------|-------------------------------------------------------------------------------|
1360|hi-t-***en-h0-hybrid***	| Deva		| Hinglish	| Hindi-English hybrid where the script is Devanagari\*				|
1361|hi-Latn-t-***en-h0-hybrid***	| Latin		| Hinglish	| Hindi-English hybrid where the script is Latin\*				|
1362|hi-Latn			| Latin		| Hinglish	| Hindi written in Latin script; in practice usually a hybrid with English	|
1363|ta-t-***en-h0-hybrid***  	| Tamil		| Tanglish	| Tamil-English hybrid where the script is Tamil\*				|
1364|...																		||
1365|en-t-***hi-h0-hybrid***	| Latin		| Hinglish	| English-Hindi hybrid	where the script is Latin\*				|
1366|en-t-***zh-h0-hybrid***	| Latin		| Chinglish	| English-Chinese hybrid where the script is Latin\*				|
1367|...																		||
1368
1369\* When used as a request for international services (such as date formatting), the request is for everything to be in the base script if possible. When used to tag arbitrary content on a coarse level, the expectation is that it be the predominant script — that is, there may be certain passages or phrases that are in the other script but are not tagged on a fine-grained level.
1370
1371> _Note: The [unicode_language_id](#unicode_language_id) should be the language used as the ‘scaffold’: for the fallback locale for internationalization services, typically used for more of the core vocabulary/structure in the content. Thus where Hindi is the scaffold, Hinglish should be represented as hi-t-en-h0-hybrid (when written in Devanagari script) or hi-Latn-t-en-h0-hybrid (when written in Latin characters). Where English is the scaffold, Hinglish should be represented as en-t-hi-h0-hybrid (or possibly en-Deva-t-hi-h0-hybrid)._
1372
1373The value of -t- is a full _[unicode_language_id](#unicode_language_id)_, and can contain a subtag for the region where it is important to include it, as in the following. The value can also include the script, although that is not normally included: the only instance where it should be is where the content of the source text varies by script. So because zh-Hant has different vocabulary and expressions, it could make sense to have en-t-zh-hant to make that distinction.
1374
1375> Note: The default script for the language is computed without reference to the hybrid subtags. Thus the default script for 'ru' is “Cyrl”, no matter what the source is in the -t- tag.
1376
1377|Locale ID			| Base script	| Hybrid name	| Description									|
1378|-------------------------------|---------------|---------------|-------------------------------------------------------------------------------|
1379|ru-t-***en***-h0-hybrid	| Cyrillic	| Runglish	| Russian with an admixture of ***American English***				|
1380|ru-t-***en-gb***-h0-hybrid	| Cyrillic	| Runglish	| Russian with an admixture of ***British English***				|
1381|ru-***Latn***-t-en-gb-h0-hybrid| Latin		| Runglish	| Russian with an admixture of British English					|
1382|en-t-***zh-h0-hybrid***	| Latin		| Chinglish	| American English with an admixture of ***Chinese (Simplified Mandarin Chinese)***|
1383|en-t-***zh-hant-h0-hybrid***	| Latin		| Chinglish	| American English with an admixture of ***Chinese (Traditional Mandarin Chinese)***|
1384
1385Should there ever be strong need for hybrids of more than two languages or for other purposes such as hybrid languages as the source of translated content, additional structure could be added.
1386
1387### 3.11 <a name="Validity_Data" href="#Validity_Data">Validity Data</a>
1388
1389```xml
1390<!ELEMENT idValidity (id*) >
1391<!ELEMENT id ( #PCDATA ) >
1392<!ATTLIST id type NMTOKEN #REQUIRED >
1393<!ATTLIST id idStatus NMTOKEN #REQUIRED >
1394```
1395
1396The directory [common/validity](https://github.com/unicode-org/cldr/blob/main/common/validity/) contains machine-readable data for validating the language, region, script, and variant subtags, as well as currency, subdivisions and measure units. Each file contains a number of subtags with the following **idStatus** values:
1397
1398* **regular** — the standard codes used for the specific type of subtag
1399* **special** — certain exceptional language codes like 'mul' _(languages only)_
1400* **unknown** — the code used to indicate the "unknown", "undetermined" or "invalid" values. For more information, see _Section 3.5.1 [Unknown or Invalid Identifiers](#Unknown_or_Invalid_Identifiers)_.
1401* **macroregion** — the standard codes that are macroregions _(for regions only)._
1402  * Note that some two-letter region codes are macroregions, and (in the future) some three-digit codes may be regular codes.
1403  * For details as to which regions are contained within which macroregions, see the `<containment>` element of the supplemental data.
1404* **deprecated** — codes that should not be used. The `<alias>` element in the supplementalMeta file contains more information about these codes, and which codes should be used instead.
1405* **private_use** — codes that, for CLDR, are considered private use. Note that some private-use codes in a source standard such as BCP 47 have defined CLDR semantics, and are considered regular codes. For more information, see _Section 3.5.3 [Private Use Codes](#Private_Use_Codes)._
1406* **reserved** — codes that are private use in a source standard, but are reserved for future use as regular codes by CLDR.
1407
1408The list of subtags for each idStatus use a compact format as a space-delimited list of StringRanges, as defined in _Section [5.3.4 String Range](#String_Range)._ The separator for each StringRange is a "~".
1409
1410Each measure unit is a sequence of subtags, such as “angle-arc-minute”. The first subtag provides a general “category” of the unit.
1411
1412In version 28.0, the subdivisions in the validity files used the ISO format, uppercase with a hyphen separating two components, instead of the BCP 47 format.
1413
1414
1415
1416## 4 <a name="Locale_Inheritance" href="#Locale_Inheritance">Locale Inheritance and Matching</a>
1417
1418The XML format relies on an inheritance model, whereby the resources are collected into _bundles_, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as _root_. Wherever possible, the resources in the root are language & territory neutral. For example, the collation (sorting) order in the root is based on the [[DUCET](#DUCET)] (see _[Root Collation](tr35-collation.md#Root_Collation)_). Since English language collation has the same ordering as the root locale, the 'en' locale data does not need to supply any collation data, nor do the 'en_US', 'en_GB' or the any of the various other locales that use English.
1419
1420Given a particular locale id "en_US_someVariant", the default search chain for a particular resource is the following.
1421
1422```
1423en_US_someVariant
1424en_US
1425en
1426root
1427```
1428
1429_The inheritance is often not simple truncation, as will be seen later in this section._
1430
1431The default search chain is slighly different for multiple variants.
1432In that case, the inheritance chain covers all combinations of variants, with longest number of variants first, and otherwise in alphabetical order.
1433For example, where the requested locale ID is en_fonipa_scouse, the inheritance chain is as follows:
1434
1435```
1436en_GB_fonipa_scouse
1437en_GB_scouse_fonipa // extra step, only needed if not canonical
1438en_GB_fonipa
1439en_GB_scouse // extra step
1440en_GB
1441en
1442```
1443
1444
1445If the data for the implementation performing the inheritance doesn't require canonical locale identifiers, then extra locale IDs need to be inserted in the chain.
1446That is indicated in the example above, marked with "only needed if not canonical".
1447These would would include all combinations of variants that are not in canonical order, inserted in alphabetical order.
1448Note that the order of multiple variants in canonical locale identifiers is alphabetical, as per [5. Canonicalizing Syntax](#5-canonicalizing-syntax) in [Annex C. LocaleId Canonicalization](#annex-c-localeid-canonicalization).
1449
1450If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.
1451
1452Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less error-prone) maintenance. At the script or region level, the "primary" child locale will be empty, since its parent will contain all of the appropriate resources for it. For more information see _CLDR Information: Section 9.3 [Default Content](tr35-info.md#Default_Content)._
1453
1454Certain data items depend only on the region specified in a locale id (by a [unicode_region_subtag](#unicode_region_subtag_validity) or an “rg” [Region Override](#RegionOverride) key), and are obtained from supplemental data rather than through locale resources. For example:
1455
1456* The currency for the specified region (see [Supplemental Currency Data](tr35-numbers.md#Supplemental_Currency_Data))
1457* The measurement system for the specified region (see [Measurement System Data](tr35-general.md#Measurement_System_Data))
1458* The week conventions for the specified region (see [Week Data](tr35-dates.md#Week_Data))
1459
1460(For more information on the specific items handled this way, see [Territory-Based Preferences](tr35-info.md#Territory_Based_Preferences).) These items will be correct for the specified region regardless of whether a locale bundle actually exists with the same combination of language and region as in the locale id. For example, suppose data is requested for the locale id "fr_US" and there is no bundle for that combination. Data obtained via locale inheritance, such as currency patterns and currency symbols, will be obtained from the parent locale "fr". However, currency amounts would be formatted by default using US dollars, just displayed in the manner governed by the locale "fr". When a locale id does not specify a region, the region-specific items such as those above are obtained from the likely region for the locale (obtained via [Likely Subtags](#Likely_Subtags)).
1461
1462For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see Section 4.2.6 [Inheritance vs Related Information](tr35.md#Inheritance_vs_Related).
1463
1464### 4.1 <a name="Lookup" href="#Lookup">Lookup</a>
1465
1466If a language has more than one script in customary modern use, then the CLDR file structure in common/main follows the following model:
1467
1468```
1469lang
1470lang_script
1471lang_script_region
1472lang_region (aliases to lang_script_region)
1473```
1474
1475#### 4.1.1 <a name="Bundle_vs_Item_Lookup" href="#Bundle_vs_Item_Lookup">Bundle vs Item Lookup</a>
1476
1477There are actually two different kinds of inheritance fallback: _resource bundle lookup_ and _resource item lookup_. For the former, a process is looking to find the first, best resource bundle it can; for the later, it is fallback within bundles on individual items, like the translated name for the region "CN" in Breton.
1478
1479These are closely related, but distinct, processes. They are illustrated in the table [Lookup Differences](#Lookup-Differences), where "key" stands for zero or more key/type pairs. Logically speaking, when looking up an item for a given locale, you first do a resource bundle lookup to find the best bundle for the locale, then you do an inherited item lookup starting with that resource bundle.
1480
1481The table [Lookup Differences](#Lookup-Differences) uses the naïve resource bundle lookup for illustration. More sophisticated systems will get far better results for resource bundle lookup if they use the algorithm described in _Section 4.4 [Language Matching](#LanguageMatching)_. That algorithm takes into account both the user’s desired locale(s) and the application’s supported locales, in order to get the best match.
1482
1483If the naïve resource bundle lookup is used, the desired locale needs to be canonicalized using 4.3 [Likely Subtags](#Likely_Subtags) and the supplemental alias information, so that locales that CLDR considers identical are treated as such. Thus eng-Latn-GB should be mapped to en-GB, and cmn-TW mapped to zh-Hant-TW.
1484
1485For the purposes of CLDR, everything with the `<ldml>` dtd is treated logically as if it is one resource bundle, even if the implementation separates data into separate physical resource bundles. For example, suppose that there is a main XML file for Nama (naq), but there are no `<unit>` elements for it because the units are all inherited from root. If the `<unit>` elements are separated into a separate data tree for modularity in the implementation, the Nama `<unit>` resource bundle would be empty. However, for purposes of resource-bundle lookup the resource bundle lookup still stops at naq.xml.
1486
1487###### Table: <a name="Lookup-Differences" href="#Lookup-Differences">Lookup Differences</a>
1488
1489
1490<!-- HTML: readability -->
1491<table><thead>
1492<tr>
1493    <th>Lookup Type</th>
1494    <th>Example</th>
1495    <th>Comments</th>
1496</tr>
1497</thead><tbody>
1498<tr>
1499    <td><b>Resource bundle</b> lookup</td>
1500    <td>
1501        se-FI →                 <br/>
1502        se →                    <br/>
1503        <i>default‑locale* →</i><br/>
1504        root
1505    </td>
1506    <td><p>* The default-locale may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in:</p>
1507        <p>
1508            se-FI →             <br/>
1509            se →                <br/>
1510            fi →                <br/>
1511            <i>en-GB →</i>      <br/>
1512            <i>en →</i>         <br/>
1513            root
1514        </p>
1515    </td>
1516<tr>
1517    <td><b>Inherited item</b> lookup</td>
1518    <td>
1519        se-FI+key →             <br/>
1520        se+key →                <br/>
1521        <i>root_alias*+key</i>  <br/>
1522        → root+key
1523    </td>
1524    <td><p>* If there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. This can happen multiple times.</p>
1525        <p>
1526            se-FI+key →         <br/>
1527            se+key →            <br/>
1528            root_alias*+key →   <br/>
1529            <i>se-FI+key2 →</i> <br/>
1530            <i>se+key2 →</i>    <br/>
1531            root_alias*+key2 →  <br/>
1532            root+key2
1533        </p>
1534    </td>
1535</tr>
1536</tbody></table>
1537
1538_Both the resource bundle inheritance and the inherited item inheritance use the parentLocale data, where available, instead of simple truncation._
1539
1540The fallback is a bit different for these two cases; internal aliases and keys are not involved in the bundle lookup, and the default locale is not involved in the item lookup. If the default-locale were used in the resource-item lookup, then strange results will occur. For example, suppose that the default locale is Swedish, and there is a Nama locale but no specific inherited item for collation. If the default-locale were used in resource-item lookup, it would produce odd and unexpected results for Nama sorting.
1541
1542The default locale is not even always used in resource bundle inheritance. For the following services, the fallback is always directly to the root locale rather than through default locale.
1543
1544*   collation
1545*   break iteration
1546*   case mapping
1547*   transliteration
1548    *   The lookup for transliteration is yet more complicated because of the interplay of source and target locales: see _Part 2 General, Section 10.1 [Inheritance.](tr35-general.md#Inheritance)_
1549
1550Thus if there is no Akan locale, for example, asking for a collation for Akan should produce the root collation, _not the Swedish collation._
1551
1552The inherited item lookup must remain stable, because the resources are built with a certain fallback in mind; changing the core fallback order can render the bundle structure incoherent.
1553
1554Resource bundle lookup, on the other hand, is more flexible; changes in the view of the "best" match between the input request and the output bundle are more tolerant, when represent overall improvements for users. For more information, see _[A.1 Element fallback](#Fallback_Elements)_.
1555
1556Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding _all_ inherited data to each locale data set.
1557
1558For a more complete description of how inheritance applies to data, and the use of keywords, see _[Section 4.2 Inheritance](#Inheritance_and_Validity)_ .
1559
1560The locale data does not contain general character properties that are derived from the _Unicode Character Database_ [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
1561
1562**Warning:** If a locale has a different script than its parent (for example, sr_Latn), then special attention must be paid to make sure that all inheritance is covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.
1563
1564**Empty Override:** There is one special value reserved in LDML to indicate that a child locale is to have no value for a path, even if the parent locale has a value for that path. That value is "∅∅∅". For example, if there is no phrase for "two days ago" in a language, that can be indicated with:
1565
1566```xml
1567<field type="day">
1568  <relative type="-2">∅∅∅</relative>
1569```
1570
1571<a name="Multiple_Inheritance"></a>
1572#### 4.1.2 <a name="Lateral_Inheritance" href="#Lateral_Inheritance">Lateral Inheritance</a>
1573
1574__Lateral Inheritance__ is where resources are inherited from within the same locale, _before inheriting from the parent_. This is used for the following element@attribute instances:
1575
1576| Element @Attribute          | Source | Context |
1577| ---------------- | ------ | ------- |
1578| currency @pattern | currencyFormat   | numberSystem = defaultNumberingSystem, unless otherwise specified*<br/>currencyFormatLength type=none, unless otherwise specified<br/>currencyFormat type="standard", unless otherwise specified |
1579| currency @decimal | symbols @decimal  | numberSystem = defaultNumberingSystem, unless otherwise specified |
1580| currency @group   | symbols @group    | numberSystem = defaultNumberingSystem, unless otherwise specified |
1581
1582>\* The "unless otherwise specified" clause is for when an API or other context indicates a different choice, such as currencyFormat type="accounting".
1583
1584For example, with /currency [@type="CVE"], the decimal symbol for almost all locales is the value from symbols/decimal, but for pt_CV it is explicitly `<decimal>$</decimal>`.
1585
1586The following attributes use lateral inheritance for **all elements** with the DTD root = ldml, except where otherwise noted. The process is applied recursively.
1587
1588| Attribute  | Fallback                               | Exception Elements          |
1589| ---------- | -------------------------------------- | --------------------------- |
1590| alt        | __no alt attribute__                   | _none_                      |
1591| case       | "nominative" → ∅                       | caseMinimalPairs            |
1592| gender     | default_gender(locale) → ∅             | genderMinimalPairs          |
1593| count      | plural_rules(locale, x) → "other" → ∅  | minDays, pluralMinimalPairs |
1594| ordinal    | plural_rules(locale, x) → "other" → ∅  | ordinalMinimalPairs         |
1595
1596The gender fallback is to neuter if the locale has a neuter gender, otherwise masculine. This may be extended in the future if necessary. See also [Part 2, Section 15, Grammatical Features](tr35-general.md#Grammatical_Features).
1597
1598For example, if there is no value for a path, and that path has a [@count="x"] attribute and value, then:
1599
16001. If "x" is numeric, the path falls back to the path with [@count=«the plural rules category for x for that locale»], within that the same locale.
1601   1. For example, [@count="0"] for English falls back to [@count="other"], while for French falls back to [@count="one"].
16022. If "x" is anything but "other", it falls back to a path [@count="other"], within that the same locale.
16033. If "x" is "other", it falls back to the path that is completely missing the count item, within that the same locale.
16044. If there is no value for that path the same locale, the same process is used for the **original path** in the parent locale.
1605
1606A path may have multiple attributes with lateral inheritance. In such a case, all of the combinations are tried, and in the order supplied above. For example (this is an extreme case):
1607
1608```
1609/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="accusative">] →
1610/compoundUnitPattern1[@count="few"][@gender="feminine"][@case="nominative">] →
1611/compoundUnitPattern1[@count="few"][@gender="feminine"] →
1612/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="accusative">] →
1613/compoundUnitPattern1[@count="few"][@gender="neuter"][@case="nominative">] →
1614/compoundUnitPattern1[@count="few"][@gender="neuter"] →
1615/compoundUnitPattern1[@count="few"][@case="accusative">] →
1616/compoundUnitPattern1[@count="few"][@case="nominative">] →
1617/compoundUnitPattern1[@count="few"] →
1618
1619/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="accusative">] →
1620/compoundUnitPattern1[@count="other"][@gender="feminine"][@case="nominative">] →
1621/compoundUnitPattern1[@count="other"][@gender="feminine"] →
1622/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="accusative">] →
1623/compoundUnitPattern1[@count="other"][@gender="neuter"][@case="nominative">] →
1624/compoundUnitPattern1[@count="other"][@gender="neuter"] →
1625/compoundUnitPattern1[@count="other"][@case="accusative">] →
1626/compoundUnitPattern1[@count="other"][@case="nominative">] →
1627/compoundUnitPattern1[@count="other"] →
1628
1629/compoundUnitPattern1[@gender="feminine"][@case="accusative">] →
1630/compoundUnitPattern1[@gender="feminine"][@case="nominative">] →
1631/compoundUnitPattern1[@gender="feminine"] →
1632/compoundUnitPattern1[@gender="neuter"][@case="accusative">] →
1633/compoundUnitPattern1[@gender="neuter"][@case="nominative">] →
1634/compoundUnitPattern1[@gender="neuter"] →
1635/compoundUnitPattern1[@case="accusative">] →
1636/compoundUnitPattern1[@case="nominative">] →
1637/compoundUnitPattern1
1638```
1639
1640_Examples:_
1641
1642###### Table: <a name="Count_Fallback_normal" href="#Count_Fallback_normal">Count Fallback: normal</a>
1643
1644| Locale | Path |
1645| ------ | ---- |
1646| fr-CA  | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]`     |
1647| fr-CA  | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]` |
1648| fr     | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]`     |
1649| fr     | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]` |
1650| root   | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="x"]`     |
1651| root   | `//ldml/units/unitLength[@type="narrow"]/unit[@type="mass-gram"]/unitPattern[@count="other"]` |
1652
1653> Note that there may also be an alias in root that changes the path and starts again from the requested locale, such as:
1654
1655```xml
1656<unitLength type="narrow">
1657   <alias source="locale" path="../unitLength[@type='short']"/>
1658</unitLength>
1659```
1660
1661###### Table: <a name="Count_Fallback_currency" href="#Count_Fallback_currency">Count Fallback: currency</a>
1662
1663| Locale | Path |
1664| ------ | ---- |
1665| fr-CA | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]`     |
1666| fr-CA | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]` |
1667| fr-CA | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName`                 |
1668| fr    | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]`     |
1669| fr    | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]` |
1670| fr    | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName`                 |
1671| root  | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="x"]`     |
1672| root  | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName[@count="other"]` |
1673| root  | `//ldml/numbers/currencies/currency[@type="CAD"]/displayName`                 |
1674
1675
1676#### 4.1.3 <a name="Parent_Locales" href="#Parent_Locales">Parent Locales</a>
1677
1678```xml
1679<!ELEMENT parentLocales ( parentLocale* ) >
1680<!ELEMENT parentLocale EMPTY >
1681<!ATTLIST parentLocale parent NMTOKEN #REQUIRED >
1682<!ATTLIST parentLocale locales NMTOKENS #REQUIRED >
1683```
1684
1685In some cases, the normal truncation inheritance does not function well. This happens when:
1686
16871.  The child locale is of a different script. In this case, mixing elements from the parent into the child data results in a mishmash.
16882.  A large number of child locales behave similarly, and differently from the truncation parent.
1689
1690The `parentLocale` element is used to override the normal inheritance when accessing CLDR data.
1691
1692For case 1, the children are script locales, and the parent is "root". For example:
1693
1694```xml
1695<parentLocale parent="root" locales="az_Cyrl ha_Arab … zh_Hant"/>
1696```
1697
1698For case 2, the children and parent share the same primary language, but the region is changed. For example:
1699
1700```xml
1701<parentLocale parent="es_419" locales="es_AR es_BO … es_UY es_VE"/>
1702```
1703
1704Collation data, however, is an exception. Since collation rules do not truly inherit data from the parent, the `parentLocale` element is not necessary and not used for collation. Thus, for a locale like zh_Hant in the example above, the `parentLocale` element would dictate the parent as "root" when referring to main locale data, but for collation data, the parent locale would still be "zh", even though the `parentLocale` element is present for that locale.
1705
1706Since parentLocale information is not localizable on a per locale basis, the parentLocale information is contained in CLDR’s [supplemental data.](tr35-info.md)
1707
1708When a `parentLocale` element is used to override normal inheritance, the following guidelines apply in most cases:
1709
17101.  If X is the parentLocale of Y, then either X is the root locale, or X has the same base language code as Y. For example, the parent of `en` cannot be `fr`, and the parent of `en_YY` cannot be `fr` or `fr_XX`.
17112.  If X is the parentLocale of Y, Y must not be a base language locale. For example, the parent of `en` cannot be `en_XX`.
1712
1713There may be specific exceptions to these for certain closely-related languages or language-script combinations, for example:
1714* `no` may be the parent of `nb` and `nn`.
1715* `en_IN` may be the parent of `hi_Latn` (the parent is one of the languages for a child that is effectively a hybrid of two languages in `Latn` script)
1716
1717There are certain invariants that must always be true:
1718
17193. The parent must either be the root locale or have the same script as the child.
17204. There must never be cycles, such as: X parent of Y ... parent of X.
17215. Following the inheritance path, using parentLocale where available and otherwise truncating the locale, must always lead eventually to the root locale.
1722
1723### 4.2 <a name="Inheritance_and_Validity" href="#Inheritance_and_Validity">Inheritance and Validity</a>
1724
1725The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.
1726
1727#### 4.2.1 <a name="Definitions" href="#Definitions">Definitions</a>
1728
1729_Blocking_ elements are those whose subelements do not inherit from parent locales. For example, a `<collation>` element is a blocking element: everything in a `<collation>` element is treated as a single lump of data, as far as inheritance is concerned. For more information, see [Section 5.5 Valid Attribute Values](#Valid_Attribute_Values).
1730
1731Attributes that serve to distinguish multiple elements at the same level are called _distinguishing_ attributes. For example, the `type` attribute distinguishes different elements in lists of translations, such as:
1732
1733```xml
1734<language type="aa">Afar</language>
1735<language type="ab">Abkhazian</language>
1736```
1737
1738Distinguishing attributes affect inheritance; two elements with different distinguishing attributes are treated as different for purposes of inheritance. For more information, see [Section 5.5 Valid Attribute Values](#Valid_Attribute_Values). Other attributes are called value attributes. Value attributes do not affect inheritance, and elements with value attributes may not have child elements (see [XML Format](#XML_Format)).
1739
1740Non-distinguishing attributes are identified by [DTD Annotations](#DTD_Annotations) such as `@VALUE`.
1741
1742For any element in an XML file, _an element chain_ is a resolved [[XPath](#XPath)] leading from the root to an element, with attributes on each element in alphabetical order. So in, say, [https://github.com/unicode-org/cldr/blob/main/common/main/el.xml](https://github.com/unicode-org/cldr/blob/main/common/main/el.xml) we may have:
1743
1744```xml
1745<ldml>
1746    <identity>
1747        <version number="1.1" />
1748        <language type="el" />
1749    </identity>
1750    <localeDisplayNames>
1751        <languages>
1752            <language type="ar">Αραβικά</language>
1753...
1754```
1755
1756Which gives the following element chains (among others):
1757
1758* `//ldml/identity/version[@number="1.1"]`
1759* `//ldml/localeDisplayNames/languages/language[@type="ar"]`
1760
1761An element chain A is an _extension_ of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1. (Equivalent, depending on the tree, may not be "identical to". See below for an example.)
1762
17631. `//ldml/localeDisplayNames`
17642. `//ldml/localeDisplayNames/languages/language[@type="ar"]`
1765
1766An LDML file can be thought of as an ordered list of _element pairs_: <element chain, data>, where the element chains are all the chains for the end-nodes. (This works because of restrictions on the structure of LDML, including that it does not allow mixed content.) The ordering is the ordering that the element chains are found in the file, and thus determined by the DTD.
1767
1768For example, some of those pairs would be the following. Notice that the first has the null string as element contents.
1769
1770* <`//ldml/identity/version[@number="1.1"]`,` ""`>
1771* <`//ldml/localeDisplayNames/languages/language[@type="ar"]`, `"Αραβικά"`>
1772
1773> Note: There are two exceptions to this:
1774>
1775> 1. Blocking nodes and their contents are treated as a single end node.
1776> 2. In terms of computing inheritance, the element pair consists of the element chain plus all distinguishing attributes; the value consists of the value (if any) plus any nondistinguishing attributes.
1777>
1778> > Thus instead of the element pair being (a) below, it is (b):
1779> >
1780> > 1. <`//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00']`,`""`>
1781> > 2. <`//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart`,`[@day='sun'][@time='00:00']`>
1782
1783Two LDML element chains are _equivalent_ when they would be identical if all attributes and their values were removed — except for distinguishing attributes. Thus the following are equivalent:
1784
1785* `//ldml/localeDisplayNames/languages/language[@type="ar"]`
1786* `//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]`
1787
1788For any locale ID, a _locale chain_ is an ordered list starting with the root and leading down to the ID. For example:
1789
1790> <root, de, de_DE, de_DE_xxx>
1791
1792#### 4.2.2 <a name="Resolved_Data_File" href="#Resolved_Data_File">Resolved Data File</a>
1793
1794To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until you get up to root. More formally, this can be expressed as the following procedure.
1795
17961. Let Result be initially L.
17972. For each Li in the locale chain for L, starting at L and going up to root:
1798   1. Let Temp be a copy of the pairs in the LDML file for Li
1799   2. Replace each alias in Temp by the resolved list of pairs it points to.
1800      1. The resolved list of pairs is obtained by recursively applying this procedure.
1801      2. That alias now blocks any inheritance from the parent. (See _[Section 5.1 Common Elements](#Common_Elements)_ for an example.)
1802   3. For each element pair P in Temp:
1803      1. If P does not contain a blocking element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.
1804
1805**Notes:**
1806
1807* When adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.
1808* The identity element and its children are unaffected by resolution.
1809* The LDML data must be constructed so as to avoid circularity in step 2.2.
1810
1811#### 4.2.3 <a name="Valid_Data" href="#Valid_Data">Valid Data</a>
1812
1813The attribute `draft="x"` in LDML means that the data has not been approved by the subcommittee. (For more information, see [Process](https://cldr.unicode.org/index/process)). However, some data that is not explicitly marked as `draft` may be implicitly `draft`, either because it inherits it from a parent, or from an enclosing element.
1814
1815**Example 2.** Suppose that new locale data is added for af (Afrikaans). To indicate that all of the data is _unconfirmed_, the attribute can be added to the top level.
1816
1817```xml
1818<ldml version="1.1" draft="unconfirmed">
1819    <identity>
1820        <version number="1.1" />
1821        <language type="af" />
1822    </identity>
1823    <characters>...</characters>
1824    <localeDisplayNames>...</localeDisplayNames>
1825</ldml>
1826```
1827
1828Any data can be added to that file, and the status will all be `draft="unconfirmed"`. Once an item is vetted—_whether it is inherited or explicitly in the file_—then its status can be changed to _approved_. This can be done either by leaving `draft="unconfirmed"` on the enclosing element and marking the child with `draft="approved"`, such as:
1829
1830```xml
1831<ldml version="1.1" draft="unconfirmed">
1832    <identity>
1833        <version number="1.1" />
1834        <language type="af" />
1835    </identity>
1836    <characters draft="approved">...</characters>
1837    <localeDisplayNames>...</localeDisplayNames>
1838    <dates />
1839    <numbers />
1840    <collations />
1841</ldml>
1842```
1843
1844However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in _[Section 5.6 Canonical Form](#Canonical_Form)_. If an LDML file does have draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file.
1845
1846More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are simply formalizations of what is in LDML already. Item 3 adds the new element.
1847
1848#### 4.2.4 <a name="Checking_for_Draft_Status" href="#Checking_for_Draft_Status">Checking for Draft Status</a>
1849
18501. **Parent Locale Inheritance**
1851   1. Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
1852   2. Produce the fully resolved data file D' for D.
1853   3. In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
1854   4. If there is no such E', return _true_
1855   5. If E' is not equivalent to E, truncate E' to the length of E.
18562. **Enclosing Element Inheritance**
1857   1. Walk through the elements in E', from back to front.
1858      1. If you ever encounter draft=_x_, return _x_
1859   2. If L' = L, return _false_
18603. **Missing File Inheritance**
1861   1. Otherwise, walk again through the elements in E', from back to front.
1862      1. If you encounter a `validSubLocales` attribute (deprecated):
1863         1. If L is in the attribute value, return _false_
1864         2. Otherwise return _true_
18654. **Otherwise**
1866   1.  Return _true_
1867
1868The `validSubLocales` in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing data from less specific ones).
1869
1870#### 4.2.5 <a name="Keyword_and_Default_Resolution" href="#Keyword_and_Default_Resolution">Keyword and Default Resolution</a>
1871
1872When accessing data based on keywords, the following process is used. Consider the following example:
1873
1874* The locale 'de' has collation types A, B, C, and no `<default>` element
1875* The locale 'de_CH' has `<default type='B'>`
1876
1877Here are the searches for various combinations.
1878
1879<!-- HTML: rowspan -->
1880<table><thead>
1881<tr><th>User Input</th>                                 <th>Lookup in Locale</th>   <th>For</th>                        <th>Comment</th></tr>
1882</thead><tbody>
1883<tr><td rowspan="3">de_CH<br/><i>no keyword</i></td>    <td>de_CH</td>              <td>default collation type</td>     <td>finds "B"</td></tr>
1884<tr>                                                    <td>de_CH</td>              <td>collation type=B</td>           <td>not found</td></tr>
1885<tr>                                                    <td>de</td>                 <td>collation type=B</td>           <td><i>found</i></td></tr>
1886<tr><td rowspan="4">de<br/><i>no keyword</i></td>       <td>de</td>                 <td>default collation type</td>     <td>not found</td></tr>
1887<tr>                                                    <td>root</td>               <td>default collation type</td>	    <td>finds "standard"</td></tr>
1888<tr>                                                    <td>de</td>                 <td>collation type=standard</td>    <td>not found</td></tr>
1889<tr>                                                    <td>root</td>               <td>collation type=standard</td>    <td><i>found</i></td></tr>
1890<tr><td>de_u_co_A</td>                                  <td>de</td>                 <td>collation type=A</td>           <td><i>found</i></td></tr>
1891<tr><td rowspan="2">de_u_co_standard</td>	            <td>de</td>	                <td>collation type=standard</td>    <td>not found</td></tr>
1892<tr>                                                    <td>root</td>               <td>collation type=standard</td>    <td><i>found</i></td></tr>
1893<tr><td rowspan="6">de_u_co_foobar</td>	                <td>de</td>	                <td>collation type=foobar</td>      <td>not found</td></tr>
1894<tr>                                                    <td>root</td>               <td>collation type=foobar</td>      <td>not found, starts looking for default</td></tr>
1895<tr>                                                    <td>de</td>	                <td>default collation type</td>     <td>not found</td></tr>
1896<tr>                                                    <td>root</td>               <td>default collation type</td>     <td>finds "standard"</td></tr>
1897<tr>                                                    <td>de</td>	                <td>collation type=standard</td>    <td>not found</td></tr>
1898<tr>                                                    <td>root</td>               <td>collation type=standard</td>    <td><i>found</i></td></tr>
1899</tbody></table>
1900
1901Examples of "search" collator lookup; 'de' has a language-specific version, but 'en' does not:
1902
1903<!-- HTML: rowspan -->
1904<table><thead>
1905<tr><th>User Input</th>                                 <th>Lookup in Locale</th>   <th>For</th>                        <th>Comment</th></tr>
1906</thead><tbody>
1907<tr><td rowspan="2">de_CH_u_co_search</td>              <td>de_CH</td>              <td>collation type=search</td>      <td>not found</td></tr>
1908<tr>                                                    <td>de</td>                 <td>collation type=search</td>      <td><i>found</i></td></tr>
1909<tr><td rowspan="3">en_US_u_co_search</td>              <td>en_US</td>              <td>collation type=search</td>      <td>not found</td></tr>
1910<tr>                                                    <td>en</td>                 <td>collation type=search</td>      <td>not found</td></tr>
1911<tr>                                                    <td>root</td>               <td>collation type=search</td>      <td><i>found</i></td></tr>
1912</tbody></table>
1913
1914Examples of lookup for Chinese collation types. Note:
1915
1916* All of the Chinese-specific collation types are provided in the 'zh' locale
1917* For 'zh' the `<default>` element specifies "pinyin"; for 'zh_Hant' the `<default>` element specifies "stroke". However any of the available Chinese collation types can be explicitly requested for any Chinese locale.
1918
1919<!-- HTML: rowspan -->
1920<table><thead>
1921<tr><th>User Input</th>                                 <th>Lookup in Locale</th>   <th>For</th>                        <th>Comment</th></tr>
1922</thead><tbody>
1923<tr><td rowspan="3">zh_Hant<br/><i>no keyword</i></td>  <td>zh_Hant</td>            <td>default collation type</td>     <td>finds "stroke"</td></tr>
1924<tr>                                                    <td>zh_Hant</td>            <td>collation type=stroke</td>      <td>not found</td></tr>
1925<tr>                                                    <td>zh</td>                 <td>collation type=stroke</td>      <td><i>found</i></td></tr>
1926<tr><td rowspan="3">zh_Hant_HK_u_co_pinyin</td>         <td>zh_Hant_HK</td>         <td>collation type=pinyin</td>      <td>not found</td></tr>
1927<tr>                                                    <td>zh_Hant</td>            <td>collation type=pinyin</td>      <td>not found</td></tr>
1928<tr>                                                    <td>zh</td>                 <td>collation type=pinyin</td>      <td><i>found</i></td></tr>
1929<tr><td rowspan="2">zh<br/><i>no keyword</i></td>       <td>zh</td>                 <td>default collation type</td>     <td>finds "pinyin"</td></tr>
1930<tr>                                                    <td>zh</td>                 <td>collation type=pinyin</td>      <td><i>found</i></td></tr>
1931</tbody></table>
1932
1933> **Note:** It is an invariant that the default in root for a given element must
1934> always be a value that exists in root. So you can not have the following in root:
1935
1936```
1937<someElements>
1938    <default type='a'/>
1939    <someElement type='b'>...</someElement>
1940    <someElement type='c'>...</someElement>
1941    <!-- no 'a' -->
1942</someElements>
1943```
1944
1945For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default value is the identifier itself whenever no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display name is simply 'QA'.
1946
1947#### 4.2.6 <a name="Inheritance_vs_Related" href="#Inheritance_vs_Related">Inheritance vs Related Information</a>
1948
1949There are related types of data and processing that are easy to confuse:
1950
1951<!-- HTML: rowspan, colspan, col th -->
1952<table class="simple"><tbody>
1953<tr><th rowspan="4">Inheritance</th>
1954        <td colspan="2">Part of the internal mechanism used by CLDR to organize and manage locale data. This is used to share common resources, and ease maintenance, and provide the best fallback behavior in the absence of data. <i>Should not be used for locale matching or likely subtags.</i></td></tr>
1955        <tr><td><i>Example:</i></td>
1956            <td>parent(en_AU) ⇒ en_001<br/>
1957                parent(en_001) ⇒ en<br/>
1958                parent(en) ⇒ root</td></tr>
1959        <tr><td><i>Data:</i></td>
1960            <td>supplementalData.xml &lt;parentLocale&gt;</td></tr>
1961        <tr><td><i>Spec:</i></td>
1962            <td><b>Section <a href="#Inheritance_and_Validity">4.2 Inheritance and Validity</a></b></td></tr>
1963
1964<tr><th rowspan="4">DefaultContent</th>
1965    <td colspan="2">Part of the internal mechanism used by CLDR to manage locale data. A particular sublocale is designated the defaultContent for a parent, so that the parent exhibits consistent behavior. <i>Should not be used for locale matching or likely subtags.</i></td></tr>
1966        <tr><td><i>Example:</i></td>
1967            <td>addLikelySubtags(sr-ME) ⇒ sr-Latn-ME, minimize(de-Latn-DE) ⇒ de</td></tr>
1968        <tr><td><i>Data:</i></td>
1969            <td>supplementalMetadata.xml &lt;defaultContent&gt;</td></tr>
1970        <tr><td><i>Spec:</i></td
1971            ><td><b>Part 6: Section 9.3&nbsp;<a href="tr35-info.md#Default_Content">Default Content</a></b></td></tr>
1972
1973<tr><th rowspan="4">LikelySubtags</th>
1974    <td colspan="2">Provides most likely full subtag (script and region) in the absence of other information. A core component of LocaleMatching.</td></tr>
1975        <tr><td><i>Example:</i></td>
1976            <td>addLikelySubtags(zh) ⇒ zh-Hans-CN<br/>addLikelySubtags(zh-TW) ⇒ zh-Hant-TW<br/>minimize(zh-Hans, favorRegion) ⇒ zh-TW</td></tr>
1977        <tr><td><i>Data:</i></td>
1978            <td>likelySubtags.xml &lt;likelySubtags&gt;</td></tr>
1979        <tr><td><i>Spec:</i></td>
1980            <td><b>Section <a href="#Likely_Subtags">4.3 Likely Subtags</a></b></td></tr>
1981
1982<tr><th rowspan="4">LocaleMatching</th>
1983    <td colspan="2">Provides the best match for the user’s language(s) among an application’s supported languages.</td></tr>
1984        <tr><td><i>Example:</i></td>
1985            <td>bestLocale(userLangs=&lt;en, fr&gt;, appLangs=&lt;fr-CA, ru&gt;) ⇒ fr-CA</td></tr>
1986        <tr><td><i>Data:</i></td>
1987            <td>languageInfo.xml &lt;languageMatching&gt;</td></tr>
1988        <tr><td><i>Spec:</i></td>
1989            <td><b>Section <a href="#LanguageMatching">4.4 Language Matching</a></b></td></tr>
1990
1991</tbody></table>
1992
1993### 4.3 <a name="Likely_Subtags" href="#Likely_Subtags">Likely Subtags</a>
1994
1995```xml
1996<!ELEMENT likelySubtag EMPTY >
1997<!ATTLIST likelySubtag from NMTOKEN #REQUIRED>
1998<!ATTLIST likelySubtag to NMTOKEN #REQUIRED>
1999```
2000
2001There are a number of situations where it is useful to be able to find the most likely language, script, or region. For example, given the language "zh" and the region "TW", what is the most likely script? Given the script "Thai" what is the most likely language or region? Given the region TW, what is the most likely language and script?
2002
2003Conversely, given a locale, it is useful to find out which fields (language, script, or region) may be superfluous, in the sense that they contain the likely tags. For example, "en_Latn" can be simplified down to "en" since "Latn" is the likely script for "en"; "ja_Jpan_JP" can be simplified down to "ja".
2004
2005The _likelySubtag_ supplemental data provides default information for computing these values. This data is based on the default content data, the population data, and the suppress-script data in [[BCP47](#BCP47)]. It is heuristically derived, and may change over time.
2006
2007For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see **_Section 4.2.6 [Inheritance vs Related Information](tr35.md#Inheritance_vs_Related)_**.
2008
2009To look up data in the table, see if a locale matches one of the `from` attribute values. If so, fetch the corresponding `to` attribute value. For example, the Chinese data looks like the following:
2010
2011```xml
2012<likelySubtag from="zh" to="zh_Hans_CN" />
2013<likelySubtag from="zh_HK" to="zh_Hant_HK" />
2014<likelySubtag from="zh_Hani" to="zh_Hani_CN" />
2015<likelySubtag from="zh_Hant" to="zh_Hant_TW" />
2016<likelySubtag from="zh_MO" to="zh_Hant_MO" />
2017<likelySubtag from="zh_TW" to="zh_Hant_TW" />
2018```
2019
2020So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh" returns "zh_Hans_CN".
2021
2022In more detail, the data is designed to be used in the following operations.
2023
2024Note that as of CLDR v24, any field present in the 'from' field is also present in the 'to' field, so an input field will not change in "Add Likely Subtags" operation. The data and operations can also be used with language tags using [[BCP47](#BCP47)] syntax, with the appropriate changes. In addition, certain common 'denormalized' language subtags such as 'iw' (for 'he') may occur in both the 'from' and 'to' fields. This allows for implementations that use those denormalized subtags to use the data with only minor changes to the operations.
2025
2026An implementation may choose to exclude language tags with the language subtag "und" from the following operation. In such a case, only the canonicalization is done. An implementation can declare that it is doing the exclusion, or can take a parameter that controls whether or not to do it.
2027
2028_**Add Likely Subtags:**_ _Given a source locale X, to return a locale Y where the empty subtags have been filled in by the most likely subtags._ This is written as X ⇒ Y ("X maximizes to Y").
2029
2030A subtag is called _empty_ if it is a missing script or region subtag, or it is a base language subtag with the value "und". In the description below, a subscript on a subtag _x_ indicates which tag it is from: _xs_ is in the source, _xm_ is in a match, and _xr_ is in the final result.
2031
2032This operation is performed in the following way.
2033
20341. **Canonicalize.**
2035   1. Make sure the input locale is in canonical form: uses the right separator, and has the right casing.
2036   2. Replace any deprecated subtags with their canonical values using the `<alias>` data in supplemental metadata. Use the first value in the replacement list, if it exists. Language tag replacements may have multiple parts, such as "sh" ➞ "sr_Latn" or "mo" ➞ "ro_MD". In such a case, the original script and/or region are retained if there is one. Thus "sh_Arab_AQ" ➞ "sr_Arab_AQ", not "sr_Latn_AQ".
2037   3. If the tag is a legacy language tag (marked as “Type: grandfathered” in BCP 47; see `<variable id="$grandfathered" type="choice">` in the supplemental data), then return it.
2038   4. Remove the script code 'Zzzz' and the region code 'ZZ' if they occur.
2039   5. Get the components of the cleaned-up source tag _(languages, scripts,_ and _regions_), plus any variants and extensions.
20402. **Lookup.** Look up each of the following in order, and stop on the first match:
2041   1. _languages_scripts_regions_
2042   2. _languages_regions_
2043   3. _languages_scripts_
2044   4. __languages__
2045   5. und\__scripts_
20463. **Return**
2047   1. If there is no match, either return
2048      1.  an error value, or
2049      2.  the match for "und" (in APIs where a valid language tag is required).
2050   2. Otherwise there is a match = _languagem_scriptm_regionm_
2051   3. Let xr = xs if xs is not empty, and xm otherwise.
2052   4. Return the language tag composed of _languager _ scriptr _ regionr_ + variants + extensions .
2053
2054The lookup can be optimized. For example, if any of the tags in Step 2 are the same as previous ones in that list, they do not need to be tested.
2055
2056_Example1:_
2057
2058* Input is ZH-ZZZZ-SG.
2059* Normalize to zh_SG.
2060* Look up in table. No match.
2061* Look up zh, and get the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG.
2062
2063To find the most likely language for a country, or language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW.
2064
2065A goal of the algorithm is that if X ⇒ Y, and X' results from replacing an empty subtag in X by the corresponding subtag in Y, then X' ⇒ Y. For example, if und_AF ⇒ fa_Arab_AF, then:
2066
2067* fa_Arab_AF ⇒ fa_Arab_AF
2068* und_Arab_AF ⇒ fa_Arab_AF
2069* fa_AF ⇒ fa_Arab_AF
2070
2071There are a small number of exceptions to this goal in the current data, where X ∈ {und_Bopo, und_Brai, und_Cakm, und_Limb, und_Shaw}.
2072
2073**_Remove_** _**Likely Subtags:** Given a locale, remove any fields that Add Likely Subtags would add._
2074
2075The reverse operation removes fields that would be added by the first operation.
2076
20771. First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it.
20782. Remove the variants from max.
20793. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_).
20804. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_}
2081   * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants.
20825. If you do not get a match, return max + variants.
2083
2084Example:
2085
2086* Input is zh_Hant. Maximize to get zh_Hant_TW.
2087* zh => zh_Hans_CN. No match, so continue.
2088* zh_TW => zh_Hant_TW. Matches, so return zh_TW.
2089
2090A variant of this favors the script over the region, thus using {language, language_script, language_region} in the above. If that variant is used, then the result in this example would be zh_Hant instead of zh_TW.
2091
2092### 4.4 <a name="LanguageMatching" href="#LanguageMatching">Language Matching</a>
2093
2094```xml
2095<!ELEMENT languageMatching ( languageMatches* ) >
2096<!ELEMENT languageMatches ( paradigmLocales*, matchVariable*, languageMatch* ) >
2097<!ATTLIST languageMatches type NMTOKEN #REQUIRED >
2098
2099<!ELEMENT languageMatch EMPTY >
2100<!ATTLIST languageMatch desired CDATA #REQUIRED >
2101<!ATTLIST languageMatch supported CDATA #REQUIRED >
2102<!ATTLIST languageMatch percent NMTOKEN #REQUIRED >
2103<!ATTLIST languageMatch distance NMTOKEN #IMPLIED >
2104<!ATTLIST languageMatch oneway ( true | false ) #IMPLIED >
2105
2106<!ELEMENT languageMatches ( paradigmLocales*, matchVariable*, languageMatch* ) >
2107<!ATTLIST languageMatches type NMTOKEN #REQUIRED >
2108
2109<!ELEMENT paradigmLocales EMPTY >
2110<!ATTLIST paradigmLocales locales NMTOKENS #REQUIRED >
2111```
2112
2113Implementers are often faced with the issue of how to match the user's requested languages with their product's supported languages. For example, suppose that a product supports \{ja-JP, de, zh-TW}. If the user understands written American English, German, French, Swiss German, and Italian, then **de** would be the best match; if s/he understands only Chinese (zh), then zh-TW would be the best match.
2114
2115The standard truncation-fallback algorithm does not work well when faced with the complexities of natural language. The language matching data is designed to fill that gap. Stated in those terms, language matching can have the effect of a more complex fallback, such as:
2116
2117```
2118sr-Cyrl-RS
2119sr-Cyrl
2120sr-Latn-RS
2121sr-Latn
2122sr
2123hr-Latn
2124hr
2125```
2126
2127Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native tongue is English, I can understand Swiss German and German, my French is rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list of languages, skipping Italian because my comprehension is not good enough for arbitrary content.
2128
2129Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example, for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle, but it does not contain translation for the key "CN" (for the country China). It is best to return "chine", rather than falling back to the value default language such as Russian and getting "Китай".  The language matching data can be used to get the closest fallback locales (of those supported) to a given language.
2130
2131For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see **_Section 4.2.6 [Inheritance vs Related Information](tr35.md#Inheritance_vs_Related)_**.
2132
2133When such fallback is used for inherited item lookup, the normal order of inheritance is used for inherited item lookup, except that before using any data from **root**, the data for the fallback locales would be used if available. Language matching does not interact with the fallback of resources _within the locale-parent chain_. For example, suppose that we are looking for the value for a particular path **P** in **nb-NO**. In the absence of aliases, normally the following lookup is used.
2134
2135> **nb-NO** → **nb** → **root**
2136
2137That is, we first look in **nb-NO**. If there is no value for **P** there, then we look in **nb**. If there is no value for **P** there, we return the value for **P** in root (or a code value, if there is nothing there). Remember that if there is an `alias` element along this path, then the lookup may restart with a different path in **nb-NO** (or another locale).
2138
2139However, suppose that **nb-NO** has the fallback values **[nn da sv en]**, derived from language matching. In that case, an implementation _may_ progressively look up each of the listed locales, with the appropriate substitutions, returning the first value that is not found in **root**. This follows roughly the following pseudocode:
2140
2141```c
2142value = lookup(P, nb-NO); if (locationFound != root) return value;
2143value = lookup(P, nn-NO); if (locationFound != root) return value;
2144value = lookup(P, da-NO); if (locationFound != root) return value;
2145value = lookup(P, sv-NO); if (locationFound != root) return value;
2146value = lookup(P, en-NO); return value;
2147```
2148
2149The locales in the fallback list are not used recursively. For example, for the lookup of a path in nb-NO, if **fr** were a fallback value for **da**, it would not matter for the above process. Only the original language matters.
2150
2151The language matching data is intended to be used according to the following algorithm. This is a logical description, and can be optimized for production in many ways. In this algorithm, the languageMatching data is interpreted as an ordered list.
2152
2153Distances between given pair of subtags can be larger or smaller than the typical distances. For example, the distance between en and en-GB can be greater than those between en-GB and en-IE. In some cases, language and/or script differences can be as small as the typical region difference. (Example: sr-Latn vs. sr-Cyrl).
2154
2155The distances resulting from the table are not linear, but are rather chosen to produce expected results. So a distance of 10 is not necessarily twice as "bad" as a distance of 5. Implementations may want to have a mode where script distances should swamp language distances. The tables are built such that this can be accomplished by multiplying the language distance by 0.25.
2156
2157The language matching algorithm takes a list of a user’s desired languages, and a list of the application’s supported languages.
2158
2159* Set the best weighted distance BWD to ∞
2160* Set the best desired language BD to null
2161* Set the best supported language BS to null
2162* For each desired language D
2163  * Compute a demotion value F, based on the position in the list.
2164    * This demotion value is up to the implementation, but is typically a positive value that increases according to how far D is from the start of the desired language list.
2165  * For each supported language S
2166    * Find the matching distance MD as described below.
2167    * Compute the weighted distance as F + MD
2168    * If WD < BD
2169      * BWD = WD
2170      * BD = D
2171      * BS = S
2172* If the BWD is less than a threshold, return <BD, BS>
2173  * The threshold is implementation-defined, typically set to greater than a default region difference, and less than a default script difference.
2174* Otherwise BD = the default supported language (like English); return <BD, null>
2175
2176To find the matching distance MD between any two languages, perform the following steps.
2177
21781. Maximize each language using Section 4.3 [Likely Subtags](#Likely_Subtags).
2179   * und is a special case: see below.
21802. Set the match-distance MD to 0
21813. For each subtag in {language, script, region}
2182   1. If respective subtags in each language tag are identical, remove the subtag from each (logically) and continue.
2183   2. Traverse the languageMatching data until a match is found.
2184      * \* matches any field.
2185      * If the oneway flag is false, then the match is symmetric; otherwise only match one direction.
2186      * For region matching, use the mechanisms in **Section 4.4.1 [Enhanced Language Matching](#EnhancedLanguageMatching)**.
2187   3. Add the `distance` attribute value to MD.
2188      * This used to be a `percent` attribute value, which was 100 - the `distance` attribute value.
2189   4. Remove the subtag from each (logically)
21904. Return MD
2191
2192It is typically useful to set the discount factor between successive elements of the desired languages list to be slightly greater than the default region difference. That avoids the following problem:
2193
2194_Supported languages:_ "de, fr, ja"
2195
2196_User's desired languages:_ "de-AT, fr"
2197
2198This user would expect to get "de", not "fr". In practice, when a user selects a list of preferred languages, they don't include all the regional variants ahead of their second base language. Yet while the user's desired languages really doesn't tell us the priority ranking among their languages, normally the fall-off between the user's languages is substantially greater than regional variants. But unless F is greater than the distance between de-AT and de-DE, then the user’s second-choice language would be returned.
2199
2200The base language subtag "und" is a special case. Suppose we have the following situation:
2201
2202* desired languages: \{und, it}
2203* supported languages: \{en, it}
2204* resulting language: en
2205
2206Part of this is because 'und' has a special function in BCP 47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.
2207
2208Examples:
2209
2210For example, suppose that nn-DE and nb-FR are being compared. They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively. The list is searched. The first match is with "\*-\*-\*", for a match of 96%. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. The first match is also for a value of 96%, so the result is 92%.
2211
2212Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match (because it is more likely that a Breton reader will understand French than Welsh). This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton.
2213
2214The "\*" acts as a wild card, as shown in the following example:
2215
2216```xml
2217<languageMatch desired="es-*-ES" supported="es-*-ES" percent="100" />
2218<!-- Latin American Spanishes are closer to each other. Approximate by having es-ES be further from everything else. -->
2219
2220<languageMatch desired="es-*-ES" supported="es-*-*" percent="93" />
2221
2222<languageMatch desired="*" supported="*" percent="1" />
2223<!-- [Default value - must be at end!] Normally there is no comprehension of different languages. -->
2224
2225<languageMatch desired="*-*" supported="*-*" percent="20" />
2226<!-- [Default value - must be at end!] Normally there is little comprehension of different scripts. -->
2227
2228<languageMatch desired="*-*-*" supported="*-*-*" percent="96" />
2229<!-- [Default value - must be at end!] Normally there are small differences across regions. -->
2230```
2231
2232When the language+region is not matched, and there is otherwise no reason to pick among the supported regions for that language, then some measure of geographic "closeness" can be used. The results may be more understandable by users. Looking for en-SK, for example, should fall back to something within Europe (eg en-GB) in preference to something far away and unrelated (eg en-SG). Such a closeness metric does not need to be exact; a small amount of data can be used to give an approximate distance between any two regions. However, any such data must be used carefully; although Hong Kong is closer to India than to the UK, it is unlikely that en-IN would be a better match to en-HK than en-GB would.
2233
2234#### 4.4.1 <a name="EnhancedLanguageMatching" href="#EnhancedLanguageMatching">Enhanced Language Matching</a>
2235
2236The enhanced format for language matching adds structure to enable better matching of languages. It is distinguished by having a suffix "\_new" on the type, as in the example below. The extended structure allows matching to take into account broad similarities that would give better results. For example, for English the regions that are or inherit from US (AS|GU|MH|MP|PR|UM|VI|US) form a “cluster”. Each region in that cluster should be closer to each other than to any other region. And a region outside the cluster should be closer to another region outside that cluster than to one inside. We get this issue with the “world languages” like English, Spanish, Portuguese, Arabic, etc.
2237
2238_Example:_
2239
2240```xml
2241<languageMatches type="written_new">
2242    <paradigmLocales locales="en en-GB es es-419 pt-BR pt-PT" />
2243    <matchVariable id="$enUS" value="AS+GU+MH+MP+PR+UM+US+VI" />
2244    <matchVariable id="$cnsar" value="HK+MO" />
2245    <matchVariable id="$americas" value="019" />
2246    <matchVariable id="$maghreb" value="MA+DZ+TN+LY+MR+EH" />
2247    <languageMatch desired="no" supported="nb" distance="1" /><!-- no ⇒ nb -->
22482249    <languageMatch desired="ar_*_$maghreb" supported="ar_*_$maghreb" distance="4" />
2250    <!-- ar; *; $maghreb ⇒ ar; *; $maghreb -->
2251    <languageMatch desired="ar_*_$!maghreb" supported="ar_*_$!maghreb" distance="4" />
2252    <!-- ar; *; $!maghreb ⇒ ar; *; $!maghreb -->
22532254```
2255
2256The **matchVariable** allows for a rule to match to multiple regions, as illustrated by **\$maghreb**. The syntax is simple: it allows for + for _union_ and - for _set difference_, but no precedence. So A+B-A+D is interpreted as (((A+B)-A)+D), not as (A+B)-(A+D). The variable **id** has a value of the form [$][a-zA-Z0-9]+. If $X is defined, then $!X automatically means all those regions that are not in $X.
2257
2258When the set is interpreted, then macrolanguages are (logically) transformed into a list of their contents, so “053+GB” → “AU+GB+NF+NZ”. This is done recursively, so 009 → “053+054+057+061+QO” → “AU+NF+NZ+FJ+NC+PG+SB +VU...”. Note that we use 019 for all of the Americas in the variables above, because en-US should be in the same cluster as es-419 and its contents.
2259
2260In the rules, the percent value (100..0) is replaced by a **distance** value, which is the inverse (0..100).
2261
2262These new variables and rules divide up the world into clusters, where items in the same clusters (for specific languages) get the normal regional difference, and items in different clusters get different weights.
2263
2264Each cluster can have one or more associated **paradigmLocales**. These are locales that are preferred within a cluster. So when matching desired=[en-SA] against [en-GU en en-IN en-GB], the value en-GB is returned. Both of \{en-GU en} are in a different cluster. While \{en-IN en-GB} are in the same cluster, and the same distance from en-SA, the preference is given to en-GB because it is in the paradigm locales. It would be possible to express this in rules, but using this mechanism handles these very common cases without bulking up the tables.
2265
2266The **paradigmLocales** also allow matching to macroregions. For example, desired=[es-419] should match to \{es-MX} more closely than to \{es}, and vice versa: \{es-MX} should match more closely to \{es-419} than to \{es}. But es-MX should match more closely to es-419 than to any of the other es-419 sublocales. In general, in the absence of other distance data, there is a ‘paradigm’ in each cluster that the others should match more closely to: en(-US), en-GB, es(-ES), es-419, ru(-RU)...
2267
2268
2269
2270## 5 <a name="XML_Format" href="#XML_Format">XML Format</a>
2271
2272There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple files, which can be in multiple directory trees.
2273
2274For example, the language-dependent data for Japanese in CLDR is present in the following files:
2275
2276* common/collation/ja.xml
2277* common/main/ja.xml
2278* common/rbnf/ja.xml
2279* common/segmentations/ja.xml
2280
2281Data for cased languages such as French are in files like:
2282
2283* common/casing/fr.xml
2284
2285The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml files is treated as if it was in a single file. These files have the `<ldml>` root element and use ldml.dtd. The file name must match the identity element. For example, the `<ldml>` file pa_Arab_PK.xml must contain the following elements:
2286
2287```xml
2288<ldml>
2289    <identity>
22902291        <language type="pa" />
2292        <script type="Arab" />
2293        <territory type="PK" />
2294    </identity>
22952296```
2297
2298Supplemental data can have different root elements, currently: `ldmlBCP47`, `supplementalData`, `keyboard`, and `platform`. Keyboard and platform files are considered distinct. The ldmlBCP47 files and supplementalData files that have the same root are all logically part of the same file; they are simply split into separate files for convenience. Implementations may split the files in different ways, also for their convenience. The files in /properties are also supplemental data files, but are structured like UCD properties.
2299
2300For example, supplemental data relating to Japan or the Japanese writing are in:
2301
2302* common/supplemental/ (in many files, such as supplementalData.xml)
2303* common/transforms/Hiragana-Katakana.xml
2304* common/transforms/Hiragana-Latin.xml
2305* common/properties/scriptMetadata.txt
2306* common/bcp47/calendar.xml
2307* uca/allkeys_CLDR.txt (sorting)
2308* /keyboards/chromeos/ja-t-k0-chromeos.xml
2309* ...
2310
2311Like the `<ldml>` files, the keyboard file names must match internal data: in particular, the `locale` attribute on the keyboard element must have a value that corresponds to the file name, such as `<keyboard locale="af-t-k0-android">` for the file af-t-k0-android.xml.
2312
2313The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the ldml.dtd file; _however, the DTD does not describe all the constraints on the structure._
2314
2315To start with, the root element is `<ldml>`, with the following DTD entry:
2316
2317```xml
2318<!ELEMENT ldml (identity,(alias|(fallback*,localeDisplayNames?,layout?,contextTransforms?,characters?,
2319delimiters?,measurement?,dates?,numbers?,units?,listPatterns?,collations?,posix?,
2320segmentations?,rbnf?,annotations?,metadata?,references?,special*)))>
2321```
2322
2323The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged. In most cases, an alternate structure is provided for expressing the information. There is only one exception: newer DTDs cannot be used with version 1.1 files, without some modification.
2324
2325In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.
2326
2327There are two kinds of elements in LDML: _rule_ elements and _structure_ elements.
2328
2329For structure elements, there are restrictions to allow for effective inheritance and processing:
2330
23311.  There is no ["mixed" content](https://www.w3.org/TR/xml/#sec-mixed-content): if an element has textual content, then it cannot contain any elements.
23322.  The [[XPath](#XPath)] leading to the content is unique; no two different pieces of textual content have the same [[XPath](#XPath)].
23333.  An element that has [value attributes](#Definitions) MUST NOT also have have child elements.
2334
2335To illustrate these restrictions, consider the below chunk of XML:
2336
2337```xml
2338<!-- Not correct LDML -->
2339<unit type="duration-day"
2340      displayName="days"> <!-- #3: @VALUE attribute AND children -->
2341  {0} per day <!-- #1: Mixed content -->
2342  <unitPattern>{0} day</unitPattern>  <!-- #2 same XPath /unit[@type="duration-day"]/unitPattern -->
2343  <unitPattern>{0} days</unitPattern> <!-- #2 same XPath /unit[@type="duration-day"]/unitPattern -->
2344</unit>
2345```
2346
2347LDML is actually structured as below (from `en.xml`):
2348
2349```xml
2350<unit type="duration-day">  <!-- OK: "type" is distinguishing -->
2351  <displayName>days</displayName>
2352  <unitPattern count="one">{0} day</unitPattern> <!-- "count" is distinguishing -->
2353  <unitPattern count="other">{0} days</unitPattern>
2354  <perUnitPattern>{0} per day</perUnitPattern> <!-- mixed content in an element -->
2355</unit>
2356```
2357
2358Rule elements do not have these restrictions, but also do not inherit, except as an entire block. Items which are ordered have the DTD Annotation `@ORDERED`. See [_DTD Annotations_](#DTD_Annotations) and _[Section 4.2 Inheritance and Validity](#Inheritance_and_Validity)_. For more technical details, see [Updating-DTDs](https://cldr.unicode.org/development/updating-dtds).
2359
2360Note that the data in examples given below is purely illustrative, and does not match any particular language. For a more detailed example of this format, see [[Example](#LDML)]. There is also a DTD for this format, but _remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor  the interrelationships between the different elements and attributes_. You may wish to have copies of each of these to hand as you proceed through the rest of this document.
2361
2362In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple instances. However, unless an element is annotated as `@ORDERED`, or has a distinguishing attribute, it can only occur once as a subelement of a given element. Thus, for example, the following is illegal even though allowed by the DTD:
2363
2364```xml
2365<languages>
2366    <language type="aa">...</language>
2367    <language type="aa">..</language>
2368```
2369
2370There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an `alt` element).
2371
2372In general, LDML data should be in NFC format. Normalization forms are defined by [[UAX15](https://www.unicode.org/reports/tr41/#UAX15)]. However, certain elements may need to contain characters that are not in NFC, including exemplars, transforms, segmentations, and p/s/t/i/pc/sc/tc/ic rules in collation. These elements must not be normalized (either to NFC or NFD), or their meaning may be changed. Thus LDML documents must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining slash (U+0338 COMBINING LONG SOLIDUS OVERLAY).
2373
2374Lists, such as singleCountries are space-delimited. That means that they are separated by one or more XML whitespace characters:
2375
2376* singleCountries
2377* preferenceOrdering
2378* references
2379
2380### 5.1 <a name="Common_Elements" href="#Common_Elements">Common Elements</a>
2381
2382At any level in any element, two special elements are allowed.
2383
2384#### 5.1.1 <a name="special" href="#special">Element special</a>
2385
2386This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute `xmlns`, which specifies the XML [namespace](https://www.w3.org/TR/REC-xml-names/) of the special data. For example, the following used the version 1.0 POSIX special element.
2387
2388```xml
2389<!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.0/ldml.dtd" [
2390    <!ENTITY % posix SYSTEM "https://www.unicode.org/cldr/dtd/1.0/ldmlPOSIX.dtd">
2391%posix;
2392]>
2393<ldml>
2394...
2395    <special xmlns:posix="https://www.opengroup.org/regproducts/xu.htm">
2396        <!-- old abbreviations for pre-GUI days -->
2397        <posix:messages>
2398            <posix:yesstr>Yes</posix:yesstr>
2399            <posix:nostr>No</posix:nostr>
2400            <posix:yesexpr>^[Yy].*</posix:yesexpr>
2401            <posix:noexpr>^[Nn].*</posix:noexpr>
2402        </posix:messages>
2403    </special>
2404</ldml>
2405```
2406
2407##### 5.1.1.1 <a name="Sample_Special_Elements" href="#Sample_Special_Elements">Sample Special Elements</a>
2408
2409The elements in this section are _**not**_ part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be stored in the Common Locale Repository. They may change or be removed in future versions of this document, and are present here more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.)
2410
2411* [https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd](https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd)
2412* [https://www.unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd](https://www.unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd)
2413
2414The above examples are old versions: consult the documentation for the specific application to see which should be used.
2415
2416These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:
2417
2418```xml
2419<?xml version="1.0" encoding="UTF-8" ?>
2420<!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldml.dtd" [
2421    <!ENTITY % icu SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
2422    <!ENTITY % openOffice SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlOpenOffice.dtd">
2423%icu;
2424%openOffice; ]>
2425```
2426
2427Thus to include just the ICU DTD, one uses:
2428
2429```xml
2430<?xml version="1.0" encoding="UTF-8" ?>
2431<!DOCTYPE ldml SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldml.dtd" [
2432    <!ENTITY % icu SYSTEM "https://www.unicode.org/cldr/dtd/1.1/ldmlICU.dtd">
2433%icu; ]>
2434```
2435
2436> **Note:** A previous version of this document contained a special element for [ISO TR 14652](https://www.open-std.org/jtc1/sc22/wg20/docs/n897-14652w25.pdf) compatibility data. That element has been withdrawn, pending further investigation, since 14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated effort". See the ballot comments on [14652 Comments](https://www.open-std.org/jtc1/sc22/wg20/docs/n948-J1N6769-14652.pdf) for details on the 14652 defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.
2437>
2438> **Note:** While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance of backwards compatibility is up to those organizations.
2439
2440A number of the elements above can have extra information for <a name="OpenOffice" href="#OpenOffice">openoffice.org</a>, such as the following example:
2441
2442```xml
2443<special xmlns:openOffice="https://www.openoffice.org">
2444    <openOffice:search>
2445        <openOffice:searchOptions>
2446            <openOffice:transliterationModules>IGNORE_CASE</openOffice:transliterationModules>
2447        </openOffice:searchOptions>
2448    </openOffice:search>
2449</special>
2450```
2451
2452#### 5.1.2 <a name="Alias_Elements" href="#Alias_Elements">Element alias</a>
2453
2454```xml
2455<!ELEMENT alias (special*) >
2456<!ATTLIST alias source NMTOKEN #REQUIRED >
2457<!ATTLIST alias path CDATA #IMPLIED>
2458```
2459
2460The contents of any element in root can be replaced by an alias, which points to the path where the data can be found.
2461
2462Aliases will only ever appear in root with the form `//ldml/.../alias[@source="locale"][@path="..."]`.
2463
2464Consider the following example in root:
2465
2466```xml
2467<calendar type="gregorian">
2468    <months>
2469        <default choice="format" />
2470        <monthContext type="format">
2471            <default choice="wide" />
2472            <monthWidth type="abbreviated">
2473                <alias source="locale" path="../monthWidth[@type='wide']"/>
2474            </monthWidth>
2475```
2476
2477If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at that path. If not found there, then the resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/**wide** element instead of format/abbreviated.
2478
2479If the `path` attribute is present, then its value is an [[XPath](#XPath)] that points to a different node in the tree. For example:
2480
2481```xml
2482<alias source="locale" path="../monthWidth[@type='wide']"/>
2483```
2484
2485The default value if the path is not present is the same position in the tree. All of the attributes in the [[XPath](#XPath)] must be _distinguishing_ elements. For more details, see [Section 4.2 Inheritance and Validity](#Inheritance_and_Validity).
2486
2487There is a special value for the source attribute, the constant `source="locale"`. This special value is equivalent to the locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:
2488
2489###### Table: <a name="Inheritance_with_source_locale_" href="#Inheritance_with_source_locale_">Inheritance with `source="locale"`</a>
2490
2491<!-- HTML: multiline, readability -->
2492<table><thead>
2493<tr><th>Root</th><th>de</th><th>Resolved</th></tr>
2494</thead><tbody>
2495<tr>
2496<td>
2497
2498```xml
2499<x>
2500  <a>1</a>
2501  <b>2</b>
2502  <c>3</c>
2503
2504</x>
2505```
2506</td><td>
2507
2508```xml
2509<x>
2510 <a>11</a>
2511 <b>12</b>
2512
2513 <d>14</d>
2514</x>
2515```
2516</td><td>
2517
2518```xml
2519<x>
2520 <a>11</a>
2521 <b>12</b>
2522 <c>3</c>
2523 <d>14</d>
2524</x>
2525```
2526</td></tr>
2527<tr><td>
2528
2529```xml
2530<y>
2531 <alias source="locale" path="../x">
2532</y>
2533
2534
2535
2536
2537
2538```
2539</td><td>
2540
2541```xml
2542<y>
2543
2544 <b>22</b>
2545
2546
2547 <e>25</e>
2548</y>
2549```
2550</td><td>
2551
2552```xml
2553<y>
2554 <a>11</a>
2555 <b>22</b>
2556 <c>3</c>
2557 <d>14</d>
2558 <e>25</e>
2559</y>
2560```
2561</td></tr>
2562</tbody></table>
2563
2564The first row shows the inheritance within the `<x>` element, whereby `<c>` is inherited from root. The second shows the inheritance within the `<y>` element, whereby `<a>`, `<c>`, and `<d>` are inherited also from root, but from an alias there. The alias in root is logically replaced not by the elements in root itself, but by elements in the 'target' locale.
2565
2566For more details on data resolution, see [Section 4.2 Inheritance and Validity](#Inheritance_and_Validity).
2567
2568Aliases must be resolved recursively. An alias may point to another path that results in another alias being found, and so on. For example, looking up Thai buddhist abbreviated months for the locale **xx-YY** may result in the following chain of aliases being followed:
2569
2570> `../../calendar[@type="buddhist"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]`
2571>
2572> xx-YY → xx → root // finds alias that changes path to:
2573>
2574> `../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]`
2575>
2576> xx-YY → xx → root // finds alias that changes path to:
2577>
2578> `../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="wide"]`
2579>
2580> xx-YY → xx // finds value here
2581
2582
2583It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups (including inheritance and lateral inheritance) can be followed indefinitely without terminating.
2584
2585#### 5.1.3 <a name="Element_displayName" href="#Element_displayName">Element displayName</a>
2586
2587Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.
2588
2589```xml
2590<numberFormat>
2591    <displayName>Prozentformat</displayName>
2592    ...
2593<numberFormat>
2594```
2595
2596Where present, the display names must be unique; that is, two distinct codes would not get the same display name.  (There is one exception to this: in time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different time zone IDs.) Any translations should follow customary practice for the locale in question. For more information, see [[Data Formats](#DataFormats)].
2597
2598#### 5.1.4 <a name="Escaping_Characters" href="#Escaping_Characters">Escaping Characters</a>
2599
2600Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent those code points that cannot be otherwise represented in element content. The escaping syntax is only defined on a few types of elements, such as in collation or exemplar sets, and uses the appropriate syntax for that type.
2601
2602The element `<cp>`, which was formerly used for this purpose, has been deprecated.
2603
2604### 5.2 <a name="Common_Attributes" href="#Common_Attributes">Common Attributes</a>
2605
2606#### 5.2.1 <a name="Attribute_type" href="#Attribute_type">Attribute type</a>
2607
2608The attribute `type` is also used to indicate an alternate resource that can be selected with a matching `type=option` in the locale id modifiers, or be referenced by a default element. For example:
2609
2610```xml
2611<ldml>
2612    ...
2613    <currencies>
2614        <currency>...</currency>
2615        <currency type="preEuro">...</currency>
2616    </currencies>
2617</ldml>
2618```
2619
2620#### 5.2.2 <a name="Attribute_draft" href="#Attribute_draft">Attribute draft</a>
2621
2622If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary `draft` value), as per the following:
2623
2624* `approved`: fully approved by the technical committee (equals the CLDR 1.3 value of `false`, or an absent `draft` attribute). This does not mean that the data is guaranteed to be error-free—this is the best judgment of the committee.
2625* `contributed`: partially approved by the technical committee.
2626* `provisional`: partially confirmed. Implementations may choose to accept the provisional data, especially if there is no translated alternative.
2627* `unconfirmed`: no confirmation available.
2628
2629For more information on precisely how these values are computed for any given release, see [Data Submission and Vetting Process](https://cldr.unicode.org/index/process#h.krygv7y7jkk9) on the CLDR website.
2630
2631The `draft` attribute should only occur on "leaf" elements, and is deprecated elsewhere. For a more formal description of how elements are inherited, and what their draft status is, see _[Section 4.2 Inheritance and Validity](#Inheritance_and_Validity)_.
2632
2633#### 5.2.3 <a name="alt_attribute" href="#alt_attribute">Attribute alt</a>
2634
2635This attribute labels an alternative value for an element. The value is a _descriptor_ that indicates what kind of alternative it is, and takes one of the following
2636
2637* `variantname` means that the value is a variant of the normal value, and may be used in its place in certain circumstances. If a variant value is absent for a particular locale, the normal value is used. The variant mechanism should only be used when such a fallback is acceptable.
2638* `proposed`, optionally followed by a number, indicating that the value is a proposed replacement for an existing value.
2639* `variantname-proposed`, optionally followed by a number, indicating that the value is a proposed replacement variant value.
2640
2641`proposed` should only be present if the draft status is not `approved`. It indicates that the data is proposed replacement data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as `alt="proposed"` until it is vetted.
2642
2643```xml
2644...
2645<month type="9">Settembru</month>
2646<month type="9" draft="unconfirmed" alt="proposed">Settembro</month>
2647<month type="10">...
2648```
2649
2650Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:
2651
2652```xml
2653...
2654<month type="9" draft="unconfirmed" alt="proposed2">Settembre</month>
2655...
2656```
2657
2658The values for _variantname_ at this time include "variant", "list", "email", "www", "short", and "secondary".
2659
2660For a more complete description of how draft applies to data, see _[Section 4.2 Inheritance and Validity](#Inheritance_and_Validity)_.
2661
2662#### 5.2.4 <a name="references_attribute" href="#references_attribute">Attribute references</a>
2663
2664The value of this attribute is a token representing a reference for the information in the element, including standards that it may conform to. `<references>`. (In older versions of CLDR, the value of the attribute was freeform text. That format is deprecated.)
2665
2666_Example:_
2667
2668```xml
2669<territory type="UM" references="R222">USAs yttre öar</territory>
2670```
2671
2672The `reference` element may be inherited. Thus, for example, R222 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.
2673
2674```xml
2675<... allow="verbatim" ...> (deprecated)
2676```
2677
2678This attribute was originally intended for use in marking display names whose capitalization differed from what was indicated by the now-deprecated `<inText>` element (perhaps, for example, because the names included a proper noun). It was never supported in the dtd and is not needed for use with the new `<contextTransforms>` element.
2679
2680### 5.3 <a name="Common_Structures" href="#Common_Structures">Common Structures</a>
2681
2682#### 5.3.1 <a name="Date_Ranges" href="#Date_Ranges">Date and Date Ranges</a>
2683
2684When attribute specify date ranges, it is usually done with attributes `from` and `to`. The `from` attribute specifies the starting point, and the `to` attribute specifies the end point. The deprecated `time` attribute was formerly used to specify time with the deprecated `weekEndStart` and `weekEndEnd` elements, which were themselves inherently `from` or `to`.
2685
2686The data format is a restricted ISO 8601 format, restricted to the fields `year`, `month`, `day`, `hour`, `minute`, and `second` in that order, with "-" used as a separator between date fields, a space used as the separator between the date and the time fields, and `:` used as a separator between the time fields. If the `minute` or `minute` and `second` are absent, they are interpreted as zero. If the `hour` is also missing, then it is interpreted based on whether the attribute is `from` or `to`.
2687
2688* `from` defaults to "00:00:00" (midnight at the start of the day).
2689* `to` defaults to "24:00:00" (midnight at the end of the day).
2690
2691That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00. Thus when the `hour` is missing, the `from` and `to` are interpreted inclusively: the range includes all of the day mentioned.
2692
2693For example, the following are equivalent:
2694
2695```xml
2696<usesMetazone from="1991-10-27" to="2006-04-02" .../>
2697<usesMetazone from="1991-10-27 00:00:00" to="2006-04-02 24:00:00" .../>
2698<usesMetazone from="1991-10-26 24:00:00" to="2006-04-03 00:00:00" .../>
2699```
2700
2701If the `from` element is missing, it is assumed to be as far backwards in time as there is data for; if the `to` element is missing, then it is from this point onwards, with no known end point.
2702
2703The dates and times are specified in local time, unless otherwise noted. (In particular, the metazone values are in UTC (also known as GMT).
2704
2705#### 5.3.2 <a name="Text_Directionality" href="#Text_Directionality">Text Directionality</a>
2706
2707The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.
2708
2709For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.
2710
2711Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.
2712
2713#### 5.3.3 <a name="Unicode_Sets" href="#Unicode_Sets">Unicode Sets</a>
2714
2715Some attribute values or element contents use _UnicodeSet_ notation. A UnicodeSet represents a finite set of Unicode code points and strings, and is defined by lists of code points and strings, Unicode property sets, and set operators, all bounded by square brackets. In this context, a code point means a string consisting of exactly one code point.
2716
2717A UnicodeSet implements the semantics in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)] Levels 1 & 2 that are relevant to determining sets of characters. Note however that it may deviate from the syntax provided in [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], which is illustrative rather than a requirement. There is one exception to the supported semantics, Section [RL2.6](https://www.unicode.org/reports/tr18/#RL2.6) _Wildcards in Property Values_. That feature can be supported in clients such as ICU by implementing a “hook” as is done in the [online UnicodeSet utilities](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bname%3D%2FAPPLE%2F%7D).
2718
2719A UnicodeSet may be cited in specifications outside of the domain of LDML. In such a case, the specification may specify a subset of the syntax provided here.
2720
2721The following provides EBNF syntax for a UnicodeSet:
2722
2723| Symbol         | Expression                                                     | Examples                                |
2724| -------------- | -------------------------------------------------------------- | --------------------------------------- |
2725| `root`         | <pre>= prop<br/>\| '[-]'<br/>\| '[' [\\-\\^]? s seq+ ']'</pre> | \\p{x=y},<br/>[abc]                     |
2726| `seq`          | <pre>= root (s [\\&\\-] s root)* s<br/>\| range s</pre>        | [abc]-[cde], a                          |
2727| `range`        | <pre>= char ('-' char)?<br/>\| '{' (s char)+ s '}'</pre>       | a, a-c, \{abc}                          |
2728| `prop`         | <pre>= '\\' [pP] '{' propName ([≠=] s value1+)? '}'<br/>\| '[:' '^'? propName ([≠=] s value2+)? ':]'</pre> | \\p\{x=y}, [:x=y:]<br/> |
2729| `propName`     | <pre>= s [A-Za-z0-9] [A-Za-z0-9_\\x20]* s</pre>                | General_Category,<br/>General Category  |
2730| `value1`       | <pre>= [^\\}]<br/>\| '\\' quoted</pre>                         | Lm,<br/>\\n,<br/>\\}                    |
2731| `value2`       | <pre>= [^:]<br/>\| '\\' quoted</pre>                           | Lm,<br/>\\n,<br/>\\:                    |
2732| `char`         | <pre>= [^\\& \\- \\[ \\[ \\] \\\\ \\} \\{ [:Pat_WS:]]<br/>\| '\\' quoted</pre> | a, b, c, \\n                            |
2733| `quoted`       | <pre>= 'u' (hex{4} \| bracketedHex)<br/>\| 'x' (hex{2} \| bracketedHex)<br/>\| 'U00' ('0' hex{5} \| '10' hex{4})<br/>\| 'N{' propName '}'<br/>\| [[\u0000-\U00010FFFF]-[uxUN]]</pre> | _**error** if lengths not exact_ |
2734| `charName`     | <pre>= s [A-Za-z0-9] [-A-Za-z0-9_\x20]* s</pre>                | TIBETAN LETTER -A                       |
2735| `bracketedHex` | <pre>= '{' s hexCodePoint (s hexCodePoint)* s '}'</pre>        | \{61 2019 62}                           |
2736| `hexCodePoint` | <pre>= hex{1,5} \| '10' hex{4}</pre>                           |                                         |
2737| `hex`          | <pre>= [0-9A-Fa-f]</pre>                                       |                                         |
2738| `s`            | <pre>= [:Pattern_White_Space:]*</pre>                          | optional whitespace                     |
2739
2740Some constraints on UnicodeSet syntax are not captured by this EBNF. Notably, property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. In addition, quoted values that resolve to more than one code point are disallowed in ranges of the form `char '-' char`.
2741
2742The syntax characters are listed in the table below:
2743
2744| Char | Hex    | Name                 | Usage                                      |
2745| ---- | ------ | -------------------- | ------------------------------------------ |
2746|  $   | U+0024 | DOLLAR SIGN          | Equivalent of \\uFFFF (This is for implementations that return \\uFFFF when accessing before the first or after the last character) |
2747|  &   | U+0026 | AMPERSAND            | Intersecting UnicodeSets                   |
2748|  -  | U+002D | HYPHEN-MINUS         | Ranges of characters; also set difference. |
2749|  :   | U+003A | COLON                | POSIX-style property syntax                |
2750|  [  | U+005B | LEFT SQUARE BRACKET  | Grouping; POSIX property syntax            |
2751|  ]  | U+005D | RIGHT SQUARE BRACKET | Grouping; POSIX property syntax            |
2752|  \\  | U+005C | REVERSE SOLIDUS      | Escaping                                   |
2753|  ^   | U+005E | CIRCUMFLEX ACCENT    | Posix negation syntax                      |
2754|  {   | U+007B | LEFT CURLY BRACKET   | Strings in set; Perl property syntax       |
2755|  }   | U+007D | RIGHT CURLY BRACKET  | Strings in set; Perl property syntax       |
2756|      | U+0020 U+0009..U+000D U+0085<br/>U+200E U+200F<br/>U+2028 U+2029 | ASCII whitespace,<br/>LRM, RLM,<br/>LINE/PARAGRAPH SEPARATOR | Ignored except when escaped |
2757
2758
2759##### 5.3.3.1 <a name="Lists_of_Code_Points" href="#Lists_of_Code_Points">Lists of Code Points</a>
2760
2761Lists are a sequence of strings that may include ranges, which are indicated by a '-' between two code points, as in "a-z". The sequence _start-end_ specifies the range of all code points from the start to end, inclusive, in Unicode order. For example, **[a c d-f m]** is equivalent to **[a c d e f m]**. Whitespace can be freely used for clarity, as **[a c d-f m]** means the same as **[acd-fm]**.
2762
2763A string with multiple code points is represented in a list by being surrounded by curly braces, such as in **[a-z \{ch}]**. It can be used with the range notation, as described in _Section [5.3.4 String Range](#String_Range)_ . There is an additional restriction on string ranges in a UnicodeSet: the number of codepoints in the first string of the range must be identical to the number in the second. Thus [\{ab}-\{c}] and [\{ab}-c] are invalid.
2764
2765In UnicodeSets, there are two ways to quote syntax code points:
2766
2767<a name="Backslash_Escapes"></a>
2768Outside of single quotes, certain backslashed code point sequences can be used to quote code points:
2769
2770| Sequence        | Code point                           |
2771| --------------- | ------------------------------------ |
2772| \\x\{h...h}<br/>\\u\{h...h} | list of 1-6 hex digits ([0-9A-Fa-f]), separated by spaces |
2773| \\xhh           | 2 hex digits                         |
2774| \\uhhhh         | Exactly 4 hex digits                 |
2775| \\Uhhhhhhhh     | Exactly 8 hex digits                 |
2776| \\a             | U+0007 (BEL / ALERT)                 |
2777| \\b             | U+0008 (BACKSPACE)                   |
2778| \\t             | U+0009 (TAB / CHARACTER TABULATION)  |
2779| \\n             | U+000A (LINE FEED)                   |
2780| \\v             | U+000B (LINE TABULATION)             |
2781| \\f             | U+000C (FORM FEED)                   |
2782| \\r             | U+000D (CARRIAGE RETURN)             |
2783| \\\\            | U+005C (BACKSLASH / REVERSE SOLIDUS) |
2784| \\N\{name}      | The Unicode code point named "name". |
2785| \\p\{…},\\P\{…} | Unicode property (see below)         |
2786
2787Anything else following a backslash is mapped to itself, except the property syntax described below, or in an environment where it is defined to have some special meaning.
2788
2789Any code point formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \\x, \\u and \\U escapes create literal code points. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary code points in an ASCII source file, and any resulting code points are _**not**_ tagged as literals.)
2790
2791Unicode property sets are defined as described in _UTS #18: Unicode Regular Expressions_ [[UTS18](https://www.unicode.org/reports/tr41/#UTS18)], Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [[ICUUnicodeSet](#ICUUnicodeSet)].
2792
2793##### 5.3.3.2 <a name="Unicode_Properties" href="#Unicode_Properties">Unicode Properties</a>
2794
2795Briefly, Unicode property sets are specified by any Unicode property and a value of that property, such as **[:General_Category=Letter:]** for Unicode letters or **\\p\{uppercase}** for the set of upper case letters in Unicode. The property names are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)]. The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of `"=<value>"`. For example, you can match letters by using the POSIX-style syntax:
2796
2797**[:General_Category=Letter:]**
2798
2799or by using the Perl-style syntax
2800
2801**\\p\{General_Category=Letter}**.
2802
2803Property names and values are case-insensitive, and whitespace, "-", and "\_" are ignored. The property name can be omitted for the **General_Category** and **Script** properties, but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus **[:Letter:]** is equivalent to **[:General_Category=Letter:]**, and **[:Wh-ite-s pa_ce:]** is equivalent to **[:Whitespace=true:]**.
2804
2805The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative" version, which is a property that excludes all code points of a given kind. For example, **[:^Letter:]** matches all code points that are not **[:Letter:]**.
2806
2807|                    | Positive         | Negative          |
2808| ------------------ | ---------------- | ----------------- |
2809| POSIX-style Syntax | [:type=value:]   | [:^type=value:]   |
2810| Perl-style Syntax  | \\p\{type=value} | \\P\{type=value}  |
2811
2812##### 5.3.3.3 <a name="Boolean_Operations" href="#Boolean_Operations">Boolean Operations</a>
2813
2814The low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):
2815
2816* To union two sets, simply concatenate them. For example, **[[:letter:] [:number:]]**
2817* To intersect two sets, use the '&' operator. For example, **[[:letter:] & [a-z]]**
2818* To take the set-difference of two sets, use the '-' operator. For example, **[[:letter:] - [a-z]]**
2819* To invert a set, place a '\^' immediately after the opening '['. For example, **[\^a-z]**. In any other location, the '\^' does not have a special meaning. The inversion [\^X] is equivalent to [[\\x{0}-\\x{10FFFF}]-[X]]. Thus multi-code point strings are discarded.
2820* Symmetric difference (~) is not supported.
2821
2822The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus **[[:letter:]-[a-z]-[\\u0100-\\u01FF]]** is equal to **[[[:letter:]-[a-z]]-[\\u0100-\\u01FF]]**. Another example is the set **[[ace][bdf] - [abc][def]]**, which is not the empty set, but instead equal to **[[[[ace] [bdf]] - [abc]] [def]]**, which equals **[[[abcdef] - [abc]] [def]]**, which equals **[[def] [def]]**, which equals **[def]**.
2823
2824**One caution:** the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern **[[:Lu:]-A]** is illegal, since it is interpreted as the set **[:Lu:]** followed by the incomplete range **-A**. To specify the set of upper case letters except for 'A', enclose the 'A' in brackets: **[[:Lu:]-[A]]**.
2825
2826##### 5.3.3.4 <a name="UnicodeSet_Examples" href="#UnicodeSet_Examples">UnicodeSet Examples</a>
2827
2828The following table summarizes the syntax that can be used.
2829
2830| Example              | Description |
2831| -------------------- | ----------- |
2832| [a]                  | The set containing 'a' alone |
2833| [a-z]                | The set containing 'a' through 'z' and all letters in between, in Unicode order.<br/>Thus it is the same as [\\u0061-\\u007A]. |
2834| [^a-z]               | The set containing all code points but 'a' through 'z'.<br/>Thus it is the same as [\\u0000-\\u0060 \\u007B-\\x{10FFFF}]. |
2835| [[pat1][pat2]]       | The union of sets specified by pat1 and pat2 |
2836| [[pat1]&[pat2]]      | The intersection of sets specified by pat1 and pat2 |
2837| [[pat1]-[pat2]]      | The asymmetric difference of sets specified by pat1 and pat2 |
2838| [a \{ab} \{ac}]      | The code point 'a' and the multi-code point strings "ab" and "ac" |
2839| [x\\u\{61 2019 62}y] | Equivalent to [x\\u0061\\u2019\\u0062y] (= [xa’by]) |
2840| [\{ax}-\{bz}]        | The set containing [\{ax} \{ay} \{az} \{bx} \{by} \{bz}], using the range syntax to get all the strings from \{ax} to \{bz} as described in _Section [5.3.4 String Range](#String_Range)_. |
2841| [:Lu:]               | The set of code points with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode upper case letters. The long form for this is **[:General_Category=Uppercase_Letter:]**. |
2842| [:L:]                | The set of code points belonging to all Unicode categories starting with 'L', that is, **[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]**. The long form for this is **[:General_Category=Letter:]**. |
2843
2844#### 5.3.4 <a name="String_Range" href="#String_Range">String Range</a>
2845
2846A String Range is a compact format for specifying a list of strings.
2847
2848**Syntax:**
2849
2850> X _sep_ Y
2851
2852The separator and the format of strings X, Y may vary depending on the domain. For example,
2853
2854* for the validity files the separator is ~,
2855* for UnicodeSet the separator is -, and any multi-codepoint string is enclosed in {…}.
2856
2857**Validity:**
2858
2859> A string range X _sep_ Y is valid iff len(X) ≥ len(Y) > 0, where len(X) is the length of X in code points.
2860>
2861> _There may be additional, domain-specific requirements for validity of the expansion of the string range._
2862
2863**Interpretation:**
2864
28651. Break X into P and S, where len(S) = len(Y)
2866   * Note that P will be an empty string if the lengths of X and Y are equal.
28672. Form the combinations of all P+(s₀..y₀)+(s₁..y₁)+...(sₙ..yₙ)
2868   * s₀ is the first code point in S, etc.
2869
2870**Examples:**
2871
2872<!-- HTML: no th -->
2873<table><tbody>
2874<tr><td>ab-ad</td><td>→</td><td>ab ac ad</td></tr>
2875<tr><td>ab-d</td><td>→</td><td>ab ac ad</td></tr>
2876<tr><td>ab-cd</td><td>→</td><td>ab ac ad bb bc bd cb cc cd</td></tr>
2877<tr><td>����-����</td><td>→</td><td>���� ���� ���� ���� ����</td></tr>
2878<tr><td>����-��</td><td>→</td><td>���� ���� ���� ���� ����</td></tr>
2879</tbody></table>
2880
2881### 5.4 <a name="Identity_Elements" href="#Identity_Elements">Identity Elements</a>
2882
2883```xml
2884<!ELEMENT identity (alias | (version, generation?, language, script?, territory?, variant?, special*) ) >
2885```
2886
2887The `identity` element contains information identifying the target locale for this data, and general information about the version of this data.
2888
2889```xml
2890<version number="$Revision: 1.227 $">
2891```
2892
2893The `version` element provides, in an attribute, the version of this file.  The contents of the element can contain textual notes about the changes between this version and the last. For example:
2894
2895> ```xml
2896> <version number="1.1">Various notes and changes in version 1.1</version>
2897> ```
2898>
2899> This is not to be confused with the `version` attribute on the `ldml` element, which tracks the dtd version.
2900
2901```xml
2902<generation date="$Date: 2007/07/17 23:41:16 $" />
2903```
2904
2905The `generation` element is now deprecated. It was used to contain the last modified date for the data. This could be in two formats: ISO 8601 format, or CVS format (illustrated by the example above).
2906
2907```xml
2908<language type="en" />
2909```
2910
2911The language code is the primary part of the specification of the locale id, with values as described above.
2912
2913```xml
2914<script type="Latn" />
2915```
2916
2917The script code may be used in the identification of written languages, with values described above.
2918
2919```xml
2920<territory type="US" />
2921```
2922
2923The territory code is a common part of the specification of the locale id, with values as described above.
2924
2925```xml
2926<variant type="NYNORSK" />
2927```
2928
2929The variant code is the tertiary part of the specification of the locale id, with values as described above.
2930
2931When combined according to the rules described in _[Section 3, Unicode Language and Locale Identifiers](#Unicode_Language_and_Locale_Identifiers)_, the `language` element, along with any of the optional `script`, `territory`, and `variant` elements, must identify a known, stable locale identifier. Otherwise, it is an error.
2932
2933### 5.5 <a name="Valid_Attribute_Values" href="#Valid_Attribute_Values">Valid Attribute Values</a>
2934
2935The [DTD Annotations](#DTD_Annotations) in Section 5.7 are used to determine whether elements, attributes, or attribute values are valid (or deprecated).
2936
2937### 5.6 <a name="Canonical_Form" href="#Canonical_Form">Canonical Form</a>
2938
2939The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files.
2940
2941Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element `foo`:
2942
2943```xml
2944<foo>
2945    <pattern>
2946    <somethingElse>
2947</foo>
2948```
2949
2950It can never require the reverse order in a different element `bar`.
2951
2952```xml
2953<bar>
2954    <somethingElse>
2955    <pattern>
2956</bar>
2957```
2958
2959Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:
2960
2961```xml
2962<!ELEMENT currency (alias | (pattern*, displayName?, symbol?, pattern*, decimal?, group?, special*)) >
2963```
2964
2965[XML](https://www.w3.org/TR/REC-xml/) files can have a wide variation in textual form, while representing precisely the same data. By putting the LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.
2966
2967#### 5.6.1 <a name="Content" href="#Content">Content</a>
2968
29691.  All start elements are on their own line, indented by _depth_ tabs.
29702.  All end elements (except for leaf nodes) are on their own line, indented by _depth_ tabs.
29713.  Any leaf node with empty content is in the form `<foo/>`.
29724.  There are no blank lines except within comments or content.
29735.  Spaces are used within a start element. There are no extra spaces within elements.
2974    * `<version number="1.2"/>`, not `<version  number = "1.2" />`
2975    * `</identity>`, not `</identity >`
29766.  All attribute values use double quote ("), not single (').
29777.  There are no CDATA sections, and no escapes except those absolutely required.
2978    * no `&apos;` since it is not necessary
2979    * no `'&#x61;'`, it would be just `'a'`
29808.  All attributes with defaulted values are suppressed.
29819.  The draft and `alt="proposed.*"` attributes are only on leaf elements.
298210. The tzid are canonicalized in the following way:
2983    * All tzids as of CLDR 1.1 (2004.06.08) in zone.tab are canonical.
2984    * After that point, the first time a tzid is introduced, that is the canonical form.
2985
2986    That is, new IDs are added, but existing ones keep the original form. The _TZ_ timezone database keeps a set of equivalences in the "backward" file. These are used to map other tzids to the canonical form. For example, when `America/Argentina/Catamarca` was introduced as the new name for the previous `America/Catamarca` , a link was added in the backward file.
2987
2988    `Link America/Argentina/Catamarca America/Catamarca`
2989
2990_Example:_
2991
2992```xml
2993<ldml draft="unconfirmed" >
2994    <identity>
2995        <version number="1.2" />
2996        <language type="en" />
2997        <territory type="AS" />
2998    </identity>
2999    <numbers>
3000        <currencyFormats>
3001            <currencyFormatLength>
3002                <currencyFormat>
3003                    <pattern>¤#,##0.00;(¤#,##0.00)</pattern>
3004                </currencyFormat>
3005            </currencyFormatLength>
3006        </currencyFormats>
3007    </numbers>
3008</ldml>
3009```
3010
3011#### 5.6.2 <a name="Ordering" href="#Ordering">Ordering</a>
3012
3013An element is ordered first by the element name, and then if the element names are identical, by the sorted set of attribute-value pairs. For the latter, compare the first pair in each (in sorted order by attribute pair). If not identical, go to the second pair, and so on.
3014
3015Elements and attributes are ordered according to their order in the respective DTDs. Attribute value comparison is a bit more complicated, and may depend on the attribute and type. This is currently done with specific ordering tables.
3016
3017Any future additions to the DTD must be structured so as to allow compatibility with this ordering. See also [Section 5.5 Valid Attribute Values.](#Valid_Attribute_Values)
3018
3019#### 5.6.3 <a name="Comments" href="#Comments">Comments</a>
3020
30211. Comments are of the form `<!-- stuff -->`.
30222. They are logically attached to a node. There are 4 kinds:
3023   1. Inline always appear after a leaf node, on the same line at the end. These are a single line.
3024   2. Preblock comments always precede the attachment node, and are indented on the same level.
3025   3. Postblock comments always follow the attachment node, and are indented on the same level.
3026   4. Final comment, after `</ldml>`
30273. Multiline comments (except the final comment) have each line after the first indented to one deeper level.
3028
3029**Examples:**
3030
3031```xml
3032<eraAbbr>
3033    <era type="0">BC</era> <!-- might add alternate BDE in the future -->
3034...
3035<timeZoneNames>
3036    <!-- Note: zones that do not use daylight time need further work -->
3037    <zone type="America/Los_Angeles">
3038    ...
3039    <!-- Note: the following is known to be sparse,
3040            and needs to be improved in the future -->
3041    <zone type="Asia/Jerusalem">
3042```
3043
3044### 5.7 <a name="DTD_Annotations" href="#DTD_Annotations">DTD Annotations</a>
3045
3046The information in a standard DTD is insufficient for use in CLDR. To make up for that, DTD annotations are added. These are of the form
3047
3048```xml
3049<!--@...-->
3050```
3051
3052and are included below the !ELEMENT or !ATTLIST line that they apply to. The current annotations are:
3053
3054| Type                 | Description |
3055| ---------------------| ----------- |
3056| `<!--@VALUE-->`      | The attribute is not distinguishing, and is treated like an element value |
3057| `<!--@METADATA-->`   | The attribute is a “comment” on the data, like the draft status. It is not typically used in implementations. |
3058| `<!--@ORDERED-->`    | The element's children are ordered, and do not inherit. |
3059| `<!--@DEPRECATED-->` | The element or attribute is deprecated, and should not be used. |
3060| `<!--@DEPRECATED: attribute-value1, attribute-value2-->` | The attribute values are deprecated, and should not be used. Spaces between tokens are not significant. |
3061| `<!--@MATCH:{attribute value constraint}-->` | Requires the attribute value to match the constraint. |
3062| `<!--@TECHPREVIEW-->` | The element is a technical preview of a feature and may be changed or removed at any time. |
3063
3064There is additional information in the attributeValueValidity.xml file that is used internally for testing. For example, the following line indicates that the 'currency' element in the ldml dtd must have values from the bcp47 'cu' type.
3065
3066```xml
3067<attributeValues dtds='ldml' elements='currency' attributes='type'>$_bcp47_cu</attributeValues>
3068```
3069
3070The element values may be literals, regular expressions, or variables (some of which are set programmatically according to other CLDR data, such as the above). However, the information at this point does not cover all attribute values, is used only for testing, and should not be used in implementations since the structure may change without notice.
3071
3072#### 5.7.1 <a name="match_expressions" href="#match_expressions">Attribute Value Constraints</a>
3073
3074The following are constraints on the attribute values. Note: in future versions, the format may change, and/or the constraints may be tightened.
3075
3076| Constraint                | Comments |
3077| ------------------------- | -------- |
3078| any                       | any string value |
3079| any/TODO                  | placeholder for future constraints |
3080| bcp47/anykey              | any bcp47 key or tkey |
3081| bcp47/anyvalue            | any bcp47 value (type) or tvalue |
3082| literal/\{literal values} | comma separated |
3083| regex/\{regex expression} | valid regex expression |
3084| bcp47/\{key or tkey}      | matches possible values for that key or tkey |
3085| metazone                  | valid metazone |
3086| range/\{start_number~{end_number}} | number between (inclusive) start and end |
3087| time/\{time or date or date-time pattern} | eg HH:mm |
3088| unicodeset/\{unicodeset pattern} | valid unicodeset |
3089| validity/\{field}         | currency, language, locale, region, script, subdivision, short-unit, unit, variant<br/>The field can be qualified by particular enums, such as:<br/>`validity/unit/regular deprecated`: matches only _deprecated_ and _regular_<br/>`validity/unit/!deprecated`: matches all but _deprecated_ |
3090| version                   | 1 to 4 digit field version, such as 35.3.9 |
3091| set/\{match}              | set of elements that match \{match} |
3092| or/\{match1}XX\{match2}…  | matches at least one of \{match1}, etc |
3093
3094
3095
3096## 6 <a name="Property_Data" href="#Property_Data">Property Data</a>
3097
3098Some data in CLDR does not use an XML format, but rather a semicolon-delimited format derived from that of the Unicode Character Database. That is because the data is more likely to be parsed by implementations that already parse UCD data. Those files are present in the common/properties directory.
3099
3100Each file has a header that explains the format and usage of the data.
3101
3102### 6.1 <a name="Script_Metadata" href="#Script_Metadata">Script Metadata</a>
3103
3104`scriptMetadata.txt`
3105
3106This file provides general information about scripts that may be useful to implementations processing text. The information is the best currently available, and may change between versions of CLDR. The format is similar to Unicode Character Database property file, and is documented in the header of the data file.
3107
3108### 6.2 <a name="Extended_Pictographic" href="#Extended_Pictographic">Extended Pictographic</a>
3109
3110`ExtendedPictographic.txt`
3111
3112This file was used to define the ExtendedPictographic data used for “future-proofing” emoji behavior, especially in segmentation. As of Emoji version 11.0, the set of Extended_Pictographic is incorporated into the emoji data files found at [unicode.org/Public/emoji/](https://www.unicode.org/Public/emoji/).
3113
3114### 6.3 <a name="Labels.txt" href="#Labels.txt">Labels.txt</a>
3115
3116`labels.txt`
3117
3118This file provides general information about associations of labels to characters that may be useful to implementations of character-picking applications. The information is the best currently available, and may change between versions of CLDR. The format is similar to Unicode Character Database property file, and is documented in the header of the data file.
3119
3120Initially, the contents are focused on emoji, but may be expanded in the future to other types of characters. Note that a character may have multiple labels.
3121
3122### 6.4 <a name="Segmentation_Tests" href="#Segmentation_Tests">Segmentation Tests</a>
3123
3124CLDR provides a tailoring to the [Grapheme Cluster Break (gcb)](https://www.unicode.org/reports/tr29/) algorithm to avoid splitting Indic aksaras. The corresponding test files for that are located in common/properties/segments/, along with a readme.txt that provides more details. There are also specific test files for the supported Indic scripts in the unittest directory.
3125
3126
3127
3128## 7 <a name="Format_Parse_Issues" href="#Format_Parse_Issues">Issues in Formatting and Parsing</a>
3129
3130### 7.1 <a name="Lenient_Parsing" href="#Lenient_Parsing">Lenient Parsing</a>
3131
3132#### 7.1.1 <a name="Motivation" href="#Motivation">Motivation</a>
3133
3134User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the input is clear to a human being. For example, for a date pattern of "MM/dd/yy", the input "June 1, 2006" will fail.
3135
3136The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data to guide the parsing process, rather than an exact template that must be matched. This informative section suggests some heuristics that may be useful for lenient parsing of dates, times, and numbers.
3137
3138#### 7.1.2 <a name="Loose_Matching" href="#Loose_Matching">Loose Matching</a>
3139
3140Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:
3141
3142* Remove "." from currency symbols and other fields used for matching, and also from the input string unless:
3143  * "." is in the decimal set, and
3144  * its position in the input string is immediately before a decimal digit
3145* Ignore all format characters: in particular, ignore any RLM, LRM or ALM used to control BIDI formatting.
3146* Ignore all characters in [:Zs:] unless they occur between letters. (In the heuristics below, even those between letters are ignored except to delimit fields)
3147* Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
3148* Use the data in the `<character-fallback>` element to map equivalent characters (for example, curly to straight apostrophes). Other apostrophe-like characters should also be treated as equivalent, especially if the character actually used in a format may be unavailable on some keyboards. For example:
3149  * U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as U+2018 LEFT SINGLE QUOTATION MARK (‘).
3150  * U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
3151  * U+05F3 HEBREW PUNCTUATION GERESH (‎׳) might be typed instead as U+0027 APOSTROPHE.
3152* Apply mappings particular to the domain (i.e., for dates or for numbers, discussed in more detail below)
3153* Apply case folding (possibly including language-specific mappings such as Turkish i)
3154* Normalize to NFKC; thus _no-break space_ will map to _space_; half-width _katakana_ will map to full-width.
3155
3156Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the process; actual implementations can optimize, such as by applying the transform incrementally during matching.
3157
3158### 7.2 <a name="Invalid_Patterns" href="#Invalid_Patterns">Handling Invalid Patterns</a>
3159
3160Processes sometimes encounter invalid number or date patterns, such as a number pattern with “¤¤¤¤¤” (valid pattern character but invalid length in current CLDR), a date pattern with “nn” (invalid pattern character in current CLDR), or a date pattern with “MMMMMM” (invalid length in current CLDR). The recommended behavior for handling such an invalid pattern field is:
3161
3162* For a field using a currently-invalid length for a valid pattern character:
3163  * In **formatting,** emit U+FFFD REPLACEMENT CHARACTER for the invalid field.
3164  * In **parsing,** the field may be parsed as if it had a valid length.
3165* For a pattern that contains a currently-invalid pattern character (applies only to date patterns, for which A-Za-z are reserved as pattern characters but not all defined as valid):
3166  * Produce an error (set an error code or throw an exception) when an attempt is made to create a formatter with such a pattern or to apply such a pattern to an existing formatter.
3167
3168* * *
3169
3170## <a name="Deprecated_Structure" href="#Deprecated_Structure">Annex A Deprecated Structure</a>
3171
3172The [DTD Annotations](#DTD_Annotations) in Section 5.7 are used to determine whether DTD items such as elements, attributes, or attribute values are deprecated.
3173
3174Though such deprecated items are still valid LDML, they are strongly discouraged, and are no longer used in CLDR.
3175
3176The CLDR [DTD Deltas](https://unicode-org.github.io/cldr-staging/charts/latest/supplemental/dtd_deltas.html) chart shows which DTD items have been deprecated in which version of CLDR.
3177
3178The remainder of this section describes selected cases of deprecated structure, and what (if any) should be used instead.
3179
3180### <a name="Fallback_Elements" href="#Fallback_Elements">A.1 Element fallback</a>
3181
3182Implementations should use instead the information in [Section 4.4 Language Matching](#LanguageMatching) for doing language fallback.
3183
3184### <a name="BCP47_Keyword_Mapping" href="#BCP47_Keyword_Mapping">A.2 BCP 47 Keyword Mapping</a>
3185
3186Instead use the mechanisms descibed in [Section 3.6.4 U Extension Data Files](#Unicode_Locale_Extension_Data_Files).
3187
3188### <a name="Choice_Patterns" href="#Choice_Patterns">A.3 Choice Patterns</a>
3189
3190Instead use `count` attributes.
3191
3192### <a name="Element_default" href="#Element_default">A.4 Element default</a>
3193
3194Instead use replacement structure, for example:
3195
3196* For `<collations>`, now use the `<defaultCollation>` element.
3197* For `<calendars>`, the default calendar type for a locale is now specified by _[Calendar Preference Data](tr35-dates.md#Calendar_Preference_Data)_.
3198
3199### <a name="Deprecated_Common_Attributes" href="#Deprecated_Common_Attributes">A.5 Deprecated Common Attributes</a>
3200
3201#### <a name="Attribute_standard" href="#Attribute_standard">A.5.1 Attribute standard</a>
3202
3203Instead, use a `reference` element with the attribute `standard="true"`.
3204
3205#### <a name="Attribute_draft_nonLeaf" href="#Attribute_draft_nonLeaf">A.5.2 Attribute draft in non-leaf elements</a>
3206
3207The `draft` attribute is deprecated except in leaf elements (elements that do not have any subelements)
3208
3209### <a name="Element_base" href="#Element_base">A.6 Element base</a>
3210
3211Instead use the collation `<import>` element.
3212
3213### <a name="Element_rules" href="#Element_rules">A.7 Element rules</a>
3214
3215Instead use the basic collation syntax with the [`<cr>` element](tr35-collation.md#Rules).
3216
3217### <a name="Deprecated_subelements_of_dates" href="#Deprecated_subelements_of_dates">A.8 Deprecated subelements of `<dates>`</a>
3218
3219* `<localizedPatternChars>`
3220* `<dateRangePattern>`, replaced by `<intervalFormats>`.
3221
3222### <a name="Deprecated_subelements_of_calendars" href="#Deprecated_subelements_of_calendars">A.9 Deprecated subelements of `<calendars>`</a>
3223
3224* The deprecated `<monthNames>` and `<monthAbbr>` are replaced by the `months` element with the context `type="format"` and the width `type="wide"` (for ...Names) and `type="narrow"` (for ...Abbr), respectively.
3225* The deprecated `<dayNames>` and `<dayAbbr>` are replaced by the `days` element with the context `type="format"` and the width `type="wide"` (for ...Names) and `type="narrow"` (for ...Abbr), respectively.
3226* <a name="week" href="#week">`<week>`</code></a> is deprecated in the main LDML files, because the data is more appropriately organized as connected to territories, not to linguistic data. Use the supplemental `<weekData>` element instead.
3227* The standalone `<am>` and `<pm>` are deprecated, and the data are instead included as part of the `<dayPeriods>` element
3228* `<fields>` is deprecated as a subelement of `<calendars>` instead, a `<fields>` element should be located just under a `<dates>` element. See [Calendar Fields](tr35-dates.md#Calendar_Fields).
3229
3230### <a name="Deprecated_subelements_of_timeZoneNames" href="#Deprecated_subelements_of_timeZoneNames">A.10 Deprecated subelements of `<timeZoneNames>`</a>
3231
3232* `<preferenceOrdering>`: use metazones instead.
3233* `<singleCountries>`:use [Primary Zones](tr35-dates.md#Primary_Zones)
3234* `<hoursFormat>`, <a name="fallbackRegionFormat" href="#fallbackRegionFormat">`<fallbackRegionFormat>`</a>, `<abbreviationFallback>`
3235
3236### <a name="Deprecated_subelements_of_zone_metazone" href="#Deprecated_subelements_of_zone_metazone">A.11 Deprecated subelements of `<zone>` and `<metazone>`</a>
3237
3238* `<commonlyUsed>`, formerly used to indicate whether a zone was commonly used in the locale.
3239
3240### <a name="Renamed_attribute_values_for_contextTransformUsage" href="#Renamed_attribute_values_for_contextTransformUsage">A.12 Renamed attribute values for `<contextTransformUsage>` element</a>
3241
3242The `<contextTransformUsage>` element was introduced in CLDR 21. The values for its `type` attribute are documented in [`<contextTransformUsage>` type attribute values](tr35-general.md#contextTransformUsage_type_attribute_values). In CLDR 25, some of these values were renamed from their previous values for improved clarity:
3243
3244* `type` was renamed to `keyValue`
3245* `displayName` was renamed to `currencyName`
3246* `displayName-count` was renamed to `currencyName-count`
3247* `tense` was renamed to `relative`
3248
3249### <a name="Deprecated_subelements_of_segmentations" href="#Deprecated_subelements_of_segmentations">A.13 Deprecated subelements of `<segmentations>`</a>
3250
3251* `<exceptions>` and `<exception>`: Replaced with `<suppressions>` and `<suppression>`.
3252
3253### <a name="Element_cp" href="#Element_cp">A.14 Element cp</a>
3254
3255The `cp` element was used in certain elements to escape characters that cannot be represented in XML, even with NCRs. This mechanism was replaced by specialized syntax:
3256
3257| Code Point | XML Example    |
3258| ---------- | -------------- |
3259| `U+0000`   | `<cp hex="0">` |
3260
3261### <a name="validSubLocales" href="#validSubLocales">A.15 Attribute validSubLocales</a>
3262
3263Instead of using `validSubLocales`, it is recommended to simply add empty files to specify which sublocales are valid. This convention is used throughout the CLDR.
3264
3265### <a name="postCodeElements" href="#postCodeElements">A.16 Elements postalCodeData, postCodeRegex</a>
3266
3267Instead please see other services that are kept up to date, such as:
3268
3269* [https://i18napis.appspot.com/address/data/US](https://i18napis.appspot.com/address/data/US)
3270* [https://i18napis.appspot.com/address/data/CH](https://i18napis.appspot.com/address/data/CH)
3271* ...
3272
3273### <a name="telephoneCodeData" href="#telephoneCodeData">A.17 Element telephoneCodeData</a>
3274
3275The element `<telephoneCodeData>` and its subelements have been deprecated and the data removed.
3276
3277* * *
3278
3279## <a name="Links_to_Other_Parts" href="#Links_to_Other_Parts">Annex B Links to Other Parts</a>
3280
3281The LDML specification is split into several [parts](#Parts) by topic, with one HTML document per part. The following tables provide redirects for links to specific topics. Please update your links and bookmarks.
3282
3283Part 1 Links: Core (this document): No redirects needed.
3284
3285###### Table: <a name="Part_2_Links" href="#Part_2_Links">Part 2 Links</a>: [General](tr35-general.md) (display names & transforms, etc.)
3286
3287| Old section                                                                                                 | Section in new part |
3288| ----------------------------------------------------------------------------------------------------------- | ------------------- |
3289| 5.4 <a name="Display_Name_Elements" href="#Display_Name_Elements">Display Name Elements</a>                 | 1 [Display Name Elements](tr35-general.md#Display_Name_Elements) |
3290| 5.5 <a name="Layout_Elements" href="#Layout_Elements">Layout Elements</a>                                   | 2 [Layout Elements](tr35-general.md#Layout_Elements) |
3291| 5.6 <a name="Character_Elements" href="#Character_Elements">Character Elements</a>                          | 3 [Character Elements](tr35-general.md#Character_Elements) |
3292| 5.6.1 <a name="ExemplarSyntax" href="#ExemplarSyntax">Exemplar Syntax</a>                                   | 3.1 [Exemplar Syntax](tr35-general.md#ExemplarSyntax) |
3293| 5.6.2 Restrictions                                                                                          | 3.1 [Exemplar Syntax](tr35-general.md#ExemplarSyntax) |
3294| 5.6.3 Mapping                                                                                               | 3.2 [Mapping](tr35-general.md#Character_Mapping) |
3295| 5.6.4 <a name="IndexLabels" href="#IndexLabels">Index Labels</a>                                            | 3.3 [Index Labels](tr35-general.md#IndexLabels) |
3296| 5.6.5 Ellipsis                                                                                              | 3.4 [Ellipsis](tr35-general.md#Ellipsis) |
3297| 5.6.6 More Information                                                                                      | 3.5 [More Information](tr35-general.md#Character_More_Info) |
3298| 5.7 <a name="Delimiter_Elements" href="#Delimiter_Elements">Delimiter Elements</a>                          | 4 [Delimiter Elements](tr35-general.md#Delimiter_Elements) |
3299| C.6 <a name="Measurement_System_Data" href="#Measurement_System_Data">Measurement System Data</a>           | 5 [Measurement System Data](tr35-general.md#Measurement_System_Data) |
3300| 5.8 <a name="Measurement_Elements" href="#Measurement_Elements">Measurement Elements (deprecated)</a>       | 5.1 [Measurement Elements (deprecated)](tr35-general.md#Measurement_Elements) |
3301| 5.11 <a name="Unit_Elements" href="#Unit_Elements">Unit Elements</a>                                        | 6 [Unit Elements](tr35-general.md#Unit_Elements) |
3302| 5.12 <a name="POSIX_Elements" href="#POSIX_Elements">POSIX Elements</a>                                     | 7 [POSIX Elements](tr35-general.md#POSIX_Elements) |
3303| 5.13 <a name="Reference_Elements" href="#Reference_Elements">Reference Element</a>                          | 8 [Reference Element](tr35-general.md#Reference_Elements) |
3304| 5.15 <a name="Segmentations" href="#Segmentations">Segmentations</a>                                        | 9 [Segmentations](tr35-general.md#Segmentations) |
3305| 5.15.1 <a name="Segmentation_Inheritance" href="#Segmentation_Inheritance">Segmentation Inheritance</a>     | 9.1 [Segmentation Inheritance](tr35-general.md#Segmentation_Inheritance) |
3306| 5.16 <a name="Transforms" href="#Transforms">Transforms</a>                                                 | 10 [Transforms](tr35-general.md#Transforms) |
3307| N <a name="Transform_Rules" href="#Transform_Rules">Transform Rules</a>                                     | 10.3 [Transform Rules Syntax](tr35-general.md#Transform_Rules_Syntax) |
3308| 5.18 <a name="ListPatterns" href="#ListPatterns">List Patterns</a>                                          | 11 [List Patterns](tr35-general.md#ListPatterns) |
3309| C.20 <a name="List_Gender" href="#List_Gender">Gender of Lists</a>                                          | 11.1 [Gender of Lists](tr35-general.md#List_Gender) |
3310| 5.19 <a name="Context_Transform_Elements" href="#Context_Transform_Elements">ContextTransform Elements</a>  | 12 [ContextTransform Elements](tr35-general.md#Context_Transform_Elements) |
3311
3312###### Table: <a name="Part_3_Links" href="#Part_3_Links">Part 3 Links</a>: [Numbers](tr35-numbers.md) (number & currency formatting)
3313
3314| Old section                                                                                                       | Section in new part |
3315| ----------------------------------------------------------------------------------------------------------------- | ------------------- |
3316| C.13 <a name="Numbering_Systems" href="#Numbering_Systems">Numbering Systems</a>                                  | 1 [Numbering Systems](tr35-numbers.md#Numbering_Systems) |
3317| 5.10 <a name="Number_Elements" href="#Number_Elements">Number Elements</a>                                        | 2 [Number Elements](tr35-numbers.md#Number_Elements) |
3318| 5.10.1 <a name="Number_Symbols" href="#Number_Symbols">Number Symbols</a>                                         | 2.3 [Number Symbols](tr35-numbers.md#Number_Symbols) |
3319| G <a name="Number_Format_Patterns" href="#Number_Format_Patterns">Number Format Patterns</a>                      | 3 [Number Format Patterns](tr35-numbers.md#Number_Format_Patterns) |
3320| 5.10.2 <a name="Currencies" href="#Currencies">Currencies</a>                                                     | 4 [Currencies](tr35-numbers.md#Currencies) |
3321| C.1 <a name="Supplemental_Currency_Data" href="#Supplemental_Currency_Data">Supplemental Currency Data</a>        | 4.1 [Supplemental Currency Data](tr35-numbers.md#Supplemental_Currency_Data) |
3322| C.11 <a name="Language_Plural_Rules" href="#Language_Plural_Rules">Language Plural Rules</a>                      | 5 [Language Plural Rules](tr35-numbers.md#Language_Plural_Rules) |
3323| 5.17 <a name="Rule-Based_Number_Formatting" href="#Rule-Based_Number_Formatting">Rule-Based Number Formatting</a> | 6 [Rule-Based Number Formatting](tr35-numbers.md#Rule-Based_Number_Formatting) |
3324
3325###### Table: <a name="Part_4_Links" href="#Part_4_Links">Part 4 Links</a>: [Dates](tr35-dates.md) (date, time, time zone formatting)
3326
3327| Old section                                                                                                                   | Section in new part |
3328| ----------------------------------------------------------------------------------------------------------------------------- | ------------------- |
3329| <a name="Date_Elements" href="#Date_Elements">5.9 Date Elements</a>                                                           | 1 [Overview: Dates Element, Supplemental Date and Calendar Information](tr35-dates.md#Overview_Dates_Element_Supplemental) |
3330| <a name="Calendar_Elements" href="#Calendar_Elements">5.9.1 Calendar Elements</a>                                             | 2 [Calendar Elements](tr35-dates.md#Calendar_Elements) |
3331| <a name="months_days_quarters_eras" href="#months_days_quarters_eras">Elements months, days, quarters, eras</a>               | 2.1 [Elements months, days, quarters, eras](tr35-dates.md#months_days_quarters_eras) |
3332| <a name="monthPatterns_cyclicNameSets" href="#monthPatterns_cyclicNameSets">Elements monthPatterns, cyclicNameSets</a>        | 2.2 [Elements monthPatterns, cyclicNameSets](tr35-dates.md#monthPatterns_cyclicNameSets) |
3333| <a name="dayPeriods" href="#dayPeriods">Element dayPeriods</a>                                                                | 2.3 [Element dayPeriods](tr35-dates.md#dayPeriods) |
3334| <a name="dateFormats" href="#dateFormats">Element dateFormats</a>                                                             | 2.4 [Element dateFormats](tr35-dates.md#dateFormats) |
3335| <a name="timeFormats" href="#timeFormats">Element timeFormats</a>                                                             | 2.5 [Element timeFormats](tr35-dates.md#timeFormats) |
3336| <a name="dateTimeFormats" href="#dateTimeFormats">Element dateTimeFormats</a>                                                 | 2.6 [Element dateTimeFormats](tr35-dates.md#dateTimeFormats) |
3337| <a name="Calendar_Fields" href="#Calendar_Fields">5.9.2 Calendar Fields</a>                                                   | 3 [Calendar Fields](tr35-dates.md#Calendar_Fields) |
3338| 5.9.3 <a name="Timezone_Names" href="#Timezone_Names">Time Zone Names</a>                                                     | 5 [Time Zone Names](tr35-dates.md#Time_Zone_Names) |
3339| <a name="Supplemental_Calendar_Data" href="#Supplemental_Calendar_Data">C.5 Supplemental Calendar Data</a>                    | 4 [Supplemental Calendar Data](tr35-dates.md#Supplemental_Calendar_Data) |
3340| <a name="Supplemental_Timezone_Data" href="#Supplemental_Timezone_Data">C.7 Supplemental Time Zone Data</a>                   | 6 [Supplemental Time Zone Data](tr35-dates.md#Supplemental_Time_Zone_Data) |
3341| <a name="Calendar_Preference_Data" href="#Calendar_Preference_Data">C.15 Calendar Preference Data</a>                         | 4.2 [Calendar Preference Data](tr35-dates.md#Calendar_Preference_Data) |
3342| <a name="DayPeriodRules" href="#DayPeriodRules">C.17 DayPeriod Rules</a>                                                      | 4.5 [Day Period Rules](tr35-dates.md#Day_Period_Rules) |
3343| <a name="Date_Format_Patterns" href="#Date_Format_Patterns">Appendix F: Date Format Patterns</a>                              | 8 [Date Format Patterns](tr35-dates.md#Date_Format_Patterns) |
3344| <a name="Date_Field_Symbol_Table" href="#Date_Field_Symbol_Table">Date Field Symbol Table</a>                                 | [Date Field Symbol Table](tr35-dates.md#Date_Field_Symbol_Table) |
3345| <a name="Localized_Pattern_Characters" href="#Localized_Pattern_Characters">F.1 Localized Pattern Characters (deprecated)</a> | 8.1 [Localized Pattern Characters (deprecated)](tr35-dates.md#Localized_Pattern_Characters) |
3346| <a name="Time_Zone_Fallback" href="#Time_Zone_Fallback">Appendix J: Time Zone Display Names</a>                               | 7 [Using Time Zone Names](tr35-dates.md#Using_Time_Zone_Names) |
3347| <a name="fallbackFormat" href="#fallbackFormat">**fallbackFormat**:</a>                                                       | [**fallbackFormat**:](tr35-dates.md#fallbackFormat) |
3348| O.4 Parsing Dates and Times                                                                                                   | 9 [Parsing Dates and Times](tr35-dates.md#Parsing_Dates_Times) |
3349
3350###### Table: <a name="Part_5_Links" href="#Part_5_Links">Part 5 Links</a>: [Collation](tr35-collation.md) (sorting, searching, grouping)
3351
3352| Old section                                                                                                                     | Section in new part |
3353| ------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
3354| 5.14 <a name="Collation_Elements" href="#Collation_Elements">Collation Elements</a>                                             | 3 [Collation Tailorings](tr35-collation.md#Collation_Tailorings) |
3355| 5.14.1 <a name="Collation_Version" href="#Collation_Version">Version</a>                                                        | 3.1 [Version](tr35-collation.md#Collation_Version) |
3356| 5.14.2 <a name="Collation_Element" href="#Collation_Element">Collation Element</a>                                              | 3.2 [Collation Element](tr35-collation.md#Collation_Element) |
3357| 5.14.3 <a name="Setting_Options" href="#Setting_Options">Setting Options</a>                                                    | 3.3 [Setting Options](tr35-collation.md#Setting_Options) |
3358| Table <a name="Collation_Settings" href="#Collation_Settings">Collation Settings</a>                                            | Table [Collation Settings](tr35-collation.md#Collation_Settings) |
3359| 5.14.4 <a name="Rules" href="#Rules">Collation Rule Syntax</a>                                                                  | 3.4 [Collation Rule Syntax](tr35-collation.md#Rules) |
3360| 5.14.5 <a name="Orderings" href="#Orderings">Orderings</a>                                                                      | 3.5 [Orderings](tr35-collation.md#Orderings) |
3361| 5.14.6 <a name="Contractions" href="#Contractions">Contractions</a>                                                             | 3.6 [Contractions](tr35-collation.md#Contractions) |
3362| 5.14.7 <a name="Expansions" href="#Expansions">Expansions</a>                                                                   | 3.7 [Expansions](tr35-collation.md#Expansions) |
3363| 5.14.8 <a name="Context_Before" href="#Context_Before">Context Before</a>                                                       | 3.8 [Context Before](tr35-collation.md#Context_Before) |
3364| 5.14.9 <a name="Placing_Characters_Before_Others" href="#Placing_Characters_Before_Others">Placing Characters Before Others</a> | 3.9 [Placing Characters Before Others](tr35-collation.md#Placing_Characters_Before_Others) |
3365| 5.14.10 <a name="Logical_Reset_Positions" href="#Logical_Reset_Positions">Logical Reset Positions</a>                           | 3.10 [Logical Reset Positions](tr35-collation.md#Logical_Reset_Positions) |
3366| 5.14.11 <a name="Special_Purpose_Commands" href="#Special_Purpose_Commands">Special-Purpose Commands</a>                        | 3.11 [Special-Purpose Commands](tr35-collation.md#Special_Purpose_Commands) |
3367| 5.14.12 <a name="Script_Reordering" href="#Script_Reordering">Collation Reordering</a>                                          | 3.12 [Collation Reordering](tr35-collation.md#Script_Reordering) |
3368| 5.14.13 <a name="Case_Parameters" href="#Case_Parameters">Case Parameters</a>                                                   | 3.13 [Case Parameters](tr35-collation.md#Case_Parameters) |
3369| Definition: <a name="UncasedExceptions" href="#UncasedExceptions">UncasedExceptions</a>                                         | removed: see 3.13 [Case Parameters](tr35-collation.md#Case_Parameters) |
3370| Definition: <a name="LowerExceptions" href="#LowerExceptions">LowerExceptions</a>                                               | removed: see 3.13 [Case Parameters](tr35-collation.md#Case_Parameters) |
3371| Definition: <a name="UpperExceptions" href="#UpperExceptions">UpperExceptions</a>                                               | removed: see 3.13 [Case Parameters](tr35-collation.md#Case_Parameters) |
3372| 5.14.14 <a name="Visibility" href="#Visibility">Visibility</a>                                                                  | 3.14 [Visibility](tr35-collation.md#Visibility) |
3373
3374###### Table: <a name="Part_6_Links" href="#Part_6_Links">Part 6 Links</a>: [Supplemental](tr35-info.md) (supplemental data)
3375
3376| Old section                                                                                                                              | Section in new part |
3377| ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
3378| C <a name="Supplemental_Data" href="#Supplemental_Data">Supplemental Data</a>                                                            | Introduction [Supplemental Data](tr35-info.md#Supplemental_Data) |
3379| C.2 <a name="Supplemental_Territory_Containment" href="#Supplemental_Territory_Containment">Supplemental Territory Containment</a>       | 1.1 [Supplemental Territory Containment](tr35-info.md#Supplemental_Territory_Containment) |
3380| C.4 <a name="Supplemental_Territory_Information" href="#Supplemental_Territory_Information">Supplemental Territory Information</a>       | 1.2 [Supplemental Territory Information](tr35-info.md#Supplemental_Territory_Information) |
3381| C.3 <a name="Supplemental_Language_Data" href="#Supplemental_Language_Data">Supplemental Language Data</a>                               | 2 [Supplemental Language Data](tr35-info.md#Supplemental_Language_Data) |
3382| C.9 <a name="Supplemental_Code_Mapping" href="#Supplemental_Code_Mapping">Supplemental Code Mapping</a>                                  | 4 [Supplemental Code Mapping](tr35-info.md#Supplemental_Code_Mapping) |
3383| C.12 <a name="Telephone_Code_Data" href="#Telephone_Code_Data">Telephone Code Data</a>                                                   | 5 [Telephone Code Data](tr35-info.md#Telephone_Code_Data) |
3384| C.14 <a name="Postal_Code_Validation" href="#Postal_Code_Validation">Postal Code Validation</a>                                          | 6 [Postal Code Validation](tr35-info.md#Postal_Code_Validation) |
3385| C.8 <a name="Supplemental_Character_Fallback_Data" href="#Supplemental_Character_Fallback_Data">Supplemental Character Fallback Data</a> | 7 [Supplemental Character Fallback Data](tr35-info.md#Supplemental_Character_Fallback_Data) |
3386| M <a name="Coverage_Levels" href="#Coverage_Levels">Coverage Levels</a>                                                                  | 8 [Coverage Levels](tr35-info.md#Coverage_Levels) |
3387| 5.20 [Metadata Elements](tr35-info.md#Metadata_Elements)                                                                                 | 10 [Locale Metadata Element](tr35-info.md#Metadata_Elements) |
3388| P [Supplemental Metadata](tr35-info.md#Appendix_Supplemental_Metadata)                                                                   | 9 [Supplemental Metadata](tr35-info.md#Appendix_Supplemental_Metadata)
3389| P.1 [Supplemental Alias Information](tr35-info.md#Supplemental_Alias_Information)                                                        | 9.1 [Supplemental Alias Information](tr35-info.md#Supplemental_Alias_Information)
3390| P.2 [Supplemental Deprecated Information](tr35-info.md#Supplemental_Deprecated_Information)                                              | 9.2 [Supplemental Deprecated Information](tr35-info.md#Supplemental_Deprecated_Information)
3391| P.3 [Default Content](tr35-info.md#Default_Content)                                                                                      | 9.3 [Default Content](tr35-info.md#Default_Content) |
3392
3393###### Table: <a name="Part_7_Links" href="#Part_7_Links">Part 7 Links</a>: [Keyboards](tr35-keyboards.md) (keyboard mappings)
3394
3395| Old section                                                                                                                | Section in new part |
3396| -------------------------------------------------------------------------------------------------------------------------- | ------------------- |
3397| S <a name="Keyboards" href="#Keyboards">Keyboards</a>                                                                      | 1 [Introduction](tr35-keyboards.md#Introduction) |
3398| S <a name="Goals_and_Nongoals" href="#Goals_and_Nongoals">Goals and Nongoals</a>                                           | [Goals and Nongoals](tr35-keyboards.md#Goals_and_Nongoals) |
3399| S <a name="File_and_Dir_Structure" href="#File_and_Dir_Structure">File and Directory Structure</a>                         | [File and Directory Structure](tr35-keyboards.md#File_and_Dir_Structure) |
3400| S <a name="Element_Heirarchy_Layout_File" href="#Element_Heirarchy_Layout_File">Element Hierarchy - Layout File</a>        | [Element Hierarchy - Layout File](tr35-keyboards.md#Element_Heirarchy_Layout_File) |
3401| S <a name="Element_Heirarchy_Platform_File" href="#Element_Heirarchy_Platform_File">Element Hierarchy - Platform File</a>  | [Element Hierarchy - Platform File](tr35-keyboards.md#Element_Heirarchy_Platform_File) |
3402| S <a name="Invariants" href="#Invariants">Invariants</a>                                                                   | [Invariants](tr35-keyboards.md#Invariants) |
3403| S <a name="Data_Sources" href="#Data_Sources">Data Sources</a>                                                             | [Data Sources](tr35-keyboards.md#Data_Sources) |
3404| S <a name="Keyboard_IDs" href="#Keyboard_IDs">Keyboard IDs</a>                                                             | [Keyboard IDs](tr35-keyboards.md#Keyboard_IDs) |
3405| S <a name="Platform_Behaviors_in_Edge_Cases" href="#Platform_Behaviors_in_Edge_Cases">Platform Behaviors in Edge Cases</a> | [Platform Behaviors in Edge Cases](tr35-keyboards.md#Platform_Behaviors_in_Edge_Cases) |
3406| S <a name="Element_Keyboard" href="#Element_Keyboard">Element: keyboard</a>                                                | [Element: keyboard](tr35-keyboards.md#Element_Keyboard) |
3407| S <a name="Element_version" href="#Element_version">Element: version</a>                                                   | [Element: version](tr35-keyboards.md#Element_version) |
3408| S <a name="Element_generation" href="#Element_generation">Element: generation</a>                                          | [Element: generation](tr35-keyboards.md#Element_generation) |
3409| S <a name="Element_names" href="#Element_names">Element: names</a>                                                         | [Element: names](tr35-keyboards.md#Element_names) |
3410| S <a name="Element_name" href="#Element_name">Element: name</a>                                                            | [Element: name](tr35-keyboards.md#Element_name) |
3411| S <a name="Element_settings" href="#Element_settings">Element: settings</a>                                                | [Element: settings](tr35-keyboards.md#Element_settings) |
3412| S <a name="Element_keyMap" href="#Element_keyMap">Element: keyMap</a>                                                      | [Element: keyMap](tr35-keyboards.md#Element_keyMap) |
3413| S <a name="Element_map" href="#Element_map">Element: map</a>                                                               | [Element: map](tr35-keyboards.md#Element_map) |
3414| S <a name="Element_transforms" href="#Element_transforms">Element: transforms</a>                                          | [Element: transforms](tr35-keyboards.md#Element_transforms) |
3415| S <a name="Element_transform" href="#Element_transform">Element: transform</a>                                             | [Element: transform](tr35-keyboards.md#Element_transform) |
3416| S <a name="Element_platform" href="#Element_platform">Element: platform</a>                                                | [Element: platform](tr35-keyboards.md#Element_platform) |
3417| S <a name="Element_hardwareMap" href="#Element_hardwareMap">Element: hardwareMap</a>                                       | [Element: hardwareMap](tr35-keyboards.md#Element_hardwareMap) |
3418| S <a name="Principles_for_Keyboard_Ids" href="#Principles_for_Keyboard_Ids">Principles for Keyboard Ids</a>                | [Principles for Keyboard Ids](tr35-keyboards.md#Principles_for_Keyboard_Ids) |
3419
3420* * *
3421
3422## <a name="LocaleId_Canonicalization" href="#LocaleId_Canonicalization">Annex C. LocaleId Canonicalization</a>
3423
3424The `languageAlias`, `scriptAlias`, `territoryAlias`, and `variantAlias` elements are used as rules to transform an input _source localeId_. The first step is to transform the _languageId_ portion of the localeId.
3425
3426> Note: in the following discussion, the separator '-' is used. That is also used in examples of XML alias data, even though for compatibility reasons that alias data actually uses '\_' as a separator. The processing can also be applied to syntax while maintaining the separator '\_', _mutatis mutandis_. CLDR also uses “territory” and “region” interchangeably.
3427
3428> Also note that the discussion of canonicalization assumes BCP 47
3429> input data. If input data is a CLDR or ICU locale ID such
3430> as `en_US_POSIX`, a conversion step must be done prior to
3431> canonicalization.
3432>See §3.8.2 [Legacy Variants](#Legacy_Variants).
3433
3434### <a name="LocaleId_Definitions">LocaleId Definitions</a>
3435
3436#### <a name="1.-multimap-interpretation" href="#1.-multimap-interpretation">1. Multimap interpretation</a>
3437
3438Interpret each languageId as a multimap from a _fieldId_ (language, script, region, variants) to a **sorted set** of field values.
3439
3440_Examples:_
3441
3442| Source                    | Language | Script | Region | Variants          |
3443|---------------------------|----------|--------|--------|-------------------|
3444| en-GB                     | {en}     | {}     | {GB}   | {}                |
3445| und-GB                    | {}       | {}     | {GB}   | {}                |
3446| ja-Latn-YU-hepburn-heploc | {ja}     | {Latn} | {YU}   | {hepburn, heploc} |
3447
3448* This can be represented as an abbreviated format: \{L=\{ja}, S=\{Latn}, R=\{YU}, V=\{hepburn, heploc}}, skipping empty sets.
3449* “und” is a special language code that is treated as an empty set.
3450* Of course, only the Variants can contain more than one item: the others are either empty or contain exactly 1 item.
3451
3452#### <a name="2.-alias-elements" href="#2.-alias-elements">2. Alias elements</a>
3453
3454For the `languageAlias` elements, the _type_ and _replacements_ are languageIds.
3455
3456For the script-, territory- (aka region), and variant- Alias elements, the type and replacements are interpreted as a languageId, _after_ prefixing with “und-”. Thus
3457
3458```xml
3459<territoryAlias type="AN" replacement="CW SX BQ" reason="deprecated" />
3460```
3461
3462is interpreted as:
3463
3464```xml
3465<territoryAlias type="und-AN" replacement="und-CW und-SX und-BQ" reason="deprecated" />
3466```
3467
3468Note that for the case of territoryAlias, there may be multiple replacement values separated by spaces in the text (such as replacement="und-CW und-SX und-BQ"); other rules only ever have a single replacement value.
3469
3470#### <a name="3.-matches" href="#3.-matches">3. Matches</a>
3471
3472A rule matches a source if and only for all fields, each _source_ field ⊇ _type_ field.
3473
3474_Examples:_
3475
3476`source="ja-heploc-hepburn"` and `type="und-hepburn"`
3477
3478<table class="simple"><tbody>
3479<tr><td>{ja} ⊇ {}</td><td>success, und = {}</td></tr>
3480<tr><td>{hepburn, heploc} ⊇ {hepburn}</td><td><b>success</b></td></tr>
3481</tbody></table>
3482
3483so the rule matches the source. (Note that order of variants is immaterial to matching)
3484
3485`source="ja-hepburn"` and `type="und-hepburn-heploc"`
3486
3487<table class="simple"><tbody>
3488<tr><td>{ja} ⊇ {}</td><td>success, und = {}</td></tr>
3489<tr><td>{hepburn} ⊉ {hepburn, heploc}</td><td><b>failure</b></td></tr>
3490</tbody></table>
3491
3492so the rule does not match the source.
3493
3494#### <a name="4.-replacement" href="#4.-replacement">4. Replacement</a>
3495
3496A matching rule can be used to transform the source fields as follows
3497
3498* if type.field ≠ \{}
3499  * source.field = (source.field - type.field) ∪ replacement.field
3500* else if source.field = \{} and replacement.field ≠ \{}
3501  * source.field = replacement.field
3502
3503_Example:_
3504
3505> source=ja-Latn-fonipa-hepburn-heploc
3506>
3507> rule  ="\<languageAlias type="und-hepburn-heploc"
3508>
3509> replacement="und-alalc97">"
3510>
3511> result="ja-Latn-alalc97-fonipa" // note that CLDR canonical order of variants is alphabetical
3512
3513##### <a name="territory-exception" href="#territory-exception">Territory Exception</a>
3514
3515If the field = territory, and the replacement.field has more than one value, then look up the most likely territory for the base language code (and script, if there is one). If that likely territory is in the list of replacements, use it. Otherwise, use the first territory in the list.
3516
3517#### <a name="5.-canonicalizing-syntax" href="#5.-canonicalizing-syntax">5. Canonicalizing Syntax</a>
3518
3519To canonicalize the syntax of _source_:
3520
3521* Initial Script Subtag
3522  * If the first subtag has 4 letters, prepend the source with "und-"
3523  * Note: These are only for specialized use.
3524* Casing
3525  * Put any script subtag inside unicode_language_id into title case (eg, Hant)
3526  * Put any region subtag inside unicode_language_id into uppercase (eg, DE)
3527  * Put all other subtags into lowercase (eg, en, fonipa)
3528* Order
3529  * Put any variants into alphabetical order (eg, en-fonipa-scouse, not en-scouse-fonipa)
3530  * Put any extensions into alphabetical order by their singleton (eg, en-t-xxx-u-yyy, not en-u-yyy-t-xxx)
3531  * Put all attributes into alphabetical order.
3532  * Put all ufields (<ukey, uvalue>) and tfields (<tkey, tvalue>) into alphabetical order according to their keys (ukey or tkey), within their respective extensions.
3533  * Remove any uvalue (aka type) equal to "true". Note that "true" values cannot be removed from tvalues.
3534* Separator
3535  * Replace '\_' by '-'
3536
3537### <a name="preprocessing" href="#preprocessing">Preprocessing</a>
3538
3539The data from supplementalMetadata is (logically) preprocessed as follows.
3540
35411. Load the rules from supplementalMetadata.xml, replacing '\_' by '-', and adding “und-” as described in _Definition 2. Alias Elements_.
35422. Capture all languageAlias rules where the _type_ is an invalid languageId into a set of **BCP47 LegacyRules**. Example:
3543   1. `<languageAlias type="i-mingo" replacement="see-x-i-mingo" reason="legacy" />`
35443. Discard all rules where the _type_ is an invalid languageId. Examples are
3545   1. `<languageAlias type="i-mingo" replacement="see-x-i-mingo" reason="legacy" />`
3546   2. `<territoryAlias type="und-AAA" replacement="und-AA" reason="overlong" />`
35474. Change the _type_ and _replacement_ values in the remaining rules into multimap rules, as per _Definition 1. Multimap Interpretation_.
3548   1. Note that the “und” value disappears.
35495. Order the set of rules by the following levels
3550   1. First order by the size of the union of all field value sets, with larger sizes before smaller sizes.
3551     * So V={hepburn, heploc}} is before {R={CA}}
3552	 * V={hepburn, heploc}} and {L={en}, R={GB}} are not ordered at this level
3553   2. And then order by field, where L < S < R < V. Thus L is first and V is last.
3554     * So {L={fr}, R={CA}} is before {V={fonipa, heploc}}.
3555	 * V={hepburn, heploc}} and {V={hepburn, heploc}} are not ordered at this level
3556     * After this point we are guaranteed to have the same set of fields, with possibly different field value sets.
3557   3. And then order by field value sets, traversing also in the order of their fields L < S < R < V.
3558     * To determine the ordering between a field value set A and B, traverse each in parallel
3559     * If the corresponding field value sets for A and B are identical, then the next pair of field value sets is processed
3560     * Otherwise at the first pair of differing field values, A is before B if its field value is alphabetically less, otherwise B is before.
35616. The result is the set of **Alias Rules**
3562
3563So using the examples above, we get the following order:
3564
3565| languageId            | i. size of union | ii. field order | iii. field value sets |
3566| --------------------- | ---------------- | --------------- | --------------------- |
3567| {L={en}, R={GB}}      | 2                | n/a             |                       |
3568| {L={fr}, R={CA}}      | 2                | n/a             | en < fr               |
3569| {V={fonipa, heploc}}  | 2                | L < V           |                       |
3570| {V={hepburn, heploc}} | 2                | n/a             | fonipa < hepburn      |
3571| {R={CA}}              | 1                | n/a             |                       |
3572
3573
3574### <a name="processing-languageids" href="#processing-languageids">Processing LanguageIds</a>
3575
3576To canonicalize a given _source_:
3577
35781. Canonicalize the syntax of _source_ as per _Definition 5. Canonicalizing Syntax_.
35792. Where the _source_ could be an arbitrary BCP 47 language tag, first process as follows:
3580   1. If the source is identical to one of the types in the BCP47 LegacyRules, replace the entire source by the replacement value.
3581   2. Else if there is an extlang subtag, then apply Step 3 of BCP 47 [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) to remove the extlang subtag (possibly adjusting the language subtag).
3582      1. Don’t apply any of the other canonicalization steps in that section, however.
3583   3. Else if the first subtag is "x", prefix by "und-".
3584   4. **Note:** there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP 47 would ever register them, if so then _languageAlias_ mappings will be supplied for them, mapping to defined CLDR language subtags (from the `idStatus="reserved"` set).
35853. Find the first matching rule in **Alias Rules** (from **Preprocessing**)
3586   1. If there are none, return _source_
35874. Transform _source_ according to that rule
35885. loop (goto #3)
3589
3590### <a name="processing-localeids" href="#processing-localeids">Processing LocaleIds</a>
3591
3592The canonicalization of localeIds is done by first canonicalizing the languageId portion, then handling extensions in the following way:
3593
35941. Replace any _tlang_ languageId value by its canonicalization.
35952. Use the bcp47 data to replace keys, types, tfields, and tvalues by their canonical forms. See **Section 3.6.4 U Extension Data Files** and **Section 3.7.1 T Extension Data Files**. The matches are in the `alias` attribute value, while the canonical replacement is in the `name` attribute value. For example:
3596   1. Because of the following bcp47 data:
3597      `<key name="ms"…>…<type name="uksystem" … alias="imperial" … />…</key>`
3598   2. We get the following transformation:
3599      `en-u-ms-imperial ⇒ en-u-ms-uksystem`
36003. Replace any unicode_subdivision_id that is a subdivision alias by its replacement value in the same way, using subdivisionAlias data. This applies, for example, to the values for the 'sd' and 'rg' keys. However, where the replacement value is a two-letter region code, also append zzzz so that the result is syntactically correct. For example:
3601   1. Because of the following bcp47 data:
3602      `<subdivisionAlias type="fi01" replacement="AX"…`
3603   2. We get the following transformation:
3604      `en-u-rg-fi01 ⇒ en-u-rg-axzzzz`
3605
3606### <a name="optimizations" href="#optimizations">Optimizations</a>
3607
3608The above algorithm is a logical statement of the process, but would obviously not be directly suited to production code. Production-level code can use many optimizations for efficiency while achieving the same result. For example, the Alias Rules can be further preprocessed to avoid indefinite looping, instead doing a rule lookup once per subtag. As another example, the small number of **Territory Exceptions** can be preprocessed to avoid the likely subtags processing.
3609
3610* * *
3611
3612## <a name="References" href="#References">References</a>
3613
3614| Ancillary Information                                    | To properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data Markup Language. Some of the formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources for this data and/or formats include the following:  |
3615| -------------------------------------------------------- | --- |
3616| [<a name="Bugs" href="#Bugs">Bugs</a>]                   | CLDR Bug Reporting form<br/>[https://cldr.unicode.org/index/bug-reports](https://cldr.unicode.org/index/bug-reports) |
3617| [<a name="Charts" href="#Charts">Charts</a>]             | The online code charts can be found at [https://www.unicode.org/charts/](https://www.unicode.org/charts/) An index to character names with links to the corresponding chart is found at [https://www.unicode.org/charts/charindex.html](https://www.unicode.org/charts/charindex.html) |
3618| [<a name="DUCET" href="#DUCET">DUCET</a>]                | The Default Unicode Collation Element Table (DUCET)<br/>For the base-level collation, of which all the collation tables in this document are tailorings.<br/>[https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table](https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) |
3619| [<a name="FAQ" href="#FAQ">FAQ</a>]                      | Unicode Frequently Asked Questions<br/>[https://www.unicode.org/faq/<br/>](https://www.unicode.org/faq/)_For answers to common questions on technical issues._ |
3620| [<a name="FCD" href="#FCD">FCD</a>]                      | As defined in UTN #5 Canonical Equivalences in Applications<br/>[https://www.unicode.org/notes/tn5/](https://www.unicode.org/notes/tn5/) |
3621| [<a name="Glossary" href="#Glossary">Glossary</a>]       | Unicode Glossary[<br/>https://www.unicode.org/glossary/<br/>](https://www.unicode.org/glossary/)_For explanations of terminology used in this and other documents._ |
3622| [<a name="JavaChoice" href="#JavaChoice">JavaChoice</a>] | Java ChoiceFormat<br/>[https://docs.oracle.com/javase/7/docs/api/java/text/ChoiceFormat.html](https://docs.oracle.com/javase/7/docs/api/java/text/ChoiceFormat.html) |
3623| [<a name="Olson" href="#Olson">Olson</a>]                | The TZID Database (aka Olson timezone database)<br/>Time zone and daylight savings information.<br/>[https://www.iana.org/time-zones](https://www.iana.org/time-zones)<br/>For archived data, see <br/>[ftp://ftp.iana.org/tz/releases/](ftp://ftp.iana.org/tz/releases/) |
3624| [<a name="Reports" href="#Reports">Reports</a>]          | Unicode Technical Reports<br/>[https://www.unicode.org/reports/<br/>](https://www.unicode.org/reports/)_For information on the status and development process for technical reports, and for a list of technical reports._ |
3625| [<a name="Unicode" href="#Unicode">Unicode</a>]          | The Unicode Consortium, _The Unicode Standard, Version 13.0.0_<br/>(Mountain View, CA: The Unicode Consortium, 2020. ISBN 978-1-936213-26-9)<br/>[https://www.unicode.org/versions/Unicode13.0.0/](https://www.unicode.org/versions/Unicode13.0.0/) |
3626| [<a name="Versions" href="#Versions">Versions</a>]       | Versions of the Unicode Standard<br/>[https://www.unicode.org/versions/](https://www.unicode.org/versions/)<br/>_For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports._ |
3627| [<a name="XPath" href="#XPath">XPath</a>]                | [https://www.w3.org/TR/xpath/](https://www.w3.org/TR/xpath/) |
3628| Other Standards                                          | _Various standards define codes that are used as keys or values in Locale Data Markup Language. These include:_ |
3629| [<a name="BCP47" href="#BCP47">BCP47</a>]                | [https://www.rfc-editor.org/rfc/bcp/bcp47.txt](https://www.rfc-editor.org/rfc/bcp/bcp47.txt)<br/>The Registry<br/>[https://www.iana.org/assignments/language-subtag-registry](https://www.iana.org/assignments/language-subtag-registry) |
3630| [<a name="ISO639" href="#ISO639">ISO639</a>]             | ISO Language Codes<br/>[https://www.loc.gov/standards/iso639-2/](https://www.loc.gov/standards/iso639-2/)<br/>Actual List<br/>[https://www.loc.gov/standards/iso639-2/langcodes.html](https://www.loc.gov/standards/iso639-2/langcodes.html) |
3631| [<a name="ISO1000" href="#ISO1000">ISO1000</a>]          | ISO 1000: SI units and recommendations for the use of their multiples and of certain other units, International Organization for Standardization, 1992.<br/>[https://www.iso.org/iso/catalogue_detail?csnumber=5448](https://www.iso.org/iso/catalogue_detail?csnumber=5448) |
3632| [<a name="ISO3166" href="#ISO3166">ISO3166</a>]          | ISO Region Codes<br/>[https://www.iso.org/iso-3166-country-codes.html](https://www.iso.org/iso-3166-country-codes.html)<br/>Actual List<br/>[https://www.iso.org/obp/ui/#search](https://www.iso.org/obp/ui/#search) |
3633| [<a name="ISO4217" href="#ISO4217">ISO4217</a>]          | ISO Currency Codes<br/>[https://www.iso.org/iso-4217-currency-codes.html](https://www.iso.org/iso-4217-currency-codes.html)<br/>_(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)_ |
3634| [<a name="ISO8601" href="#ISO8601">ISO8601</a>]          | ISO Date and Time Format<br/>[https://www.iso.org/iso-8601-date-and-time-format.html](https://www.iso.org/iso-8601-date-and-time-format.html) |
3635| [<a name="ISO15924" href="#ISO15924">ISO15924</a>]       | ISO Script Codes<br/>[https://www.unicode.org/iso15924/index.html](https://www.unicode.org/iso15924/index.html)<br/>Actual List<br/>[https://www.unicode.org/iso15924/codelists.html](https://www.unicode.org/iso15924/codelists.html) |
3636| [<a name="LOCODE" href="#LOCODE">LOCODE</a>]             | United Nations Code for Trade and Transport Locations, commonly known as "UN/LOCODE"<br/>[https://unece.org/trade/uncefact/unlocode](https://unece.org/trade/uncefact/unlocode)<br/>Download at:  [https://unece.org/trade/cefact/UNLOCODE-Download](https://unece.org/trade/cefact/UNLOCODE-Download) |
3637| [<a name="RFC6067" href="#RFC6067">RFC6067</a>]          | BCP 47 Extension U<br/>[https://www.ietf.org/rfc/rfc6067.txt](https://www.ietf.org/rfc/rfc6067.txt) |
3638| [<a name="RFC6497" href="#RFC6497">RFC6497</a>]          | BCP 47 Extension T - Transformed Content<br/>[https://www.ietf.org/rfc/rfc6497.txt](https://www.ietf.org/rfc/rfc6497.txt) |
3639| [<a name="UNM49" href="#UNM49">UNM49</a>]                | UN M.49: UN Statistics Division<br/>Country or area & region codes<br/>[https://unstats.un.org/unsd/methods/m49/m49.htm](https://unstats.un.org/unsd/methods/m49/m49.htm)<br/>Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings<br/>[https://unstats.un.org/unsd/methods/m49/m49regin.htm](https://unstats.un.org/unsd/methods/m49/m49regin.htm) |
3640| [<a name="XMLSchema" href="#XMLSchema">XML Schema</a>]   | W3C XML Schema<br/>[https://www.w3.org/XML/Schema](https://www.w3.org/XML/Schema) |
3641| General                                                  | _The following are general references from the text:_ |
3642| [<a name="ByType" href="#ByType">ByType</a>]             | CLDR Comparison Charts<br/>[https://cldr.unicode.org/index/charts](https://cldr.unicode.org/index/charts) |
3643| [<a name="Calendars" href="#Calendars">Calendars</a>]    | Calendrical Calculations: The Millennium Edition by Edward M. Reingold, Nachum Dershowitz; Cambridge University Press; Book and CD-ROM edition (July 1, 2001); ISBN: 0521777526. Note that the algorithms given in this book are copyrighted. |
3644| [<a name="Comparisons" href="#Comparisons">Comparisons</a>]             | Comparisons between locale data from different sources<br/>[https://unicode-org.github.io/cldr-staging/charts/latest/by_type/index.html](https://unicode-org.github.io/cldr-staging/charts/latest/by_type/index.html) |
3645| [<a name="CurrencyInfo" href="#CurrencyInfo">CurrencyInfo</a>]          | UNECE Currency Data<br/>[https://www.iso.org/iso-4217-currency-codes.html](https://www.iso.org/iso-4217-currency-codes.html) |
3646| [<a name="DataFormats" href="#DataFormats">DataFormats</a>]             | CLDR Translation Guidelines<br/>[https://cldr.unicode.org/translation](https://cldr.unicode.org/translation) |
3647| [<a name="LDML" href="#LDML">Example</a>]                               | A sample in Locale Data Markup Language<br/>[https://www.unicode.org/cldr/dtd/1.1/ldml-example.xml](https://www.unicode.org/cldr/dtd/1.1/ldml-example.xml) |
3648| [<a name="ICUCollation" href="#ICUCollation">ICUCollation</a>]          | ICU rule syntax<br/>[https://unicode-org.github.io/icu/userguide/collation/customization/](https://unicode-org.github.io/icu/userguide/collation/customization/) |
3649| [<a name="ICUTransforms" href="#ICUTransforms">ICUTransforms</a>]       | Transforms<br/>[https://unicode-org.github.io/icu/userguide/transforms/](https://unicode-org.github.io/icu/userguide/transforms/)<br/>Transforms Demo<br/>[https://icu4c-demos.unicode.org/icu-bin/translit](https://icu4c-demos.unicode.org/icu-bin/translit) |
3650| [<a name="ICUUnicodeSet" href="#ICUUnicodeSet">ICUUnicodeSet</a>]       | ICU UnicodeSet<br/>[https://unicode-org.github.io/icu/userguide/strings/unicodeset.html<br/>](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html)API<br/>[https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.html](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/UnicodeSet.html) |
3651| [<a name="ITUE164" href="#ITUE164">ITUE164</a>]                         | International Telecommunication Union: List Of ITU Recommendation E.164 Assigned Country Codes<br/>available at [https://www.itu.int/opb/publications.aspx?parent=T-SP&view=T-SP2](https://www.itu.int/opb/publications.aspx?parent=T-SP&view=T-SP2) |
3652| [<a name="LocaleExplorer" href="#LocaleExplorer">LocaleExplorer</a>]    | ICU Locale Explorer<br/>[https://icu4c-demos.unicode.org/icu-bin/locexp](https://icu4c-demos.unicode.org/icu-bin/locexp) |
3653| [<a name="localeProject" href="#localeProject">LocaleProject</a>]       | Common Locale Data Repository Project<br/>[https://cldr.unicode.org](https://cldr.unicode.org) |
3654| [<a name="NamingGuideline" href="#NamingGuideline">NamingGuideline</a>] | OpenI18N Locale Naming Guideline<br/>formerly at https://www.openi18n.org/docs/text/LocNameGuide-V10.txt |
3655| [<a name="RBNF" href="#RBNF">RBNF</a>]                                  | Rule-Based Number Format<br/>[https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RuleBasedNumberFormat.html](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RuleBasedNumberFormat.html) |
3656| [<a name="RBBI" href="#RBBI">RBBI</a>]                                  | Rule-Based Break Iterator<br/>[https://unicode-org.github.io/icu/userguide/boundaryanalysis/](https://unicode-org.github.io/icu/userguide/boundaryanalysis/) |
3657| [<a name="UCAChart" href="#UCAChart">UCAChart</a>]                      | Collation Chart[<br/>https://www.unicode.org/charts/collation/](https://www.unicode.org/charts/collation/) |
3658| [<a name="UTCInfo" href="#UTCInfo">UTCInfo</a>]                         | NIST Time and Frequency Division Home Page<br/>[https://www.nist.gov/pml/time-and-frequency-division<br/>](https://www.nist.gov/pml/time-and-frequency-division)U.S. Naval Observatory: What is Universal Time?<br/><https://www.cnmoc.usff.navy.mil/Our-Commands/United-States-Naval-Observatory/Precise-Time-Department/The-USNO-Master-Clock/Definitions-of-Systems-of-Time/> |
3659| [<a name="WindowsCulture" href="#WindowsCulture">WindowsCulture</a>]    | Windows Culture Info (with mappings from [[BCP47](#BCP47)]-style codes to LCIDs)<br/>[https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo?view=net-6.0](https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo?view=net-6.0) |
3660
3661
3662## <a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a>
3663
3664Special thanks to the following people for their continuing overall contributions to the CLDR project, and for their specific contributions in the following areas. These descriptions only touch on the many contributions that they have made.
3665
3666* Mark Davis for creating the initial version of LDML, and adding to and maintaining this specification, and for his work on the LDML code and tests, much of the supplemental data and overall structure, and transforms and keyboards.
3667* John Emmons for the POSIX conversion tool and metazones.
3668* Deborah Goldsmith for her contributions to LDML architecture and this specification.
3669* Chris Hansten for coordinating and managing data submissions and vetting.
3670* Erkki Kolehmainen and his team for their work on Finnish.
3671* Steven R. Loomis for development of the survey tool and database management.
3672* Peter Nugent for his contributions to the POSIX tool and from Open Office, and for coordinating and managing data submissions and vetting.
3673* George Rhoten for his work on currencies.
3674* Roozbeh Pournader (روزبه پورنادر) for his work on South Asian countries.
3675* Ram Viswanadha (రఘురామ్ విశ్వనాధ) for all of his work on LDML code and data integration, and for coordinating and managing data submissions and vetting.
3676* Vladimir Weinstein (Владимир Вајнштајн) for his work on collation.
3677* Yoshito Umaoka (馬岡 由人) for his work on the timezone architecture.
3678* Rick McGowan for his work gathering language, script and region data.
3679* Xiaomei Ji (吉晓梅) for her work on time intervals and plural formatting.
3680* David Bertoni for his contributions to the conversion tools.
3681* Mike Tardif for reviewing this specification and for coordinating and vetting data submissions.
3682* Peter Edberg for work on this specification, monthPatterns, cyclicNameSets, contextTransforms and other items.
3683* Raymond Wainman and Cibu Johny for their work on keyboards.
3684* Jennifer Chye for her contributions to the conversion tools.
3685* Markus Scherer for a major rewrite of Part 5, Collation.
3686* [Shane Carr](https://www.sffc.xyz/) for his work on numbers and measurement units.
3687* Robin Leroy for his work on compact plurals: Part 3, Section 5, [Language Plural Rules](tr35-numbers.md#Language_Plural_Rules)
3688* Rich Gillam for work on Person Names.
3689* Alex Kolisnychenko for work on Person Names.
3690* Mike McKenna for work on Person Names.
3691
3692
3693Other contributors to CLDR are listed on the [CLDR Project Page](https://www.unicode.org/cldr/).
3694
3695## <a name="Modifications" href="#Modifications">Modifications</a>
3696
3697**Revision 67**
3698
3699* [Parent Locales](#Parent_Locales)
3700    * Updated the description of guidelines and invariants for `parentLocale` data.
3701* [Hybrid Locale Identifiers](#Hybrid_Locale)
3702    * Expanded the discussion of combinations such as Hinglish.
3703* [Currency Formats](tr35-numbers.md#Currency_Formats) and [Currencies](tr35-numbers.md#Currencies)
3704    * Described the new `alt="alphaNextToNumber"` and `alt="noCurrency"` variants for `pattern`s used with `currencyFormat` elements
3705    * Described the new `currencyPatternAppendISO` element under `currencyFormats`
3706    * Discouraged the use of the old `currencySpacing` element (and its subelements) in favor of the `alt="alphaNextToNumber"` variant
3707* [Element dateTimeFormat](tr35-dates.md#dateTimeFormat)
3708    * Described the new `dateTimeFormat type="atTime"` pattern and when to use it versus the standard `dateTimeFormat` pattern.
3709* [Matching Skeletons](tr35-dates.md#Matching_Skeletons)
3710    * Provided more detailed recommendations on matching pattern field length to field length in the requested skeleton.
3711* [Unit Preferences](tr35-info.md#Unit_Preferences)
3712   * Added a new subsection to  specify the interaction of the unit Preferences data with the locale keys mu, ms, and rg, and the base locale.
3713* Plurals
3714   * In [Plural rules syntax](tr35-numbers.md#Plural_rules_syntax), allow sample values to have positive and negative signs.
3715* Units of measurement
3716   * [Unit Preferences](tr35-info.md#Unit_Preferences)
3717      * Added a new subsection to specify the interaction of the unit Preferences data with the locale keys mu, ms, and rg, and the base locale.
3718   * [Unit Elements](tr35-general.md#Unit_Elements), [Unit_Conversion](tr35-info.md#Unit_Conversion)
3719       * For simpler and cleaner parsing, add a new element (unitIdComponent) and restructured the EBNF for parsing unit identifiers.
3720       * As part of this work, the identifier metric-ton was deprecated in favor of tonne. As usual, the older identifier remains for compatibility, and is aliased to the new one.
3721* [Person Names](tr35-personNames.md#Contents)
3722    * Added a new Part 8, Person Names.
3723
3724Note that small changes such as typos and link fixes are not listed above. Modifications in previous versions are listed in those respective versions. Click on **Previous Version** in the header until you get to the desired version.
3725
3726* * *
3727
3728Copyright © 2001–2022 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode [Terms of Use](https://www.unicode.org/copyright.html) apply.
3729
3730Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.
3731