Unicode Technical Standard #35

Unicode Locale Data Markup Language (LDML)
Part 6: Supplemental

Version	34
Editors	Steven Loomis (srl@icu-project.org) and other CLDR committee members

For the full header, summary, and status, see Part 1: Core

Summary

This document describes parts of an XML format (vocabulary) for the exchange of structured locale data. This format is used in the Unicode Common Locale Data Repository.

This is a partial document, describing only those parts of the LDML that are relevant for supplemental data. For the other parts of the LDML see the main LDML document and the links above.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

Please submit corrigenda and other comments with the CLDR bug reporting form [Bugs]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Parts

The LDML specification is divided into the following parts:

Part 1: Core (languages, locales, basic structure)
Part 2: General (display names & transforms, etc.)
Part 3: Numbers (number & currency formatting)
Part 4: Dates (date, time, time zone formatting)
Part 5: Collation (sorting, searching, grouping)
Part 6: Supplemental (supplemental data)
Part 7: Keyboards (keyboard mappings)

Contents of Part 6, Supplemental

1 Introduction Supplemental Data
2 Territory Data
- 2.1 Supplemental Territory Containment
- 2.2 Subdivision Containment
- 2.3 Supplemental Territory Information
- 2.4 Territory-Based Preferences
  - 2.4.1 Preferred Units for Specific Usages
    - Table: Unit Preference Categories
- 2.5 <rgScope>: Scope of the “rg” Locale Key
3 Supplemental Language Data
- 3.1 Supplemental Language Grouping
4 Supplemental Code Mapping
5 Telephone Code Data (Deprecated)
6 Postal Code Validation (Deprecated)
7 Supplemental Character Fallback Data
8 Coverage Levels
- 8.1 Definitions
- 8.2 Data Requirements
- 8.3 Default Values
9 Supplemental Metadata
- 9.1 Supplemental Alias Information
  - Table: Alias Attribute Values
- 9.2 Supplemental Deprecated Information (Deprecated)
- 9.3 Default Content
10 Locale Metadata Elements
11 Version Information
12 Parent Locales

1 Introduction Supplemental Data

The following represents the format for additional supplemental information. This is information that is important for internationalization and proper use of CLDR, but is not contained in the locale hierarchy. It is not localizable, nor is it overridden by locale data. The current CLDR data can be viewed in the Supplemental Charts.

<!ELEMENT supplementalData (version, generation?, cldrVersion?, currencyData?, territoryContainment?, subdivisionContainment?, languageData?, territoryInfo?, postalCodeData?, calendarData?, calendarPreferenceData?, weekData?, timeData?, measurementData?, unitPreferenceData?, timezoneData?, characters?, transforms?, metadata?, codeMappings?, parentLocales?, likelySubtags?, metazoneInfo?, plurals?, telephoneCodeData?, numberingSystems?, bcp47KeywordMappings?, gender?, references?, languageMatching?, dayPeriodRuleSet*, metaZones?, primaryZones?, windowsZones?, coverageLevels?, idValidity?, rgScope?) >

The data in CLDR is presently split into multiple files: supplementalData.xml, supplementalMetadata.xml, characters.xml, likelySubtags.xml, ordinals.xml, plurals.xml, telephoneCodeData.xml, genderList.xml, plus transforms (see Part 2 Section 10 Transforms and Part 2 Section 10.3 Transform Rule Syntax). The split is just for convenience: logically, they are treated as though they were a single file. Future versions of CLDR may split the data in a different fashion. Do not depend on any specific XML filename or path for supplemental data.

Note that Chapter 10 presents information about metadata that is maintained on a per-locale basis. It is included in this section because it is not intended to be used as part of the locale itself.

2 Territory Data

2.1 Supplemental Territory Containment

<!ELEMENT territoryContainment ( group* ) >
<!ELEMENT group EMPTY >
<!ATTLIST group type NMTOKEN #REQUIRED >
<!ATTLIST group contains NMTOKENS #IMPLIED >
<!ATTLIST group grouping ( true | false ) #IMPLIED >
<!ATTLIST group status ( deprecated, grouping ) #IMPLIED >

The following data provides information that shows groupings of countries (regions). The data is based on the [UNM49]. There is one special code, QO , which is used for outlying areas of Oceania that are typically uninhabited. The territory containment forms a tree with the following levels:

World

Continent

Subcontinent

Country

Excluding groupings, in this tree:

All non-overlapping regions form a strict tree rooted at World
All leaf-nodes (country) are always at depth 4. Some of these “country” regions are actually parts of other countries, such as Hong Kong (part of China). Such relationships are not part of the containment data.

For a chart showing the relationships (plus the included timezones), see the Territory Containment Chart. The XML structure has the following form.

<territoryContainment>

<group type="001" contains="002 009 019 142 150"/> <!--World -->
<group type="011" contains="BF BJ CI CV GH GM GN GW LR ML MR NE NG SH SL SN TG"/> <!--Western Africa -->
<group type="013" contains="BZ CR GT HN MX NI PA SV"/> <!--Central America -->
<group type="014" contains="BI DJ ER ET KE KM MG MU MW MZ RE RW SC SO TZ UG YT ZM ZW"/> <!--Eastern Africa -->
<group type="142" contains="030 035 062 145"/> <!--Asia -->
<group type="145" contains="AE AM AZ BH CY GE IL IQ JO KW LB OM PS QA SA SY TR YE"/> <!--Western Asia -->
<group type="015" contains="DZ EG EH LY MA SD TN"/> <!--Northern Africa -->
...

There are groupings that don't follow this regular structure, such as:

<group type="003" contains="013 021 029" grouping="true"/> <!--North America -->

These are marked with the attribute grouping="true".

When groupings have been deprecated but kept around for backwards compatibility, they are marked with the attribute status="deprecated", like this:

<group type="029" contains="AN" status="deprecated"/> <!--Caribbean -->

When the containment relationship itself is a grouping, it is marked with the attribute status="grouping", like this:

<group type="150" contains="EU" status="grouping"/> <!--Europe -->

That is, the type value isn’t a grouping, but if you filter out groupings you can drop this containment. In the example above, EU is a grouping, and contained in 150.

2.2 Subdivision Containment

<!ELEMENT subdivisionContainment ( subgroup* ) >

<!ELEMENT subgroup EMPTY >
<!ATTLIST subgroup type NMTOKEN #REQUIRED >
<!ATTLIST subgroup contains NMTOKENS #IMPLIED >

The subdivision containment data is similar to the territory containment. It is based on ISO 3166-2 data, but may diverge from it in the future.

The type is a unicode_region_subtag (territory) identifier for the top level of containment, or a unicode_subdivision_id for lower levels of containment when there are multiple levels. The contains value is a space-delimited list of one or more unicode_subdivision_id values. In the example above, subdivision bda contains other subdivisions bd02, bd06, bd07, bd25, bd50, bd51.

Note: Formerly (in CLDR 28 through 30):

The type attribute could only contain a unicode_region_subtag;
The contains attribute contained unicode_subdivision_suffix values; these are not unique across multiple territories, so...
For lower containment levels, a now-deprecated subtype attribute was used to specify the parent unicode_subdivision_suffix.

* The type attribute contained only a unicode_region_subtag unicode_subdivision_suffix values were used in the contains attribute; these are not unique across multiple territories, so for lower levels a now-deprecated

2.3 Supplemental Territory Information

<!ELEMENT territory ( languagePopulation* ) >
<!ATTLIST territory type NMTOKEN #REQUIRED >
<!ATTLIST territory gdp NMTOKEN #REQUIRED >
<!ATTLIST territory literacyPercent NMTOKEN #REQUIRED >
<!ATTLIST territory population NMTOKEN #REQUIRED >

<!ELEMENT languagePopulation EMPTY >
<!ATTLIST languagePopulation type NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation literacyPercent NMTOKEN #IMPLIED >
<!ATTLIST languagePopulation writingPercent NMTOKEN #IMPLIED >
<!ATTLIST languagePopulation populationPercent NMTOKEN #REQUIRED >
<!ATTLIST languagePopulation officialStatus (de_facto_official | official | official_regional | official_minority) #IMPLIED >

This data provides testing information for language and territory populations. The main goal is to provide approximate figures for the literate, functional population for each language in each territory: that is, the population that is able to read and write each language, and is comfortable enough to use it with computers. For a chart of this data, see Territory-Language Information.

Example

<territory type="AO" gdp="175500000000" literacyPercent="70.4" population="19088100"> <!--Angola-->
 <languagePopulation type="pt" populationPercent="67" officialStatus="official"/> <!--Portuguese-->
 <languagePopulation type="umb" populationPercent="29"/> <!--Umbundu-->
 <languagePopulation type="kmb" writingPercent="10" populationPercent="25" references="R1034"/> <!--Kimbundu-->
 <languagePopulation type="ln" populationPercent="0.67" references="R1010"/> <!--Lingala-->
</territory>

Note that reliable information is difficult to obtain; the information in CLDR is an estimate culled from different sources, including the World Bank, CIA Factbook, and others. The GDP and country literacy figures are taken from the World Bank where available, otherwise supplemented by FactBook data and other sources. The GDP figures are “PPP (constant 2000 international $)”. Much of the per-language data is taken from the Ethnologue, but is supplemented and processed using many other sources, including per-country census data. (The focus of the Ethnologue is native speakers, which includes people who are not literate, and excludes people who are functional second-language users.) Some references are marked in the XML files, with attributes such as references="R1010" .

The percentages may add up to more than 100% due to multilingual populations, or may be less than 100% due to illiteracy or because the data has not yet been gathered or processed. Languages with smaller populations might not be included.

The following describes the meaning of some of these terms—as used in CLDR—in more detail.

literacy percent for the territory — an estimate of the percentage of the country’s population that is functionally literate.

language population percent — an estimate of the number of people who are functional in that language in that country, including both first and second language speakers. The level of fluency is that necessary to use a UI on a computer, smartphone, or similar devices, rather than complete fluency.

literacy percent for language population — Within the set of people who are functional in the corresponding language (as specified by language population percent), this is an estimate of the percentage of those people who are functionally literate in that language, that is, who are capable of reading or writing in that language, even if they do not regularly use it for reading or writing. If not specified, this defaults to the literacy percent for the territory.

writing percent — Within the set of people who are functional in the corresponding language (as specified by language population percent), this is an estimate of the percentage of those people who regularly read or write a significant amount in that language. Ideally, the regularity would be measured as “7-day actives”. If it is known that the language is not widely or commonly written, but there are no solid figures, the value is typically given 1%-5%.

For a language such as Swiss German, which is typically not written, even though nearly the whole native Germanophone population could write in Swiss German, the literacy percent for language population is high, but the writing percent is low.

official language — as used in CLDR, a language that can generally be used in all communications with a central government. That is, people can expect that essentially all communication from the government is available in that language (ballots, information pamphlets, legal documents, …) and that they can use that language in any communication to the central government (petitions, forms, filing lawsuits,…).

Official languages for a country in this sense are not necessarily the same as those with official legal status in the country. For example, Irish is declared to be an official language in Ireland, but English has no such formal status in the United States. Languages such as the latter are called de facto official languages. As another example, German has legal status in Italy, but cannot be used in all communications with the central government, and is thus not an official language of Italy for CLDR purposes. It is, however, an official regional language. Other languages are declared to be official, but can’t actually be used for all communication with any major governmental entity in the country. There is no intention to mark such nominally official languages as “official” in the CLDR data.

official regional language — a language that is official (de jure or de facto) in a major region within a country, but does not qualify as an official language of the country as a whole. For example, it can be used in an official petition to a provincial government, but not the central government. The term “major” is meant to distinguish from smaller-scale usage, such as for a town or village.

2.4 Territory-Based Preferences

The default preference for several locale items is based solely on a unicode_region_subtag, which may either be specified as part of a unicode_language_id, inferred from other locale ID elements using the Likely Subtags mechanism, or provided explicitly using an “rg” Region Override locale key. For more information on this process see Locale Inheritance and Matching. The specific items that are handled in this way are:

Default calendar (see Calendar Preference Data)
Default week conventions (first day of week and weekend days; see Week Data)
Default hour cycle (see Time Data)
Default currency (see Supplemental Currency Data)
Default measurement system and paper size (see Measurement System Data)
Default units for specific usage (see Preferred Units for Specific Usages, below)

2.4.1 Preferred Units for Specific Usages

This data is intended to map from a particular usage — e.g. measuring the height of a person or the fuel consumption of an automobile — to the unit or combination of units typically used for that usage in a given region. Considerations for such a mapping include:

The list of possible usages large and open-ended. The intent here is to start with a small set for which there is an urgent need, and expand as necessary.
Even for a given usage such a measuring a road distance, there are multiple ranges in use. For example, one set of units may be used for indicating the distance to the next city (kilometers or miles), while another may be used for indicating the distance to the next exit (meters, yards, or feet).
There are also differences between more formal usage (official signage, medical records) and more informal usage (conversation, texting).
For some usages, the measurement may be expressed using a sequence of units, such as “1 meter, 78 centimeters” or “12 stone, 2 pounds”.

The DTD structure is as follows:

<!ELEMENT unitPreferenceData ( unitPreferences* ) >

<!ELEMENT unitPreferences ( unitPreference* ) >
<!ATTLIST unitPreferences category NMTOKEN #REQUIRED >
<!ATTLIST unitPreferences usage NMTOKENS #REQUIRED >
<!ATTLIST unitPreferences scope (small) #IMPLIED >

<!ELEMENT unitPreference ( #PCDATA ) >
<!ATTLIST unitPreference regions NMTOKENS #REQUIRED >

An example of data using this structure is as follows:

   <unitPreferenceData>
      ...
      <unitPreferences category="length" usage="person">
           <unitPreference regions="001">centimeter</unitPreference>
           <unitPreference regions="BR CN DE DK MX NL NO PL PT RU" alt="informal">meter centimeter</unitPreference>
           <unitPreference regions="AT BE DZ EG ES FR HK ID IL IT JO MY SA SE TR VN">meter centimeter</unitPreference>
           <unitPreference regions="CA GB IN US" alt="informal">foot inch</unitPreference>
           <unitPreference regions="US">inch</unitPreference>
      </unitPreferences>
      <unitPreferences category="length" usage="person" scope="small">
           <unitPreference regions="001">centimeter</unitPreference>
           <unitPreference regions="CA GB IN" alt="informal">inch</unitPreference>
           <unitPreference regions="US">inch</unitPreference>
      </unitPreferences>
      ...
   </unitPreferenceData>

There are several things to note:

The <unitPreferences> category attribute values match a <unit> element type attribute value, as listed in Unit Elements.
The <unitPreferences> usage attribute values are specific to this data; current values are listed in a table at the end of this section.
The <unitPreferences> element may have a scope="small" attribute to indicate that it is intended for the smaller range of values for that usage, such measuring the height or weight of an infant versus that of an adult, or measuring the road distance to the next exit versus that to the next city.
Each <unitPreferences> element must contain one <unitPreference> element with attribute regions="001"; this specifies the worldwide default unit or unit sequence for the usage and scope specified by the <unitPreferences> element. There may be additional <unitPreference> elements which specify a different unit or unit sequence for specific regions and possibly for a different degree of formality.
The <unitPreference> element may have an alt="informal" attribute to indicate that the specified unit or unit sequence is preferred in more informal usage.
The value of the <unitPreference> element is a sequence of one or more space-separated unit names from the a <unit> element unit attribute values for the relevant type, as listed in Unit Elements.

For a given combination of category, usage, scope and formality, the intended procedure for looking up the unit or unit combination to use for a given region is as follows:

Get the appropriate <unitPreferences> element for the desired category and usage: If scope=small is desired and a <unitPreferences> element with scope="small" exists for the desired category and usage, use it. Otherwise, use a <unitPreferences> element for the desired category and usage that has no scope attribute. In the selected <unitPreferences> element, pick a <unitPreference> element using the following steps.
If informal usage is preferred, look for a <unitPreference> element with alt="informal" whose regions attribute includes the given region. If found, use the specified unit [sequence].
Look for a <unitPreference> element whose regions attribute includes the given region. If found, use the specified unit [sequence].
Look for a <unitPreference> element with alt="informal" whose regions attribute is "001". If found, use the specified unit [sequence].
Look for a <unitPreference> element whose regions attribute is "001". If found, use the specified unit [sequence].

CLDR 29 contains usage mapping data for the following combinations of category, usage, and scope:

Unit Preference Categories
Category	Usage	Sample Value
area	land-agricult	hectare
area	land-commercl	hectare
area	land-residntl	hectare
concentr	blood-glucose	milligram-per-deciliter
consumption	vehicle-fuel	liter-per-100kilometers
duration	music-track	minute second
duration	person-age	year-person month-person
duration	tv-program	minute second
energy	food	foodcalorie
energy	person-usage	kilocalorie
length	person	centimeter
length	person, scope=small	centimeter
length	rainfall	millimeter
length	road	kilometer
length	road, scope=small	meter
length	snowfall	centimeter
length	vehicle	meter
length	visiblty	kilometer
length	visiblty, scope=small	meter
mass	person	kilogram
mass	person, scope=small	gram
pressure	baromtrc	hectopascal
speed	road-travel	kilometer-per-hour
speed	wind	kilometer-per-hour
temperature	person	celsius
temperature	weather	celsius
volume	vehicle-fuel	liter

2.5 <rgScope>: Scope of the “rg” Locale Key

The supplemental <rgScope> element specifies the data paths for which the region used for data lookup is determined by the value of any “rg” key present in the locale identifier (see Region Override). If no “rg” key is present, the region used for lookup is determined as usual: from the unicode_region_subtag if present, else inferred from the unicode_language_subtag. The DTD structure is as follows:

<!ELEMENT rgScope ( rgPath* ) >

<!ELEMENT rgPath EMPTY >
<!ATTLIST rgPath path CDATA #REQUIRED >

The <rgScope> element contains a list of <rgPath> elements, each of which specifies a datapath for which any “rg” key determines the region for lookup. For example:

   <rgScope>
      <rgPath path="//supplementalData/currencyData/fractions/info[@iso4217='#'][@digits='*'][@rounding='*'][@cashDigits='*'][@cashRounding='*']" draft="provisional" />
      <rgPath path="//supplementalData/currencyData/fractions/info[@iso4217='#'][@digits='*'][@rounding='*'][@cashRounding='*']" draft="provisional" />
      <rgPath path="//supplementalData/currencyData/fractions/info[@iso4217='#'][@digits='*'][@rounding='*']" draft="provisional" />
      <rgPath path="//supplementalData/calendarPreferenceData/calendarPreference[@territories='#'][@ordering='*']" draft="provisional" />
      ...
      <rgPath path="//supplementalData/unitPreferenceData/unitPreferences[@category='*'][@usage='*'][@scope='*']/unitPreference[@regions='#'][@alt='*']" draft="provisional" />
      <rgPath path="//supplementalData/unitPreferenceData/unitPreferences[@category='*'][@usage='*'][@scope='*']/unitPreference[@regions='#']" draft="provisional" />
      <rgPath path="//supplementalData/unitPreferenceData/unitPreferences[@category='*'][@usage='*']/unitPreference[@regions='#'][@alt='*']" draft="provisional" />
      <rgPath path="//supplementalData/unitPreferenceData/unitPreferences[@category='*'][@usage='*']/unitPreference[@regions='#']" draft="provisional" />
   </rgScope>

The exact format of the path is provisional in CLDR 29, but as currently shown:

An attribute value of '*' indicates that the path applies regardless of the value of the attribute.
Each path must have exactly one attribute whose value is marked here as '#'; in actual data items with this path, the corresponding value is a list of region codes. It is the region codes in this list that are compared with the region specified by the “rg” key to determine which data item to use for this path.

3 Supplemental Language Data

<!ELEMENT languageData ( language* ) >
<!ELEMENT language EMPTY >
<!ATTLIST language type NMTOKEN #REQUIRED >
<!ATTLIST language scripts NMTOKENS #IMPLIED >
<!ATTLIST language territories NMTOKENS #IMPLIED >
<!ATTLIST language variants NMTOKENS #IMPLIED >
<!ATTLIST language alt NMTOKENS #IMPLIED >

The language data is used for consistency checking and testing. It provides a list of which languages are used with which scripts and in which countries. To a large extent, however, the territory list has been superseded by the data in Section 2.2 Supplemental Territory Information .

	<languageData>
		<language type="af" scripts="Latn" territories="ZA"/>
		<language type="am" scripts="Ethi" territories="ET"/>
		<language type="ar" scripts="Arab" territories="AE BH DZ EG IN IQ JO KW LB
LY MA OM PS QA SA SD SY TN YE"/>
                     ...

If the language is not a modern language, or the script is not a modern script, or the language not a major language of the territory, then the alt attribute is set to secondary.

		<language type="fr" scripts="Latn" territories="IT US" alt="secondary" />
                     ...

3.1 Supplemental Language Grouping

<!ELEMENT languageGroups ( languageGroup* ) >
<!ELEMENT languageGroup ( #PCDATA ) >
<!ATTLIST languageGroup parent NMTOKEN #REQUIRED >

The language groups supply language containment. For example, the following indicates that aav is the Unicode language code for a language group that contains caq, crv, etc.

<languageGroup parent="fiu">chm et fi fit fkv hu izh kca koi krl kv liv mdf mns mrj myv smi udm vep vot vro</languageGroup>

The vast majority of the languageGroup data is extracted from wikidata, but may be overridden in some cases. The wikidata information is more fine-grained, but makes use of language groups that don't have ISO or Unicode language codes. Those language groups are omitted from the data. For example, wikidata has the following child-parent chain: only the first and last elements are present in the language groups.

Name	Wikidata Code	Language Code
Finnish	Q1412	fi
Finnic languages	Q33328
Finno-Samic languages	Q163652
Finno-Volgaic languages	Q161236
Finno-Permic languages	Q161240
Finno-Ugric languages	Q79890	fiu

4 Supplemental Code Mapping

<!ELEMENT codeMappings (languageCodes*, territoryCodes*, currencyCodes*) >

<!ELEMENT languageCodes EMPTY >
<!ATTLIST languageCodes type NMTOKEN #REQUIRED>
<!ATTLIST languageCodes alpha3 NMTOKEN #REQUIRED>

<!ELEMENT territoryCodes EMPTY >
<!ATTLIST territoryCodes type NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes numeric NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes alpha3 NMTOKEN #REQUIRED>
<!ATTLIST territoryCodes fips10 NMTOKEN #IMPLIED>
<!ATTLIST territoryCodes internet NMTOKENS #IMPLIED> [deprecated]

<!ELEMENT currencyCodes EMPTY >
<!ATTLIST currencyCodes type NMTOKEN #REQUIRED>
<!ATTLIST currencyCodes numeric NMTOKEN #REQUIRED>

The code mapping information provides mappings between the subtags used in the CLDR locale IDs (from BCP 47) and other coding systems or related information. The language codes are only provided for those codes that have two letters in BCP 47 to their ISO three-letter equivalents. The territory codes provide mappings to numeric (UN M.49 [UNM49] codes, equivalent to ISO numeric codes), ISO three-letter codes, FIPS 10 codes, and the internet top-level domain codes.

The alphabetic codes are only provided where different from the type. For example:

<territoryCodes type="AA" numeric="958" alpha3="AAA"/>
<territoryCodes type="AD" numeric="020" alpha3="AND" fips10="AN"/>
<territoryCodes type="AE" numeric="784" alpha3="ARE"/>
...
<territoryCodes type="GB" numeric="826" alpha3="GBR" fips10="UK"/>
...
<territoryCodes type="QU" numeric="967" alpha3="QUU" internet="EU"/>
...
<territoryCodes type="XK" numeric="983" alpha3="XKK"/>
...

Where there is no corresponding code, sometimes private use codes are used, such as the numeric code for XK.

The currencyCodes are mappings from three letter currency codes to numeric values (ISO 4217 Current currency & funds code list.) The mapping currently covers only current codes and does not include historic currencies. For example:

<currencyCodes type="AED" numeric="784"/>
<currencyCodes type="AFN" numeric="971"/>
...
<currencyCodes type="EUR" numeric="978"/>
...
<currencyCodes type="ZAR" numeric="710"/>
<currencyCodes type="ZMW" numeric="967"/>

5 Telephone Code Data (Deprecated)

Deprecated in CLDR v34, and data removed.

<!ELEMENT telephoneCodeData ( codesByTerritory* ) >

<!ELEMENT codesByTerritory ( telephoneCountryCode+ ) >
<!ATTLIST codesByTerritory territory NMTOKEN #REQUIRED >

<!ELEMENT telephoneCountryCode EMPTY >
<!ATTLIST telephoneCountryCode code NMTOKEN #REQUIRED >
<!ATTLIST telephoneCountryCode from NMTOKEN #IMPLIED >
<!ATTLIST telephoneCountryCode to NMTOKEN #IMPLIED >

This data specifies the mapping between ITU telephone country codes [ITUE164] and CLDR-style territory codes (ISO 3166 2-letter codes or non-corresponding UN M.49 [UNM49] 3-digit codes). There are several things to note:

A given telephone country code may map to multiple CLDR territory codes; +1 (North America Numbering Plan) covers the US and Canada, as well as many islands in the Caribbean and some in the Pacific
Some telephone country codes are for global services (for example, some satellite services), and thus correspond to territory code 001.
The mappings change over time (territories move from one telephone code to another). These changes are usually planned several years in advance, and there may be a period during which either telephone code can be used to reach the territory. While the CLDR telephone code data is not intended to include past changes, it is intended to incorporate known information on planned future changes, using "from" and "to" date attributes to indicate when mappings are valid.

A subset of the telephone code data might look like the following (showing a past mapping change to illustrate the from and to attributes):

<codesByTerritory territory="001">
	<telephoneCountryCode code="800"/> <!-- International Freephone Service -->
	<telephoneCountryCode code="808"/> <!-- International Shared Cost Services (ISCS) -->
	<telephoneCountryCode code="870"/> <!-- Inmarsat Single Number Access Service (SNAC) -->
</codesByTerritory>
<codesByTerritory territory="AS"> <!-- American Samoa -->
	<telephoneCountryCode code="1" from="2004-10-02"/> <!-- +1 684 in North America Numbering Plan -->
	<telephoneCountryCode code="684" to="2005-04-02"/> <!-- +684 now a spare code -->
</codesByTerritory>
<codesByTerritory territory="CA">
	<telephoneCountryCode code="1"/> <!-- North America Numbering Plan -->
</codesByTerritory>

6 Postal Code Validation (Deprecated)

Deprecated in v27. Please see other services that are kept up to date, such as:

<!ELEMENT postalCodeData (postCodeRegex*) >
<!ELEMENT postCodeRegex (#PCDATA) >
<!ATTLIST postCodeRegex territoryId NMTOKEN #REQUIRED>

The Postal Code regex information can be used to validate postal codes used in different countries. In some cases, the regex is quite simple, such as for Germany:

<postCodeRegex territoryId="DE" >\d{5}</postCodeRegex>

The US code is slightly more complicated, since there is an optional portion:

<postCodeRegex territoryId="US" >\d{5}([ \-]\d{4})?</postCodeRegex>

The most complicated currently is the UK.

7 Supplemental Character Fallback Data

<!ELEMENT characters ( character-fallback*) >

<!ELEMENT character-fallback ( character* ) >
<!ELEMENT character (substitute*) >
<!ATTLIST character value CDATA #REQUIRED >

<!ELEMENT substitute (#PCDATA) >

The characters element provides a way for non-Unicode systems, or systems that only support a subset of Unicode characters, to transform CLDR data. It gives a list of characters with alternative values that can be used if the main value is not available. For example:

<characters>
       <character-fallback>
	<character value = "ß">
		<substitute>ss</substitute>
	</character>
	<character value = "Ø">
		<substitute>Ö</substitute>
		<substitute>O</substitute>
	</character>
	<character value = "₧">
		<substitute>Pts</substitute>
	</character>
	<character value = "₣">
		<substitute>Fr.</substitute>
	</character>
       </character-fallback>
</characters>

The ordering of the substitute elements indicates the preference among them.

That is, this data provides recommended fallbacks for use when a charset or supported repertoire does not contain a desired character. There is more than one possible fallback: the recommended usage is that when a character value is not in the desired repertoire the following process is used, whereby the first value that is wholly in the desired repertoire is used.

toNFC(value)
other canonically equivalent sequences, if there are any
the explicit substitutes value (in order)
toNFKC(value)

8 Coverage Levels

The following describes the coverage levels used for the current version of CLDR. This list will change between releases of CLDR. Each level adds to what is in the lower level.

Level	Description
0	undetermined	Does not meet any of the following levels.
10	core	The CLDR "core" data, which is defined as the basic information about the language and writing system that is required before other information can be added using the CLDR survey tool. See http://cldr.unicode.org/index/cldr-spec/minimaldata
40	basic	The minimum amount of locale data deemed necessary to create a "viable" locale in CLDR. Contains names for the languages, scripts, and territories associated with the language, numbering systems used in those languages, date and number formats, plus a few key values such as the values in Section 3.1 Unknown or Invalid Identifiers. Also contains data associated with the most prominent languages and countries.
60	moderate	Contains more types of data and more language and territory names than the basic level. If the language is associated with an EU country, then the moderate level attempts to complete the data as it pertains to all EU member countries.
80	modern	Contains all fields in normal modern use, including all country names, and currencies in use.
100	comprehensive	Contains complete localizations (or valid inheritance) for every possible field.

Levels 40 through 80 are based on the definitions and specifications listed in 8.1-8.4. However, these principles are continually being refined by the CLDR technical committee, and so do not completely reflect the data that is actually used for coverage determination, which is under the XPath //supplementalData/CoverageLevels. For a view of the trunk version of this data~~file~~, see coverageLevels.xml. (As described in the introduction to Supplemental Data, the specific XML filename may change.)

<!ELEMENT coverageLevels ( approvalRequirements, coverageVariable*, coverageLevel* ) >
<!ELEMENT coverageLevel EMPTY >
<!ATTLIST coverageLevel inLanguage CDATA #IMPLIED >
<!ATTLIST coverageLevel inScript CDATA #IMPLIED >
<!ATTLIST coverageLevel inTerritory CDATA #IMPLIED >
<!ATTLIST coverageLevel value CDATA #REQUIRED >
<!ATTLIST coverageLevel match CDATA #REQUIRED >

For example, here is an example coverageLevel line.

<coverageLevel
    value="30"
      inLanguage="(de|fi)" 
    match="localeDisplayNames/types/type[@type='phonebook'][@key='collation']"/>

The coverageLevel elements are read in order, and the first match results in a coverage level value. The element matches based on the inLanguage, inScript, inTerritory, and match attribute values, which are regular expressions. For example, in the above example, a match occurs if the language is de or fi, and if the path is a locale display name for collation=phonebook.

The match attribute value logically has "//ldml/" prefixed before it is applied. In addition, the "[@" is automatically quoted. Otherwise standard Perl/Java style regular expression syntax is used.

<!ELEMENT coverageVariable EMPTY >
<!ATTLIST coverageVariable key CDATA #REQUIRED >
<!ATTLIST coverageVariable value CDATA #REQUIRED >

The coverageVariable element allows us to create variables for certain regular expressions that are used frequently in the coverageLevel definitions above. Each coverage varible must contain a key / value pair of attributes, which can then be used to be substituted into a coverageLevel definition above.

For example, here is an example coverageLevel line using coverageVariable substitution.

<coverageVariable key="%dayTypes" value="(sun|mon|tue|wed|thu|fri|sat)">

<coverageVariable key="%wideAbbr" value="(wide|abbreviated)">

<coverageLevel value="20" match="dates/calendars/calendar[@type='gregorian']/days/dayContext[@type='format']/dayWidth[@type='%wideAbbr']/day[@type='%dayTypes']"/>

In this example, the coverge variables %dayTypes and %wideAbbr are used to substitute their respective values into the match expression. This allows us to reuse the same variable for other coverageLevel matches that use the same regular expression fragment.

<!ELEMENT approvalRequirements ( approvalRequirement* ) >
<!ELEMENT approvalRequirement EMPTY >
<!ATTLIST approvalRequirement votes CDATA #REQUIRED>
<!ATTLIST approvalRequirement locales CDATA #REQUIRED>
<!ATTLIST approvalRequirement paths CDATA #REQUIRED>

The approvalRequirements allows to specify the number of survey tool votes required for approval, either based on locale, or path, or both. Certain locales require a higher voting threshhold (usually 8 votes instead of 4), in order to promote greater stability in the data. Furthermore, certain fields that are very high visibility fields, such as number formats, require a CLDR TC committee member's vote for approval.

Here is an example of the approvalRequirements section.

<approvalRequirements>
	<!--  "high bar" items -->
		<approvalRequirement votes="20" locales="*" paths="//ldml/numbers/symbols[^/]++/(decimal|group)"/>
		<!--  established locales - http://cldr.unicode.org/index/process#TOC-Draft-Status-of-Optimal-Field-Value -->
		<approvalRequirement votes="8" locales="ar ca cs da de el es fi fr he hi hr hu it ja ko nb nl pl pt pt_PT ro ru sk sl sr sv th tr uk vi zh zh_Hant" paths=""/>
		<!--  all other items -->
		<approvalRequirement votes="4" locales="*" paths=""/>
</approvalRequirements>

This section specifies that a TC vote (20 votes) is required for decimal and grouping separators. Furthermore it specifies that any field in the established locales list (i.e. ar, ca, cs, etc.) requires 8 votes, and that all other locales require 4 votes only.

For more information on the CLDR Voting process, See http://cldr.unicode.org/index/process

8.1 Definitions

Target-Language is the language under consideration.
Target-Territories is the list of territories found by looking up Target-Language in the <languageData> elements in Supplemental Language Data.
Language-List is Target-Language, plus
- basic: Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Unknown (de, en, es, fr, it, ja, pt, ru, zh, und
- moderate: basic + Arabic, Hindi, Korean, Indonesian, Dutch, Bengali, Turkish, Thai, Polish (ar, hi, ko, in, nl, bn, tr, th, pl). If an EU language, add the remaining official EU languages, currently: Danish, Greek, Finnish, Swedish, Czech, Estonian, Latvian, Lithuanian, Hungarian, Maltese, Slovak, Slovene (da, el, fi, sv, cs, et, lv, lt, hu, mt, sk, sl)
- modern: all languages that are official or major commercial languages of modern territories
Target-Scripts is the list of scripts in which Target-Language can be customarily written (found by looking up Target-Language in the <languageData> elements in Supplemental Language Data.), plus Unknown (Zzzz).
Script-List is the Target-Scripts plus the major scripts used for multiple languages
- Latin, Simplified Chinese, Traditional Chinese, Cyrillic, Arabic (Latn, Hans, Hant, Cyrl, Arab)
Territory-List is the list of territories formed by taking the Target-Territories and adding:
- basic: Brazil, China, France, Germany, India, Italy, Japan, Russia, United Kingdom, United States, Unknown (BR, CN, DE, GB, FR, IN, IT, JP, RU, US, ZZ)
- moderate: basic + Spain, Canada, Korea, Mexico, Australia, Netherlands, Switzerland, Belgium, Sweden, Turkey, Austria, Indonesia, Saudi Arabia, Norway, Denmark, Poland, South Africa, Greece, Finland, Ireland, Portugal, Thailand, Hong Kong SAR China, Taiwan (ES, BE, SE, TR, AT, ID, SA, NO, DK, PL, ZA, GR, FI, IE, PT, TH, HK, TW). If an EU language, add the remaining member EU countries: Luxembourg, Czech Republic, Hungary, Estonia, Lithuania, Latvia, Slovenia, Slovakia, Malta (LU, CZ, HU, ES, LT, LV, SI, SK, MT).
- modern: all current ISO 3166 territories, plus the UN M.49 [UNM49] regions in Supplemental Territory Containment.
Currency-List is the list of current official currencies used in any of the territories in Territory-List, found by looking at the region elements in Supplemental Territory Containment, plus Unknown (XXX).
Calendar-List is the set of calendars in customary use in any of Target-Territories, plus Gregorian.
Number-System-List is the set of number systems in customary use in the language.

8.2 Data Requirements

The required data to qualify for the level is then the following.

localeDisplayNames
1. languages: localized names for all languages in Language-List.
2. scripts: localized names for all scripts in Script-List.
3. territories: localized names for all territories in Territory-List.
4. variants, keys, types: localized names for any in use in Target-Territories; for example, a translation for PHONEBOOK in a German locale.
dates: all of the following for each calendar in Calendar-List.
1. calendars: localized names
2. month names, day names, era names, and quarter names
  - context=format and width=narrow, wide, & abbreviated
  - plus context=standAlone and width=narrow, wide, & abbreviated, if the grammatical forms of these are different than for context=format.
3. week: minDays, firstDay, weekendStart, weekendEnd
  - if some of these vary in territories in Territory-List, include territory locales for those that do.
4. am, pm, eraNames, eraAbbr
5. dateFormat, timeFormat: full, long, medium, short
6. intervalFormatFallback
numbers: symbols, decimalFormats, scientificFormats, percentFormats, currencyFormats for each number system in Number-System-List.
currencies: displayNames and symbol for all currencies in Currency-List, for all plural forms
transforms: (moderate and above) transliteration between Latin and each other script in Target-Scripts.

8.3 Default Values

Items should only be included if they are not the same as the default, which is:

what is in root, if there is something defined there.
for timezone IDs: the name computed according to Appendix J: Time Zone Display Names
for collation sequence, the UCA DUCET (Default Unicode Collation Element Table), as modified by CLDR.
- however, in that case the locale must be added to the validSubLocale list in collation/root.xml.
for currency symbol, language, territory, script names, variants, keys, types, the internal code identifiers, for example,
- currencies: EUR, USD, JPY, ...
- languages: en, ja, ru, ...
- territories: GB, JP, FR, ...
- scripts: Latn, Thai, ...
- variants: PHONEBOOK,...

9 Supplemental Metadata

Note that this section discusses the <metadata> element within the <supplementalData> element. For the per-locale metadata used in tests and the Survey Tool, see 10: Locale Metadata Element.

The supplemental metadata contains information about the CLDR file itself, used to test validity and provide information for locale inheritance. A number of these elements are described in

Appendix I: Inheritance and Validity
Appendix K: Valid Attribute Values
Appendix L: Canonical Form
Appendix M: Coverage Levels

9.1 Supplemental Alias Information

<!ELEMENT alias (languageAlias*,scriptAlias*,territoryAlias*,subdivisionAlias*,variantAlias*,zoneAlias*) >

The following are common attributes for subelements of <alias>:
<!ELEMENT *Alias EMPTY >
<!ATTLIST *Alias type NMTOKEN #IMPLIED >
<!ATTLIST *Alias replacement NMTOKEN #IMPLIED >
<!ATTLIST *Alias reason ( deprecated | overlong ) #IMPLIED>

The languageAlias has additional reasons
<!ATTLIST languageAlias reason ( deprecated | overlong | macrolanguage | legacy | bibliographic ) #IMPLIED>

This element provides information as to parts of locale IDs that should be substituted when accessing CLDR data. This logical substitution should be done to both the locale id, and to any lookup for display names of languages, territories, and so on. The replacement for the language and territory types is more complicated: see Part 1: Core, Section 3.3.1 BCP 47 Language Tag Conversion for details.

<alias>
  <languageAlias type="in" replacement="id">
  <languageAlias type="sh" replacement="sr">
  <languageAlias type="sh_YU" replacement="sr_Latn_YU">
...
  <territoryAlias type="BU" replacement="MM">
...
</alias>

Attribute values for the *Alias values include the following:

Alias Attribute Values
Attribute	Value	Description
type	NMTOKEN	The code to be replaced
replacement	NMTOKEN	The code(s) to replace it, space-delimited.
reason	deprecated	The code in type is deprecated, such as 'iw' by 'he', or 'CS' by 'RS ME'.
	overlong	The code in type is too long, such as 'eng' by 'en' or 'USA' or '840' by 'US'
	macrolanguage	The code in type is an encompassed languagethat is replaced by a macrolanguage, such as 'arb' by 'ar'.
	legacy	The code in type is a legacy code that is replaced by another code for compatiblity with established legacy usage, such as 'sh' by 'sr_Latn'
	bibliographic	The code in type is a bibliographic code, which is replaced by a terminology code, such as 'alb' by 'sq'.

9.2 Supplemental Deprecated Information (Deprecated)

<!ELEMENT deprecated ( deprecatedItems* ) >
<!ATTLIST deprecated draft ( approved | contributed | provisional | unconfirmed | true | false ) #IMPLIED > <!-- true and false are deprecated. -->

<!ELEMENT deprecatedItems EMPTY >
<!ATTLIST deprecatedItems type ( standard | supplemental | ldml | supplementalData | ldmlBCP47 ) #IMPLIED > <!-- standard | supplemental are deprecated -->
<!ATTLIST deprecatedItems elements NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems attributes NMTOKENS #IMPLIED >
<!ATTLIST deprecatedItems values CDATA #IMPLIED >

The deprecated items element was used to indicate elements, attributes, and attribute values that are deprecated. This means that the items are valid, but that their usage is strongly discouraged. This element and its subelements have been deprecated in favor of DTD Annotations.

Where particular values are deprecated (such as territory codes like SU for Soviet Union), the names for such codes may be removed from the common/main translated data after some period of time. However, typically supplemental information for deprecated codes is retained, such as containment, likely subtags, older currency codes usage, etc. The English name may also be retained, for debugging purposes.

9.3 Default Content

<!ELEMENT defaultContent EMPTY >
               <!ATTLIST defaultContent locales NMTOKENS #IMPLIED >

In CLDR, locales without territory information (or where needed, script information) provide data appropriate for what is called the default content locale. For example, the en locale contains data appropriate for en-US, while the zh locale contains content for zh-Hans-CN, and the zh-Hant locale contains content for zh-Hant-TW. The default content locales themselves thus inherit all of their contents, and are empty.

The choice of content is typically based on the largest literate population of the possible choices. Thus if an implementation only provides the base language (such as en), it will still get a complete and consistent set of data appropriate for a locale which is reasonably likely to be the one meant. Where other information is available, such as independent country information, that information can always be used to pick a different locale (such as en-CA for a website targeted at Canadian users).

If an implementation is to use a different default locale, then the data needs to be pivoted; all of the data from the CLDR for the current default locale pushed out to the locales that inherit from it, then the new default content locale's data moved into the base. There are tools in CLDR to perform this operation.

For the relationship between Inheritance, DefaultContent, LikelySubtags, and LocaleMatching, see Section 4.2.6 Inheritance vs Related Information.

10 Locale Metadata Elements

Note: This section refers to the per-locale <metadata> element, containing metadata about a particular locale. This is in contrast to the Supplemental Metadata, which is in the supplemental tree and is not specific to a locale.

<!ELEMENT metadata ( alias | ( casingData?, special* ) ) >
<!ELEMENT casingData ( alias | ( casingItem*, special* ) ) >
<!ELEMENT casingItem ( #PCDATA ) >
<!ATTLIST casingItem type CDATA #REQUIRED >
<!ATTLIST casingItem override (true | false) #IMPLIED >
<!ATTLIST casingItem forceError (true | false) #IMPLIED >

The <metadata> element contains metadata about the locale for use by the Survey Tool or other tools in checking locale data; this data is not intended for export as part of the locale itself.

The <casingItem> element specifies the capitalization intended for the majority of the data in a given category with the locale. The purpose is so that warnings can be issued to translators that anything deviating from that capitalization should be carefully reviewed. Its type attribute has one of the values used for the <contextTransformUsage> element above, with the exception of the special value "all"; its value is one of the following:

lowercase
titlecase

The <casingItem> data is generated by a tool based on the data available in CLDR. In cases where the generated casing information is incorrect and needs to be manually edited, the override attribute is set to "true" so that the tool will not override the manual edits. When the casing information is known to be both correct and something that should apply to all elements of the specified type in a given locale, the forceErr attribute may be set to "true" to force an error instead of a warning for items that do not match the casing information.

11 Version Information

<!ELEMENT version EMPTY >
<!ATTLIST version cldrVersion CDATA #FIXED "27" >
<!ATTLIST version unicodeVersion CDATA #FIXED "7.0.0" >

The <cldrVersion> attribute defines the CLDR version for this data, as published on CLDR Releases/Downloads

The <unicodeVersion> attribute defines the version of the Unicode standard that is used to interpret data. Specifically, some data elements such as exemplar characters are expressed in terms of UnicodeSets. Since UnicodeSets can be expressed in terms of Unicode properties, their meaning depend on the Unicode version from which property values are derived.

12 Parent Locales

The parentLocales data is supplemental data, but is described in detail in the core specification section 4.1.3.

Copyright © 2001–2018 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.