]> BCP 47 Extension U Google
mark@macchiato.com
Lab126
addison@inter-locale.com
IBM
yoshito_umaoka@us.ibm.com
General Internet Engineering Task Force localebcp 47 This document specifies an Extension to BCP 47 which provides subtags that specify language and/or locale-based behavior or refinements to language tags, according to work done by the Unicode Consortium.
permits the definition and registration of language tag extensions "that contain a language component and are compatible with applications that understand language tags". This document defines an extension for identifying Unicode locale-based variations using language tags. The "singleton" identifier for this extension is 'u'.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Language tags, as defined by , are useful for identifying the language of content. They are also used as locale identifiers (or can be mapped to locales) in many operating environments and APIs. However, most such locale identifiers also provide additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange. The maintaining authority for this extension's registry is the Unicode Consortium. Unicode defines common locale data and identifiers for this data: Item Value Name Unicode Consortium Contact Email cldr@unicode.org Discussion List Email cldr-users@unicode.org URL Location cldr.unicode.org Specification Unicode Technical Standard #35 Unicode Locale Data Markup Language (LDML), http://unicode.org/reports/tr35/ Section Section 3.2 BCP 47 Tag Conversion The specification of extension subtags is provided by Section 3 of Unicode Technical Standard #35 Unicode Locale Data Markup Language. As required by BCP 47, subtags follow the language tag ABNF and other rules for the formation of language tags and subtags, are restricted to the ASCII letters and digits, are not case sensitive, and do not exceed eight characters in length. specifies a canonical representation. LDML is available over the Internet and at no cost, and is available via a royalty-free license at http://unicode.org/copyright.html. LDML is versioned, and each version of LDML is numbered, dated, and stable. Extension subtags, once defined by LDML, are never retracted or change in meaning in a substantial way.
The subtags available for use in the 'u' extension consist of a set of attributes, keys, and types. Attributes, keys, types, and their respective meanings are defined in Section 3 (Unicode Language and Locale Identifiers) of . The following is a summary of that definition (for details see Section 3):An 'attribute' is a subtag with a length of three or more characters following the singleton and preceding any 'keyword' sequences. No attributes were defined at the time of this document's publication.A 'keyword' is a sequence of subtags consisting of a 'key' subtag, followed by zero or more 'type' subtags. Each 'key' MUST be unique within the extension. The order of the 'type' subtags within a 'keyword' is sometimes significant to their interpretation. Note that 'keys' can appear without a subsequent 'type' subtag.A 'key' is a subtag with a length of exactly two characters. Each 'key' is followed by zero or more 'type' subtags. A 'type' is a subtag with a length of three or more characters following a key. 'Type' subtags are specific to a particular 'key' and the order of the 'type' subtags MAY be significant to the interpretation of the 'keyword'.For example, the language tag "de-DE-u-attr-co-phonebk" consists of:The base language tag "de-DE" (German as used in Germany), exactly as defined by using subtags from the IANA Language Subtag Registry.The singleton 'u', identifying this extension.The attribute 'attr', which is an example for illustration (no attributes were defined at the time this document was published).The keyword 'co-phonebk', consisting to the key 'co' (Collation) and the type 'phonebk' (Phonebook collation order).With successive versions of , additional attributes, keys, and types MAY be defined. Once defined, attributes, keys, and types will never be removed. Machine-readable files listing the valid attributes, keys, and types are available in the CLDR repository for each version. For example, for version 1.7.2, the files are located at http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/. These also can contain aliases which were used in previous versions of .
As required by , case is not significant. The canonical form for all subtags in the extension is lowercase. The canonical order of attributes is in order (that is, numbers before letters, with letters sorted as lowercase US-ASCII code points). The canonical order of keywords is in order by key. The order of subtags within a keyword is significant; the meaning of this extension is altered if those subtags are rearranged. Thus, the canonical form of the extension never reorders the subtags within a keyword.
Per , Section 3.7:
%% Identifier: u Description: Unicode Locale Comments: Subtags for the identification of language and cultural variations. Used to set behavior in locale APIs. Added: 2009-mm-dd RFC: [TBD] Authority: Unicode Consortium Contact_Email: cldr@unicode.org Mailing_List: cldr-users@unicode.org URL: http://cldr.unicode.org %%
Thanks to John Emmons and the rest of the Unicode CLDR Technical Committee for their work in developing the BCP 47 subtags for LDML.
This document will require IANA to insert the record in into the Language Extensions Registry, according to Section 3.7. Extensions and the Extensions Registry of "Tags for Identifying Languages" in . There might be occasional maintenance of this record. This document does not require IANA to create or maintain a new registry or otherwise impact IANA.
The security considerations for this extension are the same as those for (or its successors). See Section 6. Security Considerations of .
&rfc5646; Unicode Technical Standard #35: Locale Data Markup Language (LDML) Unicode Consortium Tags for the Identification of Language (BCP47) ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange. International Organization for Standardization This standard defines an International Reference Version (IRV) which corresponds exactly to what is widely known as ASCII or US-ASCII. ISO/IEC 646 was based on the earlier standard ECMA-6. ECMA has maintained its standard up to date with respect to ISO/IEC 646 and makes an electronic copy available at http://www.ecma-international.org/publications/standards/Ecma-006.htm. ISO/IEC 646 JTC 1/SC 2 Registry for Common Locale Data Repository tag elements