--- title: BCP47 Validation and Canonicalization --- # BCP47 Validation and Canonicalization The proposal is to add two tables of precomputed values to CLDR for each release, plus a table of language code mappings. ## Validation Data **Language subtag.** These can be 2-letter, 3-letter, or registered (>3 letter). We were looking at validation of 7,000 base language entries and Markus had an idea. Algorithmically map the two-letter codes onto values from 0..675, and the three letter codes onto 676..18251 (just over 14 bits). The set of all valid language subtags can be put into a bit-set using 2,282 bytes. That allows for fast validation with a small table. Registered codes would just use an exception table. An alternative mapping would be 26\*26\*27, eg - "xy" => (x-0x61)\*26\*27 + (y-0x61)\*27 - "xyz" => (x-0x61)\*26\*27 + (y-0x61)\*26 + (z-0x60) However, it is better to have the two letter codes as smaller numbers, for compression, since they occur far more often. **Region subtag.** One could do the same for region codes, with two-letter codes from 0..675, and then 3-digit codes from 675 to 1,675 (about 10.7 bits). A bitset that can cover all values is a 210-byte table. **Script subtag.** John Cowan suggested that except for Teng/Tfng, the second letter of the script code is redundant, so you can special-case those two, remove the second letter, and use the same algorithm as for ISO 639-2. However, we can't expect that the JAC would follow any particular restrictions, and the set of scripts is still a relatively small number, so this probably isn't worth it. Note: The generation of a table is simply a convenience, since it can be computed from the IANA registry, so it may not be worth doing as a part of CLDR, but we can suggest it as an implementation technique. ## Canonicalization Data We also provide data for validation and canonicalization. The basic canonicalization is as per BCP47, with the following additions: 1. We canonicalize the case, with variants getting uppercase, so en\_foobar => en\_FOOBAR 2. We alphabetize the variants so that irrelevant differences in order don't cause problems, so en-FOOBAR-ABCDE => en\_ABCDE\_FOOBAR - Note: the uppercasing of variants is for compatibility, since the basis for the CLDR work predated BCP47. Data for doing the preferred value mapping is in the supplemental data, extracted from the IANA registry. We also provide data for a lenient canonicalization, which involves the following additional steps: 1. maps the 3-letter language subtags that have 2-letter equivalents into those 2-letter equivalents; so eng-US => en\_US 2. maps the 3-digit region codes that have 2-letter equivalents into those 2-letter equivalents; so eng-840 => en\_US 3. combines identical extensions; en-a-foo-a-fii => en\_a\_foo\_fii The data for #2 are in http://unicode.org/cldr/data/common/supplemental/supplementalData.xml, in codeMappings/territoryCodes. However, we need an extra table for doing #1, the language code mappings. Suggest adding: \  \ ... ## Sample Structure \  \   -2122061011208687, e00d48015863b67, 15fb9fb2095c00, 340400f7818068d,
  -2b07ebe0bd4e300, 100086b25d7fffc, 43fff001538f3c40, -4044cc58020eaf00,
  4085570410419a, 18ffffffc04002, 2eea2e908400418, 6260008c6,
  -33d4000000000000, 10000, 0, 0,
  0, 80
 \
 \   91019c747263433, 1c68108800045364, 4443028094090c84, -7ffffbe3baa63970,
  -3ff1e7af28980bf0, 61204489a16d0e6d, 10000003024040, -648bb808222ebe40,
  1001044202044053, 4100000020000400, -1220200fbffdfc00, -5010004244000101,
  -78c6890a8c3e0081, -fc408f0000001, -200169dfafa301, -880800009,
  -8171c0000001, -4187fd4fbb1, -2000000800011, 9fe75970f1b42bd,
  1490f9feddf20051, -114007e, -2800000008080001, -80280000001,
  -40180000001, -400000000010003, -c0000200000001, -1000000d0041,
  -20000080000041, -1200000208000001, -42000002000005, 7fffffffffefefff,
  -3bfc2522b0640841, 4124082843c19cf, -d00447fffbbb00, 488349bd64542b49,
  -3f182aaabe898841, -7a20060100c8ec8f, -400043effff79ae, -3878c1e88b08201,
  8005b0008100ffd, 2000040030000000, -301210082002fde7, -3eee729ffdfffbc,
  665df000000227bd, -4200010e261ad97, c100860c01149fc, 75689565b65c5500,
  20003efedb, -3fe966da82589400, 7f7ffff07a540, 460801000,
  a12510714b, 600000490100000, 4440000100000001, 4000010048010000,
  5100000042880000, 4f553243564102dd, 800001cdc2bd5, -6e8fbffffffffffc,
  a798218157d9013, 4000000824000000, 4001020000000, 39abfeb000004,
  40000000, 4400020000, 5a88110000000020, 6042000000000000,
  500108000408a, 400631080, 4081003f50300400, 13b33be00000000,
  1100800, -5000000000000000, 283cedffffffffff, 3c51fcf24dfffc0a,
  -4daba593a4fdff00, 409403cff84f039, -774ab6e1cfff5fef, -20000069208824fd,
  -c3fddefefff7ff9, 40444f850ffd4, 7f41be8d6fffdff0, 2397b2000000da23,
  fe7ffff00084050, 800000008104180, 11c941, 5feb408040040100,
  -3fbfbeeffffbff36, -7198804000f9fffa, 1100036edb9f051, -77d596afbc000000,
  -2001daefbfffbe, 40050809053, 10000, -7ebb71a93fd5f000,
  78047c0208, 844785244050cc0, 1885000000000204, 11d1350ee8cd1001,
  833eb5906691, 4100000001040052, 74481a71dd649964, 800008001000fb9,
  10002010045400, -60a140ff7f9ff7fc, 1040c00000698843, 1a2dd20200,
  -20feffbffffbfee0, 400000000013959, 2486290f00000401, -7fffbb9fff5fefd3,
  68000c02000000, -370ffffffffff80, 10000080215e, 1000011000000,
  -21c2801000, -110008000001, -1, -224000004003,
  -1801, -200002000000001, -231, -401001,
  -2000100001, -10000000001, -4000000000000001, -7fa6743002322029,
  -7ff8000000ff800e, 72e6061b3dc3000, -460ffddbbc083001, -c0000da80802f1a,
  204100000004, -1010248847ff5e, -1000000000001, -80001,
  7dffffff7fffffff, -100000008000009, -40000421, -80000080002803,
  -10000400001, 7efbfffdf7ffffff, -1, -800000011,
  -1000080020003001, 27ef77ffffeffbe9, -2800000020fb7e0, -98611d000800002,
  -13642a901081, -820a02001000001, -1aa3ff8bbfbae5e0, -2a4a3bc0002225d6,
  1170004203fffbff, -1db71a9a7df, 400220005450540d, 6010041040810696,
  5605100000000000, 200106000001040, 64b67f9b19201180, -462b32512a8ff6c0,
  421500406204837, -3efa4067cac00000, 3113f7df46c98, 900000000,
  400100100a000, -3045824002201, -5a2004f9e7bebbd5, 7251425410002047,
  200003fefbff1edf, -482004483e11f390, -20a0300038a02081, 7c00100107ff445e,
  -479faa0062208009, 30000003dbd77f5e, 3010000400042704, -fff7ffe70,
  -1, -1, -1, -1,
  -1, -1, -1, 7bdeffffffffffff,
  140c1085d13ee57e, 800117fa2, bffffef00, 3b53000000800002,
  16241100000010e, 5001a3831002010, 81183010, -780efd5ed8010201,
  20800052d, 2000009020010400, 319c5f61004, 254010,
  -2040fffffffebe0, -230a120c00000001, 400b7fffffa7bfb, -400000a1250228ef,
  -800e840e9000201, -4004010081100801, 757bf7efdffdfffd, -20823e08f7e981,
  -a8e000010400001, -19de7efa010f, 17d467d07159f159, -40000081080e7d6,
  -90240402202001, -3e00fd67bfe000b, 7fdfffdffef6f1f, -412043c4fae3b,
  -1000080080000081, -7fd010022808241, -860000001003a5, -f98420400200001,
  67dc75ddbf8ff531, 201ca06beab91, 4480404023, 4000400000406d44,
  80050060130000, -3effffbffdc00000, 6f14f040f49c5588, 400a0a51641,
  -40823fffffff7ffc, 4a0044108581fe, 1000224012300, 1000000000,
  936fbf1010800, 0, 2d80c, 2c286c00000000a0,
  18df010000000, 25003f79fff0120, a00000, 90003100040000,
  8300000100002, 100000, -1000000000000000, 1011ae7ecffffff,
  5ef18d040092000, 1105404141000010, 4057a3ff6040, 19f7d9755450080b,
  46757f0435578cc7, 72c0000000000011, 5480360501fdae, 10000001ba388b1,
  2100300000044240, 157ff95f00000010, 40117a78d630944, 100000041e984a40,
  -56fcf17ff8bfdf7e, -1000ffffffede, -7cf0000100bac7a9, 45bd55042f54c019,
  -460e973bb3f7ffff, 1ad243d7fed7d37f, 80248550118440, 242000008000281,
  -7f9fe01900020010, -1d4eafefee9fff58, 2442980000000013, 3bfbfe100020,
  -bb61abaf5a2bf00, 3d42051b3668ffdd, -3ffffff7fc2cf27c, 200506e80c110b44,
  200007dbbf7f002, 5b2801, -2000fffeefdf00, -7ffffff7eedfdfad,
  20c10000, 310084280230030, -53b3cffffffdbdcc, 67ffffff004c8023,
  -3f8af7bfefef, 10138104000010ff, 3081676e140121c0, 1000000100,
  80400a242200000
 \
\
Here is the data that they replace: \ \