1--- 2title: Language Data Consistency 3--- 4 5# Language Data Consistency 6 7We have a set of tests for consistency in the data for language, script, and country. The following is a draft description what those consistency checks should aim for. 8 9## Default script, language 10 11- 1. For each script encoded in Unicode, the default\* language is present in the script metadata. 122. For each language used in CLDR, there is a default\* script 13 14\* default = most used in writing; currently if modern, otherwise historical. 15 16## Implications for Language-Country population data (LCPD) 17 181. If a base-language has a CLDR locale, then it is in the LCPD for at least one country. 192. If there is a CLDR country locale for a language, then that language+country is in the LCPD. 20 1. For each country, get the language most widely used as a written language in that country. That language+country combination is in the LCPD. 21 2. When a significant proportion of the language use in a country is in a non-default script, that script is marked in the LCPD. 22 3. When a script is not EXCLUDED in UAX#31, then we have at least one language-country pair in the LCPD. 233. If a language has a significant\* literate population in a country, the pair is in the LCPD. This target is fuzzier, but definitely 24 1. anything \>1M, or 25 2. \>100K and either official (real, not honorary) or 1/3 of the population. 26 27## Implications for Likely Subtags 28 29Likely Subtags are built from the language-country population data, plus the script metadata, plus an exception list. 30 31