1--- 2title: Language Distance Data 3--- 4 5# Language Distance Data 6 7The purpose is to provide a way to match language/locales according to "closeness" rather than just a truncation algorithm. and to allow for the user to specify multiple acceptable languages. The data is designed to allow for an algorithm that can account for the closeness of the relations between, say, tl and fil, or en-US and en-CA. This is based on code that we already use, but expanded to be more data-driven. 8 9For example, if I understand written American English, German, French, Swiss German, and Italian, and the product has {ja-JP, de, zh-TW}, then de would be the best match; if I understand only {zh}, then zh-TW would be the best match. This represents a superset of the capabilities of locale fallback. Stated in those terms, it can have the effect of a more complex fallback, such as: 10 11sr-Cyrl-RS 12 13sr-Cyrl 14 15sr-Latn-RS 16 17sr-Latn 18 19sr 20 21hr-Latn 22 23hr 24 25Note that the goal, as with the rest of CLDR, is for matching written languages. Should we find in the future that it is also important to support spoken language matching in the same way, variant weights could be supplied. 26 27This is related to the current aliasing mechanism, which is used to equate he and iw, for example. It is used to find the best locale ID for a given request, but does not interact with the fallback of resources *within the locale-parent chain.* It subsumes and replaces the current \<fallback> element (we'd take the current information in those elements and apply it). 28 29## Expected Input 30 311. a weighted list of desired languages D (like AcceptLanguage) 322. a weighted list of available languages A (eg supported languages) 33 34In the examples, the weights are given in AcceptLanguage syntax, eg ";" + number in (0.0 to 1.0). The weight 0.0 means don't match at all. Unlike AcceptLanguage, however, the relations among variants like "en" and "en-CA" are taken into account. 35 36In very many cases, the weights will all be identical (1.0). Some exceptions might be: 37 38- For desired languages, to indicate a preference. For example, I happen to prefer English to German to French to Swiss German to Italian. So the desired list for me might be {"en-US;q=1", "de;q=0.9", "fr;q=0.85", "gsw;q=0.8", "it;q=0.6"} 39- For available languages, it can be used to indicate the "quality" of the user experience. Thus if it is known that the German version of a product or site is quite good, but the Danish is substandard, that could be reflected in the weightings. In most cases, however, the available language weights would be the same. 40 41## Expected Output 42 431. A "best fit" language from A 442. A measure of how good the fit is 45 46## Examples 47 48Input: 49 50desired: {"en-CA;q=1", "fr;q=1"} 51 52available: {"en-GB;q=1", "en-US;q=1"} 53 54threshold: script 55 56Output: 57 58en-US 59 60good 61 62Input: 63 64desired: {"en-ZA;q=1", "fr;q=1"} 65 66available: {"en-GB;q=1", "en-US;q=1", "fr-CA;q=0.9"} 67 68threshold: script 69 70Output: 71 72en-GB 73 74good 75 76Input: 77 78desired: {"de"} 79 80available: {"en-GB;q=1", "en-US;q=1", "fr-CA;q=0.9"} 81 82threshold: script 83 84Output: 85 86en-GB 87 88bad 89 90## Internals 91 92The following is a logical expression of how this data can be used. 93 94The lists are processed, with each Q value being inverted (x = 1/x) to derive a weight. There is a small progressive cost as well, so {x;q=1 y;q=1} turns into x;w=0 y;w=0.0001. Because AcceptLanguage is fatally underspecified, we also have to normalize the Q values. 95 96For each pair (d,a) in D and A: 97 98The base distance between d and a is computed by canonicalizing both languages and maximizing, using likely subtags, then computing the following. 99 100baseDistance = diff(d.language, a.language) + diff(d.script, a.script) + diff(d.region, a.region) + diff(d.variants, a.variants) 101 102There is also a small distance allotted for the maximization. That is, "en-Latn-CA" vs "en-Latn-CA" where the second "Latn" was added by maximization, will have a non-zero distance. Variants are handled as a sorted set, and the distance is variantDistance \* (count(variants1-variants2) + count(variants2-variants1)). As yet, there is no distance for extensions, but that may come in the future. 103 104We then compute: 105 106weight(d,a) = weight(d) \* weight(a) \* baseDistance(d,a) 107 108The weight of each a is then computed as the min(weight(d,a)) for all d. The a with the smallest such weight is the winner. The "goodness" of the match is given as a scale from 0.0(perfect) to 1.0 (awful). Constants are provided for a Script-only difference and a Region-only difference, for comparison. 109 110If, however, the winning language has too low a threshold, then the default locale (first in the available languages list) is returned. 111 112Note that the distance metric is *not* symmetric: the distance from zh to yue may be different than the distance from yue to zh. That happens when it is more likely that a reader of yue would understand zh than the reverse. 113 114Note that this doesn't have to be an N x M algorithm. Because there is a minimum threshold (otherwise returning the default locale), we can precompute the possible base language subtags that could be returned; anything else can be discarded. 115 116## Data Sample 117 118The data is designed to be relatively simple to understand. It would typically be processed into an internal format for fast processing. The data does not need to be exact; only the relative computed values are important. However, for keep the types of fields apart, they are given very different values. TODO: add values for [ISO 636 Deprecation Requests - DRAFT](https://cldr.unicode.org/development/development-process/design-proposals/iso-636-deprecation-requests-draft) 119 120\<languageDistances> 121 122\<!-- Essentially synonyms. Note that true synonyms like he/iw are handled by default below. --> 123 124\<distance desired="tl" available="fil">8\</distance> 125 126\<distance desired="no" available="nb">1\</distance> 127 128\<distance desired="ro-MO" available="mo">1\</distance> 129 130\<!-- Scandanavian. Remember that we focus on written form --> 131 132\<distance desired="nn" available="no">64\</distance> 133 134\<distance desired="nn" available="nb">64\</distance> 135 136\<distance desired="da" available="no">96\</distance> 137 138\<distance desired="da" available="nb">96\</distance> 139 140\<distance desired="da" available="nn">128\</distance> 141 142\<!-- All the Serbo-Croatian variants are like regional variants --> 143 144\<distance desired="hr" available="bs">64\</distance> 145 146\<distance desired="sh" available="bs">64\</distance> 147 148\<distance desired="sr" available="bs">64\</distance> 149 150\<distance desired="sh" available="hr">64\</distance> 151 152\<distance desired="sr" available="hr">64\</distance> 153 154\<distance desired="sh" available="sr">64\</distance> 155 156\<!-- Chinese scripts --> 157 158\<distance desired="und-Hant" available="und-Hans">128\</distance> 159 160\<!-- English: US and Canada are close; everything else closer to GB --> 161 162\<distance desired="en-Zzzz-155" available="en-Zzzz-155">8\</distance> \<!-- Expand to cover the Americas --> 163 164\<distance desired="en-Zzzz-155" available="en-Zzzz-ZZ">64\</distance> \<!-- They aren't close to GB --> 165 166\<distance desired="en-Zzzz-ZZ" available="en-Zzzz-ZZ">8\</distance> \<!-- All others are closer to GB, and each other --> 167 168\<!-- default distances. 169 170 Must be last! 171 172 Note that deprecated differences in the alias file are given a weight of 1, 173 174 and before this point. --> 175 176\<distance desired="und" available="\*">1024\</distance> \<!-- default language distance --> 177 178\<distance desired="und-Zzzz" available="\*">256\</distance> \<!-- default script distance --> 179 180\<distance desired="und-Zzzz-ZZ" available="\*">64\</distance> \<!-- default region distance --> 181 182\<distance desired="und-Zzzz-ZZ-UNKNOWN" available="\*">16\</distance> \<!-- default variant distance --> 183 184\<languageDistances> 185 186## Interpreting the Format 187 1881. The list is ordered, so the first match for a given type wins. That is, logically, you walk through the list looking for language matches. At the first one, you record the distance. Then you walk though for script differences, and so on. 1892. The attributes desired and available both take language tags, and are assumed to be maximized for matching. 1903. The Unknown subtags (und, Zzzz, ZZ, UNKNOWN) match any subtag of the same type. Trailing unknown values can be omitted. "\*" is a special value, used for the default distances. The macro regions (eg, 019 = Americas) match any region in them. So und-155 matches any language in Western Europe (155). 191 1. As we expand, we may find out that we want more expressive power, like regex. 1924. The attribute oneWay="true" indicates that the distance is only one direction. 193 194Issues 195 196- Should we have the values be symbolic rather than literal numbers? eg: L, S, R, ... instead of 1024, 256, 64,... 197- The "\*" is a bit of a hack. Other thoughts for syntax? 198 199