1# README for configuration files used by org.unicode.icu.tool.cldrtoicu.regex.RegexTransformer. 2# 3# © 2019 and later: Unicode, Inc. and others. 4# 5# CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/) 6# For terms of use, see http://www.unicode.org/copyright.html 7 8====== 9Basics 10====== 11 12The RegexTransformer class converts CLDR paths and values to ICU Resource Bundle paths 13and values, based on a set of transformation rules typically loaded from a text file 14(e.g. ldml2icu_locale.txt). 15 16The basic format of transformation rules is: 17 <path-specification> ; <resource-bundle-specification> [; <instruction>=<argument>]* 18 19A simple example of a transformation rule is: 20 21 //ldml/localeDisplayNames/keys/key[@type="(%A)"] ; /Keys/$1 22 23which transforms CLDR values whose path matches the path specification, and emits: 24* A resource bundle path "/Keys/xx", where 'xx' is the captured type attribute. 25* A resource bundle value, which is just the CLDR value's base value. 26 27A path specification can be thought of as a regular expression which matches the CLDR 28path and can capture some element names or attribute values; however unlike a regular 29expression, the '[',']' characters are treated as literals, similar to XPath expressions. 30 31If a single CLDR value should produce more than one resource bundle path/value, then 32it should be written: 33 34 <path-specification> 35 ; <resource-bundle-1-specification> [; <instruction> ]* 36 ; <resource-bundle-2-specification> [; <instruction> ]* 37 38===================== 39Argument Substitution 40===================== 41 42Before a rule can be matched, any %-variables must be substituted. These are defined 43in the same configuration file as the rules, and look something like: 44 %W=[\w\-]++ 45or: 46 %D=//ldml/numbers/defaultNumberingSystem 47 48The first case can be thought of as just a snippet of regular expression (in this case 49something that matches hyphen separated words) and, importantly, here '[' and ']' are 50treated as regular expression metacharacters. These arguments are static and wil be 51substituted exactly as-is into the regular expression to be used for matching. 52 53The second case (used exactly once) is a dynamic argument which references a CLDR value 54in the set of data being transformed. This is simply indicated by the fact that it starts 55with '//'. This path is resolved and the value is substituted just prior to matching. 56 57Variable names are limited to a single upper-case letter (A-Z). 58 59=========================== 60Implicit Argument Splitting 61=========================== 62 63This is a (somewhat non-obvious) mechanism which allows for a single rule to generate 64multiple results from a single input path when a argument is a list of tokens. 65 66Consider the rule: 67 68//supplementalData/timeData/hours[@allowed="(%W)"][@preferred="(%W)"][@regions="(%W)"] 69 ; /timeData/$3/allowed ; values=$1 70 ; /timeData/$3/preferred ; values=$2 71 72where the "regions" attributes (which is captured as '$3') contains a whitespace separated 73list of region codes (e.g. "US GB AU NZ"). In this case the rule is applied once for each 74region, producing paths such as "/timeData/US/allowed" or "/timeData/NZ/preferred". Note 75that there is no explicit instruction to do this, it just happens. 76 77The rule is that the first unquoted argument in the resource bundle path is always treated 78as splittable. 79 80To suppress this behaviour, the argument must be quoted (e.g. /timeData/"$3"/allowed). Now, 81if there were another following unquoted argument, that would become implicitly splittable 82(but only one argument is ever splittable). 83 84============ 85Instructions 86============ 87 88Additional instructions can be supplied to control value transformation and specify fallback 89values. The set of instructions is: 90* values: The most common instruction which defines how values are transformed. 91* fallback: Defines a fallback value to be used if this rule was not matched. 92 93There are two other special case instructions which should (if at all possible) not be used, 94and might be removed at some point: 95* group: Causes values to be grouped as sub-arrays for very specific use cases 96 (prefer using "Hidden Labels" where possible). 97* base_xpath: Allows deduplication of results between multiple different rules (this is a 98 hack to work around limitations in how matching is performed). 99 100------------------- 101values=<expression> 102------------------- 103 104The "values" instruction defines an expression whose evaluated result becomes the output 105resource bundle value(s). Unless quoting is present, this evaluated expression is split 106on whitespace and can become multiple values in the resulting resource bundle. 107 108Examples: 109 110* values=$1 $2 $3 111 112 Produces three separate values in the resource bundle for the first three captured 113 arguments. 114 115* values="$1 $2" $3 116 117 Produces two values in the resource bundle, the first of which is two captured values 118 separated by a space character. 119 120* values={value} 121 122 Substitutes the CLDR value, but then performs whitespace splitting on the result. This 123 differs from the behaviour when no "values" instructions is present (which does not 124 split the results). 125 126* values="{value}" $1 127 128 Produces two values, the first of which is the unsplit CLDR value, and the second is a 129 captured argument. 130 131* values=&func($1, {value}) 132 133 Invokes a transformation function, passing in a captured argument and the CLDR value, 134 and the result is then split. The set of functions available to a transformer is 135 configured when it is created. 136 137Note that in the above examples, it is assumed that the $N arguments do not contain spaces. 138If they did, it would result in more output values. To be strict about things, every value 139which should not be split must be quoted (e.g. values="$1" "$2" "$3") but since captured 140values are often IDs or other tokens, this is not what is seen in practice, so it is not 141reflected in these examples. 142 143--------------------- 144fallback=<expression> 145--------------------- 146 147The fallback instruction provides a way for default values to be emitted for a path that 148was not matched. Fallbacks are useful when several different rules produce values for the 149same resource bundle. In this case the output path produced by one rule can be used as 150the "key" for any unmatched rules with fallback values (to "fill in the gaps"). 151 152Consider the two rules which can emit the same resource bundle path: 153 154//ldml/numbers/currencies/currency[@type="(%W)"]/symbol 155 ; /Currencies/$1 ; fallback=$1 156//ldml/numbers/currencies/currency[@type="(%W)"]/displayName 157 ; /Currencies/$1 ; fallback=$1 158 159These rules, if both matched, will produce two values for the same resource bundle path. 160Consider the CLDR values: 161 162//ldml/numbers/currencies/currency[@type="USD"]/symbol ==> "$" 163//ldml/numbers/currencies/currency[@type="USD"]/displayName ==> "US Dollar" 164 165After matching both of these paths, the values for the resource bundle "/Currencies/USD" 166will be the array { "$", "US Dollar" }. 167 168However, if only one value were present to be converted, the converter could use the 169matched path "/Currencies/XXX" and infer the missing fallback value, ensuring that the 170output array (it if was emitted at all) was always two values. 171 172Note that in order for this to work, the fallback value must be derivable only from the 173matched path. E.g. it cannot contain arguments that are not also present in the matched 174path, and obviously cannot reference the "{value}" at all. Thus the following would not 175be permitted: 176 177//ldml/foo/bar[@type="(%W)"][@region=(%A)] ; /Foo/$1 ; fallback=$2 178 179However the fallback value can reference existing CLDR or resource bundle paths (expected 180to be present from other rules). For example: 181 fallback=/weekData/001:intvector[0] 182or: 183 fallback=//ldml/numbers/symbols[@numberSystem="%D"]/decimal 184 185The latter case is especially complex because it also uses the "dynamic" argument: 186 %D=//ldml/numbers/defaultNumberingSystem 187 188So determining the resulting value will require: 1891) resolving "//ldml/numbers/defaultNumberingSystem" to, for example, "arab" 1902) looking up the value of "//ldml/numbers/symbols[@numberSystem="arab"]/decimal" 191 192----------------- 193base_xpath=<path> 194----------------- 195 196The base_xpath instruction allows a rule to specify a proxy path which is used in place of 197the originally matched path in the returned result. This is a useful hack for cases where 198values are derived from information in a path prefix. 199 200Because path matching for transformation happens only on full paths, it is possible that 201several distinct CLDR paths might effectively generate the same result if they share the 202same prefix (i.e. paths in the same "sub hierarchy" of the CLDR data). 203 204If this happens, then you end up generating "the same" result from different paths. To 205fix this, a "surrogate" CLDR path can be specified as a proxy for the source path, 206allowing several results to appears to have come from the same source, which results in 207deduplication of the final value. 208 209For example, the two rules : 210 211//supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"][@officialStatus="(%W)"](?:[@references="%W"])? 212 ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"] 213 214//supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"](?:[@references="%W"])? 215 ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"] 216 217Produce the same results for different paths (with or without the "officialStatus" 218attribute) but only one such result is desired. By specifying the same base_xpath on 219both rules, the conversion logic can deduplicate these to produce only one result. 220 221When using base_xpath, it is worth noting that: 2221) Base xpaths must be valid "distinguishing" paths (but are never matched to any rule). 2232) Base xpaths can use arguments to achieve the necessary level of uniqueness. 2243) Rules which share the same base xpath must always produce the same values. 225 226Note however that this is a still very much a hack because since two rules are responsible 227for generating the same result, there is no well defined "line number" to use for ordering 228of values. Thus this mechanism should only be used for rules which produce "single" 229values, and must not be used in cases where the ordering of values in arrays is important. 230 231This mechanism only exists because there is currently no mechanism for partial matching 232or a way to match one path against multiple rules. 233 234----- 235group 236----- 237 238The "group" instruction should be considered a "last resort" hack for controlling value 239grouping, in cases where "hidden labels" are not suitable (see below). 240 241============================== 242Value Arrays and Hidden Labels 243============================== 244 245In the simplest case, one rule produces one or more output path/values per matched CLDR 246value (i.e. one-to-one or one-to-many). If that happens, then output ordering of the 247resource bundle paths is just the natural resource bundle path ordering. 248 249However it is also possible for several rules to produce values for a single output path 250(i.e. many-to-one). When this happens there are some important details about how results 251are grouped and ordered. 252 253------------ 254Value Arrays 255------------ 256 257If several rules produce results for the same resource bundle path, the values produced 258by the rules are always ordered according to the order of the rule in the configuration 259rule (and it is best practice to group any such rules together for clarity). 260 261If each rule produces multiple values, then depending on grouping, those values can either 262be concatenated together in a single array or grouped individually to create an array 263of arrays. 264 265In the example below, there are four rules producing values for the same path ( 266 267//.../firstDay[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 268//.../minDays[@count="(%N)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=$1 269//.../weekendStart[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 0 270//.../weekendEnd[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 86400000 271 272The first two rules produce one value each, and the last two produce two values each. This 273results in the resource bundle "/weekData/xxx:intvector" having a single array consisting 274of six values. In the real configuration, these rules also use fallback instructions to 275ensure that the resulting array of values is always six values, even if some CLDR paths are 276not present. 277 278------------- 279Hidden Labels 280------------- 281 282Sometimes rules should produce separate "sub-arrays" of values, rather than having all the 283values appended to a single array. Consider the following path/value pairs: 284 285x/y: a 286x/y: b 287x/y: c 288 289Which produce the resource bundle "x/y" with three values: 290 291x{ 292 y{ 293 "a", 294 "b", 295 "c" 296 } 297} 298 299Now suppose we want to make a resource bundle where the values are grouped into their 300own sub-array: 301 302x{ 303 y{ 304 { "a", "b", "c" } 305 } 306} 307 308We can think of this as coming from the path/value pairs: 309 310x/y/-: a 311x/y/-: b 312x/y/-: c 313 314where to represent the sub-array we introduce the idea of an empty path element '-'. 315 316In a transformation rule, these "empty elements" are represent as "hidden labels", and look 317like "<some-label>". They are treated as "normal" path elements for purposes of ordering and 318grouping, but are treated as empty when the paths are written to the ICU data files. 319 320For example the rule: 321 322//.../currencyCodes[@type="(%W)"][@numeric="(%N)"].* ; /codeMappingsCurrency/<$1> ; values=$1 $2 323 324Generates a series of grouped, 2-element sub-arrays split by the captured type attribute. 325 326 codeMappingCurrency{ 327 { type-1, numeric-1 } 328 { type-2, numeric-2 } 329 { type-3, numeric-3 } 330 } 331 332<FIFO> is a special hidden label which is substituted for in incrementing counting when 333sorting paths. It ensures that values in the same array are sorted in the order that they 334were encountered. However this mechanism imposes a strict requirement that the ordering 335of CLDR values to be transformed matches the expected ICU value order, so it should be 336avoided where possible to avoid this implicit, subtle dependency. Note that this mechanism 337is currently only enabled for the transformation of "supplemental data" and may eventually 338be removed. 339 340Hidden labels are a neat solution which permits the generation of sub-array values, but they 341don't quite work in every case. For example if you need to produce a resource bundle with a 342mix of values and sub-arrays, like: 343 344x{ 345 y{ 346 "a", 347 { "b", "c" } 348 "d" 349 } 350} 351 352which can be thought of as coming from the path/value pairs: 353 354x/y: a 355x/y/<z>: b 356x/y/<z>: c 357x/y: d 358 359we find that, after sorting the resource bundle paths, we end up with: 360 361x/y: a 362x/y: d 363x/y/<z>: b 364x/y/<z>: c 365 366which produces the wrong result. This happens because values with different paths are 367sorted primarily by their path. I cases like this, where a mix of values and sub-arrays 368are required, the "group" instruction can be used instead. 369 370For example: 371 372//ldml/numbers/currencies/currency[@type="(%W)"]/symbol ; /Currencies/$1 373//ldml/numbers/currencies/currency[@type="(%W)"]/displayName ; /Currencies/$1 374//ldml/numbers/currencies/currency[@type="(%W)"]/pattern ; /Currencies/$1 ; group 375//ldml/numbers/currencies/currency[@type="(%W)"]/decimal ; /Currencies/$1 ; group 376//ldml/numbers/currencies/currency[@type="(%W)"]/group ; /Currencies/$1 ; group 377 378Produces resource bundles which look like: 379 380Currencies{ 381 xxx{ 382 "<symbol>", 383 "<display name>", 384 { "<pattern>", "<decimal>", "<group>" } 385 } 386} 387