• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# README for configuration files used by org.unicode.icu.tool.cldrtoicu.regex.RegexTransformer.
2#
3# © 2019 and later: Unicode, Inc. and others.
4#
5# CLDR data files are interpreted according to the LDML specification (http://unicode.org/reports/tr35/)
6# For terms of use, see http://www.unicode.org/copyright.html
7
8======
9Basics
10======
11
12The RegexTransformer class converts CLDR paths and values to ICU Resource Bundle paths
13and values, based on a set of transformation rules typically loaded from a text file
14(e.g. ldml2icu_locale.txt).
15
16The basic format of transformation rules is:
17  <path-specification> ; <resource-bundle-specification> [; <instruction>=<argument>]*
18
19A simple example of a transformation rule is:
20
21  //ldml/localeDisplayNames/keys/key[@type="(%A)"] ; /Keys/$1
22
23which transforms CLDR values whose path matches the path specification, and emits:
24* A resource bundle path "/Keys/xx", where 'xx' is the captured type attribute.
25* A resource bundle value, which is just the CLDR value's base value.
26
27A path specification can be thought of as a regular expression which matches the CLDR
28path and can capture some element names or attribute values; however unlike a regular
29expression, the '[',']' characters are treated as literals, similar to XPath expressions.
30
31If a single CLDR value should produce more than one resource bundle path/value, then
32it should be written:
33
34  <path-specification>
35     ; <resource-bundle-1-specification> [; <instruction> ]*
36     ; <resource-bundle-2-specification> [; <instruction> ]*
37
38=====================
39Argument Substitution
40=====================
41
42Before a rule can be matched, any %-variables must be substituted. These are defined
43in the same configuration file as the rules, and look something like:
44  %W=[\w\-]++
45or:
46  %D=//ldml/numbers/defaultNumberingSystem
47
48The first case can be thought of as just a snippet of regular expression (in this case
49something that matches hyphen separated words) and, importantly, here '[' and ']' are
50treated as regular expression metacharacters. These arguments are static and wil be
51substituted exactly as-is into the regular expression to be used for matching.
52
53The second case (used exactly once) is a dynamic argument which references a CLDR value
54in the set of data being transformed. This is simply indicated by the fact that it starts
55with '//'. This path is resolved and the value is substituted just prior to matching.
56
57Variable names are limited to a single upper-case letter (A-Z).
58
59===========================
60Implicit Argument Splitting
61===========================
62
63This is a (somewhat non-obvious) mechanism which allows for a single rule to generate
64multiple results from a single input path when a argument is a list of tokens.
65
66Consider the rule:
67
68//supplementalData/timeData/hours[@allowed="(%W)"][@preferred="(%W)"][@regions="(%W)"]
69  ; /timeData/$3/allowed   ; values=$1
70  ; /timeData/$3/preferred ; values=$2
71
72where the "regions" attributes (which is captured as '$3') contains a whitespace separated
73list of region codes (e.g. "US GB AU NZ"). In this case the rule is applied once for each
74region, producing paths such as "/timeData/US/allowed" or "/timeData/NZ/preferred". Note
75that there is no explicit instruction to do this, it just happens.
76
77The rule is that the first unquoted argument in the resource bundle path is always treated
78as splittable.
79
80To suppress this behaviour, the argument must be quoted (e.g. /timeData/"$3"/allowed). Now,
81if there were another following unquoted argument, that would become implicitly splittable
82(but only one argument is ever splittable).
83
84============
85Instructions
86============
87
88Additional instructions can be supplied to control value transformation and specify fallback
89values. The set of instructions is:
90* values:     The most common instruction which defines how values are transformed.
91* fallback:   Defines a fallback value to be used if this rule was not matched.
92
93There are two other special case instructions which should (if at all possible) not be used,
94and might be removed at some point:
95* group:      Causes values to be grouped as sub-arrays for very specific use cases
96              (prefer using "Hidden Labels" where possible).
97* base_xpath: Allows deduplication of results between multiple different rules (this is a
98              hack to work around limitations in how matching is performed).
99
100-------------------
101values=<expression>
102-------------------
103
104The "values" instruction defines an expression whose evaluated result becomes the output
105resource bundle value(s). Unless quoting is present, this evaluated expression is split
106on whitespace and can become multiple values in the resulting resource bundle.
107
108Examples:
109
110* values=$1 $2 $3
111
112  Produces three separate values in the resource bundle for the first three captured
113  arguments.
114
115* values="$1 $2" $3
116
117  Produces two values in the resource bundle, the first of which is two captured values
118  separated by a space character.
119
120* values={value}
121
122  Substitutes the CLDR value, but then performs whitespace splitting on the result. This
123  differs from the behaviour when no "values" instructions is present (which does not
124  split the results).
125
126* values="{value}" $1
127
128  Produces two values, the first of which is the unsplit CLDR value, and the second is a
129  captured argument.
130
131* values=&func($1, {value})
132
133  Invokes a transformation function, passing in a captured argument and the CLDR value,
134  and the result is then split. The set of functions available to a transformer is
135  configured when it is created.
136
137Note that in the above examples, it is assumed that the $N arguments do not contain spaces.
138If they did, it would result in more output values. To be strict about things, every value
139which should not be split must be quoted (e.g. values="$1" "$2" "$3") but since captured
140values are often IDs or other tokens, this is not what is seen in practice, so it is not
141reflected in these examples.
142
143---------------------
144fallback=<expression>
145---------------------
146
147The fallback instruction provides a way for default values to be emitted for a path that
148was not matched. Fallbacks are useful when several different rules produce values for the
149same resource bundle. In this case the output path produced by one rule can be used as
150the "key" for any unmatched rules with fallback values (to "fill in the gaps").
151
152Consider the two rules which can emit the same resource bundle path:
153
154//ldml/numbers/currencies/currency[@type="(%W)"]/symbol
155     ; /Currencies/$1 ; fallback=$1
156//ldml/numbers/currencies/currency[@type="(%W)"]/displayName
157     ; /Currencies/$1 ; fallback=$1
158
159These rules, if both matched, will produce two values for the same resource bundle path.
160Consider the CLDR values:
161
162//ldml/numbers/currencies/currency[@type="USD"]/symbol      ==> "$"
163//ldml/numbers/currencies/currency[@type="USD"]/displayName ==> "US Dollar"
164
165After matching both of these paths, the values for the resource bundle "/Currencies/USD"
166will be the array { "$", "US Dollar" }.
167
168However, if only one value were present to be converted, the converter could use the
169matched path "/Currencies/XXX" and infer the missing fallback value, ensuring that the
170output array (it if was emitted at all) was always two values.
171
172Note that in order for this to work, the fallback value must be derivable only from the
173matched path. E.g. it cannot contain arguments that are not also present in the matched
174path, and obviously cannot reference the "{value}" at all. Thus the following would not
175be permitted:
176
177//ldml/foo/bar[@type="(%W)"][@region=(%A)] ; /Foo/$1 ; fallback=$2
178
179However the fallback value can reference existing CLDR or resource bundle paths (expected
180to be present from other rules). For example:
181  fallback=/weekData/001:intvector[0]
182or:
183  fallback=//ldml/numbers/symbols[@numberSystem="%D"]/decimal
184
185The latter case is especially complex because it also uses the "dynamic" argument:
186  %D=//ldml/numbers/defaultNumberingSystem
187
188So determining the resulting value will require:
1891) resolving "//ldml/numbers/defaultNumberingSystem" to, for example, "arab"
1902) looking up the value of "//ldml/numbers/symbols[@numberSystem="arab"]/decimal"
191
192-----------------
193base_xpath=<path>
194-----------------
195
196The base_xpath instruction allows a rule to specify a proxy path which is used in place of
197the originally matched path in the returned result. This is a useful hack for cases where
198values are derived from information in a path prefix.
199
200Because path matching for transformation happens only on full paths, it is possible that
201several distinct CLDR paths might effectively generate the same result if they share the
202same prefix (i.e. paths in the same "sub hierarchy" of the CLDR data).
203
204If this happens, then you end up generating "the same" result from different paths. To
205fix this, a "surrogate" CLDR path can be specified as a proxy for the source path,
206allowing several results to appears to have come from the same source, which results in
207deduplication of the final value.
208
209For example, the two rules :
210
211//supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"][@officialStatus="(%W)"](?:[@references="%W"])?
212    ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"]
213
214//supplementalData/territoryInfo/territory[...][@writingPercent="(%N)"][@populationPercent="(%N)"](?:[@references="%W"])?
215    ; /territoryInfo/$1/territoryF:intvector ; values=&exp($2) &exp($3,-2) &exp($4) ; base_xpath=//supplementalData/territoryInfo/territory[@type="$1"]
216
217Produce the same results for different paths (with or without the "officialStatus"
218attribute) but only one such result is desired. By specifying the same base_xpath on
219both rules, the conversion logic can deduplicate these to produce only one result.
220
221When using base_xpath, it is worth noting that:
2221) Base xpaths must be valid "distinguishing" paths (but are never matched to any rule).
2232) Base xpaths can use arguments to achieve the necessary level of uniqueness.
2243) Rules which share the same base xpath must always produce the same values.
225
226Note however that this is a still very much a hack because since two rules are responsible
227for generating the same result, there is no well defined "line number" to use for ordering
228of values. Thus this mechanism should only be used for rules which produce "single"
229values, and must not be used in cases where the ordering of values in arrays is important.
230
231This mechanism only exists because there is currently no mechanism for partial matching
232or a way to match one path against multiple rules.
233
234-----
235group
236-----
237
238The "group" instruction should be considered a "last resort" hack for controlling value
239grouping, in cases where "hidden labels" are not suitable (see below).
240
241==============================
242Value Arrays and Hidden Labels
243==============================
244
245In the simplest case, one rule produces one or more output path/values per matched CLDR
246value (i.e. one-to-one or one-to-many). If that happens, then output ordering of the
247resource bundle paths is just the natural resource bundle path ordering.
248
249However it is also possible for several rules to produce values for a single output path
250(i.e. many-to-one). When this happens there are some important details about how results
251are grouped and ordered.
252
253------------
254Value Arrays
255------------
256
257If several rules produce results for the same resource bundle path, the values produced
258by the rules are always ordered according to the order of the rule in the configuration
259rule (and it is best practice to group any such rules together for clarity).
260
261If each rule produces multiple values, then depending on grouping, those values can either
262be concatenated together in a single array or grouped individually to create an array
263of arrays.
264
265In the example below, there are four rules producing values for the same path (
266
267//.../firstDay[@day="(%W)"][@territories="(%W)"]     ; /weekData/$2:intvector ; values=&day_number($1)
268//.../minDays[@count="(%N)"][@territories="(%W)"]    ; /weekData/$2:intvector ; values=$1
269//.../weekendStart[@day="(%W)"][@territories="(%W)"] ; /weekData/$2:intvector ; values=&day_number($1) 0
270//.../weekendEnd[@day="(%W)"][@territories="(%W)"]   ; /weekData/$2:intvector ; values=&day_number($1) 86400000
271
272The first two rules produce one value each, and the last two produce two values each. This
273results in the resource bundle "/weekData/xxx:intvector" having a single array consisting
274of six values. In the real configuration, these rules also use fallback instructions to
275ensure that the resulting array of values is always six values, even if some CLDR paths are
276not present.
277
278-------------
279Hidden Labels
280-------------
281
282Sometimes rules should produce separate "sub-arrays" of values, rather than having all the
283values appended to a single array. Consider the following path/value pairs:
284
285x/y: a
286x/y: b
287x/y: c
288
289Which produce the resource bundle "x/y" with three values:
290
291x{
292  y{
293    "a",
294    "b",
295    "c"
296  }
297}
298
299Now suppose we want to make a resource bundle where the values are grouped into their
300own sub-array:
301
302x{
303  y{
304    { "a", "b", "c" }
305  }
306}
307
308We can think of this as coming from the path/value pairs:
309
310x/y/-: a
311x/y/-: b
312x/y/-: c
313
314where to represent the sub-array we introduce the idea of an empty path element '-'.
315
316In a transformation rule, these "empty elements" are represent as "hidden labels", and look
317like "<some-label>". They are treated as "normal" path elements for purposes of ordering and
318grouping, but are treated as empty when the paths are written to the ICU data files.
319
320For example the rule:
321
322//.../currencyCodes[@type="(%W)"][@numeric="(%N)"].* ; /codeMappingsCurrency/<$1> ; values=$1 $2
323
324Generates a series of grouped, 2-element sub-arrays split by the captured type attribute.
325
326  codeMappingCurrency{
327    { type-1, numeric-1 }
328    { type-2, numeric-2 }
329    { type-3, numeric-3 }
330  }
331
332<FIFO> is a special hidden label which is substituted for in incrementing counting when
333sorting paths. It ensures that values in the same array are sorted in the order that they
334were encountered. However this mechanism imposes a strict requirement that the ordering
335of CLDR values to be transformed matches the expected ICU value order, so it should be
336avoided where possible to avoid this implicit, subtle dependency. Note that this mechanism
337is currently only enabled for the transformation of "supplemental data" and may eventually
338be removed.
339
340Hidden labels are a neat solution which permits the generation of sub-array values, but they
341don't quite work in every case. For example if you need to produce a resource bundle with a
342mix of values and sub-arrays, like:
343
344x{
345  y{
346    "a",
347    { "b", "c" }
348    "d"
349  }
350}
351
352which can be thought of as coming from the path/value pairs:
353
354x/y: a
355x/y/<z>: b
356x/y/<z>: c
357x/y: d
358
359we find that, after sorting the resource bundle paths, we end up with:
360
361x/y: a
362x/y: d
363x/y/<z>: b
364x/y/<z>: c
365
366which produces the wrong result. This happens because values with different paths are
367sorted primarily by their path. I cases like this, where a mix of values and sub-arrays
368are required, the "group" instruction can be used instead.
369
370For example:
371
372//ldml/numbers/currencies/currency[@type="(%W)"]/symbol      ; /Currencies/$1
373//ldml/numbers/currencies/currency[@type="(%W)"]/displayName ; /Currencies/$1
374//ldml/numbers/currencies/currency[@type="(%W)"]/pattern     ; /Currencies/$1 ; group
375//ldml/numbers/currencies/currency[@type="(%W)"]/decimal     ; /Currencies/$1 ; group
376//ldml/numbers/currencies/currency[@type="(%W)"]/group       ; /Currencies/$1 ; group
377
378Produces resource bundles which look like:
379
380Currencies{
381  xxx{
382     "<symbol>",
383     "<display name>",
384     { "<pattern>", "<decimal>", "<group>" }
385  }
386}
387