index.md - OpenGrok cross reference for /third_party/icu/docs/userguide/collation/customization/index.md

Lines Matching +full:rules +full:- +full:anchors
1 ---
6 ---
7 <!--
10 -->
16 {: .no_toc .text-delta }
21 ---
26 order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
35 Therefore, ICU provides a data-driven, flexible, and run-time-customizable
46 A tailoring is specified via a string containing a set of rules. ICU implements
48 syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more
57 ------------ | ---------
70 This is a non-complex example of a tailoring rule. Tailoring rules consist of
71 zero or more rules and zero or more options. There must be at least one rule or
75 Note that the tailoring rules override the UCA ordering. In addition, if a
85 ------ | ------------- | ----------------------------------
91 `&`    | `&Z`          | Instructs ICU to reset at this letter. These rules will be relative to thi…
100 > Rules that use the `;` and `,` notations are still processed by ICU for compatibility;
105 …See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules)
110 `&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`.
113 Orderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).
115 ### Escaping Rules
117 Most of the characters can be used as parts of rules. However, whitespace
120 rules, they need to be escaped. Escaping can be done in several ways:
128     **''**, both inside and outside single-quote-escaped strings.
139 --------------- | --------------
158 -------- | ---------
174 --- | ----------------
197 options/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options).
205 - **`[alternate non-ignorable]`**
206 - `[alternate shifted]`
213 - **`[maxVariable punct]`**
214 - `[maxVariable space]`
223 - `& X < [variable top]`
231 - **`[normalization off]`**
232 - `[normalization on]`
238 - `[strength 1]`
239 - `[strength 2]`
240 - **`[strength 3]`**
241 - `[strength 4]`
242 - `[strength I]`
247 - `[backwards 2]`
253 - **`[caseLevel off]`**
254 - `[caseLevel on]`
262 - **`[caseFirst off]`**
263 - `[caseFirst upper]`
264 - `[caseFirst lower]`
272 - **`[numericOrdering off]`**
273 - `[numericOrdering on]`
280 - **`[hiraganaQ off]`**
281 - `[hiraganaQ on]`
285 all the other non-variable code points. Strength must be greater or equal than
293 - `[suppressContractions [Љ-ґ]]`
295 Removes context-sensitive mappings (contractions and prefix/context-before mappings)
297 current set of rules: It removes mappings from the root collation as well as
298 from previous rules.
306 - `[optimize [Ά-ώ]]`
317 - `[reorder Grek Hani space]` 
320 non-script blocks (space, punctuation, symbol, currency, and digit). The default
323 ----
333 backwards-secondary ordering:
351 ---------------------------------------- | ----------
359 In addition, ISO **4-letter script codes** can be used. Codes for scripts that
363 Limitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers
364 to multiple scripts that do not sort primary-equal.)
381 In ICU 4.8-54, not every script could be reordered independently. CLDR and ICU
388 DUCET-following) group in the `[reorder]` list, moving the whole set of scripts
392 Beginning with ICU 55, scripts only reorder together if they are primary-equal,
413 was created from resource data or from rules. The DEFAULT code must be the sole
417 spec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering).
425 - Syntax: `[before 1|2|3]`
426 - Example: `&[before 2]a<ā<á<ǎ<à`
431 used to indicate the intonations in Pinyin. In this case, the non-accented
436 - Syntax: `/`
437 - Example: `æ/e`
445 - Syntax: `|`
446 - Example: `a|b`
458 - Syntax: `[top]`
459 - Example: `&[top] < a < b < c …`
472 (CE). Similar to the reset anchor `top`, these reset anchors allow for positioning of the
477 ------------------------- | ----------------- | ------------
482 first primary ignorable   | `[, 87, 05]`      | Mostly for non-spacing combining marks.
483 last primary ignorable    | `[, E1 B1, 05]`   | Currently this value points to a non-existing code …
484 first variable            | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see b…
510 makes them sort after all other non-tailored code points except for U+FFFD and U+FFFF.
517 <http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights>
522 Not all of the indirect-positioning anchors are useful. Most of the 'first'
540 letter that has primary-level sorting after 'z'. However, in Swedish and some
542 tertiary-level difference from the letters "th" and "TH" respectively. This is
546 --- | --------------------
563 big number of contractions is a performance burden on the commonly-used base
624 from the specified rules:
627 ------- | ------------------ | -------------- | ------
650 Redundant tailoring rules are removed, with later rules "winning". The strengths
651 around the removed rules are also fixed.
655 The following table summarizes effects of different redundant rules.
658 ------ | --------------------------------------------------------- | ----------
679 tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines
680 the 'distance' before, based on the strength of the difference (see 6-8).
711 after the reset is decomposable. Since all rules are converted to NFD before
712 they are interpreted, this can result in contractions that the rule-writer might
756 Given the rules
768 Rules         | Desired Order | Current Order
769 ------------- | ------------- | -------------
776 ICU currently always pre-compiles the expansion into an internal format (a list
782 Rules           | Desired Order | Current Order
783 --------------- | ------------- | -------------
791 Since the pre-compiled expansions are a huge performance gain, we will probably
796 pre-compile time, these need to be done at runtime.
809 Suppose that someone had the rules:
822 pre-processed so that there is no need to perform normalization on strings that
826 [`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5…
835 Rules                | Desired Order | Current Order
836 -------------------- | ------------- | -------------
851 text is represented with the pre-composed character à, but **will** match if
855 Rules        | Desired Order | Current Order
856 ------------ | ------------- | -------------
878 characters in the pre-compiled UCA table. Normally there is a gap of one. There
887 Rules              | Desired Order SHIFTED = ON | Current Order
888 ------------------ | -------------------------- | -------------
889 `& \u2010`         | -                          | -
894 &nbsp;             | -b                         | b
895 &nbsp;             | xb                         | -b
904 > exclusive in the rules.
911 tertiary differences (such as between circled and non-circled letters, or
912 between half-width and full-width katakana). The case values are derived
913 directly from the Unicode character properties, and not set by the rules.
918 only have three special case values: upper, lower, and mixed. All mixed-case
922 Rules      | Desired Order UPPER_FIRST | Current Order
923 ---------- | ------------------------- | -------------
932 All of the collation rules are additive; that is, they override what any
933 previous rule expressed. That means that you can build on existing rules for
934 given locales. Here is an example of this, which fetches the rules for a
943     String rules = col.getRules();
945     RuleBasedCollator col2 = new RuleBasedCollator(rules + myRules);
953     assertEquals("Customized rules with %", expected, actual);
956     throw new IllegalArgumentException("Failed to create customized rules", e);
960 The root collator has an empty rules string (`getRules()` returns `""`): Any
961 collator's tailoring rules string defines how a collator *differs* from the root
962 collator, and the tailoring rules string was the input for building the
968 CLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation).
971 `delta=UCOL_FULL_RULES` (C/C++) or `fullrules=true` (Java), return "full rules"
972 which are a concatenation of the "UCA rules" and the collator's tailoring. The
973 "UCA rules" are published as UCA_Rules.txt in every [UCA
976 *   "UCA rules" is a historical misnomer. The UCA specifies an Algorithm which
978 *   ICU's root collator implements the CLDR-modified collation element table.
979     The "UCA rules" returned from ICU functions are equivalently modified rules
982 The "UCA rules" are an *approximation* of the root collator's sort order, but
996 The "full rules" are almost never used, or useful, at runtime. They are included
1004 the `--omitCollationRules` option to the relevant `genrb` invocations
1008 …nicode.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-…
1010 If the tailoring rules are needed but the 150kB or so of "UCA rules" are not,
1018 [source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/roo…
1030 Rules     | Order | Comment
1031 --------- | ----- | -------
1032 `& a < b` | a     | The rules mean: put **b** after **a**, then put **c** after **a** (inserting **…
1061 **superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule
1072 ICU will automatically form expansions whenever a reset is to a multi-character
1075 obvious that the reset is to a multi-character value. For example, `& à<<< d` is