index.md - OpenGrok cross reference for /third_party/icu/docs/userguide/collation/customization/index.md

Lines Matching +full:before +full:- +full:script
1 ---
6 ---
7 <!--
10 -->
16 {: .no_toc .text-delta }
21 ---
26 order](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation)
32 all the languages. In particular, languages that share a script may sort the
35 Therefore, ICU provides a data-driven, flexible, and run-time-customizable
48 syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules). For more
53 after "a" and before "b", and the "a" does not change place. This rule has the
57 ------------ | ---------
70 This is a non-complex example of a tailoring rule. Tailoring rules consist of
85 ------ | ------------- | ----------------------------------
105 …See the [LDML collation rule syntax](http://www.unicode.org/reports/tr35/tr35-collation.html#Rules)
110 `&a <* bcd-gp-s` for `&a < b < c < d < e < f < g < p < q < r < s`.
113 Orderings](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings).
128     **''**, both inside and outside single-quote-escaped strings.
139 --------------- | --------------
152 letter that gets sorted after "Đ" and before "E" ("Đ" sorts as a base letter
158 -------- | ---------
174 --- | ----------------
197 options/settings](http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options).
205 - **`[alternate non-ignorable]`**
206 - `[alternate shifted]`
213 - **`[maxVariable punct]`**
214 - `[maxVariable space]`
223 - `& X < [variable top]`
231 - **`[normalization off]`**
232 - `[normalization on]`
238 - `[strength 1]`
239 - `[strength 2]`
240 - **`[strength 3]`**
241 - `[strength 4]`
242 - `[strength I]`
247 - `[backwards 2]`
253 - **`[caseLevel off]`**
254 - `[caseLevel on]`
262 - **`[caseFirst off]`**
263 - `[caseFirst upper]`
264 - `[caseFirst lower]`
267 upper, causes upper case to sort before lower case. If set to lower, lower case
268 will sort before upper case. Useful for locales that have an already supported
272 - **`[numericOrdering off]`**
273 - `[numericOrdering on]`
280 - **`[hiraganaQ off]`**
281 - `[hiraganaQ on]`
285 all the other non-variable code points. Strength must be greater or equal than
293 - `[suppressContractions [Љ-ґ]]`
295 Removes context-sensitive mappings (contractions and prefix/context-before mappings)
306 - `[optimize [Ά-ώ]]`
317 - `[reorder Grek Hani space]` 
320 non-script blocks (space, punctuation, symbol, currency, and digit). The default
323 ----
333 backwards-secondary ordering:
351 ---------------------------------------- | ----------
357 Unicode scripts not mentioned ("others") |Zzzz (= Unknown script)
359 In addition, ISO **4-letter script codes** can be used. Codes for scripts that
360 do not have Unicode characters (according to the Unicode Script property values)
363 Limitations of ICU 4.8-52: (Except `Kore` is still not usable because it refers
364 to multiple scripts that do not sort primary-equal.)
366 *   For Chinese, use script code `Hani`, *not* `Hans` or `Hant`.
375 For an introduction and examples see the section “Script Reordering” in the
381 In ICU 4.8-54, not every script could be reordered independently. CLDR and ICU
385 script that is not Recommended always moved together with the Recommended Script
387 with Greek, etc.) ICU allowed any one script of a (Recommended Script +
388 DUCET-following) group in the `[reorder]` list, moving the whole set of scripts
392 Beginning with ICU 55, scripts only reorder together if they are primary-equal,
397 The special code Zzzz (= Unknown script = `UScript.UNKNOWN` =
398 `Collator.ReorderCodes.OTHERS` = "others") stands for any script that is not
417 spec](http://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering).
423 #### Order before
425 - Syntax: `[before 1|2|3]`
426 - Example: `&[before 2]a<ā<á<ǎ<à`
428 Enables users to order characters **before **a given character. In UCA 3.0, the
430 for hour nine) and makes accented 'a' letters sort before 'a'. Accents are often
431 used to indicate the intonations in Pinyin. In this case, the non-accented
436 - Syntax: `/`
437 - Example: `æ/e`
445 - Syntax: `|`
446 - Example: `a|b`
458 - Syntax: `[top]`
459 - Example: `&[top] < a < b < c …`
474 `[before]` reset option to position before these sections.
477 ------------------------- | ----------------- | ------------
482 first primary ignorable   | `[, 87, 05]`      | Mostly for non-spacing combining marks.
483 last primary ignorable    | `[, E1 B1, 05]`   | Currently this value points to a non-existing code …
484 first variable            | `[05 07, 05, 05]` | The lowest CE that is not primary-ignorable. (see b…
499 The script reordering implementation assumes that CEs in this section
500 are for "Hani" script characters.
510 makes them sort after all other non-tailored code points except for U+FFFD and U+FFFF.
515 UCA 6.3/CLDR 24/ICU 52 and later map U+FFFD to just before U+FFFF.
517 <http://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights>
519 Before ICU 4.6, U+FFFF mapped to a completely ignorable CE, and `[last trailing]`
522 Not all of the indirect-positioning anchors are useful. Most of the 'first'
523 elements should be used with the `[before]` directive, in order to make sure
524 that your tailoring will sort before an interesting section.
540 letter that has primary-level sorting after 'z'. However, in Swedish and some
542 tertiary-level difference from the letters "th" and "TH" respectively. This is
546 --- | --------------------
563 big number of contractions is a performance burden on the commonly-used base
575 &[before 3]ァ <<< ァ|ー = ｧ|ー = ぁ|ー
581 &[before 3]ァー <<< ァー = ｧー = ぁー
627 ------- | ------------------ | -------------- | ------
658 ------ | --------------------------------------------------------- | ----------
663 5      | `& a < b < c < d` `& [before 1] c < m`                    | `& a < b < m < c < d`
664 6      | `& a < b <<< c << d <<< e` `& [before 3] e <<< x`         | `& a < b <<< c << d <<< x <<< …
665 7      | `& a < b <<< c << d <<< e` `& [before 2] e <<< x`         | `& a < b <<< c <<< x << d <<< …
666 8      | `& a < b <<< c << d <<< e` `& [before 1] e <<< x`         | `& a <<< x < b <<< c << d <<< …
667 9      | `& a < b <<< c << d <<< e <<< f < g` `& [before 1] g < x` | `& a < b <<< c << d <<< e <<< …
677 If there is a `[before N]` on the reset, then the reset character is
678 effectively replaced by the item that would be before it, either in a previous
679 tailoring (if the letter occurs in one - see 5) or in the UCA. The N determines
680 the 'distance' before, based on the strength of the difference (see 6-8).
711 after the reset is decomposable. Since all rules are converted to NFD before
712 they are interpreted, this can result in contractions that the rule-writer might
769 ------------- | ------------- | -------------
776 ICU currently always pre-compiles the expansion into an internal format (a list
783 --------------- | ------------- | -------------
791 Since the pre-compiled expansions are a huge performance gain, we will probably
796 pre-compile time, these need to be done at runtime.
822 pre-processed so that there is no need to perform normalization on strings that
826 [`[[:^lccc=0:]&[:toNFD=/../:]]`](http://www.unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3A%5…
836 -------------------- | ------------- | -------------
851 text is represented with the pre-composed character à, but **will** match if
856 ------------ | ------------- | -------------
878 characters in the pre-compiled UCA table. Normally there is a gap of one. There
888 ------------------ | -------------------------- | -------------
889 `& \u2010`         | -                          | -
894 &nbsp;             | -b                         | b
895 &nbsp;             | xb                         | -b
911 tertiary differences (such as between circled and non-circled letters, or
912 between half-width and full-width katakana). The case values are derived
918 only have three special case values: upper, lower, and mixed. All mixed-case
923 ---------- | ------------------------- | -------------
968 CLDR](http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation).
978 *   ICU's root collator implements the CLDR-modified collation element table.
991 *   script reordering will not work
1004 the `--omitCollationRules` option to the relevant `genrb` invocations
1008 …nicode.org/processes/release/tasks/integration#TOC-Verify-that-ICU4C-tests-pass-without-collation-…
1018 [source/data/coll/root.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/coll/roo…
1031 --------- | ----- | -------
1032 …The rules mean: put **b** after **a**, then put **c** after **a** (inserting **before** the **b**).
1039 a value Y, it actually is inserted just before the next item of the same
1061 **superscript-a** ("ª") to be after "**e"**, it is necessary to specify the rule
1072 ICU will automatically form expansions whenever a reset is to a multi-character
1075 obvious that the reset is to a multi-character value. For example, `& à<<< d` is