• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2title: Proposed Collation Additions
3---
4
5# Proposed Collation Additions
6
7|   |   |
8|---|---|
9| Author | Mark Davis, Markus Scherer, Michael Fairley |
10| Date | 2009-06-23 |
11| Status | Proposal |
12| Bugs | *insert linked bug numbers here* |
13
14## Script Reordering
15
16We would like to add script reordering as a new collation setting. This will allow, for example, sorting Greek before Latin, and digits after all letters, without listing all affected characters in the rules. Since this is a parameter, it can also be changed at runtime without changing any rules.
17
18This will be implemented via a permutation table for primary collation weights. See the original (somewhat outdated) ICU collation design doc for reference:
19
20http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU\_collation\_design.htm#Script\_Order
21
22### Proposed LDML syntax:
23
24Add the '**kr**' key, with an ordered list of script names as its types, in the order they should be sorted. For example, to specify an ordering of Greek, followed by Latin, followed by everything else (Zzzz = unknown), with digits (Zyyy = Common) last, the following would be used: **el-u-kr-grek-latn-zzzz-zyyy**. That would modify the ordering found on [http://unicode.org/charts/collation/](http://unicode.org/charts/collation/) in the following way:
25
26- OLD
27    - [Null](http://unicode.org/charts/collation/chart_Null.html) [Ignorable](http://unicode.org/charts/collation/chart_Ignorable.html) [Variable](http://unicode.org/charts/collation/chart_Variable.html) [Common](http://unicode.org/charts/collation/chart_Common.html) [Latin](http://unicode.org/charts/collation/chart_Latin.html) [Greek](http://unicode.org/charts/collation/chart_Greek.html) [Coptic](http://unicode.org/charts/collation/chart_Coptic.html) ... [CJK](http://unicode.org/charts/collation/chart_CJK.html) [CJK-Extensions](http://unicode.org/charts/collation/chart_CJK-Extensions.html) [Unsupported](http://unicode.org/charts/collation/chart_Unsupported.html)
28- NEW
29    - [Null](http://unicode.org/charts/collation/chart_Null.html) [Ignorable](http://unicode.org/charts/collation/chart_Ignorable.html) [Variable](http://unicode.org/charts/collation/chart_Variable.html) [Greek](http://unicode.org/charts/collation/chart_Greek.html) [Latin](http://unicode.org/charts/collation/chart_Latin.html) [Coptic](http://unicode.org/charts/collation/chart_Coptic.html) ... [CJK](http://unicode.org/charts/collation/chart_CJK.html) [CJK-Extensions](http://unicode.org/charts/collation/chart_CJK-Extensions.html) [Unsupported](http://unicode.org/charts/collation/chart_Unsupported.html) [Common](http://unicode.org/charts/collation/chart_Common.html)
30
31***Issue:*** *do we still want Unsupported at the very end??*
32
33The 'digitaft' type for the 'co' key is no longer needed, and can be deprecated (with some minor changes to data).
34
35Add an additional attribute, **scriptReorder**, to **\<settings>**. Its value will be the script names separated by spaces, in the order they should be sorted. The script code **Zzzz** stands for "any other script", and the script code **Zyyy** stands for Common.
36
37Example:
38
39\<settings scriptReorder="grek latn zzzz zyyy">
40
41Note: after looking at the data, I'm thinking that we might want to change the above:
42
43- allow codes that are not just script codes; in particular, Sc and Nd.
44- note that implicit is always at the end; thus there would be no code to specify it, so that someone can't try to put something after it.
45- Add that if the same script is specified twice in the list, the second wins.
46- we also need to warn people that depending on the implementation, specifying a script may drag along others. In particular, historic scripts may be grouped together.
47
48See http://site.icu-project.org/design/collation/script-reordering
49
50### Proposed LDML BCP47 subtag syntax changes:
51
52To allow a key to have multiple types (for listing multiple script codes), change:
53
54extension = key "-" type
55
56to
57
58extension = key ("-" type)+
59
60## Collation Import
61
62We want to add the ability for collation to "import" rules from another collator. This provides two useful features:
63
64- Many European languages can import a common collation for the [European Ordering Rules](http://anubis.dkuug.dk/CEN/TC304/EOR/eorhome.html) and then add language-specific rules on top of that.
65- For CJK Unihan variant collation orderings, the large common suffix with the Unihan ordering can be shared.
66
67This should reduce the maintenance burden and make total storage of the collation rule strings significantly smaller.
68
69### Proposed LDML syntax:
70
71Add an **\<import>** tag within collation **\<rules>** with two attributes, **source**, to identify the locale to import from (mirroring \<alias>'s source), and **type**, to identify which collator within the locale to include.
72
73Examples:
74
75\<import source="und\_hani">
76
77\<import source="de" type="phonebk">
78
79Add **private** as an additional attribute for \<settings>:
80
81\<settings private="true"> // mirroring \<transform>'s private attribute
82
83This attribute indicates to clients that the collation is intended only for \<import>, and should not be available as a stand-alone collator or listed in available collator APIs.
84
85**Update CLDR 26 (2014)**: A collation type is marked "private" via a type naming convention, rather than an attribute, so that it is easy for an implementation to omit such a type from a list of available types without reading its data. See [CLDR ticket #3949 comment:18](http://unicode.org/cldr/trac/ticket/3949#comment:18).
86
87
88