• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2layout: default
3title: Ignore Punctuation Options
4nav_order: 8
5parent: Collation
6---
7<!--
8© 2020 and later: Unicode, Inc. and others.
9License & terms of use: http://www.unicode.org/copyright.html
10-->
11
12# “Ignore Punctuation” Options
13{: .no_toc }
14
15## Contents
16{: .no_toc .text-delta }
17
181. TOC
19{:toc}
20
21---
22
23## Overview
24
25By default, spaces and punctuation characters add primary (base character)
26differences. Such characters sort less-than digits and letters. For example, the
27default collation yields “De Anza” < “de-luge” < “deanza”.
28
29UCA/CLDR/ICU provide several options for “ignore punctuation” collation
30settings, also known as Variable Weighting or Alternate Handling. These options
31change the sorting behavior of “variable” characters algorithmically. “Variable”
32characters are those with low (but non-zero) primary weights up to a threshold,
33the “variable top”. By default, CLDR and ICU treat spaces and punctuation as
34variable. (This can be changed via API.) The DUCET also includes most symbols.
35
36## Non-Ignorable
37
38The default behavior in CLDR & ICU, shown above, is to not ignore punctuation
39(alternate=non-ignorable) but to map variable characters to their normal primary
40collation elements.
41
42All of the following options cause variable characters to be ignored on levels
431..3. Only when strings compare equal up to the tertiary level may variable
44characters make a difference, depending on the options.
45
46See also
47
48*   [UCA: Variable
49    Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting)
50*   [LDML: Setting
51    Options](https://htmlpreview.github.io/?https://github.com/unicode-org/cldr/blob/master/docs/ldml/tr35-collation.html#Setting_Options)
52
53Here is an overview of the sorting results with these options.
54
55Non-ignorable | Blanked      | Shifted | Shift-Trimmed | Variable-After
56------------- | ------------ | ------- | ------------- | --------------
57delug         | delug        | delug   | delug         | delug
58de-luge       | de-luge      | de-luge | *deluge*      | *deluge*
59delu-ge       | delu-ge (*)  | delu-ge | de-luge       | deluge-
60*deluge*      | *deluge* (*) | *deluge* | delu-ge     | delu-ge
61Deluge        | deluge- (*)  | deluge-  | deluge-     | de-luge
62deluge-       | Deluge       | Deluge   | Deluge      | Deluge
63
64Items with (*) compare equal to the preceding ones, and their relative order
65is arbitrary. These only occur in the Blanked column. This table shows the
66results of a stable sort algorithm with the non-ignorable column as input.
67
68## Blanked
69
70The simplest option is to “ignore punctuation” completely, as if all variable
71characters (and following combining marks) had been removed from the input
72strings before comparing them.
73
74For example: “De Anza” = “De-Anza” = “DeAnza”.
75
76In ICU, this option is selected with alternate=shifted and
77strength=primary|secondary|tertiary. (ICU does not support Blanked combined with
78strength=identical.)
79
80The implementation “blanks” out all weights of the variable characters’
81collation elements.
82
83*With all of the following options, variable characters are ignored on levels
841..3 but add distinctions on level 4 (quaternary level).*
85
86## Shifted
87
88Among strings that compare tertiary-equal, that is, they contain the same
89letters, accents and casing:
90
91*   Sorts all variable characters less-than (before) regular characters.
92*   Appending a variable character makes a string sort *greater-than* the string
93    without it.
94*   *Inserting* a variable character makes a string sort *less-than* the string
95    without it.
96*   Inserting a variable character *earlier* in a string makes it sort
97    *less-than* inserting the variable character *later* in the string.
98
99The result is similar to [Merging Sort
100Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys) (with shorter
101prefixes sorting less-than longer ones), like in last-name+first-name sorting,
102except only among tertiary-equal strings.
103
104For example: “de-luge” < “delu-ge” < “deluge” < “deluge-”.
105
106In ICU, this option is selected with alternate=shifted and
107strength=quaternary|identical.
108
109The implementation “shifts” the primary weight p of the collation element \[p,
110s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular
111characters with primary collation elements get a high quaternary weight, higher
112than that of any variable character.
113
114Note that this behavior is different from collation on secondary and tertiary
115level, because normal collation elements get low secondary & tertiary weights
116but high quaternary weights. Adding an accent difference anywhere makes a string
117sort greater-than the string without it, and adding an accent difference earlier
118makes it sort greater-than adding it later. For example, “deanza” < “deanzä” <
119“deänza” < “dëanza”. (Compare the ‘ä’/‘ë’ positions here with the ‘-’ positions
120above.)
121
122## Shift-Trimmed
123
124*Note: This method is not currently implemented in ICU.*
125
126Among strings that compare tertiary-equal:
127
128*   Sorts variable characters sometimes less-than, sometimes greater-than
129    regular characters.
130*   Inserting a variable character anywhere makes a string sort *greater-than*
131    the string without it. (The string without variable characters gets an empty
132    quaternary level.)
133*   Inserting a variable character *earlier* in a string makes it sort
134    *less-than* inserting the variable character *later* in the string.
135
136For example: “deluge” < “de-luge” < “delu-ge” < “deluge-”.
137
138The Shift-Trimmed method works like Shifted, except that *trailing*
139high-quaternary weights (from regular characters) are removed (trimmed).
140Compared with Shifted, the Shift-Trimmed method sorts strings without variable
141characters before ones with variable characters added, rather than producing the
142equivalent of [Merging Sort
143Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys).
144
145Shift-Trimmed is more complicated to implement than all of the other options:
146When comparing strings, a lookahead (or equivalent) is needed to determine
147whether a non-variable character gets a zero quaternary weight (if no variables
148follow) or a high quaternary weight (if at least one variable follows). When
149building sort keys, trailing high/common quaternary weights are trimmed (backed
150out) at the end of the quaternary level.
151
152## Variable-After
153
154*Note: This method is not currently implemented in ICU.*
155
156Among strings that compare tertiary-equal:
157
158*   Sorts all variable characters greater-than (after) regular characters.
159*   Inserting a variable character anywhere makes a string sort *greater-than*
160    the string without it. (Like Shift-Trimmed.)
161*   Inserting a variable character *earlier* in a string makes it sort
162    *greater-than* inserting the variable character *later* in the string. (Like
163    accent differences.)
164
165For example: “deluge” < “deluge-” < “delu-ge” < “de-luge”.
166
167The implementation “shifts” the primary weight p of the collation element \[p,
168s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular
169characters with primary collation elements get a *low* quaternary weight,
170*lower* than that of any variable character. This is consistent with collation
171on secondary and tertiary levels but unlike [Merging Sort
172Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys).
173
174This method extends the [UCA well-formedness condition
1752](http://www.unicode.org/reports/tr10/#WF2) to apply to quaternary weights.
176(UCA versions before UCA 6.2 did not limit WF2 to secondary & tertiary weights,
177which meant that several of the Variable Weighting options technically created
178ill-formed quaternary weights.)
179