1--- 2layout: default 3title: Ignore Punctuation Options 4nav_order: 8 5parent: Collation 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# “Ignore Punctuation” Options 13{: .no_toc } 14 15## Contents 16{: .no_toc .text-delta } 17 181. TOC 19{:toc} 20 21--- 22 23## Overview 24 25By default, spaces and punctuation characters add primary (base character) 26differences. Such characters sort less-than digits and letters. For example, the 27default collation yields “De Anza” < “de-luge” < “deanza”. 28 29UCA/CLDR/ICU provide several options for “ignore punctuation” collation 30settings, also known as Variable Weighting or Alternate Handling. These options 31change the sorting behavior of “variable” characters algorithmically. “Variable” 32characters are those with low (but non-zero) primary weights up to a threshold, 33the “variable top”. By default, CLDR and ICU treat spaces and punctuation as 34variable. (This can be changed via API.) The DUCET also includes most symbols. 35 36## Non-Ignorable 37 38The default behavior in CLDR & ICU, shown above, is to not ignore punctuation 39(alternate=non-ignorable) but to map variable characters to their normal primary 40collation elements. 41 42All of the following options cause variable characters to be ignored on levels 431..3. Only when strings compare equal up to the tertiary level may variable 44characters make a difference, depending on the options. 45 46See also 47 48* [UCA: Variable 49 Weighting](http://www.unicode.org/reports/tr10/#Variable_Weighting) 50* [LDML: Setting 51 Options](https://htmlpreview.github.io/?https://github.com/unicode-org/cldr/blob/master/docs/ldml/tr35-collation.html#Setting_Options) 52 53Here is an overview of the sorting results with these options. 54 55Non-ignorable | Blanked | Shifted | Shift-Trimmed | Variable-After 56------------- | ------------ | ------- | ------------- | -------------- 57delug | delug | delug | delug | delug 58de-luge | de-luge | de-luge | *deluge* | *deluge* 59delu-ge | delu-ge (*) | delu-ge | de-luge | deluge- 60*deluge* | *deluge* (*) | *deluge* | delu-ge | delu-ge 61Deluge | deluge- (*) | deluge- | deluge- | de-luge 62deluge- | Deluge | Deluge | Deluge | Deluge 63 64Items with (*) compare equal to the preceding ones, and their relative order 65is arbitrary. These only occur in the Blanked column. This table shows the 66results of a stable sort algorithm with the non-ignorable column as input. 67 68## Blanked 69 70The simplest option is to “ignore punctuation” completely, as if all variable 71characters (and following combining marks) had been removed from the input 72strings before comparing them. 73 74For example: “De Anza” = “De-Anza” = “DeAnza”. 75 76In ICU, this option is selected with alternate=shifted and 77strength=primary|secondary|tertiary. (ICU does not support Blanked combined with 78strength=identical.) 79 80The implementation “blanks” out all weights of the variable characters’ 81collation elements. 82 83*With all of the following options, variable characters are ignored on levels 841..3 but add distinctions on level 4 (quaternary level).* 85 86## Shifted 87 88Among strings that compare tertiary-equal, that is, they contain the same 89letters, accents and casing: 90 91* Sorts all variable characters less-than (before) regular characters. 92* Appending a variable character makes a string sort *greater-than* the string 93 without it. 94* *Inserting* a variable character makes a string sort *less-than* the string 95 without it. 96* Inserting a variable character *earlier* in a string makes it sort 97 *less-than* inserting the variable character *later* in the string. 98 99The result is similar to [Merging Sort 100Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys) (with shorter 101prefixes sorting less-than longer ones), like in last-name+first-name sorting, 102except only among tertiary-equal strings. 103 104For example: “de-luge” < “delu-ge” < “deluge” < “deluge-”. 105 106In ICU, this option is selected with alternate=shifted and 107strength=quaternary|identical. 108 109The implementation “shifts” the primary weight p of the collation element \[p, 110s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular 111characters with primary collation elements get a high quaternary weight, higher 112than that of any variable character. 113 114Note that this behavior is different from collation on secondary and tertiary 115level, because normal collation elements get low secondary & tertiary weights 116but high quaternary weights. Adding an accent difference anywhere makes a string 117sort greater-than the string without it, and adding an accent difference earlier 118makes it sort greater-than adding it later. For example, “deanza” < “deanzä” < 119“deänza” < “dëanza”. (Compare the ‘ä’/‘ë’ positions here with the ‘-’ positions 120above.) 121 122## Shift-Trimmed 123 124*Note: This method is not currently implemented in ICU.* 125 126Among strings that compare tertiary-equal: 127 128* Sorts variable characters sometimes less-than, sometimes greater-than 129 regular characters. 130* Inserting a variable character anywhere makes a string sort *greater-than* 131 the string without it. (The string without variable characters gets an empty 132 quaternary level.) 133* Inserting a variable character *earlier* in a string makes it sort 134 *less-than* inserting the variable character *later* in the string. 135 136For example: “deluge” < “de-luge” < “delu-ge” < “deluge-”. 137 138The Shift-Trimmed method works like Shifted, except that *trailing* 139high-quaternary weights (from regular characters) are removed (trimmed). 140Compared with Shifted, the Shift-Trimmed method sorts strings without variable 141characters before ones with variable characters added, rather than producing the 142equivalent of [Merging Sort 143Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys). 144 145Shift-Trimmed is more complicated to implement than all of the other options: 146When comparing strings, a lookahead (or equivalent) is needed to determine 147whether a non-variable character gets a zero quaternary weight (if no variables 148follow) or a high quaternary weight (if at least one variable follows). When 149building sort keys, trailing high/common quaternary weights are trimmed (backed 150out) at the end of the quaternary level. 151 152## Variable-After 153 154*Note: This method is not currently implemented in ICU.* 155 156Among strings that compare tertiary-equal: 157 158* Sorts all variable characters greater-than (after) regular characters. 159* Inserting a variable character anywhere makes a string sort *greater-than* 160 the string without it. (Like Shift-Trimmed.) 161* Inserting a variable character *earlier* in a string makes it sort 162 *greater-than* inserting the variable character *later* in the string. (Like 163 accent differences.) 164 165For example: “deluge” < “deluge-” < “delu-ge” < “de-luge”. 166 167The implementation “shifts” the primary weight p of the collation element \[p, 168s, t, q\] of each variable characters down three levels: \[0, 0, 0, p\]. Regular 169characters with primary collation elements get a *low* quaternary weight, 170*lower* than that of any variable character. This is consistent with collation 171on secondary and tertiary levels but unlike [Merging Sort 172Keys](http://www.unicode.org/reports/tr10/#Merging_Sort_Keys). 173 174This method extends the [UCA well-formedness condition 1752](http://www.unicode.org/reports/tr10/#WF2) to apply to quaternary weights. 176(UCA versions before UCA 6.2 did not limit WF2 to secondary & tertiary weights, 177which meant that several of the Variable Weighting options technically created 178ill-formed quaternary weights.) 179