1--- 2title: Grapheme Usage 3--- 4 5# Grapheme Usage 6 7*Draft* 8 9The goal is to allow the use of the appropriate grapheme clusters for given tasks, for a given language. See http://unicode.org/cldr/trac/ticket/2142. *Please leave any feedback as comments on that ticket.* 10 11The idea is that we have explicit boundaries that represent certain common behaviors (codepoint breaks, or legacy grapheme cluster breaks), and we also have associations for a given language between a particular *function* and the explicit boundaries that should be used in that language for that function. 12 13Here is a proposal for the structure in LDML: 14 15\<characters> 16 17... 18 19 \<grapheme-usage type="count">**extended**\</grapheme-usage> \<!-- when counting 'user characters' --> 20 21 \<grapheme-usage type="drop-cap">**legacy**\</grapheme-usage> \<!-- paragraph drop-caps --> 22 23 \<grapheme-usage type="selection">**aksara**\</grapheme-usage> \<!-- selection boundaries: highlighting, keyboard arrows, cut&paste --> 24 25 \<grapheme-usage type="backspace">**codepoint**\</grapheme-usage> \<!-- delete previous character --> 26 27 \<grapheme-usage type="delete">**extended**\</grapheme-usage> \<!-- delete next character --> 28 29... 30 31\</characters> 32 33*The above would be tailorable per locale.* 34 35In segments/root.xml we have GraphemeClusterBreak. We interpret that as extended grapheme clusters for compatibility. We then add rules for: 36 37- LegacyGraphemeClusterBreak // as per UAX#29 38- AksaraGraphemeClusterBreak // the virama character connects extended clusters 39- CodepointGraphemeClusterBreak // constant, trivial, probably usually implemented in code 40- ExemplarGraphemeClusterBreak // uses the CLDR exemplar set in addition to extended clusters. 41 42*These would also be tailorable per locale (except CodePoint), but should be more rarely done.* 43 44Clients like ICU would add new constants for getting BreakIterators (or equivalents). These would be both corresponding to the new explicit rules: 45 46- legacy 47- extended = 'user-character' 48- aksara 49- codepoint 50- exemplar 51 52And to the new 'function-based' breaks: 53 54- character\_count 55- character\_drop\_cap 56- character\_selection 57- character\_backspace 58- character\_delete 59 60## Related bugs 61 62- #[2142](http://unicode.org/cldr/trac/ticket/2142), Alternate Grapheme Clusters (pedberg, 2.0) 63- #[2975](http://unicode.org/cldr/trac/ticket/2975), Support legacy grapheme break (pedberg, 2.0) 64- #[2825](http://unicode.org/cldr/trac/ticket/2825), Add aksha grapheme break (pedberg, 2.0) 65- #[2992](http://unicode.org/cldr/trac/ticket/2992), Grapheme Clusters or a new break type - TR29 vs TR18? [about language-specific treatment of digraphs as clusters - ] 66- #[2406](http://unicode.org/cldr/trac/ticket/2406), Add locale keywords to specify the type (or variant) of word & grapheme break (pedberg, 2.0) 67- There is also the suggestion to add another type which is beyond the scope of CLDR - a cluster type that treats ligatures as single clusters. This depends on font behavior. 68 69