• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1---
2title: Grapheme Usage
3---
4
5# Grapheme Usage
6
7*Draft*
8
9The goal is to allow the use of the appropriate grapheme clusters for given tasks, for a given language. See http://unicode.org/cldr/trac/ticket/2142. *Please leave any feedback as comments on that ticket.*
10
11The idea is that we have explicit boundaries that represent certain common behaviors (codepoint breaks, or legacy grapheme cluster breaks), and we also have associations for a given language between a particular *function* and the explicit boundaries that should be used in that language for that function.
12
13Here is a proposal for the structure in LDML:
14
15\<characters>
16
17...
18
19 \<grapheme-usage type="count">**extended**\</grapheme-usage> \<!-- when counting 'user characters' -->
20
21 \<grapheme-usage type="drop-cap">**legacy**\</grapheme-usage> \<!-- paragraph drop-caps -->
22
23 \<grapheme-usage type="selection">**aksara**\</grapheme-usage> \<!-- selection boundaries: highlighting, keyboard arrows, cut&paste -->
24
25 \<grapheme-usage type="backspace">**codepoint**\</grapheme-usage> \<!-- delete previous character -->
26
27 \<grapheme-usage type="delete">**extended**\</grapheme-usage> \<!-- delete next character -->
28
29...
30
31\</characters>
32
33*The above would be tailorable per locale.*
34
35In segments/root.xml we have GraphemeClusterBreak. We interpret that as extended grapheme clusters for compatibility. We then add rules for:
36
37- LegacyGraphemeClusterBreak // as per UAX#29
38- AksaraGraphemeClusterBreak // the virama character connects extended clusters
39- CodepointGraphemeClusterBreak // constant, trivial, probably usually implemented in code
40- ExemplarGraphemeClusterBreak // uses the CLDR exemplar set in addition to extended clusters.
41
42*These would also be tailorable per locale (except CodePoint), but should be more rarely done.*
43
44Clients like ICU would add new constants for getting BreakIterators (or equivalents). These would be both corresponding to the new explicit rules:
45
46- legacy
47- extended = 'user-character'
48- aksara
49- codepoint
50- exemplar
51
52And to the new 'function-based' breaks:
53
54- character\_count
55- character\_drop\_cap
56- character\_selection
57- character\_backspace
58- character\_delete
59
60## Related bugs
61
62- #[2142](http://unicode.org/cldr/trac/ticket/2142), Alternate Grapheme Clusters (pedberg, 2.0)
63- #[2975](http://unicode.org/cldr/trac/ticket/2975), Support legacy grapheme break (pedberg, 2.0)
64- #[2825](http://unicode.org/cldr/trac/ticket/2825), Add aksha grapheme break (pedberg, 2.0)
65- #[2992](http://unicode.org/cldr/trac/ticket/2992), Grapheme Clusters or a new break type - TR29 vs TR18? [about language-specific treatment of digraphs as clusters - ]
66- #[2406](http://unicode.org/cldr/trac/ticket/2406), Add locale keywords to specify the type (or variant) of word & grapheme break (pedberg, 2.0)
67- There is also the suggestion to add another type which is beyond the scope of CLDR - a cluster type that treats ligatures as single clusters. This depends on font behavior.
68
69