• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1The graphemeCluster directory contains files used to modify the default Grapheme Cluster Break (GCB)
2(https://unicode.org/reports/tr29/) algorithm to add support for not splitting Indic aksaras.
3
4The modifications are:
5
61. Adding 3 new character categories to https://unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values
7
8  Virama=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Virama}]
9
10  LinkingConsonant=[\p{Gujr}\p{sc=Telu}\p{sc=Mlym}\p{sc=Orya}\p{sc=Beng}\p{sc=Deva}&\p{Indic_Syllabic_Category=Consonant}]
11
12  ExtCccZwj=[\p{gcb=Extend}-\p{ccc=0}] \p{gcb=ZWJ}]
13
14Note that these categories are not GCB property values:
15In fact, they overlap the GCB property values.
16It is not necessary for the rules to have disjoint categories.
17The list of scripts can be added to over time, as test files for them become available.
18
192. Adding a rule to https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
20
21  9.3) LinkingConsonant ExtCccZwj* Virama ExtCccZwj* × LinkingConsonant
22
233. Adding test files supplied by India to org.unicode.cldr.unittest.data.graphemeCluster/*
24
25  TestSegmenter-Bengali.txt
26  TestSegmenter-Devanagari.txt
27  TestSegmenter-Gujarati.txt
28  TestSegmenter-Malayalam.txt
29  TestSegmenter-Odia.txt
30  TestSegmenter-Telugu.txt
31
324. Adding modified files in this directory, which can be used in place of the default files from
33   https://unicode.org/Public/12.0.0/ucd/auxiliary/
34
35  GraphemeBreakTest.html
36  GraphemeBreakTest.txt
37
38Note: The GraphemeBreakProperty.txt file is unmodified, as those properties don't change.
39