• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<chapter id="clusters">
2<sect1 id="clusters">
3  <title>Clusters</title>
4  <para>
5    In shaping text, a <emphasis>cluster</emphasis> is a sequence of
6    code points that needs to be treated as a single, indivisible unit.
7  </para>
8  <para>
9    When you add text to a HB buffer, each character is associated with
10    a <emphasis>cluster value</emphasis>. This is an arbitrary number as
11    far as HB is concerned.
12  </para>
13  <para>
14    Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
15    actual number does not matter. Moreover, it is not required for the
16    cluster values to be monotonically increasing, but pretty much all
17    of HB's tests are performed on monotonically increasing cluster
18    numbers. Nevertheless, there is no such assumption in the code
19    itself. With that in mind, let's examine what happens with cluster
20    values during shaping under each cluster-level.
21  </para>
22  <para>
23    HarfBuzz provides three <emphasis>levels</emphasis> of clustering
24    support. Level 0 is the default behavior and reproduces the behavior
25    of the old HarfBuzz library. Level 1 tweaks this behavior slightly
26    to produce better results, so level 1 clustering is recommended for
27    code that is not required to implement backward compatibility with
28    the old HarfBuzz.
29  </para>
30  <para>
31    Level 2 differs significantly in how it treats cluster values.
32    Levels 0 and 1 both process ligatures and glyph decomposition by
33    merging clusters; level 2 does not.
34  </para>
35  <para>
36    The conceptual model for what the cluster values mean, in levels 0
37    and 1, is this:
38  </para>
39  <itemizedlist spacing="compact">
40    <listitem>
41      <para>
42        the sequence of cluster values will always remain monotone
43      </para>
44    </listitem>
45    <listitem>
46      <para>
47        each value represents a single cluster
48      </para>
49    </listitem>
50    <listitem>
51      <para>
52        each cluster contains one or more glyphs and one or more
53        characters
54      </para>
55    </listitem>
56  </itemizedlist>
57  <para>
58    Assuming that initial cluster numbers were monotonically increasing
59    and distinct, then all adjacent glyphs having the same cluster
60    number belong to the same cluster, and all characters belong to the
61    cluster that has the highest number not larger than their initial
62    cluster number. This will become clearer with an example.
63  </para>
64</sect1>
65<sect1 id="a-clustering-example-for-levels-0-and-1">
66  <title>A clustering example for levels 0 and 1</title>
67  <para>
68    Let's say we start with the following character sequence and cluster
69    values:
70  </para>
71  <programlisting>
72   A,B,C,D,E
73   0,1,2,3,4
74</programlisting>
75  <para>
76    We then map the characters to glyphs. For simplicity, let's assume
77    that each character maps to the corresponding, identical-looking
78    glyph:
79  </para>
80  <programlisting>
81   A,B,C,D,E
82   0,1,2,3,4
83</programlisting>
84  <para>
85    Now if, for example, <literal>B</literal> and <literal>C</literal>
86    ligate, then the clusters to which they belong &quot;merge&quot;.
87    This merged cluster takes for its cluster number the minimum of all
88    the cluster numbers of the clusters that went in. In this case, we
89    get:
90  </para>
91  <programlisting>
92   A,BC,D,E
93   0,1 ,3,4
94</programlisting>
95  <para>
96    Now let's assume that the <literal>BC</literal> glyph decomposes
97    into three components, and <literal>D</literal> also decomposes into
98    two. The components each inherit the cluster value of their parent:
99  </para>
100  <programlisting>
101   A,BC0,BC1,BC2,D0,D1,E
102   0,1  ,1  ,1  ,3 ,3 ,4
103</programlisting>
104  <para>
105    Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
106    their clusters (numbers 1 and 3) merge into
107    <literal>min(1,3) = 1</literal>:
108  </para>
109  <programlisting>
110   A,BC0,BC1,BC2D0,D1,E
111   0,1  ,1  ,1    ,1 ,4
112</programlisting>
113  <para>
114    At this point, cluster 1 means: the character sequence
115    <literal>BCD</literal> is represented by glyphs
116    <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
117    further.
118  </para>
119</sect1>
120<sect1 id="reordering-in-levels-0-and-1">
121  <title>Reordering in levels 0 and 1</title>
122  <para>
123    Another common operation in the more complex shapers is when things
124    reorder. In those cases, to maintain monotone clusters, HB merges
125    the clusters of everything in the reordering sequence. For example,
126    let's again start with the character sequence:
127  </para>
128  <programlisting>
129   A,B,C,D,E
130   0,1,2,3,4
131</programlisting>
132  <para>
133    If <literal>D</literal> is reordered before <literal>B</literal>,
134    then the <literal>B</literal>, <literal>C</literal>, and
135    <literal>D</literal> clusters merge, and we get:
136  </para>
137  <programlisting>
138   A,D,B,C,E
139   0,1,1,1,4
140</programlisting>
141  <para>
142    This is clearly not ideal, but it is the only sensible way to
143    maintain monotone indices and retain the true relationship between
144    glyphs and characters.
145  </para>
146</sect1>
147<sect1 id="the-distinction-between-levels-0-and-1">
148  <title>The distinction between levels 0 and 1</title>
149  <para>
150    So, the above is pretty much what cluster levels 0 and 1 do. The
151    only difference between the two is this: in level 0, at the very
152    beginning of the shaping process, we also merge clusters between
153    base characters and all Unicode marks (combining or not) following
154    them. E.g.:
155  </para>
156  <programlisting>
157  A,acute,B
158  0,1    ,2
159</programlisting>
160  <para>
161    will become:
162  </para>
163  <programlisting>
164  A,acute,B
165  0,0    ,2
166</programlisting>
167  <para>
168    This is the default behavior. We do it because Windows did it and
169    old HarfBuzz did it, so this remained the default. But this behavior
170    makes it impossible to color diacritic marks differently from their
171    base characters. That's why in level 1 we do not perform this
172    initial merging step.
173  </para>
174  <para>
175    For clients, level 0 is more convenient if they rely on HarfBuzz
176    clusters for cursor positioning. But that's wrong anyway: cursor
177    positions should be determined based on Unicode grapheme boundaries,
178    NOT shaping clusters. As such, level 1 clusters are preferred.
179  </para>
180  <para>
181    One last note about levels 0 and 1. We currently don't allow a
182    <literal>MultipleSubst</literal> lookup to replace a glyph with zero
183    glyphs (i.e., to delete a glyph). But in some other situations,
184    glyphs can be deleted. In those cases, if the glyph being deleted is
185    the last glyph of its cluster, we make sure to merge the cluster
186    with a neighboring cluster.
187  </para>
188  <para>
189    This is, primarily, to make sure that the starting cluster of the
190    text always has the cluster index pointing to the start of the text
191    for the run; more than one client currently relies on this
192    guarantee.
193  </para>
194  <para>
195    Incidentally, Apple's CoreText does something else to maintain the
196    same promise: it inserts a glyph with id 65535 at the beginning of
197    the glyph string if the glyph corresponding to the first character
198    in the run was deleted. HarfBuzz might do something similar in the
199    future.
200  </para>
201</sect1>
202<sect1 id="level-2">
203  <title>Level 2</title>
204  <para>
205    Level 2 is a different beast from levels 0 and 1. It is simple to
206    describe, but hard to make sense of. It simply doesn't do any
207    cluster merging whatsoever. When things ligate or otherwise multiple
208    glyphs turn into one, the cluster value of the first glyph is
209    retained.
210  </para>
211  <para>
212    Here are a few examples of why processing cluster values produced at
213    this level might be tricky:
214  </para>
215  <sect2 id="ligatures-with-combining-marks">
216    <title>Ligatures with combining marks</title>
217    <para>
218      Imagine capital letters are bases and lower case letters are
219      combining marks. With an input sequence like this:
220    </para>
221    <programlisting>
222  A,a,B,b,C,c
223  0,1,2,3,4,5
224</programlisting>
225    <para>
226      if <literal>A,B,C</literal> ligate, then here are the cluster
227      values one would get under the various levels:
228    </para>
229    <para>
230      level 0:
231    </para>
232    <programlisting>
233  ABC,a,b,c
234  0  ,0,0,0
235</programlisting>
236    <para>
237      level 1:
238    </para>
239    <programlisting>
240  ABC,a,b,c
241  0  ,0,0,5
242</programlisting>
243    <para>
244      level 2:
245    </para>
246    <programlisting>
247  ABC,a,b,c
248  0  ,1,3,5
249</programlisting>
250    <para>
251      Making sense of the last example is the hardest for a client,
252      because there is nothing in the cluster values to suggest that
253      <literal>B</literal> and <literal>C</literal> ligated with
254      <literal>A</literal>.
255    </para>
256  </sect2>
257  <sect2 id="reordering">
258    <title>Reordering</title>
259    <para>
260      Another tricky case is when things reorder. Under level 2:
261    </para>
262    <programlisting>
263  A,B,C,D,E
264  0,1,2,3,4
265</programlisting>
266    <para>
267      Now imagine <literal>D</literal> moves before
268      <literal>B</literal>:
269    </para>
270    <programlisting>
271  A,D,B,C,E
272  0,3,1,2,4
273</programlisting>
274    <para>
275      Now, if <literal>D</literal> ligates with <literal>B</literal>, we
276      get:
277    </para>
278    <programlisting>
279  A,DB,C,E
280  0,3 ,2,4
281</programlisting>
282    <para>
283      In a different scenario, <literal>A</literal> and
284      <literal>B</literal> could have ligated
285      <emphasis>before</emphasis> <literal>D</literal> reordered; that
286      would have resulted in:
287    </para>
288    <programlisting>
289  AB,D,C,E
290  0 ,3,2,4
291</programlisting>
292    <para>
293      There's no way to differentitate between these two scenarios based
294      on the cluster numbers alone.
295    </para>
296    <para>
297      Another problem appens with ligatures under level 2 if the
298      direction of the text is forced to opposite of its natural
299      direction (e.g. left-to-right Arabic). But that's too much of a
300      corner case to worry about.
301    </para>
302  </sect2>
303</sect1>
304</chapter>
305