1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="clusters"> 8 <title>Clusters</title> 9 <section id="clusters"> 10 <title>Clusters</title> 11 <para> 12 In text shaping, a <emphasis>cluster</emphasis> is a sequence of 13 characters that needs to be treated as a single, indivisible 14 unit. 15 </para> 16 <para> 17 A cluster is distinct from a <emphasis>grapheme</emphasis>, 18 which is the smallest unit of a writing system or script, 19 because clusters are only relevant for script shaping and the 20 layout of glyphs. 21 </para> 22 <para> 23 For example, a grapheme may be a letter, a number, a logogram, 24 or a symbol. When two letters form a ligature, however, they 25 combine into a single glyph. They are therefore part of the same 26 cluster and are treated as a unit — even though the two 27 original, underlying letters are separate graphemes. 28 </para> 29 <para> 30 During the shaping process, there are several shaping operations 31 that may merge adjacent characters (for example, when two code 32 points form a ligature or a conjunct form and are replaced by a 33 single glyph) or split one character into several (for example, 34 when decomposing a code point through the 35 <literal>ccmp</literal> feature). 36 </para> 37 <para> 38 HarfBuzz tracks clusters independently from how these 39 shaping operations affect the individual glyphs that comprise the 40 output HarfBuzz returns in a buffer. Consequently, 41 a client program using HarfBuzz can utilize the cluster 42 information to implement features such as: 43 </para> 44 <itemizedlist> 45 <listitem> 46 <para> 47 Correctly positioning the cursor within a shaped text run, 48 even when characters have formed ligatures, composed or 49 decomposed, reordered, or undergone other shaping operations. 50 </para> 51 </listitem> 52 <listitem> 53 <para> 54 Correctly highlighting a text selection that includes some, 55 but not all, of the characters in a word. 56 </para> 57 </listitem> 58 <listitem> 59 <para> 60 Applying text attributes (such as color or underlining) to 61 part, but not all, of a word. 62 </para> 63 </listitem> 64 <listitem> 65 <para> 66 Generating output document formats (such as PDF) with 67 embedded text that can be fully extracted. 68 </para> 69 </listitem> 70 <listitem> 71 <para> 72 Determining the mapping between input characters and output 73 glyphs, such as which glyphs are ligatures. 74 </para> 75 </listitem> 76 <listitem> 77 <para> 78 Performing line-breaking, justification, and other 79 line-level or paragraph-level operations that must be done 80 after shaping is complete, but which require character-level 81 properties. 82 </para> 83 </listitem> 84 </itemizedlist> 85 <para> 86 When you add text to a HarfBuzz buffer, each code point must be 87 assigned a <emphasis>cluster value</emphasis>. 88 </para> 89 <para> 90 This cluster value is an arbitrary number; HarfBuzz uses it only 91 to distinguish between clusters. Many client programs will use 92 the index of each code point in the input text stream as the 93 cluster value. This is for the sake of convenience; the actual 94 value does not matter. 95 </para> 96 <para> 97 Client programs can choose how HarfBuzz handles clusters during 98 shaping by setting the 99 <literal>cluster_level</literal> of the 100 buffer. HarfBuzz offers three <emphasis>levels</emphasis> of 101 clustering support for this property: 102 </para> 103 <itemizedlist> 104 <listitem> 105 <para><emphasis>Level 0</emphasis> is the default and 106 reproduces the behavior of the old HarfBuzz library. 107 </para> 108 <para> 109 The distinguishing feature of level 0 behavior is that, at 110 the beginning of processing the buffer, all code points that 111 are categorized as <emphasis>marks</emphasis>, 112 <emphasis>modifier symbols</emphasis>, or 113 <emphasis>Emoji extended pictographic</emphasis> modifiers, 114 as well as the <emphasis>Zero Width Joiner</emphasis> and 115 <emphasis>Zero Width Non-Joiner</emphasis> code points, are 116 assigned the cluster value of the closest preceding code 117 point from <emphasis>different</emphasis> category. 118 </para> 119 <para> 120 In essence, whenever a base character is followed by a mark 121 character or a sequence of mark characters, those marks are 122 reassigned to the same initial cluster value as the base 123 character. This reassignment is referred to as 124 "merging" the affected clusters. This behavior is based on 125 the Grapheme Cluster Boundary specification in <ulink 126 url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode 127 Technical Report 29</ulink>. 128 </para> 129 <para> 130 Client programs can specify level 0 behavior for a buffer by 131 setting its <literal>cluster_level</literal> to 132 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. 133 </para> 134 </listitem> 135 <listitem> 136 <para> 137 <emphasis>Level 1</emphasis> tweaks the old behavior 138 slightly to produce better results. Therefore, level 1 139 clustering is recommended for code that is not required to 140 implement backward compatibility with the old HarfBuzz. 141 </para> 142 <para> 143 Level 1 differs from level 0 by not merging the 144 clusters of marks and other modifier code points with the 145 preceding "base" code point's cluster. By preserving the 146 separate cluster values of these marks and modifier code 147 points, script shapers can perform additional operations 148 that might lead to improved results (for example, reordering 149 a sequence of marks). 150 </para> 151 <para> 152 Client programs can specify level 1 behavior for a buffer by 153 setting its <literal>cluster_level</literal> to 154 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. 155 </para> 156 </listitem> 157 <listitem> 158 <para> 159 <emphasis>Level 2</emphasis> differs significantly in how it 160 treats cluster values. In level 2, HarfBuzz never merges 161 clusters. 162 </para> 163 <para> 164 This difference can be seen most clearly when HarfBuzz processes 165 ligature substitutions and glyph decompositions. In level 0 166 and level 1, ligatures and glyph decomposition both involve 167 merging clusters; in level 2, neither of these operations 168 triggers a merge. 169 </para> 170 <para> 171 Client programs can specify level 2 behavior for a buffer by 172 setting its <literal>cluster_level</literal> to 173 <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. 174 </para> 175 </listitem> 176 </itemizedlist> 177 <para> 178 As mentioned earlier, client programs using HarfBuzz often 179 assign initial cluster values in a buffer by reusing the indices 180 of the code points in the input text. This gives a sequence of 181 cluster values that is monotonically increasing (for example, 182 0,1,2,3,4,5). 183 </para> 184 <para> 185 It is not <emphasis>required</emphasis> that the cluster values 186 in a buffer be monotonically increasing. However, if the initial 187 cluster values in a buffer are monotonic and the buffer is 188 configured to use cluster level 0 or 1, then HarfBuzz 189 guarantees that the final cluster values in the shaped buffer 190 will also be monotonic. No such guarantee is made for cluster 191 level 2. 192 </para> 193 <para> 194 In levels 0 and 1, HarfBuzz implements the following conceptual 195 model for cluster values: 196 </para> 197 <itemizedlist spacing="compact"> 198 <listitem> 199 <para> 200 If the sequence of input cluster values is monotonic, the 201 sequence of cluster values will remain monotonic. 202 </para> 203 </listitem> 204 <listitem> 205 <para> 206 Each cluster value represents a single cluster. 207 </para> 208 </listitem> 209 <listitem> 210 <para> 211 Each cluster contains one or more glyphs and one or more 212 characters. 213 </para> 214 </listitem> 215 </itemizedlist> 216 <para> 217 In practice, this model offers several benefits. Assuming that 218 the initial cluster values were monotonically increasing 219 and distinct before shaping began, then, in the final output: 220 </para> 221 <itemizedlist spacing="compact"> 222 <listitem> 223 <para> 224 All adjacent glyphs having the same final cluster 225 value belong to the same cluster. 226 </para> 227 </listitem> 228 <listitem> 229 <para> 230 Each character belongs to the cluster that has the highest 231 cluster value <emphasis>not larger than</emphasis> its 232 initial cluster value. 233 </para> 234 </listitem> 235 </itemizedlist> 236 237 </section> 238 <section id="a-clustering-example-for-levels-0-and-1"> 239 <title>A clustering example for levels 0 and 1</title> 240 <para> 241 The guarantees and benefits of level 0 and level 1 can be seen 242 with some examples. First, let us examine what happens with cluster 243 values when shaping involves cluster merging with ligatures and 244 decomposition. 245 </para> 246 <para> 247 Let's say we start with the following character sequence (top row) and 248 initial cluster values (bottom row): 249 </para> 250 <programlisting> 251 A,B,C,D,E 252 0,1,2,3,4 253 </programlisting> 254 <para> 255 During shaping, HarfBuzz maps these characters to glyphs from 256 the font. For simplicity, let us assume that each character maps 257 to the corresponding, identical-looking glyph: 258 </para> 259 <programlisting> 260 A,B,C,D,E 261 0,1,2,3,4 262 </programlisting> 263 <para> 264 Now if, for example, <literal>B</literal> and <literal>C</literal> 265 form a ligature, then the clusters to which they belong 266 "merge". This merged cluster takes for its cluster 267 value the minimum of all the cluster values of the clusters that 268 went in to the ligature. In this case, we get: 269 </para> 270 <programlisting> 271 A,BC,D,E 272 0,1 ,3,4 273 </programlisting> 274 <para> 275 because 1 is the minimum of the set {1,2}, which were the 276 cluster values of <literal>B</literal> and 277 <literal>C</literal>. 278 </para> 279 <para> 280 Next, let us say that the <literal>BC</literal> ligature glyph 281 decomposes into three components, and <literal>D</literal> also 282 decomposes into two components. These components each inherit the 283 cluster value of their parent: 284 </para> 285 <programlisting> 286 A,BC0,BC1,BC2,D0,D1,E 287 0,1 ,1 ,1 ,3 ,3 ,4 288 </programlisting> 289 <para> 290 Next, if <literal>BC2</literal> and <literal>D0</literal> form a 291 ligature, then their clusters (cluster values 1 and 3) merge into 292 <literal>min(1,3) = 1</literal>: 293 </para> 294 <programlisting> 295 A,BC0,BC1,BC2D0,D1,E 296 0,1 ,1 ,1 ,1 ,4 297 </programlisting> 298 <para> 299 At this point, cluster 1 means: the character sequence 300 <literal>BCD</literal> is represented by glyphs 301 <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any 302 further. 303 </para> 304 </section> 305 <section id="reordering-in-levels-0-and-1"> 306 <title>Reordering in levels 0 and 1</title> 307 <para> 308 Another common operation in the more complex shapers is glyph 309 reordering. In order to maintain a monotonic cluster sequence 310 when glyph reordering takes place, HarfBuzz merges the clusters 311 of everything in the reordering sequence. 312 </para> 313 <para> 314 For example, let us again start with the character sequence (top 315 row) and initial cluster values (bottom row): 316 </para> 317 <programlisting> 318 A,B,C,D,E 319 0,1,2,3,4 320 </programlisting> 321 <para> 322 If <literal>D</literal> is reordered to before <literal>B</literal>, 323 then HarfBuzz merges the <literal>B</literal>, 324 <literal>C</literal>, and <literal>D</literal> clusters, and we 325 get: 326 </para> 327 <programlisting> 328 A,D,B,C,E 329 0,1,1,1,4 330 </programlisting> 331 <para> 332 This is clearly not ideal, but it is the only sensible way to 333 maintain a monotonic sequence of cluster values and retain the 334 true relationship between glyphs and characters. 335 </para> 336 </section> 337 <section id="the-distinction-between-levels-0-and-1"> 338 <title>The distinction between levels 0 and 1</title> 339 <para> 340 The preceding examples demonstrate the main effects of using 341 cluster levels 0 and 1. The only difference between the two 342 levels is this: in level 0, at the very beginning of the shaping 343 process, HarfBuzz also merges clusters between any base character 344 and all Unicode marks (combining or not) that follow it. 345 </para> 346 <para> 347 For example, let us start with the following character sequence 348 (top row) and accompanying initial cluster values (bottom row): 349 </para> 350 <programlisting> 351 A,acute,B 352 0,1 ,2 353 </programlisting> 354 <para> 355 The <literal>acute</literal> is a Unicode mark. If HarfBuzz is 356 using cluster level 0 on this sequence, then the 357 <literal>A</literal> and <literal>acute</literal> clusters will 358 merge, and the result will become: 359 </para> 360 <programlisting> 361 A,acute,B 362 0,0 ,2 363 </programlisting> 364 <para> 365 This initial cluster merging is the default behavior of the 366 Windows shaping engine, and the old HarfBuzz codebase copied 367 that behavior to maintain compatibility. Consequently, it has 368 remained the default behavior in the new HarfBuzz codebase. 369 </para> 370 <para> 371 But this initial cluster-merging behavior makes it impossible to 372 color diacritic marks differently from their base 373 characters. That is why, in level 1, HarfBuzz does not perform 374 the initial merging step. 375 </para> 376 <para> 377 For client programs that rely on HarfBuzz cluster values to 378 perform cursor positioning, level 0 is more convenient. But 379 relying on cluster boundaries for cursor positioning is wrong: cursor 380 positions should be determined based on Unicode grapheme 381 boundaries, not on shaping-cluster boundaries. As such, level 1 382 clusters are preferred. 383 </para> 384 <para> 385 One last note about levels 0 and 1. HarfBuzz currently does not allow a 386 <literal>MultipleSubst</literal> lookup to replace a glyph with zero 387 glyphs (in other words, to delete a glyph). But, in some other situations, 388 glyphs can be deleted. In those cases, if the glyph being deleted is 389 the last glyph of its cluster, HarfBuzz makes sure to merge the cluster 390 with a neighboring cluster. 391 </para> 392 <para> 393 This is done primarily to make sure that the starting cluster of the 394 text always has the cluster index pointing to the start of the text 395 for the run; more than one client currently relies on this 396 guarantee. 397 </para> 398 <para> 399 Incidentally, Apple's CoreText does something else to maintain the 400 same promise: it inserts a glyph with id 65535 at the beginning of 401 the glyph string if the glyph corresponding to the first character 402 in the run was deleted. HarfBuzz might do something similar in the 403 future. 404 </para> 405 </section> 406 <section id="level-2"> 407 <title>Level 2</title> 408 <para> 409 HarfBuzz's level 2 cluster behavior uses a significantly 410 different model than that of level 0 and level 1. 411 </para> 412 <para> 413 The level 2 behavior is easy to describe, but it may be 414 difficult to understand in practical terms. In brief, level 2 415 performs no merging of clusters whatsoever. 416 </para> 417 <para> 418 When glyphs form a ligature (or when some other feature 419 substitutes multiple glyphs with one glyph), the cluster value 420 of the first glyph is retained as the cluster value for the 421 ligature. However, no subsequent clusters — including 422 marks and modifiers — are affected. 423 </para> 424 <para> 425 Level 2 cluster behavior is less complex than level 0 or level 426 1, but there are a few cases in which processing cluster values 427 produced at level 2 may be tricky. 428 </para> 429 <section id="ligatures-with-combining-marks-in-level-2"> 430 <title>Ligatures with combining marks in level 2</title> 431 <para> 432 The first example of how HarfBuzz's level 2 cluster behavior 433 can be tricky is when the text to be shaped includes combining 434 marks attached to ligatures. 435 </para> 436 <para> 437 Let us start with an input sequence with the following 438 characters (top row) and initial cluster values (bottom row): 439 </para> 440 <programlisting> 441 A,acute,B,breve,C,circumflex 442 0,1 ,2,3 ,4,5 443 </programlisting> 444 <para> 445 If the sequence <literal>A,B,C</literal> forms a ligature, 446 then these are the cluster values HarfBuzz will return under 447 the various cluster levels: 448 </para> 449 <para> 450 Level 0: 451 </para> 452 <programlisting> 453 ABC,acute,breve,circumflex 454 0 ,0 ,0 ,0 455 </programlisting> 456 <para> 457 Level 1: 458 </para> 459 <programlisting> 460 ABC,acute,breve,circumflex 461 0 ,0 ,0 ,5 462 </programlisting> 463 <para> 464 Level 2: 465 </para> 466 <programlisting> 467 ABC,acute,breve,circumflex 468 0 ,1 ,3 ,5 469 </programlisting> 470 <para> 471 Making sense of the level 2 result is the hardest for a client 472 program, because there is nothing in the cluster values that 473 indicates that <literal>B</literal> and <literal>C</literal> 474 formed a ligature with <literal>A</literal>. 475 </para> 476 <para> 477 In contrast, the "merged" cluster values of the mark glyphs 478 that are seen in the level 0 and level 1 output are evidence 479 that a ligature substitution took place. 480 </para> 481 </section> 482 <section id="reordering-in-level-2"> 483 <title>Reordering in level 2</title> 484 <para> 485 Another example of how HarfBuzz's level 2 cluster behavior 486 can be tricky is when glyphs reorder. Consider an input sequence 487 with the following characters (top row) and initial cluster 488 values (bottom row): 489 </para> 490 <programlisting> 491 A,B,C,D,E 492 0,1,2,3,4 493 </programlisting> 494 <para> 495 Now imagine <literal>D</literal> moves before 496 <literal>B</literal> in a reordering operation. The cluster 497 values will then be: 498 </para> 499 <programlisting> 500 A,D,B,C,E 501 0,3,1,2,4 502 </programlisting> 503 <para> 504 Next, if <literal>D</literal> forms a ligature with 505 <literal>B</literal>, the output is: 506 </para> 507 <programlisting> 508 A,DB,C,E 509 0,3 ,2,4 510 </programlisting> 511 <para> 512 However, in a different scenario, in which the shaping rules 513 of the script instead caused <literal>A</literal> and 514 <literal>B</literal> to form a ligature 515 <emphasis>before</emphasis> the <literal>D</literal> reordered, the 516 result would be: 517 </para> 518 <programlisting> 519 AB,D,C,E 520 0 ,3,2,4 521 </programlisting> 522 <para> 523 There is no way for a client program to differentiate between 524 these two scenarios based on the cluster values 525 alone. Consequently, client programs that use level 2 might 526 need to undertake additional work in order to manage cursor 527 positioning, text attributes, or other desired features. 528 </para> 529 </section> 530 <section id="other-considerations-in-level-2"> 531 <title>Other considerations in level 2</title> 532 <para> 533 There may be other problems encountered with ligatures under 534 level 2, such as if the direction of the text is forced to 535 opposite of its natural direction (for example, left-to-right 536 Arabic). But, generally speaking, these other scenarios are 537 minor corner cases that are too obscure for most client 538 programs to need to worry about. 539 </para> 540 </section> 541 </section> 542</chapter> 543