1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="shaping-concepts"> 8 <title>Shaping concepts</title> 9 <section id="text-shaping-concepts"> 10 <title>Text shaping</title> 11 <para> 12 Text shaping is the process of transforming a sequence of Unicode 13 codepoints that represent individual characters (letters, 14 diacritics, tone marks, numbers, symbols, etc.) into the 15 orthographically and linguistically correct two-dimensional layout 16 of glyph shapes taken from a specified font. 17 </para> 18 <para> 19 For some writing systems (or <emphasis>scripts</emphasis>) and 20 languages, the process is simple, requiring the shaper to do 21 little more than advance the horizontal position forward by the 22 correct amount for each successive glyph. 23 </para> 24 <para> 25 But, for <emphasis>complex scripts</emphasis>, any combination of 26 several shaping operations may be required, and the rules for how 27 and when they are applied vary from script to script. HarfBuzz and 28 other shaping engines implement these rules. 29 </para> 30 <para> 31 The exact rules and necessary operations for a particular script 32 constitute a shaping <emphasis>model</emphasis>. OpenType 33 specifies a set of shaping models that covers all of 34 Unicode. Other shaping models are available, however, including 35 Graphite and Apple Advanced Typography (AAT). 36 </para> 37 </section> 38 39 <section id="complex-scripts"> 40 <title>Complex scripts</title> 41 <para> 42 In text-shaping terminology, scripts are generally classified as 43 either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>. 44 </para> 45 <para> 46 Complex scripts are those for which transforming the input 47 sequence into the final layout requires some combination of 48 operations—such as context-dependent substitutions, 49 context-dependent mark positioning, glyph-to-glyph joining, 50 glyph reordering, or glyph stacking. 51 </para> 52 <para> 53 In some complex scripts, the shaping rules require that a text 54 run be divided into syllables before the operations can be 55 applied. Other complex scripts may apply shaping operations over 56 entire words or over the entire text run, with no subdivision 57 required. 58 </para> 59 <para> 60 Non-complex scripts, by definition, do not require these 61 operations. However, correctly shaping a text run in a 62 non-complex script may still involve Unicode normalization, 63 ligature substitutions, mark positioning, kerning, and applying 64 other font features. The key difference is that a text run in a 65 non-complex script can be processed sequentially and in the same 66 order as the input sequence of Unicode codepoints, without 67 requiring an analysis stage. 68 </para> 69 </section> 70 71 <section id="shaping-operations"> 72 <title>Shaping operations</title> 73 <para> 74 Shaping a complex-script text run involves transforming the 75 input sequence of Unicode codepoints with some combination of 76 operations that is specified in the shaping model for the 77 script. 78 </para> 79 <para> 80 The specific conditions that trigger a given operation for a 81 text run varies from script to script, as do the order that the 82 operations are performed in and which codepoints are 83 affected. However, the same general set of shaping operations is 84 common to all of the complex-script shaping models. 85 </para> 86 87 <itemizedlist> 88 <listitem> 89 <para> 90 A <emphasis>reordering</emphasis> operation moves a glyph 91 from its original ("logical") position in the sequence to 92 some other ("visual") position. 93 </para> 94 <para> 95 The shaping model for a given complex script might involve 96 more than one reordering step. 97 </para> 98 </listitem> 99 100 <listitem> 101 <para> 102 A <emphasis>joining</emphasis> operation replaces a glyph 103 with an alternate form that is designed to connect with one 104 or more of the adjacent glyphs in the sequence. 105 </para> 106 </listitem> 107 108 <listitem> 109 <para> 110 A contextual <emphasis>substitution</emphasis> operation 111 replaces either a single glyph or a subsequence of several 112 glyphs with an alternate glyph. This substitution is 113 performed when the original glyph or subsequence of glyphs 114 occurs in a specified position with respect to the 115 surrounding sequence. For example, one substitution might be 116 performed only when the target glyph is the first glyph in 117 the sequence, while another substitution is performed only 118 when a different target glyph occurs immediately after a 119 particular string pattern. 120 </para> 121 <para> 122 The shaping model for a given complex script might involve 123 multiple contextual-substitution operations, each applying 124 to different target glyphs and patterns, and which are 125 performed in separate steps. 126 </para> 127 </listitem> 128 129 <listitem> 130 <para> 131 A contextual <emphasis>positioning</emphasis> operation 132 moves the horizontal and/or vertical position of a 133 glyph. This positioning move is performed when the glyph 134 occurs in a specified position with respect to the 135 surrounding sequence. 136 </para> 137 <para> 138 Many contextual positioning operations are used to place 139 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel 140 signs, and tone markers) with respect to 141 <emphasis>base</emphasis> glyphs. However, some complex 142 scripts may use contextual positioning operations to 143 correctly place base glyphs as well, such as 144 when the script uses <emphasis>stacking</emphasis> characters. 145 </para> 146 </listitem> 147 148 </itemizedlist> 149 </section> 150 151 <section id="unicode-character-categories"> 152 <title>Unicode character categories</title> 153 <para> 154 Shaping models are typically specified with respect to how 155 scripts are defined in the Unicode standard. 156 </para> 157 <para> 158 Every codepoint in the Unicode Character Database (UCD) is 159 assigned a <emphasis>Unicode General Category</emphasis> (UGC), 160 which provides the most fundamental information about the 161 codepoint: whether the codepoint represents a 162 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a 163 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a 164 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, 165 or something else (<emphasis>Other</emphasis>). 166 </para> 167 <para> 168 These UGC properties are "Major" categories. Each codepoint is 169 further assigned to a "minor" category within its Major 170 category, such as "Letter, uppercase" (<literal>Lu</literal>) or 171 "Letter, modifier" (<literal>Lm</literal>). 172 </para> 173 <para> 174 Shaping models are concerned primarily with Letter and Mark 175 codepoints. The minor categories of Mark codepoints are 176 particularly important for shaping. Marks can be nonspacing 177 (<literal>Mn</literal>), spacing combining 178 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). 179 </para> 180 <para> 181 In addition to the UGC property, codepoints in the Indic and 182 Southeast Asian scripts are also assigned 183 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and 184 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) 185 properties that provide more detailed information needed for 186 shaping. 187 </para> 188 <para> 189 The UISC property sub-categorizes Letters and Marks according to 190 common script-shaping behaviors. For example, UISC distinguishes 191 between consonant letters, vowel letters, and vowel marks. The 192 UIPC property sub-categorizes Mark codepoints by the relative visual 193 position that they occupy (above, below, right, left, or in 194 multiple positions). 195 </para> 196 <para> 197 Some complex scripts require that the text run be split into 198 syllables. What constitutes a valid syllable in these 199 scripts is specified in regular expressions, formed from the 200 Letter and Mark codepoints, that take the UISC and UIPC 201 properties into account. 202 </para> 203 204 </section> 205 206 <section id="text-runs"> 207 <title>Text runs</title> 208 <para> 209 Real-world text usually contains codepoints from a mixture of 210 different Unicode scripts (including punctuation, numbers, symbols, 211 white-space characters, and other codepoints that do not belong 212 to any script). Real-world text may also be marked up with 213 formatting that changes font properties (including the font, 214 font style, and font size). 215 </para> 216 <para> 217 For shaping purposes, all real-world text streams must be first 218 segmented into runs that have a uniform set of properties. 219 </para> 220 <para> 221 In particular, shaping models always assume that every codepoint 222 in a text run has the same <emphasis>direction</emphasis>, 223 <emphasis>script</emphasis> tag, and 224 <emphasis>language</emphasis> tag. 225 </para> 226 </section> 227 228 <section id="opentype-shaping-models"> 229 <title>OpenType shaping models</title> 230 <para> 231 OpenType provides shaping models for the following scripts: 232 </para> 233 234 <itemizedlist> 235 <listitem> 236 <para> 237 The <emphasis>default</emphasis> shaping model handles all 238 non-complex scripts, and may also be used as a fallback for 239 handling unrecognized scripts. 240 </para> 241 </listitem> 242 243 <listitem> 244 <para> 245 The <emphasis>Indic</emphasis> shaping model handles the Indic 246 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, 247 Malayalam, Oriya, Tamil, Telugu, and Sinhala. 248 </para> 249 <para> 250 The Indic shaping model was revised significantly in 251 2005. To denote the change, a new set of <emphasis>script 252 tags</emphasis> was assigned for Bengali, Devanagari, 253 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and 254 Telugu. For the sake of clarity, the term "Indic2" is 255 sometimes used to refer to the current, revised shaping 256 model. 257 </para> 258 </listitem> 259 260 <listitem> 261 <para> 262 The <emphasis>Arabic</emphasis> shaping model supports 263 Arabic, Mongolian, N'Ko, Syriac, and several other connected 264 or cursive scripts. 265 </para> 266 </listitem> 267 268 <listitem> 269 <para> 270 The <emphasis>Thai/Lao</emphasis> shaping model supports 271 the Thai and Lao scripts. 272 </para> 273 </listitem> 274 275 <listitem> 276 <para> 277 The <emphasis>Khmer</emphasis> shaping model supports the 278 Khmer script. 279 </para> 280 </listitem> 281 282 <listitem> 283 <para> 284 The <emphasis>Myanmar</emphasis> shaping model supports the 285 Myanmar (or Burmese) script. 286 </para> 287 </listitem> 288 289 <listitem> 290 <para> 291 The <emphasis>Tibetan</emphasis> shaping model supports the 292 Tibetan script. 293 </para> 294 </listitem> 295 296 <listitem> 297 <para> 298 The <emphasis>Hangul</emphasis> shaping model supports the 299 Hangul script. 300 </para> 301 </listitem> 302 303 <listitem> 304 <para> 305 The <emphasis>Hebrew</emphasis> shaping model supports the 306 Hebrew script. 307 </para> 308 </listitem> 309 310 <listitem> 311 <para> 312 The <emphasis>Universal Shaping Engine</emphasis> (USE) 313 shaping model supports complex scripts not covered by one of 314 the above, script-specific shaping models, including 315 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, 316 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai 317 Viet, and many others. 318 </para> 319 </listitem> 320 321 <listitem> 322 <para> 323 Text runs that do not fall under one of the above shaping 324 models may still require processing by a shaping engine. Of 325 particular note is <emphasis>Emoji</emphasis> shaping, which 326 may involve variation-selector sequences and glyph 327 substitution. Emoji shaping is handled by the default 328 shaping model. 329 </para> 330 </listitem> 331 332 </itemizedlist> 333 334 </section> 335 336 <section id="graphite-shaping"> 337 <title>Graphite shaping</title> 338 <para> 339 In contrast to OpenType shaping, Graphite shaping does not 340 specify a predefined set of shaping models or a set of supported 341 scripts. 342 </para> 343 <para> 344 Instead, each Graphite font contains a complete set of rules that 345 implement the required shaping model for the intended 346 script. These rules include finite-state machines to match 347 sequences of codepoints to the shaping operations to perform. 348 </para> 349 <para> 350 Graphite shaping can perform the same shaping operations used in 351 OpenType shaping, as well as other functions that have not been 352 defined for OpenType shaping. 353 </para> 354 </section> 355 356 <section id="aat-shaping"> 357 <title>AAT shaping</title> 358 <para> 359 In contrast to OpenType shaping, AAT shaping does not specify a 360 predefined set of shaping models or a set of supported scripts. 361 </para> 362 <para> 363 Instead, each AAT font includes a complete set of rules that 364 implement the desired shaping model for the intended 365 script. These rules include finite-state machines to match glyph 366 sequences and the shaping operations to perform. 367 </para> 368 <para> 369 Notably, AAT shaping rules are expressed for glyphs in the font, 370 not for Unicode codepoints. AAT shaping can perform the same 371 shaping operations used in OpenType shaping, as well as other 372 functions that have not been defined for OpenType shaping. 373 </para> 374 </section> 375</chapter> 376