• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="shaping-concepts">
8  <title>Shaping concepts</title>
9  <section id="text-shaping-concepts">
10    <title>Text shaping</title>
11    <para>
12      Text shaping is the process of transforming a sequence of Unicode
13      codepoints that represent individual characters (letters,
14      diacritics, tone marks, numbers, symbols, etc.) into the
15      orthographically and linguistically correct two-dimensional layout
16      of glyph shapes taken from a specified font.
17    </para>
18    <para>
19      For some writing systems (or <emphasis>scripts</emphasis>) and
20      languages, the process is simple, requiring the shaper to do
21      little more than advance the horizontal position forward by the
22      correct amount for each successive glyph.
23    </para>
24    <para>
25      But, for <emphasis>complex scripts</emphasis>, any combination of
26      several shaping operations may be required, and the rules for how
27      and when they are applied vary from script to script. HarfBuzz and
28      other shaping engines implement these rules.
29    </para>
30    <para>
31      The exact rules and necessary operations for a particular script
32      constitute a shaping <emphasis>model</emphasis>. OpenType
33      specifies a set of shaping models that covers all of
34      Unicode. Other shaping models are available, however, including
35      Graphite and Apple Advanced Typography (AAT).
36    </para>
37  </section>
38
39  <section id="complex-scripts">
40    <title>Complex scripts</title>
41    <para>
42      In text-shaping terminology, scripts are generally classified as
43      either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>.
44    </para>
45    <para>
46      Complex scripts are those for which transforming the input
47      sequence into the final layout requires some combination of
48      operations&mdash;such as context-dependent substitutions,
49      context-dependent mark positioning, glyph-to-glyph joining,
50      glyph reordering, or glyph stacking.
51    </para>
52    <para>
53      In some complex scripts, the shaping rules require that a text
54      run be divided into syllables before the operations can be
55      applied. Other complex scripts may apply shaping operations over
56      entire words or over the entire text run, with no subdivision
57      required.
58    </para>
59    <para>
60      Non-complex scripts, by definition, do not require these
61      operations. However, correctly shaping a text run in a
62      non-complex script may still involve Unicode normalization,
63      ligature substitutions, mark positioning, kerning, and applying
64      other font features. The key difference is that a text run in a
65      non-complex script can be processed sequentially and in the same
66      order as the input sequence of Unicode codepoints, without
67      requiring an analysis stage.
68    </para>
69  </section>
70
71  <section id="shaping-operations">
72    <title>Shaping operations</title>
73    <para>
74      Shaping a complex-script text run involves transforming the
75      input sequence of Unicode codepoints with some combination of
76      operations that is specified in the shaping model for the
77      script.
78    </para>
79    <para>
80      The specific conditions that trigger a given operation for a
81      text run varies from script to script, as do the order that the
82      operations are performed in and which codepoints are
83      affected. However, the same general set of shaping operations is
84      common to all of the complex-script shaping models.
85    </para>
86
87    <itemizedlist>
88      <listitem>
89	<para>
90	  A <emphasis>reordering</emphasis> operation moves a glyph
91	  from its original ("logical") position in the sequence to
92	  some other ("visual") position.
93	</para>
94	<para>
95	  The shaping model for a given complex script might involve
96	  more than one reordering step.
97	</para>
98      </listitem>
99
100      <listitem>
101	<para>
102	  A <emphasis>joining</emphasis> operation replaces a glyph
103	  with an alternate form that is designed to connect with one
104	  or more of the adjacent glyphs in the sequence.
105	</para>
106      </listitem>
107
108      <listitem>
109	<para>
110	  A contextual <emphasis>substitution</emphasis> operation
111	  replaces either a single glyph or a subsequence of several
112	  glyphs with an alternate glyph. This substitution is
113	  performed when the original glyph or subsequence of glyphs
114	  occurs in a specified position with respect to the
115	  surrounding sequence. For example, one substitution might be
116	  performed only when the target glyph is the first glyph in
117	  the sequence, while another substitution is performed only
118	  when a different target glyph occurs immediately after a
119	  particular string pattern.
120	</para>
121	<para>
122	  The shaping model for a given complex script might involve
123	  multiple contextual-substitution operations, each applying
124	  to different target glyphs and patterns, and which are
125	  performed in separate steps.
126	</para>
127      </listitem>
128
129      <listitem>
130	<para>
131	  A contextual <emphasis>positioning</emphasis> operation
132	  moves the horizontal and/or vertical position of a
133	  glyph. This positioning move is performed when the glyph
134	  occurs in a specified position with respect to the
135	  surrounding sequence.
136	</para>
137	<para>
138	  Many contextual positioning operations are used to place
139	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
140	  signs, and tone markers) with respect to
141	  <emphasis>base</emphasis> glyphs. However, some complex
142	  scripts may use contextual positioning operations to
143	  correctly place base glyphs as well, such as
144	  when the script uses <emphasis>stacking</emphasis> characters.
145	</para>
146      </listitem>
147
148    </itemizedlist>
149  </section>
150
151  <section id="unicode-character-categories">
152    <title>Unicode character categories</title>
153    <para>
154      Shaping models are typically specified with respect to how
155      scripts are defined in the Unicode standard.
156    </para>
157    <para>
158      Every codepoint in the Unicode Character Database (UCD) is
159      assigned a <emphasis>Unicode General Category</emphasis> (UGC),
160      which provides the most fundamental information about the
161      codepoint: whether the codepoint represents a
162      <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
163      <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
164      <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
165      or something else (<emphasis>Other</emphasis>).
166    </para>
167    <para>
168      These UGC properties are "Major" categories. Each codepoint is
169      further assigned to a "minor" category within its Major
170      category, such as "Letter, uppercase" (<literal>Lu</literal>) or
171      "Letter, modifier" (<literal>Lm</literal>).
172    </para>
173    <para>
174      Shaping models are concerned primarily with Letter and Mark
175      codepoints. The minor categories of Mark codepoints are
176      particularly important for shaping. Marks can be nonspacing
177      (<literal>Mn</literal>), spacing combining
178      (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
179    </para>
180    <para>
181      In addition to the UGC property, codepoints in the Indic and
182      Southeast Asian scripts are also assigned
183      <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
184      <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
185      properties that provide more detailed information needed for
186      shaping.
187    </para>
188    <para>
189      The UISC property sub-categorizes Letters and Marks according to
190      common script-shaping behaviors. For example, UISC distinguishes
191      between consonant letters, vowel letters, and vowel marks. The
192      UIPC property sub-categorizes Mark codepoints by the relative visual
193      position that they occupy (above, below, right, left, or in
194      multiple positions).
195    </para>
196    <para>
197      Some complex scripts require that the text run be split into
198      syllables. What constitutes a valid syllable in these
199      scripts is specified in regular expressions, formed from the
200      Letter and Mark codepoints, that take the UISC and UIPC
201      properties into account.
202    </para>
203
204  </section>
205
206  <section id="text-runs">
207    <title>Text runs</title>
208    <para>
209      Real-world text usually contains codepoints from a mixture of
210      different Unicode scripts (including punctuation, numbers, symbols,
211      white-space characters, and other codepoints that do not belong
212      to any script). Real-world text may also be marked up with
213      formatting that changes font properties (including the font,
214      font style, and font size).
215    </para>
216    <para>
217      For shaping purposes, all real-world text streams must be first
218      segmented into runs that have a uniform set of properties.
219    </para>
220    <para>
221      In particular, shaping models always assume that every codepoint
222      in a text run has the same <emphasis>direction</emphasis>,
223      <emphasis>script</emphasis> tag, and
224      <emphasis>language</emphasis> tag.
225    </para>
226  </section>
227
228  <section id="opentype-shaping-models">
229    <title>OpenType shaping models</title>
230    <para>
231      OpenType provides shaping models for the following scripts:
232    </para>
233
234    <itemizedlist>
235      <listitem>
236	<para>
237	  The <emphasis>default</emphasis> shaping model handles all
238	  non-complex scripts, and may also be used as a fallback for
239	  handling unrecognized scripts.
240	</para>
241      </listitem>
242
243      <listitem>
244	<para>
245	  The <emphasis>Indic</emphasis> shaping model handles the Indic
246	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
247	  Malayalam, Oriya, Tamil, Telugu, and Sinhala.
248	</para>
249	<para>
250	  The Indic shaping model was revised significantly in
251	  2005. To denote the change, a new set of <emphasis>script
252	  tags</emphasis> was assigned for Bengali, Devanagari,
253	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
254	  Telugu. For the sake of clarity, the term "Indic2" is
255	  sometimes used to refer to the current, revised shaping
256	  model.
257	</para>
258      </listitem>
259
260      <listitem>
261	<para>
262	  The <emphasis>Arabic</emphasis> shaping model supports
263	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
264	  or cursive scripts.
265	</para>
266      </listitem>
267
268      <listitem>
269	<para>
270	  The <emphasis>Thai/Lao</emphasis> shaping model supports
271	  the Thai and Lao scripts.
272	</para>
273      </listitem>
274
275      <listitem>
276	<para>
277	  The <emphasis>Khmer</emphasis> shaping model supports the
278	  Khmer script.
279	</para>
280      </listitem>
281
282      <listitem>
283	<para>
284	  The <emphasis>Myanmar</emphasis> shaping model supports the
285	  Myanmar (or Burmese) script.
286	</para>
287      </listitem>
288
289      <listitem>
290	<para>
291	  The <emphasis>Tibetan</emphasis> shaping model supports the
292	  Tibetan script.
293	</para>
294      </listitem>
295
296      <listitem>
297	<para>
298	  The <emphasis>Hangul</emphasis> shaping model supports the
299	  Hangul script.
300	</para>
301      </listitem>
302
303      <listitem>
304	<para>
305	  The <emphasis>Hebrew</emphasis> shaping model supports the
306	  Hebrew script.
307	</para>
308      </listitem>
309
310      <listitem>
311	<para>
312	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
313	  shaping model supports complex scripts not covered by one of
314	  the above, script-specific shaping models, including
315	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
316	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
317	  Viet, and many others.
318	</para>
319      </listitem>
320
321      <listitem>
322	<para>
323	  Text runs that do not fall under one of the above shaping
324	  models may still require processing by a shaping engine. Of
325	  particular note is <emphasis>Emoji</emphasis> shaping, which
326	  may involve variation-selector sequences and glyph
327	  substitution. Emoji shaping is handled by the default
328	  shaping model.
329	</para>
330      </listitem>
331
332    </itemizedlist>
333
334  </section>
335
336  <section id="graphite-shaping">
337    <title>Graphite shaping</title>
338    <para>
339      In contrast to OpenType shaping, Graphite shaping does not
340      specify a predefined set of shaping models or a set of supported
341      scripts.
342    </para>
343    <para>
344      Instead, each Graphite font contains a complete set of rules that
345      implement the required shaping model for the intended
346      script. These rules include finite-state machines to match
347      sequences of codepoints to the shaping operations to perform.
348    </para>
349    <para>
350      Graphite shaping can perform the same shaping operations used in
351      OpenType shaping, as well as other functions that have not been
352      defined for OpenType shaping.
353    </para>
354  </section>
355
356  <section id="aat-shaping">
357    <title>AAT shaping</title>
358    <para>
359      In contrast to OpenType shaping, AAT shaping does not specify a
360      predefined set of shaping models or a set of supported scripts.
361    </para>
362    <para>
363      Instead, each AAT font includes a complete set of rules that
364      implement the desired shaping model for the intended
365      script. These rules include finite-state machines to match glyph
366      sequences and the shaping operations to perform.
367    </para>
368    <para>
369      Notably, AAT shaping rules are expressed for glyphs in the font,
370      not for Unicode codepoints. AAT shaping can perform the same
371      shaping operations used in OpenType shaping, as well as other
372      functions that have not been defined for OpenType shaping.
373    </para>
374  </section>
375</chapter>
376