• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="buffers-language-script-and-direction">
8  <title>Buffers, language, script and direction</title>
9  <para>
10    The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
11    buffer. In this chapter, we'll look at how to set up a buffer with
12    the text that we want and how to customize the properties of the
13    buffer. We'll also look at a piece of lower-level machinery that
14    you will need to understand before proceeding: the functions that
15    HarfBuzz uses to retrieve Unicode information.
16  </para>
17  <para>
18    After shaping is complete, HarfBuzz puts its output back
19    into the buffer. But getting that output requires setting up a
20    face and a font first, so we will look at that in the next chapter
21    instead of here.
22  </para>
23  <section id="creating-and-destroying-buffers">
24    <title>Creating and destroying buffers</title>
25    <para>
26      As we saw in our <emphasis>Getting Started</emphasis> example, a
27      buffer is created and
28      initialized with <function>hb_buffer_create()</function>. This
29      produces a new, empty buffer object, instantiated with some
30      default values and ready to accept your Unicode strings.
31    </para>
32    <para>
33      HarfBuzz manages the memory of objects (such as buffers) that it
34      creates, so you don't have to. When you have finished working on
35      a buffer, you can call <function>hb_buffer_destroy()</function>:
36    </para>
37    <programlisting language="C">
38      hb_buffer_t *buf = hb_buffer_create();
39      ...
40      hb_buffer_destroy(buf);
41    </programlisting>
42    <para>
43      This will destroy the object and free its associated memory -
44      unless some other part of the program holds a reference to this
45      buffer. If you acquire a HarfBuzz buffer from another subsystem
46      and want to ensure that it is not garbage collected by someone
47      else destroying it, you should increase its reference count:
48    </para>
49    <programlisting language="C">
50      void somefunc(hb_buffer_t *buf) {
51      buf = hb_buffer_reference(buf);
52      ...
53    </programlisting>
54    <para>
55      And then decrease it once you're done with it:
56    </para>
57    <programlisting language="C">
58      hb_buffer_destroy(buf);
59      }
60    </programlisting>
61    <para>
62      While we are on the subject of reference-counting buffers, it is
63      worth noting that an individual buffer can only meaningfully be
64      used by one thread at a time.
65    </para>
66    <para>
67      To throw away all the data in your buffer and start from scratch,
68      call <function>hb_buffer_reset(buf)</function>. If you want to
69      throw away the string in the buffer but keep the options, you can
70      instead call <function>hb_buffer_clear_contents(buf)</function>.
71    </para>
72  </section>
73
74  <section id="adding-text-to-the-buffer">
75    <title>Adding text to the buffer</title>
76    <para>
77      Now we have a brand new HarfBuzz buffer. Let's start filling it
78      with text! From HarfBuzz's perspective, a buffer is just a stream
79      of Unicode code points, but your input string is probably in one of
80      the standard Unicode character encodings (UTF-8, UTF-16, or
81      UTF-32). HarfBuzz provides convenience functions that accept
82      each of these encodings:
83      <function>hb_buffer_add_utf8()</function>,
84      <function>hb_buffer_add_utf16()</function>, and
85      <function>hb_buffer_add_utf32()</function>. Other than the
86      character encoding they accept, they function identically.
87    </para>
88    <para>
89      You can add UTF-8 text to a buffer by passing in the text array,
90      the array's length, an offset into the array for the first
91      character to add, and the length of the segment to add:
92    </para>
93    <programlisting language="C">
94    hb_buffer_add_utf8 (hb_buffer_t *buf,
95                    const char *text,
96                    int text_length,
97                    unsigned int item_offset,
98                    int item_length)
99    </programlisting>
100    <para>
101      So, in practice, you can say:
102    </para>
103    <programlisting language="C">
104      hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
105    </programlisting>
106    <para>
107      This will append your new characters to
108      <parameter>buf</parameter>, not replace its existing
109      contents. Also, note that you can use <literal>-1</literal> in
110      place of the first instance of <function>strlen(text)</function>
111      if your text array is NULL-terminated. Similarly, you can also use
112      <literal>-1</literal> as the final argument want to add its full
113      contents.
114    </para>
115    <para>
116      Whatever start <parameter>item_offset</parameter> and
117      <parameter>item_length</parameter> you provide, HarfBuzz will also
118      attempt to grab the five characters <emphasis>before</emphasis>
119      the offset point and the five characters
120      <emphasis>after</emphasis> the designated end. These are the
121      before and after "context" segments, which are used internally
122      for HarfBuzz to make shaping decisions. They will not be part of
123      the final output, but they ensure that HarfBuzz's
124      script-specific shaping operations are correct. If there are
125      fewer than five characters available for the before or after
126      contexts, HarfBuzz will just grab what is there.
127    </para>
128    <para>
129      For longer text runs, such as full paragraphs, it might be
130      tempting to only add smaller sub-segments to a buffer and
131      shape them in piecemeal fashion. Generally, this is not a good
132      idea, however, because a lot of shaping decisions are
133      dependent on this context information. For example, in Arabic
134      and other connected scripts, HarfBuzz needs to know the code
135      points before and after each character in order to correctly
136      determine which glyph to return.
137    </para>
138    <para>
139      The safest approach is to add all of the text available (even
140      if your text contains a mix of scripts, directions, languages
141      and fonts), then use <parameter>item_offset</parameter> and
142      <parameter>item_length</parameter> to indicate which characters you
143      want shaped (which must all have the same script, direction,
144      language and font), so that HarfBuzz has access to any context.
145    </para>
146    <para>
147      You can also add Unicode code points directly with
148      <function>hb_buffer_add_codepoints()</function>. The arguments
149      to this function are the same as those for the UTF
150      encodings. But it is particularly important to note that
151      HarfBuzz does not do validity checking on the text that is added
152      to a buffer. Invalid code points will be replaced, but it is up
153      to you to do any deep-sanity checking necessary.
154    </para>
155
156  </section>
157
158  <section id="setting-buffer-properties">
159    <title>Setting buffer properties</title>
160    <para>
161      Buffers containing input characters still need several
162      properties set before HarfBuzz can shape their text correctly.
163    </para>
164    <para>
165      Initially, all buffers are set to the
166      <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
167      type. After adding text, the buffer should be set to
168      <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
169      indicates that it contains un-shaped input
170      characters. After shaping, the buffer will have the
171      <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
172    </para>
173    <para>
174      <function>hb_buffer_add_utf8()</function> and the
175      other UTF functions set the content type of their buffer
176      automatically. But if you are reusing a buffer you may want to
177      check its state with
178      <function>hb_buffer_get_content_type(buffer)</function>. If
179      necessary you can set the content type with
180    </para>
181    <programlisting language="C">
182      hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
183    </programlisting>
184    <para>
185      to prepare for shaping.
186    </para>
187    <para>
188      Buffers also need to carry information about the script,
189      language, and text direction of their contents. You can set
190      these properties individually:
191    </para>
192    <programlisting language="C">
193      hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
194      hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
195      hb_buffer_set_language(buf, hb_language_from_string("en", -1));
196    </programlisting>
197    <para>
198      However, since these properties are often repeated for
199      multiple text runs, you can also save them in a
200      <literal>hb_segment_properties_t</literal> for reuse:
201    </para>
202    <programlisting language="C">
203      hb_segment_properties_t *savedprops;
204      hb_buffer_get_segment_properties (buf, savedprops);
205      ...
206      hb_buffer_set_segment_properties (buf2, savedprops);
207    </programlisting>
208    <para>
209      HarfBuzz also provides getter functions to retrieve a buffer's
210      direction, script, and language properties individually.
211    </para>
212    <para>
213      HarfBuzz recognizes four text directions in
214      <type>hb_direction_t</type>: left-to-right
215      (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
216      top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
217      bottom-to-top (<literal>HB_DIRECTION_BTT</literal>).  For the
218      script property, HarfBuzz uses identifiers based on the
219      <ulink
220      url="https://unicode.org/iso15924/">ISO 15924
221      standard</ulink>. For languages, HarfBuzz uses tags based on the
222      <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
223    </para>
224    <para>
225      Helper functions are provided to convert character strings into
226      the necessary script and language tag types.
227    </para>
228    <para>
229      Two additional buffer properties to be aware of are the
230      "invisible glyph" and the replacement code point. The
231      replacement code point is inserted into buffer output in place of
232      any invalid code points encountered in the input. By default, it
233      is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
234      point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
235    </para>
236    <programlisting language="C">
237      hb_buffer_set_replacement_codepoint(buf, replacement);
238    </programlisting>
239    <para>
240      passing in the replacement Unicode code point as the
241      <parameter>replacement</parameter> parameter.
242    </para>
243    <para>
244      The invisible glyph is used to replace all output glyphs that
245      are invisible. By default, the standard space character
246      <literal>U+0020</literal> is used; you can replace this (for
247      example, when using a font that provides script-specific
248      spaces) with
249    </para>
250    <programlisting language="C">
251      hb_buffer_set_invisible_glyph(buf, replacement_glyph);
252    </programlisting>
253    <para>
254      Do note that in the <parameter>replacement_glyph</parameter>
255      parameter, you must provide the glyph ID of the replacement you
256      wish to use, not the Unicode code point.
257    </para>
258    <para>
259      HarfBuzz supports a few additional flags you might want to set
260      on your buffer under certain circumstances. The
261      <literal>HB_BUFFER_FLAG_BOT</literal> and
262      <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
263      that the buffer represents the beginning or end (respectively)
264      of a text element (such as a paragraph or other block). Knowing
265      this allows HarfBuzz to apply certain contextual font features
266      when shaping, such as initial or final variants in connected
267      scripts.
268    </para>
269    <para>
270      <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
271      tells HarfBuzz not to hide glyphs with the
272      <literal>Default_Ignorable</literal> property in Unicode. This
273      property designates control characters and other non-printing
274      code points, such as joiners and variation selectors. Normally
275      HarfBuzz replaces them in the output buffer with zero-width
276      space glyphs (using the "invisible glyph" property discussed
277      above); setting this flag causes them to be printed, which can
278      be helpful for troubleshooting.
279    </para>
280    <para>
281      Conversely, setting the
282      <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
283      tells HarfBuzz to remove <literal>Default_Ignorable</literal>
284      glyphs from the output buffer entirely. Finally, setting the
285      <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
286      flag tells HarfBuzz not to insert the dotted-circle glyph
287      (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
288      inserted into buffer output when broken character sequences are
289      encountered (such as combining marks that are not attached to a
290      base character).
291    </para>
292  </section>
293
294  <section id="customizing-unicode-functions">
295    <title>Customizing Unicode functions</title>
296    <para>
297      HarfBuzz requires some simple functions for accessing
298      information from the Unicode Character Database (such as the
299      <literal>General_Category</literal> (gc) and
300      <literal>Script</literal> (sc) properties) that is useful
301      for shaping, as well as some useful operations like composing and
302      decomposing code points.
303    </para>
304    <para>
305      HarfBuzz includes its own internal, lightweight set of Unicode
306      functions. At build time, it is also possible to compile support
307      for some other options, such as the Unicode functions provided
308      by GLib or the International Components for Unicode (ICU)
309      library. Generally, this option is only of interest for client
310      programs that have specific integration requirements or that do
311      a significant amount of customization.
312    </para>
313    <para>
314      If your program has access to other Unicode functions, however,
315      such as through a system library or application framework, you
316      might prefer to use those instead of the built-in
317      options. HarfBuzz supports this by implementing its Unicode
318      functions as a set of virtual methods that you can replace —
319      without otherwise affecting HarfBuzz's functionality.
320    </para>
321    <para>
322      The Unicode functions are specified in a structure called
323      <literal>unicode_funcs</literal> which is attached to each
324      buffer. But even though <literal>unicode_funcs</literal> is
325      associated with a <type>hb_buffer_t</type>, the functions
326      themselves are called by other HarfBuzz APIs that access
327      buffers, so it would be unwise for you to hook different
328      functions into different buffers.
329    </para>
330    <para>
331      In addition, you can mark your <literal>unicode_funcs</literal>
332      as immutable by calling
333      <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
334      This is especially useful if your code is a
335      library or framework that will have its own client programs. By
336      marking your Unicode function choices as immutable, you prevent
337      your own client programs from changing the
338      <literal>unicode_funcs</literal> configuration and introducing
339      inconsistencies and errors downstream.
340    </para>
341    <para>
342      You can retrieve the Unicode-functions configuration for
343      your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
344    </para>
345    <programlisting language="C">
346      hb_unicode_funcs_t *ufunctions;
347      ufunctions = hb_buffer_get_unicode_funcs(buf);
348    </programlisting>
349    <para>
350      The current version of <literal>unicode_funcs</literal> uses six functions:
351    </para>
352    <itemizedlist>
353      <listitem>
354	<para>
355	  <function>hb_unicode_combining_class_func_t</function>:
356	  returns the Canonical Combining Class of a code point.
357      	</para>
358      </listitem>
359      <listitem>
360	<para>
361	  <function>hb_unicode_general_category_func_t</function>:
362	  returns the General Category (gc) of a code point.
363      	</para>
364      </listitem>
365      <listitem>
366	<para>
367	  <function>hb_unicode_mirroring_func_t</function>: returns
368	  the Mirroring Glyph code point (for bi-directional
369	  replacement) of a code point.
370      	</para>
371      </listitem>
372      <listitem>
373	<para>
374	  <function>hb_unicode_script_func_t</function>: returns the
375	  Script (sc) property of a code point.
376      	</para>
377      </listitem>
378      <listitem>
379	<para>
380	  <function>hb_unicode_compose_func_t</function>: returns the
381	  canonical composition of a sequence of two code points.
382	</para>
383      </listitem>
384      <listitem>
385	<para>
386	  <function>hb_unicode_decompose_func_t</function>: returns
387	  the canonical decomposition of a code point.
388	</para>
389      </listitem>
390    </itemizedlist>
391    <para>
392      Note, however, that future HarfBuzz releases may alter this set.
393    </para>
394    <para>
395      Each Unicode function has a corresponding setter, with which you
396      can assign a callback to your replacement function. For example,
397      to replace
398      <function>hb_unicode_general_category_func_t</function>, you can call
399    </para>
400    <programlisting language="C">
401      hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
402    </programlisting>
403    <para>
404      Virtualizing this set of Unicode functions is primarily intended
405      to improve portability. There is no need for every client
406      program to make the effort to replace the default options, so if
407      you are unsure, do not feel any pressure to customize
408      <literal>unicode_funcs</literal>.
409    </para>
410  </section>
411
412</chapter>
413