1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="buffers-language-script-and-direction"> 8 <title>Buffers, language, script and direction</title> 9 <para> 10 The input to the HarfBuzz shaper is a series of Unicode characters, stored in a 11 buffer. In this chapter, we'll look at how to set up a buffer with 12 the text that we want and how to customize the properties of the 13 buffer. We'll also look at a piece of lower-level machinery that 14 you will need to understand before proceeding: the functions that 15 HarfBuzz uses to retrieve Unicode information. 16 </para> 17 <para> 18 After shaping is complete, HarfBuzz puts its output back 19 into the buffer. But getting that output requires setting up a 20 face and a font first, so we will look at that in the next chapter 21 instead of here. 22 </para> 23 <section id="creating-and-destroying-buffers"> 24 <title>Creating and destroying buffers</title> 25 <para> 26 As we saw in our <emphasis>Getting Started</emphasis> example, a 27 buffer is created and 28 initialized with <function>hb_buffer_create()</function>. This 29 produces a new, empty buffer object, instantiated with some 30 default values and ready to accept your Unicode strings. 31 </para> 32 <para> 33 HarfBuzz manages the memory of objects (such as buffers) that it 34 creates, so you don't have to. When you have finished working on 35 a buffer, you can call <function>hb_buffer_destroy()</function>: 36 </para> 37 <programlisting language="C"> 38 hb_buffer_t *buf = hb_buffer_create(); 39 ... 40 hb_buffer_destroy(buf); 41 </programlisting> 42 <para> 43 This will destroy the object and free its associated memory - 44 unless some other part of the program holds a reference to this 45 buffer. If you acquire a HarfBuzz buffer from another subsystem 46 and want to ensure that it is not garbage collected by someone 47 else destroying it, you should increase its reference count: 48 </para> 49 <programlisting language="C"> 50 void somefunc(hb_buffer_t *buf) { 51 buf = hb_buffer_reference(buf); 52 ... 53 </programlisting> 54 <para> 55 And then decrease it once you're done with it: 56 </para> 57 <programlisting language="C"> 58 hb_buffer_destroy(buf); 59 } 60 </programlisting> 61 <para> 62 While we are on the subject of reference-counting buffers, it is 63 worth noting that an individual buffer can only meaningfully be 64 used by one thread at a time. 65 </para> 66 <para> 67 To throw away all the data in your buffer and start from scratch, 68 call <function>hb_buffer_reset(buf)</function>. If you want to 69 throw away the string in the buffer but keep the options, you can 70 instead call <function>hb_buffer_clear_contents(buf)</function>. 71 </para> 72 </section> 73 74 <section id="adding-text-to-the-buffer"> 75 <title>Adding text to the buffer</title> 76 <para> 77 Now we have a brand new HarfBuzz buffer. Let's start filling it 78 with text! From HarfBuzz's perspective, a buffer is just a stream 79 of Unicode code points, but your input string is probably in one of 80 the standard Unicode character encodings (UTF-8, UTF-16, or 81 UTF-32). HarfBuzz provides convenience functions that accept 82 each of these encodings: 83 <function>hb_buffer_add_utf8()</function>, 84 <function>hb_buffer_add_utf16()</function>, and 85 <function>hb_buffer_add_utf32()</function>. Other than the 86 character encoding they accept, they function identically. 87 </para> 88 <para> 89 You can add UTF-8 text to a buffer by passing in the text array, 90 the array's length, an offset into the array for the first 91 character to add, and the length of the segment to add: 92 </para> 93 <programlisting language="C"> 94 hb_buffer_add_utf8 (hb_buffer_t *buf, 95 const char *text, 96 int text_length, 97 unsigned int item_offset, 98 int item_length) 99 </programlisting> 100 <para> 101 So, in practice, you can say: 102 </para> 103 <programlisting language="C"> 104 hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text)); 105 </programlisting> 106 <para> 107 This will append your new characters to 108 <parameter>buf</parameter>, not replace its existing 109 contents. Also, note that you can use <literal>-1</literal> in 110 place of the first instance of <function>strlen(text)</function> 111 if your text array is NULL-terminated. Similarly, you can also use 112 <literal>-1</literal> as the final argument want to add its full 113 contents. 114 </para> 115 <para> 116 Whatever start <parameter>item_offset</parameter> and 117 <parameter>item_length</parameter> you provide, HarfBuzz will also 118 attempt to grab the five characters <emphasis>before</emphasis> 119 the offset point and the five characters 120 <emphasis>after</emphasis> the designated end. These are the 121 before and after "context" segments, which are used internally 122 for HarfBuzz to make shaping decisions. They will not be part of 123 the final output, but they ensure that HarfBuzz's 124 script-specific shaping operations are correct. If there are 125 fewer than five characters available for the before or after 126 contexts, HarfBuzz will just grab what is there. 127 </para> 128 <para> 129 For longer text runs, such as full paragraphs, it might be 130 tempting to only add smaller sub-segments to a buffer and 131 shape them in piecemeal fashion. Generally, this is not a good 132 idea, however, because a lot of shaping decisions are 133 dependent on this context information. For example, in Arabic 134 and other connected scripts, HarfBuzz needs to know the code 135 points before and after each character in order to correctly 136 determine which glyph to return. 137 </para> 138 <para> 139 The safest approach is to add all of the text available (even 140 if your text contains a mix of scripts, directions, languages 141 and fonts), then use <parameter>item_offset</parameter> and 142 <parameter>item_length</parameter> to indicate which characters you 143 want shaped (which must all have the same script, direction, 144 language and font), so that HarfBuzz has access to any context. 145 </para> 146 <para> 147 You can also add Unicode code points directly with 148 <function>hb_buffer_add_codepoints()</function>. The arguments 149 to this function are the same as those for the UTF 150 encodings. But it is particularly important to note that 151 HarfBuzz does not do validity checking on the text that is added 152 to a buffer. Invalid code points will be replaced, but it is up 153 to you to do any deep-sanity checking necessary. 154 </para> 155 156 </section> 157 158 <section id="setting-buffer-properties"> 159 <title>Setting buffer properties</title> 160 <para> 161 Buffers containing input characters still need several 162 properties set before HarfBuzz can shape their text correctly. 163 </para> 164 <para> 165 Initially, all buffers are set to the 166 <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content 167 type. After adding text, the buffer should be set to 168 <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which 169 indicates that it contains un-shaped input 170 characters. After shaping, the buffer will have the 171 <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type. 172 </para> 173 <para> 174 <function>hb_buffer_add_utf8()</function> and the 175 other UTF functions set the content type of their buffer 176 automatically. But if you are reusing a buffer you may want to 177 check its state with 178 <function>hb_buffer_get_content_type(buffer)</function>. If 179 necessary you can set the content type with 180 </para> 181 <programlisting language="C"> 182 hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE); 183 </programlisting> 184 <para> 185 to prepare for shaping. 186 </para> 187 <para> 188 Buffers also need to carry information about the script, 189 language, and text direction of their contents. You can set 190 these properties individually: 191 </para> 192 <programlisting language="C"> 193 hb_buffer_set_direction(buf, HB_DIRECTION_LTR); 194 hb_buffer_set_script(buf, HB_SCRIPT_LATIN); 195 hb_buffer_set_language(buf, hb_language_from_string("en", -1)); 196 </programlisting> 197 <para> 198 However, since these properties are often repeated for 199 multiple text runs, you can also save them in a 200 <literal>hb_segment_properties_t</literal> for reuse: 201 </para> 202 <programlisting language="C"> 203 hb_segment_properties_t *savedprops; 204 hb_buffer_get_segment_properties (buf, savedprops); 205 ... 206 hb_buffer_set_segment_properties (buf2, savedprops); 207 </programlisting> 208 <para> 209 HarfBuzz also provides getter functions to retrieve a buffer's 210 direction, script, and language properties individually. 211 </para> 212 <para> 213 HarfBuzz recognizes four text directions in 214 <type>hb_direction_t</type>: left-to-right 215 (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>), 216 top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and 217 bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the 218 script property, HarfBuzz uses identifiers based on the 219 <ulink 220 url="https://unicode.org/iso15924/">ISO 15924 221 standard</ulink>. For languages, HarfBuzz uses tags based on the 222 <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard. 223 </para> 224 <para> 225 Helper functions are provided to convert character strings into 226 the necessary script and language tag types. 227 </para> 228 <para> 229 Two additional buffer properties to be aware of are the 230 "invisible glyph" and the replacement code point. The 231 replacement code point is inserted into buffer output in place of 232 any invalid code points encountered in the input. By default, it 233 is the Unicode <literal>REPLACEMENT CHARACTER</literal> code 234 point, <literal>U+FFFD</literal> "�". You can change this with 235 </para> 236 <programlisting language="C"> 237 hb_buffer_set_replacement_codepoint(buf, replacement); 238 </programlisting> 239 <para> 240 passing in the replacement Unicode code point as the 241 <parameter>replacement</parameter> parameter. 242 </para> 243 <para> 244 The invisible glyph is used to replace all output glyphs that 245 are invisible. By default, the standard space character 246 <literal>U+0020</literal> is used; you can replace this (for 247 example, when using a font that provides script-specific 248 spaces) with 249 </para> 250 <programlisting language="C"> 251 hb_buffer_set_invisible_glyph(buf, replacement_glyph); 252 </programlisting> 253 <para> 254 Do note that in the <parameter>replacement_glyph</parameter> 255 parameter, you must provide the glyph ID of the replacement you 256 wish to use, not the Unicode code point. 257 </para> 258 <para> 259 HarfBuzz supports a few additional flags you might want to set 260 on your buffer under certain circumstances. The 261 <literal>HB_BUFFER_FLAG_BOT</literal> and 262 <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz 263 that the buffer represents the beginning or end (respectively) 264 of a text element (such as a paragraph or other block). Knowing 265 this allows HarfBuzz to apply certain contextual font features 266 when shaping, such as initial or final variants in connected 267 scripts. 268 </para> 269 <para> 270 <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal> 271 tells HarfBuzz not to hide glyphs with the 272 <literal>Default_Ignorable</literal> property in Unicode. This 273 property designates control characters and other non-printing 274 code points, such as joiners and variation selectors. Normally 275 HarfBuzz replaces them in the output buffer with zero-width 276 space glyphs (using the "invisible glyph" property discussed 277 above); setting this flag causes them to be printed, which can 278 be helpful for troubleshooting. 279 </para> 280 <para> 281 Conversely, setting the 282 <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag 283 tells HarfBuzz to remove <literal>Default_Ignorable</literal> 284 glyphs from the output buffer entirely. Finally, setting the 285 <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal> 286 flag tells HarfBuzz not to insert the dotted-circle glyph 287 (<literal>U+25CC</literal>, "◌"), which is normally 288 inserted into buffer output when broken character sequences are 289 encountered (such as combining marks that are not attached to a 290 base character). 291 </para> 292 </section> 293 294 <section id="customizing-unicode-functions"> 295 <title>Customizing Unicode functions</title> 296 <para> 297 HarfBuzz requires some simple functions for accessing 298 information from the Unicode Character Database (such as the 299 <literal>General_Category</literal> (gc) and 300 <literal>Script</literal> (sc) properties) that is useful 301 for shaping, as well as some useful operations like composing and 302 decomposing code points. 303 </para> 304 <para> 305 HarfBuzz includes its own internal, lightweight set of Unicode 306 functions. At build time, it is also possible to compile support 307 for some other options, such as the Unicode functions provided 308 by GLib or the International Components for Unicode (ICU) 309 library. Generally, this option is only of interest for client 310 programs that have specific integration requirements or that do 311 a significant amount of customization. 312 </para> 313 <para> 314 If your program has access to other Unicode functions, however, 315 such as through a system library or application framework, you 316 might prefer to use those instead of the built-in 317 options. HarfBuzz supports this by implementing its Unicode 318 functions as a set of virtual methods that you can replace — 319 without otherwise affecting HarfBuzz's functionality. 320 </para> 321 <para> 322 The Unicode functions are specified in a structure called 323 <literal>unicode_funcs</literal> which is attached to each 324 buffer. But even though <literal>unicode_funcs</literal> is 325 associated with a <type>hb_buffer_t</type>, the functions 326 themselves are called by other HarfBuzz APIs that access 327 buffers, so it would be unwise for you to hook different 328 functions into different buffers. 329 </para> 330 <para> 331 In addition, you can mark your <literal>unicode_funcs</literal> 332 as immutable by calling 333 <function>hb_unicode_funcs_make_immutable (ufuncs)</function>. 334 This is especially useful if your code is a 335 library or framework that will have its own client programs. By 336 marking your Unicode function choices as immutable, you prevent 337 your own client programs from changing the 338 <literal>unicode_funcs</literal> configuration and introducing 339 inconsistencies and errors downstream. 340 </para> 341 <para> 342 You can retrieve the Unicode-functions configuration for 343 your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>: 344 </para> 345 <programlisting language="C"> 346 hb_unicode_funcs_t *ufunctions; 347 ufunctions = hb_buffer_get_unicode_funcs(buf); 348 </programlisting> 349 <para> 350 The current version of <literal>unicode_funcs</literal> uses six functions: 351 </para> 352 <itemizedlist> 353 <listitem> 354 <para> 355 <function>hb_unicode_combining_class_func_t</function>: 356 returns the Canonical Combining Class of a code point. 357 </para> 358 </listitem> 359 <listitem> 360 <para> 361 <function>hb_unicode_general_category_func_t</function>: 362 returns the General Category (gc) of a code point. 363 </para> 364 </listitem> 365 <listitem> 366 <para> 367 <function>hb_unicode_mirroring_func_t</function>: returns 368 the Mirroring Glyph code point (for bi-directional 369 replacement) of a code point. 370 </para> 371 </listitem> 372 <listitem> 373 <para> 374 <function>hb_unicode_script_func_t</function>: returns the 375 Script (sc) property of a code point. 376 </para> 377 </listitem> 378 <listitem> 379 <para> 380 <function>hb_unicode_compose_func_t</function>: returns the 381 canonical composition of a sequence of two code points. 382 </para> 383 </listitem> 384 <listitem> 385 <para> 386 <function>hb_unicode_decompose_func_t</function>: returns 387 the canonical decomposition of a code point. 388 </para> 389 </listitem> 390 </itemizedlist> 391 <para> 392 Note, however, that future HarfBuzz releases may alter this set. 393 </para> 394 <para> 395 Each Unicode function has a corresponding setter, with which you 396 can assign a callback to your replacement function. For example, 397 to replace 398 <function>hb_unicode_general_category_func_t</function>, you can call 399 </para> 400 <programlisting language="C"> 401 hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy) 402 </programlisting> 403 <para> 404 Virtualizing this set of Unicode functions is primarily intended 405 to improve portability. There is no need for every client 406 program to make the effort to replace the default options, so if 407 you are unsure, do not feel any pressure to customize 408 <literal>unicode_funcs</literal>. 409 </para> 410 </section> 411 412</chapter> 413