1--- 2layout: default 3title: UText 4nav_order: 4 5parent: Chars and Strings 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# UText 13 14## Overview 15 16UText is a text abstraction facility for ICU 17 18The intent is to make it possible to extend ICU to work with text data that is 19in formats above and beyond those that are native to ICU. 20 21UText directly supports text in these formats: 22 231. UTF-8 (`char*`) strings 242. UTF-16 (`UChar*` or `UnicodeString`) strings 253. `Replaceable` 26 27The ICU services that can accept UText based input are: 28 291. Regular Expressions 302. Break Iteration 31 32Examples of text formats that UText could be extended to support: 33 341. UTF-32 format. 352. Text that is stored in discontiguous chunks in memory, or in application-specific representations. 363. Text that is in a non-Unicode code page 37 38If ICU does not directly support a desired text format, it is possible for 39application developers themselves to extend UText, and in that way gain the 40ability to use their text with ICU. 41 42## Using UText 43 44There are three fairly distinct classes of use of UText. These are: 45 461. **Simple wrapping of existing text.** Application text data exists in a 47 format that is already supported by UText (such as UTF-8). The application 48 opens a UText on the data, and then passes the UText to an ICU service for 49 analysis/processing. Most use of UText from applications will follow this 50 simple pattern. Only a very few UText APIs and only a few lines of code are 51 required. 52 532. **Accessing the underlying text.** UText provides APIs for iterating over 54 the text in various ways, and for fetching individual code points from the 55 text. These functions will probably be used primarily from within ICU, in 56 the implementation of services that can accept input in the form of a UText. 57 While applications are certainly free to use these text access functions if 58 necessary, there may often be no need. 59 603. **UText support for new text storage formats.** If an application has text 61 data stored in a format that is not directly supported by ICU, extending 62 UText to support that format will provide the ability to conveniently use 63 those ICU services that support UText. 64 65 Extending UText to a new format is accomplished by implementing a well 66 defined set of *Text Provider Functions* for that format. 67 68## UText compared with CharacterIterator 69 70CharacterIterator is an abstract base class that defines a protocol for 71accessing characters in a text-storage object. This class has methods for 72iterating forward and backward over Unicode characters to return either the 73individual Unicode characters or their corresponding index values. 74 75UText and CharacterIterator both provide an abstraction for accessing text while 76hiding details of the actual storage format. UText is the more flexible of the 77two, however, with these advantages: 78 791. UText can conveniently operate on text stored in formats other than UTF-16. 802. UText includes functions for modifying or editing the text. 813. UText is more efficient. When iterating over a range of text using the 82 CharacterIterator API, a function call is required for every character. With 83 UText, iterating to the next character is usually done with small amount of 84 inline code. 85 86At this time, more ICU services support CharacterIterator than UText. ICU 87services that can operate on text represented by a CharacterIterator are 88 891. Normalizer 902. Break Iteration 913. String Search 924. Collation Element Iteration 93 94## Example: Counting the Words in a UTF-8 String 95 96Here is a function that uses UText and an ICU break iterator to count the number 97of words in a nul-terminated UTF-8 string. The use of UText only adds two lines 98of code over what a similar function operating on normal UTF-16 strings would 99require. 100 101```c 102#include "unicode/utypes.h" 103#include "unicode/ubrk.h" 104#include "unicode/utext.h" 105 106int countWords(const char *utf8String) { 107 UText *ut = NULL; 108 UBreakIterator *bi = NULL; 109 int wordCount = 0; 110 UErrorCode status = U_ZERO_ERROR; 111 112 ut = utext_openUTF8(ut, utf8String, -1, &status); 113 bi = ubrk_open(UBRK_WORD, "en_us", NULL, 0, &status); 114 115 ubrk_setUText(bi, ut, &status); 116 while (ubrk_next(bi) != UBRK_DONE) { 117 if (ubrk_getRuleStatus(bi) != UBRK_WORD_NONE) { 118 /* Count only words and numbers, not spaces or punctuation */ 119 wordCount++; 120 } 121 } 122 utext_close(ut); 123 ubrk_close(bi); 124 assert(U_SUCCESS(status)); 125 return wordCount; 126} 127``` 128 129## UText API Functions 130 131The UText API is declared in the ICU header file 132[utext.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/utext.h) 133 134### Opening and Closing. 135 136Normal usage of UText by an application consists of opening a UText to wrap some 137existing text, then passing the UText to ICU functions for processing. For this 138kind of usage, all that is needed is the appropriate UText open and close 139functions. 140 141| Function | Description | 142|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------| 143| `uext_openUChars` | Open a UText over a standard ICU (`UChar *`) string. The string consists of a UTF-16 array in memory, either nul terminated or with an explicit length. | 144| `utext_openUnicodeString` | Open a UText over an instance of an ICU C++ `UnicodeString`. | 145| `Utext_openConstUnicodeString` | Open a UText over a read-only `UnicodeString`. Disallows UText APIs that modify the text. | 146| `utext_openReplaceable` | Open a UText over an instance of an ICU C++ `Replaceable`. | 147| `utext_openUTF8` | Open a UText over a UTF-8 encoded C string. May be either Nul terminated or have an explicit length. | 148| `utext_close` | Close an open UText. Frees any allocated memory; required to prevent memory leaks. | 149 150Here are some suggestions and techniques for efficient use of UText. 151 152#### Minimizing Heap Usage 153 154Utext's open functions include features to allow applications to minimize the 155number of heap memory allocations that will be needed. Specifically, 156 1571. UText structs may declared as local variables, that is, they may be stack 158 allocated rather than heap allocated. 1592. Existing UText structs may be reused to refer to new text, avoiding the need 160 to allocate and initialize a new UText instance. 161 162Minimizing heap allocations is important in code that has critical performance 163requirements, and is doubly important for code that must scale well in 164multithreaded, multiprocessor environments. 165 166#### Stack Allocation 167 168Here is code for stack-allocating a UText: 169 170```c 171UText mytext = UTEXT_INITIALIZER; 172utext_openUChars(&myText, ... 173``` 174 175The first parameter to all `utext_open` functions is a pointer to a UText. If it 176is non-null, the supplied UText will be used; if it is null, a new UText will be 177heap allocated. 178 179Stack allocated UText objects *must *be initialized with `UTEXT_INITIALIZER`. An 180uninitialized instance will fail to open. 181 182#### Heap Allocation 183 184Here is code for creating a heap allocated UText: 185 186```c 187UText *mytext = utext_openUChars(NULL, ... 188``` 189 190This is slightly smaller and more convenient to write than the stack allocated 191code, and there is no reason not to use heap allocated UText objects in the vast 192majority of code that does not have extreme performance constraints. 193 194#### Reuse 195 196To reuse an existing UText, simply pass it as the first parameter to any of the 197UText open functions. There is no need to close the UText first, and it may 198actually be more efficient not to close it first. 199 200Here is an example of a function that iterates over an array of UTF-8 strings, 201wrapping each in a UText and passing it off to another function. On the first 202time through the loop the utext open function will heap allocate a UText. On 203each subsequent iterations the existing UText will be reused. 204 205```c 206#include "unicode/utypes.h" 207#include "unicode/utext.h" 208 209void f(char **strings, int numStrings) { 210 UText *ut = NULL; 211 UErrorCode status; 212 213 int i; 214 for (i=0; i<numStrings; i++) { 215 status = U_ZERO_ERROR; 216 ut = utext_openUTF8(ut, strings[i], -1, &status); 217 assert(U_SUCCESS(status)); 218 do_something(ut); 219 } 220 utext_close(ut); 221} 222``` 223 224#### close 225 226Closing a UText with `utext_close()` frees any storage associated with it, including the UText itself 227for those that are heap allocated. Stack allocated UTexts should also be closed 228because in some cases there may be additional heap allocated storage associated 229with them, depending on the type of the underlying text storage. 230 231## Accessing the Text 232 233For accessing the underlying text, UText provides functions both for iterating 234over the characters, and for direct random access by index. Here are the 235conventions that apply for all of the access functions: 236 2371. access to individual characters is always by code points, that is, 32 bit 238 Unicode values are always returned. UTF-16 surrogate values from a surrogate 239 pair, like bytes from a UTF-8 sequence, are not separately visible. 2402. Indexing always uses the index values from the original underlying text 241 storage, in whatever form it has. If the underlying storage is UTF-8, the 242 indexes will be UTF-8 byte indexes, not UTF-16 offsets. 2433. Indexes always refer to the first position of a character. This is 244 equivalent to saying that indexes always lie at the boundary between 245 characters. If an index supplied to a UText function refers to the 2<sup>nd</sup> 246 through the N<sup>th</sup> positions of a multi byte or multi-code-unit character, the 247 index will be normalized back to the first or lowest index. 2484. An input index that is greater than the length of the text will be set to 249 refer to the end of the string, and will not generate out of bounds error. 250 This is similar to the indexing behavior in the UnicodeString class. 2515. Iteration uses post-increment and pre-decrement conventions. That is, 252 `utext_next32()` fetches the code point at the current index, then leaves the 253 index pointing at the next character. 254 255Here are the functions for accessing the actual text data represented by a 256UText. The primary use of these functions will be in the implementation of ICU 257services that accept input in the form of a UText, although application code may 258also use them if the need arises. 259 260For more detailed descriptions of each, see the API reference. 261 262| Function | Description | 263|-------------------------|------------------------------------------------------------------------------------------------------------| 264| `utext_nativeLength` | Get the length of the text string in terms of the underlying native storage – bytes for UTF-8, for example | 265| `utext_isLengthExpensive` | Indicate whether determining the length of the string would require scanning the string. | 266| `utext_char32At` | Get the code point at the specified index. | 267| `utext_current32` | Get the code point at the current iteration position. Does not advance the position. | 268| `utext_next32` | Get the next code point, iterating forwards. | 269| `utext_previous32` | Get the previous code point, iterating backwards. | 270| `utext_next32From` | Begin a forwards iteration at a specified index. | 271| `utext_previous32From` | Begin a reverse iteration at a specified index. | 272| `utext_getNativeIndex` | Get the current iteration index. | 273| `utext_setNativeIndex` | Set the iteration index. | 274| `utext_moveIndex32` | Move the current index forwards or backwards by the specified number of code points. | 275| `utext_extract` | Retrieve a range of text, placing it into a UTF-16 buffer. | 276| `UTEXT_NEXT32` | inline (high performance) version of `utext_next32` | 277| `UTEXT_PREVIOUS32` | inline (high performance) version of `utext_previous32` | 278 279## Modifying the Text 280 281UText provides API for modifying or editing the text. 282 283| Function | Description | 284|---------------------|----------------------------------------------------------------------------------------------------| 285| `utext_replace` | Replace a range of the original text with a replacement string. | 286| `utext_copy` | Copy or Move a range of the text to a new position. | 287| `utext_isWritable` | Test whether a UText supports writing operations. | 288| `utext_hasMetaData` | Test whether the text includes metadata. See the class `Replaceable` for more information on meta data.. | 289 290Certain conventions must be followed when modifying text using these functions: 291 2921. Not all types of UText can support modifying the data. Code working with 293 UText instances of unknown origin should check `utext_isWritable()` first, and 294 be prepared to deal with failures. 2952. There must be only one UText open onto the underlying string that is being 296 modified. (Strings that are not being modified can be the target of any 297 number of UTexts at the same time) The existence of a second UText that 298 refers to a string that is being modified is not a situation that is 299 detected by the implementation. The application code must be structured to 300 avoid the situation. 301 302#### Cloning 303 304UText instances may be cloned. The clone function, 305 306```c 307UText * utext_clone(UText *dest, 308 const UText *src, 309 UBool deep, 310 UBool readOnly, 311 UErrorCode *status) 312``` 313 314behaves very much like a UText open functions, with the source of the text being 315another UText rather than some other form of a string. 316 317A *shallow* clone creates a new UText that maintains its own iteration state, 318but does not clone the underlying text itself. 319 320A *deep* clone copies the underlying text in addition to the UText state. This 321would be appropriate if you wished to modify the text without the changes being 322reflected back to the original source string. Not all text providers support 323deep clone, so checking for error status returns from `utext_clone()` is 324importatnt. 325 326#### Thread Safety 327 328UText follows the usual ICU conventions for thread safety: concurrent calls to 329functions accessing the same non-const UText is not supported. If concurrent 330access to the text is required, the UText can be cloned, allowing each thread 331access via a separate UText. So long as the underlying text is not being 332modified, a shallow clone is sufficient. 333 334## Text Providers 335 336A *text provider* is a set of functions that let UText support a specific text 337storage format. 338 339ICU includes several UText text provider implementations, and applications can 340provide additional ones if needed. 341 342To implement a new UText text provider, it is necessary to have an understanding 343of how UText is designed. 344 345Underneath the covers, UText is a struct that includes: 346 3471. A pointer to a *Text Chunk*, which is a UTF-16 buffer containing a section 348 (or all) of the text being referenced. 349 350 For text sources whose native format 351 is UTF-16, the chunk description can refer directly to the original text 352 data. For non-UTF-16 sources, the chunk will refer to a side buffer 353 containing some range of the text that has been converted to UTF-16 format. 3542. The iteration position, as a UTF-16 offset within the chunk. 355 356If a text access function (one of those described above, in the previous 357section) can do its thing based on the information maintained in the UText 358struct, it will. If not, it will call out to one of the provider functions 359(below) to do the work, or to update the UText. 360 361The best way to really understand what is required of a UText provider is to 362study the implementations that are included with ICU, and to borrow as much as 363possible. 364 365Here is the list of text provider functions. 366 367| Function | Description | 368|----------------------------|----------------------------------------------------------------------------------------------------| 369| `UTextAccess` | Set up the Text Chunk associated with this UText so that it includes a requested index position. | 370| `UTextNativeLength` | Return the full length of the text. | 371| `UTextClone` | Clone the UText. | 372| `UTextExtract` | Extract a range of text into a caller-supplied buffer | 373| `UTextReplace` | Replace a range of text with a caller-supplied replacement. May expand or shrink the overall text. | 374| `UTextCopy` | Move or copy a range of text to a new position. | 375| `UTextMapOffsetToNative` | Within the current text chunk, translate a UTF-16 buffer offset to an absolute native index. | 376| `UTextMapNativeIndexToUTF16` | Translate an absolute native index to a UTF-16 buffer offset within the current text. | 377| `UTextClose` | Provider specific close. Free storage as required. | 378 379Not every provider type requires all of the functions. If the text type is 380read-only, no implementation for Replace or Copy is required. If the text is in 381UTF-16 format, no implementation of the native to UTF-16 index conversions is 382required. 383 384To fully understand what is required to support a new string type with UText, it 385will be necessary to study both the provider function declarations from 386[utext.h](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/utext.h) 387and the existing text provider implementations in 388[utext.cpp](https://github.com/unicode-org/icu/blob/main/icu4c/source/common/utext.cpp). 389