1--- 2layout: default 3title: Chars and Strings 4nav_order: 3 5has_children: true 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# Strings 13 14## Overview 15 16This section explains how to handle Unicode strings with ICU in C and C++. 17 18Sample code is available in the ICU source code library at 19[icu/source/samples/ustring/ustring.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/samples/ustring/ustring.cpp) 20. 21 22## Text Access Overview 23 24Strings are the most common and fundamental form of handling text in software. 25Logically, and often physically, they contain contiguous arrays (vectors) of 26basic units. Most of the ICU API functions work directly with simple strings, 27and where possible, this is preferred. 28 29Sometimes, text needs to be accessed via more powerful and complicated methods. 30For example, text may be stored in discontiguous chunks in order to deal with 31frequent modification (like typing) and large amounts, or it may not be stored 32in the internal encoding, or it may have associated attributes like bold or 33italic styles. 34 35### Guidance 36 37ICU provides multiple text access interfaces which were added over time. If 38simple strings cannot be used, then consider the following: 39 401. [UText](utext.md): Added in ICU4C 3.4 as a technology preview. Intended to 41 be the strategic text access API for use with ICU. C API, high performance, 42 writable, supports native indexes for efficient non-UTF-16 text storage. So 43 far (3.4) only supported in BreakIterator. Some API changes are anticipated 44 for ICU 3.6. 45 462. Replaceable (Java & C++) and UReplaceable (C): Writable, designed for use 47 with Transliterator. 48 493. CharacterIterator (Java JDK & C++): Read-only, used in many APIs. Large 50 differences between the JDK and C++ versions. 51 524. UCharacterIterator (Java): Back-port of the C++ CharacterIterator to ICU4J 53 for support of supplementary code points and post-increment iteration. 54 555. UCharIterator (C): Read-only, C interface used mostly in incremental 56 normalization and collation. 57 58The following provides some historical perspective and comparison between the 59interfaces. 60 61### CharacterIterator 62 63ICU has long provided the CharacterIterator interface for some services. It 64allows for abstract text access, but has limitations: 65 661. It has a per-character function call overhead. 67 682. Originally, it was designed for UCS-2 operation and did not support direct 69 handling of supplementary Unicode code points. Such support was later added. 70 713. Its pre-increment iteration semantics are uncommon, and are inefficient when 72 used with a variable-width encoding form (UTF-16). Functions for 73 post-increment iteration were added later. 74 754. The C++ version added iteration start/limit boundaries only because the C++ 76 UnicodeString copies string contents during substringing; the Java 77 CharacterIterator does not have these extra boundaries – substringing is 78 more efficient in Java. 79 805. CharacterIterator is not available for use in C. 81 826. CharacterIterator is a read-only interface. 83 847. It uses UTF-16 indexes into the text, which is not efficient for other 85 encoding forms. 86 878. With the additions to the API over time, the number of methods that have to 88 be overridden by subclasses has become rather large. 89 90The core Java adopted an early version of CharacterIterator; later 91functionality, like support for supplementary code points, was back-ported from 92ICU4C to ICU4J to form the UCharacterIterator class. 93 94The UCharIterator C interface was added to allow for incremental normalization 95and collation in C. It is entirely code unit (UChar)-oriented, uses only 96post-increment iteration and has a smaller number of overridable methods. 97 98### Replaceable 99 100The Replaceable (Java & C++) and UReplaceable (C) interfaces are designed for, 101and used in, Transliterator. They are random-access interfaces, not iterators. 102 103### UText 104 105The [UText](utext.md) text access interface was designed as a possible 106replacement for all previous interfaces listed above, with additional 107functionality. It allows for high-performance operation through the use of 108storage-native indexes (for efficient use of non-UTF-16 text) and through 109accessing multiple characters per function call. Code point iteration is 110available with functions as well as with C macros, for maximum performance. 111UText is also writable, mostly patterned after Replaceable. For details see the 112UText chaper. 113 114## Strings in ICU 115 116### Strings in Java 117 118In Java, ICU uses the standard String and StringBuffer classes, `char[]`, etc. 119See the Java documentation for details. 120 121### Strings in C/C++ 122 123Strings in C and C++ are, at the lowest level, arrays of some particular base 124type. In most cases, the base type is a char, which is an 8-bit byte in modern 125compilers. Some APIs use a "wide character" type wchar_t that is typically 8, 12616, or 32 bits wide and upwards compatible with char. C code passes `char *` or 127wchar_t pointers to the first element of an array. C++ enables you to create a 128class for encapsulating these kinds of character arrays in handy and safe 129objects. 130 131The interpretation of the byte or wchar_t values depends on the platform, the 132compiler, the signed state of both char and wchar_t, and the width of wchar_t. 133These characteristics are not specified in the language standards. When using 134internationalized text, the encoding often uses multiple chars for most 135characters and a wchar_t that is wide enough to hold exactly one character code 136point value each. Some APIs, especially in the standard library (stdlib), assume 137that wchar_t strings use a fixed-width encoding with exactly one character code 138point per wchar_t. 139 140### ICU: 16-bit Unicode strings 141 142In order to take advantage of Unicode with its large character repertoire and 143its well-defined properties, there must be types with consistent definitions and 144semantics. The Unicode standard defines a default encoding based on 16-bit code 145units. This is supported in ICU by the definition of the UChar to be an unsigned 14616-bit integer type. This is the base type for character arrays for strings in 147ICU. 148 149> :point_right: **Note**: *Endianness is not an issue on this level because the interpretation of an 150integer is fixed within any given platform.* 151 152With the UTF-16 encoding form, a single Unicode code point is encoded with 153either one or two 16-bit UChar code units (unambiguously). "Supplementary" code 154points, which are encoded with pairs of code units, are rare in most texts. The 155two code units are called "surrogates", and their unit value ranges are distinct 156from each other and from single-unit value ranges. Code should be generally 157optimized for the common, single-unit case. 158 15916-bit Unicode strings in internal processing contain sequences of 16-bit code 160units that may not always be well-formed UTF-16. ICU treats single, unpaired 161surrogates as surrogate code points, i.e., they are returned in per-code point 162iteration, they are included in the number of code points of a string, and they 163are generally treated much like normal, unassigned code points in most APIs. 164Surrogate code points have Unicode properties although they cannot be assigned 165an actual character. 166 167ICU string handling functions (including append, substring, etc.) do not 168automatically protect against producing malformed UTF-16 strings. Most of the 169time, indexes into strings are naturally at code point boundaries because they 170result from other functions that always produce such indexes. If necessary, the 171user can test for proper boundaries by checking the code unit values, or adjust 172arbitrary indexes to code point boundaries by using the C macros 173U16_SET_CP_START() and U16_SET_CP_LIMIT() (see utf.h) and the UnicodeString 174functions getChar32Start() and getChar32Limit(). 175 176UTF-8 and UTF-32 are supported with converters (ucnv.h), macros (utf.h), and 177convenience functions (ustring.h), but only a subset of APIs works with UTF-8 178directly as string encoding form. 179 180**See the [UTF-8](utf-8.md) subpage for details about working with 181UTF-8.** Some of the following sections apply to UTF-8 APIs as well; for example 182sections about handling lengths and overflows. 183 184### Separate type for single code points 185 186A Unicode code point is an integer with a value from 0 to 0x10FFFF. ICU 2.4 and 187later defines the UChar32 type for single code point values as a 32 bits wide 188signed integer (int32_t). This allows the use of easily testable negative values 189as sentinels, to indicate errors, exceptions or "done" conditions. All negative 190values and positive values greater than 0x10FFFF are illegal as Unicode code 191points. 192 193ICU 2.2 and earlier defined UChar32 depending on the platform: If the compiler's 194wchar_t was 32 bits wide, then UChar32 was defined to be the same as wchar_t. 195Otherwise, it was defined to be an unsigned 32-bit integer. This means that 196UChar32 was either a signed or unsigned integer type depending on the compiler. 197This was meant for better interoperability with existing libraries, but was of 198little use because ICU does not process 32-bit strings — UChar32 is only used 199for single code points. The platform dependence of UChar32 could cause problems 200with C++ function overloading. 201 202### Compiler-dependent definitions 203 204The compiler's and the runtime character set's codepage encodings are not 205specified by the C/C++ language standards and are usually not a Unicode encoding 206form. They typically depend on the settings of the individual system, process, 207or thread. Therefore, it is not possible to instantiate a Unicode character or 208string variable directly with C/C++ character or string literals. The only safe 209way is to use numeric values. It is not an issue for User Interface (UI) strings 210that are translated. These UI strings are loaded from a resource bundle, which 211is generated from a text file that can be in Unicode or in any other 212ICU-provided codepage. The binary form of the genrb tool generates UTF-16 213strings that are ready for direct use. 214 215There is a useful exception to this for program-internal strings and test 216strings. Within each "family" of character encodings, there is a set of 217characters that have the same numeric code values. Such characters include Latin 218letters, the basic digits, the space, and some punctuation. Most of the ASCII 219graphic characters are invariant characters. The same set, with different but 220again consistent numeric values, is invariant among almost all EBCDIC codepages. 221For details, see 222[icu4c/source/common/unicode/utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) 223. With strings that contain only these invariant characters, it is possible to 224use efficient ICU constructs to write a C/C++ string literal and use it to 225initialize Unicode strings. 226 227In some APIs, ICU uses `char *` strings. This is either for file system paths or 228for strings that contain invariant characters only (such as locale identifiers). 229These strings are in the platform-specific encoding of either ASCII or EBCDIC. 230All other codepage differences do not matter for invariant characters and are 231manipulated by the C stdlib functions like strcpy(). 232 233In some APIs where identifiers are used, ICU uses `char *` strings with invariant 234characters. Such strings do not require the full Unicode repertoire and are 235easier to handle in C and C++ with `char *` string literals and standard C 236library functions. Their useful character repertoire is actually smaller than 237the set of graphic ASCII characters; for details, see 238[utypes.h](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utypes_8h.html) . Examples of 239`char *` identifier uses are converter names, locale IDs, and resource bundle 240table keys. 241 242There is another, less efficient way to have human-readable Unicode string 243literals in C and C++ code. ICU provides a small number of functions that allow 244any Unicode characters to be inserted into a string with escape sequences 245similar to the one that is used in the C and C++ language. In addition to the 246familiar \\n and \\xhh etc., ICU also provides the \\uhhhh syntax with four hex 247digits and the \\Uhhhhhhhh syntax with eight hex digits for hexadecimal Unicode 248code point values. This is very similar to the newer escape sequences used in 249Java and defined in the latest C and C++ standards. Since ICU is not a compiler 250extension, the "unescaping" is done at runtime and the backslash itself must be 251escaped (duplicated) so that the compiler does not attempt to "unescape" the 252sequence itself. 253 254## Handling Lengths, Indexes, and Offsets in Strings 255 256The length of a string and all indexes and offsets related to the string are 257always counted in terms of UChar code units, not in terms of UChar32 code 258points. (This is the same as in common C library functions that use `char *` 259strings with multi-byte encodings.) 260 261Often, a user thinks of a "character" as a complete unit in a language, like an 262'Ä', while it may be represented with multiple Unicode code points including a 263base character and combining marks. (See the Unicode standard for details.) This 264often requires users to index and pass strings (UnicodeString or `UChar *`) with 265multiple code units or code points. It cannot be done with single-integer 266character types. Indexing of such "characters" is done with the BreakIterator 267class (in C: ubrk_ functions). 268 269Even with such "higher-level" indexing functions, the actual index values will 270be expressed in terms of UChar code units. When more than one code unit is used 271at a time, the index value changes by more than one at a time. 272 273ICU uses signed 32-bit integers (int32_t) for lengths and offsets. Because of 274internal computations, strings (and arrays in general) are limited to 1G base 275units or 2G bytes, whichever is smaller. 276 277## Using C Strings: NUL-Terminated vs. Length Parameters 278 279Strings are either terminated with a NUL character (code point 0, U+0000) or 280their length is specified. In the latter case, it is possible to have one or 281more NUL characters inside the string. 282 283**Input string** arguments are typically passed with two parameters: The (const) 284`UChar *` pointer and an int32_t length argument. If the length is -1 then the 285string must be NUL-terminated and the ICU function will call the u_strlen() 286method or treat it equivalently. If the input string contains embedded NUL 287characters, then the length must be specified. 288 289**Output string** arguments are typically passed with a destination `UChar *` 290pointer and an int32_t capacity argument and the function returns the length of 291the output as an int32_t. There is also almost always a UErrorCode argument. 292Essentially, a `UChar[]` array is passed in with its start and the number of 293available UChars. The array is filled with the output and if space permits the 294output will be NUL-terminated. The length of the output string is returned. In 295all cases the length of the output string does not include the terminating NUL. 296This is the same behavior found in most ICU and non-ICU string APIs, for example 297u_strlen(). The output string may **contain** NUL characters as part of its 298actual contents, depending on the input and the operation. Note that the 299UErrorCode parameter is used to indicate both errors and warnings (non-errors). 300The following describes some of the situations in which the UErrorCode will be 301set to a non-zero value: 302 3031. If the output length is greater than the output array capacity, then the 304 UErrorCode will be set to U_BUFFER_OVERFLOW_ERROR and the contents of the 305 output array is undefined. 306 3072. If the output length is equal to the capacity, then the output has been 308 completely written minus the terminating NUL. This is also indicated by 309 setting the UErrorCode to U_STRING_NOT_TERMINATED_WARNING. 310 Note that U_STRING_NOT_TERMINATED_WARNING does not indicate failure (it 311 passes the U_SUCCESS() macro). 312 Note also that it is more reliable to check the output length against the 313 capacity, rather than checking for the warning code, because warning codes 314 do not cause the early termination of a function and may subsequently be 315 overwritten. 316 3173. If neither of these two conditions apply, the error code will indicate 318 success and not a U_STRING_NOT_TERMINATED_WARNING. (If a 319 U_STRING_NOT_TERMINATED_WARNING code had been set in the UErrorCode 320 parameter before the function call, then it is reset to a U_ZERO_ERROR.) 321 322**Preflighting:** The returned length is always the full output length even if 323the output buffer is too small. It is possible to pass in a capacity of 0 (and 324an output array pointer of NUL) for "pure preflighting" to determine the 325necessary output buffer size. Add one to make the output string NUL-terminated. 326 327Note that — whether the caller intends to "preflight" or not — if the output 328length is equal to or greater than the capacity, then the UErrorCode is set to 329U_STRING_NOT_TERMINATED_WARNING or U_BUFFER_OVERFLOW_ERROR respectively, as 330described above. 331 332However, "pure preflighting" is very expensive because the operation has to be 333processed twice — once for calculating the output length, and a second time to 334actually generate the output. It is much more efficient to always provide an 335output buffer that is expected to be large enough for most cases, and to 336reallocate and repeat the operation only when an overflow occurred. (Remember to 337reset the UErrorCode to U_ZERO_ERROR before calling the function again.) In 338C/C++, the initial output buffer can be a stack buffer. In case of a 339reallocation, it may be possible and useful to cache and reuse the new, larger 340buffer. 341 342> :point_right: **Note**:*The exception to these rules are the ANSI-C-style functions like u_strcpy(), 343which generally require NUL-terminated strings, forbid embedded NULs, and do not 344take capacity arguments for buffer overflow checking.* 345 346## Using Unicode Strings in C 347 348In C, Unicode strings are similar to standard `char *` strings. Unicode strings 349are arrays of UChar and most APIs take a `UChar *` pointer to the first element 350and an input length and/or output capacity, see above. ICU has a number of 351functions that provide the Unicode equivalent of the stdlib functions such as 352strcpy(), strstr(), etc. Compared with their C standard counterparts, their 353function names begin with u_. Otherwise, their semantics are equivalent. These 354functions are defined in icu/source/common/unicode/ustring.h. 355 356### Code Point Access 357 358Sometimes, Unicode code points need to be accessed in C for iteration, movement 359forward, or movement backward in a string. A string might also need to be 360written from code points values. ICU provides a number of macros that are 361defined in the icu/source/common/unicode/utf.h and utf8.h/utf16.h headers that 362it includes (utf.h is in turn included with utypes.h). 363 364Macros for 16-bit Unicode strings have a U16_ prefix. For example: 365 366 U16_NEXT(s, i, length, c) 367 U16_PREV(s, start, i, c) 368 U16_APPEND(s, i, length, c, isError) 369 370There are also macros with a U_ prefix for code point range checks (e.g., test 371for non-character code point), and U8_ macros for 8-bit (UTF-8) strings. See the 372header files and the API References for more details. 373 374#### UTF Macros before ICU 2.4 375 376In ICU 2.4, the utf\*.h macros have been revamped, improved, simplified, and 377renamed. The old macros continue to be available. They are in utf_old.h, 378together with an explanation of the change. utf.h, utf8.h and utf16.h contain 379the new macros instead. The new macros are intended to be more consistent, more 380useful, and less confusing. Some macros were simply renamed for consistency with 381a new naming scheme. 382 383The documentation of the old macros has been removed. If you need it, see a User 384Guide version from ICU 4.2 or earlier (see the [download 385page](http://site.icu-project.org/download)). 386 387C Unicode String Literals 388 389There is a pair of macros that together enable users to instantiate a Unicode 390string in C — a `UChar []` array — from a C string literal: 391 392 /* 393 * In C, we need two macros: one to declare the UChar[] array, and 394 * one to populate it; the second one is a noop on platforms where 395 * wchar_t is compatible with UChar and ASCII-based. 396 * The length of the string literal must be counted for both macros. 397 */ 398 /* declare the invString array for the string */ 399 U_STRING_DECL(invString, "such characters are safe 123 %-.", 32); 400 /* populate it with the characters */ 401 U_STRING_INIT(invString, "such characters are safe 123 %-.", 32); 402 403With invariant characters, it is also possible to efficiently convert `char *` 404strings to and from UChar \ strings: 405 406 static const char *cs1="such characters are safe 123 %-."; 407 static UChar us1[40]; 408 static char cs2[40]; 409 u_charsToUChars(cs1, us1, 33); /* include the terminating NUL */ 410 u_UCharsToChars(us1, cs2, 33); 411 412## Testing for well-formed UTF-16 strings 413 414It is sometimes useful to test if a 16-bit Unicode string is well-formed UTF-16, 415that is, that it does not contain unpaired surrogate code units. For a boolean 416test, call a function like u_strToUTF8() which sets an error code if the input 417string is malformed. (Provide a zero-capacity destination buffer and treat the 418buffer overflow error as "is well-formed".) If you need to know the position of 419the unpaired surrogate, you can iterate through the string with U16_NEXT() and 420U_IS_SURROGATE(). 421 422## Using Unicode Strings in C++ 423 424[UnicodeString](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classUnicodeString.html) is 425a C++ string class that wraps a UChar array and associated bookkeeping. It 426provides a rich set of string handling functions. 427 428UnicodeString combines elements of both the Java String and StringBuffer 429classes. Many UnicodeString functions are named and work similar to Java String 430methods but modify the object (UnicodeString is "mutable"). 431 432UnicodeString provides functions for random access and use (insert/append/find 433etc.) of both code units and code points. For each non-iterative string/code 434point macro in utf.h there is at least one UnicodeString member function. The 435names of most of these functions contain "32" to indicate the use of a UChar32. 436 437Code point and code unit iteration is provided by the 438[CharacterIterator](characteriterator.md) abstract class and its subclasses. 439There are concrete iterator implementations for UnicodeString objects and plain 440`UChar []` arrays. 441 442Most UnicodeString constructors and functions do not have a UErrorCode 443parameter. Instead, if the construction of a UnicodeString fails, for example 444when it is constructed from a NULL `UChar *` pointer, then the UnicodeString 445object becomes "bogus". This can be tested with the isBogus() function. A 446UnicodeString can be put into the "bogus" state explicitly with the setToBogus() 447function. This is different from an empty string (although a "bogus" string also 448returns TRUE from isEmpty()) and may be used equivalently to NULL in `UChar *` C 449APIs (or null references in Java, or NULL values in SQL). A string remains 450"bogus" until a non-bogus string value is assigned to it. For complete details 451of the behavior of "bogus" strings see the description of the setToBogus() 452function. 453 454Some APIs work with the 455[Replaceable](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classReplaceable.html) 456abstract class. It defines a simple interface for random access and text 457modification and is useful for operations on text that may have associated 458meta-data (e.g., styled text), especially in the Transliterator API. 459UnicodeString implements Replaceable. 460 461### C++ Unicode String Literals 462 463Like in C, there are macros that enable users to instantiate a UnicodeString 464from a C string literal. One macro requires the length of the string as in the C 465macros, the other one implies a strlen(). 466 467 UnicodeString s1=UNICODE_STRING("such characters are safe 123 %-.", 32); 468 UnicodeString s1=UNICODE_STRING_SIMPLE("such characters are safe 123 %-."); 469 470It is possible to efficiently convert between invariant-character strings and 471UnicodeStrings by using constructor, setTo() or extract() overloads that take 472codepage data (`const char *`) and specifying an empty string ("") as the 473codepage name. 474 475## Using C++ Strings in C APIs 476 477The internal buffer of UnicodeString objects is available for direct handling in 478C (or C-style) APIs that take `UChar *` arguments. It is possible but usually not 479necessary to copy the string contents with one of the extract functions. The 480following describes several direct buffer access methods. 481 482The UnicodeString function getBuffer() const returns a readonly const `UChar *`. 483The length of the string is indicated by UnicodeString's length() function. 484Generally, UnicodeString does not NUL-terminate the contents of its internal 485buffer. However, it is possible to check for a NUL character if the length of 486the string is less than the capacity of the buffer. The following code is an 487example of how to check the capacity of the buffer: 488`(s.length()<s.getCapacity() && buffer[s.length()]==0)` 489 490An easier way to NUL-terminate the buffer and get a `const UChar *` pointer to it 491is the getTerminatedBuffer() function. Unlike getBuffer() const, 492getTerminatedBuffer() is not a const function because it may have to (reallocate 493and) modify the buffer to append a terminating NUL. Therefore, use getBuffer() 494const if you do not need a NUL-terminated buffer. 495 496There is also a pair of functions that allow controlled write access to the 497buffer of a UnicodeString: `UChar *getBuffer(int32_t minCapacity)` and 498`releaseBuffer(int32_t newLength)`. `UChar *getBuffer(int32_t minCapacity)` 499provides a writeable buffer of at least the requested capacity and returns a 500pointer to it. The actual capacity of the buffer after the 501`getBuffer(minCapacity)` call may be larger than the requested capacity and can be 502determined with `getCapacity()`. 503 504Once the buffer contents are modified, the buffer must be released with the 505`releaseBuffer(int32_t newLength)` function, which sets the new length of the 506UnicodeString (newLength=-1 can be passed to determine the length of 507NUL-terminated contents like `u_strlen()`). 508 509Between the `getBuffer(minCapacity)` and `releaseBuffer(newLength)` function calls, 510the contents of the UnicodeString is unknown and the object behaves like it 511contains an empty string. A nested `getBuffer(minCapacity)`, `getBuffer() const` or 512`getTerminatedBuffer()` will fail (return NULL) and modifications of the string 513via UnicodeString member functions will have no effect. Copying a string with an 514"open buffer" yields an empty copy. The move constructor, move assignment 515operator and Return Value Optimization (RVO) transfer the state, including the 516open buffer. 517 518See the UnicodeString API documentation for more information. 519 520## Using C Strings in C++ APIs 521 522There are efficient ways to wrap C-style strings in C++ UnicodeString objects 523without copying the string contents. In order to use C strings in C++ APIs, the 524`UChar *` pointer and length need to be wrapped into a UnicodeString. This can be 525done efficiently in two ways: With a readonly alias and a writable alias. The 526UnicodeString object that is constructed actually uses the `UChar *` pointer as 527its internal buffer pointer instead of allocating a new buffer and copying the 528string contents. 529 530If the original string is a readonly `const UChar *`, then the UnicodeString must 531be constructed with a read only alias. If the original string is a writable 532(non-const) `UChar *` and is to be modified (e.g., if the `UChar *` buffer is an 533output buffer) then the UnicodeString should be constructed with a writeable 534alias. For more details see the section "Maximizing Performance with the 535UnicodeString Storage Model" and search the unistr.h header file for "alias". 536 537## Maximizing Performance with the UnicodeString Storage Model 538 539UnicodeString uses four storage methods to maximize performance and minimize 540memory consumption: 541 5421. Short strings are normally stored inside the UnicodeString object. The 543 object has fields for the "bookkeeping" and a small UChar array. When the 544 object is copied, the internal characters are copied into the destination 545 object. 5462. Longer strings are normally stored in allocated memory. The allocated UChar 547 array is preceded by a reference counter. When the string object is copied, 548 the allocated buffer is shared by incrementing the reference counter. If any 549 of the objects that share the same string buffer are modified, they receive 550 their own copy of the buffer and decrement the reference counter of the 551 previously co-used buffer. 5523. A UnicodeString can be constructed (or set with a setTo() function) so that 553 it aliases a readonly buffer instead of copying the characters. In this 554 case, the string object uses this aliased buffer for as long as the object 555 is not modified and it will never attempt to modify or release the buffer. 556 This model has copy-on-write semantics. For example, when the string object 557 is modified, the buffer contents are first copied into writable memory 558 (inside the object for short strings or the allocated buffer for longer 559 strings). When a UnicodeString with a readonly setting is copied to another 560 UnicodeString using the fastCopyFrom() function, then both string objects 561 share the same readonly setting and point to the same storage. Copying a 562 string with the normal assignment operator or copy constructor will copy the 563 buffer. This prevents accidental misuse of readonly-aliased strings. (This 564 is new in ICU 2.4; earlier, the assignment operator and copy constructor 565 behaved like the new fastCopyFrom() does now.) 566 **Important:** 567 1. The aliased buffer must remain valid for as long as any UnicodeString 568 object aliases it. This includes unmodified fastCopyFrom()and 569 `movedFrom()` copies of the object (including moves via the move 570 constructor and move assignment operator), and when the compiler uses 571 Return Value Optimization (RVO) where a function returns a UnicodeString 572 by value. 573 2. Be prepared that return-by-value may either make a copy (which does not 574 preserve aliasing), or moves the value or uses RVO (which do preserve 575 aliasing). 576 3. It is an error to readonly-alias temporary buffers and then pass the 577 resulting UnicodeString objects (or references/pointers to them) to APIs 578 that store them for longer than the buffers are valid. 579 4. If it is necessary to make sure that a string is not a readonly alias, 580 then use any modifying function without actually changing the contents 581 (for example, s.setCharAt(0, s.charAt(0))). 582 5. In ICU 2.4 and later, a simple assignment or copy construction will also 583 copy the buffer. 5844. A UnicodeString can be constructed (or set with a setTo() function) so that 585 it aliases a writable buffer instead of copying the characters. The 586 difference from the above is that the string object writes through to this 587 aliased buffer for write operations. A new buffer is allocated and the 588 contents are copied only when the capacity of the buffer is not sufficient. 589 An efficient way to get the string contents into the original buffer is to 590 use the `extract(..., UChar *dst, ...)` function. 591 The `extract(..., UChar *dst, ...)` function copies the string contents only if the dst buffer is 592 different from the buffer of the string object itself. If a string grows and 593 shrinks during a sequence of operations, then it will not use the same 594 buffer, even if the string would fit. When a UnicodeString with a writeable 595 alias is assigned to another UnicodeString, the contents are always copied. 596 The destination string will not point to the buffer that the source string 597 aliases point to. However, a move constructor, move assignment operator, and 598 Return Value Optimization (RVO) do preserve aliasing. 599 600In general, UnicodeString objects have "copy-on-write" semantics. Several 601objects may share the same string buffer, but a modification only affects the 602object that is modified itself. This is achieved by copying the string contents 603if it is not owned exclusively by this one object. Only after that is the object 604modified. 605 606Even though it is fairly efficient to copy UnicodeString objects, it is even 607more efficient, if possible, to work with references or pointers. Functions that 608output strings can be faster by appending their results to a UnicodeString that 609is passed in by reference, compared with returning a UnicodeString object or 610just setting the local results alone into a string reference. 611 612> :point_right: **Note**: *UnicodeStrings can be copied in a thread-safe manner by just using their 613standard copy constructors and assignment operators. fastCopyFrom() is also 614thread-safe, but if the original string is a readonly alias, then the copy 615shares the same aliased buffer.* 616 617## Using UTF-8 strings with ICU 618 619As mentioned in the overview of this chapter, ICU and most other 620Unicode-supporting software uses 16-bit Unicode for internal processing. 621However, there are circumstances where UTF-8 is used instead. This is usually 622the case for software that does little or no processing of non-ASCII characters, 623and/or for APIs that predate Unicode, use byte-based strings, and cannot be 624changed or replaced for various reasons. 625 626A common perception is that UTF-8 has an advantage because it was designed for 627compatibility with byte-based, ASCII-based systems, although it was designed for 628string storage (of Unicode characters in Unix file names) rather than for 629processing performance. 630 631While ICU mostly does not natively use UTF-8 strings, there are many ways to 632work with UTF-8 strings and ICU. For more information see the newer 633[UTF-8](utf-8.md) subpage. 634 635## Using UTF-32 strings with ICU 636 637It is even rarer to use UTF-32 for string processing than UTF-8. While 32-bit 638Unicode is convenient because it is the only fixed-width UTF, there are few or 639no legacy systems with 32-bit string processing that would benefit from a 640compatible format, and the memory bandwidth requirements of UTF-32 diminish the 641performance and handling advantage of the fixed-width format. 642 643Over time, the wchar_t type of some C/C++ compilers became a 32-bit integer, and 644some C libraries do use it for Unicode processing. However, application software 645with good Unicode support tends to have little use for the rudimentary Unicode 646and Internationalization support of the standard C/C++ libraries and often uses 647custom types (like ICU's) and UTF-16 or UTF-8. 648 649For those systems where 32-bit Unicode strings are used, ICU offers some 650convenience functions. 651 6521. Conversion of whole strings: u_strFromUTF32() and u_strFromUTF32() in 653 ustring.h. 654 6552. Access to code points is trivial and does not require any macros. 656 6573. Using a UTF-32 converter with all of the ICU conversion APIs in ucnv.h, 658 including ones with an "Algorithmic" suffix. 659 6604. UnicodeString has `fromUTF32()` and `toUTF32()` methods. 661 6625. For conversion directly between UTF-32 and another charset use 663 ucnv_convertEx(). However, since ICU converters work with byte streams in 664 external charsets on the non-"Unicode" side, the UTF-32 string will be 665 treated as a byte stream (UTF-32 Character Encoding *Scheme*) rather than a 666 sequence of 32-bit code units (UTF-32 Character Encoding *Form*). The 667 correct converter must be used: UTF-32BE or UTF-32LE according to the 668 platform endianness (U_IS_BIG_ENDIAN). Treating the string like a byte 669 stream also makes a difference in data types (`char *`), lengths and indexes 670 (counting bytes), and NUL-termination handling (input NUL-termination not 671 possible, output writes only a NUL byte, not a NUL 32-bit code unit). For 672 the difference between internal encoding forms and external encoding schemes 673 see the Unicode Standard. 674 6756. Some ICU APIs work with a CharacterIterator, a UText or a UCharIterator 676 instead of directly with a C/C++ string parameter. There is currently no ICU 677 instance of any of these interfaces that reads UTF-32, although an 678 application could provide one. 679 680## Changes in ICU 2.0 681 682Beginning with ICU release 2.0, there are a few changes to the ICU string 683facilities compared with earlier ICU releases. 684 685Some of the NUL-termination behavior was inconsistent across the ICU API 686functions. In particular, the following functions used to count the terminating 687NUL character in their output length (counted one more before ICU 2.0 than now): 688ucnv_toUChars, ucnv_fromUChars, uloc_getLanguage, uloc_getCountry, 689uloc_getVariant, uloc_getName, uloc_getDisplayLanguage, uloc_getDisplayCountry, 690uloc_getDisplayVariant, uloc_getDisplayName 691 692Some functions used to set an overflow error code even when only the terminating 693NUL did not fit into the output buffer. These functions now set UErrorCode to 694U_STRING_NOT_TERMINATED_WARNING rather than to U_BUFFER_OVERFLOW_ERROR. 695 696The aliasing UnicodeString constructors and most extract functions have existed 697for several releases prior to ICU 2.0. There is now an additional extract 698function with a UErrorCode parameter. Also, the getBuffer, releaseBuffer and 699getCapacity functions are new to ICU 2.0. 700 701For more information about these changes, please consult the old and new API 702documentation. 703