1--- 2layout: default 3title: UTF-8 4nav_order: 1 5parent: Chars and Strings 6--- 7<!-- 8© 2020 and later: Unicode, Inc. and others. 9License & terms of use: http://www.unicode.org/copyright.html 10--> 11 12# UTF-8 13 14*Note: This page is only relevant for C/C++. In Java, all strings are encoded in 15UTF-16, except for conversion from bytes to strings (via InputStreamReader or 16similar) and from strings to bytes (OutputStreamWriter etc.).* 17 18While most of ICU works with UTF-16 strings and uses data structures optimized 19for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized 20for UTF-8, or work with Unicode code points (21-bit integer values) regardless 21of string encoding. Some data structures are designed to work equally well with 22UTF-16 and UTF-8. 23 24For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t` 25lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1 26means NUL-terminated, output is NUL-terminated if there is space, output 27overflow is handled with preflighting; for details see the parent [Strings 28page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write 29to an `icu::ByteSink` or to a string class object like `std::string`. 30 31## Conversion Between UTF-8 and UTF-16 32 33The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ 34`icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and 35`toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`. 36 37In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and 38`u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and 39`u_strFromUTF8Lenient()`.) 40 41The conversion functions in `unicode/ucnv.h` are intended for very flexible 42handling of conversion to/from external byte streams (with customizable error 43handling and support for split buffers at arbitrary boundaries) which is 44normally unnecessary for internal strings. 45 46Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods 47which take either a converter object or a charset name. These can be used for 48UTF-8, but are not as efficient or convenient as the 49`fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among 50conversion methods, APIs with a charset name are more convenient but internally 51open and close a converter; ones with a converter object parameter avoid this.) 52 53## UTF-8 as Default Charset 54 55ICU has many functions that take or return `char *` strings that are assumed to 56be in the default charset which should match the system encoding. Since this 57could be one of many charsets, and the charset can be different for different 58processes on the same system, ICU uses its conversion framework for converting 59to and from UTF-16. 60 61If it is known that the default charset is always UTF-8 on the target platform, 62then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`. 63(For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1` 64as a compiler flag.) This will change most of the implementation code to use 65dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the 66conversion framework. (Avoiding such dependencies helps with statically linked 67libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even 68`UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].) 69 70## Low-Level UTF-8 String Operations 71 72`unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16 73macros in `unicode/utf16.h`. The macros handle many cases inline, but call 74internal functions for complicated parts of the UTF-8 encoding form. For 75example, the following code snippet counts white space characters in a string: 76 77```c 78#include "unicode/utypes.h" 79#include "unicode/stringpiece.h" 80#include "unicode/utf8.h" 81#include "unicode/uchar.h" 82 83int32_t countWhiteSpace(StringPiece sp) { 84 const char *s=sp.data(); 85 int32_t length=sp.length(); 86 int32_t count=0; 87 for(int32_t i=0; i<length;) { 88 UChar32 c; 89 U8_NEXT(s, i, length, c); 90 if(u_isUWhiteSpace(c)) { 91 ++count; 92 } 93 } 94 return count; 95} 96``` 97 98## Dedicated UTF-8 APIs 99 100ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker 101functions" like comparing strings, to avoid the string conversion overhead, 102rather than for "builder functions" like factory methods and attribute setters. 103 104For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings 105incrementally, without converting all of the two strings to UTF-16 if there is 106an early base letter difference. 107 108`ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the 109two `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most 110other charsets uses a dedicated, optimized code path, avoiding the pivot through 111UTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well, 112but that has not been implemented yet as of ICU 4.4.) 113 114Other examples: (This list may or may not be complete.) 115 116* ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(), 117 ucasemap_utf8FoldCase() 118* ucnvsel_selectForUTF8() 119* icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(), 120 uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.) 121* ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey() 122* uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8() 123 124## Abstract Text APIs 125 126ICU offers several interfaces for text access, designed for different use cases. 127(Some interfaces are simply newer and more modern than others.) Some ICU 128services work with some of these interfaces, and for some of these interfaces 129ICU offers UTF-8 implementations out of the box. 130 131`UText` can be used with `BreakIterator` APIs (character/word/sentence/... 132segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8 133string. 134 135* *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any 136 other charset with non-1:1 index conversion to UTF-16) if no dictionary is 137 supported. This excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).* 138* *As a workaround for Thai word breaking, you can convert the string to 139 UTF-16 and convert indexes to UTF-8 string indexes via 140 `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`* 141* *ICU 4.4 has a technology preview for UText in the regular expression API, 142 but some of the UText regex API and semantics are likely to change for ICU 143 4.6. (Especially indexing semantics.)* 144 145A `UCharIterator` can be used with several collation APIs (although there is 146also the newer `icu::Collator::compareUTF8()`) and with `u_strCompareIter()`. 147`uiter_setUTF8()` creates a UCharIterator for a UTF-8 string. 148 149It is also possible to create a `CharacterIterator` subclass for UTF-8 strings, 150but `CharacterIterator` has a lot of virtual methods and it requires UTF-16 151string index semantics. 152