--- layout: default title: UTF-8 nav_order: 1 parent: Chars and Strings --- # UTF-8 *Note: This page is only relevant for C/C++. In Java, all strings are encoded in UTF-16, except for conversion from bytes to strings (via InputStreamReader or similar) and from strings to bytes (OutputStreamWriter etc.).* While most of ICU works with UTF-16 strings and uses data structures optimized for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized for UTF-8, or work with Unicode code points (21-bit integer values) regardless of string encoding. Some data structures are designed to work equally well with UTF-16 and UTF-8. For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t` lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1 means NUL-terminated, output is NUL-terminated if there is space, output overflow is handled with preflighting; for details see the parent [Strings page](index.md).) Some newer APIs take an `icu::StringPiece` argument and write to an `icu::ByteSink` or to a string class object like `std::string`. ## Conversion Between UTF-8 and UTF-16 The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ `icu::UnicodeString` methods `fromUTF8(const StringPiece &utf8)` and `toUTF8String(StringClass &result)`. There is also `toUTF8(ByteSink &sink)`. In C, `unicode/ustring.h` has functions like `u_strFromUTF8WithSub()` and `u_strToUTF8WithSub()`. (Also `u_strFromUTF8()`, `u_strToUTF8()` and `u_strFromUTF8Lenient()`.) The conversion functions in `unicode/ucnv.h` are intended for very flexible handling of conversion to/from external byte streams (with customizable error handling and support for split buffers at arbitrary boundaries) which is normally unnecessary for internal strings. Note: `icu::``UnicodeString` has constructors, `setTo()` and `extract()` methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the `fromUTF8()`/`toUTF8()`/`toUTF8String()` methods mentioned above. (Among conversion methods, APIs with a charset name are more convenient but internally open and close a converter; ones with a converter object parameter avoid this.) ## UTF-8 as Default Charset ICU has many functions that take or return `char *` strings that are assumed to be in the default charset which should match the system encoding. Since this could be one of many charsets, and the charset can be different for different processes on the same system, ICU uses its conversion framework for converting to and from UTF-16. If it is known that the default charset is always UTF-8 on the target platform, then you should `#define`` U_CHARSET_IS_UTF8 1` in or before `unicode/utypes.h`. (For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1` as a compiler flag.) This will change most of the implementation code to use dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the conversion framework. (Avoiding such dependencies helps with statically linked libraries and may allow the use of `UCONFIG_NO_LEGACY_CONVERSION` or even `UCONFIG_NO_CONVERSION` \[see `unicode/uconfig.h`\].) ## Low-Level UTF-8 String Operations `unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16 macros in `unicode/utf16.h`. The macros handle many cases inline, but call internal functions for complicated parts of the UTF-8 encoding form. For example, the following code snippet counts white space characters in a string: ```c #include "unicode/utypes.h" #include "unicode/stringpiece.h" #include "unicode/utf8.h" #include "unicode/uchar.h" int32_t countWhiteSpace(StringPiece sp) { const char *s=sp.data(); int32_t length=sp.length(); int32_t count=0; for(int32_t i=0; i