utf-8.md - OpenGrok cross reference for /third_party/icu/docs/userguide/strings/utf-8.md

Lines Matching +full:utf +full:- +full:8
1 ---
3 title: UTF-8
6 ---
7 <!--
10 -->
12 # UTF-8  chapter
15 UTF-16, except for conversion from bytes to strings (via InputStreamReader or
18 While most of ICU works with UTF-16 strings and uses data structures optimized
19 for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized
20 for UTF-8, or work with Unicode code points (21-bit integer values) regardless
22 UTF-16 and UTF-8.
24 For UTF-8 strings, ICU normally uses `(const) char *` pointers and `int32_t`
25 lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1
26 means NUL-terminated, output is NUL-terminated if there is space, output
31 ## Conversion Between UTF-8 and UTF-16
33 The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++
48 UTF-8, but are not as efficient or convenient as the
53 ## UTF-8 as Default Charset
59 to and from UTF-16.
61 If it is known that the default charset is always UTF-8 on the target platform,
63 (For example, modify the default value there or pass `-D``U_CHARSET_IS_UTF8=1`
65 dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the
70 ## Low-Level UTF-8 String Operations
72 `unicode/utf8.h` defines macros for UTF-8 with semantics parallel to the UTF-16
74 internal functions for complicated parts of the UTF-8 encoding form. For
98 ## Dedicated UTF-8 APIs
100 ICU has some APIs dedicated for UTF-8. They tend to have been added for "worker
104 For example, `icu::Collator::compareUTF8()` compares two UTF-8 strings
105 incrementally, without converting all of the two strings to UTF-16 if there is
108 `ucnv_convertEx()` can convert between UTF-8 and another charset, if one of the
109 two `UConverter`s is a UTF-8 converter. The conversion *from UTF-8 to* most
111 UTF-16. (Conversion *from* other charsets *to UTF-8* could be optimized as well,
120     uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
129 ICU offers UTF-8 implementations out of the box.
132 segmentation). `utext_openUTF8()` creates a read-only `UText` for a UTF-8
135 *   *Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any
136     other charset with non-1:1 index conversion to UTF-16) if no dictionary is
137 …s excludes Thai word break. See [ticket #5532](https://unicode-org.atlassian.net/browse/ICU-5532).*
139     UTF-16 and convert indexes to UTF-8 string indexes via
140     `u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).`*
147 `uiter_setUTF8()` creates a UCharIterator for a UTF-8 string.
149 It is also possible to create a `CharacterIterator` subclass for UTF-8 strings,
150 but `CharacterIterator` has a lot of virtual methods and it requires UTF-16