Lines Matching +full:utf +full:- +full:8
5 difference: byte strings are only *conventionally* UTF-8 while Rust's standard
6 Unicode strings are *guaranteed* to be valid UTF-8. The primary motivation for
7 byte strings is for handling arbitrary bytes that are mostly UTF-8.
48 assert_eq!(matches, [0, 8, 12, 21]);
65 invalid UTF-8:
105 This library reflects my belief that UTF-8 by convention is a better trade
106 off in some circumstances than guaranteed UTF-8.
110 about searching valid UTF-8 encoded strings. Indeed, internally, the
112 just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the
118 standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html)
123 `&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine
124 the encoding---which is often impractical---and perform a transcoding step.)
126 contents of a file as if it were mostly valid UTF-8 and pass through invalid
127 UTF-8 untouched. This may not be the most correct approach though!
163 conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase
164 "conventionally UTF-8" means that a `[u8]` may contain bytes that do not form
165 a valid UTF-8 sequence, but operations defined on the type in this crate are
166 generally most useful on valid UTF-8 sequences. For example, iterating over
168 on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode
169 replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at
174 works just as well for byte strings that are completely valid UTF-8 as for byte
175 strings that contain no valid UTF-8 at all. Similarly for replacements and
178 Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and
185 [`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html)
198 UTF-8 code unit sequence boundary. Conversely, no such checking is performed
200 bytes. This means callers are responsible for maintaining a UTF-8 invariant
213 # Handling of invalid UTF-8
215 Since byte strings are only *conventionally* UTF-8, there is no guarantee
216 that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a
220 meaning on bytes that are invalid UTF-8 because they do not map to a particular
224 library will automatically convert invalid UTF-8 to the Unicode replacement
227 replacement codepoint whenever it comes across bytes that are not valid UTF-8:
244 that cannot possibly lead to a valid UTF-8 code unit sequence. If there were
245 previous bytes that represented a *prefix* of a well-formed UTF-8 code unit
254 // The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them
262 the byte offsets containing the invalid UTF-8 bytes that were substituted with
289 * On most Unix-like systems, file paths are an arbitrary sequence of bytes.
290 * On Windows, file paths are an arbitrary sequence of 16-bit integers.
298 of bytes. It turns out that if you treat them as "mostly UTF-8," then things
301 paths between 16-bit integers and something that looks like UTF-8 _without_
307 [WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they
308 permit losslessly roundtripping file paths on Windows by extending UTF-8 to
310 zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is
311 a proper subset of WTF-8.)
316 for example, glob or regex matching on file paths. This is because WTF-8 is
321 1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8
322 by accessing their underlying 16-bit integer representation. Unfortunately,
323 this isn't zero cost (it introduces a second WTF-8 decoding step) and it's
324 not clear this is a good thing to do, since WTF-8 should ideally remain an
328 are not valid UTF-16, and return an error when one is encountered.
330 on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes
358 normal string-related functions. On Windows, these conversion routines perform
359 a UTF-8 check and either return an error or lossily decode the file path
360 into valid UTF-8, depending on which function you use. This means that you
369 UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are
379 * `std` - **Enabled** by default. This provides APIs that require the standard
382 * `alloc` - **Enabled** by default. This provides APIs that require allocations
384 * `unicode` - **Enabled** by default. This provides APIs that require sizable
387 as UTF-8 decoding is still included. Note that currently, enabling this
390 * `serde` - Enables implementations of serde traits for `BStr`, and also
400 // that 'unicode = [std, ...]', which would be fine, but once regex-automata