1 /*! 2 A byte string library. 3 4 Byte strings are just like standard Unicode strings with one very important 5 difference: byte strings are only *conventionally* UTF-8 while Rust's standard 6 Unicode strings are *guaranteed* to be valid UTF-8. The primary motivation for 7 byte strings is for handling arbitrary bytes that are mostly UTF-8. 8 9 # Overview 10 11 This crate provides two important traits that provide string oriented methods 12 on `&[u8]` and `Vec<u8>` types: 13 14 * [`ByteSlice`](trait.ByteSlice.html) extends the `[u8]` type with additional 15 string oriented methods. 16 * [`ByteVec`](trait.ByteVec.html) extends the `Vec<u8>` type with additional 17 string oriented methods. 18 19 Additionally, this crate provides two concrete byte string types that deref to 20 `[u8]` and `Vec<u8>`. These are useful for storing byte string types, and come 21 with convenient `std::fmt::Debug` implementations: 22 23 * [`BStr`](struct.BStr.html) is a byte string slice, analogous to `str`. 24 * [`BString`](struct.BString.html) is an owned growable byte string buffer, 25 analogous to `String`. 26 27 Additionally, the free function [`B`](fn.B.html) serves as a convenient short 28 hand for writing byte string literals. 29 30 # Quick examples 31 32 Byte strings build on the existing APIs for `Vec<u8>` and `&[u8]`, with 33 additional string oriented methods. Operations such as iterating over 34 graphemes, searching for substrings, replacing substrings, trimming and case 35 conversion are examples of things not provided on the standard library `&[u8]` 36 APIs but are provided by this crate. For example, this code iterates over all 37 of occurrences of a subtring: 38 39 ``` 40 use bstr::ByteSlice; 41 42 let s = b"foo bar foo foo quux foo"; 43 44 let mut matches = vec![]; 45 for start in s.find_iter("foo") { 46 matches.push(start); 47 } 48 assert_eq!(matches, [0, 8, 12, 21]); 49 ``` 50 51 Here's another example showing how to do a search and replace (and also showing 52 use of the `B` function): 53 54 ``` 55 use bstr::{B, ByteSlice}; 56 57 let old = B("foo ☃☃☃ foo foo quux foo"); 58 let new = old.replace("foo", "hello"); 59 assert_eq!(new, B("hello ☃☃☃ hello hello quux hello")); 60 ``` 61 62 And here's an example that shows case conversion, even in the presence of 63 invalid UTF-8: 64 65 ``` 66 use bstr::{ByteSlice, ByteVec}; 67 68 let mut lower = Vec::from("hello β"); 69 lower[0] = b'\xFF'; 70 // lowercase β is uppercased to Β 71 assert_eq!(lower.to_uppercase(), b"\xFFELLO \xCE\x92"); 72 ``` 73 74 # Convenient debug representation 75 76 When working with byte strings, it is often useful to be able to print them 77 as if they were byte strings and not sequences of integers. While this crate 78 cannot affect the `std::fmt::Debug` implementations for `[u8]` and `Vec<u8>`, 79 this crate does provide the `BStr` and `BString` types which have convenient 80 `std::fmt::Debug` implementations. 81 82 For example, this 83 84 ``` 85 use bstr::ByteSlice; 86 87 let mut bytes = Vec::from("hello β"); 88 bytes[0] = b'\xFF'; 89 90 println!("{:?}", bytes.as_bstr()); 91 ``` 92 93 will output `"\xFFello β"`. 94 95 This example works because the 96 [`ByteSlice::as_bstr`](trait.ByteSlice.html#method.as_bstr) 97 method converts any `&[u8]` to a `&BStr`. 98 99 # When should I use byte strings? 100 101 This library reflects my hypothesis that UTF-8 by convention is a better trade 102 off in some circumstances than guaranteed UTF-8. It's possible, perhaps even 103 likely, that this is a niche concern for folks working closely with core text 104 primitives. 105 106 The first time this idea hit me was in the implementation of Rust's regex 107 engine. In particular, very little of the internal implementation cares at all 108 about searching valid UTF-8 encoded strings. Indeed, internally, the 109 implementation converts `&str` from the API to `&[u8]` fairly quickly and 110 just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the 111 finite state machine itself rather than any specific string type. This makes it 112 possible to not only run regexes on `&str` values, but also on `&[u8]` values. 113 114 Why would you ever want to run a regex on a `&[u8]` though? Well, `&[u8]` is 115 the fundamental way at which one reads data from all sorts of streams, via the 116 standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html) 117 trait. In particular, there is no platform independent way to determine whether 118 what you're reading from is some binary file or a human readable text file. 119 Therefore, if you're writing a program to search files, you probably need to 120 deal with `&[u8]` directly unless you're okay with first converting it to a 121 `&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine 122 the encoding---which is often impractical---and perform a transcoding step.) 123 Often, the simplest and most robust way to approach this is to simply treat the 124 contents of a file as if it were mostly valid UTF-8 and pass through invalid 125 UTF-8 untouched. This may not be the most correct approach though! 126 127 One case in particular exacerbates these issues, and that's memory mapping 128 a file. When you memory map a file, that file may be gigabytes big, but all 129 you get is a `&[u8]`. Converting that to a `&str` all in one go is generally 130 not a good idea because of the costs associated with doing so, and also 131 because it generally causes one to do two passes over the data instead of 132 one, which is quite undesirable. It is of course usually possible to do it an 133 incremental way by only parsing chunks at a time, but this is often complex to 134 do or impractical. For example, many regex engines only accept one contiguous 135 sequence of bytes at a time with no way to perform incremental matching. 136 137 In summary, conventional UTF-8 byte strings provided by this library are 138 definitely useful in some limited circumstances, but how useful they are more 139 broadly isn't clear yet. 140 141 # `bstr` in public APIs 142 143 Since this library is not yet `1.0`, you should not use it in the public API of 144 your crates until it hits `1.0` (unless you're OK with with tracking breaking 145 releases of `bstr`). It is expected that `bstr 1.0` will be released before 146 2022. 147 148 In general, it should be possible to avoid putting anything in this crate into 149 your public APIs. Namely, you should never need to use the `ByteSlice` or 150 `ByteVec` traits as bounds on public APIs, since their only purpose is to 151 extend the methods on the concrete types `[u8]` and `Vec<u8>`, respectively. 152 Similarly, it should not be necessary to put either the `BStr` or `BString` 153 types into public APIs. If you want to use them internally, then they can 154 be converted to/from `[u8]`/`Vec<u8>` as needed. 155 156 # Differences with standard strings 157 158 The primary difference between `[u8]` and `str` is that the former is 159 conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase 160 "conventionally UTF-8" means that a `[u8]` may contain bytes that do not form 161 a valid UTF-8 sequence, but operations defined on the type in this crate are 162 generally most useful on valid UTF-8 sequences. For example, iterating over 163 Unicode codepoints or grapheme clusters is an operation that is only defined 164 on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode 165 replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at 166 all is of limited utility when using these crate. 167 168 However, not all operations on byte strings are specifically Unicode aware. For 169 example, substring search has no specific Unicode semantics ascribed to it. It 170 works just as well for byte strings that are completely valid UTF-8 as for byte 171 strings that contain no valid UTF-8 at all. Similarly for replacements and 172 various other operations that do not need any Unicode specific tailoring. 173 174 Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and 175 `str` (and `Vec<u8>` and `String`) are intentionally very similar, including 176 maintaining the same behavior for corner cases in things like substring 177 splitting. There are, however, some differences: 178 179 * Substring search is not done with `matches`, but instead, `find_iter`. 180 In general, this crate does not define any generic 181 [`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html) 182 infrastructure, and instead prefers adding new methods for different 183 argument types. For example, `matches` can search by a `char` or a `&str`, 184 where as `find_iter` can only search by a byte string. `find_char` can be 185 used for searching by a `char`. 186 * Since `SliceConcatExt` in the standard library is unstable, it is not 187 possible to reuse that to implement `join` and `concat` methods. Instead, 188 [`join`](fn.join.html) and [`concat`](fn.concat.html) are provided as free 189 functions that perform a similar task. 190 * This library bundles in a few more Unicode operations, such as grapheme, 191 word and sentence iterators. More operations, such as normalization and 192 case folding, may be provided in the future. 193 * Some `String`/`str` APIs will panic if a particular index was not on a valid 194 UTF-8 code unit sequence boundary. Conversely, no such checking is performed 195 in this crate, as is consistent with treating byte strings as a sequence of 196 bytes. This means callers are responsible for maintaining a UTF-8 invariant 197 if that's important. 198 * Some routines provided by this crate, such as `starts_with_str`, have a 199 `_str` suffix to differentiate them from similar routines already defined 200 on the `[u8]` type. The difference is that `starts_with` requires its 201 parameter to be a `&[u8]`, where as `starts_with_str` permits its parameter 202 to by anything that implements `AsRef<[u8]>`, which is more flexible. This 203 means you can write `bytes.starts_with_str("☃")` instead of 204 `bytes.starts_with("☃".as_bytes())`. 205 206 Otherwise, you should find most of the APIs between this crate and the standard 207 library string APIs to be very similar, if not identical. 208 209 # Handling of invalid UTF-8 210 211 Since byte strings are only *conventionally* UTF-8, there is no guarantee 212 that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a 213 byte string to contain arbitrary bytes. However, since this library defines 214 a *string* type, it provides many operations specified by Unicode. These 215 operations are typically only defined over codepoints, and thus have no real 216 meaning on bytes that are invalid UTF-8 because they do not map to a particular 217 codepoint. 218 219 For this reason, whenever operations defined only on codepoints are used, this 220 library will automatically convert invalid UTF-8 to the Unicode replacement 221 codepoint, `U+FFFD`, which looks like this: `�`. For example, an 222 [iterator over codepoints](struct.Chars.html) will yield a Unicode 223 replacement codepoint whenever it comes across bytes that are not valid UTF-8: 224 225 ``` 226 use bstr::ByteSlice; 227 228 let bs = b"a\xFF\xFFz"; 229 let chars: Vec<char> = bs.chars().collect(); 230 assert_eq!(vec!['a', '\u{FFFD}', '\u{FFFD}', 'z'], chars); 231 ``` 232 233 There are a few ways in which invalid bytes can be substituted with a Unicode 234 replacement codepoint. One way, not used by this crate, is to replace every 235 individual invalid byte with a single replacement codepoint. In contrast, the 236 approach this crate uses is called the "substitution of maximal subparts," as 237 specified by the Unicode Standard (Chapter 3, Section 9). (This approach is 238 also used by [W3C's Encoding Standard](https://www.w3.org/TR/encoding/).) In 239 this strategy, a replacement codepoint is inserted whenever a byte is found 240 that cannot possibly lead to a valid UTF-8 code unit sequence. If there were 241 previous bytes that represented a *prefix* of a well-formed UTF-8 code unit 242 sequence, then all of those bytes (up to 3) are substituted with a single 243 replacement codepoint. For example: 244 245 ``` 246 use bstr::ByteSlice; 247 248 let bs = b"a\xF0\x9F\x87z"; 249 let chars: Vec<char> = bs.chars().collect(); 250 // The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them 251 // on their own are invalid. Only one replacement codepoint is substituted, 252 // which demonstrates the "substitution of maximal subparts" strategy. 253 assert_eq!(vec!['a', '\u{FFFD}', 'z'], chars); 254 ``` 255 256 If you do need to access the raw bytes for some reason in an iterator like 257 `Chars`, then you should use the iterator's "indices" variant, which gives 258 the byte offsets containing the invalid UTF-8 bytes that were substituted with 259 the replacement codepoint. For example: 260 261 ``` 262 use bstr::{B, ByteSlice}; 263 264 let bs = b"a\xE2\x98z"; 265 let chars: Vec<(usize, usize, char)> = bs.char_indices().collect(); 266 // Even though the replacement codepoint is encoded as 3 bytes itself, the 267 // byte range given here is only two bytes, corresponding to the original 268 // raw bytes. 269 assert_eq!(vec![(0, 1, 'a'), (1, 3, '\u{FFFD}'), (3, 4, 'z')], chars); 270 271 // Thus, getting the original raw bytes is as simple as slicing the original 272 // byte string: 273 let chars: Vec<&[u8]> = bs.char_indices().map(|(s, e, _)| &bs[s..e]).collect(); 274 assert_eq!(vec![B("a"), B(b"\xE2\x98"), B("z")], chars); 275 ``` 276 277 # File paths and OS strings 278 279 One of the premiere features of Rust's standard library is how it handles file 280 paths. In particular, it makes it very hard to write incorrect code while 281 simultaneously providing a correct cross platform abstraction for manipulating 282 file paths. The key challenge that one faces with file paths across platforms 283 is derived from the following observations: 284 285 * On most Unix-like systems, file paths are an arbitrary sequence of bytes. 286 * On Windows, file paths are an arbitrary sequence of 16-bit integers. 287 288 (In both cases, certain sequences aren't allowed. For example a `NUL` byte is 289 not allowed in either case. But we can ignore this for the purposes of this 290 section.) 291 292 Byte strings, like the ones provided in this crate, line up really well with 293 file paths on Unix like systems, which are themselves just arbitrary sequences 294 of bytes. It turns out that if you treat them as "mostly UTF-8," then things 295 work out pretty well. On the contrary, byte strings _don't_ really work 296 that well on Windows because it's not possible to correctly roundtrip file 297 paths between 16-bit integers and something that looks like UTF-8 _without_ 298 explicitly defining an encoding to do this for you, which is anathema to byte 299 strings, which are just bytes. 300 301 Rust's standard library elegantly solves this problem by specifying an 302 internal encoding for file paths that's only used on Windows called 303 [WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they 304 permit losslessly roundtripping file paths on Windows by extending UTF-8 to 305 support an encoding of surrogate codepoints, while simultaneously supporting 306 zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is 307 a proper subset of WTF-8.) 308 309 The fundamental point at which the above strategy fails is when you want to 310 treat file paths as things that look like strings in a zero cost way. In most 311 cases, this is actually the wrong thing to do, but some cases call for it, 312 for example, glob or regex matching on file paths. This is because WTF-8 is 313 treated as an internal implementation detail, and there is no way to access 314 those bytes via a public API. Therefore, such consumers are limited in what 315 they can do: 316 317 1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 318 by accessing their underlying 16-bit integer representation. Unfortunately, 319 this isn't zero cost (it introduces a second WTF-8 decoding step) and it's 320 not clear this is a good thing to do, since WTF-8 should ideally remain an 321 internal implementation detail. 322 2. One could instead declare that they will not handle paths on Windows that 323 are not valid UTF-16, and return an error when one is encountered. 324 3. Like (2), but instead of returning an error, lossily decode the file path 325 on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes 326 with the Unicode replacement codepoint. 327 328 While this library may provide facilities for (1) in the future, currently, 329 this library only provides facilities for (2) and (3). In particular, a suite 330 of conversion functions are provided that permit converting between byte 331 strings, OS strings and file paths. For owned byte strings, they are: 332 333 * [`ByteVec::from_os_string`](trait.ByteVec.html#method.from_os_string) 334 * [`ByteVec::from_os_str_lossy`](trait.ByteVec.html#method.from_os_str_lossy) 335 * [`ByteVec::from_path_buf`](trait.ByteVec.html#method.from_path_buf) 336 * [`ByteVec::from_path_lossy`](trait.ByteVec.html#method.from_path_lossy) 337 * [`ByteVec::into_os_string`](trait.ByteVec.html#method.into_os_string) 338 * [`ByteVec::into_os_string_lossy`](trait.ByteVec.html#method.into_os_string_lossy) 339 * [`ByteVec::into_path_buf`](trait.ByteVec.html#method.into_path_buf) 340 * [`ByteVec::into_path_buf_lossy`](trait.ByteVec.html#method.into_path_buf_lossy) 341 342 For byte string slices, they are: 343 344 * [`ByteSlice::from_os_str`](trait.ByteSlice.html#method.from_os_str) 345 * [`ByteSlice::from_path`](trait.ByteSlice.html#method.from_path) 346 * [`ByteSlice::to_os_str`](trait.ByteSlice.html#method.to_os_str) 347 * [`ByteSlice::to_os_str_lossy`](trait.ByteSlice.html#method.to_os_str_lossy) 348 * [`ByteSlice::to_path`](trait.ByteSlice.html#method.to_path) 349 * [`ByteSlice::to_path_lossy`](trait.ByteSlice.html#method.to_path_lossy) 350 351 On Unix, all of these conversions are rigorously zero cost, which gives one 352 a way to ergonomically deal with raw file paths exactly as they are using 353 normal string-related functions. On Windows, these conversion routines perform 354 a UTF-8 check and either return an error or lossily decode the file path 355 into valid UTF-8, depending on which function you use. This means that you 356 cannot roundtrip all file paths on Windows correctly using these conversion 357 routines. However, this may be an acceptable downside since such file paths 358 are exceptionally rare. Moreover, roundtripping isn't always necessary, for 359 example, if all you're doing is filtering based on file paths. 360 361 The reason why using byte strings for this is potentially superior than the 362 standard library's approach is that a lot of Rust code is already lossily 363 converting file paths to Rust's Unicode strings, which are required to be valid 364 UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are 365 not terribly uncommon. If you instead use byte strings, then you're guaranteed 366 to write correct code for Unix, at the cost of getting a corner case wrong on 367 Windows. 368 */ 369 370 #![cfg_attr(not(feature = "std"), no_std)] 371 372 pub use crate::bstr::BStr; 373 #[cfg(feature = "std")] 374 pub use crate::bstring::BString; 375 pub use crate::ext_slice::{ 376 ByteSlice, Bytes, Fields, FieldsWith, Find, FindReverse, Finder, 377 FinderReverse, Lines, LinesWithTerminator, Split, SplitN, SplitNReverse, 378 SplitReverse, B, 379 }; 380 #[cfg(feature = "std")] 381 pub use crate::ext_vec::{concat, join, ByteVec, DrainBytes, FromUtf8Error}; 382 #[cfg(feature = "unicode")] 383 pub use crate::unicode::{ 384 GraphemeIndices, Graphemes, SentenceIndices, Sentences, WordIndices, 385 Words, WordsWithBreakIndices, WordsWithBreaks, 386 }; 387 pub use crate::utf8::{ 388 decode as decode_utf8, decode_last as decode_last_utf8, CharIndices, 389 Chars, Utf8Chunk, Utf8Chunks, Utf8Error, 390 }; 391 392 mod ascii; 393 mod bstr; 394 #[cfg(feature = "std")] 395 mod bstring; 396 mod byteset; 397 mod ext_slice; 398 #[cfg(feature = "std")] 399 mod ext_vec; 400 mod impls; 401 #[cfg(feature = "std")] 402 pub mod io; 403 #[cfg(test)] 404 mod tests; 405 #[cfg(feature = "unicode")] 406 mod unicode; 407 mod utf8; 408 409 #[cfg(test)] 410 mod apitests { 411 use crate::bstr::BStr; 412 use crate::bstring::BString; 413 use crate::ext_slice::{Finder, FinderReverse}; 414 415 #[test] oibits()416 fn oibits() { 417 use std::panic::{RefUnwindSafe, UnwindSafe}; 418 419 fn assert_send<T: Send>() {} 420 fn assert_sync<T: Sync>() {} 421 fn assert_unwind_safe<T: RefUnwindSafe + UnwindSafe>() {} 422 423 assert_send::<&BStr>(); 424 assert_sync::<&BStr>(); 425 assert_unwind_safe::<&BStr>(); 426 assert_send::<BString>(); 427 assert_sync::<BString>(); 428 assert_unwind_safe::<BString>(); 429 430 assert_send::<Finder<'_>>(); 431 assert_sync::<Finder<'_>>(); 432 assert_unwind_safe::<Finder<'_>>(); 433 assert_send::<FinderReverse<'_>>(); 434 assert_sync::<FinderReverse<'_>>(); 435 assert_unwind_safe::<FinderReverse<'_>>(); 436 } 437 } 438