1bstr 2==== 3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable 4their use as byte strings, where byte strings are _conventionally_ UTF-8. This 5differs from the standard library's `String` and `str` types in that they are 6not required to be valid UTF-8, but may be fully or partially valid UTF-8. 7 8[](https://github.com/BurntSushi/bstr/actions) 9[](https://crates.io/crates/bstr) 10 11 12### Documentation 13 14https://docs.rs/bstr 15 16 17### When should I use byte strings? 18 19See this part of the documentation for more details: 20<https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings>. 21 22The short story is that byte strings are useful when it is inconvenient or 23incorrect to require valid UTF-8. 24 25 26### Usage 27 28`cargo add bstr` 29 30### Examples 31 32The following two examples exhibit both the API features of byte strings and 33the I/O convenience functions provided for reading line-by-line quickly. 34 35This first example simply shows how to efficiently iterate over lines in stdin, 36and print out lines containing a particular substring: 37 38```rust 39use std::{error::Error, io::{self, Write}}; 40use bstr::{ByteSlice, io::BufReadExt}; 41 42fn main() -> Result<(), Box<dyn Error>> { 43 let stdin = io::stdin(); 44 let mut stdout = io::BufWriter::new(io::stdout()); 45 46 stdin.lock().for_byte_line_with_terminator(|line| { 47 if line.contains_str("Dimension") { 48 stdout.write_all(line)?; 49 } 50 Ok(true) 51 })?; 52 Ok(()) 53} 54``` 55 56This example shows how to count all of the words (Unicode-aware) in stdin, 57line-by-line: 58 59```rust 60use std::{error::Error, io}; 61use bstr::{ByteSlice, io::BufReadExt}; 62 63fn main() -> Result<(), Box<dyn Error>> { 64 let stdin = io::stdin(); 65 let mut words = 0; 66 stdin.lock().for_byte_line_with_terminator(|line| { 67 words += line.words().count(); 68 Ok(true) 69 })?; 70 println!("{}", words); 71 Ok(()) 72} 73``` 74 75This example shows how to convert a stream on stdin to uppercase without 76performing UTF-8 validation _and_ amortizing allocation. On standard ASCII 77text, this is quite a bit faster than what you can (easily) do with standard 78library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.) 79 80```rust 81use std::{error::Error, io::{self, Write}}; 82use bstr::{ByteSlice, io::BufReadExt}; 83 84fn main() -> Result<(), Box<dyn Error>> { 85 let stdin = io::stdin(); 86 let mut stdout = io::BufWriter::new(io::stdout()); 87 88 let mut upper = vec![]; 89 stdin.lock().for_byte_line_with_terminator(|line| { 90 upper.clear(); 91 line.to_uppercase_into(&mut upper); 92 stdout.write_all(&upper)?; 93 Ok(true) 94 })?; 95 Ok(()) 96} 97``` 98 99This example shows how to extract the first 10 visual characters (as grapheme 100clusters) from each line, where invalid UTF-8 sequences are generally treated 101as a single character and are passed through correctly: 102 103```rust 104use std::{error::Error, io::{self, Write}}; 105use bstr::{ByteSlice, io::BufReadExt}; 106 107fn main() -> Result<(), Box<dyn Error>> { 108 let stdin = io::stdin(); 109 let mut stdout = io::BufWriter::new(io::stdout()); 110 111 stdin.lock().for_byte_line_with_terminator(|line| { 112 let end = line 113 .grapheme_indices() 114 .map(|(_, end, _)| end) 115 .take(10) 116 .last() 117 .unwrap_or(line.len()); 118 stdout.write_all(line[..end].trim_end())?; 119 stdout.write_all(b"\n")?; 120 Ok(true) 121 })?; 122 Ok(()) 123} 124``` 125 126 127### Cargo features 128 129This crates comes with a few features that control standard library, serde and 130Unicode support. 131 132* `std` - **Enabled** by default. This provides APIs that require the standard 133 library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables 134 the `alloc` feature. 135* `alloc` - **Enabled** by default. This provides APIs that require allocations 136 via the `alloc` crate, such as `Vec<u8>`. 137* `unicode` - **Enabled** by default. This provides APIs that require sizable 138 Unicode data compiled into the binary. This includes, but is not limited to, 139 grapheme/word/sentence segmenters. When this is disabled, basic support such 140 as UTF-8 decoding is still included. Note that currently, enabling this 141 feature also requires enabling the `std` feature. It is expected that this 142 limitation will be lifted at some point. 143* `serde` - Enables implementations of serde traits for `BStr`, and also 144 `BString` when `alloc` is enabled. 145 146 147### Minimum Rust version policy 148 149This crate's minimum supported `rustc` version (MSRV) is `1.60.0`. 150 151In general, this crate will be conservative with respect to the minimum 152supported version of Rust. MSRV may be bumped in minor version releases. 153 154 155### Future work 156 157Since it is plausible that some of the types in this crate might end up in your 158public API (e.g., `BStr` and `BString`), we will commit to being very 159conservative with respect to new major version releases. It's difficult to say 160precisely how conservative, but unless there is a major issue with the `1.0` 161release, I wouldn't expect a `2.0` release to come out any sooner than some 162period of years. 163 164A large part of the API surface area was taken from the standard library, so 165from an API design perspective, a good portion of this crate should be on solid 166ground. The main differences from the standard library are in how the various 167substring search routines work. The standard library provides generic 168infrastructure for supporting different types of searches with a single method, 169where as this library prefers to define new methods for each type of search and 170drop the generic infrastructure. 171 172Some _probable_ future considerations for APIs include, but are not limited to: 173 174* Unicode normalization. 175* More sophisticated support for dealing with Unicode case, perhaps by 176 combining the use cases supported by [`caseless`](https://docs.rs/caseless) 177 and [`unicase`](https://docs.rs/unicase). 178 179Here are some examples that are _probably_ out of scope for this crate: 180 181* Regular expressions. 182* Unicode collation. 183 184The exact scope isn't quite clear, but I expect we can iterate on it. 185 186In general, as stated below, this crate brings lots of related APIs together 187into a single crate while simultaneously attempting to keep the total number of 188dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is 189optional. 190 191 192### High level motivation 193 194Strictly speaking, the `bstr` crate provides very little that can't already be 195achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of 196library crates. For example: 197 198* The standard library's 199 [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) can be 200 used for incremental lossy decoding of `&[u8]`. 201* The 202 [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html) 203 crate can be used for iterating over graphemes (or words), but is only 204 implemented for `&str` types. One could use `Utf8Error` above to implement 205 grapheme iteration with the same semantics as what `bstr` provides (automatic 206 Unicode replacement codepoint substitution). 207* The [`twoway`](https://docs.rs/twoway) crate can be used for fast substring 208 searching on `&[u8]`. 209 210So why create `bstr`? Part of the point of the `bstr` crate is to provide a 211uniform API of coupled components instead of relying on users to piece together 212loosely coupled components from the crate ecosystem. For example, if you wanted 213to perform a search and replace in a `Vec<u8>`, then writing the code to do 214that with the `twoway` crate is not that difficult, but it's still additional 215glue code you have to write. This work adds up depending on what you're doing. 216Consider, for example, trimming and splitting, along with their different 217variants. 218 219In other words, `bstr` is partially a way of pushing back against the 220micro-crate ecosystem that appears to be evolving. Namely, it is a goal of 221`bstr` to keep its dependency list lightweight. For example, `serde` is an 222optional dependency because there is no feasible alternative. In service of 223this philosophy, currently, the only required dependency of `bstr` is `memchr`. 224 225 226### License 227 228This project is licensed under either of 229 230 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or 231 https://www.apache.org/licenses/LICENSE-2.0) 232 * MIT license ([LICENSE-MIT](LICENSE-MIT) or 233 https://opensource.org/licenses/MIT) 234 235at your option. 236 237The data in `src/unicode/data/` is licensed under the Unicode License Agreement 238([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although 239this data is only used in tests. 240