1bstr 2==== 3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable 4their use as byte strings, where byte strings are _conventionally_ UTF-8. This 5differs from the standard library's `String` and `str` types in that they are 6not required to be valid UTF-8, but may be fully or partially valid UTF-8. 7 8[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions) 9[![crates.io](https://img.shields.io/crates/v/bstr.svg)](https://crates.io/crates/bstr) 10 11 12### Documentation 13 14https://docs.rs/bstr 15 16 17### When should I use byte strings? 18 19See this part of the documentation for more details: 20<https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings>. 21 22The short story is that byte strings are useful when it is inconvenient or 23incorrect to require valid UTF-8. 24 25 26### Usage 27 28Add this to your `Cargo.toml`: 29 30```toml 31[dependencies] 32bstr = "1" 33``` 34 35 36### Examples 37 38The following two examples exhibit both the API features of byte strings and 39the I/O convenience functions provided for reading line-by-line quickly. 40 41This first example simply shows how to efficiently iterate over lines in stdin, 42and print out lines containing a particular substring: 43 44```rust 45use std::{error::Error, io::{self, Write}}; 46use bstr::{ByteSlice, io::BufReadExt}; 47 48fn main() -> Result<(), Box<dyn Error>> { 49 let stdin = io::stdin(); 50 let mut stdout = io::BufWriter::new(io::stdout()); 51 52 stdin.lock().for_byte_line_with_terminator(|line| { 53 if line.contains_str("Dimension") { 54 stdout.write_all(line)?; 55 } 56 Ok(true) 57 })?; 58 Ok(()) 59} 60``` 61 62This example shows how to count all of the words (Unicode-aware) in stdin, 63line-by-line: 64 65```rust 66use std::{error::Error, io}; 67use bstr::{ByteSlice, io::BufReadExt}; 68 69fn main() -> Result<(), Box<dyn Error>> { 70 let stdin = io::stdin(); 71 let mut words = 0; 72 stdin.lock().for_byte_line_with_terminator(|line| { 73 words += line.words().count(); 74 Ok(true) 75 })?; 76 println!("{}", words); 77 Ok(()) 78} 79``` 80 81This example shows how to convert a stream on stdin to uppercase without 82performing UTF-8 validation _and_ amortizing allocation. On standard ASCII 83text, this is quite a bit faster than what you can (easily) do with standard 84library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.) 85 86```rust 87use std::{error::Error, io::{self, Write}}; 88use bstr::{ByteSlice, io::BufReadExt}; 89 90fn main() -> Result<(), Box<dyn Error>> { 91 let stdin = io::stdin(); 92 let mut stdout = io::BufWriter::new(io::stdout()); 93 94 let mut upper = vec![]; 95 stdin.lock().for_byte_line_with_terminator(|line| { 96 upper.clear(); 97 line.to_uppercase_into(&mut upper); 98 stdout.write_all(&upper)?; 99 Ok(true) 100 })?; 101 Ok(()) 102} 103``` 104 105This example shows how to extract the first 10 visual characters (as grapheme 106clusters) from each line, where invalid UTF-8 sequences are generally treated 107as a single character and are passed through correctly: 108 109```rust 110use std::{error::Error, io::{self, Write}}; 111use bstr::{ByteSlice, io::BufReadExt}; 112 113fn main() -> Result<(), Box<dyn Error>> { 114 let stdin = io::stdin(); 115 let mut stdout = io::BufWriter::new(io::stdout()); 116 117 stdin.lock().for_byte_line_with_terminator(|line| { 118 let end = line 119 .grapheme_indices() 120 .map(|(_, end, _)| end) 121 .take(10) 122 .last() 123 .unwrap_or(line.len()); 124 stdout.write_all(line[..end].trim_end())?; 125 stdout.write_all(b"\n")?; 126 Ok(true) 127 })?; 128 Ok(()) 129} 130``` 131 132 133### Cargo features 134 135This crates comes with a few features that control standard library, serde and 136Unicode support. 137 138* `std` - **Enabled** by default. This provides APIs that require the standard 139 library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables 140 the `alloc` feature. 141* `alloc` - **Enabled** by default. This provides APIs that require allocations 142 via the `alloc` crate, such as `Vec<u8>`. 143* `unicode` - **Enabled** by default. This provides APIs that require sizable 144 Unicode data compiled into the binary. This includes, but is not limited to, 145 grapheme/word/sentence segmenters. When this is disabled, basic support such 146 as UTF-8 decoding is still included. Note that currently, enabling this 147 feature also requires enabling the `std` feature. It is expected that this 148 limitation will be lifted at some point. 149* `serde` - Enables implementations of serde traits for `BStr`, and also 150 `BString` when `alloc` is enabled. 151 152 153### Minimum Rust version policy 154 155This crate's minimum supported `rustc` version (MSRV) is `1.60.0`. 156 157In general, this crate will be conservative with respect to the minimum 158supported version of Rust. MSRV may be bumped in minor version releases. 159 160 161### Future work 162 163Since it is plausible that some of the types in this crate might end up in your 164public API (e.g., `BStr` and `BString`), we will commit to being very 165conservative with respect to new major version releases. It's difficult to say 166precisely how conservative, but unless there is a major issue with the `1.0` 167release, I wouldn't expect a `2.0` release to come out any sooner than some 168period of years. 169 170A large part of the API surface area was taken from the standard library, so 171from an API design perspective, a good portion of this crate should be on solid 172ground. The main differences from the standard library are in how the various 173substring search routines work. The standard library provides generic 174infrastructure for supporting different types of searches with a single method, 175where as this library prefers to define new methods for each type of search and 176drop the generic infrastructure. 177 178Some _probable_ future considerations for APIs include, but are not limited to: 179 180* Unicode normalization. 181* More sophisticated support for dealing with Unicode case, perhaps by 182 combining the use cases supported by [`caseless`](https://docs.rs/caseless) 183 and [`unicase`](https://docs.rs/unicase). 184 185Here are some examples that are _probably_ out of scope for this crate: 186 187* Regular expressions. 188* Unicode collation. 189 190The exact scope isn't quite clear, but I expect we can iterate on it. 191 192In general, as stated below, this crate brings lots of related APIs together 193into a single crate while simultaneously attempting to keep the total number of 194dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is 195optional. 196 197 198### High level motivation 199 200Strictly speaking, the `bstr` crate provides very little that can't already be 201achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of 202library crates. For example: 203 204* The standard library's 205 [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) can be 206 used for incremental lossy decoding of `&[u8]`. 207* The 208 [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html) 209 crate can be used for iterating over graphemes (or words), but is only 210 implemented for `&str` types. One could use `Utf8Error` above to implement 211 grapheme iteration with the same semantics as what `bstr` provides (automatic 212 Unicode replacement codepoint substitution). 213* The [`twoway`](https://docs.rs/twoway) crate can be used for fast substring 214 searching on `&[u8]`. 215 216So why create `bstr`? Part of the point of the `bstr` crate is to provide a 217uniform API of coupled components instead of relying on users to piece together 218loosely coupled components from the crate ecosystem. For example, if you wanted 219to perform a search and replace in a `Vec<u8>`, then writing the code to do 220that with the `twoway` crate is not that difficult, but it's still additional 221glue code you have to write. This work adds up depending on what you're doing. 222Consider, for example, trimming and splitting, along with their different 223variants. 224 225In other words, `bstr` is partially a way of pushing back against the 226micro-crate ecosystem that appears to be evolving. Namely, it is a goal of 227`bstr` to keep its dependency list lightweight. For example, `serde` is an 228optional dependency because there is no feasible alternative. In service of 229this philosophy, currently, the only required dependency of `bstr` is `memchr`. 230 231 232### License 233 234This project is licensed under either of 235 236 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or 237 https://www.apache.org/licenses/LICENSE-2.0) 238 * MIT license ([LICENSE-MIT](LICENSE-MIT) or 239 https://opensource.org/licenses/MIT) 240 241at your option. 242 243The data in `src/unicode/data/` is licensed under the Unicode License Agreement 244([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although 245this data is only used in tests. 246