README.md
        
        
        
        1bstr
2====
3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
4their use as byte strings, where byte strings are _conventionally_ UTF-8. This
5differs from the standard library's `String` and `str` types in that they are
6not required to be valid UTF-8, but may be fully or partially valid UTF-8.
7
8[](https://github.com/BurntSushi/bstr/actions)
9[](https://crates.io/crates/bstr)
10
11
12### Documentation
13
14https://docs.rs/bstr
15
16
17### When should I use byte strings?
18
19See this part of the documentation for more details:
20<https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings>.
21
22The short story is that byte strings are useful when it is inconvenient or
23incorrect to require valid UTF-8.
24
25
26### Usage
27
28Add this to your `Cargo.toml`:
29
30```toml
31[dependencies]
32bstr = "1"
33```
34
35
36### Examples
37
38The following two examples exhibit both the API features of byte strings and
39the I/O convenience functions provided for reading line-by-line quickly.
40
41This first example simply shows how to efficiently iterate over lines in stdin,
42and print out lines containing a particular substring:
43
44```rust
45use std::{error::Error, io::{self, Write}};
46use bstr::{ByteSlice, io::BufReadExt};
47
48fn main() -> Result<(), Box<dyn Error>> {
49    let stdin = io::stdin();
50    let mut stdout = io::BufWriter::new(io::stdout());
51
52    stdin.lock().for_byte_line_with_terminator(|line| {
53        if line.contains_str("Dimension") {
54            stdout.write_all(line)?;
55        }
56        Ok(true)
57    })?;
58    Ok(())
59}
60```
61
62This example shows how to count all of the words (Unicode-aware) in stdin,
63line-by-line:
64
65```rust
66use std::{error::Error, io};
67use bstr::{ByteSlice, io::BufReadExt};
68
69fn main() -> Result<(), Box<dyn Error>> {
70    let stdin = io::stdin();
71    let mut words = 0;
72    stdin.lock().for_byte_line_with_terminator(|line| {
73        words += line.words().count();
74        Ok(true)
75    })?;
76    println!("{}", words);
77    Ok(())
78}
79```
80
81This example shows how to convert a stream on stdin to uppercase without
82performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
83text, this is quite a bit faster than what you can (easily) do with standard
84library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
85
86```rust
87use std::{error::Error, io::{self, Write}};
88use bstr::{ByteSlice, io::BufReadExt};
89
90fn main() -> Result<(), Box<dyn Error>> {
91    let stdin = io::stdin();
92    let mut stdout = io::BufWriter::new(io::stdout());
93
94    let mut upper = vec![];
95    stdin.lock().for_byte_line_with_terminator(|line| {
96        upper.clear();
97        line.to_uppercase_into(&mut upper);
98        stdout.write_all(&upper)?;
99        Ok(true)
100    })?;
101    Ok(())
102}
103```
104
105This example shows how to extract the first 10 visual characters (as grapheme
106clusters) from each line, where invalid UTF-8 sequences are generally treated
107as a single character and are passed through correctly:
108
109```rust
110use std::{error::Error, io::{self, Write}};
111use bstr::{ByteSlice, io::BufReadExt};
112
113fn main() -> Result<(), Box<dyn Error>> {
114    let stdin = io::stdin();
115    let mut stdout = io::BufWriter::new(io::stdout());
116
117    stdin.lock().for_byte_line_with_terminator(|line| {
118        let end = line
119            .grapheme_indices()
120            .map(|(_, end, _)| end)
121            .take(10)
122            .last()
123            .unwrap_or(line.len());
124        stdout.write_all(line[..end].trim_end())?;
125        stdout.write_all(b"\n")?;
126        Ok(true)
127    })?;
128    Ok(())
129}
130```
131
132
133### Cargo features
134
135This crates comes with a few features that control standard library, serde and
136Unicode support.
137
138* `std` - **Enabled** by default. This provides APIs that require the standard
139  library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables
140  the `alloc` feature.
141* `alloc` - **Enabled** by default. This provides APIs that require allocations
142  via the `alloc` crate, such as `Vec<u8>`.
143* `unicode` - **Enabled** by default. This provides APIs that require sizable
144  Unicode data compiled into the binary. This includes, but is not limited to,
145  grapheme/word/sentence segmenters. When this is disabled, basic support such
146  as UTF-8 decoding is still included. Note that currently, enabling this
147  feature also requires enabling the `std` feature. It is expected that this
148  limitation will be lifted at some point.
149* `serde` - Enables implementations of serde traits for `BStr`, and also
150  `BString` when `alloc` is enabled.
151
152
153### Minimum Rust version policy
154
155This crate's minimum supported `rustc` version (MSRV) is `1.60.0`.
156
157In general, this crate will be conservative with respect to the minimum
158supported version of Rust. MSRV may be bumped in minor version releases.
159
160
161### Future work
162
163Since it is plausible that some of the types in this crate might end up in your
164public API (e.g., `BStr` and `BString`), we will commit to being very
165conservative with respect to new major version releases. It's difficult to say
166precisely how conservative, but unless there is a major issue with the `1.0`
167release, I wouldn't expect a `2.0` release to come out any sooner than some
168period of years.
169
170A large part of the API surface area was taken from the standard library, so
171from an API design perspective, a good portion of this crate should be on solid
172ground. The main differences from the standard library are in how the various
173substring search routines work. The standard library provides generic
174infrastructure for supporting different types of searches with a single method,
175where as this library prefers to define new methods for each type of search and
176drop the generic infrastructure.
177
178Some _probable_ future considerations for APIs include, but are not limited to:
179
180* Unicode normalization.
181* More sophisticated support for dealing with Unicode case, perhaps by
182  combining the use cases supported by [`caseless`](https://docs.rs/caseless)
183  and [`unicase`](https://docs.rs/unicase).
184
185Here are some examples that are _probably_ out of scope for this crate:
186
187* Regular expressions.
188* Unicode collation.
189
190The exact scope isn't quite clear, but I expect we can iterate on it.
191
192In general, as stated below, this crate brings lots of related APIs together
193into a single crate while simultaneously attempting to keep the total number of
194dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is
195optional.
196
197
198### High level motivation
199
200Strictly speaking, the `bstr` crate provides very little that can't already be
201achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
202library crates. For example:
203
204* The standard library's
205  [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) can be
206  used for incremental lossy decoding of `&[u8]`.
207* The
208  [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
209  crate can be used for iterating over graphemes (or words), but is only
210  implemented for `&str` types. One could use `Utf8Error` above to implement
211  grapheme iteration with the same semantics as what `bstr` provides (automatic
212  Unicode replacement codepoint substitution).
213* The [`twoway`](https://docs.rs/twoway) crate can be used for fast substring
214  searching on `&[u8]`.
215
216So why create `bstr`? Part of the point of the `bstr` crate is to provide a
217uniform API of coupled components instead of relying on users to piece together
218loosely coupled components from the crate ecosystem. For example, if you wanted
219to perform a search and replace in a `Vec<u8>`, then writing the code to do
220that with the `twoway` crate is not that difficult, but it's still additional
221glue code you have to write. This work adds up depending on what you're doing.
222Consider, for example, trimming and splitting, along with their different
223variants.
224
225In other words, `bstr` is partially a way of pushing back against the
226micro-crate ecosystem that appears to be evolving. Namely, it is a goal of
227`bstr` to keep its dependency list lightweight. For example, `serde` is an
228optional dependency because there is no feasible alternative. In service of
229this philosophy, currently, the only required dependency of `bstr` is `memchr`.
230
231
232### License
233
234This project is licensed under either of
235
236 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
237   https://www.apache.org/licenses/LICENSE-2.0)
238 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
239   https://opensource.org/licenses/MIT)
240
241at your option.
242
243The data in `src/unicode/data/` is licensed under the Unicode License Agreement
244([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although
245this data is only used in tests.
246