• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1bstr
2====
3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
4their use as byte strings, where byte strings are _conventionally_ UTF-8. This
5differs from the standard library's `String` and `str` types in that they are
6not required to be valid UTF-8, but may be fully or partially valid UTF-8.
7
8[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
9[![crates.io](https://img.shields.io/crates/v/bstr.svg)](https://crates.io/crates/bstr)
10
11
12### Documentation
13
14https://docs.rs/bstr
15
16
17### When should I use byte strings?
18
19See this part of the documentation for more details:
20<https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings>.
21
22The short story is that byte strings are useful when it is inconvenient or
23incorrect to require valid UTF-8.
24
25
26### Usage
27
28`cargo add bstr`
29
30### Examples
31
32The following two examples exhibit both the API features of byte strings and
33the I/O convenience functions provided for reading line-by-line quickly.
34
35This first example simply shows how to efficiently iterate over lines in stdin,
36and print out lines containing a particular substring:
37
38```rust
39use std::{error::Error, io::{self, Write}};
40use bstr::{ByteSlice, io::BufReadExt};
41
42fn main() -> Result<(), Box<dyn Error>> {
43    let stdin = io::stdin();
44    let mut stdout = io::BufWriter::new(io::stdout());
45
46    stdin.lock().for_byte_line_with_terminator(|line| {
47        if line.contains_str("Dimension") {
48            stdout.write_all(line)?;
49        }
50        Ok(true)
51    })?;
52    Ok(())
53}
54```
55
56This example shows how to count all of the words (Unicode-aware) in stdin,
57line-by-line:
58
59```rust
60use std::{error::Error, io};
61use bstr::{ByteSlice, io::BufReadExt};
62
63fn main() -> Result<(), Box<dyn Error>> {
64    let stdin = io::stdin();
65    let mut words = 0;
66    stdin.lock().for_byte_line_with_terminator(|line| {
67        words += line.words().count();
68        Ok(true)
69    })?;
70    println!("{}", words);
71    Ok(())
72}
73```
74
75This example shows how to convert a stream on stdin to uppercase without
76performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
77text, this is quite a bit faster than what you can (easily) do with standard
78library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
79
80```rust
81use std::{error::Error, io::{self, Write}};
82use bstr::{ByteSlice, io::BufReadExt};
83
84fn main() -> Result<(), Box<dyn Error>> {
85    let stdin = io::stdin();
86    let mut stdout = io::BufWriter::new(io::stdout());
87
88    let mut upper = vec![];
89    stdin.lock().for_byte_line_with_terminator(|line| {
90        upper.clear();
91        line.to_uppercase_into(&mut upper);
92        stdout.write_all(&upper)?;
93        Ok(true)
94    })?;
95    Ok(())
96}
97```
98
99This example shows how to extract the first 10 visual characters (as grapheme
100clusters) from each line, where invalid UTF-8 sequences are generally treated
101as a single character and are passed through correctly:
102
103```rust
104use std::{error::Error, io::{self, Write}};
105use bstr::{ByteSlice, io::BufReadExt};
106
107fn main() -> Result<(), Box<dyn Error>> {
108    let stdin = io::stdin();
109    let mut stdout = io::BufWriter::new(io::stdout());
110
111    stdin.lock().for_byte_line_with_terminator(|line| {
112        let end = line
113            .grapheme_indices()
114            .map(|(_, end, _)| end)
115            .take(10)
116            .last()
117            .unwrap_or(line.len());
118        stdout.write_all(line[..end].trim_end())?;
119        stdout.write_all(b"\n")?;
120        Ok(true)
121    })?;
122    Ok(())
123}
124```
125
126
127### Cargo features
128
129This crates comes with a few features that control standard library, serde and
130Unicode support.
131
132* `std` - **Enabled** by default. This provides APIs that require the standard
133  library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables
134  the `alloc` feature.
135* `alloc` - **Enabled** by default. This provides APIs that require allocations
136  via the `alloc` crate, such as `Vec<u8>`.
137* `unicode` - **Enabled** by default. This provides APIs that require sizable
138  Unicode data compiled into the binary. This includes, but is not limited to,
139  grapheme/word/sentence segmenters. When this is disabled, basic support such
140  as UTF-8 decoding is still included. Note that currently, enabling this
141  feature also requires enabling the `std` feature. It is expected that this
142  limitation will be lifted at some point.
143* `serde` - Enables implementations of serde traits for `BStr`, and also
144  `BString` when `alloc` is enabled.
145
146
147### Minimum Rust version policy
148
149This crate's minimum supported `rustc` version (MSRV) is `1.60.0`.
150
151In general, this crate will be conservative with respect to the minimum
152supported version of Rust. MSRV may be bumped in minor version releases.
153
154
155### Future work
156
157Since it is plausible that some of the types in this crate might end up in your
158public API (e.g., `BStr` and `BString`), we will commit to being very
159conservative with respect to new major version releases. It's difficult to say
160precisely how conservative, but unless there is a major issue with the `1.0`
161release, I wouldn't expect a `2.0` release to come out any sooner than some
162period of years.
163
164A large part of the API surface area was taken from the standard library, so
165from an API design perspective, a good portion of this crate should be on solid
166ground. The main differences from the standard library are in how the various
167substring search routines work. The standard library provides generic
168infrastructure for supporting different types of searches with a single method,
169where as this library prefers to define new methods for each type of search and
170drop the generic infrastructure.
171
172Some _probable_ future considerations for APIs include, but are not limited to:
173
174* Unicode normalization.
175* More sophisticated support for dealing with Unicode case, perhaps by
176  combining the use cases supported by [`caseless`](https://docs.rs/caseless)
177  and [`unicase`](https://docs.rs/unicase).
178
179Here are some examples that are _probably_ out of scope for this crate:
180
181* Regular expressions.
182* Unicode collation.
183
184The exact scope isn't quite clear, but I expect we can iterate on it.
185
186In general, as stated below, this crate brings lots of related APIs together
187into a single crate while simultaneously attempting to keep the total number of
188dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is
189optional.
190
191
192### High level motivation
193
194Strictly speaking, the `bstr` crate provides very little that can't already be
195achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
196library crates. For example:
197
198* The standard library's
199  [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) can be
200  used for incremental lossy decoding of `&[u8]`.
201* The
202  [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
203  crate can be used for iterating over graphemes (or words), but is only
204  implemented for `&str` types. One could use `Utf8Error` above to implement
205  grapheme iteration with the same semantics as what `bstr` provides (automatic
206  Unicode replacement codepoint substitution).
207* The [`twoway`](https://docs.rs/twoway) crate can be used for fast substring
208  searching on `&[u8]`.
209
210So why create `bstr`? Part of the point of the `bstr` crate is to provide a
211uniform API of coupled components instead of relying on users to piece together
212loosely coupled components from the crate ecosystem. For example, if you wanted
213to perform a search and replace in a `Vec<u8>`, then writing the code to do
214that with the `twoway` crate is not that difficult, but it's still additional
215glue code you have to write. This work adds up depending on what you're doing.
216Consider, for example, trimming and splitting, along with their different
217variants.
218
219In other words, `bstr` is partially a way of pushing back against the
220micro-crate ecosystem that appears to be evolving. Namely, it is a goal of
221`bstr` to keep its dependency list lightweight. For example, `serde` is an
222optional dependency because there is no feasible alternative. In service of
223this philosophy, currently, the only required dependency of `bstr` is `memchr`.
224
225
226### License
227
228This project is licensed under either of
229
230 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
231   https://www.apache.org/licenses/LICENSE-2.0)
232 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
233   https://opensource.org/licenses/MIT)
234
235at your option.
236
237The data in `src/unicode/data/` is licensed under the Unicode License Agreement
238([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although
239this data is only used in tests.
240