1 /*! 2 A tutorial for handling CSV data in Rust. 3 4 This tutorial will cover basic CSV reading and writing, automatic 5 (de)serialization with Serde, CSV transformations and performance. 6 7 This tutorial is targeted at beginner Rust programmers. Experienced Rust 8 programmers may find this tutorial to be too verbose, but skimming may be 9 useful. There is also a 10 [cookbook](../cookbook/index.html) 11 of examples for those that prefer more information density. 12 13 For an introduction to Rust, please see the 14 [official book](https://doc.rust-lang.org/book/second-edition/). 15 If you haven't written any Rust code yet but have written code in another 16 language, then this tutorial might be accessible to you without needing to read 17 the book first. 18 19 # Table of contents 20 21 1. [Setup](#setup) 22 1. [Basic error handling](#basic-error-handling) 23 * [Switch to recoverable errors](#switch-to-recoverable-errors) 24 1. [Reading CSV](#reading-csv) 25 * [Reading headers](#reading-headers) 26 * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records) 27 * [Reading with Serde](#reading-with-serde) 28 * [Handling invalid data with Serde](#handling-invalid-data-with-serde) 29 1. [Writing CSV](#writing-csv) 30 * [Writing tab separated values](#writing-tab-separated-values) 31 * [Writing with Serde](#writing-with-serde) 32 1. [Pipelining](#pipelining) 33 * [Filter by search](#filter-by-search) 34 * [Filter by population count](#filter-by-population-count) 35 1. [Performance](#performance) 36 * [Amortizing allocations](#amortizing-allocations) 37 * [Serde and zero allocation](#serde-and-zero-allocation) 38 * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library) 39 1. [Closing thoughts](#closing-thoughts) 40 41 # Setup 42 43 In this section, we'll get you setup with a simple program that reads CSV data 44 and prints a "debug" version of each record. This assumes that you have the 45 [Rust toolchain installed](https://www.rust-lang.org/install.html), 46 which includes both Rust and Cargo. 47 48 We'll start by creating a new Cargo project: 49 50 ```text 51 $ cargo new --bin csvtutor 52 $ cd csvtutor 53 ``` 54 55 Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add 56 `csv = "1.1"` to your `[dependencies]` section. At this point, your 57 `Cargo.toml` should look something like this: 58 59 ```text 60 [package] 61 name = "csvtutor" 62 version = "0.1.0" 63 authors = ["Your Name"] 64 65 [dependencies] 66 csv = "1.1" 67 ``` 68 69 Next, let's build your project. Since you added the `csv` crate as a 70 dependency, Cargo will automatically download it and compile it for you. To 71 build your project, use Cargo: 72 73 ```text 74 $ cargo build 75 ``` 76 77 This will produce a new binary, `csvtutor`, in your `target/debug` directory. 78 It won't do much at this point, but you can run it: 79 80 ```text 81 $ ./target/debug/csvtutor 82 Hello, world! 83 ``` 84 85 Let's make our program do something useful. Our program will read CSV data on 86 stdin and print debug output for each record on stdout. To write this program, 87 open `src/main.rs` in your favorite text editor and replace its contents with 88 this: 89 90 ```no_run 91 //tutorial-setup-01.rs 92 // Import the standard library's I/O module so we can read from stdin. 93 use std::io; 94 95 // The `main` function is where your program starts executing. 96 fn main() { 97 // Create a CSV parser that reads data from stdin. 98 let mut rdr = csv::Reader::from_reader(io::stdin()); 99 // Loop over each record. 100 for result in rdr.records() { 101 // An error may occur, so abort the program in an unfriendly way. 102 // We will make this more friendly later! 103 let record = result.expect("a CSV record"); 104 // Print a debug version of the record. 105 println!("{:?}", record); 106 } 107 } 108 ``` 109 110 Don't worry too much about what this code means; we'll dissect it in the next 111 section. For now, try rebuilding your project: 112 113 ```text 114 $ cargo build 115 ``` 116 117 Assuming that succeeds, let's try running our program. But first, we will need 118 some CSV data to play with! For that, we will use a random selection of 100 119 US cities, along with their population size and geographical coordinates. (We 120 will use this same CSV data throughout the entire tutorial.) To get the data, 121 download it from github: 122 123 ```text 124 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv' 125 ``` 126 127 And now finally, run your program on `uspop.csv`: 128 129 ```text 130 $ ./target/debug/csvtutor < uspop.csv 131 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) 132 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) 133 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) 134 # ... and much more 135 ``` 136 137 # Basic error handling 138 139 Since reading CSV data can result in errors, error handling is pervasive 140 throughout the examples in this tutorial. Therefore, we're going to spend a 141 little bit of time going over basic error handling, and in particular, fix 142 our previous example to show errors in a more friendly way. **If you're already 143 comfortable with things like `Result` and `try!`/`?` in Rust, then you can 144 safely skip this section.** 145 146 Note that 147 [The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/) 148 contains an 149 [introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html). 150 For a deeper dive, see 151 [my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/). 152 The blog post is especially important if you plan on building Rust libraries. 153 154 With that out of the way, error handling in Rust comes in two different forms: 155 unrecoverable errors and recoverable errors. 156 157 Unrecoverable errors generally correspond to things like bugs in your program, 158 which might occur when an invariant or contract is broken. At that point, the 159 state of your program is unpredictable, and there's typically little recourse 160 other than *panicking*. In Rust, a panic is similar to simply aborting your 161 program, but it will unwind the stack and clean up resources before your 162 program exits. 163 164 On the other hand, recoverable errors generally correspond to predictable 165 errors. A non-existent file or invalid CSV data are examples of recoverable 166 errors. In Rust, recoverable errors are handled via `Result`. A `Result` 167 represents the state of a computation that has either succeeded or failed. 168 It is defined like so: 169 170 ``` 171 enum Result<T, E> { 172 Ok(T), 173 Err(E), 174 } 175 ``` 176 177 That is, a `Result` either contains a value of type `T` when the computation 178 succeeds, or it contains a value of type `E` when the computation fails. 179 180 The relationship between unrecoverable errors and recoverable errors is 181 important. In particular, it is **strongly discouraged** to treat recoverable 182 errors as if they were unrecoverable. For example, panicking when a file could 183 not be found, or if some CSV data is invalid, is considered bad practice. 184 Instead, predictable errors should be handled using Rust's `Result` type. 185 186 With our new found knowledge, let's re-examine our previous example and dissect 187 its error handling. 188 189 ```no_run 190 //tutorial-error-01.rs 191 use std::io; 192 193 fn main() { 194 let mut rdr = csv::Reader::from_reader(io::stdin()); 195 for result in rdr.records() { 196 let record = result.expect("a CSV record"); 197 println!("{:?}", record); 198 } 199 } 200 ``` 201 202 There are two places where an error can occur in this program. The first is 203 if there was a problem reading a record from stdin. The second is if there is 204 a problem writing to stdout. In general, we will ignore the latter problem in 205 this tutorial, although robust command line applications should probably try 206 to handle it (e.g., when a broken pipe occurs). The former however is worth 207 looking into in more detail. For example, if a user of this program provides 208 invalid CSV data, then the program will panic: 209 210 ```text 211 $ cat invalid 212 header1,header2 213 foo,bar 214 quux,baz,foobar 215 $ ./target/debug/csvtutor < invalid 216 StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] } 217 thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 }', /checkout/src/libcore/result.rs:859 218 note: Run with `RUST_BACKTRACE=1` for a backtrace. 219 ``` 220 221 What happened here? First and foremost, we should talk about why the CSV data 222 is invalid. The CSV data consists of three records: a header and two data 223 records. The header and first data record have two fields, but the second 224 data record has three fields. By default, the csv crate will treat inconsistent 225 record lengths as an error. 226 (This behavior can be toggled using the 227 [`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible) 228 config knob.) This explains why the first data record is printed in this 229 example, since it has the same number of fields as the header record. That is, 230 we don't actually hit an error until we parse the second data record. 231 232 (Note that the CSV reader automatically interprets the first record as a 233 header. This can be toggled with the 234 [`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers) 235 config knob.) 236 237 So what actually causes the panic to happen in our program? That would be the 238 first line in our loop: 239 240 ```ignore 241 for result in rdr.records() { 242 let record = result.expect("a CSV record"); // this panics 243 println!("{:?}", record); 244 } 245 ``` 246 247 The key thing to understand here is that `rdr.records()` returns an iterator 248 that yields `Result` values. That is, instead of yielding records, it yields 249 a `Result` that contains either a record or an error. The `expect` method, 250 which is defined on `Result`, *unwraps* the success value inside the `Result`. 251 Since the `Result` might contain an error instead, `expect` will *panic* when 252 it does contain an error. 253 254 It might help to look at the implementation of `expect`: 255 256 ```ignore 257 use std::fmt; 258 259 // This says, "for all types T and E, where E can be turned into a human 260 // readable debug message, define the `expect` method." 261 impl<T, E: fmt::Debug> Result<T, E> { 262 fn expect(self, msg: &str) -> T { 263 match self { 264 Ok(t) => t, 265 Err(e) => panic!("{}: {:?}", msg, e), 266 } 267 } 268 } 269 ``` 270 271 Since this causes a panic if the CSV data is invalid, and invalid CSV data is 272 a perfectly predictable error, we've turned what should be a *recoverable* 273 error into an *unrecoverable* error. We did this because it is expedient to 274 use unrecoverable errors. Since this is bad practice, we will endeavor to avoid 275 unrecoverable errors throughout the rest of the tutorial. 276 277 ## Switch to recoverable errors 278 279 We'll convert our unrecoverable error to a recoverable error in 3 steps. First, 280 let's get rid of the panic and print an error message manually: 281 282 ```no_run 283 //tutorial-error-02.rs 284 use std::{io, process}; 285 286 fn main() { 287 let mut rdr = csv::Reader::from_reader(io::stdin()); 288 for result in rdr.records() { 289 // Examine our Result. 290 // If there was no problem, print the record. 291 // Otherwise, print the error message and quit the program. 292 match result { 293 Ok(record) => println!("{:?}", record), 294 Err(err) => { 295 println!("error reading CSV from <stdin>: {}", err); 296 process::exit(1); 297 } 298 } 299 } 300 } 301 ``` 302 303 If we run our program again, we'll still see an error message, but it is no 304 longer a panic message: 305 306 ```text 307 $ cat invalid 308 header1,header2 309 foo,bar 310 quux,baz,foobar 311 $ ./target/debug/csvtutor < invalid 312 StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] } 313 error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields 314 ``` 315 316 The second step for moving to recoverable errors is to put our CSV record loop 317 into a separate function. This function then has the option of *returning* an 318 error, which our `main` function can then inspect and decide what to do with. 319 320 ```no_run 321 //tutorial-error-03.rs 322 use std::{error::Error, io, process}; 323 324 fn main() { 325 if let Err(err) = run() { 326 println!("{}", err); 327 process::exit(1); 328 } 329 } 330 331 fn run() -> Result<(), Box<dyn Error>> { 332 let mut rdr = csv::Reader::from_reader(io::stdin()); 333 for result in rdr.records() { 334 // Examine our Result. 335 // If there was no problem, print the record. 336 // Otherwise, convert our error to a Box<dyn Error> and return it. 337 match result { 338 Err(err) => return Err(From::from(err)), 339 Ok(record) => { 340 println!("{:?}", record); 341 } 342 } 343 } 344 Ok(()) 345 } 346 ``` 347 348 Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In 349 simple terms, this says that `run` either returns nothing when successful, or 350 if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of 351 error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error 352 that occurred. But for our purposes, all we need to do is gracefully print an 353 error message and exit the program. 354 355 The third and final step is to replace our explicit `match` expression with a 356 special Rust language feature: the question mark. 357 358 ```no_run 359 //tutorial-error-04.rs 360 use std::{error::Error, io, process}; 361 362 fn main() { 363 if let Err(err) = run() { 364 println!("{}", err); 365 process::exit(1); 366 } 367 } 368 369 fn run() -> Result<(), Box<dyn Error>> { 370 let mut rdr = csv::Reader::from_reader(io::stdin()); 371 for result in rdr.records() { 372 // This is effectively the same code as our `match` in the 373 // previous example. In other words, `?` is syntactic sugar. 374 let record = result?; 375 println!("{:?}", record); 376 } 377 Ok(()) 378 } 379 ``` 380 381 This last step shows how we can use the `?` to automatically forward errors 382 to our caller without having to do explicit case analysis with `match` 383 ourselves. We will use the `?` heavily throughout this tutorial, and it's 384 important to note that it can **only be used in functions that return 385 `Result`.** 386 387 We'll end this section with a word of caution: using `Box<dyn Error>` as our error 388 type is the minimally acceptable thing we can do here. Namely, while it allows 389 our program to gracefully handle errors, it makes it hard for callers to 390 inspect the specific error condition that occurred. However, since this is a 391 tutorial on writing command line programs that do CSV parsing, we will consider 392 ourselves satisfied. If you'd like to know more, or are interested in writing 393 a library that handles CSV data, then you should check out my 394 [blog post on error handling](http://blog.burntsushi.net/rust-error-handling/). 395 396 With all that said, if all you're doing is writing a one-off program to do 397 CSV transformations, then using methods like `expect` and panicking when an 398 error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial 399 will endeavor to show idiomatic code. 400 401 # Reading CSV 402 403 Now that we've got you setup and covered basic error handling, it's time to do 404 what we came here to do: handle CSV data. We've already seen how to read 405 CSV data from `stdin`, but this section will cover how to read CSV data from 406 files and how to configure our CSV reader to data formatted with different 407 delimiters and quoting strategies. 408 409 First up, let's adapt the example we've been working with to accept a file 410 path argument instead of stdin. 411 412 ```no_run 413 //tutorial-read-01.rs 414 use std::{ 415 env, 416 error::Error, 417 ffi::OsString, 418 fs::File, 419 process, 420 }; 421 422 fn run() -> Result<(), Box<dyn Error>> { 423 let file_path = get_first_arg()?; 424 let file = File::open(file_path)?; 425 let mut rdr = csv::Reader::from_reader(file); 426 for result in rdr.records() { 427 let record = result?; 428 println!("{:?}", record); 429 } 430 Ok(()) 431 } 432 433 /// Returns the first positional argument sent to this process. If there are no 434 /// positional arguments, then this returns an error. 435 fn get_first_arg() -> Result<OsString, Box<dyn Error>> { 436 match env::args_os().nth(1) { 437 None => Err(From::from("expected 1 argument, but got none")), 438 Some(file_path) => Ok(file_path), 439 } 440 } 441 442 fn main() { 443 if let Err(err) = run() { 444 println!("{}", err); 445 process::exit(1); 446 } 447 } 448 ``` 449 450 If you replace the contents of your `src/main.rs` file with the above code, 451 then you should be able to rebuild your project and try it out: 452 453 ```text 454 $ cargo build 455 $ ./target/debug/csvtutor uspop.csv 456 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) 457 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) 458 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) 459 # ... and much more 460 ``` 461 462 This example contains two new pieces of code: 463 464 1. Code for querying the positional arguments of your program. We put this code 465 into its own function called `get_first_arg`. Our program expects a file 466 path in the first position (which is indexed at `1`; the argument at index 467 `0` is the executable name), so if one doesn't exist, then `get_first_arg` 468 returns an error. 469 2. Code for opening a file. In `run`, we open a file using `File::open`. If 470 there was a problem opening the file, we forward the error to the caller of 471 `run` (which is `main` in this program). Note that we do *not* wrap the 472 `File` in a buffer. The CSV reader does buffering internally, so there's 473 no need for the caller to do it. 474 475 Now is a good time to introduce an alternate CSV reader constructor, which 476 makes it slightly more convenient to open CSV data from a file. That is, 477 instead of: 478 479 ```ignore 480 let file_path = get_first_arg()?; 481 let file = File::open(file_path)?; 482 let mut rdr = csv::Reader::from_reader(file); 483 ``` 484 485 you can use: 486 487 ```ignore 488 let file_path = get_first_arg()?; 489 let mut rdr = csv::Reader::from_path(file_path)?; 490 ``` 491 492 `csv::Reader::from_path` will open the file for you and return an error if 493 the file could not be opened. 494 495 ## Reading headers 496 497 If you had a chance to look at the data inside `uspop.csv`, you would notice 498 that there is a header record that looks like this: 499 500 ```text 501 City,State,Population,Latitude,Longitude 502 ``` 503 504 Now, if you look back at the output of the commands you've run so far, you'll 505 notice that the header record is never printed. Why is that? By default, the 506 CSV reader will interpret the first record in CSV data as a header, which 507 is typically distinct from the actual data in the records that follow. 508 Therefore, the header record is always skipped whenever you try to read or 509 iterate over the records in CSV data. 510 511 The CSV reader does not try to be smart about the header record and does 512 **not** employ any heuristics for automatically detecting whether the first 513 record is a header or not. Instead, if you don't want to treat the first record 514 as a header, you'll need to tell the CSV reader that there are no headers. 515 516 To configure a CSV reader to do this, we'll need to use a 517 [`ReaderBuilder`](../struct.ReaderBuilder.html) 518 to build a CSV reader with our desired configuration. Here's an example that 519 does just that. (Note that we've moved back to reading from `stdin`, since it 520 produces terser examples.) 521 522 ```no_run 523 //tutorial-read-headers-01.rs 524 # use std::{error::Error, io, process}; 525 # 526 fn run() -> Result<(), Box<dyn Error>> { 527 let mut rdr = csv::ReaderBuilder::new() 528 .has_headers(false) 529 .from_reader(io::stdin()); 530 for result in rdr.records() { 531 let record = result?; 532 println!("{:?}", record); 533 } 534 Ok(()) 535 } 536 # 537 # fn main() { 538 # if let Err(err) = run() { 539 # println!("{}", err); 540 # process::exit(1); 541 # } 542 # } 543 ``` 544 545 If you compile and run this program with our `uspop.csv` data, then you'll see 546 that the header record is now printed: 547 548 ```text 549 $ cargo build 550 $ ./target/debug/csvtutor < uspop.csv 551 StringRecord(["City", "State", "Population", "Latitude", "Longitude"]) 552 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"]) 553 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"]) 554 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"]) 555 ``` 556 557 If you ever need to access the header record directly, then you can use the 558 [`Reader::header`](../struct.Reader.html#method.headers) 559 method like so: 560 561 ```no_run 562 //tutorial-read-headers-02.rs 563 # use std::{error::Error, io, process}; 564 # 565 fn run() -> Result<(), Box<dyn Error>> { 566 let mut rdr = csv::Reader::from_reader(io::stdin()); 567 { 568 // We nest this call in its own scope because of lifetimes. 569 let headers = rdr.headers()?; 570 println!("{:?}", headers); 571 } 572 for result in rdr.records() { 573 let record = result?; 574 println!("{:?}", record); 575 } 576 // We can ask for the headers at any time. There's no need to nest this 577 // call in its own scope because we never try to borrow the reader again. 578 let headers = rdr.headers()?; 579 println!("{:?}", headers); 580 Ok(()) 581 } 582 # 583 # fn main() { 584 # if let Err(err) = run() { 585 # println!("{}", err); 586 # process::exit(1); 587 # } 588 # } 589 ``` 590 591 One interesting thing to note in this example is that we put the call to 592 `rdr.headers()` in its own scope. We do this because `rdr.headers()` returns 593 a *borrow* of the reader's internal header state. The nested scope in this 594 code allows the borrow to end before we try to iterate over the records. If 595 we didn't nest the call to `rdr.headers()` in its own scope, then the code 596 wouldn't compile because we cannot borrow the reader's headers at the same time 597 that we try to borrow the reader to iterate over its records. 598 599 Another way of solving this problem is to *clone* the header record: 600 601 ```ignore 602 let headers = rdr.headers()?.clone(); 603 ``` 604 605 This converts it from a borrow of the CSV reader to a new owned value. This 606 makes the code a bit easier to read, but at the cost of copying the header 607 record into a new allocation. 608 609 ## Delimiters, quotes and variable length records 610 611 In this section we'll temporarily depart from our `uspop.csv` data set and 612 show how to read some CSV data that is a little less clean. This CSV data 613 uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has 614 records of varying length. Here's the data, which contains a list of WWE 615 wrestlers and the year they started, if it's known: 616 617 ```text 618 $ cat strange.csv 619 "\"Hacksaw\" Jim Duggan";1987 620 "Bret \"Hit Man\" Hart";1984 621 # We're not sure when Rafael started, so omit the year. 622 Rafael Halperin 623 "\"Big Cat\" Ernie Ladd";1964 624 "\"Macho Man\" Randy Savage";1985 625 "Jake \"The Snake\" Roberts";1986 626 ``` 627 628 To read this CSV data, we'll want to do the following: 629 630 1. Disable headers, since this data has none. 631 2. Change the delimiter from `,` to `;`. 632 3. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`). 633 4. Permit flexible length records, since some omit the year. 634 5. Ignore lines beginning with a `#`. 635 636 All of this (and more!) can be configured with a 637 [`ReaderBuilder`](../struct.ReaderBuilder.html), 638 as seen in the following example: 639 640 ```no_run 641 //tutorial-read-delimiter-01.rs 642 # use std::{error::Error, io, process}; 643 # 644 fn run() -> Result<(), Box<dyn Error>> { 645 let mut rdr = csv::ReaderBuilder::new() 646 .has_headers(false) 647 .delimiter(b';') 648 .double_quote(false) 649 .escape(Some(b'\\')) 650 .flexible(true) 651 .comment(Some(b'#')) 652 .from_reader(io::stdin()); 653 for result in rdr.records() { 654 let record = result?; 655 println!("{:?}", record); 656 } 657 Ok(()) 658 } 659 # 660 # fn main() { 661 # if let Err(err) = run() { 662 # println!("{}", err); 663 # process::exit(1); 664 # } 665 # } 666 ``` 667 668 Now re-compile your project and try running the program on `strange.csv`: 669 670 ```text 671 $ cargo build 672 $ ./target/debug/csvtutor < strange.csv 673 StringRecord(["\"Hacksaw\" Jim Duggan", "1987"]) 674 StringRecord(["Bret \"Hit Man\" Hart", "1984"]) 675 StringRecord(["Rafael Halperin"]) 676 StringRecord(["\"Big Cat\" Ernie Ladd", "1964"]) 677 StringRecord(["\"Macho Man\" Randy Savage", "1985"]) 678 StringRecord(["Jake \"The Snake\" Roberts", "1986"]) 679 ``` 680 681 You should feel encouraged to play around with the settings. Some interesting 682 things you might try: 683 684 1. If you remove the `escape` setting, notice that no CSV errors are reported. 685 Instead, records are still parsed. This is a feature of the CSV parser. Even 686 though it gets the data slightly wrong, it still provides a parse that you 687 might be able to work with. This is a useful property given the messiness 688 of real world CSV data. 689 2. If you remove the `delimiter` setting, parsing still succeeds, although 690 every record has exactly one field. 691 3. If you remove the `flexible` setting, the reader will print the first two 692 records (since they both have the same number of fields), but will return a 693 parse error on the third record, since it has only one field. 694 695 This covers most of the things you might want to configure on your CSV reader, 696 although there are a few other knobs. For example, you can change the record 697 terminator from a new line to any other character. (By default, the terminator 698 is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record 699 terminators.) For more details, see the documentation and examples for each of 700 the methods on 701 [`ReaderBuilder`](../struct.ReaderBuilder.html). 702 703 ## Reading with Serde 704 705 One of the most convenient features of this crate is its support for 706 [Serde](https://serde.rs/). 707 Serde is a framework for automatically serializing and deserializing data into 708 Rust types. In simpler terms, that means instead of iterating over records 709 as an array of string fields, we can iterate over records of a specific type 710 of our choosing. 711 712 For example, let's take a look at some data from our `uspop.csv` file: 713 714 ```text 715 City,State,Population,Latitude,Longitude 716 Davidsons Landing,AK,,65.2419444,-165.2716667 717 Kenai,AK,7610,60.5544444,-151.2583333 718 ``` 719 720 While some of these fields make sense as strings (`City`, `State`), other 721 fields look more like numbers. For example, `Population` looks like it contains 722 integers while `Latitude` and `Longitude` appear to contain decimals. If we 723 wanted to convert these fields to their "proper" types, then we need to do 724 a lot of manual work. This next example shows how. 725 726 ```no_run 727 //tutorial-read-serde-01.rs 728 # use std::{error::Error, io, process}; 729 # 730 fn run() -> Result<(), Box<dyn Error>> { 731 let mut rdr = csv::Reader::from_reader(io::stdin()); 732 for result in rdr.records() { 733 let record = result?; 734 735 let city = &record[0]; 736 let state = &record[1]; 737 // Some records are missing population counts, so if we can't 738 // parse a number, treat the population count as missing instead 739 // of returning an error. 740 let pop: Option<u64> = record[2].parse().ok(); 741 // Lucky us! Latitudes and longitudes are available for every record. 742 // Therefore, if one couldn't be parsed, return an error. 743 let latitude: f64 = record[3].parse()?; 744 let longitude: f64 = record[4].parse()?; 745 746 println!( 747 "city: {:?}, state: {:?}, \ 748 pop: {:?}, latitude: {:?}, longitude: {:?}", 749 city, state, pop, latitude, longitude); 750 } 751 Ok(()) 752 } 753 # 754 # fn main() { 755 # if let Err(err) = run() { 756 # println!("{}", err); 757 # process::exit(1); 758 # } 759 # } 760 ``` 761 762 The problem here is that we need to parse each individual field manually, which 763 can be labor intensive and repetitive. Serde, however, makes this process 764 automatic. For example, we can ask to deserialize every record into a tuple 765 type: `(String, String, Option<u64>, f64, f64)`. 766 767 ```no_run 768 //tutorial-read-serde-02.rs 769 # use std::{error::Error, io, process}; 770 # 771 // This introduces a type alias so that we can conveniently reference our 772 // record type. 773 type Record = (String, String, Option<u64>, f64, f64); 774 775 fn run() -> Result<(), Box<dyn Error>> { 776 let mut rdr = csv::Reader::from_reader(io::stdin()); 777 // Instead of creating an iterator with the `records` method, we create 778 // an iterator with the `deserialize` method. 779 for result in rdr.deserialize() { 780 // We must tell Serde what type we want to deserialize into. 781 let record: Record = result?; 782 println!("{:?}", record); 783 } 784 Ok(()) 785 } 786 # 787 # fn main() { 788 # if let Err(err) = run() { 789 # println!("{}", err); 790 # process::exit(1); 791 # } 792 # } 793 ``` 794 795 Running this code should show similar output as previous examples: 796 797 ```text 798 $ cargo build 799 $ ./target/debug/csvtutor < uspop.csv 800 ("Davidsons Landing", "AK", None, 65.2419444, -165.2716667) 801 ("Kenai", "AK", Some(7610), 60.5544444, -151.2583333) 802 ("Oakman", "AL", None, 33.7133333, -87.3886111) 803 # ... and much more 804 ``` 805 806 One of the downsides of using Serde this way is that the type you use must 807 match the order of fields as they appear in each record. This can be a pain 808 if your CSV data has a header record, since you might tend to think about each 809 field as a value of a particular named field rather than as a numbered field. 810 One way we might achieve this is to deserialize our record into a map type like 811 [`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html) 812 or 813 [`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html). 814 The next example shows how, and in particular, notice that the only thing that 815 changed from the last example is the definition of the `Record` type alias and 816 a new `use` statement that imports `HashMap` from the standard library: 817 818 ```no_run 819 //tutorial-read-serde-03.rs 820 use std::collections::HashMap; 821 # use std::{error::Error, io, process}; 822 823 // This introduces a type alias so that we can conveniently reference our 824 // record type. 825 type Record = HashMap<String, String>; 826 827 fn run() -> Result<(), Box<dyn Error>> { 828 let mut rdr = csv::Reader::from_reader(io::stdin()); 829 for result in rdr.deserialize() { 830 let record: Record = result?; 831 println!("{:?}", record); 832 } 833 Ok(()) 834 } 835 # 836 # fn main() { 837 # if let Err(err) = run() { 838 # println!("{}", err); 839 # process::exit(1); 840 # } 841 # } 842 ``` 843 844 Running this program shows similar results as before, but each record is 845 printed as a map: 846 847 ```text 848 $ cargo build 849 $ ./target/debug/csvtutor < uspop.csv 850 {"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"} 851 {"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"} 852 {"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"} 853 ``` 854 855 This method works especially well if you need to read CSV data with header 856 records, but whose exact structure isn't known until your program runs. 857 However, in our case, we know the structure of the data in `uspop.csv`. In 858 particular, with the `HashMap` approach, we've lost the specific types we had 859 for each field in the previous example when we deserialized each record into a 860 `(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields 861 by their corresponding header name *and* assign each field its own unique 862 type? The answer is yes, but we'll need to bring in Serde's `derive` feature 863 first. You can do that by adding this to the `[dependencies]` section of your 864 `Cargo.toml` file: 865 866 ```text 867 serde = { version = "1", features = ["derive"] } 868 ``` 869 870 With these crates added to our project, we can now define our own custom struct 871 that represents our record. We then ask Serde to automatically write the glue 872 code required to populate our struct from a CSV record. The next example shows 873 how. Don't miss the new Serde imports! 874 875 ```no_run 876 //tutorial-read-serde-04.rs 877 # #![allow(dead_code)] 878 # use std::{error::Error, io, process}; 879 880 // This lets us write `#[derive(Deserialize)]`. 881 use serde::Deserialize; 882 883 // We don't need to derive `Debug` (which doesn't require Serde), but it's a 884 // good habit to do it for all your types. 885 // 886 // Notice that the field names in this struct are NOT in the same order as 887 // the fields in the CSV data! 888 #[derive(Debug, Deserialize)] 889 #[serde(rename_all = "PascalCase")] 890 struct Record { 891 latitude: f64, 892 longitude: f64, 893 population: Option<u64>, 894 city: String, 895 state: String, 896 } 897 898 fn run() -> Result<(), Box<dyn Error>> { 899 let mut rdr = csv::Reader::from_reader(io::stdin()); 900 for result in rdr.deserialize() { 901 let record: Record = result?; 902 println!("{:?}", record); 903 // Try this if you don't like each record smushed on one line: 904 // println!("{:#?}", record); 905 } 906 Ok(()) 907 } 908 909 fn main() { 910 if let Err(err) = run() { 911 println!("{}", err); 912 process::exit(1); 913 } 914 } 915 ``` 916 917 Compile and run this program to see similar output as before: 918 919 ```text 920 $ cargo build 921 $ ./target/debug/csvtutor < uspop.csv 922 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } 923 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } 924 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } 925 ``` 926 927 Once again, we didn't need to change our `run` function at all: we're still 928 iterating over records using the `deserialize` iterator that we started with 929 in the beginning of this section. The only thing that changed in this example 930 was the definition of the `Record` type and a new `use` statement. Our `Record` 931 type is now a custom struct that we defined instead of a type alias, and as a 932 result, Serde doesn't know how to deserialize it by default. However, a special 933 compiler plugin provided by Serde is available, which will read your struct 934 definition at compile time and generate code that will deserialize a CSV record 935 into a `Record` value. To see what happens if you leave out the automatic 936 derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`. 937 938 One other thing worth mentioning in this example is the use of 939 `#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your 940 struct's field names to the header names in the CSV data. If you recall, our 941 header record is: 942 943 ```text 944 City,State,Population,Latitude,Longitude 945 ``` 946 947 Notice that each name is capitalized, but the fields in our struct are not. The 948 `#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each 949 field in `PascalCase`, where the first letter of the field is capitalized. If 950 we didn't tell Serde about the name remapping, then the program will quit with 951 an error: 952 953 ```text 954 $ ./target/debug/csvtutor < uspop.csv 955 CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude` 956 ``` 957 958 We could have fixed this through other means. For example, we could have used 959 capital letters in our field names: 960 961 ```ignore 962 #[derive(Debug, Deserialize)] 963 struct Record { 964 Latitude: f64, 965 Longitude: f64, 966 Population: Option<u64>, 967 City: String, 968 State: String, 969 } 970 ``` 971 972 However, this violates Rust naming style. (In fact, the Rust compiler 973 will even warn you that the names do not follow convention!) 974 975 Another way to fix this is to ask Serde to rename each field individually. This 976 is useful when there is no consistent name mapping from fields to header names: 977 978 ```ignore 979 #[derive(Debug, Deserialize)] 980 struct Record { 981 #[serde(rename = "Latitude")] 982 latitude: f64, 983 #[serde(rename = "Longitude")] 984 longitude: f64, 985 #[serde(rename = "Population")] 986 population: Option<u64>, 987 #[serde(rename = "City")] 988 city: String, 989 #[serde(rename = "State")] 990 state: String, 991 } 992 ``` 993 994 To read more about renaming fields and about other Serde directives, please 995 consult the 996 [Serde documentation on attributes](https://serde.rs/attributes.html). 997 998 ## Handling invalid data with Serde 999 1000 In this section we will see a brief example of how to deal with data that isn't 1001 clean. To do this exercise, we'll work with a slightly tweaked version of the 1002 US population data we've been using throughout this tutorial. This version of 1003 the data is slightly messier than what we've been using. You can get it like 1004 so: 1005 1006 ```text 1007 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv' 1008 ``` 1009 1010 Let's start by running our program from the previous section: 1011 1012 ```no_run 1013 //tutorial-read-serde-invalid-01.rs 1014 # #![allow(dead_code)] 1015 # use std::{error::Error, io, process}; 1016 # 1017 # use serde::Deserialize; 1018 # 1019 #[derive(Debug, Deserialize)] 1020 #[serde(rename_all = "PascalCase")] 1021 struct Record { 1022 latitude: f64, 1023 longitude: f64, 1024 population: Option<u64>, 1025 city: String, 1026 state: String, 1027 } 1028 1029 fn run() -> Result<(), Box<dyn Error>> { 1030 let mut rdr = csv::Reader::from_reader(io::stdin()); 1031 for result in rdr.deserialize() { 1032 let record: Record = result?; 1033 println!("{:?}", record); 1034 } 1035 Ok(()) 1036 } 1037 # 1038 # fn main() { 1039 # if let Err(err) = run() { 1040 # println!("{}", err); 1041 # process::exit(1); 1042 # } 1043 # } 1044 ``` 1045 1046 Compile and run it on our messier data: 1047 1048 ```text 1049 $ cargo build 1050 $ ./target/debug/csvtutor < uspop-null.csv 1051 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } 1052 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } 1053 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } 1054 # ... more records 1055 CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string 1056 ``` 1057 1058 Oops! What happened? The program printed several records, but stopped when it 1059 tripped over a deserialization problem. The error message says that it found 1060 an invalid digit in the field at index `2` (which is the `Population` field) 1061 on line 43. What does line 43 look like? 1062 1063 ```text 1064 $ head -n 43 uspop-null.csv | tail -n1 1065 Flint Springs,KY,NULL,37.3433333,-86.7136111 1066 ``` 1067 1068 Ah! The third field (index `2`) is supposed to either be empty or contain a 1069 population count. However, in this data, it seems that `NULL` sometimes appears 1070 as a value, presumably to indicate that there is no count available. 1071 1072 The problem with our current program is that it fails to read this record 1073 because it doesn't know how to deserialize a `NULL` string into an 1074 `Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field 1075 or an integer. 1076 1077 To fix this, we tell Serde to convert any deserialization errors on this field 1078 to a `None` value, as shown in this next example: 1079 1080 ```no_run 1081 //tutorial-read-serde-invalid-02.rs 1082 # #![allow(dead_code)] 1083 # use std::{error::Error, io, process}; 1084 # 1085 # use serde::Deserialize; 1086 #[derive(Debug, Deserialize)] 1087 #[serde(rename_all = "PascalCase")] 1088 struct Record { 1089 latitude: f64, 1090 longitude: f64, 1091 #[serde(deserialize_with = "csv::invalid_option")] 1092 population: Option<u64>, 1093 city: String, 1094 state: String, 1095 } 1096 1097 fn run() -> Result<(), Box<dyn Error>> { 1098 let mut rdr = csv::Reader::from_reader(io::stdin()); 1099 for result in rdr.deserialize() { 1100 let record: Record = result?; 1101 println!("{:?}", record); 1102 } 1103 Ok(()) 1104 } 1105 # 1106 # fn main() { 1107 # if let Err(err) = run() { 1108 # println!("{}", err); 1109 # process::exit(1); 1110 # } 1111 # } 1112 ``` 1113 1114 If you compile and run this example, then it should run to completion just 1115 like the other examples: 1116 1117 ```text 1118 $ cargo build 1119 $ ./target/debug/csvtutor < uspop-null.csv 1120 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" } 1121 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" } 1122 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" } 1123 # ... and more 1124 ``` 1125 1126 The only change in this example was adding this attribute to the `population` 1127 field in our `Record` type: 1128 1129 ```ignore 1130 #[serde(deserialize_with = "csv::invalid_option")] 1131 ``` 1132 1133 The 1134 [`invalid_option`](../fn.invalid_option.html) 1135 function is a generic helper function that does one very simple thing: when 1136 applied to `Option` fields, it will convert any deserialization error into a 1137 `None` value. This is useful when you need to work with messy CSV data. 1138 1139 # Writing CSV 1140 1141 In this section we'll show a few examples that write CSV data. Writing CSV data 1142 tends to be a bit more straight-forward than reading CSV data, since you get to 1143 control the output format. 1144 1145 Let's start with the most basic example: writing a few CSV records to `stdout`. 1146 1147 ```no_run 1148 //tutorial-write-01.rs 1149 use std::{error::Error, io, process}; 1150 1151 fn run() -> Result<(), Box<dyn Error>> { 1152 let mut wtr = csv::Writer::from_writer(io::stdout()); 1153 // Since we're writing records manually, we must explicitly write our 1154 // header record. A header record is written the same way that other 1155 // records are written. 1156 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; 1157 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; 1158 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; 1159 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; 1160 1161 // A CSV writer maintains an internal buffer, so it's important 1162 // to flush the buffer when you're done. 1163 wtr.flush()?; 1164 Ok(()) 1165 } 1166 1167 fn main() { 1168 if let Err(err) = run() { 1169 println!("{}", err); 1170 process::exit(1); 1171 } 1172 } 1173 ``` 1174 1175 Compiling and running this example results in CSV data being printed: 1176 1177 ```text 1178 $ cargo build 1179 $ ./target/debug/csvtutor 1180 City,State,Population,Latitude,Longitude 1181 Davidsons Landing,AK,,65.2419444,-165.2716667 1182 Kenai,AK,7610,60.5544444,-151.2583333 1183 Oakman,AL,,33.7133333,-87.3886111 1184 ``` 1185 1186 Before moving on, it's worth taking a closer look at the `write_record` 1187 method. In this example, it looks rather simple, but if you're new to Rust then 1188 its type signature might look a little daunting: 1189 1190 ```ignore 1191 pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()> 1192 where I: IntoIterator<Item=T>, T: AsRef<[u8]> 1193 { 1194 // implementation elided 1195 } 1196 ``` 1197 1198 To understand the type signature, we can break it down piece by piece. 1199 1200 1. The method takes two parameters: `self` and `record`. 1201 2. `self` is a special parameter that corresponds to the `Writer` itself. 1202 3. `record` is the CSV record we'd like to write. Its type is `I`, which is 1203 a generic type. 1204 4. In the method's `where` clause, the `I` type is constrained by the 1205 `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the 1206 `IntoIterator` trait. If you look at the documentation of the 1207 [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html), 1208 then we can see that it describes types that can build iterators. In this 1209 case, we want an iterator that yields *another* generic type `T`, where 1210 `T` is the type of each field we want to write. 1211 5. `T` also appears in the method's `where` clause, but its constraint is the 1212 `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost 1213 conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>` 1214 means that we want to be able to *borrow* a slice of bytes from `T`. 1215 The CSV writer will take these bytes and write them as a single field. 1216 The `AsRef<[u8]>` bound is useful because types like `String`, `&str`, 1217 `Vec<u8>` and `&[u8]` all satisfy it. 1218 6. Finally, the method returns a `csv::Result<()>`, which is short-hand for 1219 `Result<(), csv::Error>`. That means `write_record` either returns nothing 1220 on success or returns a `csv::Error` on failure. 1221 1222 Now, let's apply our new found understanding of the type signature of 1223 `write_record`. If you recall, in our previous example, we used it like so: 1224 1225 ```ignore 1226 wtr.write_record(&["field 1", "field 2", "etc"])?; 1227 ``` 1228 1229 So how do the types match up? Well, the type of each of our fields in this 1230 code is `&'static str` (which is the type of a string literal in Rust). Since 1231 we put them in a slice literal, the type of our parameter is 1232 `&'static [&'static str]`, or more succinctly written as `&[&str]` without the 1233 lifetime annotations. Since slices satisfy the `IntoIterator` bound and 1234 strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call. 1235 1236 Here are a few more examples of ways you can call `write_record`: 1237 1238 ```no_run 1239 # use csv; 1240 # let mut wtr = csv::Writer::from_writer(vec![]); 1241 // A slice of byte strings. 1242 wtr.write_record(&[b"a", b"b", b"c"]); 1243 // A vector. 1244 wtr.write_record(vec!["a", "b", "c"]); 1245 // A string record. 1246 wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"])); 1247 // A byte record. 1248 wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"])); 1249 ``` 1250 1251 Finally, the example above can be easily adapted to write to a file instead 1252 of `stdout`: 1253 1254 ```no_run 1255 //tutorial-write-02.rs 1256 use std::{ 1257 env, 1258 error::Error, 1259 ffi::OsString, 1260 process, 1261 }; 1262 1263 fn run() -> Result<(), Box<dyn Error>> { 1264 let file_path = get_first_arg()?; 1265 let mut wtr = csv::Writer::from_path(file_path)?; 1266 1267 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; 1268 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; 1269 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; 1270 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; 1271 1272 wtr.flush()?; 1273 Ok(()) 1274 } 1275 1276 /// Returns the first positional argument sent to this process. If there are no 1277 /// positional arguments, then this returns an error. 1278 fn get_first_arg() -> Result<OsString, Box<dyn Error>> { 1279 match env::args_os().nth(1) { 1280 None => Err(From::from("expected 1 argument, but got none")), 1281 Some(file_path) => Ok(file_path), 1282 } 1283 } 1284 1285 fn main() { 1286 if let Err(err) = run() { 1287 println!("{}", err); 1288 process::exit(1); 1289 } 1290 } 1291 ``` 1292 1293 ## Writing tab separated values 1294 1295 In the previous section, we saw how to write some simple CSV data to `stdout` 1296 that looked like this: 1297 1298 ```text 1299 City,State,Population,Latitude,Longitude 1300 Davidsons Landing,AK,,65.2419444,-165.2716667 1301 Kenai,AK,7610,60.5544444,-151.2583333 1302 Oakman,AL,,33.7133333,-87.3886111 1303 ``` 1304 1305 You might wonder to yourself: what's the point of using a CSV writer if the 1306 data is so simple? Well, the benefit of a CSV writer is that it can handle all 1307 types of data without sacrificing the integrity of your data. That is, it knows 1308 when to quote fields that contain special CSV characters (like commas or new 1309 lines) or escape literal quotes that appear in your data. The CSV writer can 1310 also be easily configured to use different delimiters or quoting strategies. 1311 1312 In this section, we'll take a look a look at how to tweak some of the settings 1313 on a CSV writer. In particular, we'll write TSV ("tab separated values") 1314 instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields. 1315 Here's an example: 1316 1317 ```no_run 1318 //tutorial-write-delimiter-01.rs 1319 # use std::{error::Error, io, process}; 1320 # 1321 fn run() -> Result<(), Box<dyn Error>> { 1322 let mut wtr = csv::WriterBuilder::new() 1323 .delimiter(b'\t') 1324 .quote_style(csv::QuoteStyle::NonNumeric) 1325 .from_writer(io::stdout()); 1326 1327 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; 1328 wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?; 1329 wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?; 1330 wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?; 1331 1332 wtr.flush()?; 1333 Ok(()) 1334 } 1335 # 1336 # fn main() { 1337 # if let Err(err) = run() { 1338 # println!("{}", err); 1339 # process::exit(1); 1340 # } 1341 # } 1342 ``` 1343 1344 Compiling and running this example gives: 1345 1346 ```text 1347 $ cargo build 1348 $ ./target/debug/csvtutor 1349 "City" "State" "Population" "Latitude" "Longitude" 1350 "Davidsons Landing" "AK" "" 65.2419444 -165.2716667 1351 "Kenai" "AK" 7610 60.5544444 -151.2583333 1352 "Oakman" "AL" "" 33.7133333 -87.3886111 1353 ``` 1354 1355 In this example, we used a new type 1356 [`QuoteStyle`](../enum.QuoteStyle.html). 1357 The `QuoteStyle` type represents the different quoting strategies available 1358 to you. The default is to add quotes to fields only when necessary. This 1359 probably works for most use cases, but you can also ask for quotes to always 1360 be put around fields, to never be put around fields or to always be put around 1361 non-numeric fields. 1362 1363 ## Writing with Serde 1364 1365 Just like the CSV reader supports automatic deserialization into Rust types 1366 with Serde, the CSV writer supports automatic serialization from Rust types 1367 into CSV records using Serde. In this section, we'll learn how to use it. 1368 1369 As with reading, let's start by seeing how we can serialize a Rust tuple. 1370 1371 ```no_run 1372 //tutorial-write-serde-01.rs 1373 # use std::{error::Error, io, process}; 1374 # 1375 fn run() -> Result<(), Box<dyn Error>> { 1376 let mut wtr = csv::Writer::from_writer(io::stdout()); 1377 1378 // We still need to write headers manually. 1379 wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?; 1380 1381 // But now we can write records by providing a normal Rust value. 1382 // 1383 // Note that the odd `None::<u64>` syntax is required because `None` on 1384 // its own doesn't have a concrete type, but Serde needs a concrete type 1385 // in order to serialize it. That is, `None` has type `Option<T>` but 1386 // `None::<u64>` has type `Option<u64>`. 1387 wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?; 1388 wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?; 1389 wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?; 1390 1391 wtr.flush()?; 1392 Ok(()) 1393 } 1394 # 1395 # fn main() { 1396 # if let Err(err) = run() { 1397 # println!("{}", err); 1398 # process::exit(1); 1399 # } 1400 # } 1401 ``` 1402 1403 Compiling and running this program gives the expected output: 1404 1405 ```text 1406 $ cargo build 1407 $ ./target/debug/csvtutor 1408 City,State,Population,Latitude,Longitude 1409 Davidsons Landing,AK,,65.2419444,-165.2716667 1410 Kenai,AK,7610,60.5544444,-151.2583333 1411 Oakman,AL,,33.7133333,-87.3886111 1412 ``` 1413 1414 The key thing to note in the above example is the use of `serialize` instead 1415 of `write_record` to write our data. In particular, `write_record` is used 1416 when writing a simple record that contains string-like data only. On the other 1417 hand, `serialize` is used when your data consists of more complex values like 1418 numbers, floats or optional values. Of course, you could always convert the 1419 complex values to strings and then use `write_record`, but Serde can do it for 1420 you automatically. 1421 1422 As with reading, we can also serialize custom structs as CSV records. As a 1423 bonus, the fields in a struct will automatically be written as a header 1424 record! 1425 1426 To write custom structs as CSV records, we'll need to make use of Serde's 1427 automatic `derive` feature again. As in the 1428 [previous section on reading with Serde](#reading-with-serde), 1429 we'll need to add a couple crates to our `[dependencies]` section in our 1430 `Cargo.toml` (if they aren't already there): 1431 1432 ```text 1433 serde = { version = "1", features = ["derive"] } 1434 ``` 1435 1436 And we'll also need to add a new `use` statement to our code, for Serde, as 1437 shown in the example: 1438 1439 ```no_run 1440 //tutorial-write-serde-02.rs 1441 use std::{error::Error, io, process}; 1442 1443 use serde::Serialize; 1444 1445 // Note that structs can derive both Serialize and Deserialize! 1446 #[derive(Debug, Serialize)] 1447 #[serde(rename_all = "PascalCase")] 1448 struct Record<'a> { 1449 city: &'a str, 1450 state: &'a str, 1451 population: Option<u64>, 1452 latitude: f64, 1453 longitude: f64, 1454 } 1455 1456 fn run() -> Result<(), Box<dyn Error>> { 1457 let mut wtr = csv::Writer::from_writer(io::stdout()); 1458 1459 wtr.serialize(Record { 1460 city: "Davidsons Landing", 1461 state: "AK", 1462 population: None, 1463 latitude: 65.2419444, 1464 longitude: -165.2716667, 1465 })?; 1466 wtr.serialize(Record { 1467 city: "Kenai", 1468 state: "AK", 1469 population: Some(7610), 1470 latitude: 60.5544444, 1471 longitude: -151.2583333, 1472 })?; 1473 wtr.serialize(Record { 1474 city: "Oakman", 1475 state: "AL", 1476 population: None, 1477 latitude: 33.7133333, 1478 longitude: -87.3886111, 1479 })?; 1480 1481 wtr.flush()?; 1482 Ok(()) 1483 } 1484 1485 fn main() { 1486 if let Err(err) = run() { 1487 println!("{}", err); 1488 process::exit(1); 1489 } 1490 } 1491 ``` 1492 1493 Compiling and running this example has the same output as last time, even 1494 though we didn't explicitly write a header record: 1495 1496 ```text 1497 $ cargo build 1498 $ ./target/debug/csvtutor 1499 City,State,Population,Latitude,Longitude 1500 Davidsons Landing,AK,,65.2419444,-165.2716667 1501 Kenai,AK,7610,60.5544444,-151.2583333 1502 Oakman,AL,,33.7133333,-87.3886111 1503 ``` 1504 1505 In this case, the `serialize` method noticed that we were writing a struct 1506 with field names. When this happens, `serialize` will automatically write a 1507 header record (only if no other records have been written) that consists of 1508 the fields in the struct in the order in which they are defined. Note that 1509 this behavior can be disabled with the 1510 [`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers) 1511 method. 1512 1513 It's also worth pointing out the use of a *lifetime parameter* in our `Record` 1514 struct: 1515 1516 ```ignore 1517 struct Record<'a> { 1518 city: &'a str, 1519 state: &'a str, 1520 population: Option<u64>, 1521 latitude: f64, 1522 longitude: f64, 1523 } 1524 ``` 1525 1526 The `'a` lifetime parameter corresponds to the lifetime of the `city` and 1527 `state` string slices. This says that the `Record` struct contains *borrowed* 1528 data. We could have written our struct without borrowing any data, and 1529 therefore, without any lifetime parameters: 1530 1531 ```ignore 1532 struct Record { 1533 city: String, 1534 state: String, 1535 population: Option<u64>, 1536 latitude: f64, 1537 longitude: f64, 1538 } 1539 ``` 1540 1541 However, since we had to replace our borrowed `&str` types with owned `String` 1542 types, we're now forced to allocate a new `String` value for both of `city` 1543 and `state` for every record that we write. There's no intrinsic problem with 1544 doing that, but it might be a bit wasteful. 1545 1546 For more examples and more details on the rules for serialization, please see 1547 the 1548 [`Writer::serialize`](../struct.Writer.html#method.serialize) 1549 method. 1550 1551 # Pipelining 1552 1553 In this section, we're going to cover a few examples that demonstrate programs 1554 that take CSV data as input, and produce possibly transformed or filtered CSV 1555 data as output. This shows how to write a complete program that efficiently 1556 reads and writes CSV data. Rust is well positioned to perform this task, since 1557 you'll get great performance with the convenience of a high level CSV library. 1558 1559 ## Filter by search 1560 1561 The first example of CSV pipelining we'll look at is a simple filter. It takes 1562 as input some CSV data on stdin and a single string query as its only 1563 positional argument, and it will produce as output CSV data that only contains 1564 rows with a field that matches the query. 1565 1566 ```no_run 1567 //tutorial-pipeline-search-01.rs 1568 use std::{env, error::Error, io, process}; 1569 1570 fn run() -> Result<(), Box<dyn Error>> { 1571 // Get the query from the positional arguments. 1572 // If one doesn't exist, return an error. 1573 let query = match env::args().nth(1) { 1574 None => return Err(From::from("expected 1 argument, but got none")), 1575 Some(query) => query, 1576 }; 1577 1578 // Build CSV readers and writers to stdin and stdout, respectively. 1579 let mut rdr = csv::Reader::from_reader(io::stdin()); 1580 let mut wtr = csv::Writer::from_writer(io::stdout()); 1581 1582 // Before reading our data records, we should write the header record. 1583 wtr.write_record(rdr.headers()?)?; 1584 1585 // Iterate over all the records in `rdr`, and write only records containing 1586 // `query` to `wtr`. 1587 for result in rdr.records() { 1588 let record = result?; 1589 if record.iter().any(|field| field == &query) { 1590 wtr.write_record(&record)?; 1591 } 1592 } 1593 1594 // CSV writers use an internal buffer, so we should always flush when done. 1595 wtr.flush()?; 1596 Ok(()) 1597 } 1598 1599 fn main() { 1600 if let Err(err) = run() { 1601 println!("{}", err); 1602 process::exit(1); 1603 } 1604 } 1605 ``` 1606 1607 If we compile and run this program with a query of `MA` on `uspop.csv`, we'll 1608 see that only one record matches: 1609 1610 ```text 1611 $ cargo build 1612 $ ./csvtutor MA < uspop.csv 1613 City,State,Population,Latitude,Longitude 1614 Reading,MA,23441,42.5255556,-71.0958333 1615 ``` 1616 1617 This example doesn't actually introduce anything new. It merely combines what 1618 you've already learned about CSV readers and writers from previous sections. 1619 1620 Let's add a twist to this example. In the real world, you're often faced with 1621 messy CSV data that might not be encoded correctly. One example you might come 1622 across is CSV data encoded in 1623 [Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). 1624 Unfortunately, for the examples we've seen so far, our CSV reader assumes that 1625 all of the data is UTF-8. Since all of the data we've worked on has been 1626 ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any 1627 problems. But let's introduce a slightly tweaked version of our `uspop.csv` 1628 file that contains an encoding of a Latin-1 character that is invalid UTF-8. 1629 You can get the data like so: 1630 1631 ```text 1632 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv' 1633 ``` 1634 1635 Even though I've already given away the problem, let's see what happen when 1636 we try to run our previous example on this new data: 1637 1638 ```text 1639 $ ./csvtutor MA < uspop-latin1.csv 1640 City,State,Population,Latitude,Longitude 1641 CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0 1642 ``` 1643 1644 The error message tells us exactly what's wrong. Let's take a look at line 4 1645 to see what we're dealing with: 1646 1647 ```text 1648 $ head -n4 uspop-latin1.csv | tail -n1 1649 Õakman,AL,,33.7133333,-87.3886111 1650 ``` 1651 1652 In this case, the very first character is the Latin-1 `Õ`, which is encoded as 1653 the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our 1654 CSV parser has choked on our data? You have two choices. The first is to go in 1655 and fix up your CSV data so that it's valid UTF-8. This is probably a good 1656 idea anyway, and tools like `iconv` can help with the task of transcoding. 1657 But if you can't or don't want to do that, then you can instead read CSV data 1658 in a way that is mostly encoding agnostic (so long as ASCII is still a valid 1659 subset). The trick is to use *byte records* instead of *string records*. 1660 1661 Thus far, we haven't actually talked much about the type of a record in this 1662 library, but now is a good time to introduce them. There are two of them, 1663 [`StringRecord`](../struct.StringRecord.html) 1664 and 1665 [`ByteRecord`](../struct.ByteRecord.html). 1666 Each them represent a single record in CSV data, where a record is a sequence 1667 of an arbitrary number of fields. The only difference between `StringRecord` 1668 and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where 1669 as `ByteRecord` contains arbitrary bytes. 1670 1671 Armed with that knowledge, we can now begin to understand why we saw an error 1672 when we ran the last example on data that wasn't UTF-8. Namely, when we call 1673 `records`, we get back an iterator of `StringRecord`. Since `StringRecord` is 1674 guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid 1675 UTF-8 will result in the error that we see. 1676 1677 All we need to do to make our example work is to switch from a `StringRecord` 1678 to a `ByteRecord`. This means using `byte_records` to create our iterator 1679 instead of `records`, and similarly using `byte_headers` instead of `headers` 1680 if we think our header data might contain invalid UTF-8 as well. Here's the 1681 change: 1682 1683 ```no_run 1684 //tutorial-pipeline-search-02.rs 1685 # use std::{env, error::Error, io, process}; 1686 # 1687 fn run() -> Result<(), Box<dyn Error>> { 1688 let query = match env::args().nth(1) { 1689 None => return Err(From::from("expected 1 argument, but got none")), 1690 Some(query) => query, 1691 }; 1692 1693 let mut rdr = csv::Reader::from_reader(io::stdin()); 1694 let mut wtr = csv::Writer::from_writer(io::stdout()); 1695 1696 wtr.write_record(rdr.byte_headers()?)?; 1697 1698 for result in rdr.byte_records() { 1699 let record = result?; 1700 // `query` is a `String` while `field` is now a `&[u8]`, so we'll 1701 // need to convert `query` to `&[u8]` before doing a comparison. 1702 if record.iter().any(|field| field == query.as_bytes()) { 1703 wtr.write_record(&record)?; 1704 } 1705 } 1706 1707 wtr.flush()?; 1708 Ok(()) 1709 } 1710 # 1711 # fn main() { 1712 # if let Err(err) = run() { 1713 # println!("{}", err); 1714 # process::exit(1); 1715 # } 1716 # } 1717 ``` 1718 1719 Compiling and running this now yields the same results as our first example, 1720 but this time it works on data that isn't valid UTF-8. 1721 1722 ```text 1723 $ cargo build 1724 $ ./csvtutor MA < uspop-latin1.csv 1725 City,State,Population,Latitude,Longitude 1726 Reading,MA,23441,42.5255556,-71.0958333 1727 ``` 1728 1729 ## Filter by population count 1730 1731 In this section, we will show another example program that both reads and 1732 writes CSV data, but instead of dealing with arbitrary records, we will use 1733 Serde to deserialize and serialize records with specific types. 1734 1735 For this program, we'd like to be able to filter records in our population data 1736 by population count. Specifically, we'd like to see which records meet a 1737 certain population threshold. In addition to using a simple inequality, we must 1738 also account for records that have a missing population count. This is where 1739 types like `Option<T>` come in handy, because the compiler will force us to 1740 consider the case when the population count is missing. 1741 1742 Since we're using Serde in this example, don't forget to add the Serde 1743 dependencies to your `Cargo.toml` in your `[dependencies]` section if they 1744 aren't already there: 1745 1746 ```text 1747 serde = { version = "1", features = ["derive"] } 1748 ``` 1749 1750 Now here's the code: 1751 1752 ```no_run 1753 //tutorial-pipeline-pop-01.rs 1754 # use std::{env, error::Error, io, process}; 1755 1756 use serde::{Deserialize, Serialize}; 1757 1758 // Unlike previous examples, we derive both Deserialize and Serialize. This 1759 // means we'll be able to automatically deserialize and serialize this type. 1760 #[derive(Debug, Deserialize, Serialize)] 1761 #[serde(rename_all = "PascalCase")] 1762 struct Record { 1763 city: String, 1764 state: String, 1765 population: Option<u64>, 1766 latitude: f64, 1767 longitude: f64, 1768 } 1769 1770 fn run() -> Result<(), Box<dyn Error>> { 1771 // Get the query from the positional arguments. 1772 // If one doesn't exist or isn't an integer, return an error. 1773 let minimum_pop: u64 = match env::args().nth(1) { 1774 None => return Err(From::from("expected 1 argument, but got none")), 1775 Some(arg) => arg.parse()?, 1776 }; 1777 1778 // Build CSV readers and writers to stdin and stdout, respectively. 1779 // Note that we don't need to write headers explicitly. Since we're 1780 // serializing a custom struct, that's done for us automatically. 1781 let mut rdr = csv::Reader::from_reader(io::stdin()); 1782 let mut wtr = csv::Writer::from_writer(io::stdout()); 1783 1784 // Iterate over all the records in `rdr`, and write only records containing 1785 // a population that is greater than or equal to `minimum_pop`. 1786 for result in rdr.deserialize() { 1787 // Remember that when deserializing, we must use a type hint to 1788 // indicate which type we want to deserialize our record into. 1789 let record: Record = result?; 1790 1791 // `map_or` is a combinator on `Option`. It take two parameters: 1792 // a value to use when the `Option` is `None` (i.e., the record has 1793 // no population count) and a closure that returns another value of 1794 // the same type when the `Option` is `Some`. In this case, we test it 1795 // against our minimum population count that we got from the command 1796 // line. 1797 if record.population.map_or(false, |pop| pop >= minimum_pop) { 1798 wtr.serialize(record)?; 1799 } 1800 } 1801 1802 // CSV writers use an internal buffer, so we should always flush when done. 1803 wtr.flush()?; 1804 Ok(()) 1805 } 1806 1807 fn main() { 1808 if let Err(err) = run() { 1809 println!("{}", err); 1810 process::exit(1); 1811 } 1812 } 1813 ``` 1814 1815 If we compile and run our program with a minimum threshold of `100000`, we 1816 should see three matching records. Notice that the headers were added even 1817 though we never explicitly wrote them! 1818 1819 ```text 1820 $ cargo build 1821 $ ./target/debug/csvtutor 100000 < uspop.csv 1822 City,State,Population,Latitude,Longitude 1823 Fontana,CA,169160,34.0922222,-117.4341667 1824 Bridgeport,CT,139090,41.1669444,-73.2052778 1825 Indianapolis,IN,773283,39.7683333,-86.1580556 1826 ``` 1827 1828 # Performance 1829 1830 In this section, we'll go over how to squeeze the most juice out of our CSV 1831 reader. As it happens, most of the APIs we've seen so far were designed with 1832 high level convenience in mind, and that often comes with some costs. For the 1833 most part, those costs revolve around unnecessary allocations. Therefore, most 1834 of the section will show how to do CSV parsing with as little allocation as 1835 possible. 1836 1837 There are two critical preliminaries we must cover. 1838 1839 Firstly, when you care about performance, you should compile your code 1840 with `cargo build --release` instead of `cargo build`. The `--release` 1841 flag instructs the compiler to spend more time optimizing your code. When 1842 compiling with the `--release` flag, you'll find your compiled program at 1843 `target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this 1844 tutorial, we've used `cargo build` because our dataset was small and we weren't 1845 focused on speed. The downside of `cargo build --release` is that it will take 1846 longer than `cargo build`. 1847 1848 Secondly, the dataset we've used throughout this tutorial only has 100 records. 1849 We'd have to try really hard to cause our program to run slowly on 100 records, 1850 even when we compile without the `--release` flag. Therefore, in order to 1851 actually witness a performance difference, we need a bigger dataset. To get 1852 such a dataset, we'll use the original source of `uspop.csv`. **Warning: the 1853 download is 41MB compressed and decompresses to 145MB.** 1854 1855 ```text 1856 $ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz 1857 $ gunzip worldcitiespop.csv.gz 1858 $ wc worldcitiespop.csv 1859 3173959 5681543 151492068 worldcitiespop.csv 1860 $ md5sum worldcitiespop.csv 1861 6198bd180b6d6586626ecbf044c1cca5 worldcitiespop.csv 1862 ``` 1863 1864 Finally, it's worth pointing out that this section is not attempting to 1865 present a rigorous set of benchmarks. We will stay away from rigorous analysis 1866 and instead rely a bit more on wall clock times and intuition. 1867 1868 ## Amortizing allocations 1869 1870 In order to measure performance, we must be careful about what it is we're 1871 measuring. We must also be careful to not change the thing we're measuring as 1872 we make improvements to the code. For this reason, we will focus on measuring 1873 how long it takes to count the number of records corresponding to city 1874 population counts in Massachusetts. This represents a very small amount of work 1875 that requires us to visit every record, and therefore represents a decent way 1876 to measure how long it takes to do CSV parsing. 1877 1878 Before diving into our first optimization, let's start with a baseline by 1879 adapting a previous example to count the number of records in 1880 `worldcitiespop.csv`: 1881 1882 ```no_run 1883 //tutorial-perf-alloc-01.rs 1884 use std::{error::Error, io, process}; 1885 1886 fn run() -> Result<u64, Box<dyn Error>> { 1887 let mut rdr = csv::Reader::from_reader(io::stdin()); 1888 1889 let mut count = 0; 1890 for result in rdr.records() { 1891 let record = result?; 1892 if &record[0] == "us" && &record[3] == "MA" { 1893 count += 1; 1894 } 1895 } 1896 Ok(count) 1897 } 1898 1899 fn main() { 1900 match run() { 1901 Ok(count) => { 1902 println!("{}", count); 1903 } 1904 Err(err) => { 1905 println!("{}", err); 1906 process::exit(1); 1907 } 1908 } 1909 } 1910 ``` 1911 1912 Now let's compile and run it and see what kind of timing we get. Don't forget 1913 to compile with the `--release` flag. (For grins, try compiling without the 1914 `--release` flag and see how long it takes to run the program!) 1915 1916 ```text 1917 $ cargo build --release 1918 $ time ./target/release/csvtutor < worldcitiespop.csv 1919 2176 1920 1921 real 0m0.645s 1922 user 0m0.627s 1923 sys 0m0.017s 1924 ``` 1925 1926 All right, so what's the first thing we can do to make this faster? This 1927 section promised to speed things up by amortizing allocation, but we can do 1928 something even simpler first: iterate over 1929 [`ByteRecord`](../struct.ByteRecord.html)s 1930 instead of 1931 [`StringRecord`](../struct.StringRecord.html)s. 1932 If you recall from a previous section, a `StringRecord` is guaranteed to be 1933 valid UTF-8, and therefore must validate that its contents is actually UTF-8. 1934 (If validation fails, then the CSV reader will return an error.) If we remove 1935 that validation from our program, then we can realize a nice speed boost as 1936 shown in the next example: 1937 1938 ```no_run 1939 //tutorial-perf-alloc-02.rs 1940 # use std::{error::Error, io, process}; 1941 # 1942 fn run() -> Result<u64, Box<dyn Error>> { 1943 let mut rdr = csv::Reader::from_reader(io::stdin()); 1944 1945 let mut count = 0; 1946 for result in rdr.byte_records() { 1947 let record = result?; 1948 if &record[0] == b"us" && &record[3] == b"MA" { 1949 count += 1; 1950 } 1951 } 1952 Ok(count) 1953 } 1954 # 1955 # fn main() { 1956 # match run() { 1957 # Ok(count) => { 1958 # println!("{}", count); 1959 # } 1960 # Err(err) => { 1961 # println!("{}", err); 1962 # process::exit(1); 1963 # } 1964 # } 1965 # } 1966 ``` 1967 1968 And now compile and run: 1969 1970 ```text 1971 $ cargo build --release 1972 $ time ./target/release/csvtutor < worldcitiespop.csv 1973 2176 1974 1975 real 0m0.429s 1976 user 0m0.403s 1977 sys 0m0.023s 1978 ``` 1979 1980 Our program is now approximately 30% faster, all because we removed UTF-8 1981 validation. But was it actually okay to remove UTF-8 validation? What have we 1982 lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use 1983 `ByteRecord` instead because all we're doing with the data in the record is 1984 comparing two of its fields to raw bytes: 1985 1986 ```ignore 1987 if &record[0] == b"us" && &record[3] == b"MA" { 1988 count += 1; 1989 } 1990 ``` 1991 1992 In particular, it doesn't matter whether `record` is valid UTF-8 or not, since 1993 we're checking for equality on the raw bytes themselves. 1994 1995 UTF-8 validation via `StringRecord` is useful because it provides access to 1996 fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types. 1997 `&str` is the type of a borrowed string in Rust, which provides convenient 1998 access to string APIs like substring search. Strings are also frequently used 1999 in other areas, so they tend to be a useful thing to have. Therefore, sticking 2000 with `StringRecord` is a good default, but if you need the extra speed and can 2001 deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea. 2002 2003 Moving on, let's try to get another speed boost by amortizing allocation. 2004 Amortizing allocation is the technique that creates an allocation once (or 2005 very rarely), and then attempts to reuse it instead of creating additional 2006 allocations. In the case of the previous examples, we used iterators created 2007 by the `records` and `byte_records` methods on a CSV reader. These iterators 2008 allocate a new record for every item that it yields, which in turn corresponds 2009 to a new allocation. It does this because iterators cannot yield items that 2010 borrow from the iterator itself, and because creating new allocations tends to 2011 be a lot more convenient. 2012 2013 If we're willing to forgo use of iterators, then we can amortize allocations 2014 by creating a *single* `ByteRecord` and asking the CSV reader to read into it. 2015 We do this by using the 2016 [`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record) 2017 method. 2018 2019 ```no_run 2020 //tutorial-perf-alloc-03.rs 2021 # use std::{error::Error, io, process}; 2022 # 2023 fn run() -> Result<u64, Box<dyn Error>> { 2024 let mut rdr = csv::Reader::from_reader(io::stdin()); 2025 let mut record = csv::ByteRecord::new(); 2026 2027 let mut count = 0; 2028 while rdr.read_byte_record(&mut record)? { 2029 if &record[0] == b"us" && &record[3] == b"MA" { 2030 count += 1; 2031 } 2032 } 2033 Ok(count) 2034 } 2035 # 2036 # fn main() { 2037 # match run() { 2038 # Ok(count) => { 2039 # println!("{}", count); 2040 # } 2041 # Err(err) => { 2042 # println!("{}", err); 2043 # process::exit(1); 2044 # } 2045 # } 2046 # } 2047 ``` 2048 2049 Compile and run: 2050 2051 ```text 2052 $ cargo build --release 2053 $ time ./target/release/csvtutor < worldcitiespop.csv 2054 2176 2055 2056 real 0m0.308s 2057 user 0m0.283s 2058 sys 0m0.023s 2059 ``` 2060 2061 Woohoo! This represents *another* 30% boost over the previous example, which is 2062 a 50% boost over the first example. 2063 2064 Let's dissect this code by taking a look at the type signature of the 2065 `read_byte_record` method: 2066 2067 ```ignore 2068 fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>; 2069 ``` 2070 2071 This method takes as input a CSV reader (the `self` parameter) and a *mutable 2072 borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The 2073 `csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return 2074 value is `true` if and only if a record was read. When it's `false`, that means 2075 the reader has exhausted its input. This method works by copying the contents 2076 of the next record into the provided `ByteRecord`. Since the same `ByteRecord` 2077 is used to read every record, it will already have space allocated for data. 2078 When `read_byte_record` runs, it will overwrite the contents that were there 2079 with the new record, which means that it can reuse the space that was 2080 allocated. Thus, we have *amortized allocation*. 2081 2082 An exercise you might consider doing is to use a `StringRecord` instead of a 2083 `ByteRecord`, and therefore 2084 [`Reader::read_record`](../struct.Reader.html#method.read_record) 2085 instead of `read_byte_record`. This will give you easy access to Rust strings 2086 at the cost of UTF-8 validation but *without* the cost of allocating a new 2087 `StringRecord` for every record. 2088 2089 ## Serde and zero allocation 2090 2091 In this section, we are going to briefly examine how we use Serde and what we 2092 can do to speed it up. The key optimization we'll want to make is to---you 2093 guessed it---amortize allocation. 2094 2095 As with the previous section, let's start with a simple baseline based off an 2096 example using Serde in a previous section: 2097 2098 ```no_run 2099 //tutorial-perf-serde-01.rs 2100 # #![allow(dead_code)] 2101 use std::{error::Error, io, process}; 2102 2103 use serde::Deserialize; 2104 2105 #[derive(Debug, Deserialize)] 2106 #[serde(rename_all = "PascalCase")] 2107 struct Record { 2108 country: String, 2109 city: String, 2110 accent_city: String, 2111 region: String, 2112 population: Option<u64>, 2113 latitude: f64, 2114 longitude: f64, 2115 } 2116 2117 fn run() -> Result<u64, Box<dyn Error>> { 2118 let mut rdr = csv::Reader::from_reader(io::stdin()); 2119 2120 let mut count = 0; 2121 for result in rdr.deserialize() { 2122 let record: Record = result?; 2123 if record.country == "us" && record.region == "MA" { 2124 count += 1; 2125 } 2126 } 2127 Ok(count) 2128 } 2129 2130 fn main() { 2131 match run() { 2132 Ok(count) => { 2133 println!("{}", count); 2134 } 2135 Err(err) => { 2136 println!("{}", err); 2137 process::exit(1); 2138 } 2139 } 2140 } 2141 ``` 2142 2143 Now compile and run this program: 2144 2145 ```text 2146 $ cargo build --release 2147 $ ./target/release/csvtutor < worldcitiespop.csv 2148 2176 2149 2150 real 0m1.381s 2151 user 0m1.367s 2152 sys 0m0.013s 2153 ``` 2154 2155 The first thing you might notice is that this is quite a bit slower than our 2156 programs in the previous section. This is because deserializing each record 2157 has a certain amount of overhead to it. In particular, some of the fields need 2158 to be parsed as integers or floating point numbers, which isn't free. However, 2159 there is hope yet, because we can speed up this program! 2160 2161 Our first attempt to speed up the program will be to amortize allocation. Doing 2162 this with Serde is a bit trickier than before, because we need to change our 2163 `Record` type and use the manual deserialization API. Let's see what that looks 2164 like: 2165 2166 ```no_run 2167 //tutorial-perf-serde-02.rs 2168 # #![allow(dead_code)] 2169 # use std::{error::Error, io, process}; 2170 # use serde::Deserialize; 2171 # 2172 #[derive(Debug, Deserialize)] 2173 #[serde(rename_all = "PascalCase")] 2174 struct Record<'a> { 2175 country: &'a str, 2176 city: &'a str, 2177 accent_city: &'a str, 2178 region: &'a str, 2179 population: Option<u64>, 2180 latitude: f64, 2181 longitude: f64, 2182 } 2183 2184 fn run() -> Result<u64, Box<dyn Error>> { 2185 let mut rdr = csv::Reader::from_reader(io::stdin()); 2186 let mut raw_record = csv::StringRecord::new(); 2187 let headers = rdr.headers()?.clone(); 2188 2189 let mut count = 0; 2190 while rdr.read_record(&mut raw_record)? { 2191 let record: Record = raw_record.deserialize(Some(&headers))?; 2192 if record.country == "us" && record.region == "MA" { 2193 count += 1; 2194 } 2195 } 2196 Ok(count) 2197 } 2198 # 2199 # fn main() { 2200 # match run() { 2201 # Ok(count) => { 2202 # println!("{}", count); 2203 # } 2204 # Err(err) => { 2205 # println!("{}", err); 2206 # process::exit(1); 2207 # } 2208 # } 2209 # } 2210 ``` 2211 2212 Compile and run: 2213 2214 ```text 2215 $ cargo build --release 2216 $ ./target/release/csvtutor < worldcitiespop.csv 2217 2176 2218 2219 real 0m1.055s 2220 user 0m1.040s 2221 sys 0m0.013s 2222 ``` 2223 2224 This corresponds to an approximately 24% increase in performance. To achieve 2225 this, we had to make two important changes. 2226 2227 The first was to make our `Record` type contain `&str` fields instead of 2228 `String` fields. If you recall from a previous section, `&str` is a *borrowed* 2229 string where a `String` is an *owned* string. A borrowed string points to 2230 a already existing allocation where as a `String` always implies a new 2231 allocation. In this case, our `&str` is borrowing from the CSV record itself. 2232 2233 The second change we had to make was to stop using the 2234 [`Reader::deserialize`](../struct.Reader.html#method.deserialize) 2235 iterator, and instead deserialize our record into a `StringRecord` explicitly 2236 and then use the 2237 [`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize) 2238 method to deserialize a single record. 2239 2240 The second change is a bit tricky, because in order for it to work, our 2241 `Record` type needs to borrow from the data inside the `StringRecord`. That 2242 means that our `Record` value cannot outlive the `StringRecord` that it was 2243 created from. Since we overwrite the same `StringRecord` on each iteration 2244 (in order to amortize allocation), that means our `Record` value must evaporate 2245 before the next iteration of the loop. Indeed, the compiler will enforce this! 2246 2247 There is one more optimization we can make: remove UTF-8 validation. In 2248 general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead 2249 of `StringRecord`: 2250 2251 ```no_run 2252 //tutorial-perf-serde-03.rs 2253 # #![allow(dead_code)] 2254 # use std::{error::Error, io, process}; 2255 # 2256 # use serde::Deserialize; 2257 # 2258 #[derive(Debug, Deserialize)] 2259 #[serde(rename_all = "PascalCase")] 2260 struct Record<'a> { 2261 country: &'a [u8], 2262 city: &'a [u8], 2263 accent_city: &'a [u8], 2264 region: &'a [u8], 2265 population: Option<u64>, 2266 latitude: f64, 2267 longitude: f64, 2268 } 2269 2270 fn run() -> Result<u64, Box<dyn Error>> { 2271 let mut rdr = csv::Reader::from_reader(io::stdin()); 2272 let mut raw_record = csv::ByteRecord::new(); 2273 let headers = rdr.byte_headers()?.clone(); 2274 2275 let mut count = 0; 2276 while rdr.read_byte_record(&mut raw_record)? { 2277 let record: Record = raw_record.deserialize(Some(&headers))?; 2278 if record.country == b"us" && record.region == b"MA" { 2279 count += 1; 2280 } 2281 } 2282 Ok(count) 2283 } 2284 # 2285 # fn main() { 2286 # match run() { 2287 # Ok(count) => { 2288 # println!("{}", count); 2289 # } 2290 # Err(err) => { 2291 # println!("{}", err); 2292 # process::exit(1); 2293 # } 2294 # } 2295 # } 2296 ``` 2297 2298 Compile and run: 2299 2300 ```text 2301 $ cargo build --release 2302 $ ./target/release/csvtutor < worldcitiespop.csv 2303 2176 2304 2305 real 0m0.873s 2306 user 0m0.850s 2307 sys 0m0.023s 2308 ``` 2309 2310 This corresponds to a 17% increase over the previous example and a 37% increase 2311 over the first example. 2312 2313 In sum, Serde parsing is still quite fast, but will generally not be the 2314 fastest way to parse CSV since it necessarily needs to do more work. 2315 2316 ## CSV parsing without the standard library 2317 2318 In this section, we will explore a niche use case: parsing CSV without the 2319 standard library. While the `csv` crate itself requires the standard library, 2320 the underlying parser is actually part of the 2321 [`csv-core`](https://docs.rs/csv-core) 2322 crate, which does not depend on the standard library. The downside of not 2323 depending on the standard library is that CSV parsing becomes a lot more 2324 inconvenient. 2325 2326 The `csv-core` crate is structured similarly to the `csv` crate. There is a 2327 [`Reader`](../../csv_core/struct.Reader.html) 2328 and a 2329 [`Writer`](../../csv_core/struct.Writer.html), 2330 as well as corresponding builders 2331 [`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html) 2332 and 2333 [`WriterBuilder`](../../csv_core/struct.WriterBuilder.html). 2334 The `csv-core` crate has no record types or iterators. Instead, CSV data 2335 can either be read one field at a time or one record at a time. In this 2336 section, we'll focus on reading a field at a time since it is simpler, but it 2337 is generally faster to read a record at a time since it does more work per 2338 function call. 2339 2340 In keeping with this section on performance, let's write a program using only 2341 `csv-core` that counts the number of records in the state of Massachusetts. 2342 2343 (Note that we unfortunately use the standard library in this example even 2344 though `csv-core` doesn't technically require it. We do this for convenient 2345 access to I/O, which would be harder without the standard library.) 2346 2347 ```no_run 2348 //tutorial-perf-core-01.rs 2349 use std::io::{self, Read}; 2350 use std::process; 2351 2352 use csv_core::{Reader, ReadFieldResult}; 2353 2354 fn run(mut data: &[u8]) -> Option<u64> { 2355 let mut rdr = Reader::new(); 2356 2357 // Count the number of records in Massachusetts. 2358 let mut count = 0; 2359 // Indicates the current field index. Reset to 0 at start of each record. 2360 let mut fieldidx = 0; 2361 // True when the current record is in the United States. 2362 let mut inus = false; 2363 // Buffer for field data. Must be big enough to hold the largest field. 2364 let mut field = [0; 1024]; 2365 loop { 2366 // Attempt to incrementally read the next CSV field. 2367 let (result, nread, nwrite) = rdr.read_field(data, &mut field); 2368 // nread is the number of bytes read from our input. We should never 2369 // pass those bytes to read_field again. 2370 data = &data[nread..]; 2371 // nwrite is the number of bytes written to the output buffer `field`. 2372 // The contents of the buffer after this point is unspecified. 2373 let field = &field[..nwrite]; 2374 2375 match result { 2376 // We don't need to handle this case because we read all of the 2377 // data up front. If we were reading data incrementally, then this 2378 // would be a signal to read more. 2379 ReadFieldResult::InputEmpty => {} 2380 // If we get this case, then we found a field that contains more 2381 // than 1024 bytes. We keep this example simple and just fail. 2382 ReadFieldResult::OutputFull => { 2383 return None; 2384 } 2385 // This case happens when we've successfully read a field. If the 2386 // field is the last field in a record, then `record_end` is true. 2387 ReadFieldResult::Field { record_end } => { 2388 if fieldidx == 0 && field == b"us" { 2389 inus = true; 2390 } else if inus && fieldidx == 3 && field == b"MA" { 2391 count += 1; 2392 } 2393 if record_end { 2394 fieldidx = 0; 2395 inus = false; 2396 } else { 2397 fieldidx += 1; 2398 } 2399 } 2400 // This case happens when the CSV reader has successfully exhausted 2401 // all input. 2402 ReadFieldResult::End => { 2403 break; 2404 } 2405 } 2406 } 2407 Some(count) 2408 } 2409 2410 fn main() { 2411 // Read the entire contents of stdin up front. 2412 let mut data = vec![]; 2413 if let Err(err) = io::stdin().read_to_end(&mut data) { 2414 println!("{}", err); 2415 process::exit(1); 2416 } 2417 match run(&data) { 2418 None => { 2419 println!("error: could not count records, buffer too small"); 2420 process::exit(1); 2421 } 2422 Some(count) => { 2423 println!("{}", count); 2424 } 2425 } 2426 } 2427 ``` 2428 2429 And compile and run it: 2430 2431 ```text 2432 $ cargo build --release 2433 $ time ./target/release/csvtutor < worldcitiespop.csv 2434 2176 2435 2436 real 0m0.572s 2437 user 0m0.513s 2438 sys 0m0.057s 2439 ``` 2440 2441 This isn't as fast as some of our previous examples where we used the `csv` 2442 crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because 2443 this example reads a field at a time, which incurs more overhead than reading a 2444 record at a time. To fix this, you would want to use the 2445 [`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record) 2446 method instead, which is defined on `csv_core::Reader`. 2447 2448 The other thing to notice here is that the example is considerably longer than 2449 the other examples. This is because we need to do more book keeping to keep 2450 track of which field we're reading and how much data we've already fed to the 2451 reader. There are basically two reasons to use the `csv_core` crate: 2452 2453 1. If you're in an environment where the standard library is not usable. 2454 2. If you wanted to build your own csv-like library, you could build it on top 2455 of `csv-core`. 2456 2457 # Closing thoughts 2458 2459 Congratulations on making it to the end! It seems incredible that one could 2460 write so many words on something as basic as CSV parsing. I wanted this 2461 guide to be accessible not only to Rust beginners, but to inexperienced 2462 programmers as well. My hope is that the large number of examples will help 2463 push you in the right direction. 2464 2465 With that said, here are a few more things you might want to look at: 2466 2467 * The [API documentation for the `csv` crate](../index.html) documents all 2468 facets of the library, and is itself littered with even more examples. 2469 * The [`csv-index` crate](https://docs.rs/csv-index) provides data structures 2470 that can index CSV data that are amenable to writing to disk. (This library 2471 is still a work in progress.) 2472 * The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high 2473 performance CSV swiss army knife. It can slice, select, search, sort, join, 2474 concatenate, index, format and compute statistics on arbitrary CSV data. Give 2475 it a try! 2476 2477 */ 2478