• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 /*!
2 A tutorial for handling CSV data in Rust.
3 
4 This tutorial will cover basic CSV reading and writing, automatic
5 (de)serialization with Serde, CSV transformations and performance.
6 
7 This tutorial is targeted at beginner Rust programmers. Experienced Rust
8 programmers may find this tutorial to be too verbose, but skimming may be
9 useful. There is also a
10 [cookbook](../cookbook/index.html)
11 of examples for those that prefer more information density.
12 
13 For an introduction to Rust, please see the
14 [official book](https://doc.rust-lang.org/book/second-edition/).
15 If you haven't written any Rust code yet but have written code in another
16 language, then this tutorial might be accessible to you without needing to read
17 the book first.
18 
19 # Table of contents
20 
21 1. [Setup](#setup)
22 1. [Basic error handling](#basic-error-handling)
23     * [Switch to recoverable errors](#switch-to-recoverable-errors)
24 1. [Reading CSV](#reading-csv)
25     * [Reading headers](#reading-headers)
26     * [Delimiters, quotes and variable length records](#delimiters-quotes-and-variable-length-records)
27     * [Reading with Serde](#reading-with-serde)
28     * [Handling invalid data with Serde](#handling-invalid-data-with-serde)
29 1. [Writing CSV](#writing-csv)
30     * [Writing tab separated values](#writing-tab-separated-values)
31     * [Writing with Serde](#writing-with-serde)
32 1. [Pipelining](#pipelining)
33     * [Filter by search](#filter-by-search)
34     * [Filter by population count](#filter-by-population-count)
35 1. [Performance](#performance)
36     * [Amortizing allocations](#amortizing-allocations)
37     * [Serde and zero allocation](#serde-and-zero-allocation)
38     * [CSV parsing without the standard library](#csv-parsing-without-the-standard-library)
39 1. [Closing thoughts](#closing-thoughts)
40 
41 # Setup
42 
43 In this section, we'll get you setup with a simple program that reads CSV data
44 and prints a "debug" version of each record. This assumes that you have the
45 [Rust toolchain installed](https://www.rust-lang.org/install.html),
46 which includes both Rust and Cargo.
47 
48 We'll start by creating a new Cargo project:
49 
50 ```text
51 $ cargo new --bin csvtutor
52 $ cd csvtutor
53 ```
54 
55 Once inside `csvtutor`, open `Cargo.toml` in your favorite text editor and add
56 `csv = "1.1"` to your `[dependencies]` section. At this point, your
57 `Cargo.toml` should look something like this:
58 
59 ```text
60 [package]
61 name = "csvtutor"
62 version = "0.1.0"
63 authors = ["Your Name"]
64 
65 [dependencies]
66 csv = "1.1"
67 ```
68 
69 Next, let's build your project. Since you added the `csv` crate as a
70 dependency, Cargo will automatically download it and compile it for you. To
71 build your project, use Cargo:
72 
73 ```text
74 $ cargo build
75 ```
76 
77 This will produce a new binary, `csvtutor`, in your `target/debug` directory.
78 It won't do much at this point, but you can run it:
79 
80 ```text
81 $ ./target/debug/csvtutor
82 Hello, world!
83 ```
84 
85 Let's make our program do something useful. Our program will read CSV data on
86 stdin and print debug output for each record on stdout. To write this program,
87 open `src/main.rs` in your favorite text editor and replace its contents with
88 this:
89 
90 ```no_run
91 //tutorial-setup-01.rs
92 // Import the standard library's I/O module so we can read from stdin.
93 use std::io;
94 
95 // The `main` function is where your program starts executing.
96 fn main() {
97     // Create a CSV parser that reads data from stdin.
98     let mut rdr = csv::Reader::from_reader(io::stdin());
99     // Loop over each record.
100     for result in rdr.records() {
101         // An error may occur, so abort the program in an unfriendly way.
102         // We will make this more friendly later!
103         let record = result.expect("a CSV record");
104         // Print a debug version of the record.
105         println!("{:?}", record);
106     }
107 }
108 ```
109 
110 Don't worry too much about what this code means; we'll dissect it in the next
111 section. For now, try rebuilding your project:
112 
113 ```text
114 $ cargo build
115 ```
116 
117 Assuming that succeeds, let's try running our program. But first, we will need
118 some CSV data to play with! For that, we will use a random selection of 100
119 US cities, along with their population size and geographical coordinates. (We
120 will use this same CSV data throughout the entire tutorial.) To get the data,
121 download it from github:
122 
123 ```text
124 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop.csv'
125 ```
126 
127 And now finally, run your program on `uspop.csv`:
128 
129 ```text
130 $ ./target/debug/csvtutor < uspop.csv
131 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
132 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
133 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
134 # ... and much more
135 ```
136 
137 # Basic error handling
138 
139 Since reading CSV data can result in errors, error handling is pervasive
140 throughout the examples in this tutorial. Therefore, we're going to spend a
141 little bit of time going over basic error handling, and in particular, fix
142 our previous example to show errors in a more friendly way. **If you're already
143 comfortable with things like `Result` and `try!`/`?` in Rust, then you can
144 safely skip this section.**
145 
146 Note that
147 [The Rust Programming Language Book](https://doc.rust-lang.org/book/second-edition/)
148 contains an
149 [introduction to general error handling](https://doc.rust-lang.org/book/second-edition/ch09-00-error-handling.html).
150 For a deeper dive, see
151 [my blog post on error handling in Rust](http://blog.burntsushi.net/rust-error-handling/).
152 The blog post is especially important if you plan on building Rust libraries.
153 
154 With that out of the way, error handling in Rust comes in two different forms:
155 unrecoverable errors and recoverable errors.
156 
157 Unrecoverable errors generally correspond to things like bugs in your program,
158 which might occur when an invariant or contract is broken. At that point, the
159 state of your program is unpredictable, and there's typically little recourse
160 other than *panicking*. In Rust, a panic is similar to simply aborting your
161 program, but it will unwind the stack and clean up resources before your
162 program exits.
163 
164 On the other hand, recoverable errors generally correspond to predictable
165 errors. A non-existent file or invalid CSV data are examples of recoverable
166 errors. In Rust, recoverable errors are handled via `Result`. A `Result`
167 represents the state of a computation that has either succeeded or failed.
168 It is defined like so:
169 
170 ```
171 enum Result<T, E> {
172     Ok(T),
173     Err(E),
174 }
175 ```
176 
177 That is, a `Result` either contains a value of type `T` when the computation
178 succeeds, or it contains a value of type `E` when the computation fails.
179 
180 The relationship between unrecoverable errors and recoverable errors is
181 important. In particular, it is **strongly discouraged** to treat recoverable
182 errors as if they were unrecoverable. For example, panicking when a file could
183 not be found, or if some CSV data is invalid, is considered bad practice.
184 Instead, predictable errors should be handled using Rust's `Result` type.
185 
186 With our new found knowledge, let's re-examine our previous example and dissect
187 its error handling.
188 
189 ```no_run
190 //tutorial-error-01.rs
191 use std::io;
192 
193 fn main() {
194     let mut rdr = csv::Reader::from_reader(io::stdin());
195     for result in rdr.records() {
196         let record = result.expect("a CSV record");
197         println!("{:?}", record);
198     }
199 }
200 ```
201 
202 There are two places where an error can occur in this program. The first is
203 if there was a problem reading a record from stdin. The second is if there is
204 a problem writing to stdout. In general, we will ignore the latter problem in
205 this tutorial, although robust command line applications should probably try
206 to handle it (e.g., when a broken pipe occurs). The former however is worth
207 looking into in more detail. For example, if a user of this program provides
208 invalid CSV data, then the program will panic:
209 
210 ```text
211 $ cat invalid
212 header1,header2
213 foo,bar
214 quux,baz,foobar
215 $ ./target/debug/csvtutor < invalid
216 StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] }
217 thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnequalLengths { pos: Some(Position { byte: 24, line: 3, record: 2 }), expected_len: 2, len: 3 }', /checkout/src/libcore/result.rs:859
218 note: Run with `RUST_BACKTRACE=1` for a backtrace.
219 ```
220 
221 What happened here? First and foremost, we should talk about why the CSV data
222 is invalid. The CSV data consists of three records: a header and two data
223 records. The header and first data record have two fields, but the second
224 data record has three fields. By default, the csv crate will treat inconsistent
225 record lengths as an error.
226 (This behavior can be toggled using the
227 [`ReaderBuilder::flexible`](../struct.ReaderBuilder.html#method.flexible)
228 config knob.) This explains why the first data record is printed in this
229 example, since it has the same number of fields as the header record. That is,
230 we don't actually hit an error until we parse the second data record.
231 
232 (Note that the CSV reader automatically interprets the first record as a
233 header. This can be toggled with the
234 [`ReaderBuilder::has_headers`](../struct.ReaderBuilder.html#method.has_headers)
235 config knob.)
236 
237 So what actually causes the panic to happen in our program? That would be the
238 first line in our loop:
239 
240 ```ignore
241 for result in rdr.records() {
242     let record = result.expect("a CSV record"); // this panics
243     println!("{:?}", record);
244 }
245 ```
246 
247 The key thing to understand here is that `rdr.records()` returns an iterator
248 that yields `Result` values. That is, instead of yielding records, it yields
249 a `Result` that contains either a record or an error. The `expect` method,
250 which is defined on `Result`, *unwraps* the success value inside the `Result`.
251 Since the `Result` might contain an error instead, `expect` will *panic* when
252 it does contain an error.
253 
254 It might help to look at the implementation of `expect`:
255 
256 ```ignore
257 use std::fmt;
258 
259 // This says, "for all types T and E, where E can be turned into a human
260 // readable debug message, define the `expect` method."
261 impl<T, E: fmt::Debug> Result<T, E> {
262     fn expect(self, msg: &str) -> T {
263         match self {
264             Ok(t) => t,
265             Err(e) => panic!("{}: {:?}", msg, e),
266         }
267     }
268 }
269 ```
270 
271 Since this causes a panic if the CSV data is invalid, and invalid CSV data is
272 a perfectly predictable error, we've turned what should be a *recoverable*
273 error into an *unrecoverable* error. We did this because it is expedient to
274 use unrecoverable errors. Since this is bad practice, we will endeavor to avoid
275 unrecoverable errors throughout the rest of the tutorial.
276 
277 ## Switch to recoverable errors
278 
279 We'll convert our unrecoverable error to a recoverable error in 3 steps. First,
280 let's get rid of the panic and print an error message manually:
281 
282 ```no_run
283 //tutorial-error-02.rs
284 use std::{io, process};
285 
286 fn main() {
287     let mut rdr = csv::Reader::from_reader(io::stdin());
288     for result in rdr.records() {
289         // Examine our Result.
290         // If there was no problem, print the record.
291         // Otherwise, print the error message and quit the program.
292         match result {
293             Ok(record) => println!("{:?}", record),
294             Err(err) => {
295                 println!("error reading CSV from <stdin>: {}", err);
296                 process::exit(1);
297             }
298         }
299     }
300 }
301 ```
302 
303 If we run our program again, we'll still see an error message, but it is no
304 longer a panic message:
305 
306 ```text
307 $ cat invalid
308 header1,header2
309 foo,bar
310 quux,baz,foobar
311 $ ./target/debug/csvtutor < invalid
312 StringRecord { position: Some(Position { byte: 16, line: 2, record: 1 }), fields: ["foo", "bar"] }
313 error reading CSV from <stdin>: CSV error: record 2 (line: 3, byte: 24): found record with 3 fields, but the previous record has 2 fields
314 ```
315 
316 The second step for moving to recoverable errors is to put our CSV record loop
317 into a separate function. This function then has the option of *returning* an
318 error, which our `main` function can then inspect and decide what to do with.
319 
320 ```no_run
321 //tutorial-error-03.rs
322 use std::{error::Error, io, process};
323 
324 fn main() {
325     if let Err(err) = run() {
326         println!("{}", err);
327         process::exit(1);
328     }
329 }
330 
331 fn run() -> Result<(), Box<dyn Error>> {
332     let mut rdr = csv::Reader::from_reader(io::stdin());
333     for result in rdr.records() {
334         // Examine our Result.
335         // If there was no problem, print the record.
336         // Otherwise, convert our error to a Box<dyn Error> and return it.
337         match result {
338             Err(err) => return Err(From::from(err)),
339             Ok(record) => {
340               println!("{:?}", record);
341             }
342         }
343     }
344     Ok(())
345 }
346 ```
347 
348 Our new function, `run`, has a return type of `Result<(), Box<dyn Error>>`. In
349 simple terms, this says that `run` either returns nothing when successful, or
350 if an error occurred, it returns a `Box<dyn Error>`, which stands for "any kind of
351 error." A `Box<dyn Error>` is hard to inspect if we cared about the specific error
352 that occurred. But for our purposes, all we need to do is gracefully print an
353 error message and exit the program.
354 
355 The third and final step is to replace our explicit `match` expression with a
356 special Rust language feature: the question mark.
357 
358 ```no_run
359 //tutorial-error-04.rs
360 use std::{error::Error, io, process};
361 
362 fn main() {
363     if let Err(err) = run() {
364         println!("{}", err);
365         process::exit(1);
366     }
367 }
368 
369 fn run() -> Result<(), Box<dyn Error>> {
370     let mut rdr = csv::Reader::from_reader(io::stdin());
371     for result in rdr.records() {
372         // This is effectively the same code as our `match` in the
373         // previous example. In other words, `?` is syntactic sugar.
374         let record = result?;
375         println!("{:?}", record);
376     }
377     Ok(())
378 }
379 ```
380 
381 This last step shows how we can use the `?` to automatically forward errors
382 to our caller without having to do explicit case analysis with `match`
383 ourselves. We will use the `?` heavily throughout this tutorial, and it's
384 important to note that it can **only be used in functions that return
385 `Result`.**
386 
387 We'll end this section with a word of caution: using `Box<dyn Error>` as our error
388 type is the minimally acceptable thing we can do here. Namely, while it allows
389 our program to gracefully handle errors, it makes it hard for callers to
390 inspect the specific error condition that occurred. However, since this is a
391 tutorial on writing command line programs that do CSV parsing, we will consider
392 ourselves satisfied. If you'd like to know more, or are interested in writing
393 a library that handles CSV data, then you should check out my
394 [blog post on error handling](http://blog.burntsushi.net/rust-error-handling/).
395 
396 With all that said, if all you're doing is writing a one-off program to do
397 CSV transformations, then using methods like `expect` and panicking when an
398 error occurs is a perfectly reasonable thing to do. Nevertheless, this tutorial
399 will endeavor to show idiomatic code.
400 
401 # Reading CSV
402 
403 Now that we've got you setup and covered basic error handling, it's time to do
404 what we came here to do: handle CSV data. We've already seen how to read
405 CSV data from `stdin`, but this section will cover how to read CSV data from
406 files and how to configure our CSV reader to data formatted with different
407 delimiters and quoting strategies.
408 
409 First up, let's adapt the example we've been working with to accept a file
410 path argument instead of stdin.
411 
412 ```no_run
413 //tutorial-read-01.rs
414 use std::{
415     env,
416     error::Error,
417     ffi::OsString,
418     fs::File,
419     process,
420 };
421 
422 fn run() -> Result<(), Box<dyn Error>> {
423     let file_path = get_first_arg()?;
424     let file = File::open(file_path)?;
425     let mut rdr = csv::Reader::from_reader(file);
426     for result in rdr.records() {
427         let record = result?;
428         println!("{:?}", record);
429     }
430     Ok(())
431 }
432 
433 /// Returns the first positional argument sent to this process. If there are no
434 /// positional arguments, then this returns an error.
435 fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
436     match env::args_os().nth(1) {
437         None => Err(From::from("expected 1 argument, but got none")),
438         Some(file_path) => Ok(file_path),
439     }
440 }
441 
442 fn main() {
443     if let Err(err) = run() {
444         println!("{}", err);
445         process::exit(1);
446     }
447 }
448 ```
449 
450 If you replace the contents of your `src/main.rs` file with the above code,
451 then you should be able to rebuild your project and try it out:
452 
453 ```text
454 $ cargo build
455 $ ./target/debug/csvtutor uspop.csv
456 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
457 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
458 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
459 # ... and much more
460 ```
461 
462 This example contains two new pieces of code:
463 
464 1. Code for querying the positional arguments of your program. We put this code
465    into its own function called `get_first_arg`. Our program expects a file
466    path in the first position (which is indexed at `1`; the argument at index
467    `0` is the executable name), so if one doesn't exist, then `get_first_arg`
468    returns an error.
469 2. Code for opening a file. In `run`, we open a file using `File::open`. If
470    there was a problem opening the file, we forward the error to the caller of
471    `run` (which is `main` in this program). Note that we do *not* wrap the
472    `File` in a buffer. The CSV reader does buffering internally, so there's
473    no need for the caller to do it.
474 
475 Now is a good time to introduce an alternate CSV reader constructor, which
476 makes it slightly more convenient to open CSV data from a file. That is,
477 instead of:
478 
479 ```ignore
480 let file_path = get_first_arg()?;
481 let file = File::open(file_path)?;
482 let mut rdr = csv::Reader::from_reader(file);
483 ```
484 
485 you can use:
486 
487 ```ignore
488 let file_path = get_first_arg()?;
489 let mut rdr = csv::Reader::from_path(file_path)?;
490 ```
491 
492 `csv::Reader::from_path` will open the file for you and return an error if
493 the file could not be opened.
494 
495 ## Reading headers
496 
497 If you had a chance to look at the data inside `uspop.csv`, you would notice
498 that there is a header record that looks like this:
499 
500 ```text
501 City,State,Population,Latitude,Longitude
502 ```
503 
504 Now, if you look back at the output of the commands you've run so far, you'll
505 notice that the header record is never printed. Why is that? By default, the
506 CSV reader will interpret the first record in CSV data as a header, which
507 is typically distinct from the actual data in the records that follow.
508 Therefore, the header record is always skipped whenever you try to read or
509 iterate over the records in CSV data.
510 
511 The CSV reader does not try to be smart about the header record and does
512 **not** employ any heuristics for automatically detecting whether the first
513 record is a header or not. Instead, if you don't want to treat the first record
514 as a header, you'll need to tell the CSV reader that there are no headers.
515 
516 To configure a CSV reader to do this, we'll need to use a
517 [`ReaderBuilder`](../struct.ReaderBuilder.html)
518 to build a CSV reader with our desired configuration. Here's an example that
519 does just that. (Note that we've moved back to reading from `stdin`, since it
520 produces terser examples.)
521 
522 ```no_run
523 //tutorial-read-headers-01.rs
524 # use std::{error::Error, io, process};
525 #
526 fn run() -> Result<(), Box<dyn Error>> {
527     let mut rdr = csv::ReaderBuilder::new()
528         .has_headers(false)
529         .from_reader(io::stdin());
530     for result in rdr.records() {
531         let record = result?;
532         println!("{:?}", record);
533     }
534     Ok(())
535 }
536 #
537 # fn main() {
538 #     if let Err(err) = run() {
539 #         println!("{}", err);
540 #         process::exit(1);
541 #     }
542 # }
543 ```
544 
545 If you compile and run this program with our `uspop.csv` data, then you'll see
546 that the header record is now printed:
547 
548 ```text
549 $ cargo build
550 $ ./target/debug/csvtutor < uspop.csv
551 StringRecord(["City", "State", "Population", "Latitude", "Longitude"])
552 StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
553 StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
554 StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
555 ```
556 
557 If you ever need to access the header record directly, then you can use the
558 [`Reader::header`](../struct.Reader.html#method.headers)
559 method like so:
560 
561 ```no_run
562 //tutorial-read-headers-02.rs
563 # use std::{error::Error, io, process};
564 #
565 fn run() -> Result<(), Box<dyn Error>> {
566     let mut rdr = csv::Reader::from_reader(io::stdin());
567     {
568         // We nest this call in its own scope because of lifetimes.
569         let headers = rdr.headers()?;
570         println!("{:?}", headers);
571     }
572     for result in rdr.records() {
573         let record = result?;
574         println!("{:?}", record);
575     }
576     // We can ask for the headers at any time. There's no need to nest this
577     // call in its own scope because we never try to borrow the reader again.
578     let headers = rdr.headers()?;
579     println!("{:?}", headers);
580     Ok(())
581 }
582 #
583 # fn main() {
584 #     if let Err(err) = run() {
585 #         println!("{}", err);
586 #         process::exit(1);
587 #     }
588 # }
589 ```
590 
591 One interesting thing to note in this example is that we put the call to
592 `rdr.headers()` in its own scope. We do this because `rdr.headers()` returns
593 a *borrow* of the reader's internal header state. The nested scope in this
594 code allows the borrow to end before we try to iterate over the records. If
595 we didn't nest the call to `rdr.headers()` in its own scope, then the code
596 wouldn't compile because we cannot borrow the reader's headers at the same time
597 that we try to borrow the reader to iterate over its records.
598 
599 Another way of solving this problem is to *clone* the header record:
600 
601 ```ignore
602 let headers = rdr.headers()?.clone();
603 ```
604 
605 This converts it from a borrow of the CSV reader to a new owned value. This
606 makes the code a bit easier to read, but at the cost of copying the header
607 record into a new allocation.
608 
609 ## Delimiters, quotes and variable length records
610 
611 In this section we'll temporarily depart from our `uspop.csv` data set and
612 show how to read some CSV data that is a little less clean. This CSV data
613 uses `;` as a delimiter, escapes quotes with `\"` (instead of `""`) and has
614 records of varying length. Here's the data, which contains a list of WWE
615 wrestlers and the year they started, if it's known:
616 
617 ```text
618 $ cat strange.csv
619 "\"Hacksaw\" Jim Duggan";1987
620 "Bret \"Hit Man\" Hart";1984
621 # We're not sure when Rafael started, so omit the year.
622 Rafael Halperin
623 "\"Big Cat\" Ernie Ladd";1964
624 "\"Macho Man\" Randy Savage";1985
625 "Jake \"The Snake\" Roberts";1986
626 ```
627 
628 To read this CSV data, we'll want to do the following:
629 
630 1. Disable headers, since this data has none.
631 2. Change the delimiter from `,` to `;`.
632 3. Change the quote strategy from doubled (e.g., `""`) to escaped (e.g., `\"`).
633 4. Permit flexible length records, since some omit the year.
634 5. Ignore lines beginning with a `#`.
635 
636 All of this (and more!) can be configured with a
637 [`ReaderBuilder`](../struct.ReaderBuilder.html),
638 as seen in the following example:
639 
640 ```no_run
641 //tutorial-read-delimiter-01.rs
642 # use std::{error::Error, io, process};
643 #
644 fn run() -> Result<(), Box<dyn Error>> {
645     let mut rdr = csv::ReaderBuilder::new()
646         .has_headers(false)
647         .delimiter(b';')
648         .double_quote(false)
649         .escape(Some(b'\\'))
650         .flexible(true)
651         .comment(Some(b'#'))
652         .from_reader(io::stdin());
653     for result in rdr.records() {
654         let record = result?;
655         println!("{:?}", record);
656     }
657     Ok(())
658 }
659 #
660 # fn main() {
661 #     if let Err(err) = run() {
662 #         println!("{}", err);
663 #         process::exit(1);
664 #     }
665 # }
666 ```
667 
668 Now re-compile your project and try running the program on `strange.csv`:
669 
670 ```text
671 $ cargo build
672 $ ./target/debug/csvtutor < strange.csv
673 StringRecord(["\"Hacksaw\" Jim Duggan", "1987"])
674 StringRecord(["Bret \"Hit Man\" Hart", "1984"])
675 StringRecord(["Rafael Halperin"])
676 StringRecord(["\"Big Cat\" Ernie Ladd", "1964"])
677 StringRecord(["\"Macho Man\" Randy Savage", "1985"])
678 StringRecord(["Jake \"The Snake\" Roberts", "1986"])
679 ```
680 
681 You should feel encouraged to play around with the settings. Some interesting
682 things you might try:
683 
684 1. If you remove the `escape` setting, notice that no CSV errors are reported.
685    Instead, records are still parsed. This is a feature of the CSV parser. Even
686    though it gets the data slightly wrong, it still provides a parse that you
687    might be able to work with. This is a useful property given the messiness
688    of real world CSV data.
689 2. If you remove the `delimiter` setting, parsing still succeeds, although
690    every record has exactly one field.
691 3. If you remove the `flexible` setting, the reader will print the first two
692    records (since they both have the same number of fields), but will return a
693    parse error on the third record, since it has only one field.
694 
695 This covers most of the things you might want to configure on your CSV reader,
696 although there are a few other knobs. For example, you can change the record
697 terminator from a new line to any other character. (By default, the terminator
698 is `CRLF`, which treats each of `\r\n`, `\r` and `\n` as single record
699 terminators.) For more details, see the documentation and examples for each of
700 the methods on
701 [`ReaderBuilder`](../struct.ReaderBuilder.html).
702 
703 ## Reading with Serde
704 
705 One of the most convenient features of this crate is its support for
706 [Serde](https://serde.rs/).
707 Serde is a framework for automatically serializing and deserializing data into
708 Rust types. In simpler terms, that means instead of iterating over records
709 as an array of string fields, we can iterate over records of a specific type
710 of our choosing.
711 
712 For example, let's take a look at some data from our `uspop.csv` file:
713 
714 ```text
715 City,State,Population,Latitude,Longitude
716 Davidsons Landing,AK,,65.2419444,-165.2716667
717 Kenai,AK,7610,60.5544444,-151.2583333
718 ```
719 
720 While some of these fields make sense as strings (`City`, `State`), other
721 fields look more like numbers. For example, `Population` looks like it contains
722 integers while `Latitude` and `Longitude` appear to contain decimals. If we
723 wanted to convert these fields to their "proper" types, then we need to do
724 a lot of manual work. This next example shows how.
725 
726 ```no_run
727 //tutorial-read-serde-01.rs
728 # use std::{error::Error, io, process};
729 #
730 fn run() -> Result<(), Box<dyn Error>> {
731     let mut rdr = csv::Reader::from_reader(io::stdin());
732     for result in rdr.records() {
733         let record = result?;
734 
735         let city = &record[0];
736         let state = &record[1];
737         // Some records are missing population counts, so if we can't
738         // parse a number, treat the population count as missing instead
739         // of returning an error.
740         let pop: Option<u64> = record[2].parse().ok();
741         // Lucky us! Latitudes and longitudes are available for every record.
742         // Therefore, if one couldn't be parsed, return an error.
743         let latitude: f64 = record[3].parse()?;
744         let longitude: f64 = record[4].parse()?;
745 
746         println!(
747             "city: {:?}, state: {:?}, \
748              pop: {:?}, latitude: {:?}, longitude: {:?}",
749             city, state, pop, latitude, longitude);
750     }
751     Ok(())
752 }
753 #
754 # fn main() {
755 #     if let Err(err) = run() {
756 #         println!("{}", err);
757 #         process::exit(1);
758 #     }
759 # }
760 ```
761 
762 The problem here is that we need to parse each individual field manually, which
763 can be labor intensive and repetitive. Serde, however, makes this process
764 automatic. For example, we can ask to deserialize every record into a tuple
765 type: `(String, String, Option<u64>, f64, f64)`.
766 
767 ```no_run
768 //tutorial-read-serde-02.rs
769 # use std::{error::Error, io, process};
770 #
771 // This introduces a type alias so that we can conveniently reference our
772 // record type.
773 type Record = (String, String, Option<u64>, f64, f64);
774 
775 fn run() -> Result<(), Box<dyn Error>> {
776     let mut rdr = csv::Reader::from_reader(io::stdin());
777     // Instead of creating an iterator with the `records` method, we create
778     // an iterator with the `deserialize` method.
779     for result in rdr.deserialize() {
780         // We must tell Serde what type we want to deserialize into.
781         let record: Record = result?;
782         println!("{:?}", record);
783     }
784     Ok(())
785 }
786 #
787 # fn main() {
788 #     if let Err(err) = run() {
789 #         println!("{}", err);
790 #         process::exit(1);
791 #     }
792 # }
793 ```
794 
795 Running this code should show similar output as previous examples:
796 
797 ```text
798 $ cargo build
799 $ ./target/debug/csvtutor < uspop.csv
800 ("Davidsons Landing", "AK", None, 65.2419444, -165.2716667)
801 ("Kenai", "AK", Some(7610), 60.5544444, -151.2583333)
802 ("Oakman", "AL", None, 33.7133333, -87.3886111)
803 # ... and much more
804 ```
805 
806 One of the downsides of using Serde this way is that the type you use must
807 match the order of fields as they appear in each record. This can be a pain
808 if your CSV data has a header record, since you might tend to think about each
809 field as a value of a particular named field rather than as a numbered field.
810 One way we might achieve this is to deserialize our record into a map type like
811 [`HashMap`](https://doc.rust-lang.org/std/collections/struct.HashMap.html)
812 or
813 [`BTreeMap`](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html).
814 The next example shows how, and in particular, notice that the only thing that
815 changed from the last example is the definition of the `Record` type alias and
816 a new `use` statement that imports `HashMap` from the standard library:
817 
818 ```no_run
819 //tutorial-read-serde-03.rs
820 use std::collections::HashMap;
821 # use std::{error::Error, io, process};
822 
823 // This introduces a type alias so that we can conveniently reference our
824 // record type.
825 type Record = HashMap<String, String>;
826 
827 fn run() -> Result<(), Box<dyn Error>> {
828     let mut rdr = csv::Reader::from_reader(io::stdin());
829     for result in rdr.deserialize() {
830         let record: Record = result?;
831         println!("{:?}", record);
832     }
833     Ok(())
834 }
835 #
836 # fn main() {
837 #     if let Err(err) = run() {
838 #         println!("{}", err);
839 #         process::exit(1);
840 #     }
841 # }
842 ```
843 
844 Running this program shows similar results as before, but each record is
845 printed as a map:
846 
847 ```text
848 $ cargo build
849 $ ./target/debug/csvtutor < uspop.csv
850 {"City": "Davidsons Landing", "Latitude": "65.2419444", "State": "AK", "Population": "", "Longitude": "-165.2716667"}
851 {"City": "Kenai", "Population": "7610", "State": "AK", "Longitude": "-151.2583333", "Latitude": "60.5544444"}
852 {"State": "AL", "City": "Oakman", "Longitude": "-87.3886111", "Population": "", "Latitude": "33.7133333"}
853 ```
854 
855 This method works especially well if you need to read CSV data with header
856 records, but whose exact structure isn't known until your program runs.
857 However, in our case, we know the structure of the data in `uspop.csv`. In
858 particular, with the `HashMap` approach, we've lost the specific types we had
859 for each field in the previous example when we deserialized each record into a
860 `(String, String, Option<u64>, f64, f64)`. Is there a way to identify fields
861 by their corresponding header name *and* assign each field its own unique
862 type? The answer is yes, but we'll need to bring in Serde's `derive` feature
863 first. You can do that by adding this to the `[dependencies]` section of your
864 `Cargo.toml` file:
865 
866 ```text
867 serde = { version = "1", features = ["derive"] }
868 ```
869 
870 With these crates added to our project, we can now define our own custom struct
871 that represents our record. We then ask Serde to automatically write the glue
872 code required to populate our struct from a CSV record. The next example shows
873 how. Don't miss the new Serde imports!
874 
875 ```no_run
876 //tutorial-read-serde-04.rs
877 # #![allow(dead_code)]
878 # use std::{error::Error, io, process};
879 
880 // This lets us write `#[derive(Deserialize)]`.
881 use serde::Deserialize;
882 
883 // We don't need to derive `Debug` (which doesn't require Serde), but it's a
884 // good habit to do it for all your types.
885 //
886 // Notice that the field names in this struct are NOT in the same order as
887 // the fields in the CSV data!
888 #[derive(Debug, Deserialize)]
889 #[serde(rename_all = "PascalCase")]
890 struct Record {
891     latitude: f64,
892     longitude: f64,
893     population: Option<u64>,
894     city: String,
895     state: String,
896 }
897 
898 fn run() -> Result<(), Box<dyn Error>> {
899     let mut rdr = csv::Reader::from_reader(io::stdin());
900     for result in rdr.deserialize() {
901         let record: Record = result?;
902         println!("{:?}", record);
903         // Try this if you don't like each record smushed on one line:
904         // println!("{:#?}", record);
905     }
906     Ok(())
907 }
908 
909 fn main() {
910     if let Err(err) = run() {
911         println!("{}", err);
912         process::exit(1);
913     }
914 }
915 ```
916 
917 Compile and run this program to see similar output as before:
918 
919 ```text
920 $ cargo build
921 $ ./target/debug/csvtutor < uspop.csv
922 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
923 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
924 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
925 ```
926 
927 Once again, we didn't need to change our `run` function at all: we're still
928 iterating over records using the `deserialize` iterator that we started with
929 in the beginning of this section. The only thing that changed in this example
930 was the definition of the `Record` type and a new `use` statement. Our `Record`
931 type is now a custom struct that we defined instead of a type alias, and as a
932 result, Serde doesn't know how to deserialize it by default. However, a special
933 compiler plugin provided by Serde is available, which will read your struct
934 definition at compile time and generate code that will deserialize a CSV record
935 into a `Record` value. To see what happens if you leave out the automatic
936 derive, change `#[derive(Debug, Deserialize)]` to `#[derive(Debug)]`.
937 
938 One other thing worth mentioning in this example is the use of
939 `#[serde(rename_all = "PascalCase")]`. This directive helps Serde map your
940 struct's field names to the header names in the CSV data. If you recall, our
941 header record is:
942 
943 ```text
944 City,State,Population,Latitude,Longitude
945 ```
946 
947 Notice that each name is capitalized, but the fields in our struct are not. The
948 `#[serde(rename_all = "PascalCase")]` directive fixes that by interpreting each
949 field in `PascalCase`, where the first letter of the field is capitalized. If
950 we didn't tell Serde about the name remapping, then the program will quit with
951 an error:
952 
953 ```text
954 $ ./target/debug/csvtutor < uspop.csv
955 CSV deserialize error: record 1 (line: 2, byte: 41): missing field `latitude`
956 ```
957 
958 We could have fixed this through other means. For example, we could have used
959 capital letters in our field names:
960 
961 ```ignore
962 #[derive(Debug, Deserialize)]
963 struct Record {
964     Latitude: f64,
965     Longitude: f64,
966     Population: Option<u64>,
967     City: String,
968     State: String,
969 }
970 ```
971 
972 However, this violates Rust naming style. (In fact, the Rust compiler
973 will even warn you that the names do not follow convention!)
974 
975 Another way to fix this is to ask Serde to rename each field individually. This
976 is useful when there is no consistent name mapping from fields to header names:
977 
978 ```ignore
979 #[derive(Debug, Deserialize)]
980 struct Record {
981     #[serde(rename = "Latitude")]
982     latitude: f64,
983     #[serde(rename = "Longitude")]
984     longitude: f64,
985     #[serde(rename = "Population")]
986     population: Option<u64>,
987     #[serde(rename = "City")]
988     city: String,
989     #[serde(rename = "State")]
990     state: String,
991 }
992 ```
993 
994 To read more about renaming fields and about other Serde directives, please
995 consult the
996 [Serde documentation on attributes](https://serde.rs/attributes.html).
997 
998 ## Handling invalid data with Serde
999 
1000 In this section we will see a brief example of how to deal with data that isn't
1001 clean. To do this exercise, we'll work with a slightly tweaked version of the
1002 US population data we've been using throughout this tutorial. This version of
1003 the data is slightly messier than what we've been using. You can get it like
1004 so:
1005 
1006 ```text
1007 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-null.csv'
1008 ```
1009 
1010 Let's start by running our program from the previous section:
1011 
1012 ```no_run
1013 //tutorial-read-serde-invalid-01.rs
1014 # #![allow(dead_code)]
1015 # use std::{error::Error, io, process};
1016 #
1017 # use serde::Deserialize;
1018 #
1019 #[derive(Debug, Deserialize)]
1020 #[serde(rename_all = "PascalCase")]
1021 struct Record {
1022     latitude: f64,
1023     longitude: f64,
1024     population: Option<u64>,
1025     city: String,
1026     state: String,
1027 }
1028 
1029 fn run() -> Result<(), Box<dyn Error>> {
1030     let mut rdr = csv::Reader::from_reader(io::stdin());
1031     for result in rdr.deserialize() {
1032         let record: Record = result?;
1033         println!("{:?}", record);
1034     }
1035     Ok(())
1036 }
1037 #
1038 # fn main() {
1039 #     if let Err(err) = run() {
1040 #         println!("{}", err);
1041 #         process::exit(1);
1042 #     }
1043 # }
1044 ```
1045 
1046 Compile and run it on our messier data:
1047 
1048 ```text
1049 $ cargo build
1050 $ ./target/debug/csvtutor < uspop-null.csv
1051 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1052 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1053 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1054 # ... more records
1055 CSV deserialize error: record 42 (line: 43, byte: 1710): field 2: invalid digit found in string
1056 ```
1057 
1058 Oops! What happened? The program printed several records, but stopped when it
1059 tripped over a deserialization problem. The error message says that it found
1060 an invalid digit in the field at index `2` (which is the `Population` field)
1061 on line 43. What does line 43 look like?
1062 
1063 ```text
1064 $ head -n 43 uspop-null.csv | tail -n1
1065 Flint Springs,KY,NULL,37.3433333,-86.7136111
1066 ```
1067 
1068 Ah! The third field (index `2`) is supposed to either be empty or contain a
1069 population count. However, in this data, it seems that `NULL` sometimes appears
1070 as a value, presumably to indicate that there is no count available.
1071 
1072 The problem with our current program is that it fails to read this record
1073 because it doesn't know how to deserialize a `NULL` string into an
1074 `Option<u64>`. That is, a `Option<u64>` either corresponds to an empty field
1075 or an integer.
1076 
1077 To fix this, we tell Serde to convert any deserialization errors on this field
1078 to a `None` value, as shown in this next example:
1079 
1080 ```no_run
1081 //tutorial-read-serde-invalid-02.rs
1082 # #![allow(dead_code)]
1083 # use std::{error::Error, io, process};
1084 #
1085 # use serde::Deserialize;
1086 #[derive(Debug, Deserialize)]
1087 #[serde(rename_all = "PascalCase")]
1088 struct Record {
1089     latitude: f64,
1090     longitude: f64,
1091     #[serde(deserialize_with = "csv::invalid_option")]
1092     population: Option<u64>,
1093     city: String,
1094     state: String,
1095 }
1096 
1097 fn run() -> Result<(), Box<dyn Error>> {
1098     let mut rdr = csv::Reader::from_reader(io::stdin());
1099     for result in rdr.deserialize() {
1100         let record: Record = result?;
1101         println!("{:?}", record);
1102     }
1103     Ok(())
1104 }
1105 #
1106 # fn main() {
1107 #     if let Err(err) = run() {
1108 #         println!("{}", err);
1109 #         process::exit(1);
1110 #     }
1111 # }
1112 ```
1113 
1114 If you compile and run this example, then it should run to completion just
1115 like the other examples:
1116 
1117 ```text
1118 $ cargo build
1119 $ ./target/debug/csvtutor < uspop-null.csv
1120 Record { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
1121 Record { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
1122 Record { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
1123 # ... and more
1124 ```
1125 
1126 The only change in this example was adding this attribute to the `population`
1127 field in our `Record` type:
1128 
1129 ```ignore
1130 #[serde(deserialize_with = "csv::invalid_option")]
1131 ```
1132 
1133 The
1134 [`invalid_option`](../fn.invalid_option.html)
1135 function is a generic helper function that does one very simple thing: when
1136 applied to `Option` fields, it will convert any deserialization error into a
1137 `None` value. This is useful when you need to work with messy CSV data.
1138 
1139 # Writing CSV
1140 
1141 In this section we'll show a few examples that write CSV data. Writing CSV data
1142 tends to be a bit more straight-forward than reading CSV data, since you get to
1143 control the output format.
1144 
1145 Let's start with the most basic example: writing a few CSV records to `stdout`.
1146 
1147 ```no_run
1148 //tutorial-write-01.rs
1149 use std::{error::Error, io, process};
1150 
1151 fn run() -> Result<(), Box<dyn Error>> {
1152     let mut wtr = csv::Writer::from_writer(io::stdout());
1153     // Since we're writing records manually, we must explicitly write our
1154     // header record. A header record is written the same way that other
1155     // records are written.
1156     wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1157     wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1158     wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1159     wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1160 
1161     // A CSV writer maintains an internal buffer, so it's important
1162     // to flush the buffer when you're done.
1163     wtr.flush()?;
1164     Ok(())
1165 }
1166 
1167 fn main() {
1168     if let Err(err) = run() {
1169         println!("{}", err);
1170         process::exit(1);
1171     }
1172 }
1173 ```
1174 
1175 Compiling and running this example results in CSV data being printed:
1176 
1177 ```text
1178 $ cargo build
1179 $ ./target/debug/csvtutor
1180 City,State,Population,Latitude,Longitude
1181 Davidsons Landing,AK,,65.2419444,-165.2716667
1182 Kenai,AK,7610,60.5544444,-151.2583333
1183 Oakman,AL,,33.7133333,-87.3886111
1184 ```
1185 
1186 Before moving on, it's worth taking a closer look at the `write_record`
1187 method. In this example, it looks rather simple, but if you're new to Rust then
1188 its type signature might look a little daunting:
1189 
1190 ```ignore
1191 pub fn write_record<I, T>(&mut self, record: I) -> csv::Result<()>
1192     where I: IntoIterator<Item=T>, T: AsRef<[u8]>
1193 {
1194     // implementation elided
1195 }
1196 ```
1197 
1198 To understand the type signature, we can break it down piece by piece.
1199 
1200 1. The method takes two parameters: `self` and `record`.
1201 2. `self` is a special parameter that corresponds to the `Writer` itself.
1202 3. `record` is the CSV record we'd like to write. Its type is `I`, which is
1203    a generic type.
1204 4. In the method's `where` clause, the `I` type is constrained by the
1205    `IntoIterator<Item=T>` bound. What that means is that `I` must satisfy the
1206    `IntoIterator` trait. If you look at the documentation of the
1207    [`IntoIterator` trait](https://doc.rust-lang.org/std/iter/trait.IntoIterator.html),
1208    then we can see that it describes types that can build iterators. In this
1209    case, we want an iterator that yields *another* generic type `T`, where
1210    `T` is the type of each field we want to write.
1211 5. `T` also appears in the method's `where` clause, but its constraint is the
1212    `AsRef<[u8]>` bound. The `AsRef` trait is a way to describe zero cost
1213    conversions between types in Rust. In this case, the `[u8]` in `AsRef<[u8]>`
1214    means that we want to be able to *borrow* a slice of bytes from `T`.
1215    The CSV writer will take these bytes and write them as a single field.
1216    The `AsRef<[u8]>` bound is useful because types like `String`, `&str`,
1217    `Vec<u8>` and `&[u8]` all satisfy it.
1218 6. Finally, the method returns a `csv::Result<()>`, which is short-hand for
1219    `Result<(), csv::Error>`. That means `write_record` either returns nothing
1220    on success or returns a `csv::Error` on failure.
1221 
1222 Now, let's apply our new found understanding of the type signature of
1223 `write_record`. If you recall, in our previous example, we used it like so:
1224 
1225 ```ignore
1226 wtr.write_record(&["field 1", "field 2", "etc"])?;
1227 ```
1228 
1229 So how do the types match up? Well, the type of each of our fields in this
1230 code is `&'static str` (which is the type of a string literal in Rust). Since
1231 we put them in a slice literal, the type of our parameter is
1232 `&'static [&'static str]`, or more succinctly written as `&[&str]` without the
1233 lifetime annotations. Since slices satisfy the `IntoIterator` bound and
1234 strings satisfy the `AsRef<[u8]>` bound, this ends up being a legal call.
1235 
1236 Here are a few more examples of ways you can call `write_record`:
1237 
1238 ```no_run
1239 # use csv;
1240 # let mut wtr = csv::Writer::from_writer(vec![]);
1241 // A slice of byte strings.
1242 wtr.write_record(&[b"a", b"b", b"c"]);
1243 // A vector.
1244 wtr.write_record(vec!["a", "b", "c"]);
1245 // A string record.
1246 wtr.write_record(&csv::StringRecord::from(vec!["a", "b", "c"]));
1247 // A byte record.
1248 wtr.write_record(&csv::ByteRecord::from(vec!["a", "b", "c"]));
1249 ```
1250 
1251 Finally, the example above can be easily adapted to write to a file instead
1252 of `stdout`:
1253 
1254 ```no_run
1255 //tutorial-write-02.rs
1256 use std::{
1257     env,
1258     error::Error,
1259     ffi::OsString,
1260     process,
1261 };
1262 
1263 fn run() -> Result<(), Box<dyn Error>> {
1264     let file_path = get_first_arg()?;
1265     let mut wtr = csv::Writer::from_path(file_path)?;
1266 
1267     wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1268     wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1269     wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1270     wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1271 
1272     wtr.flush()?;
1273     Ok(())
1274 }
1275 
1276 /// Returns the first positional argument sent to this process. If there are no
1277 /// positional arguments, then this returns an error.
1278 fn get_first_arg() -> Result<OsString, Box<dyn Error>> {
1279     match env::args_os().nth(1) {
1280         None => Err(From::from("expected 1 argument, but got none")),
1281         Some(file_path) => Ok(file_path),
1282     }
1283 }
1284 
1285 fn main() {
1286     if let Err(err) = run() {
1287         println!("{}", err);
1288         process::exit(1);
1289     }
1290 }
1291 ```
1292 
1293 ## Writing tab separated values
1294 
1295 In the previous section, we saw how to write some simple CSV data to `stdout`
1296 that looked like this:
1297 
1298 ```text
1299 City,State,Population,Latitude,Longitude
1300 Davidsons Landing,AK,,65.2419444,-165.2716667
1301 Kenai,AK,7610,60.5544444,-151.2583333
1302 Oakman,AL,,33.7133333,-87.3886111
1303 ```
1304 
1305 You might wonder to yourself: what's the point of using a CSV writer if the
1306 data is so simple? Well, the benefit of a CSV writer is that it can handle all
1307 types of data without sacrificing the integrity of your data. That is, it knows
1308 when to quote fields that contain special CSV characters (like commas or new
1309 lines) or escape literal quotes that appear in your data. The CSV writer can
1310 also be easily configured to use different delimiters or quoting strategies.
1311 
1312 In this section, we'll take a look a look at how to tweak some of the settings
1313 on a CSV writer. In particular, we'll write TSV ("tab separated values")
1314 instead of CSV, and we'll ask the CSV writer to quote all non-numeric fields.
1315 Here's an example:
1316 
1317 ```no_run
1318 //tutorial-write-delimiter-01.rs
1319 # use std::{error::Error, io, process};
1320 #
1321 fn run() -> Result<(), Box<dyn Error>> {
1322     let mut wtr = csv::WriterBuilder::new()
1323         .delimiter(b'\t')
1324         .quote_style(csv::QuoteStyle::NonNumeric)
1325         .from_writer(io::stdout());
1326 
1327     wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1328     wtr.write_record(&["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])?;
1329     wtr.write_record(&["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])?;
1330     wtr.write_record(&["Oakman", "AL", "", "33.7133333", "-87.3886111"])?;
1331 
1332     wtr.flush()?;
1333     Ok(())
1334 }
1335 #
1336 # fn main() {
1337 #     if let Err(err) = run() {
1338 #         println!("{}", err);
1339 #         process::exit(1);
1340 #     }
1341 # }
1342 ```
1343 
1344 Compiling and running this example gives:
1345 
1346 ```text
1347 $ cargo build
1348 $ ./target/debug/csvtutor
1349 "City"  "State" "Population"    "Latitude"      "Longitude"
1350 "Davidsons Landing"     "AK"    ""      65.2419444      -165.2716667
1351 "Kenai" "AK"    7610    60.5544444      -151.2583333
1352 "Oakman"        "AL"    ""      33.7133333      -87.3886111
1353 ```
1354 
1355 In this example, we used a new type
1356 [`QuoteStyle`](../enum.QuoteStyle.html).
1357 The `QuoteStyle` type represents the different quoting strategies available
1358 to you. The default is to add quotes to fields only when necessary. This
1359 probably works for most use cases, but you can also ask for quotes to always
1360 be put around fields, to never be put around fields or to always be put around
1361 non-numeric fields.
1362 
1363 ## Writing with Serde
1364 
1365 Just like the CSV reader supports automatic deserialization into Rust types
1366 with Serde, the CSV writer supports automatic serialization from Rust types
1367 into CSV records using Serde. In this section, we'll learn how to use it.
1368 
1369 As with reading, let's start by seeing how we can serialize a Rust tuple.
1370 
1371 ```no_run
1372 //tutorial-write-serde-01.rs
1373 # use std::{error::Error, io, process};
1374 #
1375 fn run() -> Result<(), Box<dyn Error>> {
1376     let mut wtr = csv::Writer::from_writer(io::stdout());
1377 
1378     // We still need to write headers manually.
1379     wtr.write_record(&["City", "State", "Population", "Latitude", "Longitude"])?;
1380 
1381     // But now we can write records by providing a normal Rust value.
1382     //
1383     // Note that the odd `None::<u64>` syntax is required because `None` on
1384     // its own doesn't have a concrete type, but Serde needs a concrete type
1385     // in order to serialize it. That is, `None` has type `Option<T>` but
1386     // `None::<u64>` has type `Option<u64>`.
1387     wtr.serialize(("Davidsons Landing", "AK", None::<u64>, 65.2419444, -165.2716667))?;
1388     wtr.serialize(("Kenai", "AK", Some(7610), 60.5544444, -151.2583333))?;
1389     wtr.serialize(("Oakman", "AL", None::<u64>, 33.7133333, -87.3886111))?;
1390 
1391     wtr.flush()?;
1392     Ok(())
1393 }
1394 #
1395 # fn main() {
1396 #     if let Err(err) = run() {
1397 #         println!("{}", err);
1398 #         process::exit(1);
1399 #     }
1400 # }
1401 ```
1402 
1403 Compiling and running this program gives the expected output:
1404 
1405 ```text
1406 $ cargo build
1407 $ ./target/debug/csvtutor
1408 City,State,Population,Latitude,Longitude
1409 Davidsons Landing,AK,,65.2419444,-165.2716667
1410 Kenai,AK,7610,60.5544444,-151.2583333
1411 Oakman,AL,,33.7133333,-87.3886111
1412 ```
1413 
1414 The key thing to note in the above example is the use of `serialize` instead
1415 of `write_record` to write our data. In particular, `write_record` is used
1416 when writing a simple record that contains string-like data only. On the other
1417 hand, `serialize` is used when your data consists of more complex values like
1418 numbers, floats or optional values. Of course, you could always convert the
1419 complex values to strings and then use `write_record`, but Serde can do it for
1420 you automatically.
1421 
1422 As with reading, we can also serialize custom structs as CSV records. As a
1423 bonus, the fields in a struct will automatically be written as a header
1424 record!
1425 
1426 To write custom structs as CSV records, we'll need to make use of Serde's
1427 automatic `derive` feature again. As in the
1428 [previous section on reading with Serde](#reading-with-serde),
1429 we'll need to add a couple crates to our `[dependencies]` section in our
1430 `Cargo.toml` (if they aren't already there):
1431 
1432 ```text
1433 serde = { version = "1", features = ["derive"] }
1434 ```
1435 
1436 And we'll also need to add a new `use` statement to our code, for Serde, as
1437 shown in the example:
1438 
1439 ```no_run
1440 //tutorial-write-serde-02.rs
1441 use std::{error::Error, io, process};
1442 
1443 use serde::Serialize;
1444 
1445 // Note that structs can derive both Serialize and Deserialize!
1446 #[derive(Debug, Serialize)]
1447 #[serde(rename_all = "PascalCase")]
1448 struct Record<'a> {
1449     city: &'a str,
1450     state: &'a str,
1451     population: Option<u64>,
1452     latitude: f64,
1453     longitude: f64,
1454 }
1455 
1456 fn run() -> Result<(), Box<dyn Error>> {
1457     let mut wtr = csv::Writer::from_writer(io::stdout());
1458 
1459     wtr.serialize(Record {
1460         city: "Davidsons Landing",
1461         state: "AK",
1462         population: None,
1463         latitude: 65.2419444,
1464         longitude: -165.2716667,
1465     })?;
1466     wtr.serialize(Record {
1467         city: "Kenai",
1468         state: "AK",
1469         population: Some(7610),
1470         latitude: 60.5544444,
1471         longitude: -151.2583333,
1472     })?;
1473     wtr.serialize(Record {
1474         city: "Oakman",
1475         state: "AL",
1476         population: None,
1477         latitude: 33.7133333,
1478         longitude: -87.3886111,
1479     })?;
1480 
1481     wtr.flush()?;
1482     Ok(())
1483 }
1484 
1485 fn main() {
1486     if let Err(err) = run() {
1487         println!("{}", err);
1488         process::exit(1);
1489     }
1490 }
1491 ```
1492 
1493 Compiling and running this example has the same output as last time, even
1494 though we didn't explicitly write a header record:
1495 
1496 ```text
1497 $ cargo build
1498 $ ./target/debug/csvtutor
1499 City,State,Population,Latitude,Longitude
1500 Davidsons Landing,AK,,65.2419444,-165.2716667
1501 Kenai,AK,7610,60.5544444,-151.2583333
1502 Oakman,AL,,33.7133333,-87.3886111
1503 ```
1504 
1505 In this case, the `serialize` method noticed that we were writing a struct
1506 with field names. When this happens, `serialize` will automatically write a
1507 header record (only if no other records have been written) that consists of
1508 the fields in the struct in the order in which they are defined. Note that
1509 this behavior can be disabled with the
1510 [`WriterBuilder::has_headers`](../struct.WriterBuilder.html#method.has_headers)
1511 method.
1512 
1513 It's also worth pointing out the use of a *lifetime parameter* in our `Record`
1514 struct:
1515 
1516 ```ignore
1517 struct Record<'a> {
1518     city: &'a str,
1519     state: &'a str,
1520     population: Option<u64>,
1521     latitude: f64,
1522     longitude: f64,
1523 }
1524 ```
1525 
1526 The `'a` lifetime parameter corresponds to the lifetime of the `city` and
1527 `state` string slices. This says that the `Record` struct contains *borrowed*
1528 data. We could have written our struct without borrowing any data, and
1529 therefore, without any lifetime parameters:
1530 
1531 ```ignore
1532 struct Record {
1533     city: String,
1534     state: String,
1535     population: Option<u64>,
1536     latitude: f64,
1537     longitude: f64,
1538 }
1539 ```
1540 
1541 However, since we had to replace our borrowed `&str` types with owned `String`
1542 types, we're now forced to allocate a new `String` value for both of `city`
1543 and `state` for every record that we write. There's no intrinsic problem with
1544 doing that, but it might be a bit wasteful.
1545 
1546 For more examples and more details on the rules for serialization, please see
1547 the
1548 [`Writer::serialize`](../struct.Writer.html#method.serialize)
1549 method.
1550 
1551 # Pipelining
1552 
1553 In this section, we're going to cover a few examples that demonstrate programs
1554 that take CSV data as input, and produce possibly transformed or filtered CSV
1555 data as output. This shows how to write a complete program that efficiently
1556 reads and writes CSV data. Rust is well positioned to perform this task, since
1557 you'll get great performance with the convenience of a high level CSV library.
1558 
1559 ## Filter by search
1560 
1561 The first example of CSV pipelining we'll look at is a simple filter. It takes
1562 as input some CSV data on stdin and a single string query as its only
1563 positional argument, and it will produce as output CSV data that only contains
1564 rows with a field that matches the query.
1565 
1566 ```no_run
1567 //tutorial-pipeline-search-01.rs
1568 use std::{env, error::Error, io, process};
1569 
1570 fn run() -> Result<(), Box<dyn Error>> {
1571     // Get the query from the positional arguments.
1572     // If one doesn't exist, return an error.
1573     let query = match env::args().nth(1) {
1574         None => return Err(From::from("expected 1 argument, but got none")),
1575         Some(query) => query,
1576     };
1577 
1578     // Build CSV readers and writers to stdin and stdout, respectively.
1579     let mut rdr = csv::Reader::from_reader(io::stdin());
1580     let mut wtr = csv::Writer::from_writer(io::stdout());
1581 
1582     // Before reading our data records, we should write the header record.
1583     wtr.write_record(rdr.headers()?)?;
1584 
1585     // Iterate over all the records in `rdr`, and write only records containing
1586     // `query` to `wtr`.
1587     for result in rdr.records() {
1588         let record = result?;
1589         if record.iter().any(|field| field == &query) {
1590             wtr.write_record(&record)?;
1591         }
1592     }
1593 
1594     // CSV writers use an internal buffer, so we should always flush when done.
1595     wtr.flush()?;
1596     Ok(())
1597 }
1598 
1599 fn main() {
1600     if let Err(err) = run() {
1601         println!("{}", err);
1602         process::exit(1);
1603     }
1604 }
1605 ```
1606 
1607 If we compile and run this program with a query of `MA` on `uspop.csv`, we'll
1608 see that only one record matches:
1609 
1610 ```text
1611 $ cargo build
1612 $ ./csvtutor MA < uspop.csv
1613 City,State,Population,Latitude,Longitude
1614 Reading,MA,23441,42.5255556,-71.0958333
1615 ```
1616 
1617 This example doesn't actually introduce anything new. It merely combines what
1618 you've already learned about CSV readers and writers from previous sections.
1619 
1620 Let's add a twist to this example. In the real world, you're often faced with
1621 messy CSV data that might not be encoded correctly. One example you might come
1622 across is CSV data encoded in
1623 [Latin-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
1624 Unfortunately, for the examples we've seen so far, our CSV reader assumes that
1625 all of the data is UTF-8. Since all of the data we've worked on has been
1626 ASCII---which is a subset of both Latin-1 and UTF-8---we haven't had any
1627 problems. But let's introduce a slightly tweaked version of our `uspop.csv`
1628 file that contains an encoding of a Latin-1 character that is invalid UTF-8.
1629 You can get the data like so:
1630 
1631 ```text
1632 $ curl -LO 'https://raw.githubusercontent.com/BurntSushi/rust-csv/master/examples/data/uspop-latin1.csv'
1633 ```
1634 
1635 Even though I've already given away the problem, let's see what happen when
1636 we try to run our previous example on this new data:
1637 
1638 ```text
1639 $ ./csvtutor MA < uspop-latin1.csv
1640 City,State,Population,Latitude,Longitude
1641 CSV parse error: record 3 (line 4, field: 0, byte: 125): invalid utf-8: invalid UTF-8 in field 0 near byte index 0
1642 ```
1643 
1644 The error message tells us exactly what's wrong. Let's take a look at line 4
1645 to see what we're dealing with:
1646 
1647 ```text
1648 $ head -n4 uspop-latin1.csv | tail -n1
1649 Ã•akman,AL,,33.7133333,-87.3886111
1650 ```
1651 
1652 In this case, the very first character is the Latin-1 `Õ`, which is encoded as
1653 the byte `0xD5`, which is in turn invalid UTF-8. So what do we do now that our
1654 CSV parser has choked on our data? You have two choices. The first is to go in
1655 and fix up your CSV data so that it's valid UTF-8. This is probably a good
1656 idea anyway, and tools like `iconv` can help with the task of transcoding.
1657 But if you can't or don't want to do that, then you can instead read CSV data
1658 in a way that is mostly encoding agnostic (so long as ASCII is still a valid
1659 subset). The trick is to use *byte records* instead of *string records*.
1660 
1661 Thus far, we haven't actually talked much about the type of a record in this
1662 library, but now is a good time to introduce them. There are two of them,
1663 [`StringRecord`](../struct.StringRecord.html)
1664 and
1665 [`ByteRecord`](../struct.ByteRecord.html).
1666 Each them represent a single record in CSV data, where a record is a sequence
1667 of an arbitrary number of fields. The only difference between `StringRecord`
1668 and `ByteRecord` is that `StringRecord` is guaranteed to be valid UTF-8, where
1669 as `ByteRecord` contains arbitrary bytes.
1670 
1671 Armed with that knowledge, we can now begin to understand why we saw an error
1672 when we ran the last example on data that wasn't UTF-8. Namely, when we call
1673 `records`, we get back an iterator of `StringRecord`. Since `StringRecord` is
1674 guaranteed to be valid UTF-8, trying to build a `StringRecord` with invalid
1675 UTF-8 will result in the error that we see.
1676 
1677 All we need to do to make our example work is to switch from a `StringRecord`
1678 to a `ByteRecord`. This means using `byte_records` to create our iterator
1679 instead of `records`, and similarly using `byte_headers` instead of `headers`
1680 if we think our header data might contain invalid UTF-8 as well. Here's the
1681 change:
1682 
1683 ```no_run
1684 //tutorial-pipeline-search-02.rs
1685 # use std::{env, error::Error, io, process};
1686 #
1687 fn run() -> Result<(), Box<dyn Error>> {
1688     let query = match env::args().nth(1) {
1689         None => return Err(From::from("expected 1 argument, but got none")),
1690         Some(query) => query,
1691     };
1692 
1693     let mut rdr = csv::Reader::from_reader(io::stdin());
1694     let mut wtr = csv::Writer::from_writer(io::stdout());
1695 
1696     wtr.write_record(rdr.byte_headers()?)?;
1697 
1698     for result in rdr.byte_records() {
1699         let record = result?;
1700         // `query` is a `String` while `field` is now a `&[u8]`, so we'll
1701         // need to convert `query` to `&[u8]` before doing a comparison.
1702         if record.iter().any(|field| field == query.as_bytes()) {
1703             wtr.write_record(&record)?;
1704         }
1705     }
1706 
1707     wtr.flush()?;
1708     Ok(())
1709 }
1710 #
1711 # fn main() {
1712 #     if let Err(err) = run() {
1713 #         println!("{}", err);
1714 #         process::exit(1);
1715 #     }
1716 # }
1717 ```
1718 
1719 Compiling and running this now yields the same results as our first example,
1720 but this time it works on data that isn't valid UTF-8.
1721 
1722 ```text
1723 $ cargo build
1724 $ ./csvtutor MA < uspop-latin1.csv
1725 City,State,Population,Latitude,Longitude
1726 Reading,MA,23441,42.5255556,-71.0958333
1727 ```
1728 
1729 ## Filter by population count
1730 
1731 In this section, we will show another example program that both reads and
1732 writes CSV data, but instead of dealing with arbitrary records, we will use
1733 Serde to deserialize and serialize records with specific types.
1734 
1735 For this program, we'd like to be able to filter records in our population data
1736 by population count. Specifically, we'd like to see which records meet a
1737 certain population threshold. In addition to using a simple inequality, we must
1738 also account for records that have a missing population count. This is where
1739 types like `Option<T>` come in handy, because the compiler will force us to
1740 consider the case when the population count is missing.
1741 
1742 Since we're using Serde in this example, don't forget to add the Serde
1743 dependencies to your `Cargo.toml` in your `[dependencies]` section if they
1744 aren't already there:
1745 
1746 ```text
1747 serde = { version = "1", features = ["derive"] }
1748 ```
1749 
1750 Now here's the code:
1751 
1752 ```no_run
1753 //tutorial-pipeline-pop-01.rs
1754 # use std::{env, error::Error, io, process};
1755 
1756 use serde::{Deserialize, Serialize};
1757 
1758 // Unlike previous examples, we derive both Deserialize and Serialize. This
1759 // means we'll be able to automatically deserialize and serialize this type.
1760 #[derive(Debug, Deserialize, Serialize)]
1761 #[serde(rename_all = "PascalCase")]
1762 struct Record {
1763     city: String,
1764     state: String,
1765     population: Option<u64>,
1766     latitude: f64,
1767     longitude: f64,
1768 }
1769 
1770 fn run() -> Result<(), Box<dyn Error>> {
1771     // Get the query from the positional arguments.
1772     // If one doesn't exist or isn't an integer, return an error.
1773     let minimum_pop: u64 = match env::args().nth(1) {
1774         None => return Err(From::from("expected 1 argument, but got none")),
1775         Some(arg) => arg.parse()?,
1776     };
1777 
1778     // Build CSV readers and writers to stdin and stdout, respectively.
1779     // Note that we don't need to write headers explicitly. Since we're
1780     // serializing a custom struct, that's done for us automatically.
1781     let mut rdr = csv::Reader::from_reader(io::stdin());
1782     let mut wtr = csv::Writer::from_writer(io::stdout());
1783 
1784     // Iterate over all the records in `rdr`, and write only records containing
1785     // a population that is greater than or equal to `minimum_pop`.
1786     for result in rdr.deserialize() {
1787         // Remember that when deserializing, we must use a type hint to
1788         // indicate which type we want to deserialize our record into.
1789         let record: Record = result?;
1790 
1791         // `map_or` is a combinator on `Option`. It take two parameters:
1792         // a value to use when the `Option` is `None` (i.e., the record has
1793         // no population count) and a closure that returns another value of
1794         // the same type when the `Option` is `Some`. In this case, we test it
1795         // against our minimum population count that we got from the command
1796         // line.
1797         if record.population.map_or(false, |pop| pop >= minimum_pop) {
1798             wtr.serialize(record)?;
1799         }
1800     }
1801 
1802     // CSV writers use an internal buffer, so we should always flush when done.
1803     wtr.flush()?;
1804     Ok(())
1805 }
1806 
1807 fn main() {
1808     if let Err(err) = run() {
1809         println!("{}", err);
1810         process::exit(1);
1811     }
1812 }
1813 ```
1814 
1815 If we compile and run our program with a minimum threshold of `100000`, we
1816 should see three matching records. Notice that the headers were added even
1817 though we never explicitly wrote them!
1818 
1819 ```text
1820 $ cargo build
1821 $ ./target/debug/csvtutor 100000 < uspop.csv
1822 City,State,Population,Latitude,Longitude
1823 Fontana,CA,169160,34.0922222,-117.4341667
1824 Bridgeport,CT,139090,41.1669444,-73.2052778
1825 Indianapolis,IN,773283,39.7683333,-86.1580556
1826 ```
1827 
1828 # Performance
1829 
1830 In this section, we'll go over how to squeeze the most juice out of our CSV
1831 reader. As it happens, most of the APIs we've seen so far were designed with
1832 high level convenience in mind, and that often comes with some costs. For the
1833 most part, those costs revolve around unnecessary allocations. Therefore, most
1834 of the section will show how to do CSV parsing with as little allocation as
1835 possible.
1836 
1837 There are two critical preliminaries we must cover.
1838 
1839 Firstly, when you care about performance, you should compile your code
1840 with `cargo build --release` instead of `cargo build`. The `--release`
1841 flag instructs the compiler to spend more time optimizing your code. When
1842 compiling with the `--release` flag, you'll find your compiled program at
1843 `target/release/csvtutor` instead of `target/debug/csvtutor`. Throughout this
1844 tutorial, we've used `cargo build` because our dataset was small and we weren't
1845 focused on speed. The downside of `cargo build --release` is that it will take
1846 longer than `cargo build`.
1847 
1848 Secondly, the dataset we've used throughout this tutorial only has 100 records.
1849 We'd have to try really hard to cause our program to run slowly on 100 records,
1850 even when we compile without the `--release` flag. Therefore, in order to
1851 actually witness a performance difference, we need a bigger dataset. To get
1852 such a dataset, we'll use the original source of `uspop.csv`. **Warning: the
1853 download is 41MB compressed and decompresses to 145MB.**
1854 
1855 ```text
1856 $ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv.gz
1857 $ gunzip worldcitiespop.csv.gz
1858 $ wc worldcitiespop.csv
1859   3173959   5681543 151492068 worldcitiespop.csv
1860 $ md5sum worldcitiespop.csv
1861 6198bd180b6d6586626ecbf044c1cca5  worldcitiespop.csv
1862 ```
1863 
1864 Finally, it's worth pointing out that this section is not attempting to
1865 present a rigorous set of benchmarks. We will stay away from rigorous analysis
1866 and instead rely a bit more on wall clock times and intuition.
1867 
1868 ## Amortizing allocations
1869 
1870 In order to measure performance, we must be careful about what it is we're
1871 measuring. We must also be careful to not change the thing we're measuring as
1872 we make improvements to the code. For this reason, we will focus on measuring
1873 how long it takes to count the number of records corresponding to city
1874 population counts in Massachusetts. This represents a very small amount of work
1875 that requires us to visit every record, and therefore represents a decent way
1876 to measure how long it takes to do CSV parsing.
1877 
1878 Before diving into our first optimization, let's start with a baseline by
1879 adapting a previous example to count the number of records in
1880 `worldcitiespop.csv`:
1881 
1882 ```no_run
1883 //tutorial-perf-alloc-01.rs
1884 use std::{error::Error, io, process};
1885 
1886 fn run() -> Result<u64, Box<dyn Error>> {
1887     let mut rdr = csv::Reader::from_reader(io::stdin());
1888 
1889     let mut count = 0;
1890     for result in rdr.records() {
1891         let record = result?;
1892         if &record[0] == "us" && &record[3] == "MA" {
1893             count += 1;
1894         }
1895     }
1896     Ok(count)
1897 }
1898 
1899 fn main() {
1900     match run() {
1901         Ok(count) => {
1902             println!("{}", count);
1903         }
1904         Err(err) => {
1905             println!("{}", err);
1906             process::exit(1);
1907         }
1908     }
1909 }
1910 ```
1911 
1912 Now let's compile and run it and see what kind of timing we get. Don't forget
1913 to compile with the `--release` flag. (For grins, try compiling without the
1914 `--release` flag and see how long it takes to run the program!)
1915 
1916 ```text
1917 $ cargo build --release
1918 $ time ./target/release/csvtutor < worldcitiespop.csv
1919 2176
1920 
1921 real    0m0.645s
1922 user    0m0.627s
1923 sys     0m0.017s
1924 ```
1925 
1926 All right, so what's the first thing we can do to make this faster? This
1927 section promised to speed things up by amortizing allocation, but we can do
1928 something even simpler first: iterate over
1929 [`ByteRecord`](../struct.ByteRecord.html)s
1930 instead of
1931 [`StringRecord`](../struct.StringRecord.html)s.
1932 If you recall from a previous section, a `StringRecord` is guaranteed to be
1933 valid UTF-8, and therefore must validate that its contents is actually UTF-8.
1934 (If validation fails, then the CSV reader will return an error.) If we remove
1935 that validation from our program, then we can realize a nice speed boost as
1936 shown in the next example:
1937 
1938 ```no_run
1939 //tutorial-perf-alloc-02.rs
1940 # use std::{error::Error, io, process};
1941 #
1942 fn run() -> Result<u64, Box<dyn Error>> {
1943     let mut rdr = csv::Reader::from_reader(io::stdin());
1944 
1945     let mut count = 0;
1946     for result in rdr.byte_records() {
1947         let record = result?;
1948         if &record[0] == b"us" && &record[3] == b"MA" {
1949             count += 1;
1950         }
1951     }
1952     Ok(count)
1953 }
1954 #
1955 # fn main() {
1956 #     match run() {
1957 #         Ok(count) => {
1958 #             println!("{}", count);
1959 #         }
1960 #         Err(err) => {
1961 #             println!("{}", err);
1962 #             process::exit(1);
1963 #         }
1964 #     }
1965 # }
1966 ```
1967 
1968 And now compile and run:
1969 
1970 ```text
1971 $ cargo build --release
1972 $ time ./target/release/csvtutor < worldcitiespop.csv
1973 2176
1974 
1975 real    0m0.429s
1976 user    0m0.403s
1977 sys     0m0.023s
1978 ```
1979 
1980 Our program is now approximately 30% faster, all because we removed UTF-8
1981 validation. But was it actually okay to remove UTF-8 validation? What have we
1982 lost? In this case, it is perfectly acceptable to drop UTF-8 validation and use
1983 `ByteRecord` instead because all we're doing with the data in the record is
1984 comparing two of its fields to raw bytes:
1985 
1986 ```ignore
1987 if &record[0] == b"us" && &record[3] == b"MA" {
1988     count += 1;
1989 }
1990 ```
1991 
1992 In particular, it doesn't matter whether `record` is valid UTF-8 or not, since
1993 we're checking for equality on the raw bytes themselves.
1994 
1995 UTF-8 validation via `StringRecord` is useful because it provides access to
1996 fields as `&str` types, where as `ByteRecord` provides fields as `&[u8]` types.
1997 `&str` is the type of a borrowed string in Rust, which provides convenient
1998 access to string APIs like substring search. Strings are also frequently used
1999 in other areas, so they tend to be a useful thing to have. Therefore, sticking
2000 with `StringRecord` is a good default, but if you need the extra speed and can
2001 deal with arbitrary bytes, then switching to `ByteRecord` might be a good idea.
2002 
2003 Moving on, let's try to get another speed boost by amortizing allocation.
2004 Amortizing allocation is the technique that creates an allocation once (or
2005 very rarely), and then attempts to reuse it instead of creating additional
2006 allocations. In the case of the previous examples, we used iterators created
2007 by the `records` and `byte_records` methods on a CSV reader. These iterators
2008 allocate a new record for every item that it yields, which in turn corresponds
2009 to a new allocation. It does this because iterators cannot yield items that
2010 borrow from the iterator itself, and because creating new allocations tends to
2011 be a lot more convenient.
2012 
2013 If we're willing to forgo use of iterators, then we can amortize allocations
2014 by creating a *single* `ByteRecord` and asking the CSV reader to read into it.
2015 We do this by using the
2016 [`Reader::read_byte_record`](../struct.Reader.html#method.read_byte_record)
2017 method.
2018 
2019 ```no_run
2020 //tutorial-perf-alloc-03.rs
2021 # use std::{error::Error, io, process};
2022 #
2023 fn run() -> Result<u64, Box<dyn Error>> {
2024     let mut rdr = csv::Reader::from_reader(io::stdin());
2025     let mut record = csv::ByteRecord::new();
2026 
2027     let mut count = 0;
2028     while rdr.read_byte_record(&mut record)? {
2029         if &record[0] == b"us" && &record[3] == b"MA" {
2030             count += 1;
2031         }
2032     }
2033     Ok(count)
2034 }
2035 #
2036 # fn main() {
2037 #     match run() {
2038 #         Ok(count) => {
2039 #             println!("{}", count);
2040 #         }
2041 #         Err(err) => {
2042 #             println!("{}", err);
2043 #             process::exit(1);
2044 #         }
2045 #     }
2046 # }
2047 ```
2048 
2049 Compile and run:
2050 
2051 ```text
2052 $ cargo build --release
2053 $ time ./target/release/csvtutor < worldcitiespop.csv
2054 2176
2055 
2056 real    0m0.308s
2057 user    0m0.283s
2058 sys     0m0.023s
2059 ```
2060 
2061 Woohoo! This represents *another* 30% boost over the previous example, which is
2062 a 50% boost over the first example.
2063 
2064 Let's dissect this code by taking a look at the type signature of the
2065 `read_byte_record` method:
2066 
2067 ```ignore
2068 fn read_byte_record(&mut self, record: &mut ByteRecord) -> csv::Result<bool>;
2069 ```
2070 
2071 This method takes as input a CSV reader (the `self` parameter) and a *mutable
2072 borrow* of a `ByteRecord`, and returns a `csv::Result<bool>`. (The
2073 `csv::Result<bool>` is equivalent to `Result<bool, csv::Error>`.) The return
2074 value is `true` if and only if a record was read. When it's `false`, that means
2075 the reader has exhausted its input. This method works by copying the contents
2076 of the next record into the provided `ByteRecord`. Since the same `ByteRecord`
2077 is used to read every record, it will already have space allocated for data.
2078 When `read_byte_record` runs, it will overwrite the contents that were there
2079 with the new record, which means that it can reuse the space that was
2080 allocated. Thus, we have *amortized allocation*.
2081 
2082 An exercise you might consider doing is to use a `StringRecord` instead of a
2083 `ByteRecord`, and therefore
2084 [`Reader::read_record`](../struct.Reader.html#method.read_record)
2085 instead of `read_byte_record`. This will give you easy access to Rust strings
2086 at the cost of UTF-8 validation but *without* the cost of allocating a new
2087 `StringRecord` for every record.
2088 
2089 ## Serde and zero allocation
2090 
2091 In this section, we are going to briefly examine how we use Serde and what we
2092 can do to speed it up. The key optimization we'll want to make is to---you
2093 guessed it---amortize allocation.
2094 
2095 As with the previous section, let's start with a simple baseline based off an
2096 example using Serde in a previous section:
2097 
2098 ```no_run
2099 //tutorial-perf-serde-01.rs
2100 # #![allow(dead_code)]
2101 use std::{error::Error, io, process};
2102 
2103 use serde::Deserialize;
2104 
2105 #[derive(Debug, Deserialize)]
2106 #[serde(rename_all = "PascalCase")]
2107 struct Record {
2108     country: String,
2109     city: String,
2110     accent_city: String,
2111     region: String,
2112     population: Option<u64>,
2113     latitude: f64,
2114     longitude: f64,
2115 }
2116 
2117 fn run() -> Result<u64, Box<dyn Error>> {
2118     let mut rdr = csv::Reader::from_reader(io::stdin());
2119 
2120     let mut count = 0;
2121     for result in rdr.deserialize() {
2122         let record: Record = result?;
2123         if record.country == "us" && record.region == "MA" {
2124             count += 1;
2125         }
2126     }
2127     Ok(count)
2128 }
2129 
2130 fn main() {
2131     match run() {
2132         Ok(count) => {
2133             println!("{}", count);
2134         }
2135         Err(err) => {
2136             println!("{}", err);
2137             process::exit(1);
2138         }
2139     }
2140 }
2141 ```
2142 
2143 Now compile and run this program:
2144 
2145 ```text
2146 $ cargo build --release
2147 $ ./target/release/csvtutor < worldcitiespop.csv
2148 2176
2149 
2150 real    0m1.381s
2151 user    0m1.367s
2152 sys     0m0.013s
2153 ```
2154 
2155 The first thing you might notice is that this is quite a bit slower than our
2156 programs in the previous section. This is because deserializing each record
2157 has a certain amount of overhead to it. In particular, some of the fields need
2158 to be parsed as integers or floating point numbers, which isn't free. However,
2159 there is hope yet, because we can speed up this program!
2160 
2161 Our first attempt to speed up the program will be to amortize allocation. Doing
2162 this with Serde is a bit trickier than before, because we need to change our
2163 `Record` type and use the manual deserialization API. Let's see what that looks
2164 like:
2165 
2166 ```no_run
2167 //tutorial-perf-serde-02.rs
2168 # #![allow(dead_code)]
2169 # use std::{error::Error, io, process};
2170 # use serde::Deserialize;
2171 #
2172 #[derive(Debug, Deserialize)]
2173 #[serde(rename_all = "PascalCase")]
2174 struct Record<'a> {
2175     country: &'a str,
2176     city: &'a str,
2177     accent_city: &'a str,
2178     region: &'a str,
2179     population: Option<u64>,
2180     latitude: f64,
2181     longitude: f64,
2182 }
2183 
2184 fn run() -> Result<u64, Box<dyn Error>> {
2185     let mut rdr = csv::Reader::from_reader(io::stdin());
2186     let mut raw_record = csv::StringRecord::new();
2187     let headers = rdr.headers()?.clone();
2188 
2189     let mut count = 0;
2190     while rdr.read_record(&mut raw_record)? {
2191         let record: Record = raw_record.deserialize(Some(&headers))?;
2192         if record.country == "us" && record.region == "MA" {
2193             count += 1;
2194         }
2195     }
2196     Ok(count)
2197 }
2198 #
2199 # fn main() {
2200 #     match run() {
2201 #         Ok(count) => {
2202 #             println!("{}", count);
2203 #         }
2204 #         Err(err) => {
2205 #             println!("{}", err);
2206 #             process::exit(1);
2207 #         }
2208 #     }
2209 # }
2210 ```
2211 
2212 Compile and run:
2213 
2214 ```text
2215 $ cargo build --release
2216 $ ./target/release/csvtutor < worldcitiespop.csv
2217 2176
2218 
2219 real    0m1.055s
2220 user    0m1.040s
2221 sys     0m0.013s
2222 ```
2223 
2224 This corresponds to an approximately 24% increase in performance. To achieve
2225 this, we had to make two important changes.
2226 
2227 The first was to make our `Record` type contain `&str` fields instead of
2228 `String` fields. If you recall from a previous section, `&str` is a *borrowed*
2229 string where a `String` is an *owned* string. A borrowed string points to
2230 a already existing allocation where as a `String` always implies a new
2231 allocation. In this case, our `&str` is borrowing from the CSV record itself.
2232 
2233 The second change we had to make was to stop using the
2234 [`Reader::deserialize`](../struct.Reader.html#method.deserialize)
2235 iterator, and instead deserialize our record into a `StringRecord` explicitly
2236 and then use the
2237 [`StringRecord::deserialize`](../struct.StringRecord.html#method.deserialize)
2238 method to deserialize a single record.
2239 
2240 The second change is a bit tricky, because in order for it to work, our
2241 `Record` type needs to borrow from the data inside the `StringRecord`. That
2242 means that our `Record` value cannot outlive the `StringRecord` that it was
2243 created from. Since we overwrite the same `StringRecord` on each iteration
2244 (in order to amortize allocation), that means our `Record` value must evaporate
2245 before the next iteration of the loop. Indeed, the compiler will enforce this!
2246 
2247 There is one more optimization we can make: remove UTF-8 validation. In
2248 general, this means using `&[u8]` instead of `&str` and `ByteRecord` instead
2249 of `StringRecord`:
2250 
2251 ```no_run
2252 //tutorial-perf-serde-03.rs
2253 # #![allow(dead_code)]
2254 # use std::{error::Error, io, process};
2255 #
2256 # use serde::Deserialize;
2257 #
2258 #[derive(Debug, Deserialize)]
2259 #[serde(rename_all = "PascalCase")]
2260 struct Record<'a> {
2261     country: &'a [u8],
2262     city: &'a [u8],
2263     accent_city: &'a [u8],
2264     region: &'a [u8],
2265     population: Option<u64>,
2266     latitude: f64,
2267     longitude: f64,
2268 }
2269 
2270 fn run() -> Result<u64, Box<dyn Error>> {
2271     let mut rdr = csv::Reader::from_reader(io::stdin());
2272     let mut raw_record = csv::ByteRecord::new();
2273     let headers = rdr.byte_headers()?.clone();
2274 
2275     let mut count = 0;
2276     while rdr.read_byte_record(&mut raw_record)? {
2277         let record: Record = raw_record.deserialize(Some(&headers))?;
2278         if record.country == b"us" && record.region == b"MA" {
2279             count += 1;
2280         }
2281     }
2282     Ok(count)
2283 }
2284 #
2285 # fn main() {
2286 #     match run() {
2287 #         Ok(count) => {
2288 #             println!("{}", count);
2289 #         }
2290 #         Err(err) => {
2291 #             println!("{}", err);
2292 #             process::exit(1);
2293 #         }
2294 #     }
2295 # }
2296 ```
2297 
2298 Compile and run:
2299 
2300 ```text
2301 $ cargo build --release
2302 $ ./target/release/csvtutor < worldcitiespop.csv
2303 2176
2304 
2305 real    0m0.873s
2306 user    0m0.850s
2307 sys     0m0.023s
2308 ```
2309 
2310 This corresponds to a 17% increase over the previous example and a 37% increase
2311 over the first example.
2312 
2313 In sum, Serde parsing is still quite fast, but will generally not be the
2314 fastest way to parse CSV since it necessarily needs to do more work.
2315 
2316 ## CSV parsing without the standard library
2317 
2318 In this section, we will explore a niche use case: parsing CSV without the
2319 standard library. While the `csv` crate itself requires the standard library,
2320 the underlying parser is actually part of the
2321 [`csv-core`](https://docs.rs/csv-core)
2322 crate, which does not depend on the standard library. The downside of not
2323 depending on the standard library is that CSV parsing becomes a lot more
2324 inconvenient.
2325 
2326 The `csv-core` crate is structured similarly to the `csv` crate. There is a
2327 [`Reader`](../../csv_core/struct.Reader.html)
2328 and a
2329 [`Writer`](../../csv_core/struct.Writer.html),
2330 as well as corresponding builders
2331 [`ReaderBuilder`](../../csv_core/struct.ReaderBuilder.html)
2332 and
2333 [`WriterBuilder`](../../csv_core/struct.WriterBuilder.html).
2334 The `csv-core` crate has no record types or iterators. Instead, CSV data
2335 can either be read one field at a time or one record at a time. In this
2336 section, we'll focus on reading a field at a time since it is simpler, but it
2337 is generally faster to read a record at a time since it does more work per
2338 function call.
2339 
2340 In keeping with this section on performance, let's write a program using only
2341 `csv-core` that counts the number of records in the state of Massachusetts.
2342 
2343 (Note that we unfortunately use the standard library in this example even
2344 though `csv-core` doesn't technically require it. We do this for convenient
2345 access to I/O, which would be harder without the standard library.)
2346 
2347 ```no_run
2348 //tutorial-perf-core-01.rs
2349 use std::io::{self, Read};
2350 use std::process;
2351 
2352 use csv_core::{Reader, ReadFieldResult};
2353 
2354 fn run(mut data: &[u8]) -> Option<u64> {
2355     let mut rdr = Reader::new();
2356 
2357     // Count the number of records in Massachusetts.
2358     let mut count = 0;
2359     // Indicates the current field index. Reset to 0 at start of each record.
2360     let mut fieldidx = 0;
2361     // True when the current record is in the United States.
2362     let mut inus = false;
2363     // Buffer for field data. Must be big enough to hold the largest field.
2364     let mut field = [0; 1024];
2365     loop {
2366         // Attempt to incrementally read the next CSV field.
2367         let (result, nread, nwrite) = rdr.read_field(data, &mut field);
2368         // nread is the number of bytes read from our input. We should never
2369         // pass those bytes to read_field again.
2370         data = &data[nread..];
2371         // nwrite is the number of bytes written to the output buffer `field`.
2372         // The contents of the buffer after this point is unspecified.
2373         let field = &field[..nwrite];
2374 
2375         match result {
2376             // We don't need to handle this case because we read all of the
2377             // data up front. If we were reading data incrementally, then this
2378             // would be a signal to read more.
2379             ReadFieldResult::InputEmpty => {}
2380             // If we get this case, then we found a field that contains more
2381             // than 1024 bytes. We keep this example simple and just fail.
2382             ReadFieldResult::OutputFull => {
2383                 return None;
2384             }
2385             // This case happens when we've successfully read a field. If the
2386             // field is the last field in a record, then `record_end` is true.
2387             ReadFieldResult::Field { record_end } => {
2388                 if fieldidx == 0 && field == b"us" {
2389                     inus = true;
2390                 } else if inus && fieldidx == 3 && field == b"MA" {
2391                     count += 1;
2392                 }
2393                 if record_end {
2394                     fieldidx = 0;
2395                     inus = false;
2396                 } else {
2397                     fieldidx += 1;
2398                 }
2399             }
2400             // This case happens when the CSV reader has successfully exhausted
2401             // all input.
2402             ReadFieldResult::End => {
2403                 break;
2404             }
2405         }
2406     }
2407     Some(count)
2408 }
2409 
2410 fn main() {
2411     // Read the entire contents of stdin up front.
2412     let mut data = vec![];
2413     if let Err(err) = io::stdin().read_to_end(&mut data) {
2414         println!("{}", err);
2415         process::exit(1);
2416     }
2417     match run(&data) {
2418         None => {
2419             println!("error: could not count records, buffer too small");
2420             process::exit(1);
2421         }
2422         Some(count) => {
2423             println!("{}", count);
2424         }
2425     }
2426 }
2427 ```
2428 
2429 And compile and run it:
2430 
2431 ```text
2432 $ cargo build --release
2433 $ time ./target/release/csvtutor < worldcitiespop.csv
2434 2176
2435 
2436 real    0m0.572s
2437 user    0m0.513s
2438 sys     0m0.057s
2439 ```
2440 
2441 This isn't as fast as some of our previous examples where we used the `csv`
2442 crate to read into a `StringRecord` or a `ByteRecord`. This is mostly because
2443 this example reads a field at a time, which incurs more overhead than reading a
2444 record at a time. To fix this, you would want to use the
2445 [`Reader::read_record`](../../csv_core/struct.Reader.html#method.read_record)
2446 method instead, which is defined on `csv_core::Reader`.
2447 
2448 The other thing to notice here is that the example is considerably longer than
2449 the other examples. This is because we need to do more book keeping to keep
2450 track of which field we're reading and how much data we've already fed to the
2451 reader. There are basically two reasons to use the `csv_core` crate:
2452 
2453 1. If you're in an environment where the standard library is not usable.
2454 2. If you wanted to build your own csv-like library, you could build it on top
2455    of `csv-core`.
2456 
2457 # Closing thoughts
2458 
2459 Congratulations on making it to the end! It seems incredible that one could
2460 write so many words on something as basic as CSV parsing. I wanted this
2461 guide to be accessible not only to Rust beginners, but to inexperienced
2462 programmers as well. My hope is that the large number of examples will help
2463 push you in the right direction.
2464 
2465 With that said, here are a few more things you might want to look at:
2466 
2467 * The [API documentation for the `csv` crate](../index.html) documents all
2468   facets of the library, and is itself littered with even more examples.
2469 * The [`csv-index` crate](https://docs.rs/csv-index) provides data structures
2470   that can index CSV data that are amenable to writing to disk. (This library
2471   is still a work in progress.)
2472 * The [`xsv` command line tool](https://github.com/BurntSushi/xsv) is a high
2473   performance CSV swiss army knife. It can slice, select, search, sort, join,
2474   concatenate, index, format and compute statistics on arbitrary CSV data. Give
2475   it a try!
2476 
2477 */
2478