README.md
1# textdistance.rs
2
3[ [github.com](https://github.com/life4/textdistance.rs) ]
4[ [docs.rs](https://docs.rs/textdistance/) ]
5[ [crates.io](https://crates.io/crates/textdistance) ]
6
7Rust library with lots of different algorithms to compare how similar two sequences are.
8
9Features:
10
11+ Based on popular and battle-tested [textdistance](https://github.com/life4/textdistance) Python library (and written by the same author).
12+ Contains 20+ algorithms for all purposes.
13+ Includes state-of-the-art algorithms like `EntropyNCD` and `Sift4`.
14+ Zero-dependency.
15+ `#![no_std]` support (embedded systems).
16+ Works with any iterators, including bytes, code points, Unicode grapheme clusters, words, and numbers.
17+ ❤️ Friendly and consistent API for all algorithms.
18+ Optional normalization of the result on the 0.0-1.0 interval.
19+ No unsafe code.
20+ Pure Rust.
21
22## Available algorithms
23
24Edit-based:
25
261. `DamerauLevenshtein`, both optimal string alignment and restricted.
271. `Hamming`
281. `Jaro`
291. `JaroWinkler`
301. `Levenshtein`
311. `Sift4Common`
321. `Sift4Simple`
331. `SmithWaterman`
34
35Token-based:
36
371. `Bag`
381. `Cosine` (aka Orchini, Tucker, Otsuka–Ochiai)
391. `EntropyNCD` (Entropy-based Normalized Compression Distance)
401. `Jaccard` (aka Tanimoto, Critical Success Index)
411. `Overlap` (aka Szymkiewicz–Simpson)
421. `Roberts`
431. `SorensenDice` (aka F1, Czekanowski, Zijdenbos)
441. `Tversky`
45
46Sequence-based:
47
481. `LCSSeq` (Longest Common SubSequence)
491. `LCSStr` (Longest Common SubString)
501. `RatcliffObershelp` (aka Gestalt pattern matching)
51
52Naive:
53
541. `Prefix`
551. `Suffix`
561. `Length`
57
58Normalization for other metrics:
59
601. `LIG3` normalization for `Hamming` by `Levenshtein`
611. `MLIPNS` normalization for `Hamming`
621. `YujianBo` normalization for `Levenshtein`
63
64## Installation
65
66```shell
67cargo add textdistance
68```
69
70Or if you're going to use it in a [no_std](https://docs.rust-embedded.org/book/intro/no-std.html) project:
71
72```shell
73cargo add --no-default-features textdistance
74```
75
76## Usage
77
78The `textdistance::str` module provides shortcut functions for each algorithm for calculating the distance/similarity between two strings:
79
80```rust
81use textdistance::str::damerau_levenshtein;
82assert!(damerau_levenshtein("abc", "acbd") == 2);
83```
84
85The `textdistance::nstr` module is the same but all algorithms return a normalized value (between 0.0 and 1.0):
86
87```rust
88use textdistance::nstr::damerau_levenshtein;
89assert!(damerau_levenshtein("abc", "acbd") == 2./4.);
90```
91
92For more advanced usage, each algorithm is provided as a struct implementing the `Algorithm` trait:
93
94```rust
95use textdistance::{Algorithm, DamerauLevenshtein};
96
97let a = DamerauLevenshtein::default();
98let r = a.for_str("abc", "acbd");
99assert!(r.val() == 2);
100assert!(r.nval() == 2./4.);
101```
102
1031. The `Algorithm` trait provides `for_str`, `for_vec`, and `for_iter` to calculate the result for two strings, vectors (slices), or iterators respectively. In addition, there are `for_words` and `for_bigrams` methods that split the text into words or bigrams respectively before calculating the distance.
1041. Each method returns a `textdistance::Result` that provides methods to get absolute (`val`) or normalized (`nval`) value of the metric, distance (`dist` and `ndist`), or similarity (`sim` and `nsim`).
105
106## Unicode support
107
108The `for_str` method (and so all functions in the `str` and `nstr` modules) uses `String.chars` to split the string and then runs it through the `for_iter` method. So, `é` will be considered two distinct characters ("latin small letter e" and "combining acute accent"). Usually, that's ok and this is how Python works. You can read more in [the official Rust documentation](https://doc.rust-lang.org/std/primitive.char.html#representation).
109
110If you want `é` to be considered as a single symbol, use the [unicode-segmentation](https://crates.io/crates/unicode-segmentation) crate:
111
112```rust
113use textdistance::{Algorithm, DamerauLevenshtein};
114use unicode_segmentation::UnicodeSegmentation;
115
116let s1 = "a̐éö̲\r\n";
117let s2 = "éa̐ö̲\r\n";
118let g1 = s1.graphemes(true);
119let g2 = s2.graphemes(true);
120let a = DamerauLevenshtein::default();
121let r = a.for_iter(g1, g2);
122assert!(r.val() == 1);
123```
124
125## Choosing the algorithm
126
127The algorithm to use depends on your use case. First, you need to decide on the algorithm category:
128
1291. Edit-based algorithms work well on short sequences for detecting typos and minor changes.
1301. Token-based algorithms work well on longer sequences for comparing long texts with noticeable modifications.
1311. Sequence-based algorithms work well for calculating diff size between the original and the changed version of the sequence.
132
133If you go with edit-based, the next thing is to decide what kind of changes you need to detect:
134
135+ ✏️ Substitution. One character is replaced by another.
136+ ➕ Addition. A new character is added.
137+ Deletion. A character is removed.
138+ Transposition. Two sequential characters are swapped.
139
140| alg | sub | add | del | trans |
141| --------------------- | --- | --- | --- | ----- |
142| `Hamming` | ✅ | ❌ | ❌ | ❌ |
143| `Jaro` | ❌ | ❌ | ❌ | ✅ |
144| `JaroWinkler` | ❌ | ❌ | ❌ | ✅ |
145| `Sift4` | ❌ | ❌ | ❌ | ✅ |
146| `Levenshtein` | ✅ | ✅ | ✅ | ❌ |
147| `DamerauLevenshtein` | ✅ | ✅ | ✅ | ✅ |
148
149+ `Hamming` is the fastest one but detects only substitutions.
150+ `Sift4` is very fast but not as well-known and battle-tested as other algorithms.
151+ `Jaro` is slower than `Sift4` but well-known and battle-tested.
152+ `JaroWinkler` is like `Jaro` but gives more weight to strings with a matching prefix.
153+ `Levenshtein` detects everything but transpositions and faster than `DamerauLevenshtein` (but slower than other algorithms).
154+ `DamerauLevenshtein` ticks all the boxes but very slow.
155
156There are some use cases:
157
158+ `DamerauLevenshtein` with some optimizations is [used in cargo](https://github.com/rust-lang/cargo/blob/master/src/cargo/util/edit_distance.rs) to correct typos in command names.
159+ `Jaro` is included in the Elixir standard library ([String.jaro_distance](https://hexdocs.pm/elixir/1.12/String.html#jaro_distance/2)). It is used by the compiler and by mix (cargo for Elixir) to provide the "did you mean?" functionality for typos in module or command names.
160+ `RatcliffObershelp` variation is included in the Python standard library ([difflib.SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher)).
161
162## Benchmarks
163
164Legend:
165
166+ is very slow (> 5 ms)
167+ is slow (> 1 ms)
168+ is fast (> 500 µs)
169+ is very fast (< 500 µs)
170
171Edit-based (and their normalizations):
172
173| algorithm | time |
174| ------------------ | ------------ |
175| hamming | 19.203 µs |
176| mlipns | 20.625 µs |
177| sift4_simple | 143.69 µs |
178| sift4_common | 238.86 µs |
179| jaro | 1.7148 ms |
180| jaro_winkler | 1.7174 ms |
181| levenshtein | 4.5999 ms |
182| yujian_bo | 4.6044 ms |
183| lig3 | 6.5563 ms |
184| smith_waterman | 9.5146 ms |
185| damerau_levenshtein_restricted | 10.301 ms |
186| damerau_levenshtein | 41.938 ms |
187
188Token-based:
189
190| algorithm | time |
191| ------------------ | ------------ |
192| cosine | 508.59 µs |
193| sorensen_dice | 510.75 µs |
194| tversky | 512.41 µs |
195| overlap | 513.76 µs |
196| bag | 523.06 µs |
197| jaccard | 580.79 µs |
198| roberts | 714.79 µs |
199| entropy_ncd | 731.68 µs |
200
201Sequence-based:
202
203| algorithm | time |
204| ------------------ | ------------ |
205| lcsstr | 3.2658 ms |
206| lcsseq | 7.4349 ms |
207| ratcliff_obershelp | 36.308 ms |
208
209Naive:
210
211| algorithm | time |
212| ------------------ | ------------ |
213| length | 2.5300 µs |
214| prefix | 22.473 µs |
215| suffix | 38.821 µs |
216
217The benchmarks are powered by [criterion](https://github.com/bheisler/criterion.rs) and live in the [benches](./benches/) directory. They are quite simple: grab 10 [open-source licenses](https://github.com/github/choosealicense.com/tree/gh-pages/_licenses), take a 200 chars prefix from each, and cross-compare these prefixes. The numbers might be very different for a different kind of input, length of the input, when comparing words rather than characters, or running the benchmarks on a different machine. The goal of these benchmarks is to provide basic guidance rather than give a definitive answer. If performance is critical for your application, I encourage you to make your benchmarks on the real data you have.
218
219## Versioning
220
221We stick to [SemVer](https://semver.org/):
222
2231. The **patch** number is for bug fixes. The results of an algorithm may change in some corner cases if we found that the previous implementation doesn't match the algorithm described in the original paper.
2241. The **minor** number is for new algorithms and features.
2251. The **major** number is for big changes in the API. We try to avoid breaking stuff but we prefer to provide a friendly and convenient API over keeping a backward compatibility.
226
227## Limitations
228
229+ In the original textdisance, most of the algorithms are adjusted to work on any number of the input sequences. However, Rust doesn't support variadic arguments, so all algorithms currently are implemented only for exactly two inputs.
230+ All algorithms in the crate implement the same `Algorithm` trait. Hence metrics that have additional limitations on the input sequence elements beyond `Eq` (like Editex and MRA that work only with ASCII letters) currently cannot be implemented.
231+ Most of the implemented algorithms have certain properties (like [commutative property](https://en.wikipedia.org/wiki/Commutative_property)) that make their behavior more like what you would expect and make normalization simple. So, I haven't implemented yet Needleman-Wunsch and Gotoh, mostly because they are tricky to normalize and I'm still not 100% sure that I did it correctly in the original textdistance.
232
233## Acknowledgments
234
235There are the libraries that I used as a reference implementation and the source of test cases:
236
237+ Python: [textdistance](https://github.com/life4/textdistance), [abydos](https://github.com/chrislit/abydos), [jellyfish](https://github.com/jamesturk/jellyfish).
238+ ☕️ JS: [talisman](https://github.com/Yomguithereal/talisman).
239+ Rust: [strsim](https://github.com/dguo/strsim-rs), [distance](https://github.com/mbrlabs/distance), [levenshtein-rs](https://github.com/wooorm/levenshtein-rs).
240
241Specials thanks to [Trevor Gross](https://github.com/tgross35) for transferring to me the ownership of the [textdistance](https://crates.io/crates/textdistance) name on crates.io.
242
243## Testing locally
244
245To run everything locally, all you need is Rust, Python, and [task](https://taskfile.dev/installation/). Execute `task all` to run all code formatters, linters, and tests.
246
247Thank you ❤️
248