|
Name |
|
Date |
Size |
#Lines |
LOC |
| .. | | - | - |
| README.md | D | 12-May-2024 | 647 | 13 | 10 |
| en-huge.txt | D | 12-May-2024 | 599 KiB | 22,928 | 22,927 |
| en-medium.txt | D | 12-May-2024 | 60 KiB | 2,171 | 2,170 |
| en-small.txt | D | 12-May-2024 | 1,019 | 40 | 39 |
| en-teeny.txt | D | 12-May-2024 | 28 | 2 | 1 |
| en-tiny.txt | D | 12-May-2024 | 108 | 3 | 2 |
| ru-huge.txt | D | 12-May-2024 | 599 KiB | 12,686 | 12,685 |
| ru-medium.txt | D | 12-May-2024 | 60 KiB | 1,324 | 1,323 |
| ru-small.txt | D | 12-May-2024 | 1 KiB | 19 | 18 |
| ru-teeny.txt | D | 12-May-2024 | 42 | 2 | 1 |
| ru-tiny.txt | D | 12-May-2024 | 174 | 3 | 2 |
| zh-huge.txt | D | 12-May-2024 | 599 KiB | 22,001 | 22,000 |
| zh-medium.txt | D | 12-May-2024 | 60 KiB | 1,466 | 1,465 |
| zh-small.txt | D | 12-May-2024 | 1 KiB | 29 | 28 |
| zh-teeny.txt | D | 12-May-2024 | 31 | 2 | 1 |
| zh-tiny.txt | D | 12-May-2024 | 110 | 4 | 3 |
README.md
1These were downloaded and derived from the Open Subtitles data set:
2https://opus.nlpl.eu/OpenSubtitles-v2018.php
3
4The specific way in which they were modified has been lost to time, but it's
5likely they were just a simple truncation based on target file sizes for
6various benchmarks.
7
8The main reason why we have them is that it gives us a way to test similar
9inputs on non-ASCII text. Normally this wouldn't matter for a substring search
10implementation, but because of the heuristics used to pick a priori determined
11"rare bytes" to base a prefilter on, it's possible for this heuristic to do
12more poorly on non-ASCII text than one might expect.
13