• Home
Name Date Size #Lines LOC

..--

README.mdD22-Mar-2025647 1310

en-huge.txtD22-Mar-2025599 KiB22,92822,927

en-medium.txtD22-Mar-202560 KiB2,1712,170

en-small.txtD22-Mar-20251,019 4039

en-teeny.txtD22-Mar-202528 21

en-tiny.txtD22-Mar-2025108 32

ru-huge.txtD22-Mar-2025599 KiB12,68612,685

ru-medium.txtD22-Mar-202560 KiB1,3241,323

ru-small.txtD22-Mar-20251 KiB1918

ru-teeny.txtD22-Mar-202542 21

ru-tiny.txtD22-Mar-2025174 32

zh-huge.txtD22-Mar-2025599 KiB22,00122,000

zh-medium.txtD22-Mar-202560 KiB1,4661,465

zh-small.txtD22-Mar-20251 KiB2928

zh-teeny.txtD22-Mar-202531 21

zh-tiny.txtD22-Mar-2025110 43

README.md

1These were downloaded and derived from the Open Subtitles data set:
2https://opus.nlpl.eu/OpenSubtitles-v2018.php
3
4The specific way in which they were modified has been lost to time, but it's
5likely they were just a simple truncation based on target file sizes for
6various benchmarks.
7
8The main reason why we have them is that it gives us a way to test similar
9inputs on non-ASCII text. Normally this wouldn't matter for a substring search
10implementation, but because of the heuristics used to pick a priori determined
11"rare bytes" to base a prefilter on, it's possible for this heuristic to do
12more poorly on non-ASCII text than one might expect.
13