1These were downloaded and derived from the Open Subtitles data set: 2https://opus.nlpl.eu/OpenSubtitles-v2018.php 3 4The specific way in which they were modified has been lost to time, but it's 5likely they were just a simple truncation based on target file sizes for 6various benchmarks. 7 8The main reason why we have them is that it gives us a way to test similar 9inputs on non-ASCII text. Normally this wouldn't matter for a substring search 10implementation, but because of the heuristics used to pick a priori determined 11"rare bytes" to base a prefilter on, it's possible for this heuristic to do 12more poorly on non-ASCII text than one might expect. 13