Name Date Size

..25-Oct-20244 KiB

en-huge.txtH A D25-Oct-2024599 KiB

en-medium.txtH A D25-Oct-202460 KiB

en-small.txtH A D25-Oct-20241,019

en-teeny.txtH A D25-Oct-202428

en-tiny.txtH A D25-Oct-2024108

README.mdH A D25-Oct-2024647

ru-huge.txtH A D25-Oct-2024599 KiB

ru-medium.txtH A D25-Oct-202460 KiB

ru-small.txtH A D25-Oct-20241 KiB

ru-teeny.txtH A D25-Oct-202442

ru-tiny.txtH A D25-Oct-2024174

zh-huge.txtH A D25-Oct-2024599 KiB

zh-medium.txtH A D25-Oct-202460 KiB

zh-small.txtH A D25-Oct-20241 KiB

zh-teeny.txtH A D25-Oct-202431

zh-tiny.txtH A D25-Oct-2024110

README.md

1These were downloaded and derived from the Open Subtitles data set:
2https://opus.nlpl.eu/OpenSubtitles-v2018.php
3
4The specific way in which they were modified has been lost to time, but it's
5likely they were just a simple truncation based on target file sizes for
6various benchmarks.
7
8The main reason why we have them is that it gives us a way to test similar
9inputs on non-ASCII text. Normally this wouldn't matter for a substring search
10implementation, but because of the heuristics used to pick a priori determined
11"rare bytes" to base a prefilter on, it's possible for this heuristic to do
12more poorly on non-ASCII text than one might expect.
13