1Teddy is a SIMD accelerated multiple substring matching algorithm. The name 2and the core ideas in the algorithm were learned from the [Hyperscan][1_u] 3project. The implementation in this repository was mostly motivated for use in 4accelerating regex searches by searching for small sets of required literals 5extracted from the regex. 6 7 8# Background 9 10The key idea of Teddy is to do *packed* substring matching. In the literature, 11packed substring matching is the idea of examining multiple bytes in a haystack 12at a time to detect matches. Implementations of, for example, memchr (which 13detects matches of a single byte) have been doing this for years. Only 14recently, with the introduction of various SIMD instructions, has this been 15extended to substring matching. The PCMPESTRI instruction (and its relatives), 16for example, implements substring matching in hardware. It is, however, limited 17to substrings of length 16 bytes or fewer, but this restriction is fine in a 18regex engine, since we rarely care about the performance difference between 19searching for a 16 byte literal and a 16 + N literal; 16 is already long 20enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs 21at least, is its latency and throughput. As a result, it is often faster to 22do substring search with a Boyer-Moore (or Two-Way) variant and a well placed 23memchr to quickly skip through the haystack. 24 25There are fewer results from the literature on packed substring matching, 26and even fewer for packed multiple substring matching. Ben-Kiki et al. [2] 27describes use of PCMPESTRI for substring matching, but is mostly theoretical 28and hand-waves performance. There is other theoretical work done by Bille [3] 29as well. 30 31The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci 32and is generally focused on multiple pattern search. Their first paper [4a] 33introduces the concept of a fingerprint, which is computed for every block of 34N bytes in every pattern. The haystack is then scanned N bytes at a time and 35a fingerprint is computed in the same way it was computed for blocks in the 36patterns. If the fingerprint corresponds to one that was found in a pattern, 37then a verification step follows to confirm that one of the substrings with the 38corresponding fingerprint actually matches at the current location. Various 39implementation tricks are employed to make sure the fingerprint lookup is fast; 40typically by truncating the fingerprint. (This may, of course, provoke more 41steps in the verification process, so a balance must be struck.) 42 43The main downside of [4a] is that the minimum substring length is 32 bytes, 44presumably because of how the algorithm uses certain SIMD instructions. This 45essentially makes it useless for general purpose regex matching, where a small 46number of short patterns is far more likely. 47 48Faro and Kulekci published another paper [4b] that is conceptually very similar 49to [4a]. The key difference is that it uses the CRC32 instruction (introduced 50as part of SSE 4.2) to compute fingerprint values. This also enables the 51algorithm to work effectively on substrings as short as 7 bytes with 4 byte 52windows. 7 bytes is unfortunately still too long. The window could be 53technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the 54small window size ends up negating most performance benefits—and it's likely 55the common case in a general purpose regex engine. 56 57Faro and Kulekci also published [4c] that appears to be intended as a 58replacement to using PCMPESTRI. In particular, it is specifically motivated by 59the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD 60instructions that are faster. While this approach works for short substrings, 61I personally couldn't see a way to generalize it to multiple substring search. 62 63Faro and Kulekci have another paper [4d] that I haven't been able to read 64because it is behind a paywall. 65 66 67# Teddy 68 69Finally, we get to Teddy. If the above literature review is complete, then it 70appears that Teddy is a novel algorithm. More than that, in my experience, it 71completely blows away the competition for short substrings, which is exactly 72what we want in a general purpose regex engine. Again, the algorithm appears 73to be developed by the authors of [Hyperscan][1_u]. Hyperscan was open sourced 74late 2015, and no earlier history could be found. Therefore, tracking the exact 75provenance of the algorithm with respect to the published literature seems 76difficult. 77 78At a high level, Teddy works somewhat similarly to the fingerprint algorithms 79published by Faro and Kulekci, but Teddy does it in a way that scales a bit 80better. Namely: 81 821. Teddy's core algorithm scans the haystack in 16 (for SSE, or 32 for AVX) 83 byte chunks. 16 (or 32) is significant because it corresponds to the number 84 of bytes in a SIMD vector. 852. Bitwise operations are performed on each chunk to discover if any region of 86 it matches a set of precomputed fingerprints from the patterns. If there are 87 matches, then a verification step is performed. In this implementation, our 88 verification step is naive. This can be improved upon. 89 90The details to make this work are quite clever. First, we must choose how to 91pick our fingerprints. In Hyperscan's implementation, I *believe* they use the 92last N bytes of each substring, where N must be at least the minimum length of 93any substring in the set being searched. In this implementation, we use the 94first N bytes of each substring. (The tradeoffs between these choices aren't 95yet clear to me.) We then must figure out how to quickly test whether an 96occurrence of any fingerprint from the set of patterns appears in a 16 byte 97block from the haystack. To keep things simple, let's assume N = 1 and examine 98some examples to motivate the approach. Here are our patterns: 99 100```ignore 101foo 102bar 103baz 104``` 105 106The corresponding fingerprints, for N = 1, are `f`, `b` and `b`. Now let's set 107our 16 byte block to: 108 109```ignore 110bat cat foo bump 111xxxxxxxxxxxxxxxx 112``` 113 114To cut to the chase, Teddy works by using bitsets. In particular, Teddy creates 115a mask that allows us to quickly compute membership of a fingerprint in a 16 116byte block that also tells which pattern the fingerprint corresponds to. In 117this case, our fingerprint is a single byte, so an appropriate abstraction is 118a map from a single byte to a list of patterns that contain that fingerprint: 119 120```ignore 121f |--> foo 122b |--> bar, baz 123``` 124 125Now, all we need to do is figure out how to represent this map in vector space 126and use normal SIMD operations to perform a lookup. The first simplification 127we can make is to represent our patterns as bit fields occupying a single 128byte. This is important, because a single SIMD vector can store 16 bytes. 129 130```ignore 131f |--> 00000001 132b |--> 00000010, 00000100 133``` 134 135How do we perform lookup though? It turns out that SSSE3 introduced a very cool 136instruction called PSHUFB. The instruction takes two SIMD vectors, `A` and `B`, 137and returns a third vector `C`. All vectors are treated as 16 8-bit integers. 138`C` is formed by `C[i] = A[B[i]]`. (This is a bit of a simplification, but true 139for the purposes of this algorithm. For full details, see [Intel's Intrinsics 140Guide][5_u].) This essentially lets us use the values in `B` to lookup values 141in `A`. 142 143If we could somehow cause `B` to contain our 16 byte block from the haystack, 144and if `A` could contain our bitmasks, then we'd end up with something like 145this for `A`: 146 147```ignore 148 0x00 0x01 ... 0x62 ... 0x66 ... 0xFF 149A = 0 0 00000110 00000001 0 150``` 151 152And if `B` contains our window from our haystack, we could use shuffle to take 153the values from `B` and use them to look up our bitsets in `A`. But of course, 154we can't do this because `A` in the above example contains 256 bytes, which 155is much larger than the size of a SIMD vector. 156 157Nybbles to the rescue! A nybble is 4 bits. Instead of one mask to hold all of 158our bitsets, we can use two masks, where one mask corresponds to the lower four 159bits of our fingerprint and the other mask corresponds to the upper four bits. 160So our map now looks like: 161 162```ignore 163'f' & 0xF = 0x6 |--> 00000001 164'f' >> 4 = 0x6 |--> 00000111 165'b' & 0xF = 0x2 |--> 00000110 166'b' >> 4 = 0x6 |--> 00000111 167``` 168 169Notice that the bitsets for each nybble correspond to the union of all 170fingerprints that contain that nybble. For example, both `f` and `b` have the 171same upper 4 bits but differ on the lower 4 bits. Putting this together, we 172have `A0`, `A1` and `B`, where `A0` is our mask for the lower nybble, `A1` is 173our mask for the upper nybble and `B` is our 16 byte block from the haystack: 174 175```ignore 176 0x00 0x01 0x02 0x03 ... 0x06 ... 0xF 177A0 = 0 0 00000110 0 00000001 0 178A1 = 0 0 0 0 00000111 0 179B = b a t _ t p 180B = 0x62 0x61 0x74 0x20 0x74 0x70 181``` 182 183But of course, we can't use `B` with `PSHUFB` yet, since its values are 8 bits, 184and we need indexes that are at most 4 bits (corresponding to one of 16 185values). We can apply the same transformation to split `B` into lower and upper 186nybbles as we did `A`. As before, `B0` corresponds to the lower nybbles and 187`B1` corresponds to the upper nybbles: 188 189```ignore 190 b a t _ c a t _ f o o _ b u m p 191B0 = 0x2 0x1 0x4 0x0 0x3 0x1 0x4 0x0 0x6 0xF 0xF 0x0 0x2 0x5 0xD 0x0 192B1 = 0x6 0x6 0x7 0x2 0x6 0x6 0x7 0x2 0x6 0x6 0x6 0x2 0x6 0x7 0x6 0x7 193``` 194 195And now we have a nice correspondence. `B0` can index `A0` and `B1` can index 196`A1`. Here's what we get when we apply `C0 = PSHUFB(A0, B0)`: 197 198```ignore 199 b a ... f o ... p 200 A0[0x2] A0[0x1] A0[0x6] A0[0xF] A0[0x0] 201C0 = 00000110 0 00000001 0 0 202``` 203 204And `C1 = PSHUFB(A1, B1)`: 205 206```ignore 207 b a ... f o ... p 208 A1[0x6] A1[0x6] A1[0x6] A1[0x6] A1[0x7] 209C1 = 00000111 00000111 00000111 00000111 0 210``` 211 212Notice how neither one of `C0` or `C1` is guaranteed to report fully correct 213results all on its own. For example, `C1` claims that `b` is a fingerprint for 214the pattern `foo` (since `A1[0x6] = 00000111`), and that `o` is a fingerprint 215for all of our patterns. But if we combined `C0` and `C1` with an `AND` 216operation: 217 218```ignore 219 b a ... f o ... p 220C = 00000110 0 00000001 0 0 221``` 222 223Then we now have that `C[i]` contains a bitset corresponding to the matching 224fingerprints in a haystack's 16 byte block, where `i` is the `ith` byte in that 225block. 226 227Once we have that, we can look for the position of the least significant bit 228in `C`. (Least significant because we only target `x86_64` here, which is 229always little endian. Thus, the least significant bytes correspond to bytes 230in our haystack at a lower address.) That position, modulo `8`, gives us 231the pattern that the fingerprint matches. That position, integer divided by 232`8`, also gives us the byte offset that the fingerprint occurs in inside the 23316 byte haystack block. Using those two pieces of information, we can run a 234verification procedure that tries to match all substrings containing that 235fingerprint at that position in the haystack. 236 237 238# Implementation notes 239 240The problem with the algorithm as described above is that it uses a single byte 241for a fingerprint. This will work well if the fingerprints are rare in the 242haystack (e.g., capital letters or special characters in normal English text), 243but if the fingerprints are common, you'll wind up spending too much time in 244the verification step, which effectively negates the performance benefits of 245scanning 16 bytes at a time. Remember, the key to the performance of this 246algorithm is to do as little work as possible per 16 (or 32) bytes. 247 248This algorithm can be extrapolated in a relatively straight-forward way to use 249larger fingerprints. That is, instead of a single byte prefix, we might use a 250two or three byte prefix. The implementation here implements N = {1, 2, 3} 251and always picks the largest N possible. The rationale is that the bigger the 252fingerprint, the fewer verification steps we'll do. Of course, if N is too 253large, then we'll end up doing too much on each step. 254 255The way to extend it is: 256 2571. Add a mask for each byte in the fingerprint. (Remember that each mask is 258 composed of two SIMD vectors.) This results in a value of `C` for each byte 259 in the fingerprint while searching. 2602. When testing each 16 (or 32) byte block, each value of `C` must be shifted 261 so that they are aligned. Once aligned, they should all be `AND`'d together. 262 This will give you only the bitsets corresponding to the full match of the 263 fingerprint. To do this, one needs to save the last byte (for N=2) or last 264 two bytes (for N=3) from the previous iteration, and then line them up with 265 the first one or two bytes of the next iteration. 266 267## Verification 268 269Verification generally follows the procedure outlined above. The tricky parts 270are in the right formulation of operations to get our bits out of our vectors. 271We have a limited set of operations available to us on SIMD vectors as 128-bit 272or 256-bit numbers, so we wind up needing to rip out 2 (or 4) 64-bit integers 273from our vectors, and then run our verification step on each of those. The 274verification step looks at the least significant bit set, and from its 275position, we can derive the byte offset and bucket. (Again, as described 276above.) Once we know the bucket, we do a fairly naive exhaustive search for 277every literal in that bucket. (Hyperscan is a bit smarter here and uses a hash 278table, but I haven't had time to thoroughly explore that. A few initial 279half-hearted attempts resulted in worse performance.) 280 281## AVX 282 283The AVX version of Teddy extrapolates almost perfectly from the SSE version. 284The only hickup is that PALIGNR is used to align chunks in the 16-bit version, 285and there is no equivalent instruction in AVX. AVX does have VPALIGNR, but it 286only works within 128-bit lanes. So there's a bit of tomfoolery to get around 287this by shuffling the vectors before calling VPALIGNR. 288 289The only other aspect to AVX is that since our masks are still fundamentally 29016-bytes (0x0-0xF), they are duplicated to 32-bytes, so that they can apply to 29132-byte chunks. 292 293## Fat Teddy 294 295In the version of Teddy described above, 8 buckets are used to group patterns 296that we want to search for. However, when AVX is available, we can extend the 297number of buckets to 16 by permitting each byte in our masks to use 16-bits 298instead of 8-bits to represent the buckets it belongs to. (This variant is also 299in Hyperscan.) However, what we give up is the ability to scan 32 bytes at a 300time, even though we're using AVX. Instead, we have to scan 16 bytes at a time. 301What we gain, though, is (hopefully) less work in our verification routine. 302It patterns are more spread out across more buckets, then there should overall 303be fewer false positives. In general, Fat Teddy permits us to grow our capacity 304a bit and search for more literals before Teddy gets overwhelmed. 305 306The tricky part of Fat Teddy is in how we adjust our masks and our verification 307procedure. For the masks, we simply represent the first 8 buckets in each of 308the low 16 bytes, and then the second 8 buckets in each of the high 16 bytes. 309Then, in the search loop, instead of loading 32 bytes from the haystack, we 310load the same 16 bytes from the haystack into both the low and high 16 byte 311portions of our 256-bit vector. So for example, a mask might look like this: 312 313 bits: 00100001 00000000 ... 11000000 00000000 00000001 ... 00000000 314 byte: 31 30 16 15 14 0 315 offset: 15 14 0 15 14 0 316 buckets: 8-15 8-15 8-15 0-7 0-7 0-7 317 318Where `byte` is the position in the vector (higher numbers corresponding to 319more significant bits), `offset` is the corresponding position in the haystack 320chunk, and `buckets` corresponds to the bucket assignments for that particular 321byte. 322 323In particular, notice that the bucket assignments for offset `0` are spread 324out between bytes `0` and `16`. This works well for the chunk-by-chunk search 325procedure, but verification really wants to process all bucket assignments for 326each offset at once. Otherwise, we might wind up finding a match at offset 327`1` in one the first 8 buckets, when we really should have reported a match 328at offset `0` in one of the second 8 buckets. (Because we want the leftmost 329match.) 330 331Thus, for verification, we rearrange the above vector such that it is a 332sequence of 16-bit integers, where the least significant 16-bit integer 333corresponds to all of the bucket assignments for offset `0`. So with the 334above vector, the least significant 16-bit integer would be 335 336 11000000 000000 337 338which was taken from bytes `16` and `0`. Then the verification step pretty much 339runs as described, except with 16 buckets instead of 8. 340 341 342# References 343 344- **[1]** [Hyperscan on GitHub](https://github.com/intel/hyperscan), 345 [webpage](https://www.hyperscan.io/) 346- **[2a]** Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R., 347 & Weimann, O. (2011). 348 _Optimal packed string matching_. 349 In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13). 350 Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 351 DOI: 10.4230/LIPIcs.FSTTCS.2011.423. 352 [PDF](https://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf). 353- **[2b]** Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R., 354 & Weimann, O. (2014). 355 _Towards optimal packed string matching_. 356 Theoretical Computer Science, 525, 111-129. 357 DOI: 10.1016/j.tcs.2013.06.013. 358 [PDF](https://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf). 359- **[3]** Bille, P. (2011). 360 _Fast searching in packed strings_. 361 Journal of Discrete Algorithms, 9(1), 49-56. 362 DOI: 10.1016/j.jda.2010.09.003. 363 [PDF](https://www.sciencedirect.com/science/article/pii/S1570866710000353). 364- **[4a]** Faro, S., & Külekci, M. O. (2012, October). 365 _Fast multiple string matching using streaming SIMD extensions technology_. 366 In String Processing and Information Retrieval (pp. 217-228). 367 Springer Berlin Heidelberg. 368 DOI: 10.1007/978-3-642-34109-0_23. 369 [PDF](https://www.dmi.unict.it/faro/papers/conference/faro32.pdf). 370- **[4b]** Faro, S., & Külekci, M. O. (2013, September). 371 _Towards a Very Fast Multiple String Matching Algorithm for Short Patterns_. 372 In Stringology (pp. 78-91). 373 [PDF](https://www.dmi.unict.it/faro/papers/conference/faro36.pdf). 374- **[4c]** Faro, S., & Külekci, M. O. (2013, January). 375 _Fast packed string matching for short patterns_. 376 In Proceedings of the Meeting on Algorithm Engineering & Expermiments 377 (pp. 113-121). 378 Society for Industrial and Applied Mathematics. 379 [PDF](https://arxiv.org/pdf/1209.6449.pdf). 380- **[4d]** Faro, S., & Külekci, M. O. (2014). 381 _Fast and flexible packed string matching_. 382 Journal of Discrete Algorithms, 28, 61-72. 383 DOI: 10.1016/j.jda.2014.07.003. 384 385[1_u]: https://github.com/intel/hyperscan 386[5_u]: https://software.intel.com/sites/landingpage/IntrinsicsGuide 387