Finding all regex matches has always been O(n²)

Most regex engines advertising linear-time matching actually fall into a quadratic performance trap when finding all matches. This article explores why O(n²) has been the norm since the 1970s and how the RE# project aims to fix it.

search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours. every regex engine, in every language, has had this problem since the 1970s, and nobody fixed it.

every regex engine that advertises linear-time matching - RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode - means linear time for a single match. the moment you call find_iter or FindAll, that guarantee is gone. the rust regex crate docs are the only ones honest enough to say it outright:

the worst case time complexity for iterators is O(m * n²). [...] if both patterns and haystacks are untrusted and you're iterating over all matches, you're susceptible to worst case quadratic time complexity.

the mechanism is simple. take the pattern .*a|b and a haystack of n b's. at each position, the engine tries .*a first - scans the entire remaining haystack looking for an a, finds none, fails. then the b branch matches a single character. advance one position, repeat. that's n + (n-1) + (n-2) + ... = O(n²) work to report n single-character matches.

how did this go unnoticed for so long? almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". part of the reason is that the theory of regexes boils everything down to a single yes/no question - does this string match or not? that's clean and great for proving theorems, but it throws away nearly everything that matters in practice: where the matches are, how long they are, and how many there are.

backtracking is worse, and still the default

before getting into the fix, it's worth putting the quadratic problem in context. with backtracking, a user-supplied pattern and a 50-character input can take longer than the heat death of the universe - it's exponential. Thompson published the NFA construction that avoids it back in 1968. that's nearly 60 years of a solved problem being actively unsolved at scale, because backtracking is still the default in most regex engines.

Aho-Corasick solved this for fixed strings in 1975

the problem we're talking about in this post - finding all longest matches without quadratic blowup - was actually solved decades ago, but only for fixed strings. Aho-Corasick (1975) is a classic algorithm that finds all occurrences of multiple fixed strings in a single O(n) pass. the reason Aho-Corasick avoids the quadratic blowup is simple: every pattern has a known length. when you find a match, you already know exactly how long it is - there's no ambiguity about where it ends, nothing to rescan.

Hyperscan/Vectorscan

Hyperscan is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one. for grep, editors, and search-and-replace, where users expect a+ to match the full run of a's, earliest semantics gives the wrong answer.

two passes instead of n

RE# is the first regex engine that can find all matches in two passes, regardless of the pattern or the input, without altering the semantics. the algorithm doesn't find matches one at a time. instead it does two passes over the entire input: a reverse DFA marks where matches could start, then a forward DFA resolves the longest match at each marked position. by the time we confirm a match, both directions have already been scanned - matches are reported retroactively rather than by restarting from each position.

Source: Hacker News