The quadratic problem nobody fixed

Most regex engines suffer from quadratic time complexity when finding all matches in a document. This article explores why this happens and how the RE# project solves it using a two-pass approach.

search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours. every regex engine, in every language, has had this problem since the 1970s, and nobody fixed it.

every regex engine that advertises linear-time matching - RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode - means linear time for a single match. the moment you call find_iter or FindAll, that guarantee evaporates. the rust regex crate docs are the only ones honest enough to say it outright:

the worst case time complexity for iterators is O(m * n²). [...] if both patterns and haystacks are untrusted and you're iterating over all matches, you're susceptible to worst case quadratic time complexity.

the mechanism is simple. take the pattern .*a|b and a haystack of n b's. at each position, the engine tries .*a first - scans the entire remaining haystack looking for an a, finds none, fails. then the b branch matches a single character. advance one position, repeat. that's n + (n-1) + (n-2) + ... = O(n²) work to report n single-character matches. a textbook triangular sum.

how did this go unnoticed for so long? almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". part of the reason is that the theory of regexes boils everything down to a single yes/no question - does this string match or not? that's clean and great for proving theorems, but it throws away nearly everything that matters in practice: where the matches are, how long they are, and how many there are.

backtracking is worse, and still the default

before getting into the fix, it's worth putting the quadratic problem in context. with backtracking, a user-supplied pattern and a 50-character input can take longer than the heat death of the universe - it's exponential. Thompson published the NFA construction that avoids it back in 1968. that's nearly 60 years of a solved problem being actively unsolved at scale, because backtracking is still the default in most regex engines.

Aho-Corasick solved this for fixed strings in 1975

the problem we're talking about in this post - finding all longest matches without quadratic blowup - was actually solved decades ago, but only for fixed strings. Aho-Corasick (1975) is a classic and very useful algorithm that finds all occurrences of multiple fixed strings in a single O(n) pass. the reason Aho-Corasick avoids the quadratic blowup is simple: every pattern has a known length, baked into the trie. when you find a match, you already know exactly how long it is.

Hyperscan/Vectorscan

Hyperscan is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one. this changes the results. for grep, editors, and search-and-replace, where users expect a+ to match the full run of a's, earliest semantics gives the wrong answer.

two passes instead of n

i've been working on RE#, and i want to show that this problem is actually possible to solve. to the best of my knowledge, RE# is the first regex engine that can find all matches in two passes, regardless of the pattern or the input, without altering the semantics.

the algorithm doesn't find matches one at a time. instead it does two passes over the entire input: a reverse DFA marks where matches could start, then a forward DFA resolves the longest match at each marked position. by the time we confirm a match, both directions have already been scanned - matches are reported retroactively rather than by restarting from each position.

Source: Hacker News