Performance of Greedy vs. Lazy Regex Quantifiers

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.

So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>. I'll use the RegexBuddy debugger style to illustrate each step.

Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.

Beginning match attempt at character 0
<
<ok
<backtrack
<0
<0backtrack
<01
<01backtrack
<012
<012backtrack
<0123
<0123backtrack
<01234
<01234backtrack
<012345
<012345backtrack
<0123456
<0123456backtrack
<01234567
<01234567backtrack
<012345678
<012345678backtrack
<0123456789
<0123456789>
Match found

The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point, which it does after trying and failing to match the next token (>). As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".

Here's what happens with <[^>]*>:

Beginning match attempt at character 0
<
<0123456789
<0123456789>
Match found

As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.

You might be wondering what happens if we use the regex <.*>. Take a look:

Beginning match attempt at character 0
<
<0123456789>
<0123456789>backtrack
<0123456789
<0123456789>
Match found

Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .* caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.

In short:

Make your patterns more explicit where it makes sense, to better guide the regex engine and avoid backtracking.
Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.

26 thoughts on “Performance of Greedy vs. Lazy Regex Quantifiers”

Sam says:

December 2, 2007 at 12:32 pm

I found this really interesting. Most of my focus with Regexps has been trying to make them work, not work quickly! I can see a situation in which this sort of optimization could make a program a lot better.

Wow, I just looked at some older pages on your site. You have written a lot about this stuff–a lot of stuff to read.
Mark says:

December 2, 2007 at 1:16 pm

Additional rule, that precedes the others:

→ Write a clear, easily understood regex unless your profiler shows a need to optimize it.
Steve says:

December 2, 2007 at 2:16 pm

@Mark, I apply that principal to other code I write, but not to regular expressions. My optimization priorities for regexes tend to go, in order, correctness, efficiency, then readability (although of course I’ll mix it up depending on the situation). If you get in the habit of writing regexes with general performance optimization principles in mind, it will soon come naturally with very little extra effort. N00bs can’t read a non-trivial regex anyway. A regex master with a comment or two to guide them will usually have no trouble.
Ash says:

December 2, 2007 at 5:49 pm

You might want to note that when there’s a constant string on the other side of your * quantifier, an optimized regex engine (like perl’s) will jump to an instance of that constant ‘>’ and attempt to match there, skipping the middle, unlike what your traditional regular expression engine debugger shows.
Steve says:

December 2, 2007 at 7:04 pm

@Ash, yes, the debugger output shows the traditional behavior without factoring in optimizations that certain libraries have which apply in certain circumstances. Otherwise, it would be very difficult to talk about this stuff in a consistent and easy-to-follow way. Perl has many internal optimizations that other regex libraries don’t have, and vice versa, but Perl in particular has optimized lazy quantification more than most other libraries. Nevertheless, the principles in this post still apply there.
Scofield says:

December 13, 2009 at 1:21 am

Quote:
with lazy quantification the engine backtracks forward after each character which is not followed by “>”.

I consider “forward” shuold be replaced by “backward”.
When the engine backtracks, it will RETURN to the last choice point.

English is not my native language. I feel it is really confused.
Thanks!
Steven Levithan says:

December 13, 2009 at 2:19 am

@Scofield, the use of “forward” was intentional. With lazy quantification, the last choice point is always at the current position in the match, so “returning” to the last choice point does not cause any retreat into the matched characters. The match advances via backtracking–hence it “backtracks forward.” This is indeed the opposite of how greedy quantification works and what is intuitively implied by the term “backtracking.” Hopefully the RegexBuddy-debugger-style match steps listed in this post help to make this clear.
Scofield says:

December 14, 2009 at 12:26 am

I use the pattern “123|789 ” to match the string “US123456789” (Just for understanding RegexBuddy debugger style), of course, “123” and “789” in the string both is matched.

However, I found a strange question in “Debug” pane,

Beginning match attempt at character 10
…
Beginning match attempt at character 12
…
Beginning match attempt at character 14
…
Beginning match attempt at character 16

“US123456789” ONLY has 11 letters, why could “Debug” pane display character 12, character 14 or character 16 ?

Of course, Character 11 is the digit ‘9’ in the string, however, what is character 12? character 12 doesn’t have any letter at all !
What does it mean? Thanks!
Steven Levithan says:

December 14, 2009 at 2:37 am

@Scofield, contact the maker of RegexBuddy–I have nothing to do with the software. In any case, I can’t reproduce the issue. Maybe you just need to upgrade to the latest version.
kawas says:

January 12, 2010 at 12:00 pm

Very interesting !!!
It saves me 760 secondes on a large file parsing.

Thx
ridgerunner says:

May 3, 2010 at 11:32 am

How did you get the RegexBuddy debug info into HTML?
eyelidlessness says:

May 18, 2010 at 3:07 am

I just encountered a claim that basically mirrors this article, and found the claim counter-intuitive. So I searched and found this article, and I’m frankly surprised. It seemed to me that a competent regex engine would automatically optimize a pattern like <.*?> to be <[^>]*> because it’s identical in practice.

Is there any reason, with a pattern that simple, that any engine would opt not to optimize it that way? Or is it just a matter of the work not having been done?
Steven Levithan says:

May 18, 2010 at 5:22 am

@ridgerunner, manual labor.

@eyelidlessness, it’s mostly a matter of the work not being done. Some optimized regex engines do make them equivalent in simple cases like this, but keep in mind that backtracking can make <.*?> and <[^>]*> not equivalent when they are part of a larger pattern–i.e., backtracking due to failure in a latter part of the pattern may force the dot from .*? to match a > character. Correctly handling this requires optimizations like the one you describe to be able to figure out special cases like this and properly handle them.
eyelidlessness says:

May 18, 2010 at 2:27 pm

“backtracking due to failure in a latter part of the pattern may force the dot from .*? to match a > character”

Huh. I’m having trouble imagining a pattern that would do this. If you have a moment, is there any chance you could provide an example?

Thanks for taking your time, for answering questions as well as maintaining this blog (which contains a wealth of information) as well as your XRegExp library. I don’t think I’ve ever commented before this post, but I’ve read quite a lot of the blog.
Chris says:

June 7, 2010 at 1:45 pm

Actually, you don’t even need a larger pattern. “<>>” is matched in its entirety by “<.*?>” but not by “<[^>]*>”.
Pingback: Regex: Cuidado Com os Quantificadores Gananciosos! | DevBlog – Fabiano Sobreira
dima says:

April 3, 2011 at 8:49 am

/(.*)\.(.*)/
/^(.*)\.([^.]+)$/ — it will be faster
Pingback: Regex greedy parse direction
Steven Levithan says:

September 1, 2012 at 1:24 am

@Chris, thanks for the reduced example. Another reason that the two regexes aren’t equivalent is because the dot doesn’t match newlines (unless you use flag s).
Pingback: Practical Tips on Writing an Effective Web Crawler | Guanlan Dai
ssiddique says:

June 2, 2015 at 11:50 am

Hi,

I got a query that do backtracking and search and replace in Perl.

s/(\])\||\|(\w)|\|$/$1\$+/g. When I use it in script, it works for 15 stings but on 16th it takes forever. Is there some kind of buffer problem that we are using to store $1, $2?
Tony Rudd says:

June 17, 2019 at 9:34 am

Thanks, a very interesting article. What I do not understand is why the lazy needs to backtrack? Why does it not suffice to stop when the first match occurs?
Praveen says:

May 6, 2020 at 12:32 am

Clear and precise. Thank you so much.
Pingback: Regex: matching up to the first occurrence of a character
Pingback: Regex for a username increases CPU consumption
André Antunes da Cunha says:

September 29, 2022 at 6:39 pm

Just wrote a silly browser performance test for negative vs lazy (non-greedy):
https://www.measurethat.net/Benchmarks/Show/21302/0/regexp-negative-vs-lazy-non-greedy

26 thoughts on “Performance of Greedy vs. Lazy Regex Quantifiers”

Leave a Reply