Performance of Greedy vs. Lazy Regex Quantifiers
A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.
Consider the following simple example: When the regexes <.*?>
and <[^>]*>
are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.
So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>
. I'll use the RegexBuddy debugger style to illustrate each step.
Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.
- Beginning match attempt at character 0
- <
- <ok
- <backtrack
- <0
- <0backtrack
- <01
- <01backtrack
- <012
- <012backtrack
- <0123
- <0123backtrack
- <01234
- <01234backtrack
- <012345
- <012345backtrack
- <0123456
- <0123456backtrack
- <01234567
- <01234567backtrack
- <012345678
- <012345678backtrack
- <0123456789
- <0123456789>
- Match found
The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point. As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".
Here's what happens with <[^>]*>
:
- Beginning match attempt at character 0
- <
- <0123456789
- <0123456789>
- Match found
As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.
You might be wondering what happens if we use the regex <.*>
. Take a look:
- Beginning match attempt at character 0
- <
- <0123456789>
- <0123456789>backtrack
- <0123456789
- <0123456789>
- Match found
Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .*
caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.
In short:
- Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
- Try to make your patterns as explicit as possible, to better guide the regex engine and avoid backtracking.
- If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.
Comment by Sam on 2 December 2007:
I found this really interesting. Most of my focus with Regexps has been trying to make them work, not work quickly! I can see a situation in which this sort of optimization could make a program a lot better.
Wow, I just looked at some older pages on your site. You have written a lot about this stuff–a lot of stuff to read.
Comment by Mark on 2 December 2007:
Additional rule, that precedes the others:
→ Write a clear, easily understood regex unless your profiler shows a need to optimize it.
Comment by Steve on 2 December 2007:
@Mark, I apply that principal to other code I write, but not to regular expressions. My optimization priorities for regexes tend to go, in order, correctness, efficiency, then readability (although of course I’ll mix it up depending on the situation). If you get in the habit of writing regexes with general performance optimization principles in mind, it will soon come naturally with very little extra effort. N00bs can’t read a non-trivial regex anyway. A regex master with a comment or two to guide them will usually have no trouble.
Comment by Ash on 2 December 2007:
You might want to note that when there’s a constant string on the other side of your * quantifier, an optimized regex engine (like perl’s) will jump to an instance of that constant ‘>’ and attempt to match there, skipping the middle, unlike what your traditional regular expression engine debugger shows.
Comment by Steve on 2 December 2007:
@Ash, yes, the debugger output shows the traditional behavior without factoring in optimizations that certain libraries have which apply in certain circumstances. Otherwise, it would be very difficult to talk about this stuff in a consistent and easy-to-follow way. Perl has many internal optimizations that other regex libraries don’t have, and vice versa, but Perl in particular has optimized lazy quantification more than most other libraries. Nevertheless, the principles in this post still apply there.
Comment by Scofield on 13 December 2009:
Quote:
with lazy quantification the engine backtracks forward after each character which is not followed by “>”.
I consider “forward” shuold be replaced by “backward”.
When the engine backtracks, it will RETURN to the last choice point.
English is not my native language. I feel it is really confused.
Thanks!
Comment by Steven Levithan on 13 December 2009:
@Scofield, the use of “forward” was intentional. With lazy quantification, the last choice point is always at the current position in the match, so “returning” to the last choice point does not cause any retreat into the matched characters. The match advances via backtracking–hence it “backtracks forward.” This is indeed the opposite of how greedy quantification works and what is intuitively implied by the term “backtracking.” Hopefully the RegexBuddy-debugger-style match steps listed in this post help to make this clear.
Comment by Scofield on 14 December 2009:
I use the pattern “123|789 ” to match the string “US123456789” (Just for understanding RegexBuddy debugger style), of course, “123” and “789” in the string both is matched.
However, I found a strange question in “Debug” pane,
Beginning match attempt at character 10
…
Beginning match attempt at character 12
…
Beginning match attempt at character 14
…
Beginning match attempt at character 16
“US123456789” ONLY has 11 letters, why could “Debug” pane display character 12, character 14 or character 16 ?
Of course, Character 11 is the digit ‘9’ in the string, however, what is character 12? character 12 doesn’t have any letter at all !
What does it mean? Thanks!
Comment by Steven Levithan on 14 December 2009:
@Scofield, contact the maker of RegexBuddy–I have nothing to do with the software. In any case, I can’t reproduce the issue. Maybe you just need to upgrade to the latest version.
Comment by kawas on 12 January 2010:
Very interesting !!!
It saves me 760 secondes on a large file parsing.
Thx
Comment by ridgerunner on 3 May 2010:
How did you get the RegexBuddy debug info into HTML?
Comment by eyelidlessness on 18 May 2010:
I just encountered a claim that basically mirrors this article, and found the claim counter-intuitive. So I searched and found this article, and I’m frankly surprised. It seemed to me that a competent regex engine would automatically optimize a pattern like
<.*?>
to be<[^>]*>
because it’s identical in practice.Is there any reason, with a pattern that simple, that any engine would opt not to optimize it that way? Or is it just a matter of the work not having been done?
Comment by Steven Levithan on 18 May 2010:
@ridgerunner, manual labor.
@eyelidlessness, it’s mostly a matter of the work not being done. Some optimized regex engines do make them equivalent in simple cases like this, but keep in mind that backtracking can make
<.*?>
and<[^>]*>
not equivalent when they are part of a larger pattern–i.e., backtracking due to failure in a latter part of the pattern may force the dot from.*?
to match a>
character. Correctly handling this requires optimizations like the one you describe to be able to figure out special cases like this and properly handle them.Comment by eyelidlessness on 18 May 2010:
“backtracking due to failure in a latter part of the pattern may force the dot from
.*?
to match a>
character”Huh. I’m having trouble imagining a pattern that would do this. If you have a moment, is there any chance you could provide an example?
Thanks for taking your time, for answering questions as well as maintaining this blog (which contains a wealth of information) as well as your XRegExp library. I don’t think I’ve ever commented before this post, but I’ve read quite a lot of the blog.
Comment by Chris on 7 June 2010:
Actually, you don’t even need a larger pattern. “<>>” is matched in its entirety by “<.*?>” but not by “<[^>]*>”.
Pingback by Regex: Cuidado Com os Quantificadores Gananciosos! | DevBlog – Fabiano Sobreira on 26 November 2010:
[…] Isto permite o motor seguir em frente enquanto o segundo caractere de apas não é encontrado. Quando encontrado a busca termina sem retornos. A operação falha retornando nossa URL completa. Este tipo de busca oferece excelente desempenho. Mais informações aqui. […]
Comment by dima on 3 April 2011:
/(.*)\.(.*)/
/^(.*)\.([^.]+)$/ — it will be faster
Pingback by Regex greedy parse direction on 15 June 2012:
[…] also see this article: Performance of Greedy vs. Lazy Regex Quantifiers […]
Comment by Steven Levithan on 1 September 2012:
@Chris, thanks for the reduced example. Another reason that the two regexes aren’t equivalent is because the dot doesn’t match newlines (unless you use flag
s
).Pingback by Practical Tips on Writing an Effective Web Crawler | Guanlan Dai on 18 February 2013:
[…] Performance of Greedy vs. Lazy Regex Quantifiers, Steven Levithan […]
Comment by ssiddique on 2 June 2015:
Hi,
I got a query that do backtracking and search and replace in Perl.
s/(\])\||\|(\w)|\|$/$1\$+/g. When I use it in script, it works for 15 stings but on 16th it takes forever. Is there some kind of buffer problem that we are using to store $1, $2?
Comment by yapi kredi bank deutschland on 12 February 2017:
That’s not even 10 minutes well spent!
Comment by Tony Rudd on 17 June 2019:
Thanks, a very interesting article. What I do not understand is why the lazy needs to backtrack? Why does it not suffice to stop when the first match occurs?
Comment by espn.com/activate on 8 November 2019:
Im thankful for the blog article.Really thank you
Comment by roku.com/link on 11 February 2020:
That is a very good tip especially to those fresh to the blogosphere. Brief but very precise info… Thank you for sharing this one. A must read article!
Comment by Praveen on 6 May 2020:
Clear and precise. Thank you so much.
Comment by NewBuziness on 28 September 2020:
I’m thankful for the blog article Need to Have Successful Business as a Woman
Comment by Free_biker on 4 October 2020:
Thanks for your blog article.
Comment by Strange Hoot on 10 December 2020:
That’s good to know about Regex but as per current industry i thought Letex is good for android.
Comment by Espresso machine 2021 on 12 December 2020:
Good work. Thanks for sharing. I have a lot of issues before reading this article.
Pingback by ??????????? – FIXBBS on 15 December 2020:
[…] ????????: Performance of Greedy vs. Lazy Regex Quantifiers […]
Comment by Mahesh Technicals on 18 December 2020:
Thank you for your great Information. It’s realy help me alot.
Comment by Customroms.in on 18 December 2020:
Thank you so much for sharing your information.
Comment by Multiapkclub on 18 December 2020:
Thank You So Much for Providing This, It Saves My Alot of Time.
Comment by william, jordn on 29 December 2020:
VISIT THIS REPAIRING SITE….https://alfahadhomeappliances.com/