Flagrant Badassery

A JavaScript and regular expression centric blog

Performance of Greedy vs. Lazy Regex Quantifiers

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.

So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>. I'll use the RegexBuddy debugger style to illustrate each step.

Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.

  1. Beginning match attempt at character 0
  2. <
  3. <ok
  4. <backtrack
  5. <0
  6. <0backtrack
  7. <01
  8. <01backtrack
  9. <012
  10. <012backtrack
  11. <0123
  12. <0123backtrack
  13. <01234
  14. <01234backtrack
  15. <012345
  16. <012345backtrack
  17. <0123456
  18. <0123456backtrack
  19. <01234567
  20. <01234567backtrack
  21. <012345678
  22. <012345678backtrack
  23. <0123456789
  24. <0123456789>
  25. Match found

The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point. As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".

Here's what happens with <[^>]*>:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789
  4. <0123456789>
  5. Match found

As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.

You might be wondering what happens if we use the regex <.*>. Take a look:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789>
  4. <0123456789>backtrack
  5. <0123456789
  6. <0123456789>
  7. Match found

Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .* caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.

In short:

  • Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
  • Try to make your patterns as explicit as possible, to better guide the regex engine and avoid backtracking.
  • If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.

There Are 21 Responses So Far. »

  1. I found this really interesting. Most of my focus with Regexps has been trying to make them work, not work quickly! I can see a situation in which this sort of optimization could make a program a lot better.

    Wow, I just looked at some older pages on your site. You have written a lot about this stuff–a lot of stuff to read.

  2. Additional rule, that precedes the others:

    → Write a clear, easily understood regex unless your profiler shows a need to optimize it.

  3. @Mark, I apply that principal to other code I write, but not to regular expressions. My optimization priorities for regexes tend to go, in order, correctness, efficiency, then readability (although of course I’ll mix it up depending on the situation). If you get in the habit of writing regexes with general performance optimization principles in mind, it will soon come naturally with very little extra effort. N00bs can’t read a non-trivial regex anyway. A regex master with a comment or two to guide them will usually have no trouble.

  4. You might want to note that when there’s a constant string on the other side of your * quantifier, an optimized regex engine (like perl’s) will jump to an instance of that constant ‘>’ and attempt to match there, skipping the middle, unlike what your traditional regular expression engine debugger shows.

  5. @Ash, yes, the debugger output shows the traditional behavior without factoring in optimizations that certain libraries have which apply in certain circumstances. Otherwise, it would be very difficult to talk about this stuff in a consistent and easy-to-follow way. Perl has many internal optimizations that other regex libraries don’t have, and vice versa, but Perl in particular has optimized lazy quantification more than most other libraries. Nevertheless, the principles in this post still apply there.

  6. Quote:
    with lazy quantification the engine backtracks forward after each character which is not followed by “>”.

    I consider “forward” shuold be replaced by “backward”.
    When the engine backtracks, it will RETURN to the last choice point.

    English is not my native language. I feel it is really confused.
    Thanks!

  7. @Scofield, the use of “forward” was intentional. With lazy quantification, the last choice point is always at the current position in the match, so “returning” to the last choice point does not cause any retreat into the matched characters. The match advances via backtracking–hence it “backtracks forward.” This is indeed the opposite of how greedy quantification works and what is intuitively implied by the term “backtracking.” Hopefully the RegexBuddy-debugger-style match steps listed in this post help to make this clear.

  8. I use the pattern “123|789 ” to match the string “US123456789″ (Just for understanding RegexBuddy debugger style), of course, “123″ and “789″ in the string both is matched.

    However, I found a strange question in “Debug” pane,

    Beginning match attempt at character 10

    Beginning match attempt at character 12

    Beginning match attempt at character 14

    Beginning match attempt at character 16

    “US123456789″ ONLY has 11 letters, why could “Debug” pane display character 12, character 14 or character 16 ?

    Of course, Character 11 is the digit ’9′ in the string, however, what is character 12? character 12 doesn’t have any letter at all !
    What does it mean? Thanks!

  9. @Scofield, contact the maker of RegexBuddy–I have nothing to do with the software. In any case, I can’t reproduce the issue. Maybe you just need to upgrade to the latest version.

  10. Very interesting !!!
    It saves me 760 secondes on a large file parsing.

    Thx

  11. How did you get the RegexBuddy debug info into HTML?

  12. I just encountered a claim that basically mirrors this article, and found the claim counter-intuitive. So I searched and found this article, and I’m frankly surprised. It seemed to me that a competent regex engine would automatically optimize a pattern like <.*?> to be <[^>]*> because it’s identical in practice.

    Is there any reason, with a pattern that simple, that any engine would opt not to optimize it that way? Or is it just a matter of the work not having been done?

  13. @ridgerunner, manual labor.

    @eyelidlessness, it’s mostly a matter of the work not being done. Some optimized regex engines do make them equivalent in simple cases like this, but keep in mind that backtracking can make <.*?> and <[^>]*> not equivalent when they are part of a larger pattern–i.e., backtracking due to failure in a latter part of the pattern may force the dot from .*? to match a > character. Correctly handling this requires optimizations like the one you describe to be able to figure out special cases like this and properly handle them.

  14. “backtracking due to failure in a latter part of the pattern may force the dot from .*? to match a > character”

    Huh. I’m having trouble imagining a pattern that would do this. If you have a moment, is there any chance you could provide an example?

    Thanks for taking your time, for answering questions as well as maintaining this blog (which contains a wealth of information) as well as your XRegExp library. I don’t think I’ve ever commented before this post, but I’ve read quite a lot of the blog.

  15. Actually, you don’t even need a larger pattern. “<>>” is matched in its entirety by “<.*?>” but not by “<[^>]*>”.

  16. [...] Isto permite o motor seguir em frente enquanto o segundo caractere de apas não é encontrado. Quando encontrado a busca termina sem retornos. A operação falha retornando nossa URL completa. Este tipo de busca oferece excelente desempenho. Mais informações aqui. [...]

  17. /(.*)\.(.*)/
    /^(.*)\.([^.]+)$/ — it will be faster

  18. [...] ???http://blog.stevenlevithan.com/archives/greedy-lazy-performance [...]

  19. [...] also see this article: Performance of Greedy vs. Lazy Regex Quantifiers [...]

  20. @Chris, thanks for the reduced example. Another reason that the two regexes aren’t equivalent is because the dot doesn’t match newlines (unless you use flag s).

  21. [...] Performance of Greedy vs. Lazy Regex Quantifiers, Steven Levithan [...]

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).