Flagrant Badassery

A JavaScript and regular expression centric blog

Non-Participating Groups: A Cross-Browser Mess

Cross-browser issues surrounding the handling of regular expression non-participating capturing groups (which I'll call NPCGs) present several challenges. The standard sucks to begin with, and the three biggest browsers (IE, Firefox, Safari) each disrespect the rules in their own unique ways.

First, I should explain what NPCGs are, as it seems that even some experienced regex users aren't fully aware of or understand the concept. Assuming you're already familiar with the idea of capturing and non-capturing parentheses (see this page if you need a refresher), note that NPCGs are different from groups which capture a zero-length value (i.e., an empty string). This is probably easiest to explain by showing some examples…

The following regexes all potentially contain NPCGs (depending on the data they are run over), because the capturing groups are not required to participate:

  • /(x)?/
  • /(x)*/
  • /(x){0,2}/
  • /(x)|(y)/ — If this matches, it's guaranteed to contain exactly one NPCG.
  • /(?!(x))/ — If this matches (which, on its own, it will at least at the end of the string), it's guaranteed to contain an NPCG, because the pattern only succeeds if the match of "x" fails.
  • /()??/ — This is guaranteed to match within any string and contain an NPCG, because of the use of a lazy ?? quantifier on a capturing group for a zero-length value.

On the other hand, these will never contain an NPCG, because although they are allowed to match a zero-length value, the capturing groups are required to participate:

  • /(x?)/
  • /(x*)/
  • /(x{0,2})/
  • /((?:xx)?)/ –or– /(xx|)/ — These two are equivalent.
  • /()?/ –or– /(x?)?/ — These are not required to participate, but their greedy ? quantifiers ensure they will always succeed at capturing at least an empty string.

So, what's the difference between an NPCG and a group which captures an empty string? I guess that's up to the regex library, but typically, backreferences to NPCGs are assigned a special null or undefined value.

Following are the ECMA-262v3 rules (paraphrased) for how NPCGs should be handled in JavaScript:

  • Within a regex, backreferences to NPCGs match an empty string (i.e., the backreferences always succeed). This is unfortunate, since it prevents some fancy patterns which would otherwise be possible (e.g., see my method for mimicking conditionals), and it's atypical compared to many other regular expression engines including Perl 5 (which ECMA-standard regular expressions are supposedly based on), PCRE, .NET, Java, Python, Ruby, JGsoft, and others.
  • Within a replacement string, backreferences to NPCGs produce an empty string (i.e., nothing). Unlike the previous point, this is typical elsewhere, and allows you to use a regex like /a(b)|c(d)/ and replace it with "$1$2" without having to worry about null pointers or errors about non-participating groups.
  • In the result arrays from RegExp.prototype.exec, String.prototype.match (when used with a non-global regex), String.prototype.split, and the arguments available to callback functions with String.prototype.replace, NPCGs return undefined. This is a very logical approach.

References: ECMA-262v3 sectons 15.5.4.11, 15.5.4.14, 15.10.2.1, 15.10.2.3, 15.10.2.8, 15.10.2.9.

Unfortunately, actual browser handling of NPCGs is all over the place, resulting in numerous cross-browser differences which can easily result in subtle (or not so subtle) bugs in your code if you don't know what you're doing. E.g., Firefox incorrectly uses an empty string with the replace() and split() methods, but correctly uses undefined with the exec() method. Conversely, IE correctly uses undefined with the replace() method, incorrectly uses an empty string with the exec() method, and incorrectly returns neither with the split() method since it doesn't splice backreferences into the resulting array. As for the handling of backreferences to non-participating groups within regexes (e.g., /(x)?\1y/.test("y")), Safari uses the more sensible, non-ECMA-compliant approach (returning false for the previous bit of code), while IE, Firefox, and Opera follow the standard. (If you use /(x?)\1y/.test("y") instead, all four browsers will correctly return true.)

Several times I've seen people encounter these differences and diagnose them incorrectly, not having understood the root cause. A recent instance is what prompted this writeup.

Here are cross-browser results from each of the regex and regex-using methods when NPCGs have an impact on the outcome:

Code ECMA-262v3 IE 5.5 – 7 Firefox 2.0.0.6 Opera 9.23 Safari 3.0.3
/(x)?\1y/.test("y") true true true true false
/(x)?\1y/.exec("y") ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] null
/(x)?y/.exec("y") ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] ["y", undefined]
"y".match(/(x)?\1y/) ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] null
"y".match(/(x)?y/) ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] ["y", undefined]
"y".match(/(x)?\1y/g) ["y"] ["y"] ["y"] ["y"] null
"y".split(/(x)?\1y/) ["", undefined, ""] [ ] ["", "", ""] ["", undefined, ""] ["y"]
"y".split(/(x)?y/) ["", undefined, ""] [ ] ["", "", ""] ["", undefined, ""] ["", ""]
"y".search(/(x)?\1y/) 0 0 0 0 -1
"y".replace(/(x)?\1y/, "z") "z" "z" "z" "z" "y"
"y".replace(/(x)?y/, "$1") "" "" "" "" ""
"y".replace(/(x)?\1y/,
    function($0, $1){
        return String($1);
    })
"undefined" "undefined" "" "undefined" "y"
"y".replace(/(x)?y/,
    function($0, $1){
        return String($1);
    })
"undefined" "undefined" "" "undefined" ""
"y".replace(/(x)?y/,
    function($0, $1){
        return $1;
    })
"undefined" "" "" "undefined" ""

(Run the tests in your browser.)

The workaround for this mess is to avoid creating any potential for non-participating capturing groups, unless you know exactly what you're doing. Although that shouldn't be necessary, NPCGs are usually easy to avoid anyway. See the examples near the top of this post.

Edit (2007-08-16): I've updated this post with data from the newest versions of the listed browsers. The original data contained a few false negatives for Opera and Safari which resulted from a faulty library used to generate the results.

There Are 8 Responses So Far. »

  1. I’ve posted an issue ticket to WebKit’s bug repository, summarily referencing this post. CC while it’s hot.

    http://bugs.webkit.org/show_bug.cgi?id=14931

  2. This Mozilla/Firefox bug falls in the same area https://bugzilla.mozilla.org/show_bug.cgi?id=369778

    I’ve posted an issue ticket with Mozilla’s bug repository, summarily referencing this post. CC while it’s hot.

    https://bugzilla.mozilla.org/show_bug.cgi?id=392378

  3. Thanks for opening those tickets, Kris. I’m glad to see the WebKit developers in particular moving so quickly on these issues.

  4. Hi Steve,

    If you take a look at my screen grab http://twitpic.com/1nmutl/full

    I’m confused on two points.

    1.) Why does the 2nd line xy not match completely?

    As far as I understand it – I’m sure I must be wrong in my understanding – the back reference \1 should essentially be saying (x)?xy (so first x is optional followed by an xy because \1 is a link to the first capture group)

    2.) When I do the replacement, the $1 shouldn’t replace the matches with the first capture group because it was optional, and so should be a non participating capture group, or have I mis-understood how this all works?

    If you could possibly help explain? Very confusing :)

    Thanks!

    M.

  5. Mark:

    Back references always refer to the captures in order, whether the captures are empty or not. If the capture isn’t matched, then the backreference is just empty.

    http://rubular.com/r/yxREG6K1FT illustrates this hopefully.

  6. @Caius: Cool, that helps answer both points! In point 1.) it doesn’t match xy because \1 is actually an empty string and in point 2.) when I check the regex replacements again (http://twitpic.com/1nmutl/full) I now understand what’s happening…

    - y is replaced with a . followed by nothing because the $1 capture group
    didn’t participate in the matching of y

    - xy is replaced with x. followed by nothing because again the $1 capture group didn’t participate in the match of y

    - xxy is replaced with .x because x IS the $1 captured value, as the capture group did participate in the match of xxy

    Excellent! Wow that was a bit of a head scramble but I’m glad I worked it out in the end.

    Thanks for your help Caius!

  7. @Mark McDonnell, yes, that’s right–the regex /(x)?\1y/ can only match "xxy" or "y". In the former case, "$1" in the replacement string is "x", and in the latter case, "$1" is "" (the empty string). If capturing group 1 doesn’t participate, backreference \1 matches the empty string (according to JavaScript), whereas if capturing group 1 participates, the non-optional backreference must match another "x" (since that’s what the group captured).

    Note that outside of ECMAScript and its variants (JavaScript, ActionScript, etc.), /(x)?\1y/ only matches "xxy" (no longer "y"), because if capturing group 1 doesn’t participate, the \1 backreference fails to match. However, if you changed the regex to /(x?)\1y/, now the regex matches both "xxy" and "y" in all flavors. This is because the "x" itself is now optional, rather than the capturing group wrapped around it. The capturing group is no longer optional (despite the fact that it’s satisfied by matching the empty string) and therefore participates even when the "x" is skipped. Consequently, if the "x" is skipped, the following backreference successfully matches the empty string according to normal backreferencing rules, whereas the /(x)?\1y/ regex could only do this according to funky JavaScript rules.

    If this still isn’t quite clear to you (or other readers), that just reinforces my argument that you should avoid all potential for nonparticipating capturing groups (as opposed to capturing groups that sometimes match the empty string) in JavaScript using the techniques shown in this post. The cross-browser bugs and lack of portability to other programming languages makes them too troublesome, unless you know exactly what you’re doing.

  8. this is a good post and it did catch a problem i had. but it allowed me to see a newer problem.

    that is recursion and scope.

    function recurse(str){
    var re = /a(.)+/g;
    var s;
    var b = [];
    while(s = re.exec(“blahblah”)){
    b.push(recurse(s[1]));
    }
    return b;
    }
    what happens is ie is alright. mozilla’s however do not create a new regexp in the recursed scope. the pointers are the same from the 1st scope. which means the 1st scope gets reset pointers as the 2nd scope parses the regexp….that is to say, when it returns it could be parsed out and start again. i just made up this example on the fly but it really says what is happening here from a script i use that has multiple recursions using exec parsing.

    i have not tried it with new RegExp as i rarely use it. i assume it could cure the mozilla scope issue tho.

Post a Response

If you are about to post code, please escape your HTML entities (&, >, <).