Flagrant Badassery

A JavaScript and regular expression centric blog

A JScript/VBScript Regex Lookahead Bug

Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via ?, *, {0,n}, or (.|), but not +, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:

/(?=a?b)ab/.test("ab");
// Should return true, but IE 5.5 – 8b1 return false

/(?=a?b)ab/.test("abc");
// Correctly returns true (even in IE), although the
// added "c" does not take part in the match

I've been aware of this bug for a couple years, thanks to a blog post by Michael Ash that describes the bug with a password-complexity regex. However, the bug description there is incomplete and subtly incorrect, as shown by the above, reduced test case. To be honest, although the errant behavior is predictable, it's a bit tricky to describe because I haven't yet figured out exactly what's happening internally. I'd recommend playing with variations of the above code to get a better understanding of the problem.

Fortunately, since the bug is predictable, it's usually possible to work around. For example, you can avoid the bug with the password regex in Michael's post (/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$/) by writing it as /^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/ (the .{8,15}$ lookahead must come first here). The important thing is to be aware of the issue, because it can easily introduce latent and difficult to diagnose bugs into your code. Just remember that it shows up with variable-length lookahead. If you're using such patterns, test the hell out of them in IE.

There Are 4 Responses So Far. »

  1. My guess is that after matching a?b, the regex engine notices it has reached the end of the data, and concludes that the following tokens can no longer match. The conclusion is incorrect because the zero-width nature of the lookahead resets the position the regex is at in the string.

    Quantifiers that make their token optional require special handling. E.g. given the regex (a?)* you don’t want the * to forever repeat the case where ? repeats zero times.

    This matches correctly:
    /(?=x?ab)ab/.test(”ab”);
    This matches, but captures “a” into \1 instead of “ab”:
    /(?=(ab?))ab/.test(”ab”);

    So it seems that the false end-of-data test only occurs when there are further backtracking positions to attempt. In the first of the two above, all permutations x? have already been exhausted by the time the “ab” in the lookahead matches.

  2. The suggested code /^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/
    does not work for Asp.Net RegEx Validators because the underlying javascript does this:

    var rx = new RegExp(val.validationexpression);
    var matches = rx.exec(value);
    return (matches != null && value == matches[0])

    Since all of these are look ahead groups none of them capture. I found that by using “(?=^.{6,25}$)(?=.*[a-z])(?=.*[A-Z]).*” the problem is solved because while the look aheads enforce the length, at least one small, at least one Caps the .* then captures the entire text.

    Thanks for your help with the regex.

  3. @Hananiel Sarella, I think you’re misunderstanding something. There’s a .* at the end of the regex I posted, too. It should work just fine.

  4. I never paid attention to it I guess. I was debugging this problem for a while and I found and understood what yours was doing, I didnt pay attention to the ‘.*’ . Finally after stepping through the frame work javas script I realized what was going on and came up with the solution you already posted. figures! Thanks for your good work, saved me a lot of trouble.

Post a Response

If you are about to post code, please escape your HTML entities (&, >, <).