The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's lastIndex
property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in a string using the RegExp.prototype.exec
method. That's where I've most frequently encountered the bug, and I think it will help explain why the issue exists in the first place.
First of all, if you're not already familiar with how to use exec
to iterate over a string, you're missing out on some very powerful functionality. Here's the basic construct:
var regex = /.../g,
subject = "test",
match = regex.exec(subject);
while (match != null) {
// matched text: match[0]
// match start: match.index
// match end: regex.lastIndex
// capturing group n: match[n]
...
match = regex.exec(subject);
}
When the exec
method is called for a regex that uses the /g
(global) modifier, it searches from the point in the subject string specified by the regex's lastIndex
property (which is initially zero, so it searches from the beginning of the string). If the exec
method finds a match, it updates the regex's lastIndex
property to the character index at the end of the match, and returns an array containing the matched text and any captured subexpressions. If there is no match from the point in the string where the search started, lastIndex
is reset to zero, and null
is returned.
You can tighten up the above code by moving the exec
method call into the while
loop's condition, like so:
var regex = /.../g, subject = "test", match; while (match = regex.exec(subject)) { ... }
This cleaner version works essentially the same as before. As soon as exec
can't find any further matches and therefore returns null
, the loop ends. However, there are a couple cross-browser issues to be aware of with either version of this code. One is that if the regex contains capturing groups which do not participate in the match, some values in the returned array could be either undefined
or an empty string. I've previously discussed that issue in depth in a post about what I called non-participating capturing groups.
Another issue (the topic of this post) occurs when your regex matches an empty string. There are many reasons why you might allow a regex to do that, but if you can't think of any, consider cases where you're accepting regexes from an outside source. Here's a simple example of such a regex:
var regex = /^/gm, subject = "A\nB\nC", match, endPositions = []; while (match = regex.exec(subject)) { endPositions.push(regex.lastIndex); }
You might expect the endPositions
array to be set to [0,2,4]
, since those are the character positions for the beginning of the string and just after each newline character. Thanks to the /m
modifier, those are the positions where the regex will match; and since the regex matches empty strings, regex.lastIndex
should be the same as match.index
. However, Internet Explorer (tested with v5.5–7) sets endPositions
to [1,3,5]
. Other browsers will go into an infinite loop until you short-circuit the code.
So what's going on here? Remember that every time exec
runs, it attempts to match within the subject string starting at the position specified by the lastIndex
property of the regex. Since our regex matches a zero-length string, lastIndex
remains exactly where we started the search. Therefore, every time through the loop our regex will match at the same position—the start of the string. Internet Explorer tries to be helpful and avoid this situation by automatically incrementing lastIndex
when a zero-length string is matched. That might seem like a good idea (in fact, I've seen people adamantly argue that is a bug that Firefox does not do the same), but it means that in Internet Explorer the lastIndex
property cannot be relied on to accurately determine the ending position of a match.
We can correct this situation cross-browser with the following code:
var regex = /^/gm, subject = "A\nB\nC", match, endPositions = []; while (match = regex.exec(subject)) { var zeroLengthMatch = !match[0].length; // Fix IE's incorrect lastIndex if (zeroLengthMatch && regex.lastIndex > match.index) regex.lastIndex--; endPositions.push(regex.lastIndex); // Avoid an infinite loop with zero-length matches if (zeroLengthMatch) regex.lastIndex++; }
You can see an example of the above code in the cross-browser split method I posted a while back. Keep in mind that none of the extra code here is needed if your regex cannot possibly match an empty string.
Another way to deal with this issue is to use String.prototype.replace
to iterate over the subject string. The replace
method moves forward automatically after zero-length matches, avoiding this issue altogether. Unfortunately, in the three biggest browsers (IE, Firefox, Safari), replace
doesn't seem to deal with the lastIndex
property except to reset it to zero. Opera gets it right (according to my reading of the spec) and updates lastIndex
along the way. Given the current situation, you can't rely on lastIndex
in your code when iterating over a string using replace
, but you can still easily derive the value for the end of each match. Here's an example:
var regex = /^/gm,
subject = "A\nB\nC",
endPositions = [];
subject.replace(regex, function (match) {
// Not using a named argument for the index since capturing
// groups can change its position in the list of arguments
var index = arguments[arguments.length - 2],
lastIndex = index + match.length;
endPositions.push(lastIndex);
});
That's perhaps less lucid than before (since we're not actually replacing anything), but there you have it… two cross-browser ways to get around a little-known issue that could otherwise cause tricky, latent bugs in your code.
I’m afraid I have to disagree. The ECMA-262 standard contradicts itself. Firefox slavishly follows the implementation steps for regexp.exec(), while Internet Explorer follows the definition of the lastIndex property which does require the +1 in case of a zero-length match. IE makes the more useful choice. Incrementing lastIndex is what all regex engines do, and is the only way to avoid an infinite loop. The pingback above explains my position in detail, with references to the standard.
Jan, as far as I can tell, the spec doesn’t technically contradict itself on this issue, although it might be poorly designed and/or defy the common-sense expectation that searches wouldn’t continually start at the same position after an empty string match. The problem is that ECMAScript tries to use lastIndex for two purposes, which is one too many. lastIndex, as exposed to the user for global regexes used with methods that deal with lastIndex at all (String.prototype.search doesn’t, for example), is always the end of the last match or zero (unless the user tampers with the value themselves, which can be a useful trick). How lastIndex is used internally by some methods is not really any concern of the user’s, and is likely to differ between implementations. exec is the core regex search method from which all others can be derived. It is not specifically designed to iterate over strings, although that is one of its more common uses.
The ECMAScript design for lastIndex actually adds useful information when using the test method with a global regex—it tells you how far in the string you’ve already tested, which could not otherwise be determined. But then, due to the IE bug (or spec violation) you can’t reliably use it for that purpose anyway.
If you feel strongly that this should be classified as a bug in the spec, there is an existing Firefox ticket at https://bugzilla.mozilla.org/show_bug.cgi?id=252356 which can be added to, and the ECMAScript 4 bug database at http://bugs.ecmascript.org
After I originally left a comment I seem to have clicked
on the -Notify me when new comments are added- checkbox and now
whenever a comment is added I recieve 4 emails with the same comment.
There has to be a means you are able to remove
me from that service? Many thanks!
Hi, I have a problem with the regular expression.
always return null…
this is the expression:
var links= $(“a”);
for (var i=0; i< 5; i++)
{
var regex = new RegExp('^(https?://'+document.domain+')?(/(Pedro|Luis|Carlos|Antonio))?/(es|ca|eu|gl)-ES(/.*)?$','i');
var a = links[i].pathname;
alert(regex.exec(a));
}//END FOR
Thanks you!!