April 2008 — Flagrant Badassery

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

If you haven't seen the prior versions, XRegExp is an MIT-licensed JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a new, powerful recursive-construct parser that uses regex delimiters are also included.

Here's what you get beyond the standard JavaScript regex features:

Added regex syntax:
- Comprehensive named capture support. (Improved)
- Comment patterns: (?#…). (New)
Added regex modifiers (flags):
- s (singleline), to make dot match all characters including newlines.
- x (extended), for free-spacing and comments.
Added awesome:
- Reduced cross-browser inconsistencies. (More)
- Recursive-construct parser with regex delimiters. (New)
- An easy way to cache and reuse regex objects. (New)
- The ability to safely embed literal text in your regex patterns. (New)
- A method to add modifiers to existing regex objects.
- Regex call and apply methods, which make generically working with functions and regexes easier. (New)

All of this can be yours for the low, low price of 2.4 KB. smile Version 0.5 also introduces extensive documentation and code examples.

If you're using a previous version, note that there are a few non-backward compatible changes for the sake of strict ECMA-262 Edition 3 compliance and compatibility with upcoming ECMAScript 4 changes.

The XRegExp.overrideNative function has been removed, since it is no longer possible to override native constructors in Firefox 3 or ECMAScript 4 (as proposed).
Named capture syntax has been changed from (<name>…) to (?<name>…), which is the standard in most regex libraries and under consideration for ES4. Named capture is now always available, and does not require the k modifier.
Due to cross-browser compatibility issues, previous versions enforced that a leading, unescaped ] within a character class was treated as a literal character, which is how things work in most regex flavors. XRegExp now follows ECMA-262 Edition 3 on this point. [] is an empty set and never matches (this is enforced in all browsers).

Get it while it's hot! Check out the new XRegExp documentation and source code.

The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's lastIndex property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in a string using the RegExp.prototype.exec method. That's where I've most frequently encountered the bug, and I think it will help explain why the issue exists in the first place.

First of all, if you're not already familiar with how to use exec to iterate over a string, you're missing out on some very powerful functionality. Here's the basic construct:

var	regex = /.../g,
	subject = "test",
	match = regex.exec(subject);

while (match != null) {
	// matched text: match[0]
	// match start: match.index
	// match end: regex.lastIndex
	// capturing group n: match[n]

	...

	match = regex.exec(subject);
}

When the exec method is called for a regex that uses the /g (global) modifier, it searches from the point in the subject string specified by the regex's lastIndex property (which is initially zero, so it searches from the beginning of the string). If the exec method finds a match, it updates the regex's lastIndex property to the character index at the end of the match, and returns an array containing the matched text and any captured subexpressions. If there is no match from the point in the string where the search started, lastIndex is reset to zero, and null is returned.

You can tighten up the above code by moving the exec method call into the while loop's condition, like so:

var	regex = /.../g,
	subject = "test",
	match;

while (match = regex.exec(subject)) {
	...
}

This cleaner version works essentially the same as before. As soon as exec can't find any further matches and therefore returns null, the loop ends. However, there are a couple cross-browser issues to be aware of with either version of this code. One is that if the regex contains capturing groups which do not participate in the match, some values in the returned array could be either undefined or an empty string. I've previously discussed that issue in depth in a post about what I called non-participating capturing groups.

Another issue (the topic of this post) occurs when your regex matches an empty string. There are many reasons why you might allow a regex to do that, but if you can't think of any, consider cases where you're accepting regexes from an outside source. Here's a simple example of such a regex:

var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	endPositions.push(regex.lastIndex);
}

You might expect the endPositions array to be set to [0,2,4], since those are the character positions for the beginning of the string and just after each newline character. Thanks to the /m modifier, those are the positions where the regex will match; and since the regex matches empty strings, regex.lastIndex should be the same as match.index. However, Internet Explorer (tested with v5.5–7) sets endPositions to [1,3,5]. Other browsers will go into an infinite loop until you short-circuit the code.

So what's going on here? Remember that every time exec runs, it attempts to match within the subject string starting at the position specified by the lastIndex property of the regex. Since our regex matches a zero-length string, lastIndex remains exactly where we started the search. Therefore, every time through the loop our regex will match at the same position—the start of the string. Internet Explorer tries to be helpful and avoid this situation by automatically incrementing lastIndex when a zero-length string is matched. That might seem like a good idea (in fact, I've seen people adamantly argue that is a bug that Firefox does not do the same), but it means that in Internet Explorer the lastIndex property cannot be relied on to accurately determine the ending position of a match.

We can correct this situation cross-browser with the following code:

var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	var zeroLengthMatch = !match[0].length;
	// Fix IE's incorrect lastIndex
	if (zeroLengthMatch && regex.lastIndex > match.index)
		regex.lastIndex--;

	endPositions.push(regex.lastIndex);

	// Avoid an infinite loop with zero-length matches
	if (zeroLengthMatch)
		regex.lastIndex++;
}

You can see an example of the above code in the cross-browser split method I posted a while back. Keep in mind that none of the extra code here is needed if your regex cannot possibly match an empty string.

Another way to deal with this issue is to use String.prototype.replace to iterate over the subject string. The replace method moves forward automatically after zero-length matches, avoiding this issue altogether. Unfortunately, in the three biggest browsers (IE, Firefox, Safari), replace doesn't seem to deal with the lastIndex property except to reset it to zero. Opera gets it right (according to my reading of the spec) and updates lastIndex along the way. Given the current situation, you can't rely on lastIndex in your code when iterating over a string using replace, but you can still easily derive the value for the end of each match. Here's an example:

var	regex = /^/gm,
	subject = "A\nB\nC",
	endPositions = [];

subject.replace(regex, function (match) {
	// Not using a named argument for the index since capturing
	// groups can change its position in the list of arguments
	var	index = arguments[arguments.length - 2],
		lastIndex = index + match.length;

	endPositions.push(lastIndex);
});

That's perhaps less lucid than before (since we're not actually replacing anything), but there you have it… two cross-browser ways to get around a little-known issue that could otherwise cause tricky, latent bugs in your code.

Month: April 2008

XRegExp 0.5 Released!

An IE lastIndex Bug with Zero-Length Regex Matches