An IE lastIndex Bug with Zero-Length Regex Matches

The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's lastIndex property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in a string using the RegExp.prototype.exec method. That's where I've most frequently encountered the bug, and I think it will help explain why the issue exists in the first place.

First of all, if you're not already familiar with how to use exec to iterate over a string, you're missing out on some very powerful functionality. Here's the basic construct:

var	regex = /.../g,
	subject = "test",
	match = regex.exec(subject);

while (match != null) {
	// matched text: match[0]
	// match start: match.index
	// match end: regex.lastIndex
	// capturing group n: match[n]

	...

	match = regex.exec(subject);
}

When the exec method is called for a regex that uses the /g (global) modifier, it searches from the point in the subject string specified by the regex's lastIndex property (which is initially zero, so it searches from the beginning of the string). If the exec method finds a match, it updates the regex's lastIndex property to the character index at the end of the match, and returns an array containing the matched text and any captured subexpressions. If there is no match from the point in the string where the search started, lastIndex is reset to zero, and null is returned.

You can tighten up the above code by moving the exec method call into the while loop's condition, like so:

var	regex = /.../g,
	subject = "test",
	match;

while (match = regex.exec(subject)) {
	...
}

This cleaner version works essentially the same as before. As soon as exec can't find any further matches and therefore returns null, the loop ends. However, there are a couple cross-browser issues to be aware of with either version of this code. One is that if the regex contains capturing groups which do not participate in the match, some values in the returned array could be either undefined or an empty string. I've previously discussed that issue in depth in a post about what I called non-participating capturing groups.

Another issue (the topic of this post) occurs when your regex matches an empty string. There are many reasons why you might allow a regex to do that, but if you can't think of any, consider cases where you're accepting regexes from an outside source. Here's a simple example of such a regex:

var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	endPositions.push(regex.lastIndex);
}

You might expect the endPositions array to be set to [0,2,4], since those are the character positions for the beginning of the string and just after each newline character. Thanks to the /m modifier, those are the positions where the regex will match; and since the regex matches empty strings, regex.lastIndex should be the same as match.index. However, Internet Explorer (tested with v5.5–7) sets endPositions to [1,3,5]. Other browsers will go into an infinite loop until you short-circuit the code.

So what's going on here? Remember that every time exec runs, it attempts to match within the subject string starting at the position specified by the lastIndex property of the regex. Since our regex matches a zero-length string, lastIndex remains exactly where we started the search. Therefore, every time through the loop our regex will match at the same position—the start of the string. Internet Explorer tries to be helpful and avoid this situation by automatically incrementing lastIndex when a zero-length string is matched. That might seem like a good idea (in fact, I've seen people adamantly argue that is a bug that Firefox does not do the same), but it means that in Internet Explorer the lastIndex property cannot be relied on to accurately determine the ending position of a match.

We can correct this situation cross-browser with the following code:

var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	var zeroLengthMatch = !match[0].length;
	// Fix IE's incorrect lastIndex
	if (zeroLengthMatch && regex.lastIndex > match.index)
		regex.lastIndex--;

	endPositions.push(regex.lastIndex);

	// Avoid an infinite loop with zero-length matches
	if (zeroLengthMatch)
		regex.lastIndex++;
}

You can see an example of the above code in the cross-browser split method I posted a while back. Keep in mind that none of the extra code here is needed if your regex cannot possibly match an empty string.

Another way to deal with this issue is to use String.prototype.replace to iterate over the subject string. The replace method moves forward automatically after zero-length matches, avoiding this issue altogether. Unfortunately, in the three biggest browsers (IE, Firefox, Safari), replace doesn't seem to deal with the lastIndex property except to reset it to zero. Opera gets it right (according to my reading of the spec) and updates lastIndex along the way. Given the current situation, you can't rely on lastIndex in your code when iterating over a string using replace, but you can still easily derive the value for the end of each match. Here's an example:

var	regex = /^/gm,
	subject = "A\nB\nC",
	endPositions = [];

subject.replace(regex, function (match) {
	// Not using a named argument for the index since capturing
	// groups can change its position in the list of arguments
	var	index = arguments[arguments.length - 2],
		lastIndex = index + match.length;

	endPositions.push(lastIndex);
});

That's perhaps less lucid than before (since we're not actually replacing anything), but there you have it… two cross-browser ways to get around a little-known issue that could otherwise cause tricky, latent bugs in your code.

A JScript/VBScript Regex Lookahead Bug

Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via ?, *, {0,n}, or (.|); but not +, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:

/(?=a?b)ab/.test("ab");
// Should return true, but IE 5.5 – 8b1 return false

/(?=a?b)ab/.test("abc");
// Correctly returns true (even in IE), although the
// added "c" does not take part in the match

I've been aware of this bug for a couple years, thanks to a blog post by Michael Ash that describes the bug with a password-complexity regex. However, the bug description there is incomplete and subtly incorrect, as shown by the above, reduced test case. To be honest, although the errant behavior is predictable, it's a bit tricky to describe because I haven't yet figured out exactly what's happening internally. I'd recommend playing with variations of the above code to get a better understanding of the problem.

Fortunately, since the bug is predictable, it's usually possible to work around. For example, you can avoid the bug with the password regex in Michael's post (/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$/) by writing it as /^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/ (the .{8,15}$ lookahead must come first here). The important thing is to be aware of the issue, because it can easily introduce latent and difficult to diagnose bugs into your code. Just remember that it shows up with variable-length lookahead. If you're using such patterns, test the hell out of them in IE.

JavaScript Roman Numeral Converter

While looking for something quick to do during a brief internet outage, I wrote some code to convert to and from Roman numerals. Once things were back up I searched for equivalent code, but only found stuff that was multiple pages long, limited the range of what it could convert, or both. I figured I might as well share what I came up with:

function romanize (num) {
  if (!+num) return false;
  var digits = String(+num).split('');
  var key = ['','C','CC','CCC','CD','D','DC','DCC','DCCC','CM',
             '','X','XX','XXX','XL','L','LX','LXX','LXXX','XC',
             '','I','II','III','IV','V','VI','VII','VIII','IX'];
  var roman = '', i = 3;
  while (i--) roman = (key[+digits.pop() + (i * 10)] || '') + roman;
  return Array(+digits.join('') + 1).join('M') + roman;
}

function deromanize (str) {
  var str = str.toUpperCase();
  var validator = /^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/;
  var token = /[MDLV]|C[MD]?|X[CL]?|I[XV]?/g;
  var key = {M:1000,CM:900,D:500,CD:400,C:100,XC:90,L:50,XL:40,X:10,IX:9,V:5,IV:4,I:1};
  var num = 0, m;
  if (!(str && validator.test(str))) return false;
  while (m = token.exec(str)) num += key[m[0]];
  return num;
}

How would you rewrite this code? Can you create a shorter version?

Regular Expressions As Functions

Firefox includes a non-standard JavaScript extension that makes regular expressions callable as functions. This serves as a shorthand for calling a regex's exec method. For example, in Firefox /regex/("string") is equivalent to /regex/.exec("string"). Early ECMAScript 4 proposals indicated this functionality would be added to the ES4 specification, but subsequent discussion on the ES4-discuss mailing list suggests it might be dropped.

However, you can implement something similar by adding call and apply methods to RegExp.prototype, which could help with functional programming and duck-typed code that works with both functions and regular expressions. So let's add them:

RegExp.prototype.call = function (context, str) {
	return this.exec(str);
};
RegExp.prototype.apply = function (context, args) {
	return this.exec(args[0]);
};

Note that both of the above methods completely ignore the context argument. You could pass in null or whatever else as the context, and you'd get back the normal result of running exec on the regex. Using the above methods, you can generically work with both regular expressions and functions wherever it's convenient to do so. A few obvious cases where this could be helpful are the JavaScript 1.6 array iteration methods. Following are implementations of filter, every, some, and map that allow them to be used cross-browser:

// Returns an array with the elements of an existng array for which the provided filtering function returns true
Array.prototype.filter = function (func, context) {
	var results = [];
	for (var i = 0; i < this.length; i++) {
		if (i in this && func.call(context, this[i], i, this))
			results.push(this[i]);
	}
	return results;
};
// Returns true if every element in the array satisfies the provided testing function
Array.prototype.every = function (func, context) {
	for (var i = 0; i < this.length; i++) {
		if (i in this && !func.call(context, this[i], i, this))
			return false;
	}
	return true;
};
// Returns true if at least one element in the array satisfies the provided testing function
Array.prototype.some = function (func, context) {
	for (var i = 0; i < this.length; i++) {
		if (i in this && func.call(context, this[i], i, this))
			return true;
	}
	return false;
};
// Returns an array with the results of calling the provided function on every element in the provided array
Array.prototype.map = function (func, context) {
	var results = [];
	for (var i = 0; i < this.length; i++) {
		if (i in this)
			results[i] = func.call(context, this[i], i, this);
	}
	return results;
};

Because the array and null values returned by exec type-convert nicely to true and false, the above code allows you to use something like ["a","b","ab","ba"].filter(/^a/) to return all values that start with "a": ["a","ab"]. The code ["1",1,0,"a",3.1,256].filter(/^[1-9]\d*$/) would return integers greater than zero, regardless of type: ["1",1,256]. str.match(/a?b/g).filter(/^b/) would return all matches of "b" not preceded by "a". This can be a convenient pattern since JavaScript doesn't support lookbehind.

All of the above examples already work with Firefox's native Array.prototype.filter because of the indirect exec calling feature in that browser, but they wouldn't work with the cross-browser implementation of filter above without adding RegExp.prototype.call.

Does this seem like something that would be useful to you? Can you think of other good examples where call and apply methods would be useful for regular expressions?

Update: This post has been translated into Chinese by PlanABC.net.

Timed Memoization

Certain operations are computationally expensive, but because their results might change over time or due to outside influences, they don't lend themselves to typical memoization — take for example getElementsByClassName. Here's a JavaScript timed memoization decorator / higher-order-function I made to help with these cases, which accepts an optional expiration argument in milliseconds.

function memoize (functor, expiration) {
	var memo = {};
	return function () {
		var key = Array.prototype.join.call(arguments, "§");
		if (key in memo)
			return memo[key];
		if (expiration)
			setTimeout(function () {delete memo[key];}, expiration);
		return memo[key] = functor.apply(this, arguments);
	};
}

This approach allows you to turn any function into a memoizing function. Note that return values are memoized for each set of arguments. However, due to technical constraints it's only reliable when the arguments are arrays or scalar values, but you could easily use e.g. a toJSON method rather than join to serialize objects as part of the cache key (at some additional overhead cost).

You can use the above code like this:

// Make a function which memoizes for 1000 milliseconds at a time
var fn = memoize(function () {
	Array(500000).join("."); // slow
	return true;
}, 1000);

…Or leave out the expiration argument to permanently memoize.

Here are a couple more posts on JavaScript memoization: