Matching Nested Constructs in JavaScript, Part 2

When I posted my matchRecursive function the other day (which allows easily matching nested constructs), I noted that it could easily be modified to work with a regex pattern rather than a string as the format argument. After looking at it again, I realized the conversion wouldn't be entirely straightforward, so I've gone ahead and reimplemented it as matchRecursiveRegExp to work with regex format delimiters. You can now use it to like e.g. matchRecursiveRegExp(str, "<div\\b[^>]*>", "</div>", "gi") to match the entire contents of all outermost <div> tags.

As before, read the code comments for further details. Note that in this version I've made the variable names more or less indecipherable in order to keep code size down, so see the earlier function for a similar but more readable implementation.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursiveRegExp
	Accepts a string to search, a left and right format delimiter
	as regex patterns, and optional regex flags. Returns an array
	of matches, allowing nested instances of left/right delimiters.
	Use the "g" flag to return all matches, otherwise only the
	first is returned. Be careful to ensure that the left and
	right format delimiters produce mutually exclusive matches.
	Backreferences are not supported within the right delimiter
	due to how it is internally combined with the left delimiter.
	When matching strings whose format delimiters are unbalanced
	to the left or right, the output is intentionally as a
	conventional regex library with recursion support would
	produce, e.g. "<<x>" and "<x>>" both produce ["x"] when using
	"<" and ">" as the delimiters (both strings contain a single,
	balanced instance of "<x>").

	examples:
		matchRecursiveRegExp("test", "\\(", "\\)")
			returns: []
		matchRecursiveRegExp("<t<<e>><s>>t<>", "<", ">", "g")
			returns: ["t<<e>><s>", ""]
		matchRecursiveRegExp("<div id=\"x\">test</div>", "<div\\b[^>]*>", "</div>", "gi")
			returns: ["test"]

*/
function matchRecursiveRegExp (str, left, right, flags) {
	var	f = flags || "",
		g = f.indexOf("g") > -1,
		x = new RegExp(left + "|" + right, "g" + f),
		l = new RegExp(left, f.replace(/g/g, "")),
		a = [],
		t, s, m;

	do {
		t = 0;
		while (m = x.exec(str)) {
			if (l.test(m[0])) {
				if (!t++) s = x.lastIndex;
			} else if (t) {
				if (!--t) {
					a.push(str.slice(s, m.index));
					if (!g) return a;
				}
			}
		}
	} while (t && (x.lastIndex = s));

	return a;
}

You can download it here.

Let me know if there are any additional features you'd like to see with this.

ECMAScript 3 Regular Expressions are Defective by Design

ECMAScript 3 has some major regex design flaws, and if nothing changes the ES4 group will be propagating some of the mistakes into ECMAScript 4 (aka JavaScript 2).

Recently, longtime JavaScript regex guru David "liorean" Andersson wrote up a couple posts about my biggest gripes with the ECMAScript 3 regex flavor, namely the way that backreferences are handled for non-participating capturing groups and within quantified subpatterns (see ECMAScript 3 Regular Expressions: A specification that doesn't make sense and A quick JS quiz for anybody who think they know regex). I'll avoid rehashing the points here, since I think David already articulated the problems well. For the record, I had planned to submit these issues as ECMAScript 4 bugs, but I already have a number of ES4 regex tickets open and was waiting to see their outcome before submitting more.

Another historically problematic issue has been the fact that, according to ES3, regex literals cause only one object to be created at runtime for a script or function. This issue exhibits itself most frequently as regex literals which use the /g modifier not having their lastIndex property reset in some cases where most developers would expect it. Fortunately, this is already planned to be fixed in ES4. The fact that it has been the third most duplicated Firefox bug report undoubtedly had something to do with this decision.

But getting back to my original rant, although the backreference handling issues might be less visible to some developers than having their regex objects' lastIndex properties seemingly out of whack, they are no more sensible or in line with developer expectations. Additionally, the ES3 handling for these issues is incompatible with other modern regex libraries, and far less useful than the alternative handling (see e.g. Mimicking Conditionals and Capturing Multiple, Optional HTML Attribute Values for a couple examples of where the conventional, Perl-style handling could be put to good use).

As a related rant, IMHO the ECMAScript 4 regex extension proposals miss some opportunities for key feature additions. Here's what ES4 regexes add, along with a few compatibility-related changes and the ability for regex literals to span multiple lines:

  • Character class set operations — intersection and subtraction, with syntax inspired by java.util.regex.
  • (?#…) comment patterns.
  • Named capture — though it seems this wasn't fully thought out. However, it looks like the TG1 group might be willing to change the syntax from that proposed in the draft spec to the more common .NET/Perl syntax, which would be an improvement.
  • The /y (sticky) modifier — similar to several other libraries' use of \G.
  • The /x (extended) modifier — for free-spacing and comments.
  • Unicode character properties — but there's no support for Unicode scripts or blocks, and no \X metasequence to match a single grapheme, which means you'll have to use \P{M}\p{M}*.
  • Support for hex character codes outside Unicode's Basic Multilingual Plane — via \x{n…} and \u{n…}, which are equivalent.

For a description of these features, see the ES4 wiki, but note that many of the finer details of how they'll work are not mentioned, or are being discussed elsewhere, such as on the es4-discuss@mozilla.org mailing list (external archive here) or within the ECMAScript 4 issue database.

Aside from a few details of their currently proposed implementation (which for the most part I've already brought up elsewhere), I think these additions are great. To be honest though, if I could trade all of the ES4 regex extensions for atomic groups and lookbehind, I would. And while it's understandable that different people have different priorities, the lack of atomic groups in particular is a significant omission considering their potentially dramatic performance-enhancing power combined with their minimal implementation burden. Additional features found either in Perl or other Perl-derivative regex libraries which could be quite useful include possessive quantifiers, backtracking control verbs, mode modifiers and mode-modified spans, conditionals, \A and \z assertions, callouts, relative backreferences, recursion, subpatterns as subroutines, match point resetting (via \K), duplicate subpattern numbers (?|…), subpattern definitions (?(DEFINE)…), partial matching, backwards matching, etc.

Since the ECMA TG1 group has stated that they're no longer accepting major spec proposals, I expect the additions will be limited to those already proposed. However, I'm hopeful that the situation will be improved, at least by refining some of the existing ES3 features and ES4 proposals. Since I love both JavaScript and regular expressions, I'd love to see them come together in a way that rivals the best regex libraries. Perhaps ECMAScript could even introduce a little innovation in the space.

JavaScript String Multiplication Performance Exploration

Since JavaScript concatenates strings with the + operator, it would be nifty if it would also let you multiply strings using e.g. str * 10 (as can be done in Python, at least). Since you can't do that, and no native string multiplication method is provided, I recently explored a few ways to pull it off…

A naive approach to writing a string multiplier function goes something like this:

function mul0 (str, num) {
	if (!num) return "";
	var newStr = str;
	while (--num) newStr += str;
	return newStr;
}

As many JavaScripters are aware, that isn't the best approach since string concatenation can be quite slow in Internet Explorer. And while IE tends to get a bad rap for this (fortunately, the IE team is fixing the problem in the next version of their browser), Firefox isn't exactly blazing fast at string concatenation either. Due to the performance issues, the typical string multiplication approach is to build an array and join it. Here's a nice, short way to do that:

function mul1 (str, num) {
	return num ? Array(num + 1).join(str) : "";
}

Note that the falsy num handling is probably not warranted in this case since the function would handle value 0 correctly without it. It's done anyway to keep functionality equivalent across the variations.

Unfortunately, mul1 can still be pretty slow in Firefox 2 when multiplying large strings many times. It might be unnoticeable with small strings and repetition numbers, but the completion time goes up at a super-linear rate as the numbers increase. In search of a faster solution, I tried using a regex to keep down the size of the string being worked with:

var mul2 = function () {
	function mul (str, num) {
		return Array(num + 1).join(str);
	}
	return function (str, num) {
		return num ? str.replace(/^/, mul("$'", num - 1)) : "";
	};
}();

The above multiplies the two-character string "$'" num - 1 times, then uses that as the replacement for a regex which just matches the start of the string ($' returns the text to the right of the match). How does that perform? It delivers in Firefox 2 on my Windows Vista system, with numbers like 95ms vs. 29800ms (mul1) when using a 2700x2700 string length/multiplier. However, based on my testing, that sort of speed gain appears to be limited to Firefox, and in Safari 3 beta mul2 is considerably slower that the alternative versions.

Finally, I tried creating a version which multiplied the string at an exponential rate:

function mul3 (str, num) {
	if (!num) return "";
	var	orig = str,
		soFar = [str],
		added = 1,
		left, i;
	while (added < num) {
		left = num - added;
		str = orig;
		for (i = 2; i < left; i *= 2) {
			str += str;
		}
		soFar.push(str);
		added += (i / 2);
	}
	return soFar.join("");
}

Although that might be more code than you're willing to dedicate to a string multiplication method, it's the fastest of the above versions on average cross-browser. I've also tried a few variations using from zero to two arrays and various array methods (push, concat, etc.), but the above seems to be the fastest on average across the big four browsers.

Make sure to try the tests for yourself, and let me know your thoughts and how you would improve the code.


Edit: Kris Kowal contributed mul4 (shown below, and added to the test page). It uses binary interpolation, and in Kris's words "it takes advantage of a fun bitwise identity: (1 << n) == Math.pow(2, n)". On my system it's significantly faster than mul3 in Firefox, but a little slower than mul3 in IE, Safari, and Opera. Due to its high speed and lighter weight, this looks like the one to beat. Try the test page in several browsers and see what you think.

function mul4 (str, num) {
	var acc = [];
	for (var i = 0; (1 << i) <= num; i++) {
		if ((1 << i) & num)
			acc.push(str);
		str += str;
	}
	return acc.join("");
}

Edit 2: LiuCougar of the Dojo development team posted a follow-up which includes several additional variations, and David Andersson emailed me an additional four variations including this one:

function mul8 (str, num) {
	var	i = Math.ceil(Math.log(num) / Math.LN2),
		res = str;
	do {
		res += res;
	} while (0 < --i);
	return res.slice(0, str.length * num);
}

I should clarify however that this is mostly just academic discussion, since repeating the kinds of strings in the test page as many times as it does is a pretty crazy idea. Still, it's fun to experiment. smile


Edit 3: All the variations posted or emailed in response to this post can be seen at stevenlevithan.com/demo/mul/all.js. For the sake of consistency I've made a few minor adjustments to some of the functions such as whitespace tweaks and renaming the input arguments to str and num.

Matching Nested Constructs in JavaScript

In the past, I've touched on using regexes to match nested constructs up to a predetermined depth, which is the best you can do unless you're using one of the three regex engines (Perl, PCRE, and .NET) which are currently able to handle true recursion.

Well, recently I wanted to be able to support unlimited nesting depth in a fast, flexible, and easy to use way when matching strings in JavaScript, so here's the code I wrote for it. Basic documentation and examples are included in the code comments.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursive
	accepts a string to search and a format (start and end tokens separated by "...").
	returns an array of matches, allowing nested instances of format.

	examples:
		matchRecursive("test",          "(...)")   -> []
		matchRecursive("(t(e)s)()t",    "(...)")   -> ["t(e)s", ""]
		matchRecursive("t<e>>st",       "<...>")   -> ["e"]
		matchRecursive("t<<e>st",       "<...>")   -> ["e"]
		matchRecursive("t<<e>>st",      "<...>")   -> ["<e>"]
		matchRecursive("<|t<e<|s|>t|>", "<|...|>") -> ["t<e<|s|>t"]
*/
var matchRecursive = function () {
	var	formatParts = /^([\S\s]+?)\.\.\.([\S\s]+)/,
		metaChar = /[-[\]{}()*+?.\\^$|,]/g,
		escape = function (str) {
			return str.replace(metaChar, "\\$&");
		};

	return function (str, format) {
		var p = formatParts.exec(format);
		if (!p) throw new Error("format must include start and end tokens separated by '...'");
		if (p[1] == p[2]) throw new Error("start and end format tokens cannot be identical");

		var	opener = p[1],
			closer = p[2],
			/* Use an optimized regex when opener and closer are one character each */
			iterator = new RegExp(format.length == 5 ? "["+escape(opener+closer)+"]" : escape(opener)+"|"+escape(closer), "g"),
			results = [],
			openTokens, matchStartIndex, match;

		do {
			openTokens = 0;
			while (match = iterator.exec(str)) {
				if (match[0] == opener) {
					if (!openTokens)
						matchStartIndex = iterator.lastIndex;
					openTokens++;
				} else if (openTokens) {
					openTokens--;
					if (!openTokens)
						results.push(str.slice(matchStartIndex, match.index));
				}
			}
		} while (openTokens && (iterator.lastIndex = matchStartIndex));

		return results;
	};
}();

You can download the code here.

Note that the format argument expects a simple string; not a regular expression. However, the code could easily be modified to work with regexes if that's what you were after.

Update: I've posted an alternative version which accepts regex patterns as the format as matchRecursiveRegExp.

Date Format 1.1

I've just updated my ColdFusion-inspired JavaScript Date Format script to version 1.0 1.1, and updated the documentation in the old post along with it. The new release includes "Z" (US timezone abbreviation) and "o" (UTC offset) flags as well as brevity enhancements from Scott Trenda, along with several other new features including a standalone dateFormat function, named and default masks (plus you can easily add your own), easier internationalization, etc.

This update includes one change which is not backwards compatible: mask characters and sequences no longer have to comprise entire words for them to be treated specially. The former handling was intended to make it dead-easy to mix literal characters into date masks, but ended up mostly just being a slight nuisance since most people didn't use it to embed dates in larger strings.

Check out the new Date Format!


Edit: Date Format is now integrated into two JavaScript frameworks:

  • CFJS is a library of almost 70 ColdFusion functions written in JavaScript by Chris Jordan. CFJS has used Date Format, which was a natural fit since it's largely based on ColdFusion's dateFormat and timeFormat functions, since version 0.1.
  • Chiron is an innovative, emerging JavaScript library by Kris Kowal. It's based on Python idioms, and at its heart is an advanced module loader and isolation system the likes of which hasn't been seen yet in the JavaScript world. In addition to integrating Date Format as a module called date.js, Chiron has also integrated my XRegExp library, and uses regular expressions from parseUri in its core. Expect to hear more about Chiron as it gets closer to 0.1 release.