Levels of JavaScript Regex Knowledge

  1. N00b
    • Thinks "regular expressions" is open mic night at a poetry bar.
    • Uses \w, \d, \s, and other shorthand classes purely by accident if at all.
    • Painfully misuses * and especially .*.
    • Puts words in character classes.
    • Uses | in character classes for alternation.
    • Hasn't heard of the exec method.
    • Copies and pastes poorly written regexes from the web.
  2. Trained n00b
    • Uses regexes where methods like slice or indexOf would do.
    • Uses g, i, and m modifiers needlessly.
    • Uses [^\w] instead of \W.
    • Doesn't know why using [\w\d_] gives away their n00bness.
    • Tries to remove HTML tags with replace(/<.*?>/g,"").
    • Escapes all punctuation\!
  3. User
    • Knows when to use regexes, and when to use string methods.
    • Toys with lookahead.
    • Uses regexes in conditionals.
    • Starts to understand why HTML tags are hard to match with regexes.
    • Knows to use (?:…) when a backreference or capture isn't needed.
    • Can read a relatively simple regex and explain its function.
    • Knows their way around the use of replace callback functions.
  4. Haxz0r
    • Uses lookahead with impunity.
    • Sighs at the unavailability of lookbehind and other features from more powerful regex libraries.
    • Knows what $`, $', and $& mean in a replacement string.
    • Knows the difference between string literal and regex metacharacters, and how this impacts the RegExp constructor.
    • Generally knows whether a greedy or lazy quantifier is more appropriate, even when it doesn't change what the regex matches.
    • Has a basic sense of how to avoid regex efficiency problems.
    • Knows how to iterate over strings using the exec method and a while loop.
    • Knows that properties of the global RegExp object and the compile method are deprecated.
  5. Guru
    • Understands the significance of manually modifying a regex object's lastIndex property and when this can be useful within a loop.
    • Can explain how any given regex will or won't work.
    • No longer experiences the excitement of writing complex regexes that work on the first try, since regex behavior has become predictable and obvious.
    • Is immune to catastrophic backtracking, and can easily (and accurately) determine if a nested quantifier is safe.
    • Knows of numerous cross-browser regex syntax and behavior differences.
    • Knows offhand the section number of ECMA-262 3rd Edition that covers regexes.
    • Understands the difference between capturing group nonparticipation vs participating but capturing an empty string, and the behavior differences this can lead to.
    • Has a preference for particular backreference rules related to capturing group participation and quantified alternation, or is at least aware of the implementation inconsistencies.
    • Often knows which browser will run a given regex fastest before testing, based on known internal optimizations and weaknesses.
    • Thinks that writing recursive regexes is easy, so long as there is an upper bound to recursion depth.
  6. Wizard
    • Works on a regex engine.
    • Has patched the engine from time to time.
  7. God
    • Can add features to the engine at a whim.
    • Also created all life on earth using a constructor function.

(Heavily adapted and JavaScriptized from 7 Stages of a [Perl] Regex User.)

Mimicking Lookbehind in JavaScript

Unlike lookaheads, JavaScript doesn't support regex lookbehind syntax. That's unfortunate, but I'm not content with just resigning to that fact. Following are three ways I've come up with to mimic lookbehinds in JavaScript.

For those not familar with the concept of lookbehinds, they are zero-width assertions which, like the more specific \b, ^, and $ metacharacters, don't actually consume anything — they just match a position within text. This can be a very powerful concept. Read this first if you need more details.

Mimicking lookbehind with the replace method and optional capturing groups

This first approach is not much like a real lookbehind, but it might be "good enough" in some simple cases. Here are a few examples:

// Mimic leading, positive lookbehind like replace(/(?<=es)t/g, 'x')
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
	return $1 ? $1 + 'x' : $0;
});
// output: tesxt

// Mimic leading, negative lookbehind like replace(/(?<!es)t/g, 'x')
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
	return $1 ? $0 : 'x';
});
// output: xestx

// Mimic inner, positive lookbehind like replace(/\w(?<=s)t/g, 'x')
var output = 'testt'.replace(/(?:(s)|\w)t/g, function($0, $1){
	return $1 ? 'x' : $0;
});
// output: text

Unfortunately, there are many cases where lookbehinds can't be mimicked using this construct. Here's one example:

// Trying to mimic positive lookbehind, but this doesn't work
var output = 'ttttt'.replace(/(t)?t/g, function($0, $1){
	return $1 ? $1 + 'x' : $0;
});
// output: txtxt
// desired output: txxxx

The problem is that the regexes are relying on actually consuming the characters which should be within zero-width lookbehind assertions, then simply putting back the match unviolated (an effective no-op) if the backreferences contain or don't contain a value. Since the actual matching process here doesn't work anything like real lookbehinds, this only works in a limited number of scenarios. Additionally, it only works with the replace method, since other regex-related methods don't offer a mechanism to dynamically "undo" matches. However, since you can run arbitrary code in the replacement function, it does offer a limited degree of flexibility.

Mimicking lookbehind through reversal

The next approach uses lookaheads to mimic lookbehinds, and relies on manually reversing the data and writing your regex backwards. You'll also need to write the replacement value backwards if using this with the replace method, flip the match index if using this with the search method, etc. If that sounds a bit confusing, it is. I'll show an example in a second, but first we need a way to reverse our test string, since JavaScript doesn't provide this capability natively.

String.prototype.reverse = function () {
	return this.split('').reverse().join('');
};

Now let's try to pull this off:

// Mimicking lookbehind like (?<=es)t
var output = 'testt'.reverse().replace(/t(?=se)/g, 'x').reverse();
// output: tesxt

That actually works quite nicely, and allows mimicking both positive and negative lookbehind. However, writing a more complex regex with all nodes reversed can get a bit confusing, and since lookahead is used to mimic lookbehind, you can't mix what you intend as real lookaheads in the same pattern.

Note that reversing a string and applying regexes with reversed nodes can actually open up entirely new ways to approach a pattern, and in a few cases might make your code faster, even with the overhead of reversing the data. I'll have to save the efficiency discussion for another day, but before moving on to the third lookbehind-mimicking approach, here's one example of a new pattern approach made possible through reversal.

In my last post, I used the following code to add commas every three digits from the right for all numbers which are not preceded by a dot, letter, or underscore:

String.prototype.commafy = function () {
	return this.replace(/(^|[^\w.])(\d{4,})/g, function($0, $1, $2) {
		return $1 + $2.replace(/\d(?=(?:\d\d\d)+(?!\d))/g, '$&,');
	});
}

Here's an alternative implementation:

String.prototype.commafy = function() {
	return this.
		reverse().
		replace(/\d\d\d(?=\d)(?!\d*[a-z._])/gi, '$&,').
		reverse();
};

I'll leave the analysis for your free time.

Finally, we come to the third lookbehind-mimicking approach:

Mimicking lookbehind using a while loop and regexp.lastIndex

This last approach has the following advantages:

  • It's easier to use (no need to reverse your data and regex nodes).
  • It allows lookahead and lookbehind to be used together.
  • It allows you to more easily automate the mimicking process.

However, the trade off is that, in order to avoid interfering with standard regex backtracking, this approach only allows you to use lookbehinds (positive or negative) at the very start and/or end of your regexes. Fortunately, it's quite common to want to use a lookbehind at the start of a regex.

If you're not already familiar with the exec method available for RegExp objects, make sure to read about it at the Mozilla Developer Center before continuing. In particular, look at the examples which use exec within a while loop.

Here's a quick implementation of this approach, in which we'll actually toy with the regex engine's bump-along mechanism to get it to work as we want:

var data = 'ttttt',
	regex = /t/g,
	replacement = 'x',
	match,
	lastLastIndex = 0,
	output = '';

regex.x = {
	gRegex: /t/g,
	startLb: {
		regex: /t$/,
		type: true
	}
};

function lookbehind (data, regex, match) {
	return (
		(regex.x.startLb ? (regex.x.startLb.regex.test(data.substring(0, match.index)) === regex.x.startLb.type) : true) &&
		(regex.x.endLb ? (regex.x.endLb.regex.test(data.substring(0, regex.x.gRegex.lastIndex)) === regex.x.endLb.type) : true)
	);
}

while (match = regex.x.gRegex.exec(data)) {
	/* If the match is preceded/not by start lookbehind, and the end of the match is preceded/not by end lookbehind */
	if (lookbehind(data, regex, match)) {
		/* replacement can be a function */
		output += data.substring(lastLastIndex, match.index) + match[0].replace(regex, replacement);
		if(!regex.global){
			lastLastIndex = regex.gRegex.lastIndex;
			break;
		}
	/* If the inner pattern matched, but the leading or trailing lookbehind failed */
	} else {
		output += match[0].charAt(0);
		/* Set the regex to try again one character after the failed position, rather than at the end of the last match */
		regex.x.gRegex.lastIndex = match.index + 1;
	}
	lastLastIndex = regex.x.gRegex.lastIndex;
}
output += data.substring(lastLastIndex);

// output: txxxx

That's a fair bit of code, but it's quite powerful. It accounts for using both a leading and trailing lookbehind, and allows using a function for the replacement value. Also, this could relatively easily be made into a function which accepts a string for the regex using normal lookbehind syntax (e.g., "(?<=x)x(?<!x)"), then splits it into the various parts in needs before applying it.

Notes:

  • regex.x.gRegex should be an exact copy of regex, with the difference that it must use the g flag whether or not regex does (in order for the exec method to interact with the while loop as we need it to).
  • regex.x.startLb.type and regex.x.endLb.type use true for "positive," and false for "negative."
  • regex.x.startLb.regex and regex.x.endLb.regex are the patterns you want to use for the lookbehinds, but they must contain a trailing $. The dollar sign in this case does not mean end of the data, but rather end of the data segment they will be tested against.

If you're wondering why there hasn't been any discussion of fixed- vs. variable-length lookbehinds, that's because none of these approaches have any such limitations. They support full, variable-length lookbehind, which no regex engines I know of other than .NET and JGsoft (used by products like RegexBuddy) are capable of.

In conclusion, if you take advantage of all of the above approaches, regex lookbehind syntax can be mimicked in JavaScript in the vast majority of cases. Make sure to take advantage of the comment button if you have feedback about any of this stuff.

Update 2012-04: See my followup blog post, JavaScript Regex Lookbehind Redux, where I've posted a collection of short functions that make it much easier to simulate leading lookbehind.

RegexBuddy 3.0 Beta

RegexBuddy logo

RegexBuddy is one of those tools which, now that I've gotten used to having around, I'd have a hard time living without—kind of like regular expressions themselves. I'm happy to see that the recently released (but little publicized) RegexBuddy 3.0 beta pushes what's already the best regex builder/tester on the market quite a bit further. (If you haven't heard of it before, start here.)

Here are the new features I'm personally most excited about:

  • Flavors: RegexBuddy 3 lets you emulate specific regex flavors, including .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl, POSIX BRE/ERE, and a number of others. Don't know why a particular regex doesn't work in your tool of choice? Now, RegexBuddy can tell you. Although I wouldn't expect this feature to mimic other flavors 100% accurately in all cases (e.g., I don't expect it to mimic specific bugs found in other libraries), it's still a badass new feature which further separates RegexBuddy from every competing tool. (Update: Jan Goyvaerts responded regarding the issue of bug emulation on his blog.)
  • Added syntax support: To accompany the above feature, a significant amount of flavor-specific regex syntax support has been added.
  • Integrated forums: RegexBuddy 3 is the first application I've seen which includes an integrated forum system not available outside of the software. This offers some advantages (e.g., almost no spam) and conveniences (e.g., you can attach a regex together with modifiers and target data with the click of a button), but it's a fairly novel idea. I'm curious to see how much action it will get since it's only available to licensed users… Author Jan Goyvaerts talks a bit about the concept on his blog. I'm sure you'll be able to find me on the new forums from time to time.
  • Enhanced match info: New features like "List All Group Matches in Columns" and "List All Replacements" offer lots of shortcuts to the info you're after.
  • History: You can now easily keep a list of regexes you've tried during your session.
  • New interface: The tabs from RegexBuddy 2 are now panels that you can do much more with. There's also a dual-monitor layout preset which I'll probably make good use of.
  • Debug everywhere: You can now run the debugger at every position in the test data, rather than only at the current cursor position. This is a great way to gain insight into how regular expression engines work internally.

There are a number of other nifty features I haven't mentioned here, but I'll leave them for you to discover (or read about in the changelog). The beta is only available to registered users, but if you already own RegexBuddy 2.x you should give the new beta a try.

Faster JavaScript Trim

Since JavaScript doesn't include a trim method natively, it's included by countless JavaScript libraries – usually as a global function or appended to String.prototype. However, I've never seen an implementation which performs as well as it could, probably because most programmers don't deeply understand or care about regex efficiency issues.

After seeing a particularly bad trim implementation, I decided to do a little research towards finding the most efficient approach. Before getting into the analysis, here are the results:

Method Firefox 2 IE 6
trim1 15ms < 0.5ms
trim2 31ms < 0.5ms
trim3 46ms 31ms
trim4 47ms 46ms
trim5 156ms 1656ms
trim6 172ms 2406ms
trim7 172ms 1640ms
trim8 281ms < 0.5ms
trim9 125ms 78ms
trim10 < 0.5ms < 0.5ms
trim11 < 0.5ms < 0.5ms

Note 1: The comparison is based on trimming the Magna Carta (over 27,600 characters) with a bit of leading and trailing whitespace 20 times on my personal system. However, the data you're trimming can have a major impact on performance, which is detailed below.

Note 2: trim4 and trim6 are the most commonly found in JavaScript libraries today.

Note 3: The aforementioned bad implementation is not included in the comparison, but is shown later.

The analysis

Although there are 11 rows in the table above, they are only the most notable (for various reasons) of about 20 versions I wrote and benchmarked against various types of strings. The following analysis is based on testing in Firefox 2.0.0.4, although I have noted where there are major differences in IE6.

  1. return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
    All things considered, this is probably the best all-around approach. Its speed advantage is most notable with long strings — when efficiency matters. The speed is largely due to a number of optimizations internal to JavaScript regex interpreters which the two discrete regexes here trigger. Specifically, the pre-check of required character and start of string anchor optimizations, possibly among others.
  2. return str.replace(/^\s+/, '').replace(/\s+$/, '');
    Very similar to trim1 (above), but a little slower since it doesn't trigger all of the same optimizations.
  3. return str.substring(Math.max(str.search(/\S/), 0), str.search(/\S\s*$/) + 1);
    This is often faster than the following methods, but slower than the above two. Its speed comes from its use of simple, character-index lookups.
  4. return str.replace(/^\s+|\s+$/g, '');
    This commonly thought up approach is easily the most frequently used in JavaScript libraries today. It is generally the fastest implementation of the bunch only when working with short strings which don't include leading or trailing whitespace. This minor advantage is due in part to the initial-character discrimination optimization it triggers. While this is a relatively decent performer, it's slower than the three methods above when working with longer strings, because the top-level alternation prevents a number of optimizations which could otherwise kick in.
  5. str = str.match(/\S+(?:\s+\S+)*/);
    return str ? str[0] : '';

    This is generally the fastest method when working with empty or whitespace-only strings, due to the pre-check of required character optimization it triggers. Note: In IE6, this can be quite slow when working with longer strings.
  6. return str.replace(/^\s*(\S*(\s+\S+)*)\s*$/, '$1');
    This is a relatively common approach, popularized in part by some leading JavaScripters. It's similar in approach (but inferior) to trim8. There's no good reason to use this in JavaScript, especially since it can be very slow in IE6.
  7. return str.replace(/^\s*(\S*(?:\s+\S+)*)\s*$/, '$1');
    The same as trim6, but a bit faster due to the use of a non-capturing group (which doesn't work in IE 5.0 and lower). Again, this can be slow in IE6.
  8. return str.replace(/^\s*((?:[\S\s]*\S)?)\s*$/, '$1');
    This uses a simple, single-pass, greedy approach. In IE6, this is crazy fast! The performance difference indicates that IE has superior optimization for quantification of "any character" tokens.
  9. return str.replace(/^\s*([\S\s]*?)\s*$/, '$1');
    This is generally the fastest with very short strings which contain both non-space characters and edge whitespace. This minor advantage is due to the simple, single-pass, lazy approach it uses. Like trim8, this is significantly faster in IE6 than Firefox 2.

Since I've seen the following additional implementation in one library, I'll include it here as a warning:

return str.replace(/^\s*([\S\s]*)\b\s*$/, '$1');

Although the above is sometimes the fastest method when working with short strings which contain both non-space characters and edge whitespace, it performs very poorly with long strings which contain numerous word boundaries, and it's terrible (!) with long strings comprised of nothing but whitespace, since that triggers an exponentially increasing amount of backtracking. Do not use.

A different endgame

There are two methods in the table at the top of this post which haven't been covered yet. For those, I've used a non-regex and hybrid approach.

After comparing and analyzing all of the above, I wondered how an implementation which used no regular expressions would perform. Here's what I tried:

function trim10 (str) {
	var whitespace = ' \n\r\t\f\x0b\xa0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000';
	for (var i = 0; i < str.length; i++) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(i);
			break;
		}
	}
	for (i = str.length - 1; i >= 0; i--) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return whitespace.indexOf(str.charAt(0)) === -1 ? str : '';
}

How does that perform? Well, with long strings which do not contain excessive leading or trailing whitespace, it blows away the competition (except against trim1/2/8 in IE, which are already insanely fast there).

Does that mean regular expressions are slow in Firefox? No, not at all. The issue here is that although regexes are very well suited for trimming leading whitespace, apart from the .NET library (which offers a somewhat-mysterious "backwards matching" mode), they don't really provide a method to jump to the end of a string without even considering previous characters. However, the non-regex-reliant trim10 function does just that, with the second loop working backwards from the end of the string until it finds a non-whitespace character.

Knowing that, what if we created a hybrid implementation which combined a regex's universal efficiency at trimming leading whitespace with the alternative method's speed at removing trailing characters?

function trim11 (str) {
	str = str.replace(/^\s+/, '');
	for (var i = str.length - 1; i >= 0; i--) {
		if (/\S/.test(str.charAt(i))) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return str;
}

Although the above is a bit slower than trim10 with some strings, it uses significantly less code and is still lightning fast. Plus, with strings which contain a lot of leading whitespace (which includes strings comprised of nothing but whitespace), it's much faster than trim10.

In conclusion…

Since the differences between the implementations cross-browser and when used with different data are both complex and nuanced (none of them are faster than all the others with any data you can throw at it), here are my general recommendations for a trim method:

  • Use trim1 if you want a general-purpose implementation which is fast cross-browser.
  • Use trim11 if you want to handle long strings exceptionally fast in all browsers.

To test all of the above implementations for yourself, try my very rudimentary benchmarking page. Background processing can cause the results to be severely skewed, so run the test a number of times (regardless of how many iterations you specify) and only consider the fastest results (since averaging the cost of background interference is not very enlightening).

As a final note, although some people like to cache regular expressions (e.g. using global variables) so they can be used repeatedly without recompilation, IMO this does not make much sense for a trim method. All of the above regexes are so simple that they typically take no more than a nanosecond to compile. Additionally, some browsers automatically cache the most recently used regexes, so a typical loop which uses trim and doesn't contain a bunch of other regexes might not encounter recompilation anyway.


Edit (2008-02-04): Shortly after posting this I realized trim10/11 could be better written. Several people have also posted improved versions in the comments. Here's what I use now, which takes the trim11-style hybrid approach:

function trim12 (str) {
	var	str = str.replace(/^\s\s*/, ''),
		ws = /\s/,
		i = str.length;
	while (ws.test(str.charAt(--i)));
	return str.slice(0, i + 1);
}

New library: Are you a JavaScript regex master, or want to be? Then you need my fancy XRegExp library. It adds new regex syntax (including named capture and Unicode properties); s, x, and n flags; powerful regex utils; and it fixes pesky browser inconsistencies. Check it out!