Mimicking Lookbehind in JavaScript

Unlike lookaheads, JavaScript doesn't support regex lookbehind syntax. That's unfortunate, but I'm not content with just resigning to that fact. Following are three ways I've come up with to mimic lookbehinds in JavaScript.

For those not familar with the concept of lookbehinds, they are zero-width assertions which, like the more specific \b, ^, and $ metacharacters, don't actually consume anything — they just match a position within text. This can be a very powerful concept. Read this first if you need more details.

Mimicking lookbehind with the replace method and optional capturing groups

This first approach is not much like a real lookbehind, but it might be "good enough" in some simple cases. Here are a few examples:

// Mimic leading, positive lookbehind like replace(/(?<=es)t/g, 'x')
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
	return $1 ? $1 + 'x' : $0;
});
// output: tesxt

// Mimic leading, negative lookbehind like replace(/(?<!es)t/g, 'x')
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
	return $1 ? $0 : 'x';
});
// output: xestx

// Mimic inner, positive lookbehind like replace(/\w(?<=s)t/g, 'x')
var output = 'testt'.replace(/(?:(s)|\w)t/g, function($0, $1){
	return $1 ? 'x' : $0;
});
// output: text

Unfortunately, there are many cases where lookbehinds can't be mimicked using this construct. Here's one example:

// Trying to mimic positive lookbehind, but this doesn't work
var output = 'ttttt'.replace(/(t)?t/g, function($0, $1){
	return $1 ? $1 + 'x' : $0;
});
// output: txtxt
// desired output: txxxx

The problem is that the regexes are relying on actually consuming the characters which should be within zero-width lookbehind assertions, then simply putting back the match unviolated (an effective no-op) if the backreferences contain or don't contain a value. Since the actual matching process here doesn't work anything like real lookbehinds, this only works in a limited number of scenarios. Additionally, it only works with the replace method, since other regex-related methods don't offer a mechanism to dynamically "undo" matches. However, since you can run arbitrary code in the replacement function, it does offer a limited degree of flexibility.

Mimicking lookbehind through reversal

The next approach uses lookaheads to mimic lookbehinds, and relies on manually reversing the data and writing your regex backwards. You'll also need to write the replacement value backwards if using this with the replace method, flip the match index if using this with the search method, etc. If that sounds a bit confusing, it is. I'll show an example in a second, but first we need a way to reverse our test string, since JavaScript doesn't provide this capability natively.

String.prototype.reverse = function () {
	return this.split('').reverse().join('');
};

Now let's try to pull this off:

// Mimicking lookbehind like (?<=es)t
var output = 'testt'.reverse().replace(/t(?=se)/g, 'x').reverse();
// output: tesxt

That actually works quite nicely, and allows mimicking both positive and negative lookbehind. However, writing a more complex regex with all nodes reversed can get a bit confusing, and since lookahead is used to mimic lookbehind, you can't mix what you intend as real lookaheads in the same pattern.

Note that reversing a string and applying regexes with reversed nodes can actually open up entirely new ways to approach a pattern, and in a few cases might make your code faster, even with the overhead of reversing the data. I'll have to save the efficiency discussion for another day, but before moving on to the third lookbehind-mimicking approach, here's one example of a new pattern approach made possible through reversal.

In my last post, I used the following code to add commas every three digits from the right for all numbers which are not preceded by a dot, letter, or underscore:

String.prototype.commafy = function () {
	return this.replace(/(^|[^\w.])(\d{4,})/g, function($0, $1, $2) {
		return $1 + $2.replace(/\d(?=(?:\d\d\d)+(?!\d))/g, '$&,');
	});
}

Here's an alternative implementation:

String.prototype.commafy = function() {
	return this.
		reverse().
		replace(/\d\d\d(?=\d)(?!\d*[a-z._])/gi, '$&,').
		reverse();
};

I'll leave the analysis for your free time.

Finally, we come to the third lookbehind-mimicking approach:

Mimicking lookbehind using a while loop and regexp.lastIndex

This last approach has the following advantages:

  • It's easier to use (no need to reverse your data and regex nodes).
  • It allows lookahead and lookbehind to be used together.
  • It allows you to more easily automate the mimicking process.

However, the trade off is that, in order to avoid interfering with standard regex backtracking, this approach only allows you to use lookbehinds (positive or negative) at the very start and/or end of your regexes. Fortunately, it's quite common to want to use a lookbehind at the start of a regex.

If you're not already familiar with the exec method available for RegExp objects, make sure to read about it at the Mozilla Developer Center before continuing. In particular, look at the examples which use exec within a while loop.

Here's a quick implementation of this approach, in which we'll actually toy with the regex engine's bump-along mechanism to get it to work as we want:

var data = 'ttttt',
	regex = /t/g,
	replacement = 'x',
	match,
	lastLastIndex = 0,
	output = '';

regex.x = {
	gRegex: /t/g,
	startLb: {
		regex: /t$/,
		type: true
	}
};

function lookbehind (data, regex, match) {
	return (
		(regex.x.startLb ? (regex.x.startLb.regex.test(data.substring(0, match.index)) === regex.x.startLb.type) : true) &&
		(regex.x.endLb ? (regex.x.endLb.regex.test(data.substring(0, regex.x.gRegex.lastIndex)) === regex.x.endLb.type) : true)
	);
}

while (match = regex.x.gRegex.exec(data)) {
	/* If the match is preceded/not by start lookbehind, and the end of the match is preceded/not by end lookbehind */
	if (lookbehind(data, regex, match)) {
		/* replacement can be a function */
		output += data.substring(lastLastIndex, match.index) + match[0].replace(regex, replacement);
		if(!regex.global){
			lastLastIndex = regex.gRegex.lastIndex;
			break;
		}
	/* If the inner pattern matched, but the leading or trailing lookbehind failed */
	} else {
		output += match[0].charAt(0);
		/* Set the regex to try again one character after the failed position, rather than at the end of the last match */
		regex.x.gRegex.lastIndex = match.index + 1;
	}
	lastLastIndex = regex.x.gRegex.lastIndex;
}
output += data.substring(lastLastIndex);

// output: txxxx

That's a fair bit of code, but it's quite powerful. It accounts for using both a leading and trailing lookbehind, and allows using a function for the replacement value. Also, this could relatively easily be made into a function which accepts a string for the regex using normal lookbehind syntax (e.g., "(?<=x)x(?<!x)"), then splits it into the various parts in needs before applying it.

Notes:

  • regex.x.gRegex should be an exact copy of regex, with the difference that it must use the g flag whether or not regex does (in order for the exec method to interact with the while loop as we need it to).
  • regex.x.startLb.type and regex.x.endLb.type use true for "positive," and false for "negative."
  • regex.x.startLb.regex and regex.x.endLb.regex are the patterns you want to use for the lookbehinds, but they must contain a trailing $. The dollar sign in this case does not mean end of the data, but rather end of the data segment they will be tested against.

If you're wondering why there hasn't been any discussion of fixed- vs. variable-length lookbehinds, that's because none of these approaches have any such limitations. They support full, variable-length lookbehind, which no regex engines I know of other than .NET and JGsoft (used by products like RegexBuddy) are capable of.

In conclusion, if you take advantage of all of the above approaches, regex lookbehind syntax can be mimicked in JavaScript in the vast majority of cases. Make sure to take advantage of the comment button if you have feedback about any of this stuff.

Update 2012-04: See my followup blog post, JavaScript Regex Lookbehind Redux, where I've posted a collection of short functions that make it much easier to simulate leading lookbehind.

Commafy Numbers

I've never used the few scripts I've seen that add commas to numbers because usually I want to apply the functionality to entire blocks of text. Having to pull out numbers, add commas, then put them back becomes a needlessly complex task without a method which can just do this in one shot. So, here's my attempt at this (if JavaScript regexes supported lookbehind, it could be even shorter):

String.prototype.commafy = function () {
	return this.replace(/(^|[^\w.])(\d{4,})/g, function($0, $1, $2) {
		return $1 + $2.replace(/\d(?=(?:\d\d\d)+(?!\d))/g, "$&,");
	});
}

Number.prototype.commafy = function () {
	return String(this).commafy();
}

Here are a couple examples of how this can be used:

(1000).commafy();
// Output: 1,000

var data = '1\n' +
	'10\n' +
	'100\n' +
	'1000\n' +
	'10000\n' +
	'100000\n' +
	'1000000\n' +
	'12345678901234567890\n' +
	'1000.99\n' +
	'1000.9999\n' +
	'.9999\n' +
	'-1000\n' +
	'$1000\n' +
	'"1000"\n' +
	'1000MHz\n' +
	'Z1000';

data.commafy();
/* Output:
1
10
100
1,000
10,000
100,000
1,000,000
12,345,678,901,234,567,890
1,000.99
1,000.9999
.9999
-1,000
$1,000
"1,000"
1,000MHz
Z1000
*/

Note that it adds commas to numbers followed by non-numeric characters, but avoids adding commas to numbers immediately preceded by a dot (decimal point), letter, or underscore. And as shown, this can be applied to individual numbers or entire blocks of text.

This is a decent example of where regular expressions can help to shorten and simplify code even in places you may not initially think to use them.


Edit: I've included an alternative implementation of the above code in my post Mimicking Lookbehind in JavaScript, which temporarily reverses the string to allow a simplified approach.

Smoking FAQ, Part 1

Several of my work buddies have quit smoking recently, which means I need to recruit new smokers to join me for smoke breaks. To help with this, here is the first in possibly a series of pro-smoking pieces. (This is adapted from The Easy Way to Start Smoking.)

Is smoking cool?
Taking even the most objective and detached view of this issue, it's hard to ignore the fact that the vast majority of people who hold the view that smoking is not cool are themselves not cool, and that the people who think it is cool are cool.
What is secondhand smoking?
Secondhand smoking is the best way to fill your lungs without spending $4.50 a day for the pleasure. Nowadays, secondhand smoke is big news and suddenly seems to affect everyone. This is despite the fact that smokers themselves have known about the benefits of secondhand smoking for years. Why do you think we like to huddle together, if not to smoke our own cigs and a little of everyone else's, too?
Is secondhand smoking cool?
The honest response is no. Simply standing next to smokers won't help freeloaders in this respect.
Why do smokers get more colds?
Remember that colds are viral infections, passed from person to person. Catching a cold is simply a side effect of leading an active, sociable life full of friends, lovers, and laughter. Non-smokers might want to ask themselves why it is they catch so few colds. Why are they so unpopular? What is wrong with them?
Where have I hidden your cigs?
That's not funny.

RegexBuddy 3.0 Beta

RegexBuddy logo

RegexBuddy is one of those tools which, now that I've gotten used to having around, I'd have a hard time living without—kind of like regular expressions themselves. I'm happy to see that the recently released (but little publicized) RegexBuddy 3.0 beta pushes what's already the best regex builder/tester on the market quite a bit further. (If you haven't heard of it before, start here.)

Here are the new features I'm personally most excited about:

  • Flavors: RegexBuddy 3 lets you emulate specific regex flavors, including .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl, POSIX BRE/ERE, and a number of others. Don't know why a particular regex doesn't work in your tool of choice? Now, RegexBuddy can tell you. Although I wouldn't expect this feature to mimic other flavors 100% accurately in all cases (e.g., I don't expect it to mimic specific bugs found in other libraries), it's still a badass new feature which further separates RegexBuddy from every competing tool. (Update: Jan Goyvaerts responded regarding the issue of bug emulation on his blog.)
  • Added syntax support: To accompany the above feature, a significant amount of flavor-specific regex syntax support has been added.
  • Integrated forums: RegexBuddy 3 is the first application I've seen which includes an integrated forum system not available outside of the software. This offers some advantages (e.g., almost no spam) and conveniences (e.g., you can attach a regex together with modifiers and target data with the click of a button), but it's a fairly novel idea. I'm curious to see how much action it will get since it's only available to licensed users… Author Jan Goyvaerts talks a bit about the concept on his blog. I'm sure you'll be able to find me on the new forums from time to time.
  • Enhanced match info: New features like "List All Group Matches in Columns" and "List All Replacements" offer lots of shortcuts to the info you're after.
  • History: You can now easily keep a list of regexes you've tried during your session.
  • New interface: The tabs from RegexBuddy 2 are now panels that you can do much more with. There's also a dual-monitor layout preset which I'll probably make good use of.
  • Debug everywhere: You can now run the debugger at every position in the test data, rather than only at the current cursor position. This is a great way to gain insight into how regular expression engines work internally.

There are a number of other nifty features I haven't mentioned here, but I'll leave them for you to discover (or read about in the changelog). The beta is only available to registered users, but if you already own RegexBuddy 2.x you should give the new beta a try.