Unlike lookaheads, JavaScript doesn't support regex lookbehind syntax. That's unfortunate, but I'm not content with just resigning to that fact. Following are three ways I've come up with to mimic lookbehinds in JavaScript.
For those not familar with the concept of lookbehinds, they are zero-width assertions which, like the more specific \b
, ^
, and $
metacharacters, don't actually consume anything — they just match a position within text. This can be a very powerful concept. Read this first if you need more details.
Mimicking lookbehind with the replace method and optional capturing groups
This first approach is not much like a real lookbehind, but it might be "good enough" in some simple cases. Here are a few examples:
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
return $1 ? $1 + 'x' : $0;
});
var output = 'testt'.replace(/(es)?t/g, function($0, $1){
return $1 ? $0 : 'x';
});
var output = 'testt'.replace(/(?:(s)|\w)t/g, function($0, $1){
return $1 ? 'x' : $0;
});
Unfortunately, there are many cases where lookbehinds can't be mimicked using this construct. Here's one example:
var output = 'ttttt'.replace(/(t)?t/g, function($0, $1){
return $1 ? $1 + 'x' : $0;
});
The problem is that the regexes are relying on actually consuming the characters which should be within zero-width lookbehind assertions, then simply putting back the match unviolated (an effective no-op) if the backreferences contain or don't contain a value. Since the actual matching process here doesn't work anything like real lookbehinds, this only works in a limited number of scenarios. Additionally, it only works with the replace
method, since other regex-related methods don't offer a mechanism to dynamically "undo" matches. However, since you can run arbitrary code in the replacement function, it does offer a limited degree of flexibility.
Mimicking lookbehind through reversal
The next approach uses lookaheads to mimic lookbehinds, and relies on manually reversing the data and writing your regex backwards. You'll also need to write the replacement value backwards if using this with the replace
method, flip the match index if using this with the search
method, etc. If that sounds a bit confusing, it is. I'll show an example in a second, but first we need a way to reverse our test string, since JavaScript doesn't provide this capability natively.
String.prototype.reverse = function () {
return this.split('').reverse().join('');
};
Now let's try to pull this off:
var output = 'testt'.reverse().replace(/t(?=se)/g, 'x').reverse();
That actually works quite nicely, and allows mimicking both positive and negative lookbehind. However, writing a more complex regex with all nodes reversed can get a bit confusing, and since lookahead is used to mimic lookbehind, you can't mix what you intend as real lookaheads in the same pattern.
Note that reversing a string and applying regexes with reversed nodes can actually open up entirely new ways to approach a pattern, and in a few cases might make your code faster, even with the overhead of reversing the data. I'll have to save the efficiency discussion for another day, but before moving on to the third lookbehind-mimicking approach, here's one example of a new pattern approach made possible through reversal.
In my last post, I used the following code to add commas every three digits from the right for all numbers which are not preceded by a dot, letter, or underscore:
String.prototype.commafy = function () {
return this.replace(/(^|[^\w.])(\d{4,})/g, function($0, $1, $2) {
return $1 + $2.replace(/\d(?=(?:\d\d\d)+(?!\d))/g, '$&,');
});
}
Here's an alternative implementation:
String.prototype.commafy = function() {
return this.
reverse().
replace(/\d\d\d(?=\d)(?!\d*[a-z._])/gi, '$&,').
reverse();
};
I'll leave the analysis for your free time.
Finally, we come to the third lookbehind-mimicking approach:
Mimicking lookbehind using a while loop and regexp.lastIndex
This last approach has the following advantages:
- It's easier to use (no need to reverse your data and regex nodes).
- It allows lookahead and lookbehind to be used together.
- It allows you to more easily automate the mimicking process.
However, the trade off is that, in order to avoid interfering with standard regex backtracking, this approach only allows you to use lookbehinds (positive or negative) at the very start and/or end of your regexes. Fortunately, it's quite common to want to use a lookbehind at the start of a regex.
If you're not already familiar with the exec
method available for RegExp
objects, make sure to read about it at the Mozilla Developer Center before continuing. In particular, look at the examples which use exec
within a while
loop.
Here's a quick implementation of this approach, in which we'll actually toy with the regex engine's bump-along mechanism to get it to work as we want:
var data = 'ttttt',
regex = /t/g,
replacement = 'x',
match,
lastLastIndex = 0,
output = '';
regex.x = {
gRegex: /t/g,
startLb: {
regex: /t$/,
type: true
}
};
function lookbehind (data, regex, match) {
return (
(regex.x.startLb ? (regex.x.startLb.regex.test(data.substring(0, match.index)) === regex.x.startLb.type) : true) &&
(regex.x.endLb ? (regex.x.endLb.regex.test(data.substring(0, regex.x.gRegex.lastIndex)) === regex.x.endLb.type) : true)
);
}
while (match = regex.x.gRegex.exec(data)) {
if (lookbehind(data, regex, match)) {
output += data.substring(lastLastIndex, match.index) + match[0].replace(regex, replacement);
if(!regex.global){
lastLastIndex = regex.gRegex.lastIndex;
break;
}
} else {
output += match[0].charAt(0);
regex.x.gRegex.lastIndex = match.index + 1;
}
lastLastIndex = regex.x.gRegex.lastIndex;
}
output += data.substring(lastLastIndex);
That's a fair bit of code, but it's quite powerful. It accounts for using both a leading and trailing lookbehind, and allows using a function for the replacement value. Also, this could relatively easily be made into a function which accepts a string for the regex using normal lookbehind syntax (e.g., "(?<=x)x(?<!x)
"), then splits it into the various parts in needs before applying it.
Notes:
regex.x.gRegex
should be an exact copy of regex
, with the difference that it must use the g
flag whether or not regex
does (in order for the exec
method to interact with the while
loop as we need it to).
regex.x.startLb.type
and regex.x.endLb.type
use true
for "positive," and false
for "negative."
regex.x.startLb.regex
and regex.x.endLb.regex
are the patterns you want to use for the lookbehinds, but they must contain a trailing $
. The dollar sign in this case does not mean end of the data, but rather end of the data segment they will be tested against.
If you're wondering why there hasn't been any discussion of fixed- vs. variable-length lookbehinds, that's because none of these approaches have any such limitations. They support full, variable-length lookbehind, which no regex engines I know of other than .NET and JGsoft (used by products like RegexBuddy) are capable of.
In conclusion, if you take advantage of all of the above approaches, regex lookbehind syntax can be mimicked in JavaScript in the vast majority of cases. Make sure to take advantage of the comment button if you have feedback about any of this stuff.
Update 2012-04: See my followup blog post, JavaScript Regex Lookbehind Redux, where I've posted a collection of short functions that make it much easier to simulate leading lookbehind.