Regular Expressions — Flagrant Badassery

XRegExp: An Extended JavaScript Regex Constructor

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

I use regular expressions in JavaScript fairly frequently, and although the exec() method is badass and I love the ability to use a function to generate the replacement in the replace() method, JavaScript regexes lack some very significant features available in many other languages. Amongst my biggest pet peeves is the lack of support for s and x flags, which should enable "dot matches all" (a.k.a, single-line) mode and "free-spacing and comments" mode, respectively. These modifiers are available in almost every other modern regex library.

To remedy this, I've created a (very small) script which extends the JavaScript RegExp constructor, enabling the aforementioned flags. Basically, it gives you a constructor called XRegExp which works exactly like the RegExp constructor except that it also accepts s and x as flags, in addition to the already supported g (global), i (case insensitive), and m (multiline, i.e., ^ and $ match at line breaks). As a bonus, XRegExp also improves cross-browser regex syntax consistency.

Here's how it works:

var regex = new XRegExp("te?", "gi");
var value = "Test".replace(regex, "z"); // value is now "zsz"

Look familiar? If you've used regular expressions in JavaScript before, it should — it's exactly like using RegExp. However, so far we haven't done anything that can't be accomplished using the native RegExp constructor. The s flag is pretty self-explanatory (specific details can be found in the FAQ, below), so here's an example with the x flag:

// Turn email addresses into links
var email = new XRegExp(
    "\\b                                      " +
    "# Capture the address to $1            \n" +
    "(                                        " +
    "  \\w[-.\\w]*               # username \n" +
    "  @                                      " +
    "  [-.a-z0-9]+\\.(?:com|net) # hostname \n" +
    ")                                        " +
    "\\b                                      ", "gix");

value = value.replace(email, "<a href=\"mailto:$1\">$1</a>");

That's certainly different! A couple notes:

When using XRegExp, the normal string escape rules (preceding special characters with "\") are necessary, just as with RegExp. Hence, the three instances of \n are metasequences within the string literal itself. JavaScript converts them to newline characters (which end the comments) before XRegExp sees the string.
The email regex is overly simplistic and only intended for demonstrative purposes.

That's fairly nifty, but we can make this even easier. If you run the following line of code:

XRE.overrideNative();

…Like magic, the RegExp constructor itself will support the s and x flags from that point forward. The tradeoff is that you will then no longer be able to access information about the last match as properties of the global RegExp object. However, those properties are all officially deprecated anyway, and you can access all the same info though a combination of properties on regex instances and use of the exec() method.

Here's a quick FAQ. For the first two questions, I've borrowed portions of the explanations from O'Reilly's Mastering Regular Expressions, 3^rd Edition.

What exactly does the s flag do?

Usually, dot does not match a newline. However, a mode in which dot matches a newline can be as useful as one where dot doesn't. The s flag allows the mode to be selected on a per-regex basis. Note that dots within character classes (e.g., [.a-z]) are always equivalent to literal dots.

As for what exactly is considered a newline character (and therefore not matched by dots outside of character classes unless using the s flag), according to the Mozilla Developer Center it includes the four characters matched by the following regex: [\n\r\u2028\u2029]

What exactly does the x flag do?

First, it causes most whitespace to be ignored, so you can "free-format" the expression for readability. Secondly, it allows comments with a leading #.

Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within a character class (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace, as in new XRegExp("<a \\s+ href=…>", "x"). Note that describing whitespace and comments as ignore-me metacharacters is not quite accurate; it might be better to think of them as do-nothing metacharacters. This distinction is important with something like \12 3, which with the x flag is taken as \12 followed by 3, and not \123. Finally, don't immediately follow whitespace or a comment with a quantifier (e.g., * or ?), or you will quantify the do-nothing metacharacter.

As for what exactly is whitespace, according to the Mozilla Developer Center it is equivalent to all characters matched by the following regex:
[\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]

Can the s and x flags be used together?

Yes. You can combine all supported flags (g, i, m, s, x) in any order.

What regex syntax does XRegExp support?

Whatever your browser supports natively.

You mentioned something about improving cross-browser regex syntax consistency?

When using XRegExp, a leading, unescaped ] within a character class is considered a literal character and therefore does not end the class. This is consistent with other regex engines I've used, and is true in Internet Explorer natively. However, in Firefox the following quirks (bugs?) can be experienced:

[^] is equivalent to [\S\s], although it should throw an error.
[^]] is equivalent to [\S\s]], although it should be equivalent to [^\]] or (?!])[\S\s].
[] is equivalent to (?!) (which will never match), although it should throw an error.
[]] is equivalent to (?!)] (which will never match), although it should be equvialent to [\]] or ].

When using XRegExp (or RegExp with XRE.overrideNative()), you don't have to worry about how different browsers handle this, as a leading ] within a character class will never end the class.

Which regex-related methods does XRegExp support?

All of them.

Are regexes built using XRegExp any slower than they would be otherwise?

No.

Does it take any longer to build regexes using XRegExp than it would otherwise?

Yes, by a tiny amount. From personal testing, building regexes using XRegExp typically takes less than a millisecond longer than it would otherwise. This is especially trivial given that regexes should not need to be constructed within loops. Instead, a regex should be assigned to a variable before entering a loop, to avoid rebuilding it during each iteration of the loop.

Which browsers has this been tested with?

Firefox 2, Internet Explorer 5.5–7, and Opera 9.

How big is the script file?

Minified, it's less than 1KB. Gzipping reduces it further.

What license is this released under?

The MIT License.

Does XRegExp affect regular expression literals?

No. Even when using XRE.overrideNative(), Perl-style regex literals (e.g., /pattern/gi) are unaffected.

I found a bug. Why do you suck so bad?

Are you sure the bug is in XRegExp? Regular expression syntax is somewhat complex and often changes its meaning given context. Additionally, metasequences within JavaScript string literals can change things before XRegExp ever sees your regex. In any case, whether or not you're sure you know what's causing the issue, feel free to leave a comment and I'll look into it ASAP.

Here's the script, with comments:

/*----------------------------------------------------------------------
XRegExp 0.1, by Steven Levithan <http://stevenlevithan.com>
MIT-style license
------------------------------------------------------------------------
Adds support for the following regular expression features:
  - The "s" flag: Dot matches all (a.k.a, single-line) mode.
  - The "x" flag: Free-spacing and comments mode.

XRegExp also offers consistent, cross-browser handling when "]" is used
as the first character within a character class (e.g., "[]]" or "[^]]").
----------------------------------------------------------------------*/

var XRegExp = function(pattern, flags){
	if(!flags) flags = "";
	
	/* If the "free-spacing and comments" modifier (x) is enabled, replace unescaped whitespace as well as unescaped pound
	signs (#) and any following characters up to and including the next newline character (\n) with "(?:)". Using "(?:)"
	instead of just removing matches altogether prevents, e.g., "\1 0" from becoming "\10" (which has different meanings
	depending on context). None of this applies within character classes, which are unaffected even when they contain
	whitespace or pound signs (which is consistent with pretty much every library except java.util.regex). */
	if(flags.indexOf("x") !== -1){
		pattern = pattern.replace(XRE.re.xMod, function($0, $1, $2){
			// If $2 is an empty string or its first character is "[", return the match unviolated (an effective no-op).
			return (/[^[]/.test($2.charAt(0)) ? $1 + "(?:)" : $0);
		});
	}
	
	/* Two steps (the order is not important):
	
	1. Since a regex literal will be used to return the final regex function/object, replace characters which are not
	   allowed within regex literals (carriage return, line feed) with the metasequences which represent them (\r, \n),
	   accounting for both escaped and unescaped literal characters within pattern. This step is only necessary to support
	   the XRE.overrideNative() method, since otherwise the RegExp constructor could be used to return the final regex.
	
	2. When "]" is the first character within a character class, convert it to "\]", for consistent, cross-browser handling.
	   This is included to workaround the following Firefox quirks (bugs?):
	     - "[^]" is equivalent to "[\S\s]", although it should throw an error.
	     - "[^]]" is equivalent to "[\S\s]]", although it should be equivalent to "[^\]]" or "(?!])[\S\s]".
	     - "[]" is equivalent to "(?!)" (which will never match), although it should throw an error.
	     - "[]]" is equivalent to "(?!)]" (which will never match), although it should be equvialent to "[\]]" or "]".
	   
	   Note that this step is not just an extra feature. It is in fact required in order to maintain correctness without
	   the aid of browser sniffing when constructing the regexes which deal with character classes (XRE.re.chrClass and
	   XRE.re.xMod). They treat a leading "]" within a character class as a non-terminating, literal character. */
	pattern = pattern.replace(XRE.re.badChr, function($0, $1, $2){
			return $1 + $2.replace(/\r/, "\\r").replace(/\n/, "\\n");
		}).
		replace(XRE.re.chrClass, function($0, $1, $2){
			return $1 + $2.replace(/^(\[\^?)]/, "$1\\]");
		});
	
	// If the "dot matches all" modifier (s) is enabled, replace unescaped dots outside of character classes with [\S\s]
	if(flags.indexOf("s") !== -1){
		pattern = pattern.replace(XRE.re.chrClass, function($0, $1, $2){
			return $1.replace(XRE.re.sMod, function($0, $1, $2){
					return $1 + ($2 === "." ? "[\\S\\s]" : "");
				}) + $2;
		});
	}
	
	// Use an evaluated regex literal to return the regular expression, in order to support the XRE.overrideNative() method.
	return eval("/" + pattern + "/" + flags.replace(/[sx]+/g, ""));
},
XRE = {
	overrideNative: function(){
		/* Override the global RegExp constructor/object with the enhanced XRegExp constructor. This precludes accessing
		properties of the last match via the global RegExp object. However, those properties are deprecated as of
		JavaScript 1.5, and the values are available on RegExp instances or via the exec() method. */
		RegExp = XRegExp;
	},
	re: {
		chrClass: /((?:[^[\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?)?)/g,
		xMod: /((?:[^[#\s\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?|\s*#[^\n\r]*[\n\r]?\s*|\s+)?)/g,
		sMod: /((?:[^\\.]+|\\(?:[\S\s]|$))*)(\.?)/g,
		badChr: /((?:[^\\\r\n]+|\\(?:[^\r\n]|$(?!\s)))*)\\?([\r\n]?)/g
	}
};
/* XRE.re is used to cache the more complex regexes so they don't have to be recompiled every time XRegExp runs. Note that
the included regexes match anything (if they didn't, they would have to be rewritten to avoid catastrophic backtracking on
failure). It's the backreferences, as well as where one match ends and the next begins, that's important. Also note that
the regexes are exactly as they are in order to account for all kinds of caveats involved in interpreting and working with
JavaScript regexes. Do not modify them! */

Download it here.

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

Mastering Regular Expressions, 3rd Edition

After reading positive reviews about the second edition it since I started using regular expressions a little over a year ago, I finally purchased O'Reilly Media's Mastering Regular Expressions by Jeffrey E. F. Friedl, since I discovered that a third edition came out in August 2006. The book arrived today, and it is indeed pretty damn excellent (I'm as excited about it as I can be about a tech book, anyway). I've only spent a few minutes with it so far, but I can see that it is very well presented and demonstrates some cool techniques. Hopefully I'll post about some nifty things I learn over the next few weeks, if I get a chance to actually sit down and read through a significant portion of the book.

Mimicking Conditionals

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the regex libraries I'm familiar with, conditionals are only supported by .NET, Perl, PCRE (and hence, PHP's preg functions), and JGsoft products (including RegexBuddy).

There are two common types of regex conditionals in those libraries: capturing-group-based and lookaround-based. I'll get to the latter type in a bit, but first I'll address capturing-group-based conditionals, which are able to base logic on whether a capturing group has participated in the match so far. Here's an example:

(a)?b(?(1)c|d)

That matches only "bd" and "abc". The pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

Note that to use it without an "else" part, you still need to include the second empty backreference (in this case, \3) at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's a zero-length alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the lookahead are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
It interacts with backtracking differently than a real conditional (the "a" part is treated as if it were within an optional, atomic group, e.g., (?>(a))? instead of (a)?), so it might be better to think of this as a new operator which is similar to a conditional.

Here are the regex engines I've briefly tested this pattern with:

Language	Supports fake cond.	Supports real cond.	Notes
.NET	Yes	Yes	Tested using Expresso.
ColdFusion	Yes	No	Tested using ColdFusion MX7.
Java	Yes	No	Tested using Regular Expression Test Applet.
JavaScript	No	No	According to ECMA-262v3, backreferences to non-participating capturing groups always succeed, and most browsers respect that. Unfortunately, this pattern depends on the way most other regex engines handle such groups.
JGsoft	Yes	Yes	(Edit:) Works as of RegexBuddy version 2.4.0. Previous versions contained two bugs (which I reported to JGsoft) which prevented this from working reliably.

As for lookaround-based conditionals, we can mimic them using the same concepts. Here's what a real lookaround-based conditional looks like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

And here's how you can mimic it:

(?:(?=if_assertion()|())\1then|\2else)

Again, to use it without an "else" part, you still need to include the second empty backreference (in this case, \2) at the end, like this:

(?:(?=if_assertion()|())\1then|\2)

Notes:

The above compatibility table applies just the same.
Backtracking does not come into play with lookaround-based conditionals in the same way as with capturing-group-based conditionals. As a result, mimicked lookaround-based conditionals are functionally identical to their "real" counterparts.
In some regex flavors, it may be necessary to write it in the the somewhat less lucid form (?=a()|())(?:\1b|\2c).
Another, potentially more verbose and less efficient way to mimic a lookaround-based conditional is to alternate two opposite lookarounds. E.g., (?=if_assertion)then|(?!if_assertion)else. That will work even in the case of flavors like JavaScript where backreferences to non-participating groups match the empty string.

Mimicking Atomic Groups

Edit (2024): The regex package adds support for atomic groups to native JavaScript regular expressions, making this pattern easier to use.

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds only match a position. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference zero)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually consume anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That produces the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking sweet.

Matching Innermost HTML Elements

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match <table>…</table>, but should only match the inner table within <table>…<table>…</table>…</table>. This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that many regex libraries are incapable of recursion (although even then it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. wink

Here's the solution I offered (though there were a few steps to get there):

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also quite fast, and can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the three highlighted segments of the text below:

<table><td><table><td> </td></table></td></table> <table><tr><td> </td></tr></table><table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed, and comparatively slow):

<table>(?:.(?!<table>))*?</table>

(Make sure to turn on the "dot matches newline" modifier with the above, or change the dot to [\S\s].)

Basically, the way that works is it matches an opening <table> tag, then it looks at each following character one at a time, checking if they are followed by another instance of <table> before </table>. If so, the match fails, because it's not an innermost table. (In theory, at least.)

Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening <table>, but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory… you'll see that there is now a much bigger issue than before), because within each <table> tag it will greedily jump between all characters which are not < in single steps (using [^<]+), and it will only use the negative lookahead when it encounters <. Secondly, it solves the previously noted error by using <(?!table>) instead of .(?!<table>).

If you're wondering about table tags which contain attributes, that's not a problem. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table\b[^>]*>(?:[^<]+|<(?!table\b[^>]*>))*?</table>

At first I thought this closed the case… The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The addition to the regex is highlighted:

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in a couple milliseconds on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, an approach which is not susceptible to catastrophic backtracking would be:

<table\b[^>]*>(?!<table\b[^>]*>)(?:[\S\s](?!<table\b[^>]*>))*?</table>

That runs a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through its debugger for an under-the-hood look at how they're handled by a regex engine.

Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table\b[^>]*>(?:(?=([^<]+))\1|<(?!table\b[^>]*>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by \1 to act just like an atomic group!

Update: See Matching Nested Constructs in JavaScript, Part 2 for a way to match outermost HTML elements using JavaScript.