Regular Expressions — Flagrant Badassery

RegexPal: Web-Based Regex Testing Reinvented

Yes I know, there are many other JavaScript regex testers available. Why did I create yet another? RegexPal brings several new things to the table for such web-based apps, and in my (biased) opinion it's easier to use and more helpful towards learning regular expressions than the others currently available. Additionally, most other such tools are very slow for the kind of data I often work with. They might appear fast when displaying 10 matches, but what about 100, 1000, or 5000? Try generating 5,000 matches (which is easy to do with an any-character pattern such as a dot) in your favorite existing web-based tool and see if your browser ever recovers (doubtful). The same task takes RegexPal less than half a second, and what's more, results overlay the text while you're typing it.

At the moment, RegexPal is short on features, but here are the highlights:

Real-time regex syntax highlighting with backwards and forwards context awareness.
Lightning-fast match highlighting with alternating styles.
Inverted matches (match any text not matched by the regex).

I'm not sure when I'll add additional features, but there are lots of things I'm considering. If there is something you'd like to see, let me know.

A few things to be aware of:

The approach I've used for scrollable rich-text editing (which I haven't seen elsewhere) is fast but a bit buggy. Firefox 2 and IE7 have the least issues, but it more or less works in other browsers as well.
The syntax highlighting generally marks corner-case issues that create cross-browser inconsistencies as errors even if they are the result of browser bugs or missing behavior documentation in ECMA-262 v3.
There are different forms of line breaks cross-platform/browser. E.g., Firefox uses \n even on Windows where nearly all programs use \r\n. This can affect the results of certain regexes.

At least for me, RegexPal is lots of fun to play with and helps to make learning regular expressions easy through its instant feedback. I encourage you to just go play with it and discover its results on your own, but for the curious, I'll keep rambling…

Regex syntax parsing (needed for the syntax highlighting) is somewhat complex, due to the numerous backwards and forwards context awareness issues involved. Take, for example, the pattern \10. What does it mean?

Backreference 10, if not inside a character class and at least 10 capturing groups are opened before that point.
Backreference 1, followed by a literal "0", if not inside a character class and between 1 and 9 capturing groups are opened before that point.
Octal character index 10 (decimal 8), if inside a character class, or if no capturing groups are opened before that point.
The three literal characters "\", "1", and "0", if preceded by an unescaped "\" character.
An incomplete token in a couple other situations.

Another example is the "-" character. Outside a character class it's always a literal hyphen, but inside a character class…

It creates a range between tokens if:
- There is a preceding and following token in the class, or it's preceded by a token and is the last character in an unclosed character class (caveats follow).
It's a literal character if:
- It's the first or last character in the class.
- It's preceded by an unescaped "\".
- It follows a token which is the end index for a range.
- It follows a hyphen which creates a range.
It's an error if:
- It's creating a range between tokens in reverse character index order (e.g., z-a, @-!, \uFFFF-\b, or \127-\cB).
- It would otherwise create a range, but it's followed or preceded by a token which represents more than one character index (e.g., \d). In fact, in some cases browsers take this to mean that the hyphen should be treated as a literal, but browser bugs cause it to be handled inconsistently so RegexPal flags it as a range error.

Here are a few more things which aren't errors but are flagged as such:

Empty, top-level alternation, except at the end of the pattern, where such an alternation is ignored when highlighting matches in order to create a less surprising experience while the user is in the middle of constructing the regex. Empty, top-level alternation is flagged as an error because it effectively truncates the regex at that point (since it will always match). If a zero-length, top-level alteration is really needed, there are other easy ways to do that more explicitly.
Lookaround quantifiers (e.g., the plus sign in (?!x)+). This would be an actual error with some regex libraries (e.g., PCRE), and although that's not the case in most web browsers, such constructs add no value. As a result, RegexPal flags such quantifiers as an error, since they are almost certainly a user mistake.
\c when not followed by A–Z, \x when not followed by two hex characters, and \u when not followed by four hex characters. Although these do not cause most browsers to throw errors, they are handled inconsistently cross-browser and are hence flagged as errors. They would almost certainly be a user mistake even if the cross-browser issues didn't exist.

Credit to osteele.com from where the text of the short-and-sweet Quick Reference is based, and to RegexBuddy from JGsoft for inspiring many of RegexPal's features. The name RegexPal is, in part, a nod to RegexBuddy, but also selected because it contains both "regex" and "regexp." wink

Non-Participating Groups: A Cross-Browser Mess

Cross-browser issues surrounding the handling of regular expression non-participating capturing groups (which I'll call NPCGs) present several challenges. The standard sucks to begin with, and the three biggest browsers (IE, Firefox, Safari) each disrespect the rules in their own unique ways.

First, I should explain what NPCGs are, as it seems that even some experienced regex users aren't fully aware of or understand the concept. Assuming you're already familiar with the idea of capturing and non-capturing parentheses (see this page if you need a refresher), note that NPCGs are different from groups which capture a zero-length value (i.e., an empty string). This is probably easiest to explain by showing some examples…

The following regexes all potentially contain NPCGs (depending on the data they are run over), because the capturing groups are not required to participate:

/(x)?/
/(x)*/
/(x){0,2}/
/(x)|(y)/ — If this matches, it's guaranteed to contain exactly one NPCG.
/(?!(x))/ — If this matches (which, on its own, it will at least at the end of the string), it's guaranteed to contain an NPCG, because the pattern only succeeds if the match of "x" fails.
/()??/ — This is guaranteed to match within any string and contain an NPCG, because of the use of a lazy ?? quantifier on a capturing group for a zero-length value.

On the other hand, these will never contain an NPCG, because although they are allowed to match a zero-length value, the capturing groups are required to participate:

/(x?)/
/(x*)/
/(x{0,2})/
/((?:xx)?)/ –or– /(xx|)/ — These two are equivalent.
/()?/ –or– /(x?)?/ — These are not required to participate, but their greedy ? quantifiers ensure they will always succeed at capturing at least an empty string.

So, what's the difference between an NPCG and a group which captures an empty string? I guess that's up to the regex library, but typically, backreferences to NPCGs are assigned a special null or undefined value.

Following are the ECMA-262v3 rules (paraphrased) for how NPCGs should be handled in JavaScript:

Within a regex, backreferences to NPCGs match an empty string (i.e., the backreferences always succeed). This is unfortunate, since it prevents some fancy patterns which would otherwise be possible (e.g., see my method for mimicking conditionals), and it's atypical compared to many other regular expression engines including Perl 5 (which ECMA-standard regular expressions are supposedly based on), PCRE, .NET, Java, Python, Ruby, JGsoft, and others.
Within a replacement string, backreferences to NPCGs produce an empty string (i.e., nothing). Unlike the previous point, this is typical elsewhere, and allows you to use a regex like /a(b)|c(d)/ and replace it with "$1$2" without having to worry about null pointers or errors about non-participating groups.
In the result arrays from RegExp.prototype.exec, String.prototype.match (when used with a non-global regex), String.prototype.split, and the arguments available to callback functions with String.prototype.replace, NPCGs return undefined. This is a very logical approach.

References: ECMA-262v3 sectons 15.5.4.11, 15.5.4.14, 15.10.2.1, 15.10.2.3, 15.10.2.8, 15.10.2.9.

Unfortunately, actual browser handling of NPCGs is all over the place, resulting in numerous cross-browser differences which can easily result in subtle (or not so subtle) bugs in your code if you don't know what you're doing. E.g., Firefox incorrectly uses an empty string with the replace() and split() methods, but correctly uses undefined with the exec() method. Conversely, IE correctly uses undefined with the replace() method, incorrectly uses an empty string with the exec() method, and incorrectly returns neither with the split() method since it doesn't splice backreferences into the resulting array. As for the handling of backreferences to non-participating groups within regexes (e.g., /(x)?\1y/.test("y")), Safari uses the more sensible, non-ECMA-compliant approach (returning false for the previous bit of code), while IE, Firefox, and Opera follow the standard. (If you use /(x?)\1y/.test("y") instead, all four browsers will correctly return true.)

Several times I've seen people encounter these differences and diagnose them incorrectly, not having understood the root cause. A recent instance is what prompted this writeup.

Here are cross-browser results from each of the regex and regex-using methods when NPCGs have an impact on the outcome:

Code	ECMA-262v3	IE 5.5 – 7	Firefox 2.0.0.6	Opera 9.23	Safari 3.0.3
`/(x)?\1y/.test("y")`	`true`	`true`	`true`	`true`	`false`
`/(x)?\1y/.exec("y")`	`["y", undefined]`	`["y", ""]`	`["y", undefined]`	`["y", undefined]`	`null`
`/(x)?y/.exec("y")`	`["y", undefined]`	`["y", ""]`	`["y", undefined]`	`["y", undefined]`	`["y", undefined]`
`"y".match(/(x)?\1y/)`	`["y", undefined]`	`["y", ""]`	`["y", undefined]`	`["y", undefined]`	`null`
`"y".match(/(x)?y/)`	`["y", undefined]`	`["y", ""]`	`["y", undefined]`	`["y", undefined]`	`["y", undefined]`
`"y".match(/(x)?\1y/g)`	`["y"]`	`["y"]`	`["y"]`	`["y"]`	`null`
`"y".split(/(x)?\1y/)`	`["", undefined, ""]`	`[ ]`	`["", "", ""]`	`["", undefined, ""]`	`["y"]`
`"y".split(/(x)?y/)`	`["", undefined, ""]`	`[ ]`	`["", "", ""]`	`["", undefined, ""]`	`["", ""]`
`"y".search(/(x)?\1y/)`	`0`	`0`	`0`	`0`	`-1`
`"y".replace(/(x)?\1y/, "z")`	`"z"`	`"z"`	`"z"`	`"z"`	`"y"`
`"y".replace(/(x)?y/, "$1")`	`""`	`""`	`""`	`""`	`""`
`"y".replace(/(x)?\1y/, function($0, $1){ return String($1); })`	`"undefined"`	`"undefined"`	`""`	`"undefined"`	`"y"`
`"y".replace(/(x)?y/, function($0, $1){ return String($1); })`	`"undefined"`	`"undefined"`	`""`	`"undefined"`	`""`
`"y".replace(/(x)?y/, function($0, $1){ return $1; })`	`"undefined"`	`""`	`""`	`"undefined"`	`""`

(Run the tests in your browser.)

The workaround for this mess is to avoid creating any potential for non-participating capturing groups, unless you know exactly what you're doing. Although that shouldn't be necessary, NPCGs are usually easy to avoid anyway. See the examples near the top of this post.

Edit (2007-08-16): I've updated this post with data from the newest versions of the listed browsers. The original data contained a few false negatives for Opera and Safari which resulted from a faulty library used to generate the results.

Safari Support with XRegExp 0.2.2

When I released XRegExp 0.2 several days ago, I hadn't yet tested in Safari or Swift. When I remembered to do this shortly afterwards, I found that both of those WebKit-based browsers didn't like it and often crashed when trying to use it! This was obviously a Very Bad Thing, but due to major time availability issues I wasn't able to get around to in-depth bug-shooting and testing until tonight.

It turns out that Safari's regex engine contains a bug which causes an error to be thrown when compiling a regex containing a character class ending with "[\\".

// These throw an error:
[ /[[\\]/ , /[^[\\]/ , /[abc[\\]/ ]

// ...While these are all fine:
[ /[\\[]/ , /[\[\\]/ , /[[]/ , /[\\]/ , /[[\\abc]/ , /[[\/]/ , /[[(\\]/ ]

// Testing:
try {
	RegExp("[[\\]");
	alert("OK!");
} catch (err) {
	alert(err);
	/* Safari shows:
	"SyntaxError: Invalid regular expression: missing terminating ] for
	character class" */
}

As a result, I've changed two instances of [^[\\] to [^\\[] and upped the version number to 0.2.2. XRegExp has now been tested and works without any known issues in all of the following browsers:

Internet Explorer 5.5 – 7
Firefox 2.0.0.4
Opera 9.21
Safari 3.0.2 beta for Windows
Swift 0.2

You can get the newest version here.

XRegExp 0.2: Now With Named Capture

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

JavaScript's regular expression flavor doesn't support named capture. Well, says who? XRegExp 0.2 brings named capture support, along with several other new features. But first of all, if you haven't seen the previous version, make sure to check out my post on XRegExp 0.1, because not all of the documentation is repeated below.

Highlights

Comprehensive named capture support (New)
Supports regex literals through the addFlags method (New)
Free-spacing and comments mode (x)
Dot matches all mode (s)
Several other minor improvements over v0.1

Named capture

There are several different syntaxes in the wild for named capture. I've compiled the following table based on my understanding of the regex support of the libraries in question. XRegExp's syntax is included at the top.

Library	Capture	Backreference	In replacement	Stored at
XRegExp	`(<name>…)`	`\k<name>`	`${name}`	`result.name`
.NET	`(?<name>…)` `(?'name'…)`	`\k<name>` `\k'name'`	`${name}`	`Matcher.Groups('name')`
Perl 5.10 (beta)	`(?<name>…)` `(?'name'…)`	`\k<name>` `\k'name'` `\g{name}`	`$+{name}`	`$+{name}`
Python	`(?P<name>…)`	`(?P=name)`	`\g<name>`	`result.group('name')`
PHP preg (PCRE 7)	(.NET, Perl, and Python styles)		`$regs['name']`	`$result['name']`

No other major regex library currently supports named capture, although the JGsoft engine (used by products like RegexBuddy) supports both .NET and Python syntax. XRegExp does not use a question mark at the beginning of a named capturing group because that would prevent it from being used in regex literals (JavaScript would immediately throw an "invalid quantifier" error).

XRegExp supports named capture on an on-request basis. You can add named capture support to any regex though the use of the new "k" flag. This is done for compatibility reasons and to ensure that regex compilation time remains as fast as possible in all situations.

Following are several examples of using named capture:

// Add named capture support using the XRegExp constructor
var repeatedWords = new XRegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support using RegExp, after overriding the native constructor
XRegExp.overrideNative();
var repeatedWords = new RegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support to a regex literal
var repeatedWords = /\b (<word> \w+ ) \s+ \k<word> \b/.addFlags("gixk");

var data = "The the test data.";

// Check if data contains repeated words
var hasDuplicates = repeatedWords.test(data);
// hasDuplicates: true

// Use the regex to remove repeated words
var output = data.replace(repeatedWords, "${word}");
// output: "The test data."

In the above code, I've also used the x flag provided by XRegExp, to improve readability. Note that the addFlags method can be called multiple times on the same regex (e.g., /pattern/g.addFlags("k").addFlags("s")), but I'd recommend adding all flags in one shot, for efficiency.

Here are a few more examples of using named capture, with an overly simplistic URL-matching regex (for comprehensive URL parsing, see parseUri):

var url = "http://microsoft.com/path/to/file?q=1";
var urlParser = new XRegExp("^(<protocol>[^:/?]+)://(<host>[^/?]*)(<path>[^?]*)\\?(<query>.*)", "k");
var parts = urlParser.exec(url);
/* The result:
parts.protocol: "http"
parts.host: "microsoft.com"
parts.path: "/path/to/file"
parts.query: "q=1" */

// Named backreferences are also available in replace() callback functions as properties of the first argument
var newUrl = url.replace(urlParser, function(match){
	return match.replace(match.host, "yahoo.com");
});
// newUrl: "http://yahoo.com/path/to/file?q=1"

Note that XRegExp's named capture functionality does not support deprecated JavaScript features including the lastMatch property of the global RegExp object and the RegExp.prototype.compile() method.

Singleline (s) and extended (x) modes

The other non-native flags XRegExp supports are s (singleline) for "dot matches all" mode, and x (extended) for "free-spacing and comments" mode. For full details about these modifiers, see the FAQ in my XRegExp 0.1 post. However, one difference from the previous version is that XRegExp 0.2, when using the x flag, now allows whitespace between a regex token and its quantifier (quantifiers are, e.g., +, *?, or {1,3}). Although the previous version's handling/limitation in this regard was documented, it was atypical compared to other regex libraries. This has been fixed.

Download

XRegExp 0.2.5.

XRegExp has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.2 beta for Windows, and Swift 0.2.

Finally, note that the XRE object from v0.1 has been removed. XRegExp now only creates one global variable: XRegExp. To permanently override the native RegExp constructor/object, you can now run XRegExp.overrideNative();

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

Levels of JavaScript Regex Knowledge

N00b
- Thinks "regular expressions" is open mic night at a poetry bar.
- Uses \w, \d, \s, and other shorthand classes purely by accident if at all.
- Painfully misuses * and especially .*.
- Puts words in character classes.
- Uses | in character classes for alternation.
- Hasn't heard of the exec method.
- Copies and pastes poorly written regexes from the web.
Trained n00b
- Uses regexes where methods like slice or indexOf would do.
- Uses g, i, and m modifiers needlessly.
- Uses [^\w] instead of \W.
- Doesn't know why using [\w\d_] gives away their n00bness.
- Tries to remove HTML tags with replace(/<.*?>/g,"").
- Escapes all punctuation\!
User
- Knows when to use regexes, and when to use string methods.
- Toys with lookahead.
- Uses regexes in conditionals.
- Starts to understand why HTML tags are hard to match with regexes.
- Knows to use (?:…) when a backreference or capture isn't needed.
- Can read a relatively simple regex and explain its function.
- Knows their way around the use of replace callback functions.
Haxz0r
- Uses lookahead with impunity.
- Sighs at the unavailability of lookbehind and other features from more powerful regex libraries.
- Knows what $`, $', and $& mean in a replacement string.
- Knows the difference between string literal and regex metacharacters, and how this impacts the RegExp constructor.
- Generally knows whether a greedy or lazy quantifier is more appropriate, even when it doesn't change what the regex matches.
- Has a basic sense of how to avoid regex efficiency problems.
- Knows how to iterate over strings using the exec method and a while loop.
- Knows that properties of the global RegExp object and the compile method are deprecated.
Guru
- Understands the significance of manually modifying a regex object's lastIndex property and when this can be useful within a loop.
- Can explain how any given regex will or won't work.
- No longer experiences the excitement of writing complex regexes that work on the first try, since regex behavior has become predictable and obvious.
- Is immune to catastrophic backtracking, and can easily (and accurately) determine if a nested quantifier is safe.
- Knows of numerous cross-browser regex syntax and behavior differences.
- Knows offhand the section number of ECMA-262 3rd Edition that covers regexes.
- Understands the difference between capturing group nonparticipation vs participating but capturing an empty string, and the behavior differences this can lead to.
- Has a preference for particular backreference rules related to capturing group participation and quantified alternation, or is at least aware of the implementation inconsistencies.
- Often knows which browser will run a given regex fastest before testing, based on known internal optimizations and weaknesses.
- Thinks that writing recursive regexes is easy, so long as there is an upper bound to recursion depth.
Wizard
- Works on a regex engine.
- Has patched the engine from time to time.
God
- Can add features to the engine at a whim.
- Also created all life on earth using a constructor function.

(Heavily adapted and JavaScriptized from 7 Stages of a [Perl] Regex User.)