JavaScript — Flagrant Badassery

Safari Support with XRegExp 0.2.2

When I released XRegExp 0.2 several days ago, I hadn't yet tested in Safari or Swift. When I remembered to do this shortly afterwards, I found that both of those WebKit-based browsers didn't like it and often crashed when trying to use it! This was obviously a Very Bad Thing, but due to major time availability issues I wasn't able to get around to in-depth bug-shooting and testing until tonight.

It turns out that Safari's regex engine contains a bug which causes an error to be thrown when compiling a regex containing a character class ending with "[\\".

// These throw an error:
[ /[[\\]/ , /[^[\\]/ , /[abc[\\]/ ]

// ...While these are all fine:
[ /[\\[]/ , /[\[\\]/ , /[[]/ , /[\\]/ , /[[\\abc]/ , /[[\/]/ , /[[(\\]/ ]

// Testing:
try {
	RegExp("[[\\]");
	alert("OK!");
} catch (err) {
	alert(err);
	/* Safari shows:
	"SyntaxError: Invalid regular expression: missing terminating ] for
	character class" */
}

As a result, I've changed two instances of [^[\\] to [^\\[] and upped the version number to 0.2.2. XRegExp has now been tested and works without any known issues in all of the following browsers:

Internet Explorer 5.5 – 7
Firefox 2.0.0.4
Opera 9.21
Safari 3.0.2 beta for Windows
Swift 0.2

You can get the newest version here.

XRegExp 0.2: Now With Named Capture

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

JavaScript's regular expression flavor doesn't support named capture. Well, says who? XRegExp 0.2 brings named capture support, along with several other new features. But first of all, if you haven't seen the previous version, make sure to check out my post on XRegExp 0.1, because not all of the documentation is repeated below.

Highlights

Comprehensive named capture support (New)
Supports regex literals through the addFlags method (New)
Free-spacing and comments mode (x)
Dot matches all mode (s)
Several other minor improvements over v0.1

Named capture

There are several different syntaxes in the wild for named capture. I've compiled the following table based on my understanding of the regex support of the libraries in question. XRegExp's syntax is included at the top.

Library	Capture	Backreference	In replacement	Stored at
XRegExp	`(<name>…)`	`\k<name>`	`${name}`	`result.name`
.NET	`(?<name>…)` `(?'name'…)`	`\k<name>` `\k'name'`	`${name}`	`Matcher.Groups('name')`
Perl 5.10 (beta)	`(?<name>…)` `(?'name'…)`	`\k<name>` `\k'name'` `\g{name}`	`$+{name}`	`$+{name}`
Python	`(?P<name>…)`	`(?P=name)`	`\g<name>`	`result.group('name')`
PHP preg (PCRE 7)	(.NET, Perl, and Python styles)		`$regs['name']`	`$result['name']`

No other major regex library currently supports named capture, although the JGsoft engine (used by products like RegexBuddy) supports both .NET and Python syntax. XRegExp does not use a question mark at the beginning of a named capturing group because that would prevent it from being used in regex literals (JavaScript would immediately throw an "invalid quantifier" error).

XRegExp supports named capture on an on-request basis. You can add named capture support to any regex though the use of the new "k" flag. This is done for compatibility reasons and to ensure that regex compilation time remains as fast as possible in all situations.

Following are several examples of using named capture:

// Add named capture support using the XRegExp constructor
var repeatedWords = new XRegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support using RegExp, after overriding the native constructor
XRegExp.overrideNative();
var repeatedWords = new RegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support to a regex literal
var repeatedWords = /\b (<word> \w+ ) \s+ \k<word> \b/.addFlags("gixk");

var data = "The the test data.";

// Check if data contains repeated words
var hasDuplicates = repeatedWords.test(data);
// hasDuplicates: true

// Use the regex to remove repeated words
var output = data.replace(repeatedWords, "${word}");
// output: "The test data."

In the above code, I've also used the x flag provided by XRegExp, to improve readability. Note that the addFlags method can be called multiple times on the same regex (e.g., /pattern/g.addFlags("k").addFlags("s")), but I'd recommend adding all flags in one shot, for efficiency.

Here are a few more examples of using named capture, with an overly simplistic URL-matching regex (for comprehensive URL parsing, see parseUri):

var url = "http://microsoft.com/path/to/file?q=1";
var urlParser = new XRegExp("^(<protocol>[^:/?]+)://(<host>[^/?]*)(<path>[^?]*)\\?(<query>.*)", "k");
var parts = urlParser.exec(url);
/* The result:
parts.protocol: "http"
parts.host: "microsoft.com"
parts.path: "/path/to/file"
parts.query: "q=1" */

// Named backreferences are also available in replace() callback functions as properties of the first argument
var newUrl = url.replace(urlParser, function(match){
	return match.replace(match.host, "yahoo.com");
});
// newUrl: "http://yahoo.com/path/to/file?q=1"

Note that XRegExp's named capture functionality does not support deprecated JavaScript features including the lastMatch property of the global RegExp object and the RegExp.prototype.compile() method.

Singleline (s) and extended (x) modes

The other non-native flags XRegExp supports are s (singleline) for "dot matches all" mode, and x (extended) for "free-spacing and comments" mode. For full details about these modifiers, see the FAQ in my XRegExp 0.1 post. However, one difference from the previous version is that XRegExp 0.2, when using the x flag, now allows whitespace between a regex token and its quantifier (quantifiers are, e.g., +, *?, or {1,3}). Although the previous version's handling/limitation in this regard was documented, it was atypical compared to other regex libraries. This has been fixed.

Download

XRegExp 0.2.5.

XRegExp has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.2 beta for Windows, and Swift 0.2.

Finally, note that the XRE object from v0.1 has been removed. XRegExp now only creates one global variable: XRegExp. To permanently override the native RegExp constructor/object, you can now run XRegExp.overrideNative();

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

parseUri 1.2: Split URLs in JavaScript

Edit (2024): parseUri has had a major update and is now available on GitHub and npm

I've just updated parseUri. If you haven't seen the older version, parseUri is a function which splits any well-formed URI into its parts, all of which are optional. Its combination of accuracy, flexibility, and brevity is unrivaled.

Highlights:

Comprehensively splits URIs, including splitting the query string into key/value pairs. (Enhanced)
Two parsing modes: loose and strict. (New)
Easy to use (returns an object, so you can do, e.g., parseUri(uri).anchor).
Offers convenient, pre-concatenated components (path = directory and file; authority = userInfo, host, and port; etc.)
Change the default names of URI parts without editing the function, by updating parseUri.options.key. (New)
Exceptionally lightweight (1 KB before minification or gzipping).
Released under the MIT License.

Details:

Older versions of this function used what's now called loose parsing mode (which is still the default in this version). Loose mode deviates slightly from the official generic URI spec (RFC 3986), but by doing so allows the function to split URIs in a way that most end users would expect intuitively. However, the finer details of loose mode preclude it from properly handling relative paths which do not start from root (e.g., "../file.html" or "dir/file.html"). On the other hand, strict mode attempts to split URIs according to RFC 3986. Specifically, in loose mode, directories don't need to end with a slash (e.g., the "dir" in "/dir?query" is treated as a directory rather than a file name), and the URI can start with an authority without being preceded by "//" (which means that the "yahoo.com" in "yahoo.com/search/" is treated as the host, rather than part of the directory path).

Since I've assumed that most developers will consistently want to use one mode or the other, the parsing mode is not specified as an argument when running parseUri, but rather as a property of the parseUri function itself. Simply run the following line of code to switch to strict mode:

parseUri.options.strictMode = true;

From that point forward, parseUri will work in strict mode (until you turn it back off).

The code:

// parseUri 1.2.2
// (c) Steven Levithan <stevenlevithan.com>
// MIT License

function parseUri (str) {
	var	o   = parseUri.options,
		m   = o.parser[o.strictMode ? "strict" : "loose"].exec(str),
		uri = {},
		i   = 14;

	while (i--) uri[o.key[i]] = m[i] || "";

	uri[o.q.name] = {};
	uri[o.key[12]].replace(o.q.parser, function ($0, $1, $2) {
		if ($1) uri[o.q.name][$1] = $2;
	});

	return uri;
};

parseUri.options = {
	strictMode: false,
	key: ["source","protocol","authority","userInfo","user","password","host","port","relative","path","directory","file","query","anchor"],
	q:   {
		name:   "queryKey",
		parser: /(?:^|&)([^&=]*)=?([^&]*)/g
	},
	parser: {
		strict: /^(?:([^:\/?#]+):)?(?:\/\/((?:(([^:@]*)(?::([^:@]*))?)?@)?([^:\/?#]*)(?::(\d*))?))?((((?:[^?#\/]*\/)*)([^?#]*))(?:\?([^#]*))?(?:#(.*))?)/,
		loose:  /^(?:(?![^:@]+:[^:@\/]*@)([^:\/?#.]+):)?(?:\/\/)?((?:(([^:@]*)(?::([^:@]*))?)?@)?([^:\/?#]*)(?::(\d*))?)(((\/(?:[^?#](?![^?#\/]*\.[^?#\/.]+(?:[?#]|$)))*\/?)?([^?#\/]*))(?:\?([^#]*))?(?:#(.*))?)/
	}
};

You can download it here.

parseUri has no dependencies, and has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.1 beta for Windows, and Swift 0.2.

Levels of JavaScript Regex Knowledge

N00b
- Thinks "regular expressions" is open mic night at a poetry bar.
- Uses \w, \d, \s, and other shorthand classes purely by accident if at all.
- Painfully misuses * and especially .*.
- Puts words in character classes.
- Uses | in character classes for alternation.
- Hasn't heard of the exec method.
- Copies and pastes poorly written regexes from the web.
Trained n00b
- Uses regexes where methods like slice or indexOf would do.
- Uses g, i, and m modifiers needlessly.
- Uses [^\w] instead of \W.
- Doesn't know why using [\w\d_] gives away their n00bness.
- Tries to remove HTML tags with replace(/<.*?>/g,"").
- Escapes all punctuation\!
User
- Knows when to use regexes, and when to use string methods.
- Toys with lookahead.
- Uses regexes in conditionals.
- Starts to understand why HTML tags are hard to match with regexes.
- Knows to use (?:…) when a backreference or capture isn't needed.
- Can read a relatively simple regex and explain its function.
- Knows their way around the use of replace callback functions.
Haxz0r
- Uses lookahead with impunity.
- Sighs at the unavailability of lookbehind and other features from more powerful regex libraries.
- Knows what $`, $', and $& mean in a replacement string.
- Knows the difference between string literal and regex metacharacters, and how this impacts the RegExp constructor.
- Generally knows whether a greedy or lazy quantifier is more appropriate, even when it doesn't change what the regex matches.
- Has a basic sense of how to avoid regex efficiency problems.
- Knows how to iterate over strings using the exec method and a while loop.
- Knows that properties of the global RegExp object and the compile method are deprecated.
Guru
- Understands the significance of manually modifying a regex object's lastIndex property and when this can be useful within a loop.
- Can explain how any given regex will or won't work.
- No longer experiences the excitement of writing complex regexes that work on the first try, since regex behavior has become predictable and obvious.
- Is immune to catastrophic backtracking, and can easily (and accurately) determine if a nested quantifier is safe.
- Knows of numerous cross-browser regex syntax and behavior differences.
- Knows offhand the section number of ECMA-262 3rd Edition that covers regexes.
- Understands the difference between capturing group nonparticipation vs participating but capturing an empty string, and the behavior differences this can lead to.
- Has a preference for particular backreference rules related to capturing group participation and quantified alternation, or is at least aware of the implementation inconsistencies.
- Often knows which browser will run a given regex fastest before testing, based on known internal optimizations and weaknesses.
- Thinks that writing recursive regexes is easy, so long as there is an upper bound to recursion depth.
Wizard
- Works on a regex engine.
- Has patched the engine from time to time.
God
- Can add features to the engine at a whim.
- Also created all life on earth using a constructor function.

(Heavily adapted and JavaScriptized from 7 Stages of a [Perl] Regex User.)

JavaScript split Bugs: Fixed!

The String.prototype.split method is very handy, so it's a shame that if you use a regular expression as its delimiter, the results can be so wildly different cross-browser that odds are you've just introduced bugs into your code (unless you know precisely what kind of data you're working with and are able to avoid the issues). Here's one example of other people venting about the problems. Following are the inconsistencies cross-browser when using regexes with split:

Internet Explorer excludes almost all empty values from the resulting array (e.g., when two delimiters appear next to each other in the data, or when a delimiter appears at the start or end of the data). This doesn't make any sense to me, since IE does include empty values when using a string as the delimiter.
Internet Explorer and Safari do not splice the values of capturing parentheses into the returned array (this functionality can be useful with simple parsers, etc.)
Firefox does not splice undefined values into the returned array as the result of non-participating capturing groups.
Internet Explorer, Firefox, and Safari have various additional edge-case bugs where they do not follow the split specification (which is actually quite complex).

The situation is so bad that I've simply avoided using regex-based splitting in the past.

That ends now. wink

The following script provides a fast, uniform cross-browser implementation of String.prototype.split, and attempts to precisely follow the relevant spec (ECMA-262 v3 §15.5.4.14, pp.103,104).

I've also created a fairly quick and dirty page where you can test the result of more than 50 usages of JavaScript's split method, and quickly compare your browser's results with the correct implementation. On the test page, the pink lines in the third column highlight incorrect results from the native split method. The rightmost column shows the results of the below script. It's all green in every browser I've tested (IE 5.5 – 7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.1 beta, and Swift 0.2).

Run the tests in your browser.

Here's the script:

/*!
 * Cross-Browser Split 1.1.1
 * Copyright 2007-2012 Steven Levithan <stevenlevithan.com>
 * Available under the MIT License
 * ECMAScript compliant, uniform cross-browser split method
 */

/**
 * Splits a string into an array of strings using a regex or string separator. Matches of the
 * separator are not included in the result array. However, if `separator` is a regex that contains
 * capturing groups, backreferences are spliced into the result each time `separator` is matched.
 * Fixes browser bugs compared to the native `String.prototype.split` and can be used reliably
 * cross-browser.
 * @param {String} str String to split.
 * @param {RegExp|String} separator Regex or string to use for separating the string.
 * @param {Number} [limit] Maximum number of items to include in the result array.
 * @returns {Array} Array of substrings.
 * @example
 *
 * // Basic use
 * split('a b c d', ' ');
 * // -> ['a', 'b', 'c', 'd']
 *
 * // With limit
 * split('a b c d', ' ', 2);
 * // -> ['a', 'b']
 *
 * // Backreferences in result array
 * split('..word1 word2..', /([a-z]+)(\d+)/i);
 * // -> ['..', 'word', '1', ' ', 'word', '2', '..']
 */
var split;

// Avoid running twice; that would break the `nativeSplit` reference
split = split || function (undef) {

    var nativeSplit = String.prototype.split,
        compliantExecNpcg = /()??/.exec("")[1] === undef, // NPCG: nonparticipating capturing group
        self;

    self = function (str, separator, limit) {
        // If `separator` is not a regex, use `nativeSplit`
        if (Object.prototype.toString.call(separator) !== "[object RegExp]") {
            return nativeSplit.call(str, separator, limit);
        }
        var output = [],
            flags = (separator.ignoreCase ? "i" : "") +
                    (separator.multiline  ? "m" : "") +
                    (separator.extended   ? "x" : "") + // Proposed for ES6
                    (separator.sticky     ? "y" : ""), // Firefox 3+
            lastLastIndex = 0,
            // Make `global` and avoid `lastIndex` issues by working with a copy
            separator = new RegExp(separator.source, flags + "g"),
            separator2, match, lastIndex, lastLength;
        str += ""; // Type-convert
        if (!compliantExecNpcg) {
            // Doesn't need flags gy, but they don't hurt
            separator2 = new RegExp("^" + separator.source + "$(?!\\s)", flags);
        }
        /* Values for `limit`, per the spec:
         * If undefined: 4294967295 // Math.pow(2, 32) - 1
         * If 0, Infinity, or NaN: 0
         * If positive number: limit = Math.floor(limit); if (limit > 4294967295) limit -= 4294967296;
         * If negative number: 4294967296 - Math.floor(Math.abs(limit))
         * If other: Type-convert, then use the above rules
         */
        limit = limit === undef ?
            -1 >>> 0 : // Math.pow(2, 32) - 1
            limit >>> 0; // ToUint32(limit)
        while (match = separator.exec(str)) {
            // `separator.lastIndex` is not reliable cross-browser
            lastIndex = match.index + match[0].length;
            if (lastIndex > lastLastIndex) {
                output.push(str.slice(lastLastIndex, match.index));
                // Fix browsers whose `exec` methods don't consistently return `undefined` for
                // nonparticipating capturing groups
                if (!compliantExecNpcg && match.length > 1) {
                    match[0].replace(separator2, function () {
                        for (var i = 1; i < arguments.length - 2; i++) {
                            if (arguments[i] === undef) {
                                match[i] = undef;
                            }
                        }
                    });
                }
                if (match.length > 1 && match.index < str.length) {
                    Array.prototype.push.apply(output, match.slice(1));
                }
                lastLength = match[0].length;
                lastLastIndex = lastIndex;
                if (output.length >= limit) {
                    break;
                }
            }
            if (separator.lastIndex === match.index) {
                separator.lastIndex++; // Avoid an infinite loop
            }
        }
        if (lastLastIndex === str.length) {
            if (lastLength || !separator.test("")) {
                output.push("");
            }
        } else {
            output.push(str.slice(lastLastIndex));
        }
        return output.length > limit ? output.slice(0, limit) : output;
    };

    // For convenience
    String.prototype.split = function (separator, limit) {
        return self(this, separator, limit);
    };

    return self;

}();

Download it.

Please let me know if you find any problems. Thanks!

Update: This script has become part of my XRegExp library, which includes many other JavaScript regular expression cross-browser compatibility fixes.