XRegExp 0.2: Now With Named Capture

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

JavaScript's regular expression flavor doesn't support named capture. Well, says who? XRegExp 0.2 brings named capture support, along with several other new features. But first of all, if you haven't seen the previous version, make sure to check out my post on XRegExp 0.1, because not all of the documentation is repeated below.

Highlights

  • Comprehensive named capture support (New)
  • Supports regex literals through the addFlags method (New)
  • Free-spacing and comments mode (x)
  • Dot matches all mode (s)
  • Several other minor improvements over v0.1

Named capture

There are several different syntaxes in the wild for named capture. I've compiled the following table based on my understanding of the regex support of the libraries in question. XRegExp's syntax is included at the top.

Library Capture Backreference In replacement Stored at
XRegExp (<name>…) \k<name> ${name} result.name
.NET (?<name>…)
(?'name'…)
\k<name>
\k'name'
${name} Matcher.Groups('name')
Perl 5.10 (beta) (?<name>…)
(?'name'…)
\k<name>
\k'name'
\g{name}
$+{name} $+{name}
Python (?P<name>…) (?P=name) \g<name> result.group('name')
PHP preg (PCRE 7) (.NET, Perl, and Python styles) $regs['name'] $result['name']

No other major regex library currently supports named capture, although the JGsoft engine (used by products like RegexBuddy) supports both .NET and Python syntax. XRegExp does not use a question mark at the beginning of a named capturing group because that would prevent it from being used in regex literals (JavaScript would immediately throw an "invalid quantifier" error).

XRegExp supports named capture on an on-request basis. You can add named capture support to any regex though the use of the new "k" flag. This is done for compatibility reasons and to ensure that regex compilation time remains as fast as possible in all situations.

Following are several examples of using named capture:

// Add named capture support using the XRegExp constructor
var repeatedWords = new XRegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support using RegExp, after overriding the native constructor
XRegExp.overrideNative();
var repeatedWords = new RegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support to a regex literal
var repeatedWords = /\b (<word> \w+ ) \s+ \k<word> \b/.addFlags("gixk");

var data = "The the test data.";

// Check if data contains repeated words
var hasDuplicates = repeatedWords.test(data);
// hasDuplicates: true

// Use the regex to remove repeated words
var output = data.replace(repeatedWords, "${word}");
// output: "The test data."

In the above code, I've also used the x flag provided by XRegExp, to improve readability. Note that the addFlags method can be called multiple times on the same regex (e.g., /pattern/g.addFlags("k").addFlags("s")), but I'd recommend adding all flags in one shot, for efficiency.

Here are a few more examples of using named capture, with an overly simplistic URL-matching regex (for comprehensive URL parsing, see parseUri):

var url = "http://microsoft.com/path/to/file?q=1";
var urlParser = new XRegExp("^(<protocol>[^:/?]+)://(<host>[^/?]*)(<path>[^?]*)\\?(<query>.*)", "k");
var parts = urlParser.exec(url);
/* The result:
parts.protocol: "http"
parts.host: "microsoft.com"
parts.path: "/path/to/file"
parts.query: "q=1" */

// Named backreferences are also available in replace() callback functions as properties of the first argument
var newUrl = url.replace(urlParser, function(match){
	return match.replace(match.host, "yahoo.com");
});
// newUrl: "http://yahoo.com/path/to/file?q=1"

Note that XRegExp's named capture functionality does not support deprecated JavaScript features including the lastMatch property of the global RegExp object and the RegExp.prototype.compile() method.

Singleline (s) and extended (x) modes

The other non-native flags XRegExp supports are s (singleline) for "dot matches all" mode, and x (extended) for "free-spacing and comments" mode. For full details about these modifiers, see the FAQ in my XRegExp 0.1 post. However, one difference from the previous version is that XRegExp 0.2, when using the x flag, now allows whitespace between a regex token and its quantifier (quantifiers are, e.g., +, *?, or {1,3}). Although the previous version's handling/limitation in this regard was documented, it was atypical compared to other regex libraries. This has been fixed.

Download

XRegExp 0.2.5.

XRegExp has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.2 beta for Windows, and Swift 0.2.

Finally, note that the XRE object from v0.1 has been removed. XRegExp now only creates one global variable: XRegExp. To permanently override the native RegExp constructor/object, you can now run XRegExp.overrideNative();

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

parseUri 1.2: Split URLs in JavaScript

Edit (2024): parseUri has had a major update and is now available on GitHub and npm

I've just updated parseUri. If you haven't seen the older version, parseUri is a function which splits any well-formed URI into its parts, all of which are optional. Its combination of accuracy, flexibility, and brevity is unrivaled.

Highlights:

  • Comprehensively splits URIs, including splitting the query string into key/value pairs. (Enhanced)
  • Two parsing modes: loose and strict. (New)
  • Easy to use (returns an object, so you can do, e.g., parseUri(uri).anchor).
  • Offers convenient, pre-concatenated components (path = directory and file; authority = userInfo, host, and port; etc.)
  • Change the default names of URI parts without editing the function, by updating parseUri.options.key. (New)
  • Exceptionally lightweight (1 KB before minification or gzipping).
  • Released under the MIT License.

Details:

Older versions of this function used what's now called loose parsing mode (which is still the default in this version). Loose mode deviates slightly from the official generic URI spec (RFC 3986), but by doing so allows the function to split URIs in a way that most end users would expect intuitively. However, the finer details of loose mode preclude it from properly handling relative paths which do not start from root (e.g., "../file.html" or "dir/file.html"). On the other hand, strict mode attempts to split URIs according to RFC 3986. Specifically, in loose mode, directories don't need to end with a slash (e.g., the "dir" in "/dir?query" is treated as a directory rather than a file name), and the URI can start with an authority without being preceded by "//" (which means that the "yahoo.com" in "yahoo.com/search/" is treated as the host, rather than part of the directory path).

Since I've assumed that most developers will consistently want to use one mode or the other, the parsing mode is not specified as an argument when running parseUri, but rather as a property of the parseUri function itself. Simply run the following line of code to switch to strict mode:

parseUri.options.strictMode = true;

From that point forward, parseUri will work in strict mode (until you turn it back off).

The code:

// parseUri 1.2.2
// (c) Steven Levithan <stevenlevithan.com>
// MIT License

function parseUri (str) {
	var	o   = parseUri.options,
		m   = o.parser[o.strictMode ? "strict" : "loose"].exec(str),
		uri = {},
		i   = 14;

	while (i--) uri[o.key[i]] = m[i] || "";

	uri[o.q.name] = {};
	uri[o.key[12]].replace(o.q.parser, function ($0, $1, $2) {
		if ($1) uri[o.q.name][$1] = $2;
	});

	return uri;
};

parseUri.options = {
	strictMode: false,
	key: ["source","protocol","authority","userInfo","user","password","host","port","relative","path","directory","file","query","anchor"],
	q:   {
		name:   "queryKey",
		parser: /(?:^|&)([^&=]*)=?([^&]*)/g
	},
	parser: {
		strict: /^(?:([^:\/?#]+):)?(?:\/\/((?:(([^:@]*)(?::([^:@]*))?)?@)?([^:\/?#]*)(?::(\d*))?))?((((?:[^?#\/]*\/)*)([^?#]*))(?:\?([^#]*))?(?:#(.*))?)/,
		loose:  /^(?:(?![^:@]+:[^:@\/]*@)([^:\/?#.]+):)?(?:\/\/)?((?:(([^:@]*)(?::([^:@]*))?)?@)?([^:\/?#]*)(?::(\d*))?)(((\/(?:[^?#](?![^?#\/]*\.[^?#\/.]+(?:[?#]|$)))*\/?)?([^?#\/]*))(?:\?([^#]*))?(?:#(.*))?)/
	}
};

You can download it here.

parseUri has no dependencies, and has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.1 beta for Windows, and Swift 0.2.

JavaScript Date Format

Update: The documentation below has been updated for the new Date Format 1.2. Get it now!

Although JavaScript provides a bunch of methods for getting and setting parts of a date object, it lacks a simple way to format dates and times according to a user-specified mask. There are a few scripts out there which provide this functionality, but I've never seen one that worked well for me… Most are needlessly bulky or slow, tie in unrelated functionality, use complicated mask syntaxes that more or less require you to read the documentation every time you want to use them, or don't account for special cases like escaping mask characters within the generated string.

When choosing which special mask characters to use for my JavaScript date formatter, I looked at PHP's date function and ColdFusion's discrete dateFormat and timeFormat functions. PHP uses a crazy mix of letters (to me at least, since I'm not a PHP programmer) to represent various date entities, and while I'll probably never memorize the full list, it does offer the advantages that you can apply both date and time formatting with one function, and that none of the special characters overlap (unlike ColdFusion where m and mm mean different things depending on whether you're dealing with dates or times). On the other hand, ColdFusion uses very easy to remember special characters for masks.

With my date formatter, I've tried to take the best features from both, and add some sugar of my own. It did end up a lot like the ColdFusion implementation though, since I've primarily used CF's mask syntax.

Before getting into further details, here are some examples of how this script can be used:

var now = new Date();

now.format("m/dd/yy");
// Returns, e.g., 6/09/07

// Can also be used as a standalone function
dateFormat(now, "dddd, mmmm dS, yyyy, h:MM:ss TT");
// Saturday, June 9th, 2007, 5:46:21 PM

// You can use one of several named masks
now.format("isoDateTime");
// 2007-06-09T17:46:21

// ...Or add your own
dateFormat.masks.hammerTime = 'HH:MM! "Can\'t touch this!"';
now.format("hammerTime");
// 17:46! Can't touch this!

// When using the standalone dateFormat function,
// you can also provide the date as a string
dateFormat("Jun 9 2007", "fullDate");
// Saturday, June 9, 2007

// Note that if you don't include the mask argument,
// dateFormat.masks.default is used
now.format();
// Sat Jun 09 2007 17:46:21

// And if you don't include the date argument,
// the current date and time is used
dateFormat();
// Sat Jun 09 2007 17:46:22

// You can also skip the date argument (as long as your mask doesn't
// contain any numbers), in which case the current date/time is used
dateFormat("longTime");
// 5:46:22 PM EST

// And finally, you can convert local time to UTC time. Either pass in
// true as an additional argument (no argument skipping allowed in this case):
dateFormat(now, "longTime", true);
now.format("longTime", true);
// Both lines return, e.g., 10:46:21 PM UTC

// ...Or add the prefix "UTC:" to your mask.
now.format("UTC:h:MM:ss TT Z");
// 10:46:21 PM UTC

Following are the special characters supported. Any differences in meaning from ColdFusion's dateFormat and timeFormat functions are noted.

Mask Description
d Day of the month as digits; no leading zero for single-digit days.
dd Day of the month as digits; leading zero for single-digit days.
ddd Day of the week as a three-letter abbreviation.
dddd Day of the week as its full name.
m Month as digits; no leading zero for single-digit months.
mm Month as digits; leading zero for single-digit months.
mmm Month as a three-letter abbreviation.
mmmm Month as its full name.
yy Year as last two digits; leading zero for years less than 10.
yyyy Year represented by four digits.
h Hours; no leading zero for single-digit hours (12-hour clock).
hh Hours; leading zero for single-digit hours (12-hour clock).
H Hours; no leading zero for single-digit hours (24-hour clock).
HH Hours; leading zero for single-digit hours (24-hour clock).
M Minutes; no leading zero for single-digit minutes.
Uppercase M unlike CF timeFormat's m to avoid conflict with months.
MM Minutes; leading zero for single-digit minutes.
Uppercase MM unlike CF timeFormat's mm to avoid conflict with months.
s Seconds; no leading zero for single-digit seconds.
ss Seconds; leading zero for single-digit seconds.
l or L Milliseconds. l gives 3 digits. L gives 2 digits.
t Lowercase, single-character time marker string: a or p.
No equivalent in CF.
tt Lowercase, two-character time marker string: am or pm.
No equivalent in CF.
T Uppercase, single-character time marker string: A or P.
Uppercase T unlike CF's t to allow for user-specified casing.
TT Uppercase, two-character time marker string: AM or PM.
Uppercase TT unlike CF's tt to allow for user-specified casing.
Z US timezone abbreviation, e.g. EST or MDT. With non-US timezones or in the Opera browser, the GMT/UTC offset is returned, e.g. GMT-0500
No equivalent in CF.
o GMT/UTC timezone offset, e.g. -0500 or +0230.
No equivalent in CF.
S The date's ordinal suffix (st, nd, rd, or th). Works well with d.
No equivalent in CF.
'…' or "…" Literal character sequence. Surrounding quotes are removed.
No equivalent in CF.
UTC: Must be the first four characters of the mask. Converts the date from local time to UTC/GMT/Zulu time before applying the mask. The "UTC:" prefix is removed.
No equivalent in CF.

And here are the named masks provided by default (you can easily change these or add your own):

Name Mask Example
default ddd mmm dd yyyy HH:MM:ss Sat Jun 09 2007 17:46:21
shortDate m/d/yy 6/9/07
mediumDate mmm d, yyyy Jun 9, 2007
longDate mmmm d, yyyy June 9, 2007
fullDate dddd, mmmm d, yyyy Saturday, June 9, 2007
shortTime h:MM TT 5:46 PM
mediumTime h:MM:ss TT 5:46:21 PM
longTime h:MM:ss TT Z 5:46:21 PM EST
isoDate yyyy-mm-dd 2007-06-09
isoTime HH:MM:ss 17:46:21
isoDateTime yyyy-mm-dd'T'HH:MM:ss 2007-06-09T17:46:21
isoUtcDateTime UTC:yyyy-mm-dd'T'HH:MM:ss'Z' 2007-06-09T22:46:21Z

A couple issues:

  • In the unlikely event that there is ambiguity in the meaning of your mask (e.g., m followed by mm, with no separating characters), put a pair of empty quotes between your metasequences. The quotes will be removed automatically.
  • If you need to include literal quotes in your mask, the following rules apply:
    • Unpaired quotes do not need special handling.
    • To include literal quotes inside masks which contain any other quote marks of the same type, you need to enclose them with the alternative quote type (i.e., double quotes for single quotes, and vice versa). E.g., date.format('h "o\'clock, y\'all!"') returns "6 o'clock, y'all". This can get a little hairy, perhaps, but I doubt people will really run into it that often. The previous example can also be written as date.format("h") + "o'clock, y'all!".

Here's the code:

/*
 * Date Format 1.2.3
 * (c) 2007-2009 Steven Levithan <stevenlevithan.com>
 * MIT license
 *
 * Includes enhancements by Scott Trenda <scott.trenda.net>
 * and Kris Kowal <cixar.com/~kris.kowal/>
 *
 * Accepts a date, a mask, or a date and a mask.
 * Returns a formatted version of the given date.
 * The date defaults to the current date/time.
 * The mask defaults to dateFormat.masks.default.
 */

var dateFormat = function () {
	var	token = /d{1,4}|m{1,4}|yy(?:yy)?|([HhMsTt])\1?|[LloSZ]|"[^"]*"|'[^']*'/g,
		timezone = /\b(?:[PMCEA][SDP]T|(?:Pacific|Mountain|Central|Eastern|Atlantic) (?:Standard|Daylight|Prevailing) Time|(?:GMT|UTC)(?:[-+]\d{4})?)\b/g,
		timezoneClip = /[^-+\dA-Z]/g,
		pad = function (val, len) {
			val = String(val);
			len = len || 2;
			while (val.length < len) val = "0" + val;
			return val;
		};

	// Regexes and supporting functions are cached through closure
	return function (date, mask, utc) {
		var dF = dateFormat;

		// You can't provide utc if you skip other args (use the "UTC:" mask prefix)
		if (arguments.length == 1 && Object.prototype.toString.call(date) == "[object String]" && !/\d/.test(date)) {
			mask = date;
			date = undefined;
		}

		// Passing date through Date applies Date.parse, if necessary
		date = date ? new Date(date) : new Date;
		if (isNaN(date)) throw SyntaxError("invalid date");

		mask = String(dF.masks[mask] || mask || dF.masks["default"]);

		// Allow setting the utc argument via the mask
		if (mask.slice(0, 4) == "UTC:") {
			mask = mask.slice(4);
			utc = true;
		}

		var	_ = utc ? "getUTC" : "get",
			d = date[_ + "Date"](),
			D = date[_ + "Day"](),
			m = date[_ + "Month"](),
			y = date[_ + "FullYear"](),
			H = date[_ + "Hours"](),
			M = date[_ + "Minutes"](),
			s = date[_ + "Seconds"](),
			L = date[_ + "Milliseconds"](),
			o = utc ? 0 : date.getTimezoneOffset(),
			flags = {
				d:    d,
				dd:   pad(d),
				ddd:  dF.i18n.dayNames[D],
				dddd: dF.i18n.dayNames[D + 7],
				m:    m + 1,
				mm:   pad(m + 1),
				mmm:  dF.i18n.monthNames[m],
				mmmm: dF.i18n.monthNames[m + 12],
				yy:   String(y).slice(2),
				yyyy: y,
				h:    H % 12 || 12,
				hh:   pad(H % 12 || 12),
				H:    H,
				HH:   pad(H),
				M:    M,
				MM:   pad(M),
				s:    s,
				ss:   pad(s),
				l:    pad(L, 3),
				L:    pad(L > 99 ? Math.round(L / 10) : L),
				t:    H < 12 ? "a"  : "p",
				tt:   H < 12 ? "am" : "pm",
				T:    H < 12 ? "A"  : "P",
				TT:   H < 12 ? "AM" : "PM",
				Z:    utc ? "UTC" : (String(date).match(timezone) || [""]).pop().replace(timezoneClip, ""),
				o:    (o > 0 ? "-" : "+") + pad(Math.floor(Math.abs(o) / 60) * 100 + Math.abs(o) % 60, 4),
				S:    ["th", "st", "nd", "rd"][d % 10 > 3 ? 0 : (d % 100 - d % 10 != 10) * d % 10]
			};

		return mask.replace(token, function ($0) {
			return $0 in flags ? flags[$0] : $0.slice(1, $0.length - 1);
		});
	};
}();

// Some common format strings
dateFormat.masks = {
	"default":      "ddd mmm dd yyyy HH:MM:ss",
	shortDate:      "m/d/yy",
	mediumDate:     "mmm d, yyyy",
	longDate:       "mmmm d, yyyy",
	fullDate:       "dddd, mmmm d, yyyy",
	shortTime:      "h:MM TT",
	mediumTime:     "h:MM:ss TT",
	longTime:       "h:MM:ss TT Z",
	isoDate:        "yyyy-mm-dd",
	isoTime:        "HH:MM:ss",
	isoDateTime:    "yyyy-mm-dd'T'HH:MM:ss",
	isoUtcDateTime: "UTC:yyyy-mm-dd'T'HH:MM:ss'Z'"
};

// Internationalization strings
dateFormat.i18n = {
	dayNames: [
		"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat",
		"Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"
	],
	monthNames: [
		"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
		"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"
	]
};

// For convenience...
Date.prototype.format = function (mask, utc) {
	return dateFormat(this, mask, utc);
};

Download it here (1.2 KB when minified and gzipped).

Note that the day and month names can be changed (for internationalization or other purposes) by updating the dateFormat.i18n object.

If you have any suggestions or find any issues, lemme know.


Want to learn about aphantasisa and hyperphantasia, the Shen Yun cult, or unmasking cult leader Karen Zerby? Check out my new blog at Life After Tech.

XRegExp: An Extended JavaScript Regex Constructor

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

I use regular expressions in JavaScript fairly frequently, and although the exec() method is badass and I love the ability to use a function to generate the replacement in the replace() method, JavaScript regexes lack some very significant features available in many other languages. Amongst my biggest pet peeves is the lack of support for s and x flags, which should enable "dot matches all" (a.k.a, single-line) mode and "free-spacing and comments" mode, respectively. These modifiers are available in almost every other modern regex library.

To remedy this, I've created a (very small) script which extends the JavaScript RegExp constructor, enabling the aforementioned flags. Basically, it gives you a constructor called XRegExp which works exactly like the RegExp constructor except that it also accepts s and x as flags, in addition to the already supported g (global), i (case insensitive), and m (multiline, i.e., ^ and $ match at line breaks). As a bonus, XRegExp also improves cross-browser regex syntax consistency.

Here's how it works:

var regex = new XRegExp("te?", "gi");
var value = "Test".replace(regex, "z"); // value is now "zsz"

Look familiar? If you've used regular expressions in JavaScript before, it should — it's exactly like using RegExp. However, so far we haven't done anything that can't be accomplished using the native RegExp constructor. The s flag is pretty self-explanatory (specific details can be found in the FAQ, below), so here's an example with the x flag:

// Turn email addresses into links
var email = new XRegExp(
    "\\b                                      " +
    "# Capture the address to $1            \n" +
    "(                                        " +
    "  \\w[-.\\w]*               # username \n" +
    "  @                                      " +
    "  [-.a-z0-9]+\\.(?:com|net) # hostname \n" +
    ")                                        " +
    "\\b                                      ", "gix");

value = value.replace(email, "<a href=\"mailto:$1\">$1</a>");

That's certainly different! A couple notes:

  • When using XRegExp, the normal string escape rules (preceding special characters with "\") are necessary, just as with RegExp. Hence, the three instances of \n are metasequences within the string literal itself. JavaScript converts them to newline characters (which end the comments) before XRegExp sees the string.
  • The email regex is overly simplistic and only intended for demonstrative purposes.

That's fairly nifty, but we can make this even easier. If you run the following line of code:

XRE.overrideNative();

…Like magic, the RegExp constructor itself will support the s and x flags from that point forward. The tradeoff is that you will then no longer be able to access information about the last match as properties of the global RegExp object. However, those properties are all officially deprecated anyway, and you can access all the same info though a combination of properties on regex instances and use of the exec() method.

Here's a quick FAQ. For the first two questions, I've borrowed portions of the explanations from O'Reilly's Mastering Regular Expressions, 3rd Edition.

What exactly does the s flag do?
Usually, dot does not match a newline. However, a mode in which dot matches a newline can be as useful as one where dot doesn't. The s flag allows the mode to be selected on a per-regex basis. Note that dots within character classes (e.g., [.a-z]) are always equivalent to literal dots.

As for what exactly is considered a newline character (and therefore not matched by dots outside of character classes unless using the s flag), according to the Mozilla Developer Center it includes the four characters matched by the following regex: [\n\r\u2028\u2029]
What exactly does the x flag do?
First, it causes most whitespace to be ignored, so you can "free-format" the expression for readability. Secondly, it allows comments with a leading #.

Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within a character class (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace, as in new XRegExp("<a \\s+ href=…>", "x"). Note that describing whitespace and comments as ignore-me metacharacters is not quite accurate; it might be better to think of them as do-nothing metacharacters. This distinction is important with something like \12 3, which with the x flag is taken as \12 followed by 3, and not \123. Finally, don't immediately follow whitespace or a comment with a quantifier (e.g., * or ?), or you will quantify the do-nothing metacharacter.

As for what exactly is whitespace, according to the Mozilla Developer Center it is equivalent to all characters matched by the following regex:
[\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]
Can the s and x flags be used together?
Yes. You can combine all supported flags (g, i, m, s, x) in any order.
What regex syntax does XRegExp support?
Whatever your browser supports natively.
You mentioned something about improving cross-browser regex syntax consistency?
When using XRegExp, a leading, unescaped ] within a character class is considered a literal character and therefore does not end the class. This is consistent with other regex engines I've used, and is true in Internet Explorer natively. However, in Firefox the following quirks (bugs?) can be experienced:
  • [^] is equivalent to [\S\s], although it should throw an error.
  • [^]] is equivalent to [\S\s]], although it should be equivalent to [^\]] or (?!])[\S\s].
  • [] is equivalent to (?!) (which will never match), although it should throw an error.
  • []] is equivalent to (?!)] (which will never match), although it should be equvialent to [\]] or ].
When using XRegExp (or RegExp with XRE.overrideNative()), you don't have to worry about how different browsers handle this, as a leading ] within a character class will never end the class.
Which regex-related methods does XRegExp support?
All of them.
Are regexes built using XRegExp any slower than they would be otherwise?
No.
Does it take any longer to build regexes using XRegExp than it would otherwise?
Yes, by a tiny amount. From personal testing, building regexes using XRegExp typically takes less than a millisecond longer than it would otherwise. This is especially trivial given that regexes should not need to be constructed within loops. Instead, a regex should be assigned to a variable before entering a loop, to avoid rebuilding it during each iteration of the loop.
Which browsers has this been tested with?
Firefox 2, Internet Explorer 5.5–7, and Opera 9.
How big is the script file?
Minified, it's less than 1KB. Gzipping reduces it further.
What license is this released under?
The MIT License.
Does XRegExp affect regular expression literals?
No. Even when using XRE.overrideNative(), Perl-style regex literals (e.g., /pattern/gi) are unaffected.
I found a bug. Why do you suck so bad?
Are you sure the bug is in XRegExp? Regular expression syntax is somewhat complex and often changes its meaning given context. Additionally, metasequences within JavaScript string literals can change things before XRegExp ever sees your regex. In any case, whether or not you're sure you know what's causing the issue, feel free to leave a comment and I'll look into it ASAP.

Here's the script, with comments:

/*----------------------------------------------------------------------
XRegExp 0.1, by Steven Levithan <http://stevenlevithan.com>
MIT-style license
------------------------------------------------------------------------
Adds support for the following regular expression features:
  - The "s" flag: Dot matches all (a.k.a, single-line) mode.
  - The "x" flag: Free-spacing and comments mode.

XRegExp also offers consistent, cross-browser handling when "]" is used
as the first character within a character class (e.g., "[]]" or "[^]]").
----------------------------------------------------------------------*/

var XRegExp = function(pattern, flags){
	if(!flags) flags = "";
	
	/* If the "free-spacing and comments" modifier (x) is enabled, replace unescaped whitespace as well as unescaped pound
	signs (#) and any following characters up to and including the next newline character (\n) with "(?:)". Using "(?:)"
	instead of just removing matches altogether prevents, e.g., "\1 0" from becoming "\10" (which has different meanings
	depending on context). None of this applies within character classes, which are unaffected even when they contain
	whitespace or pound signs (which is consistent with pretty much every library except java.util.regex). */
	if(flags.indexOf("x") !== -1){
		pattern = pattern.replace(XRE.re.xMod, function($0, $1, $2){
			// If $2 is an empty string or its first character is "[", return the match unviolated (an effective no-op).
			return (/[^[]/.test($2.charAt(0)) ? $1 + "(?:)" : $0);
		});
	}
	
	/* Two steps (the order is not important):
	
	1. Since a regex literal will be used to return the final regex function/object, replace characters which are not
	   allowed within regex literals (carriage return, line feed) with the metasequences which represent them (\r, \n),
	   accounting for both escaped and unescaped literal characters within pattern. This step is only necessary to support
	   the XRE.overrideNative() method, since otherwise the RegExp constructor could be used to return the final regex.
	
	2. When "]" is the first character within a character class, convert it to "\]", for consistent, cross-browser handling.
	   This is included to workaround the following Firefox quirks (bugs?):
	     - "[^]" is equivalent to "[\S\s]", although it should throw an error.
	     - "[^]]" is equivalent to "[\S\s]]", although it should be equivalent to "[^\]]" or "(?!])[\S\s]".
	     - "[]" is equivalent to "(?!)" (which will never match), although it should throw an error.
	     - "[]]" is equivalent to "(?!)]" (which will never match), although it should be equvialent to "[\]]" or "]".
	   
	   Note that this step is not just an extra feature. It is in fact required in order to maintain correctness without
	   the aid of browser sniffing when constructing the regexes which deal with character classes (XRE.re.chrClass and
	   XRE.re.xMod). They treat a leading "]" within a character class as a non-terminating, literal character. */
	pattern = pattern.replace(XRE.re.badChr, function($0, $1, $2){
			return $1 + $2.replace(/\r/, "\\r").replace(/\n/, "\\n");
		}).
		replace(XRE.re.chrClass, function($0, $1, $2){
			return $1 + $2.replace(/^(\[\^?)]/, "$1\\]");
		});
	
	// If the "dot matches all" modifier (s) is enabled, replace unescaped dots outside of character classes with [\S\s]
	if(flags.indexOf("s") !== -1){
		pattern = pattern.replace(XRE.re.chrClass, function($0, $1, $2){
			return $1.replace(XRE.re.sMod, function($0, $1, $2){
					return $1 + ($2 === "." ? "[\\S\\s]" : "");
				}) + $2;
		});
	}
	
	// Use an evaluated regex literal to return the regular expression, in order to support the XRE.overrideNative() method.
	return eval("/" + pattern + "/" + flags.replace(/[sx]+/g, ""));
},
XRE = {
	overrideNative: function(){
		/* Override the global RegExp constructor/object with the enhanced XRegExp constructor. This precludes accessing
		properties of the last match via the global RegExp object. However, those properties are deprecated as of
		JavaScript 1.5, and the values are available on RegExp instances or via the exec() method. */
		RegExp = XRegExp;
	},
	re: {
		chrClass: /((?:[^[\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?)?)/g,
		xMod: /((?:[^[#\s\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?|\s*#[^\n\r]*[\n\r]?\s*|\s+)?)/g,
		sMod: /((?:[^\\.]+|\\(?:[\S\s]|$))*)(\.?)/g,
		badChr: /((?:[^\\\r\n]+|\\(?:[^\r\n]|$(?!\s)))*)\\?([\r\n]?)/g
	}
};
/* XRE.re is used to cache the more complex regexes so they don't have to be recompiled every time XRegExp runs. Note that
the included regexes match anything (if they didn't, they would have to be rewritten to avoid catastrophic backtracking on
failure). It's the backreferences, as well as where one match ends and the next begins, that's important. Also note that
the regexes are exactly as they are in order to account for all kinds of caveats involved in interpreting and working with
JavaScript regexes. Do not modify them! */

Download it here.

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

parseUri: Split URLs in JavaScript

Update: The following post is outdated. See parseUri 2.0 for the latest, greatest version.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(uri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1.1, by Steven Levithan <http://stevenlevithan.com>
Splits any well-formed URI into the following parts (all are optional):
----------------------
- source (since the exec method returns the entire match as key 0, we might as well use it)
- protocol (i.e., scheme)
- authority (includes both the domain and port)
  - domain (i.e., host; can be an IP address)
  - port
- path (includes both the directory path and filename)
  - directoryPath (supports directories with periods, and without a trailing backslash)
  - fileName
- query (does not include the leading question mark)
- anchor (i.e., fragment) */
function parseUri(sourceUri){
	var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"],
		uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri),
		uri = {};
	
	for(var i = 0; i < 10; i++){
		uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
	}
	
	/* Always end directoryPath with a trailing backslash if a path was present in the source URI
	Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key */
	if(uri.directoryPath.length > 0){
		uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
	}
	
	return uri;
}

Test it.

Is there any leaner, meaner URI parser out there? 🙂


Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if people leave comments asking for the support, I'll go ahead and add it.