Steven Levithan – Page 16 – Flagrant Badassery

Faster JavaScript Trim

Since JavaScript doesn't include a trim method natively, it's included by countless JavaScript libraries – usually as a global function or appended to String.prototype. However, I've never seen an implementation which performs as well as it could, probably because most programmers don't deeply understand or care about regex efficiency issues.

After seeing a particularly bad trim implementation, I decided to do a little research towards finding the most efficient approach. Before getting into the analysis, here are the results:

Method	Firefox 2	IE 6
trim1	15ms	< 0.5ms
trim2	31ms	< 0.5ms
trim3	46ms	31ms
trim4	47ms	46ms
trim5	156ms	1656ms
trim6	172ms	2406ms
trim7	172ms	1640ms
trim8	281ms	< 0.5ms
trim9	125ms	78ms
trim10	< 0.5ms	< 0.5ms
trim11	< 0.5ms	< 0.5ms

Note 1: The comparison is based on trimming the Magna Carta (over 27,600 characters) with a bit of leading and trailing whitespace 20 times on my personal system. However, the data you're trimming can have a major impact on performance, which is detailed below.

Note 2: trim4 and trim6 are the most commonly found in JavaScript libraries today.

Note 3: The aforementioned bad implementation is not included in the comparison, but is shown later.

The analysis

Although there are 11 rows in the table above, they are only the most notable (for various reasons) of about 20 versions I wrote and benchmarked against various types of strings. The following analysis is based on testing in Firefox 2.0.0.4, although I have noted where there are major differences in IE6.

return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
All things considered, this is probably the best all-around approach. Its speed advantage is most notable with long strings — when efficiency matters. The speed is largely due to a number of optimizations internal to JavaScript regex interpreters which the two discrete regexes here trigger. Specifically, the pre-check of required character and start of string anchor optimizations, possibly among others.
return str.replace(/^\s+/, '').replace(/\s+$/, '');
Very similar to trim1 (above), but a little slower since it doesn't trigger all of the same optimizations.
return str.substring(Math.max(str.search(/\S/), 0), str.search(/\S\s*$/) + 1);
This is often faster than the following methods, but slower than the above two. Its speed comes from its use of simple, character-index lookups.
return str.replace(/^\s+|\s+$/g, '');
This commonly thought up approach is easily the most frequently used in JavaScript libraries today. It is generally the fastest implementation of the bunch only when working with short strings which don't include leading or trailing whitespace. This minor advantage is due in part to the initial-character discrimination optimization it triggers. While this is a relatively decent performer, it's slower than the three methods above when working with longer strings, because the top-level alternation prevents a number of optimizations which could otherwise kick in.
str = str.match(/\S+(?:\s+\S+)*/); return str ? str[0] : '';
This is generally the fastest method when working with empty or whitespace-only strings, due to the pre-check of required character optimization it triggers. Note: In IE6, this can be quite slow when working with longer strings.
return str.replace(/^\s*(\S*(\s+\S+)*)\s*$/, '$1');
This is a relatively common approach, popularized in part by some leading JavaScripters. It's similar in approach (but inferior) to trim8. There's no good reason to use this in JavaScript, especially since it can be very slow in IE6.
return str.replace(/^\s*(\S*(?:\s+\S+)*)\s*$/, '$1');
The same as trim6, but a bit faster due to the use of a non-capturing group (which doesn't work in IE 5.0 and lower). Again, this can be slow in IE6.
return str.replace(/^\s*((?:[\S\s]*\S)?)\s*$/, '$1');
This uses a simple, single-pass, greedy approach. In IE6, this is crazy fast! The performance difference indicates that IE has superior optimization for quantification of "any character" tokens.
return str.replace(/^\s*([\S\s]*?)\s*$/, '$1');
This is generally the fastest with very short strings which contain both non-space characters and edge whitespace. This minor advantage is due to the simple, single-pass, lazy approach it uses. Like trim8, this is significantly faster in IE6 than Firefox 2.

Since I've seen the following additional implementation in one library, I'll include it here as a warning:

return str.replace(/^\s*([\S\s]*)\b\s*$/, '$1');

Although the above is sometimes the fastest method when working with short strings which contain both non-space characters and edge whitespace, it performs very poorly with long strings which contain numerous word boundaries, and it's terrible (!) with long strings comprised of nothing but whitespace, since that triggers an exponentially increasing amount of backtracking. Do not use.

A different endgame

There are two methods in the table at the top of this post which haven't been covered yet. For those, I've used a non-regex and hybrid approach.

After comparing and analyzing all of the above, I wondered how an implementation which used no regular expressions would perform. Here's what I tried:

function trim10 (str) {
	var whitespace = ' \n\r\t\f\x0b\xa0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000';
	for (var i = 0; i < str.length; i++) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(i);
			break;
		}
	}
	for (i = str.length - 1; i >= 0; i--) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return whitespace.indexOf(str.charAt(0)) === -1 ? str : '';
}

How does that perform? Well, with long strings which do not contain excessive leading or trailing whitespace, it blows away the competition (except against trim1/2/8 in IE, which are already insanely fast there).

Does that mean regular expressions are slow in Firefox? No, not at all. The issue here is that although regexes are very well suited for trimming leading whitespace, apart from the .NET library (which offers a somewhat-mysterious "backwards matching" mode), they don't really provide a method to jump to the end of a string without even considering previous characters. However, the non-regex-reliant trim10 function does just that, with the second loop working backwards from the end of the string until it finds a non-whitespace character.

Knowing that, what if we created a hybrid implementation which combined a regex's universal efficiency at trimming leading whitespace with the alternative method's speed at removing trailing characters?

function trim11 (str) {
	str = str.replace(/^\s+/, '');
	for (var i = str.length - 1; i >= 0; i--) {
		if (/\S/.test(str.charAt(i))) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return str;
}

Although the above is a bit slower than trim10 with some strings, it uses significantly less code and is still lightning fast. Plus, with strings which contain a lot of leading whitespace (which includes strings comprised of nothing but whitespace), it's much faster than trim10.

In conclusion…

Since the differences between the implementations cross-browser and when used with different data are both complex and nuanced (none of them are faster than all the others with any data you can throw at it), here are my general recommendations for a trim method:

Use trim1 if you want a general-purpose implementation which is fast cross-browser.
Use trim11 if you want to handle long strings exceptionally fast in all browsers.

To test all of the above implementations for yourself, try my very rudimentary benchmarking page. Background processing can cause the results to be severely skewed, so run the test a number of times (regardless of how many iterations you specify) and only consider the fastest results (since averaging the cost of background interference is not very enlightening).

As a final note, although some people like to cache regular expressions (e.g. using global variables) so they can be used repeatedly without recompilation, IMO this does not make much sense for a trim method. All of the above regexes are so simple that they typically take no more than a nanosecond to compile. Additionally, some browsers automatically cache the most recently used regexes, so a typical loop which uses trim and doesn't contain a bunch of other regexes might not encounter recompilation anyway.

Edit (2008-02-04): Shortly after posting this I realized trim10/11 could be better written. Several people have also posted improved versions in the comments. Here's what I use now, which takes the trim11-style hybrid approach:

function trim12 (str) {
	var	str = str.replace(/^\s\s*/, ''),
		ws = /\s/,
		i = str.length;
	while (ws.test(str.charAt(--i)));
	return str.slice(0, i + 1);
}

New library: Are you a JavaScript regex master, or want to be? Then you need my fancy XRegExp library. It adds new regex syntax (including named capture and Unicode properties); s, x, and n flags; powerful regex utils; and it fixes pesky browser inconsistencies. Check it out!

Gauging Interest in Enhanced JavaScript Regex Methods

Update: Some of the functionality discussed here has made its way into later versions of XRegExp.

So, I'll admit that XRegExp 0.1, though hopefully interesting or useful for some people, was scaled back from my original plans. There were two reasons for this:

To get it out the door.
Aside from from a few marginally useful syntax constructs, I'd already included all standard regular expression features which I could think of ways to mimic while still allowing the constructed regular expression objects to be used with built-in JavaScript regex methods without any changes to expected behavior (e.g., backreference ordering).

However, if I don't worry about the regexes being used with built-in methods, and instead create custom methods (possibly with names like xmatch, xreplace, xexec, etc.), a number of significant, additional features become technically possible to mimic. Things like atomic groups, possessive quantifiers, named capture, and even infinite-length lookbehinds (though lookbehinds would have to be limited to appearing at the start and/or end of regexes, or alternatively not be used together with lookaheads).

However, since some of this stuff might be tricky to pull off, and I'm not really sure how useful most people would find this, or if the majority of people prefer regex literals over a constructor even given the feature enhancements possible through a custom constructor, I'd like to gauge interest in this stuff before thinking much more about it. Do you think you would regularly use the features I mentioned, even given that it would require the use of a custom constructor and methods? Do you use regular expressions in JavaScript but don't see yourself including a script just so you can do this stuff? Do you think the convenience of regex literals outweighs the benefits of enhanced syntax? Let me know. If you'd like more details, wanna help out with this, or have any other comments, I'd be happy to hear from you, as well.

XRegExp: An Extended JavaScript Regex Constructor

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

I use regular expressions in JavaScript fairly frequently, and although the exec() method is badass and I love the ability to use a function to generate the replacement in the replace() method, JavaScript regexes lack some very significant features available in many other languages. Amongst my biggest pet peeves is the lack of support for s and x flags, which should enable "dot matches all" (a.k.a, single-line) mode and "free-spacing and comments" mode, respectively. These modifiers are available in almost every other modern regex library.

To remedy this, I've created a (very small) script which extends the JavaScript RegExp constructor, enabling the aforementioned flags. Basically, it gives you a constructor called XRegExp which works exactly like the RegExp constructor except that it also accepts s and x as flags, in addition to the already supported g (global), i (case insensitive), and m (multiline, i.e., ^ and $ match at line breaks). As a bonus, XRegExp also improves cross-browser regex syntax consistency.

Here's how it works:

var regex = new XRegExp("te?", "gi");
var value = "Test".replace(regex, "z"); // value is now "zsz"

Look familiar? If you've used regular expressions in JavaScript before, it should — it's exactly like using RegExp. However, so far we haven't done anything that can't be accomplished using the native RegExp constructor. The s flag is pretty self-explanatory (specific details can be found in the FAQ, below), so here's an example with the x flag:

// Turn email addresses into links
var email = new XRegExp(
    "\\b                                      " +
    "# Capture the address to $1            \n" +
    "(                                        " +
    "  \\w[-.\\w]*               # username \n" +
    "  @                                      " +
    "  [-.a-z0-9]+\\.(?:com|net) # hostname \n" +
    ")                                        " +
    "\\b                                      ", "gix");

value = value.replace(email, "<a href=\"mailto:$1\">$1</a>");

That's certainly different! A couple notes:

When using XRegExp, the normal string escape rules (preceding special characters with "\") are necessary, just as with RegExp. Hence, the three instances of \n are metasequences within the string literal itself. JavaScript converts them to newline characters (which end the comments) before XRegExp sees the string.
The email regex is overly simplistic and only intended for demonstrative purposes.

That's fairly nifty, but we can make this even easier. If you run the following line of code:

XRE.overrideNative();

…Like magic, the RegExp constructor itself will support the s and x flags from that point forward. The tradeoff is that you will then no longer be able to access information about the last match as properties of the global RegExp object. However, those properties are all officially deprecated anyway, and you can access all the same info though a combination of properties on regex instances and use of the exec() method.

Here's a quick FAQ. For the first two questions, I've borrowed portions of the explanations from O'Reilly's Mastering Regular Expressions, 3^rd Edition.

What exactly does the s flag do?

Usually, dot does not match a newline. However, a mode in which dot matches a newline can be as useful as one where dot doesn't. The s flag allows the mode to be selected on a per-regex basis. Note that dots within character classes (e.g., [.a-z]) are always equivalent to literal dots.

As for what exactly is considered a newline character (and therefore not matched by dots outside of character classes unless using the s flag), according to the Mozilla Developer Center it includes the four characters matched by the following regex: [\n\r\u2028\u2029]

What exactly does the x flag do?

First, it causes most whitespace to be ignored, so you can "free-format" the expression for readability. Secondly, it allows comments with a leading #.

Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within a character class (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace, as in new XRegExp("<a \\s+ href=…>", "x"). Note that describing whitespace and comments as ignore-me metacharacters is not quite accurate; it might be better to think of them as do-nothing metacharacters. This distinction is important with something like \12 3, which with the x flag is taken as \12 followed by 3, and not \123. Finally, don't immediately follow whitespace or a comment with a quantifier (e.g., * or ?), or you will quantify the do-nothing metacharacter.

As for what exactly is whitespace, according to the Mozilla Developer Center it is equivalent to all characters matched by the following regex:
[\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000]

Can the s and x flags be used together?

Yes. You can combine all supported flags (g, i, m, s, x) in any order.

What regex syntax does XRegExp support?

Whatever your browser supports natively.

You mentioned something about improving cross-browser regex syntax consistency?

When using XRegExp, a leading, unescaped ] within a character class is considered a literal character and therefore does not end the class. This is consistent with other regex engines I've used, and is true in Internet Explorer natively. However, in Firefox the following quirks (bugs?) can be experienced:

[^] is equivalent to [\S\s], although it should throw an error.
[^]] is equivalent to [\S\s]], although it should be equivalent to [^\]] or (?!])[\S\s].
[] is equivalent to (?!) (which will never match), although it should throw an error.
[]] is equivalent to (?!)] (which will never match), although it should be equvialent to [\]] or ].

When using XRegExp (or RegExp with XRE.overrideNative()), you don't have to worry about how different browsers handle this, as a leading ] within a character class will never end the class.

Which regex-related methods does XRegExp support?

All of them.

Are regexes built using XRegExp any slower than they would be otherwise?

No.

Does it take any longer to build regexes using XRegExp than it would otherwise?

Yes, by a tiny amount. From personal testing, building regexes using XRegExp typically takes less than a millisecond longer than it would otherwise. This is especially trivial given that regexes should not need to be constructed within loops. Instead, a regex should be assigned to a variable before entering a loop, to avoid rebuilding it during each iteration of the loop.

Which browsers has this been tested with?

Firefox 2, Internet Explorer 5.5–7, and Opera 9.

How big is the script file?

Minified, it's less than 1KB. Gzipping reduces it further.

What license is this released under?

The MIT License.

Does XRegExp affect regular expression literals?

No. Even when using XRE.overrideNative(), Perl-style regex literals (e.g., /pattern/gi) are unaffected.

I found a bug. Why do you suck so bad?

Are you sure the bug is in XRegExp? Regular expression syntax is somewhat complex and often changes its meaning given context. Additionally, metasequences within JavaScript string literals can change things before XRegExp ever sees your regex. In any case, whether or not you're sure you know what's causing the issue, feel free to leave a comment and I'll look into it ASAP.

Here's the script, with comments:

/*----------------------------------------------------------------------
XRegExp 0.1, by Steven Levithan <http://stevenlevithan.com>
MIT-style license
------------------------------------------------------------------------
Adds support for the following regular expression features:
  - The "s" flag: Dot matches all (a.k.a, single-line) mode.
  - The "x" flag: Free-spacing and comments mode.

XRegExp also offers consistent, cross-browser handling when "]" is used
as the first character within a character class (e.g., "[]]" or "[^]]").
----------------------------------------------------------------------*/

var XRegExp = function(pattern, flags){
	if(!flags) flags = "";
	
	/* If the "free-spacing and comments" modifier (x) is enabled, replace unescaped whitespace as well as unescaped pound
	signs (#) and any following characters up to and including the next newline character (\n) with "(?:)". Using "(?:)"
	instead of just removing matches altogether prevents, e.g., "\1 0" from becoming "\10" (which has different meanings
	depending on context). None of this applies within character classes, which are unaffected even when they contain
	whitespace or pound signs (which is consistent with pretty much every library except java.util.regex). */
	if(flags.indexOf("x") !== -1){
		pattern = pattern.replace(XRE.re.xMod, function($0, $1, $2){
			// If $2 is an empty string or its first character is "[", return the match unviolated (an effective no-op).
			return (/[^[]/.test($2.charAt(0)) ? $1 + "(?:)" : $0);
		});
	}
	
	/* Two steps (the order is not important):
	
	1. Since a regex literal will be used to return the final regex function/object, replace characters which are not
	   allowed within regex literals (carriage return, line feed) with the metasequences which represent them (\r, \n),
	   accounting for both escaped and unescaped literal characters within pattern. This step is only necessary to support
	   the XRE.overrideNative() method, since otherwise the RegExp constructor could be used to return the final regex.
	
	2. When "]" is the first character within a character class, convert it to "\]", for consistent, cross-browser handling.
	   This is included to workaround the following Firefox quirks (bugs?):
	     - "[^]" is equivalent to "[\S\s]", although it should throw an error.
	     - "[^]]" is equivalent to "[\S\s]]", although it should be equivalent to "[^\]]" or "(?!])[\S\s]".
	     - "[]" is equivalent to "(?!)" (which will never match), although it should throw an error.
	     - "[]]" is equivalent to "(?!)]" (which will never match), although it should be equvialent to "[\]]" or "]".
	   
	   Note that this step is not just an extra feature. It is in fact required in order to maintain correctness without
	   the aid of browser sniffing when constructing the regexes which deal with character classes (XRE.re.chrClass and
	   XRE.re.xMod). They treat a leading "]" within a character class as a non-terminating, literal character. */
	pattern = pattern.replace(XRE.re.badChr, function($0, $1, $2){
			return $1 + $2.replace(/\r/, "\\r").replace(/\n/, "\\n");
		}).
		replace(XRE.re.chrClass, function($0, $1, $2){
			return $1 + $2.replace(/^(\[\^?)]/, "$1\\]");
		});
	
	// If the "dot matches all" modifier (s) is enabled, replace unescaped dots outside of character classes with [\S\s]
	if(flags.indexOf("s") !== -1){
		pattern = pattern.replace(XRE.re.chrClass, function($0, $1, $2){
			return $1.replace(XRE.re.sMod, function($0, $1, $2){
					return $1 + ($2 === "." ? "[\\S\\s]" : "");
				}) + $2;
		});
	}
	
	// Use an evaluated regex literal to return the regular expression, in order to support the XRE.overrideNative() method.
	return eval("/" + pattern + "/" + flags.replace(/[sx]+/g, ""));
},
XRE = {
	overrideNative: function(){
		/* Override the global RegExp constructor/object with the enhanced XRegExp constructor. This precludes accessing
		properties of the last match via the global RegExp object. However, those properties are deprecated as of
		JavaScript 1.5, and the values are available on RegExp instances or via the exec() method. */
		RegExp = XRegExp;
	},
	re: {
		chrClass: /((?:[^[\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?)?)/g,
		xMod: /((?:[^[#\s\\]+|\\(?:[\S\s]|$))*)((?:\[\^?]?(?:[^\\\]]+|\\(?:[\S\s]|$))*]?|\s*#[^\n\r]*[\n\r]?\s*|\s+)?)/g,
		sMod: /((?:[^\\.]+|\\(?:[\S\s]|$))*)(\.?)/g,
		badChr: /((?:[^\\\r\n]+|\\(?:[^\r\n]|$(?!\s)))*)\\?([\r\n]?)/g
	}
};
/* XRE.re is used to cache the more complex regexes so they don't have to be recompiled every time XRegExp runs. Note that
the included regexes match anything (if they didn't, they would have to be rewritten to avoid catastrophic backtracking on
failure). It's the backreferences, as well as where one match ends and the next begins, that's important. Also note that
the regexes are exactly as they are in order to account for all kinds of caveats involved in interpreting and working with
JavaScript regexes. Do not modify them! */

Download it here.

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

Shiny New Website

So, I have a new domain name and host which lets me run PHP and ColdFusion. Woohoo! I'm new to the world of both PHP and WordPress (which is powering this blog), so this is still pretty rough. Hopefully I'll add more cool features and customize the design before too long.

Most of my old stuff (anything below this post) has been migrated from Blogger. I've updated a few posts in the process, and now that I'm able to run ColdFusion code I've added demos for some of my older stuff, including Leet Translator, REMatch, and both the ColdFusion and JavaScript implementations of parseUri. Check 'em out.

Mastering Regular Expressions, 3rd Edition

After reading positive reviews about the second edition it since I started using regular expressions a little over a year ago, I finally purchased O'Reilly Media's Mastering Regular Expressions by Jeffrey E. F. Friedl, since I discovered that a third edition came out in August 2006. The book arrived today, and it is indeed pretty damn excellent (I'm as excited about it as I can be about a tech book, anyway). I've only spent a few minutes with it so far, but I can see that it is very well presented and demonstrates some cool techniques. Hopefully I'll post about some nifty things I learn over the next few weeks, if I get a chance to actually sit down and read through a significant portion of the book.