Five Free Copies of Upcoming O’Reilly Book ‘High Performance JavaScript’

Update (2010-02-25): This contest is now closed.

Book cover: High Performance JavaScript

Last year, Yahoo! engineer and all-around JavaScript badass Nicholas Zakas asked if I was interested in writing a chapter for a new book on JavaScript performance that he was working on. I agreed, and that book, High Performance JavaScript, is now available for preorder at Amazon and other fine book retailers.

In addition to the wide-ranging content by Nicholas and a chapter on string and regular expression performance by yours truly, chapters were also contributed by an awesome lineup of JavaScript performance gurus: Ross Harmes, Julien Lecomte, Stoyan Stefanov, and Matt Sweeney. This book is unique in its laser-focus on optimizing the performance of your JavaScript applications, and covers many advanced topics in the process. The chapter on strings and regular expressions provides what I think is easily the most in-depth coverage of cross-browser JavaScript regex performance currently available.

Here's the list of chapters:

  1. Loading and Execution
  2. Data Access
  3. DOM Scripting (Stoyan Stefanov)
  4. Algorithms and Flow Control
  5. Strings and Regular Expressions (Steven Levithan)
  6. Responsive Interfaces
  7. Ajax (Ross Harmes)
  8. Programming Practices
  9. Build and Deployment (Julien Lecomte)
  10. Tools (Matt Sweeney)

To celebrate the completion of this book, I'm giving away three copies. O'Reilly Media increased the offer to five books! All you need to do is comment on this post by February 24th, and I'll pick five people to send a copy to as soon as it's released (Amazon says March 15th). If you prefer, I'd be happy to send you a copy of Regular Expressions Cookbook instead (please note which book you want in your comment). Four winners will be chosen at random from the pool of unique commenters (I'll be tracking IPs), and the fifth based on the reason given for why you want a copy.

Make sure to include your email address in the comment form, since I'll need it to contact you if you're selected (your email address won't be used for any other purpose). Good luck, and congratulations to Nicholas Zakas and all the other authors on completing a fantastic new book!

Edit (2010-02-05): My blog has been offline more often than not for the first two days after posting this, and many people have reported that they were unable to post a comment. I apologize for the screw-up—my blog is now on a different server, and the problems should be resolved. Please try again!

Edit (2010-02-08): O'Reilly Media kindly offered to pick up the tab for this giveaway, and increased the winnings to five books!

Edit (2010-02-09): Nicholas Zakas posted more information about High Performance JavaScript on his blog: Announcing High Performance JavaScript.

Edit (2010-02-25): This contest is now closed. Winners will be announced here shortly.

Edit (2010-03-03): Following are the winners of this giveaway (the first four were chosen randomly):

  1. David Henderson
  2. Daniel Trebbien
  3. Lea Verou
  4. Stefan "schnalle" Schallerl
  5. Adam Crabtree

No. 5 Adam Crabtree, who wants to review the book and share it with members of the DallasJS Meetup Group, wins the nonrandom drawing for the best reason to win a copy. Runners up for this selection were Yoav, who promised to donate the book to a high school library after he's done with it; Nick Carter, who threatened me with his wrath if he doesn't win (I'll have to endure); Paul Irish, who kindly offered to have my last name corrected (to that of a sea monster) in exchange for winning; Alexei, a technical editor of a couple of Nicholas Zakas's previous books who'd like to know how many errors this one contains; and Marcel Korpel, who wants to improve his users' health by reducing the "headaches, general stress and insomnia" they suffer while waiting on his websites. 🙂

The winners have been informed by email about how to collect their prize. Thanks to everyone for playing!

Performance of Greedy vs. Lazy Regex Quantifiers

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.

So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>. I'll use the RegexBuddy debugger style to illustrate each step.

Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.

  1. Beginning match attempt at character 0
  2. <
  3. <ok
  4. <backtrack
  5. <0
  6. <0backtrack
  7. <01
  8. <01backtrack
  9. <012
  10. <012backtrack
  11. <0123
  12. <0123backtrack
  13. <01234
  14. <01234backtrack
  15. <012345
  16. <012345backtrack
  17. <0123456
  18. <0123456backtrack
  19. <01234567
  20. <01234567backtrack
  21. <012345678
  22. <012345678backtrack
  23. <0123456789
  24. <0123456789>
  25. Match found

The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point, which it does after trying and failing to match the next token (>). As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".

Here's what happens with <[^>]*>:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789
  4. <0123456789>
  5. Match found

As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.

You might be wondering what happens if we use the regex <.*>. Take a look:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789>
  4. <0123456789>backtrack
  5. <0123456789
  6. <0123456789>
  7. Match found

Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .* caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.

In short:

  • Make your patterns more explicit where it makes sense, to better guide the regex engine and avoid backtracking.
  • Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
  • If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.

Regex Performance Optimization

Crafting efficient regular expressions is somewhat of an art. In large part, it centers around controlling/minimizing backtracking and the number of steps it takes the regex engine to match or fail, but the fact that most engines implement different sets of internal optimizations (which can either make certain operations faster, or avoid work by performing simplified pre-checks or skipping unnecessary operations, etc.) also makes the topic dependent on the particular regex flavor and implementation you're using to a significant extent. Most developers aren't deeply aware of regex performance issues, so when they run into problems with regular expressions running slowly, their first thought is to remove the regexes. In reality, most non-trivial regexes I've seen could be significantly optimized for efficiency, and I've frequently seen dramatic improvements as a result of doing so.

The best discussion of regex optimization I've seen is chapter six (a good 60 pages) of Mastering Regular Expressions, Third Edition by Jeffrey Friedl. Unfortunately, other good material on the subject is fairly scarce, so I was pleasantly surprised to stumble upon the recent article Optimizing regular expressions in Java by Cristian Mocanu. Of course, it is in part Java specific, but for the most part it is a good article on the basics of optimization for traditional NFA regex engines. Check it out.

Have you seen any good articles or discussions about regular expression performance or efficiency optimization recently? Do you have any questions about the subject? Experience or pointers to share? Let me know. (I hope to eventually write up an in-depth article on JavaScript regex optimization, with lots of tips, techniques, and cross-browser benchmarks.)

Faster JavaScript Trim

Since JavaScript doesn't include a trim method natively, it's included by countless JavaScript libraries – usually as a global function or appended to String.prototype. However, I've never seen an implementation which performs as well as it could, probably because most programmers don't deeply understand or care about regex efficiency issues.

After seeing a particularly bad trim implementation, I decided to do a little research towards finding the most efficient approach. Before getting into the analysis, here are the results:

Method Firefox 2 IE 6
trim1 15ms < 0.5ms
trim2 31ms < 0.5ms
trim3 46ms 31ms
trim4 47ms 46ms
trim5 156ms 1656ms
trim6 172ms 2406ms
trim7 172ms 1640ms
trim8 281ms < 0.5ms
trim9 125ms 78ms
trim10 < 0.5ms < 0.5ms
trim11 < 0.5ms < 0.5ms

Note 1: The comparison is based on trimming the Magna Carta (over 27,600 characters) with a bit of leading and trailing whitespace 20 times on my personal system. However, the data you're trimming can have a major impact on performance, which is detailed below.

Note 2: trim4 and trim6 are the most commonly found in JavaScript libraries today.

Note 3: The aforementioned bad implementation is not included in the comparison, but is shown later.

The analysis

Although there are 11 rows in the table above, they are only the most notable (for various reasons) of about 20 versions I wrote and benchmarked against various types of strings. The following analysis is based on testing in Firefox 2.0.0.4, although I have noted where there are major differences in IE6.

  1. return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
    All things considered, this is probably the best all-around approach. Its speed advantage is most notable with long strings — when efficiency matters. The speed is largely due to a number of optimizations internal to JavaScript regex interpreters which the two discrete regexes here trigger. Specifically, the pre-check of required character and start of string anchor optimizations, possibly among others.
  2. return str.replace(/^\s+/, '').replace(/\s+$/, '');
    Very similar to trim1 (above), but a little slower since it doesn't trigger all of the same optimizations.
  3. return str.substring(Math.max(str.search(/\S/), 0), str.search(/\S\s*$/) + 1);
    This is often faster than the following methods, but slower than the above two. Its speed comes from its use of simple, character-index lookups.
  4. return str.replace(/^\s+|\s+$/g, '');
    This commonly thought up approach is easily the most frequently used in JavaScript libraries today. It is generally the fastest implementation of the bunch only when working with short strings which don't include leading or trailing whitespace. This minor advantage is due in part to the initial-character discrimination optimization it triggers. While this is a relatively decent performer, it's slower than the three methods above when working with longer strings, because the top-level alternation prevents a number of optimizations which could otherwise kick in.
  5. str = str.match(/\S+(?:\s+\S+)*/);
    return str ? str[0] : '';

    This is generally the fastest method when working with empty or whitespace-only strings, due to the pre-check of required character optimization it triggers. Note: In IE6, this can be quite slow when working with longer strings.
  6. return str.replace(/^\s*(\S*(\s+\S+)*)\s*$/, '$1');
    This is a relatively common approach, popularized in part by some leading JavaScripters. It's similar in approach (but inferior) to trim8. There's no good reason to use this in JavaScript, especially since it can be very slow in IE6.
  7. return str.replace(/^\s*(\S*(?:\s+\S+)*)\s*$/, '$1');
    The same as trim6, but a bit faster due to the use of a non-capturing group (which doesn't work in IE 5.0 and lower). Again, this can be slow in IE6.
  8. return str.replace(/^\s*((?:[\S\s]*\S)?)\s*$/, '$1');
    This uses a simple, single-pass, greedy approach. In IE6, this is crazy fast! The performance difference indicates that IE has superior optimization for quantification of "any character" tokens.
  9. return str.replace(/^\s*([\S\s]*?)\s*$/, '$1');
    This is generally the fastest with very short strings which contain both non-space characters and edge whitespace. This minor advantage is due to the simple, single-pass, lazy approach it uses. Like trim8, this is significantly faster in IE6 than Firefox 2.

Since I've seen the following additional implementation in one library, I'll include it here as a warning:

return str.replace(/^\s*([\S\s]*)\b\s*$/, '$1');

Although the above is sometimes the fastest method when working with short strings which contain both non-space characters and edge whitespace, it performs very poorly with long strings which contain numerous word boundaries, and it's terrible (!) with long strings comprised of nothing but whitespace, since that triggers an exponentially increasing amount of backtracking. Do not use.

A different endgame

There are two methods in the table at the top of this post which haven't been covered yet. For those, I've used a non-regex and hybrid approach.

After comparing and analyzing all of the above, I wondered how an implementation which used no regular expressions would perform. Here's what I tried:

function trim10 (str) {
	var whitespace = ' \n\r\t\f\x0b\xa0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000';
	for (var i = 0; i < str.length; i++) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(i);
			break;
		}
	}
	for (i = str.length - 1; i >= 0; i--) {
		if (whitespace.indexOf(str.charAt(i)) === -1) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return whitespace.indexOf(str.charAt(0)) === -1 ? str : '';
}

How does that perform? Well, with long strings which do not contain excessive leading or trailing whitespace, it blows away the competition (except against trim1/2/8 in IE, which are already insanely fast there).

Does that mean regular expressions are slow in Firefox? No, not at all. The issue here is that although regexes are very well suited for trimming leading whitespace, apart from the .NET library (which offers a somewhat-mysterious "backwards matching" mode), they don't really provide a method to jump to the end of a string without even considering previous characters. However, the non-regex-reliant trim10 function does just that, with the second loop working backwards from the end of the string until it finds a non-whitespace character.

Knowing that, what if we created a hybrid implementation which combined a regex's universal efficiency at trimming leading whitespace with the alternative method's speed at removing trailing characters?

function trim11 (str) {
	str = str.replace(/^\s+/, '');
	for (var i = str.length - 1; i >= 0; i--) {
		if (/\S/.test(str.charAt(i))) {
			str = str.substring(0, i + 1);
			break;
		}
	}
	return str;
}

Although the above is a bit slower than trim10 with some strings, it uses significantly less code and is still lightning fast. Plus, with strings which contain a lot of leading whitespace (which includes strings comprised of nothing but whitespace), it's much faster than trim10.

In conclusion…

Since the differences between the implementations cross-browser and when used with different data are both complex and nuanced (none of them are faster than all the others with any data you can throw at it), here are my general recommendations for a trim method:

  • Use trim1 if you want a general-purpose implementation which is fast cross-browser.
  • Use trim11 if you want to handle long strings exceptionally fast in all browsers.

To test all of the above implementations for yourself, try my very rudimentary benchmarking page. Background processing can cause the results to be severely skewed, so run the test a number of times (regardless of how many iterations you specify) and only consider the fastest results (since averaging the cost of background interference is not very enlightening).

As a final note, although some people like to cache regular expressions (e.g. using global variables) so they can be used repeatedly without recompilation, IMO this does not make much sense for a trim method. All of the above regexes are so simple that they typically take no more than a nanosecond to compile. Additionally, some browsers automatically cache the most recently used regexes, so a typical loop which uses trim and doesn't contain a bunch of other regexes might not encounter recompilation anyway.


Edit (2008-02-04): Shortly after posting this I realized trim10/11 could be better written. Several people have also posted improved versions in the comments. Here's what I use now, which takes the trim11-style hybrid approach:

function trim12 (str) {
	var	str = str.replace(/^\s\s*/, ''),
		ws = /\s/,
		i = str.length;
	while (ws.test(str.charAt(--i)));
	return str.slice(0, i + 1);
}

New library: Are you a JavaScript regex master, or want to be? Then you need my fancy XRegExp library. It adds new regex syntax (including named capture and Unicode properties); s, x, and n flags; powerful regex utils; and it fixes pesky browser inconsistencies. Check it out!

Capturing vs. Non-Capturing Groups

I near-religiously use non-capturing groups whenever I do not need to reference a group's contents. Recently, several people have asked me why this is, so here are the reasons:

  • Capturing groups negatively impact performance. The performance hit may be tiny, especially when working with small strings, but it's there.
  • When you need to use several groupings in a single regex, only some of which you plan to reference later, it's convenient to have the backreferences you want to use numbered sequentially. E.g., the logic in my parseUri UDF could not be nearly as simple if I had not made appropriate use of capturing and non-capturing groups within the same regex.
  • They might be slightly harder to read, but ultimately, non-capturing groups are less confusing and easier to maintain, especially for others working with your code. If I modify a regex and it contains capturing groups, I have to worry about if they're referenced anywhere outside of the regex itself, and what exactly they're expected to contain.

Of course, some capturing groups are necessary. There are three scenarios which meet this description:

  • You're using parts of a match to construct a replacement string, or otherwise referencing parts of the match in code outside the regex.
  • You need to reuse parts of the match within the regex itself. E.g., (["'])(?:\\\1|.)*?\1 would match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match, and allowing inner, escaped quotes of the same type as the enclosure. (Update: For details about this pattern, see Regexes in Depth: Advanced Quoted String Matching.)
  • You need to test if a group has participated in the match so far, as the condition to evaluate within a conditional. E.g., (a)?b(?(1)c|d) only matches the values "bd" and "abc".

If a grouping doesn't meet one of the above conditions, there's no need to capture.