JavaScript — Flagrant Badassery

Five Free Copies of Upcoming O’Reilly Book ‘High Performance JavaScript’

Update (2010-02-25): This contest is now closed.

Last year, Yahoo! engineer and all-around JavaScript badass Nicholas Zakas asked if I was interested in writing a chapter for a new book on JavaScript performance that he was working on. I agreed, and that book, High Performance JavaScript, is now available for preorder at Amazon and other fine book retailers.

In addition to the wide-ranging content by Nicholas and a chapter on string and regular expression performance by yours truly, chapters were also contributed by an awesome lineup of JavaScript performance gurus: Ross Harmes, Julien Lecomte, Stoyan Stefanov, and Matt Sweeney. This book is unique in its laser-focus on optimizing the performance of your JavaScript applications, and covers many advanced topics in the process. The chapter on strings and regular expressions provides what I think is easily the most in-depth coverage of cross-browser JavaScript regex performance currently available.

Here's the list of chapters:

Loading and Execution
Data Access
DOM Scripting (Stoyan Stefanov)
Algorithms and Flow Control
Strings and Regular Expressions (Steven Levithan)
Responsive Interfaces
Ajax (Ross Harmes)
Programming Practices
Build and Deployment (Julien Lecomte)
Tools (Matt Sweeney)

To celebrate the completion of this book, ~~I'm giving away three copies.~~ O'Reilly Media increased the offer to five books! All you need to do is comment on this post by February 24^th, and I'll pick five people to send a copy to as soon as it's released (Amazon says March 15^th). If you prefer, I'd be happy to send you a copy of Regular Expressions Cookbook instead (please note which book you want in your comment). Four winners will be chosen at random from the pool of unique commenters (I'll be tracking IPs), and the fifth based on the reason given for why you want a copy.

Make sure to include your email address in the comment form, since I'll need it to contact you if you're selected (your email address won't be used for any other purpose). Good luck, and congratulations to Nicholas Zakas and all the other authors on completing a fantastic new book!

Edit (2010-02-05): My blog has been offline more often than not for the first two days after posting this, and many people have reported that they were unable to post a comment. I apologize for the screw-up—my blog is now on a different server, and the problems should be resolved. Please try again!

Edit (2010-02-08): O'Reilly Media kindly offered to pick up the tab for this giveaway, and increased the winnings to five books!

Edit (2010-02-09): Nicholas Zakas posted more information about High Performance JavaScript on his blog: Announcing High Performance JavaScript.

Edit (2010-02-25): This contest is now closed. Winners will be announced here shortly.

Edit (2010-03-03): Following are the winners of this giveaway (the first four were chosen randomly):

No. 5 Adam Crabtree, who wants to review the book and share it with members of the DallasJS Meetup Group, wins the nonrandom drawing for the best reason to win a copy. Runners up for this selection were Yoav, who promised to donate the book to a high school library after he's done with it; Nick Carter, who threatened me with his wrath if he doesn't win (I'll have to endure); Paul Irish, who kindly offered to have my last name corrected (to that of a sea monster) in exchange for winning; Alexei, a technical editor of a couple of Nicholas Zakas's previous books who'd like to know how many errors this one contains; and Marcel Korpel, who wants to improve his users' health by reducing the "headaches, general stress and insomnia" they suffer while waiting on his websites. 🙂

The winners have been informed by email about how to collect their prize. Thanks to everyone for playing!

XRegExp 1.0

After stalling for nearly a year, I've finally released XRegExp 1.0, the next generation of my JavaScript regular expression library. Although it doesn't add support for lookbehind (as I've previously suggested) due to what would amount to significant inherent limitations, it fixes a couple bugs, corrects even more cross-browser regex inconsistencies, and adds a suite of new regular expression functions and methods that make writing regex-intensive JavaScript applications easier than ever. One of these new functions, XRegExp.addToken, fundamentally changes XRegExp's implementation and allows you to easily create your own XRegExp plugins.

Here's XRegExp's abbreviated feature list from the brand new xregexp.com (which includes extensive documentation and code examples):

Adds new regex and replacement text syntax, including comprehensive support for named capture.
Adds two new regex flags: s, to make dot match all characters (aka singleline mode), and x, for free-spacing and comments (aka extended mode).
Provides a suite of 12 functions and methods that make complex regex processing a breeze.
Automagically fixes the most commonly encountered cross-browser inconsistencies in regex behavior and syntax.
Lets you easily create and use plugins that add new syntax and flags to XRegExp's regular expression language.

The full list of changes can be seen in the changelog. Please let me know if you find any bugs or have any suggestions for the library. I'd also love to hear about projects or sites that are using XRegExp (I've got a few listed on the XRegExp homepage now).

Unicode Plugin for XRegExp

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

\p{L} — Letter
\p{M} — Mark
\p{N} — Number
\p{P} — Punctuation
\p{S} — Symbol
\p{Z} — Separator
\p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

\p{InAlphabeticPresentationForms}
\p{InArabic}
\p{InArabicPresentationFormsA}
\p{InArabicPresentationFormsB}
\p{InArabicSupplement}
\p{InArmenian}
\p{InArrows}
\p{InBalinese}
\p{InBasicLatin}
\p{InBengali}
\p{InBlockElements}
\p{InBopomofo}
\p{InBopomofoExtended}
\p{InBoxDrawing}
\p{InBraillePatterns}
\p{InBuginese}
\p{InBuhid}
\p{InCham}
\p{InCherokee}
\p{InCJKCompatibility}
\p{InCJKCompatibilityForms}
\p{InCJKCompatibilityIdeographs}
\p{InCJKRadicalsSupplement}
\p{InCJKStrokes}
\p{InCJKSymbolsandPunctuation}
\p{InCJKUnifiedIdeographs}
\p{InCJKUnifiedIdeographsExtensionA}
\p{InCombiningDiacriticalMarks}
\p{InCombiningDiacriticalMarksforSymbols}
\p{InCombiningDiacriticalMarksSupplement}
\p{InCombiningHalfMarks}
\p{InControlPictures}
\p{InCoptic}
\p{InCurrencySymbols}
\p{InCyrillic}
\p{InCyrillicExtendedA}
\p{InCyrillicExtendedB}
\p{InCyrillicSupplement}
\p{InDevanagari}
\p{InDingbats}
\p{InEnclosedAlphanumerics}
\p{InEnclosedCJKLettersandMonths}
\p{InEthiopic}
\p{InEthiopicExtended}
\p{InEthiopicSupplement}
\p{InGeneralPunctuation}
\p{InGeometricShapes}
\p{InGeorgian}
\p{InGeorgianSupplement}
\p{InGlagolitic}
\p{InGreekandCoptic}
\p{InGreekExtended}
\p{InGujarati}
\p{InGurmukhi}
\p{InHalfwidthandFullwidthForms}
\p{InHangulCompatibilityJamo}
\p{InHangulJamo}
\p{InHangulSyllables}
\p{InHanunoo}
\p{InHebrew}
\p{InHighPrivateUseSurrogates}
\p{InHighSurrogates}
\p{InHiragana}
\p{InIdeographicDescriptionCharacters}
\p{InIPAExtensions}
\p{InKanbun}
\p{InKangxiRadicals}
\p{InKannada}
\p{InKatakana}
\p{InKatakanaPhoneticExtensions}
\p{InKayahLi}
\p{InKhmer}
\p{InKhmerSymbols}
\p{InLao}
\p{InLatin1Supplement}
\p{InLatinExtendedA}
\p{InLatinExtendedAdditional}
\p{InLatinExtendedB}
\p{InLatinExtendedC}
\p{InLatinExtendedD}
\p{InLepcha}
\p{InLetterlikeSymbols}
\p{InLimbu}
\p{InLowSurrogates}
\p{InMalayalam}
\p{InMathematicalOperators}
\p{InMiscellaneousMathematicalSymbolsA}
\p{InMiscellaneousMathematicalSymbolsB}
\p{InMiscellaneousSymbols}
\p{InMiscellaneousSymbolsandArrows}
\p{InMiscellaneousTechnical}
\p{InModifierToneLetters}
\p{InMongolian}
\p{InMyanmar}
\p{InNewTaiLue}
\p{InNKo}
\p{InNumberForms}
\p{InOgham}
\p{InOlChiki}
\p{InOpticalCharacterRecognition}
\p{InOriya}
\p{InPhagspa}
\p{InPhoneticExtensions}
\p{InPhoneticExtensionsSupplement}
\p{InPrivateUseArea}
\p{InRejang}
\p{InRunic}
\p{InSaurashtra}
\p{InSinhala}
\p{InSmallFormVariants}
\p{InSpacingModifierLetters}
\p{InSpecials}
\p{InSundanese}
\p{InSuperscriptsandSubscripts}
\p{InSupplementalArrowsA}
\p{InSupplementalArrowsB}
\p{InSupplementalMathematicalOperators}
\p{InSupplementalPunctuation}
\p{InSylotiNagri}
\p{InSyriac}
\p{InTagalog}
\p{InTagbanwa}
\p{InTaiLe}
\p{InTamil}
\p{InTelugu}
\p{InThaana}
\p{InThai}
\p{InTibetan}
\p{InTifinagh}
\p{InUnifiedCanadianAboriginalSyllabics}
\p{InVai}
\p{InVariationSelectors}
\p{InVerticalForms}
\p{InYijingHexagramSymbols}
\p{InYiRadicals}
\p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of:	Use:
`[\p{N}]`	`\p{N}`
`[\p{N}a-z~]`	`(?:\p{N}\|[a-z~])`
`[\p{N}\P{Z}]`	`(?:\p{N}\|\P{Z})`
`[\p{N}\P{Z}a-z~]`	`(?:\p{N}\|\P{Z}\|[a-z~])`
`[^\p{N}]`	`\P{N}`
`[^\p{N}a-z~]`	`(?:(?!\p{N})[^a-z~])`
`[^\p{N}\P{Z}]`	`(?:(?!\p{N}\|\P{Z})[\S\s])`
`[^\p{N}\P{Z}a-z~]`	`(?:(?!\p{N}\|\P{Z})[^a-z~])`

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

Code Challenge: Change Dispenser

I recently encountered a brain teaser that asked to take an amount of change and return the equivalent in dollars and coins.

Here's the five-minute solution I first came up with.

function makeChange (money) {
    var i, num,
        output = [],
        coins  = [
            [100, "dollar",  "dollars" ],
            [25,  "quarter", "quarters"],
            [10,  "dime",    "dimes"   ],
            [5,   "nickel",  "nickels" ],
            [1,   "penny",   "pennies" ]
        ];
    money = money * 100; // avoid float precision issues
    for (i = 0; i < coins.length; i++) {
        num = Math.floor(money / coins[i][0]);
        money -= num * coins[i][0];
        if (num) {
            output.push(num + " " + coins[i][num > 1 ? 2 : 1]);
        }
    }
    return output.join(", ");
}

makeChange(0.37); // "1 quarter, 1 dime, 2 pennies"

I feel like I'm missing something, though. How would you improve this code to make it shorter, faster, or otherwise better?

Multiple String Replacement Sugar

How many times have you needed to run multiple replacement operations on the same string? It's not too bad, but can get a bit tedious if you write code like this a lot.

str = str.
	replace( /&(?!#?\w+;)/g , '&amp;'    ).
	replace( /"([^"]*)"/g   , '“$1”'     ).
	replace( /</g           , '&lt;'     ).
	replace( />/g           , '&gt;'     ).
	replace( /…/g           , '&hellip;' ).
	replace( /“/g           , '&ldquo;'  ).
	replace( /”/g           , '&rdquo;'  ).
	replace( /‘/g           , '&lsquo;'  ).
	replace( /’/g           , '&rsquo;'  ).
	replace( /—/g           , '&mdash;'  ).
	replace( /–/g           , '&ndash;'  );

A common trick to shorten such code is to look up replacement values using an object as a hash table. Here's a simple implementation of this.

var hash = {
	'<' : '&lt;'    ,
	'>' : '&gt;'    ,
	'…' : '&hellip;',
	'“' : '&ldquo;' ,
	'”' : '&rdquo;' ,
	'‘' : '&lsquo;' ,
	'’' : '&rsquo;' ,
	'—' : '&mdash;' ,
	'–' : '&ndash;'
};

str = str.
	replace( /&(?!#?\w+;)/g , '&amp;' ).
	replace( /"([^"]*)"/g   , '“$1”'  ).
	replace( /[<>…“”‘’—–]/g , function ( $0 ) {
		return hash[ $0 ];
	});

However, this approach has some limitations.

Search patterns are repeated in the hash table and the regular expression character class.
Both the search and replacement are limited to plain text. That's why the first and second replacements had to remain separate in the above code. The first replacement used a regex search pattern, and the second used a backreference in the replacement text.
Replacements don't cascade. This is another reason why the second replacement operation had to remain separate. I want text like "this" to first be replaced with “this”, and eventually end up as “this”.
It doesn't work in Safari 2.x and other old browsers that don't support using functions to generate replacement text.

With a few lines of String.prototype sugar, you can deal with all of these issues.

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
	}
	return str;
};

Now you can use code like this:

str = str.multiReplace({
	'&(?!#?\\w+;)' : '&amp;'   ,
	'"([^"]*)"'    : '“$1”'    ,
	'<'            : '&lt;'    ,
	'>'            : '&gt;'    ,
	'…'            : '&hellip;',
	'“'            : '&ldquo;' ,
	'”'            : '&rdquo;' ,
	'‘'            : '&lsquo;' ,
	'’'            : '&rsquo;' ,
	'—'            : '&mdash;' ,
	'–'            : '&ndash;'
});

If you care about the order of replacements, you should be aware that the current JavaScript specification does not require a particular enumeration order when looping over object properties with for..in. However, recent versions of the big four browsers (IE, Firefox, Safari, Opera) all use insertion order, which allows this to work as described (from top to bottom). ECMAScript 4 proposals indicate that the insertion-order convention will be formally codified in that standard.

If you need to worry about rogue properties that show up when people mess with Object.prototype, you can update the code as follows:

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		if ( Object.prototype.hasOwnProperty.call( hash, key ) ) {
			str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
		}
	}
	return str;
};

Calling the hasOwnProperty method on Object.prototype rather than on the hash object directly allows this method to work even when you're searching for the string "hasOwnProperty".

Lemme know if you think this is useful.