Still Alive

Well, I'm back. I didn't mean to go silent for so long, but I've been busy. Although it will be a few months before it comes out, Jan Goyvaerts and I have mostly finished work on our new regex book — stay tuned for more info. During this blogging hiatus I've also attended multiple family reunions, switched jobs, learned a new language (ActionScript 3), put in crazy hours on a new website launch, and about five weeks ago I moved to sunny Baghdad, Iraq for webdev work.

Anyway, now that work is calming down just enough for some breathing room, I should be able to get back to this blogging thing a little more regularly.

Teaser: Relatively soon I hope to release a new version of XRegExp, which will provide a way to easily extend XRegExp with your own, new regex features.

Unicode Plugin for XRegExp

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

  • \p{L} — Letter
  • \p{M} — Mark
  • \p{N} — Number
  • \p{P} — Punctuation
  • \p{S} — Symbol
  • \p{Z} — Separator
  • \p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

  • \p{InAlphabeticPresentationForms}
  • \p{InArabic}
  • \p{InArabicPresentationFormsA}
  • \p{InArabicPresentationFormsB}
  • \p{InArabicSupplement}
  • \p{InArmenian}
  • \p{InArrows}
  • \p{InBalinese}
  • \p{InBasicLatin}
  • \p{InBengali}
  • \p{InBlockElements}
  • \p{InBopomofo}
  • \p{InBopomofoExtended}
  • \p{InBoxDrawing}
  • \p{InBraillePatterns}
  • \p{InBuginese}
  • \p{InBuhid}
  • \p{InCham}
  • \p{InCherokee}
  • \p{InCJKCompatibility}
  • \p{InCJKCompatibilityForms}
  • \p{InCJKCompatibilityIdeographs}
  • \p{InCJKRadicalsSupplement}
  • \p{InCJKStrokes}
  • \p{InCJKSymbolsandPunctuation}
  • \p{InCJKUnifiedIdeographs}
  • \p{InCJKUnifiedIdeographsExtensionA}
  • \p{InCombiningDiacriticalMarks}
  • \p{InCombiningDiacriticalMarksforSymbols}
  • \p{InCombiningDiacriticalMarksSupplement}
  • \p{InCombiningHalfMarks}
  • \p{InControlPictures}
  • \p{InCoptic}
  • \p{InCurrencySymbols}
  • \p{InCyrillic}
  • \p{InCyrillicExtendedA}
  • \p{InCyrillicExtendedB}
  • \p{InCyrillicSupplement}
  • \p{InDevanagari}
  • \p{InDingbats}
  • \p{InEnclosedAlphanumerics}
  • \p{InEnclosedCJKLettersandMonths}
  • \p{InEthiopic}
  • \p{InEthiopicExtended}
  • \p{InEthiopicSupplement}
  • \p{InGeneralPunctuation}
  • \p{InGeometricShapes}
  • \p{InGeorgian}
  • \p{InGeorgianSupplement}
  • \p{InGlagolitic}
  • \p{InGreekandCoptic}
  • \p{InGreekExtended}
  • \p{InGujarati}
  • \p{InGurmukhi}
  • \p{InHalfwidthandFullwidthForms}
  • \p{InHangulCompatibilityJamo}
  • \p{InHangulJamo}
  • \p{InHangulSyllables}
  • \p{InHanunoo}
  • \p{InHebrew}
  • \p{InHighPrivateUseSurrogates}
  • \p{InHighSurrogates}
  • \p{InHiragana}
  • \p{InIdeographicDescriptionCharacters}
  • \p{InIPAExtensions}
  • \p{InKanbun}
  • \p{InKangxiRadicals}
  • \p{InKannada}
  • \p{InKatakana}
  • \p{InKatakanaPhoneticExtensions}
  • \p{InKayahLi}
  • \p{InKhmer}
  • \p{InKhmerSymbols}
  • \p{InLao}
  • \p{InLatin1Supplement}
  • \p{InLatinExtendedA}
  • \p{InLatinExtendedAdditional}
  • \p{InLatinExtendedB}
  • \p{InLatinExtendedC}
  • \p{InLatinExtendedD}
  • \p{InLepcha}
  • \p{InLetterlikeSymbols}
  • \p{InLimbu}
  • \p{InLowSurrogates}
  • \p{InMalayalam}
  • \p{InMathematicalOperators}
  • \p{InMiscellaneousMathematicalSymbolsA}
  • \p{InMiscellaneousMathematicalSymbolsB}
  • \p{InMiscellaneousSymbols}
  • \p{InMiscellaneousSymbolsandArrows}
  • \p{InMiscellaneousTechnical}
  • \p{InModifierToneLetters}
  • \p{InMongolian}
  • \p{InMyanmar}
  • \p{InNewTaiLue}
  • \p{InNKo}
  • \p{InNumberForms}
  • \p{InOgham}
  • \p{InOlChiki}
  • \p{InOpticalCharacterRecognition}
  • \p{InOriya}
  • \p{InPhagspa}
  • \p{InPhoneticExtensions}
  • \p{InPhoneticExtensionsSupplement}
  • \p{InPrivateUseArea}
  • \p{InRejang}
  • \p{InRunic}
  • \p{InSaurashtra}
  • \p{InSinhala}
  • \p{InSmallFormVariants}
  • \p{InSpacingModifierLetters}
  • \p{InSpecials}
  • \p{InSundanese}
  • \p{InSuperscriptsandSubscripts}
  • \p{InSupplementalArrowsA}
  • \p{InSupplementalArrowsB}
  • \p{InSupplementalMathematicalOperators}
  • \p{InSupplementalPunctuation}
  • \p{InSylotiNagri}
  • \p{InSyriac}
  • \p{InTagalog}
  • \p{InTagbanwa}
  • \p{InTaiLe}
  • \p{InTamil}
  • \p{InTelugu}
  • \p{InThaana}
  • \p{InThai}
  • \p{InTibetan}
  • \p{InTifinagh}
  • \p{InUnifiedCanadianAboriginalSyllabics}
  • \p{InVai}
  • \p{InVariationSelectors}
  • \p{InVerticalForms}
  • \p{InYijingHexagramSymbols}
  • \p{InYiRadicals}
  • \p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of: Use:
[\p{N}]\p{N}
[\p{N}a-z~](?:\p{N}|[a-z~])
[\p{N}\P{Z}](?:\p{N}|\P{Z})
[\p{N}\P{Z}a-z~](?:\p{N}|\P{Z}|[a-z~])
[^\p{N}]\P{N}
[^\p{N}a-z~](?:(?!\p{N})[^a-z~])
[^\p{N}\P{Z}](?:(?!\p{N}|\P{Z})[\S\s])
[^\p{N}\P{Z}a-z~](?:(?!\p{N}|\P{Z})[^a-z~])

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

Code Challenge: Change Dispenser

I recently encountered a brain teaser that asked to take an amount of change and return the equivalent in dollars and coins.

Here's the five-minute solution I first came up with.

function makeChange (money) {
    var i, num,
        output = [],
        coins  = [
            [100, "dollar",  "dollars" ],
            [25,  "quarter", "quarters"],
            [10,  "dime",    "dimes"   ],
            [5,   "nickel",  "nickels" ],
            [1,   "penny",   "pennies" ]
        ];
    money = money * 100; // avoid float precision issues
    for (i = 0; i < coins.length; i++) {
        num = Math.floor(money / coins[i][0]);
        money -= num * coins[i][0];
        if (num) {
            output.push(num + " " + coins[i][num > 1 ? 2 : 1]);
        }
    }
    return output.join(", ");
}

makeChange(0.37); // "1 quarter, 1 dime, 2 pennies"

I feel like I'm missing something, though. How would you improve this code to make it shorter, faster, or otherwise better?

Multiple String Replacement Sugar

How many times have you needed to run multiple replacement operations on the same string? It's not too bad, but can get a bit tedious if you write code like this a lot.

str = str.
	replace( /&(?!#?\w+;)/g , '&amp;'    ).
	replace( /"([^"]*)"/g   , '“$1”'     ).
	replace( /</g           , '&lt;'     ).
	replace( />/g           , '&gt;'     ).
	replace( /…/g           , '&hellip;' ).
	replace( /“/g           , '&ldquo;'  ).
	replace( /”/g           , '&rdquo;'  ).
	replace( /‘/g           , '&lsquo;'  ).
	replace( /’/g           , '&rsquo;'  ).
	replace( /—/g           , '&mdash;'  ).
	replace( /–/g           , '&ndash;'  );

A common trick to shorten such code is to look up replacement values using an object as a hash table. Here's a simple implementation of this.

var hash = {
	'<' : '&lt;'    ,
	'>' : '&gt;'    ,
	'…' : '&hellip;',
	'“' : '&ldquo;' ,
	'”' : '&rdquo;' ,
	'‘' : '&lsquo;' ,
	'’' : '&rsquo;' ,
	'—' : '&mdash;' ,
	'–' : '&ndash;'
};

str = str.
	replace( /&(?!#?\w+;)/g , '&amp;' ).
	replace( /"([^"]*)"/g   , '“$1”'  ).
	replace( /[<>…“”‘’—–]/g , function ( $0 ) {
		return hash[ $0 ];
	});

However, this approach has some limitations.

  • Search patterns are repeated in the hash table and the regular expression character class.
  • Both the search and replacement are limited to plain text. That's why the first and second replacements had to remain separate in the above code. The first replacement used a regex search pattern, and the second used a backreference in the replacement text.
  • Replacements don't cascade. This is another reason why the second replacement operation had to remain separate. I want text like "this" to first be replaced with “this”, and eventually end up as &ldquo;this&rdquo;.
  • It doesn't work in Safari 2.x and other old browsers that don't support using functions to generate replacement text.

With a few lines of String.prototype sugar, you can deal with all of these issues.

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
	}
	return str;
};

Now you can use code like this:

str = str.multiReplace({
	'&(?!#?\\w+;)' : '&amp;'   ,
	'"([^"]*)"'    : '“$1”'    ,
	'<'            : '&lt;'    ,
	'>'            : '&gt;'    ,
	'…'            : '&hellip;',
	'“'            : '&ldquo;' ,
	'”'            : '&rdquo;' ,
	'‘'            : '&lsquo;' ,
	'’'            : '&rsquo;' ,
	'—'            : '&mdash;' ,
	'–'            : '&ndash;'
});

If you care about the order of replacements, you should be aware that the current JavaScript specification does not require a particular enumeration order when looping over object properties with for..in. However, recent versions of the big four browsers (IE, Firefox, Safari, Opera) all use insertion order, which allows this to work as described (from top to bottom). ECMAScript 4 proposals indicate that the insertion-order convention will be formally codified in that standard.

If you need to worry about rogue properties that show up when people mess with Object.prototype, you can update the code as follows:

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		if ( Object.prototype.hasOwnProperty.call( hash, key ) ) {
			str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
		}
	}
	return str;
};

Calling the hasOwnProperty method on Object.prototype rather than on the hash object directly allows this method to work even when you're searching for the string "hasOwnProperty".

Lemme know if you think this is useful.

Writing a Regex Book

I'm excited to announce that I've recently started working on a regular expression book for O'Reilly Media. The back story is that a few months ago, Jeffrey Friedl (author of the world's best regular expression book yet wink) was kind enough to introduce me to his editor at O'Reilly, Andy Oram. After Andy and I discussed what we thought was a good follow-up and alternative approach to Jeffery's very popular book, I asked Jan Goyvaerts (of RegexBuddy and regular-expressions.info) if he was interested in working together. Long story short, Jan and I are now working on what we hope will be an exceptionally practical, high-quality guide to solving real problems using regular expressions. You can see Jan's announcement on his blog.

Unfortunately, due to work on the book and other responsibilities I probably won't be able to spend as much time on this blog until the book is further along. However, as things progress I hope to share more information about the project, and get some early feedback on a few sections. Let me know if there are particular regex problems you'd like to see solutions for in the book.

Update: The book is now available for pre-order: Regular Expressions Cookbook.