Multiple String Replacement Sugar

How many times have you needed to run multiple replacement operations on the same string? It's not too bad, but can get a bit tedious if you write code like this a lot.

str = str.
	replace( /&(?!#?\w+;)/g , '&'    ).
	replace( /"([^"]*)"/g   , '“$1”'     ).
	replace( /</g           , '&lt;'     ).
	replace( />/g           , '&gt;'     ).
	replace( /…/g           , '&hellip;' ).
	replace( /“/g           , '&ldquo;'  ).
	replace( /”/g           , '&rdquo;'  ).
	replace( /‘/g           , '&lsquo;'  ).
	replace( /’/g           , '&rsquo;'  ).
	replace( /—/g           , '&mdash;'  ).
	replace( /–/g           , '&ndash;'  );

A common trick to shorten such code is to look up replacement values using an object as a hash table. Here's a simple implementation of this.

var hash = {
	'<' : '&lt;'    ,
	'>' : '&gt;'    ,
	'…' : '&hellip;',
	'“' : '&ldquo;' ,
	'”' : '&rdquo;' ,
	'‘' : '&lsquo;' ,
	'’' : '&rsquo;' ,
	'—' : '&mdash;' ,
	'–' : '&ndash;'
};

str = str.
	replace( /&(?!#?\w+;)/g , '&amp;' ).
	replace( /"([^"]*)"/g   , '“$1”'  ).
	replace( /[<>…“”‘’—–]/g , function ( $0 ) {
		return hash[ $0 ];
	});

However, this approach has some limitations.

  • Search patterns are repeated in the hash table and the regular expression character class.
  • Both the search and replacement are limited to plain text. That's why the first and second replacements had to remain separate in the above code. The first replacement used a regex search pattern, and the second used a backreference in the replacement text.
  • Replacements don't cascade. This is another reason why the second replacement operation had to remain separate. I want text like "this" to first be replaced with “this”, and eventually end up as &ldquo;this&rdquo;.
  • It doesn't work in Safari 2.x and other old browsers that don't support using functions to generate replacement text.

With a few lines of String.prototype sugar, you can deal with all of these issues.

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
	}
	return str;
};

Now you can use code like this:

str = str.multiReplace({
	'&(?!#?\\w+;)' : '&amp;'   ,
	'"([^"]*)"'    : '“$1”'    ,
	'<'            : '&lt;'    ,
	'>'            : '&gt;'    ,
	'…'            : '&hellip;',
	'“'            : '&ldquo;' ,
	'”'            : '&rdquo;' ,
	'‘'            : '&lsquo;' ,
	'’'            : '&rsquo;' ,
	'—'            : '&mdash;' ,
	'–'            : '&ndash;'
});

If you care about the order of replacements, you should be aware that the current JavaScript specification does not require a particular enumeration order when looping over object properties with for..in. However, recent versions of the big four browsers (IE, Firefox, Safari, Opera) all use insertion order, which allows this to work as described (from top to bottom). ECMAScript 4 proposals indicate that the insertion-order convention will be formally codified in that standard.

If you need to worry about rogue properties that show up when people mess with Object.prototype, you can update the code as follows:

String.prototype.multiReplace = function ( hash ) {
	var str = this, key;
	for ( key in hash ) {
		if ( Object.prototype.hasOwnProperty.call( hash, key ) ) {
			str = str.replace( new RegExp( key, 'g' ), hash[ key ] );
		}
	}
	return str;
};

Calling the hasOwnProperty method on Object.prototype rather than on the hash object directly allows this method to work even when you're searching for the string "hasOwnProperty".

Lemme know if you think this is useful.

Writing a Regex Book

I'm excited to announce that I've recently started working on a regular expression book for O'Reilly Media. The back story is that a few months ago, Jeffrey Friedl (author of the world's best regular expression book yet wink) was kind enough to introduce me to his editor at O'Reilly, Andy Oram. After Andy and I discussed what we thought was a good follow-up and alternative approach to Jeffery's very popular book, I asked Jan Goyvaerts (of RegexBuddy and regular-expressions.info) if he was interested in working together. Long story short, Jan and I are now working on what we hope will be an exceptionally practical, high-quality guide to solving real problems using regular expressions. You can see Jan's announcement on his blog.

Unfortunately, due to work on the book and other responsibilities I probably won't be able to spend as much time on this blog until the book is further along. However, as things progress I hope to share more information about the project, and get some early feedback on a few sections. Let me know if there are particular regex problems you'd like to see solutions for in the book.

Update: The book is now available for pre-order: Regular Expressions Cookbook.

Regex Day Contest

A few months ago Ben Nadel (a regex fan and prominent ColdFusion blogger) asked me if I was interested in promoting his idea for a "National Regular Expression Day," where he'd give away some shirts and books and basically just have some fun with regex evangelism. Well, Ben finally kicked if off, assigning the honor to June 1st, 2008. Make sure to check out his blog post, because by simply posting a comment noting your preferred item from his list before June 2nd, you're entered to win it.

I'm all for regex evangelism, so I figured I'd get in on the action with my own regex contest where you can win the best commercial regex products I know of, worth up to $150! The rules are a little different here though. For one thing, you've got more time to enter — I'll keep this open until the end of Friday, June 13. Second, this isn't a lottery. The rules are still pretty simple, though.

  • Write (or link to) some kind of creative regex content in a comment on this blog post.
  • It has to be something new, specifically for this contest.
  • Enter by unlucky Friday, June 13.
  • When submitting your entry, make sure to include an email address where I can reach you in the email field (it won't be visible to others, and I'll only use it to contact you about this contest).
  • You can submit multiple entries, but each will be judged on its own and one person cannot win more than one award.
  • I get to be the judge and jury.

As for what kind of content is eligible, well, pretty much anything as long as it's regex related. You can write a regex joke (preferably not ending with the punchline "now they have two problems"), post a regex article somewhere, create a regex comic strip, share your favorite regex you've written, design a regex superhero, create a regex game, tell a story about how regexes saved the day, link to a blog post you've written about Regular Expression Day, or whatever you can come up with. Go nuts.

Here's what the winners can choose from. If you win first or second place but none of the prizes in that tier interest you, you can pick two items from a lower level.

Good luck, and I hope you have fun with this. smile (Once again, make sure to check out Ben's post that started this.)

Remove Nested Patterns with One Line of JavaScript

Here's a neat little trick I came up with for removing nested patterns from a string.

var str = "abc<1<2<>3>4>def";

while (str != (str = str.replace(/<[^<>]*>/g, "")));

// str -> "abcdef"

Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while loop's condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.

You can use a similar approach to grab nested patterns rather than delete them, as shown below.

[Edit (6/6/2008): The following code does not correctly handle input like "((a)(b))". If you browse tag recursion on this blog you'll find a variety of other approaches for matching nested constructs that actually work correctly.]

var str = "abc(d(e())f)(gh)ijk()",
    re = /\([^()]*\)/,
    output = [],
    match, parts, last;

while (match = re.exec(str)) {
    parts = match[0].split("\uFFFF");
    if (parts.length < 2) {
        last = output.push(match[0]) - 1;
    } else {
        output[last] = parts[0] + output[last] + parts[1];
    }
    str = str.replace(re, "\uFFFF");
}

// output -> ["(d(e())f)", "(gh)", "()"]

Since once again we're working from the inside out, reassembling each complete match requires us to mark the position at which the previous deepest-level match was removed. I've used the Unicode escape sequence \uFFFF to mark such positions, because that's a permanently-unassigned code point.

Note that using a negated character class like [^()] to match the inner pattern as shown in the examples here only works correctly if you're using single-character delimiters like (…) or <…>. If you want to match/remove nested patterns that use multi-character delimiters, you can use a regex like /<<(?:(?!<<|>>)[\S\s])*>>/. Just change both instances of << to your left delimiter, and >> to your right delimiter.

Test Your XRegExps with JRX

Cüneyt Yılmaz's JRX is a cool JavaScript regex tester inspired by the RX tool of Komodo IDE. Cüneyt recently added my XRegExp library to his tester, so JRX is now a nice and easy way to test XRegExp's singleline and extended modes, as well as named capture and other XRegExp-provided syntax. Check it out!

As for XRegExp, it has recently been upgraded to v0.5.2, which resolved a corner-case bug involving XRegExp.matchRecursive. See the changelog for details.

I'll take this opportunity to highlight some of my other favorite online regex testers. I've actively looked for these kinds of apps over the years and have seen more than 50 of them. Odds are you'll find something new here.

Edit (2008-06-18): Updated the list with a couple that have come out very recently.

  • RegexPal — My own JavaScript regex tester. It includes real-time regex syntax and match highlighting. Although RegexPal uses XRegExp to provide the singleline option, unlike JRX it uses JavaScript regex syntax without the XRegExp syntax extensions.
  • regex — Simple name, simple interface. Great set of flavor support (JavaScript, Perl, Python, PCRE, POSIX ERE).
  • Regexp Editor — Java regex tester with regex syntax highlighting.
  • RegExr — ActionScript regex tester with regex syntax highlighting.
  • reWork — JavaScript regex workbench.
  • reAnimator — Fun app for visualizing regex FSAs.
  • RegexMate — JavaScript regex console.
  • The REWizard — IE only, but offers regex building tools and an interesting visualization.
  • MyRegexTester — Includes code generation and plain-text explanations (via the YAPE::Regex::Explain Perl module).
  • Regular Expression Analyzer — Real-time regex explanation tree that mostly emulates Java, JavaScript, and Perl flavors. Its regex parsing code is very readable.
  • Nregex — .NET regex tester.
  • Rubular — Ruby regex tester.

Have fun!