Regular Expressions — Flagrant Badassery

Regular Expressions As Functions

Firefox includes a non-standard JavaScript extension that makes regular expressions callable as functions. This serves as a shorthand for calling a regex's exec method. For example, in Firefox /regex/("string") is equivalent to /regex/.exec("string"). Early ECMAScript 4 proposals indicated this functionality would be added to the ES4 specification, but subsequent discussion on the ES4-discuss mailing list suggests it might be dropped.

However, you can implement something similar by adding call and apply methods to RegExp.prototype, which could help with functional programming and duck-typed code that works with both functions and regular expressions. So let's add them:

RegExp.prototype.call = function (context, str) {
	return this.exec(str);
};
RegExp.prototype.apply = function (context, args) {
	return this.exec(args[0]);
};

Note that both of the above methods completely ignore the context argument. You could pass in null or whatever else as the context, and you'd get back the normal result of running exec on the regex. Using the above methods, you can generically work with both regular expressions and functions wherever it's convenient to do so. A few obvious cases where this could be helpful are the JavaScript 1.6 array iteration methods. Following are implementations of filter, every, some, and map that allow them to be used cross-browser:

// Returns an array with the elements of an existng array for which the provided filtering function returns true
Array.prototype.filter = function (func, context) {
	var results = [];
	for (var i = 0; i < this.length; i++) {
		if (i in this && func.call(context, this[i], i, this))
			results.push(this[i]);
	}
	return results;
};
// Returns true if every element in the array satisfies the provided testing function
Array.prototype.every = function (func, context) {
	for (var i = 0; i < this.length; i++) {
		if (i in this && !func.call(context, this[i], i, this))
			return false;
	}
	return true;
};
// Returns true if at least one element in the array satisfies the provided testing function
Array.prototype.some = function (func, context) {
	for (var i = 0; i < this.length; i++) {
		if (i in this && func.call(context, this[i], i, this))
			return true;
	}
	return false;
};
// Returns an array with the results of calling the provided function on every element in the provided array
Array.prototype.map = function (func, context) {
	var results = [];
	for (var i = 0; i < this.length; i++) {
		if (i in this)
			results[i] = func.call(context, this[i], i, this);
	}
	return results;
};

Because the array and null values returned by exec type-convert nicely to true and false, the above code allows you to use something like ["a","b","ab","ba"].filter(/^a/) to return all values that start with "a": ["a","ab"]. The code ["1",1,0,"a",3.1,256].filter(/^[1-9]\d*$/) would return integers greater than zero, regardless of type: ["1",1,256]. str.match(/a?b/g).filter(/^b/) would return all matches of "b" not preceded by "a". This can be a convenient pattern since JavaScript doesn't support lookbehind.

All of the above examples already work with Firefox's native Array.prototype.filter because of the indirect exec calling feature in that browser, but they wouldn't work with the cross-browser implementation of filter above without adding RegExp.prototype.call.

Does this seem like something that would be useful to you? Can you think of other good examples where call and apply methods would be useful for regular expressions?

Update: This post has been translated into Chinese by PlanABC.net.

10 Reasons to Learn and Use Regular Expressions

10. Regular expressions are everywhere

Here's a short list of programming languages and tools that support regular expressions. The links are to their regex documentation.

The .NET Framework
ActionScript
ColdFusion
Java
JavaScript
Perl
PHP
Python
Ruby
Tcl
Apache HTTP Server
Google Analytics
Google Code Search
Yahoo! Pipes
Microsoft Exchange Server
Windows PowerShell
OpenOffice.org
Oracle Database
Any decent programmer's text editor or IDE
And thousands more…

9. Regular expression mastery can help you stand out from the crowd

Regular expressions might be everywhere, but many experienced programmers are intimidated by them. Knowing how to use regular expressions effectively is a valuable skill that can quickly make your peers take notice.

8. Wielding regular expressions can make you feel like a mighty wizard

Regular expressions can be difficult to master, but doing so is that much more rewarding as a result. Writing a line of cryptic letters and symbols that does what might otherwise take hundreds of lines of code can feel pretty cool.

7. If your search is simple, regular expression syntax is simple

Want to match the word "cat"? The regex is simply cat. ^cat matches "cat" at the beginning of the string, cat$ matches at the end, and cat|dog matches "cat" or "dog". Most regex syntax is very simple once you get the hang of it.

6. Regular expressions are portable

That's a bold lie, yet it's usually true for people who stick to the basics or intentionally write their regexes in a portable way. The majority of regex syntax works the same in a wide variety of programming languages and tools.

5. Regular expressions can help you write short code

This can be especially helpful in JavaScript, where keeping code length down is important for people with slow Internet connections. And although regexes can be hard to read, I'd rather spend a minute stepping through the logic of a regex than doing the same thing with a page full of code. Of course, like with most things in life it's important to find a good balance.

4. Regular expressions save time

Even for newcomers who still struggle with the syntax, regular expressions are often the fastest way to get the job done.

3. Regular expressions are fast

Although typical backtracking regex engines have so-called pathological cases which can take a very long time, regexes written with performance in mind will be fast enough for your needs in almost all cases. To ensure that's true, it's a good idea to at least get a feel for the basics of regex performance optimization.

2. Regular expressions can match just about anything

In other words, regular expressions are powerful. A regular expression guru can find many appropriate uses for regexes where the untrained user might not think to look. As the authors of Programming Perl wrote, "if you take 'text' in the widest possible sense, perhaps 90% of what you do is 90% text processing."

1. Regular expressions are fun

Like any good challenge, regexes can be a lot of fun. Tools like RegexPal can help remove a lot of the guesswork, so you can concentrate on solving problems.

… Feel free to add your own reasons why you think regexes are awesome^2 or the worst idea since unicycles.

Update: This post has been translated into Portuguese (by Fábio Luciano) and Spanish (by Fernando Briano). Thanks guys!

Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

(?<Name1>…) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
(?<-Name1>…) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
(?<Name2-Name1>…) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

(?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
(?(Counter)…) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
(?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

Regex Legends: The People Behind the Magic

Many people have contributed to developing and promoting the use of regular expressions since they were invented about half a century ago. Here's a short list of some of the most influential people behind the technology. I've written this up for two reasons:

For people who've only gotten into the technology recently but are interested in some of the history and pioneers behind it.
Since I fit the above description, I'm hoping readers will help fill me in on other people I've forgotten about or otherwise left out.

Alfred Aho

Aho is the "A" in AWK, and co-author of the Dragon Book – a classic reference covering such topics as building regular expression compilers. He created the initial version of egrep, which provided a big jump in expressiveness from the primitive beginnings of early Unix grep.

Websites: Wikipedia, @Columbia U (photo source)

Jeffrey Friedl

Friedl began using regular expressions with Unix in 1980. He has since written the definitive work on the subject: Mastering Regular Expressions, published by O'Reilly Media. Now in its third edition, it is widely considered a classic programming book (see e.g. this Slashdot review). The longevity of his experience with regexes helps to make him a shrewd opponent in regex debates… odds are he's already succinctly countered your quackery ten or more years ago on Usenet, and has the links to prove it. wink Friedl lives in Kyoto, Japan with his family of three.

Websites: Blog (photo source), O'Reilly bio

Jan Goyvaerts

Goyvaerts – a Belgian who's been living in Thailand for several years – is not as widely known as the others on this list, but his contributions towards helping thousands of people learn and use regular expressions are significant. Goyvaerts creates the best-in-class tools RegexBuddy and PowerGREP, which use his own JGsoft regex engine (notable for its support of most syntax from popular regex flavors including Perl, .NET, and Java). His website regular-expressions.info – based on the PowerGREP/RegexBuddy help files – is the best and most popular multi-flavor regex tutorial or reference online.

Edit: A year after this post, Jan and I coauthored Regular Expressions Cookbook, now in its second edition.

Websites: Regex blog, Just Great Software (photo source)

Philip Hazel

Hazel grew up in South Africa and has a PhD in applied mathematics. He's best known for writing Exim (a popular open source mail transfer agent) and the PCRE regex library. PCRE is one of the best regex libraries in the world and is used by many projects including Apache, PHP, and probably thousands more. Hazel worked for the University of Cambridge's Computing Service for over 30 years until he retired at the end of September 2007.

Websites: Personal site, UIT Cambridge bio (photo source)

Stephen Kleene

In the 1950s, distinguished American mathematician Stephen Kleene invented regular expressions, which is what he called his notation for expressing the algebra of regular sets. The regex * metacharacter (called the Kleene star) is named after him. Kleene helped lay the foundations for theoretical computer science through his work on recursion theory, which resulted in him being awarded the National Medal of Science in 1990.

Websites: Wikipedia, Bio at nap.edu (photo source)

Henry Spencer

Spencer is a Canadian programmer and space enthusiast who created three widely used, adapted, and influential regular expression libraries. In 1986, he was the first to release a regex library which could be freely included in other programs. Perl 2's regex package was based on and enhanced from Spencer's library, but Spencer's technological tour de force was creating the regex package used by Tcl. This implementation, Jeffrey Friedl writes, "is a hybrid [NFA/DFA engine] with the best of both worlds".

Websites: Wikipedia, O'Reilly bio, Lysator, Bio at NASA (photo source)

Ken Thompson

Thompson is a hacker demigod and the principal inventor of Unix. He received the Turing Award in 1983, the National Medal of Technology in 1998, and the IEEE's Tsutomu Kanai Award in 1999. Thompson introduced regular expressions to the computing world by building Stephen Kleene's notation into his version of the QED text editor, and later ed and other tools. Thompson's original regular expression search implementation is still considered by some to be superior to modern, backtracking algorithms. Did I mention this dude flies MiG fighter jets for fun?

Websites: Wikipedia, Linfo, @Bell Labs, Bio at Bell Labs (photo source)

Larry Wall

Wall created and continues to oversee development of Perl, which has done more than any other programming language to popularize and extend the power of regular expressions. Many programming languages including Java, JavaScript, the .NET Framework, PHP, Python, and Ruby have since adopted regex syntax and features similar to Perl's. The recently released Perl 5.10 continues to push the state of the art in regex power, and upcoming changes outlined by Wall for Perl 6 (called Perl 6 rules; described in Apocalypse, Synopsis, and Exegesis 5) fearlessly redesign Perl's regular expression language.

Websites: Personal site, Wikipedia (photo source)

I'm still a newcomer to the field, so please let me know if you think there are others who should be on this list.

The terms single-line and multi-line considered harmful

Alright, that title doesn't really work, but one thing I've encountered quite frequently is that the terms "single-line mode" and "multi-line mode" seem to cause no end of confusion for the vast majority of regex users. Many guides try to explain the terms based on some description of lines, or other unrelated issues. I won't, because I think that's highly misleading. In fact, I shunned the terms in RegexPal in favor of the hopefully more descriptive "dot matches all" (instead of single-line) and "^$ match at line breaks" (instead of multi-line). Another sensible name I've seen used for multi-line mode is "enhanced line-anchor mode".

The first important thing to understand is that single-line and multi-line modes have nothing to do with each other. Hence, they can be applied independently — e.g., sometimes it makes sense to use both at the same time. Single-line mode changes the meaning of the dot, while multi-line mode changes the meaning of ^ and $. There are no other side effects. This means that if your regex does not contain one of those three metacharacters (., ^, $), neither modifier has any impact (except possibly confusing other people who read your regexes).

Let's get more specific…

Dot (.) matches all characters except newlines. In single-line mode, it matches all characters.
Caret (^) matches the position at the beginning of the string. In multi-line mode, it additionally matches the position just after newlines.
Dollar ($) matches the position at the end of the string. With most regex flavors, it also matches just before a string-ending newline. In multi-line mode, it additionally matches the position just before newlines (not only string-ending newlines).

As for exactly which characters are considered newline characters, that is regex-flavor and character-encoding specific. See my last post, JavaScript, Regex, and Unicode, for a precise JavaScript definition.

Finally, although single-line and multi-line modifiers are standard in most Perl-derivative regex flavors, there are a couple notable exceptions.

JavaScript does not have a single-line modifier. Use [\S\s] instead of a dot if you want to match any character including newlines. And for the love of god, don't use (.|\r|\n) or similar — it's terribly inefficient (especially when repeated infinitely) and doesn't match a couple lesser-used newline characters.
In JavaScript, $ without /m matches only the position at the end of the string. It does not match before a string-ending newline.
In Ruby, ^ and $ always match at newlines, and there is no mode to change this. Use \A and \Z to match at the beginning and end of the string only. In fact, what Ruby calls multi-line mode (and implements as /m) is what other regex flavors call single-line mode! Talk about confusing!

Edit: On his blog, Jeffrey Friedl pointed out a related post he'd made on the comp.lang.perl Usenet group back in 1994.

Edit 2: Note that I consider Tcl's "Advanced Regular Expressions" flavor to be more inspired by some features of Perl than actually Perl-derivative. Hence, its peculiarities regarding so-called single-line and multi-line handling are not mentioned here.