Steven Levithan – Page 9 – Flagrant Badassery

10 Reasons to Learn and Use Regular Expressions

10. Regular expressions are everywhere

Here's a short list of programming languages and tools that support regular expressions. The links are to their regex documentation.

The .NET Framework
ActionScript
ColdFusion
Java
JavaScript
Perl
PHP
Python
Ruby
Tcl
Apache HTTP Server
Google Analytics
Google Code Search
Yahoo! Pipes
Microsoft Exchange Server
Windows PowerShell
OpenOffice.org
Oracle Database
Any decent programmer's text editor or IDE
And thousands more…

9. Regular expression mastery can help you stand out from the crowd

Regular expressions might be everywhere, but many experienced programmers are intimidated by them. Knowing how to use regular expressions effectively is a valuable skill that can quickly make your peers take notice.

8. Wielding regular expressions can make you feel like a mighty wizard

Regular expressions can be difficult to master, but doing so is that much more rewarding as a result. Writing a line of cryptic letters and symbols that does what might otherwise take hundreds of lines of code can feel pretty cool.

7. If your search is simple, regular expression syntax is simple

Want to match the word "cat"? The regex is simply cat. ^cat matches "cat" at the beginning of the string, cat$ matches at the end, and cat|dog matches "cat" or "dog". Most regex syntax is very simple once you get the hang of it.

6. Regular expressions are portable

That's a bold lie, yet it's usually true for people who stick to the basics or intentionally write their regexes in a portable way. The majority of regex syntax works the same in a wide variety of programming languages and tools.

5. Regular expressions can help you write short code

This can be especially helpful in JavaScript, where keeping code length down is important for people with slow Internet connections. And although regexes can be hard to read, I'd rather spend a minute stepping through the logic of a regex than doing the same thing with a page full of code. Of course, like with most things in life it's important to find a good balance.

4. Regular expressions save time

Even for newcomers who still struggle with the syntax, regular expressions are often the fastest way to get the job done.

3. Regular expressions are fast

Although typical backtracking regex engines have so-called pathological cases which can take a very long time, regexes written with performance in mind will be fast enough for your needs in almost all cases. To ensure that's true, it's a good idea to at least get a feel for the basics of regex performance optimization.

2. Regular expressions can match just about anything

In other words, regular expressions are powerful. A regular expression guru can find many appropriate uses for regexes where the untrained user might not think to look. As the authors of Programming Perl wrote, "if you take 'text' in the widest possible sense, perhaps 90% of what you do is 90% text processing."

1. Regular expressions are fun

Like any good challenge, regexes can be a lot of fun. Tools like RegexPal can help remove a lot of the guesswork, so you can concentrate on solving problems.

… Feel free to add your own reasons why you think regexes are awesome^2 or the worst idea since unicycles.

Update: This post has been translated into Portuguese (by Fábio Luciano) and Spanish (by Fernando Briano). Thanks guys!

A Few Art Samples

Here be some old drawings I did back in 2002 or thereabouts (I haven't been drawing since then). They're done in pen, pencil, and colored pencil, with some touch-ups and added color from Photoshop. Most of my old art scans are no longer around after numerous computer upgrades, so I'm mostly just posting these here for posterity — but lemme know what you think. You won't hurt my feelings.

Jazz girl

Contortionista

Manga sketch

Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

(?<Name1>…) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
(?<-Name1>…) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
(?<Name2-Name1>…) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

(?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
(?(Counter)…) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
(?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

80 Secrets to Success

Behold my Mountain Dew collection at work, originally started when someone asked how much Dew I drank per day. This ended up taking me a couple months, since I slowed down from my previous ~4/day consumption rate right around the time I started preserving the evidence.

The photo is by ColdFusion programmer and photographer Joe D'Angelo, whose single greatest purpose in life seems to be trying to sneak up behind me and getting me to spill whatever I happen to be eating or drinking at the time.

Regex Legends: The People Behind the Magic

Many people have contributed to developing and promoting the use of regular expressions since they were invented about half a century ago. Here's a short list of some of the most influential people behind the technology. I've written this up for two reasons:

For people who've only gotten into the technology recently but are interested in some of the history and pioneers behind it.
Since I fit the above description, I'm hoping readers will help fill me in on other people I've forgotten about or otherwise left out.

Alfred Aho

Aho is the "A" in AWK, and co-author of the Dragon Book – a classic reference covering such topics as building regular expression compilers. He created the initial version of egrep, which provided a big jump in expressiveness from the primitive beginnings of early Unix grep.

Websites: Wikipedia, @Columbia U (photo source)

Jeffrey Friedl

Friedl began using regular expressions with Unix in 1980. He has since written the definitive work on the subject: Mastering Regular Expressions, published by O'Reilly Media. Now in its third edition, it is widely considered a classic programming book (see e.g. this Slashdot review). The longevity of his experience with regexes helps to make him a shrewd opponent in regex debates… odds are he's already succinctly countered your quackery ten or more years ago on Usenet, and has the links to prove it. wink Friedl lives in Kyoto, Japan with his family of three.

Websites: Blog (photo source), O'Reilly bio

Jan Goyvaerts

Goyvaerts – a Belgian who's been living in Thailand for several years – is not as widely known as the others on this list, but his contributions towards helping thousands of people learn and use regular expressions are significant. Goyvaerts creates the best-in-class tools RegexBuddy and PowerGREP, which use his own JGsoft regex engine (notable for its support of most syntax from popular regex flavors including Perl, .NET, and Java). His website regular-expressions.info – based on the PowerGREP/RegexBuddy help files – is the best and most popular multi-flavor regex tutorial or reference online.

Edit: A year after this post, Jan and I coauthored Regular Expressions Cookbook, now in its second edition.

Websites: Regex blog, Just Great Software (photo source)

Philip Hazel

Hazel grew up in South Africa and has a PhD in applied mathematics. He's best known for writing Exim (a popular open source mail transfer agent) and the PCRE regex library. PCRE is one of the best regex libraries in the world and is used by many projects including Apache, PHP, and probably thousands more. Hazel worked for the University of Cambridge's Computing Service for over 30 years until he retired at the end of September 2007.

Websites: Personal site, UIT Cambridge bio (photo source)

Stephen Kleene

In the 1950s, distinguished American mathematician Stephen Kleene invented regular expressions, which is what he called his notation for expressing the algebra of regular sets. The regex * metacharacter (called the Kleene star) is named after him. Kleene helped lay the foundations for theoretical computer science through his work on recursion theory, which resulted in him being awarded the National Medal of Science in 1990.

Websites: Wikipedia, Bio at nap.edu (photo source)

Henry Spencer

Spencer is a Canadian programmer and space enthusiast who created three widely used, adapted, and influential regular expression libraries. In 1986, he was the first to release a regex library which could be freely included in other programs. Perl 2's regex package was based on and enhanced from Spencer's library, but Spencer's technological tour de force was creating the regex package used by Tcl. This implementation, Jeffrey Friedl writes, "is a hybrid [NFA/DFA engine] with the best of both worlds".

Websites: Wikipedia, O'Reilly bio, Lysator, Bio at NASA (photo source)

Ken Thompson

Thompson is a hacker demigod and the principal inventor of Unix. He received the Turing Award in 1983, the National Medal of Technology in 1998, and the IEEE's Tsutomu Kanai Award in 1999. Thompson introduced regular expressions to the computing world by building Stephen Kleene's notation into his version of the QED text editor, and later ed and other tools. Thompson's original regular expression search implementation is still considered by some to be superior to modern, backtracking algorithms. Did I mention this dude flies MiG fighter jets for fun?

Websites: Wikipedia, Linfo, @Bell Labs, Bio at Bell Labs (photo source)

Larry Wall

Wall created and continues to oversee development of Perl, which has done more than any other programming language to popularize and extend the power of regular expressions. Many programming languages including Java, JavaScript, the .NET Framework, PHP, Python, and Ruby have since adopted regex syntax and features similar to Perl's. The recently released Perl 5.10 continues to push the state of the art in regex power, and upcoming changes outlined by Wall for Perl 6 (called Perl 6 rules; described in Apocalypse, Synopsis, and Exegesis 5) fearlessly redesign Perl's regular expression language.

Websites: Personal site, Wikipedia (photo source)

I'm still a newcomer to the field, so please let me know if you think there are others who should be on this list.