A Few Art Samples

Here be some old drawings I did back in 2002 or thereabouts (I haven't been drawing since then). They're done in pen, pencil, and colored pencil, with some touch-ups and added color from Photoshop. Most of my old art scans are no longer around after numerous computer upgrades, so I'm mostly just posting these here for posterity — but lemme know what you think. You won't hurt my feelings.

Jazz girl

Contortionista

Manga sketch

Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

  • (?<Name1>) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
  • (?<-Name1>) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
  • (?<Name2-Name1>) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

  • (?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
  • (?(Counter)) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
  • (?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

80 Secrets to Success

Mountain Dew collection

Behold my Mountain Dew collection at work, originally started when someone asked how much Dew I drank per day. This ended up taking me a couple months, since I slowed down from my previous ~4/day consumption rate right around the time I started preserving the evidence.

The photo is by ColdFusion programmer and photographer Joe D'Angelo, whose single greatest purpose in life seems to be trying to sneak up behind me and getting me to spill whatever I happen to be eating or drinking at the time.

Regex Legends: The People Behind the Magic

Many people have contributed to developing and promoting the use of regular expressions since they were invented about half a century ago. Here's a short list of some of the most influential people behind the technology. I've written this up for two reasons:

  1. For people who've only gotten into the technology recently but are interested in some of the history and pioneers behind it.
  2. Since I fit the above description, I'm hoping readers will help fill me in on other people I've forgotten about or otherwise left out.

Alfred Aho

Alfred Aho

Aho is the "A" in AWK, and co-author of the Dragon Book – a classic reference covering such topics as building regular expression compilers. He created the initial version of egrep, which provided a big jump in expressiveness from the primitive beginnings of early Unix grep.

Websites: Wikipedia, @Columbia U (photo source)

Jeffrey Friedl

Jeffrey Friedl

Friedl began using regular expressions with Unix in 1980. He has since written the definitive work on the subject: Mastering Regular Expressions, published by O'Reilly Media. Now in its third edition, it is widely considered a classic programming book (see e.g. this Slashdot review). The longevity of his experience with regexes helps to make him a shrewd opponent in regex debates… odds are he's already succinctly countered your quackery ten or more years ago on Usenet, and has the links to prove it. wink Friedl lives in Kyoto, Japan with his family of three.

Websites: Blog (photo source), O'Reilly bio

Jan Goyvaerts

Jan Goyvaerts

Goyvaerts – a Belgian who's been living in Thailand for several years – is not as widely known as the others on this list, but his contributions towards helping thousands of people learn and use regular expressions are significant. Goyvaerts creates the best-in-class tools RegexBuddy and PowerGREP, which use his own JGsoft regex engine (notable for its support of most syntax from popular regex flavors including Perl, .NET, and Java). His website regular-expressions.info – based on the PowerGREP/RegexBuddy help files – is the best and most popular multi-flavor regex tutorial or reference online.

Edit: A year after this post, Jan and I coauthored Regular Expressions Cookbook, now in its second edition.

Websites: Regex blog, Just Great Software (photo source)

Philip Hazel

Philip Hazel

Hazel grew up in South Africa and has a PhD in applied mathematics. He's best known for writing Exim (a popular open source mail transfer agent) and the PCRE regex library. PCRE is one of the best regex libraries in the world and is used by many projects including Apache, PHP, and probably thousands more. Hazel worked for the University of Cambridge's Computing Service for over 30 years until he retired at the end of September 2007.

Websites: Personal site, UIT Cambridge bio (photo source)

Stephen Kleene

Stephen Kleene

In the 1950s, distinguished American mathematician Stephen Kleene invented regular expressions, which is what he called his notation for expressing the algebra of regular sets. The regex * metacharacter (called the Kleene star) is named after him. Kleene helped lay the foundations for theoretical computer science through his work on recursion theory, which resulted in him being awarded the National Medal of Science in 1990.

Websites: Wikipedia, Bio at nap.edu (photo source)

Henry Spencer

Henry Spencer

Spencer is a Canadian programmer and space enthusiast who created three widely used, adapted, and influential regular expression libraries. In 1986, he was the first to release a regex library which could be freely included in other programs. Perl 2's regex package was based on and enhanced from Spencer's library, but Spencer's technological tour de force was creating the regex package used by Tcl. This implementation, Jeffrey Friedl writes, "is a hybrid [NFA/DFA engine] with the best of both worlds".

Websites: Wikipedia, O'Reilly bio, Lysator, Bio at NASA (photo source)

Ken Thompson

Ken Thompson

Thompson is a hacker demigod and the principal inventor of Unix. He received the Turing Award in 1983, the National Medal of Technology in 1998, and the IEEE's Tsutomu Kanai Award in 1999. Thompson introduced regular expressions to the computing world by building Stephen Kleene's notation into his version of the QED text editor, and later ed and other tools. Thompson's original regular expression search implementation is still considered by some to be superior to modern, backtracking algorithms. Did I mention this dude flies MiG fighter jets for fun?

Websites: Wikipedia, Linfo, @Bell Labs, Bio at Bell Labs (photo source)

Larry Wall

Larry Wall

Wall created and continues to oversee development of Perl, which has done more than any other programming language to popularize and extend the power of regular expressions. Many programming languages including Java, JavaScript, the .NET Framework, PHP, Python, and Ruby have since adopted regex syntax and features similar to Perl's. The recently released Perl 5.10 continues to push the state of the art in regex power, and upcoming changes outlined by Wall for Perl 6 (called Perl 6 rules; described in Apocalypse, Synopsis, and Exegesis 5) fearlessly redesign Perl's regular expression language.

Websites: Personal site, Wikipedia (photo source)


I'm still a newcomer to the field, so please let me know if you think there are others who should be on this list.

Timed Memoization

Certain operations are computationally expensive, but because their results might change over time or due to outside influences, they don't lend themselves to typical memoization — take for example getElementsByClassName. Here's a JavaScript timed memoization decorator / higher-order-function I made to help with these cases, which accepts an optional expiration argument in milliseconds.

function memoize (functor, expiration) {
	var memo = {};
	return function () {
		var key = Array.prototype.join.call(arguments, "§");
		if (key in memo)
			return memo[key];
		if (expiration)
			setTimeout(function () {delete memo[key];}, expiration);
		return memo[key] = functor.apply(this, arguments);
	};
}

This approach allows you to turn any function into a memoizing function. Note that return values are memoized for each set of arguments. However, due to technical constraints it's only reliable when the arguments are arrays or scalar values, but you could easily use e.g. a toJSON method rather than join to serialize objects as part of the cache key (at some additional overhead cost).

You can use the above code like this:

// Make a function which memoizes for 1000 milliseconds at a time
var fn = memoize(function () {
	Array(500000).join("."); // slow
	return true;
}, 1000);

…Or leave out the expiration argument to permanently memoize.

Here are a couple more posts on JavaScript memoization: