Fun With .NET Regex Balancing Groups

Steven Levithan — Wed, 23 Jan 2008 23:59:29 +0000

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

(?…) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
(?<-Name1>…) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
(?-Name1>…) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

(?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
(?(Counter)…) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
(?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?.)+.?(?<-N>\k)+(?(N)(?!))

In the above regex, \k is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?)
	(?-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?)
		(?-C>\w)+ (?(C)(?!))
		(?)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

.NET — Flagrant Badassery

Fun With .NET Regex Balancing Groups