Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

(?<Name1>…) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
(?<-Name1>…) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
(?<Name2-Name1>…) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

(?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
(?(Counter)…) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
(?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

32 thoughts on “Fun With .NET Regex Balancing Groups”

Bill Brown says:

February 4, 2008 at 10:34 am

Wow! That is some powerful stuff and I thought I was pretty good at regular expressions already. Thanks for the writeup.
na says:

April 7, 2008 at 11:04 pm

what does the (?> … ) do?
Great article though!
Thanks
Steve says:

April 8, 2008 at 3:18 pm

@na, it’s an atomic group, which means that backtracking positions within the group are thrown away when the regex exits the group.
M-Dayyan says:

May 20, 2008 at 6:07 am

It’s great and clearly.
Thanks
c-dos says:

June 29, 2008 at 4:57 pm

Fantastic Post!
Thank you!
Martin Kirk says:

August 6, 2008 at 1:54 am

Ohhhh

With these balancing groups – its possible to make a Script Parser/validater !!

with the counter, you’ll check for start and ending scope…
JD Bell says:

February 24, 2009 at 9:10 am

See http://www.m-8.dk/resources/RegEx-balancing-group.aspx for a great example of capturing the nested text. For example, for the input:

(7 + (9 * 10))

you can get:

1) 9 * 10
2) 7 + (9 * 10)
Parth Patel says:

October 12, 2009 at 7:02 am

Hi,

I am not used to Regular Expression. In your example

^
(?=.*[a-z](?)|) # If a-z is found, push to the N stack
(?=.*[A-Z](?)|) # If A-Z is found, push to the N stack
(?=.*[0-9](?)|) # If 0-9 is found, push to the N stack
(?){2} # Pop the last two captures off the N stack
.{8,} # Match eight or more characters

What does (?=.*[a-z](?)|) do? I cannot understand why (?) is taken after .*[a-z] and why .*[a-z] is used instead of [a-z]. Can you please explain in detail?

Thanks in advanced
Hong says:

April 1, 2010 at 4:55 pm

This article is great! I was just about to give up using Regex to parse HTML text enclosed by tags when running across this piece. A slight modification of your sample code works like a charm for my need.

Thank you!
Ben Nadel says:

May 21, 2010 at 8:28 am

Steve, I finally got around to playing with .NET integration in ColdFusion. Not knowing anything about the special features of the .NET regular expression engine, I borrowed your above example – I hope you don’t mind:

http://www.bennadel.com/blog/1932-Using-NET-dotnet-Regular-Expressions-In-ColdFusion.htm

Thanks a lot! As always, your in-depth understanding is inspiring.
Steven Levithan says:

May 24, 2010 at 8:40 am

@Ben Nadel, very cool. Thanks for the link.
Pingback: .NET Regex Recursive Patterns | The Largest Forum Archive
stoive says:

September 20, 2010 at 2:07 am

This is exactly what I needed. Balancing groups is the one part of .NET regex I hadn’t used before and this explains exactly what I needed to move my parser forward, thank you!
Dave Hall says:

February 10, 2011 at 12:42 am

I discovered this page as well as ‘Expresso’ tonight and used it to learn about “Balancing Groups”.

I used it to work out a moderately complex expression to validate a US phone number with optional parenthesis around the area code. It worked great in all of my Expresso and C# tests. The problem I ran into is when I tried to use the same expression in a javascript function to validate a phone number in a form on a web page. The expression throws a javascript exception in the RegEx constructor. It’s apparently not valid in Javascript. I tried whittling it back down to basics in Javascript and I can’t even figure out how to get the basic named matched sub-expressions working in Javascript. I’m not going to list all the Javascript variants that I tried, but I’ll mention the .Net version that I was able to get working. I was wonderng if you (or anyone) might be up for the challenge of helping me figure out how to translate this to a Javascript equivalent regex or if anyone could direct me to a good regex forum where I can get some help on this.

Here are my test cases and the regex itself.

These values should (and do) match:
“(555) 555-5555”
“(555)-555-5555”
“(555)555-5555”
“555 555-5555”
“555-555-5555”
“”

These values should NOT (and don’t) match:
“(055) 555-5555”
“(155) 555-5555”
“(555) 055-5555”
“(555) 155-5555”
“(555 555-5555”
“555) 555-5555”
“555)555-5555”
“555555-5555”

Here’s the expression:
Note that neither area code (npa) nor the first 3 digit of the local number (nxx) is allowed to start with 0 or 1. That’s the reason for using “[2-9]\d\d” instead of “\d{3}” as in many phone number regex examples.

Note also that the RegEx needs the IgnorePatternWhitespace option as it’s shown here, but it could be reduced to a single line that doesn’t need that option if you’re willing to sacrifice readability.

“^1?
(?:
(?’open’$)?
(?’npa'(?:[2-9]\d\d)?)
(?’close-open’$)?
(?(open)(?!))
(?:
(?:\s)
|
(?:-)
|
(?<=\))
)
(?'nxx'(?:[2-9]\d\d))
–
(?'line'\d{4})
)?
$"

Thanks,
Dave
Anuj says:

August 8, 2011 at 8:44 pm

I had an xml file with certain tag

1800 < 3944877

in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
I am using c#

Can any tell me the regex or is there any other way i can do it

I write this :
(<[^]*?>).*(<).*?(<[^]*?>)
but it holds good till each tag is separated with new line.
Anuj says:

August 8, 2011 at 8:46 pm

I had an xml file with certain tag

<xyz> 1800 < 3944877 </xyz>

in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
I am using c#

Can any tell me the regex or is there any other way i can do it

I write this :
(<[^]*?>).*(<).*?(<[^]*?>)
but it holds good till each tag is separated with new line.
default_ex says:

March 31, 2012 at 2:56 pm

I took your sample for parsing scoped groups and extended it to match a C-like function body, complete with filtering contents of comments to avoid matching braces inside a comment. It can be found at http://pastebin.com/cZULgAQ8.

Unfortunately it’s a bit trickier to handle the function declaration, where everything must be a certain order and can potentially have inline comments.
Bharath says:

May 16, 2012 at 8:03 am

Could somebody let me know what the order of this expression to find palindromes would be?
(?.)+.?(?\k)+(?(N)(?!))
Drake says:

June 6, 2012 at 2:34 am

Thanks a lot, this is a great post!
AS says:

July 29, 2012 at 9:59 am

Hi Steven,
Thanks for the explanation! I was very excited to find out about this feature.
To try it out, I immediately launched RegexBuddy – one of the companion programs to your Regular Expressions Cookbook – but it didn’t seem to understand the syntax. Is it possible that balancing groups are not supported by the JGS suite?
Steven Levithan says:

July 30, 2012 at 11:44 am

@AS, the current version of RegexBuddy doesn’t support balancing groups. However, the Expresso regex tester I mentioned in this post is created with .NET and uses .NET regular expressions. It therefore supports balancing groups. http://www.ultrapico.com/Expresso.htm
Dave says:

July 24, 2013 at 11:57 am

I am trying to work out the use of balancing groups, practicing with the palindrome RegEx

What I am trying to work out is how to make it match a whole string instead of partial. I have tried putting the caret and dollar anchors everywhere in the expression and cannot make it work.

logically it should be this:
^(?.)+.?(?\k)+(?(N)(?!))$

Where am I going wrong with this?
Pingback: Converting PCRE recursive regex pattern to .NET balancing groups definition | Ask Programming & Technology
Pingback: .NET Regex balancing groups expression – matching when not balanced | XL-UAT
Pingback: What are regular expression Balancing Groups? - QuestionFocus
Pingback: What are regular expression Balancing Groups? - PhotoLens
Pingback: What are regular expression Balancing Groups? – w3toppers.com
Pingback: What are regular expression Balancing Groups?
Pingback: What are regular expression Balancing Groups? – Row Coding
Pingback: "vertical" regex matching in an ASCII "image"
Pingback: What are regular expression Balancing Groups? - Design Corral
Pingback: How to check that a string is a palindrome using regular expressions? - Design Corral

32 thoughts on “Fun With .NET Regex Balancing Groups”

Leave a Reply