Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

  • (?<Name1>) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
  • (?<-Name1>) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
  • (?<Name2-Name1>) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

  • (?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
  • (?(Counter)) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
  • (?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

32 thoughts on “Fun With .NET Regex Balancing Groups”

  1. Ohhhh

    With these balancing groups – its possible to make a Script Parser/validater !!

    with the counter, you’ll check for start and ending scope…

  2. Hi,

    I am not used to Regular Expression. In your example

    ^
    (?=.*[a-z](?)|) # If a-z is found, push to the N stack
    (?=.*[A-Z](?)|) # If A-Z is found, push to the N stack
    (?=.*[0-9](?)|) # If 0-9 is found, push to the N stack
    (?){2} # Pop the last two captures off the N stack
    .{8,} # Match eight or more characters

    What does (?=.*[a-z](?)|) do? I cannot understand why (?) is taken after .*[a-z] and why .*[a-z] is used instead of [a-z]. Can you please explain in detail?

    Thanks in advanced

  3. This article is great! I was just about to give up using Regex to parse HTML text enclosed by tags when running across this piece. A slight modification of your sample code works like a charm for my need.

    Thank you!

  4. This is exactly what I needed. Balancing groups is the one part of .NET regex I hadn’t used before and this explains exactly what I needed to move my parser forward, thank you!

  5. I discovered this page as well as ‘Expresso’ tonight and used it to learn about “Balancing Groups”.

    I used it to work out a moderately complex expression to validate a US phone number with optional parenthesis around the area code. It worked great in all of my Expresso and C# tests. The problem I ran into is when I tried to use the same expression in a javascript function to validate a phone number in a form on a web page. The expression throws a javascript exception in the RegEx constructor. It’s apparently not valid in Javascript. I tried whittling it back down to basics in Javascript and I can’t even figure out how to get the basic named matched sub-expressions working in Javascript. I’m not going to list all the Javascript variants that I tried, but I’ll mention the .Net version that I was able to get working. I was wonderng if you (or anyone) might be up for the challenge of helping me figure out how to translate this to a Javascript equivalent regex or if anyone could direct me to a good regex forum where I can get some help on this.

    Here are my test cases and the regex itself.

    These values should (and do) match:
    “(555) 555-5555”
    “(555)-555-5555”
    “(555)555-5555”
    “555 555-5555”
    “555-555-5555”
    “”

    These values should NOT (and don’t) match:
    “(055) 555-5555”
    “(155) 555-5555”
    “(555) 055-5555”
    “(555) 155-5555”
    “(555 555-5555”
    “555) 555-5555”
    “555)555-5555”
    “555555-5555”

    Here’s the expression:
    Note that neither area code (npa) nor the first 3 digit of the local number (nxx) is allowed to start with 0 or 1. That’s the reason for using “[2-9]\d\d” instead of “\d{3}” as in many phone number regex examples.

    Note also that the RegEx needs the IgnorePatternWhitespace option as it’s shown here, but it could be reduced to a single line that doesn’t need that option if you’re willing to sacrifice readability.

    “^1?
    (?:
    (?’open’\()?
    (?’npa'(?:[2-9]\d\d)?)
    (?’close-open’\))?
    (?(open)(?!))
    (?:
    (?:\s)
    |
    (?:-)
    |
    (?<=\))
    )
    (?'nxx'(?:[2-9]\d\d))

    (?'line'\d{4})
    )?
    $"

    Thanks,
    Dave

  6. I had an xml file with certain tag

    1800 < 3944877

    in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
    I am using c#

    Can any tell me the regex or is there any other way i can do it

    I write this :
    (<[^]*?>).*(<).*?(<[^]*?>)
    but it holds good till each tag is separated with new line.

  7. I had an xml file with certain tag

    <xyz> 1800 < 3944877 </xyz>

    in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
    I am using c#

    Can any tell me the regex or is there any other way i can do it

    I write this :
    (<[^]*?>).*(<).*?(<[^]*?>)
    but it holds good till each tag is separated with new line.

  8. I took your sample for parsing scoped groups and extended it to match a C-like function body, complete with filtering contents of comments to avoid matching braces inside a comment. It can be found at http://pastebin.com/cZULgAQ8.

    Unfortunately it’s a bit trickier to handle the function declaration, where everything must be a certain order and can potentially have inline comments.

  9. Could somebody let me know what the order of this expression to find palindromes would be?
    (?.)+.?(?\k)+(?(N)(?!))

  10. Hi Steven,
    Thanks for the explanation! I was very excited to find out about this feature.
    To try it out, I immediately launched RegexBuddy – one of the companion programs to your Regular Expressions Cookbook – but it didn’t seem to understand the syntax. Is it possible that balancing groups are not supported by the JGS suite?

  11. I am trying to work out the use of balancing groups, practicing with the palindrome RegEx

    What I am trying to work out is how to make it match a whole string instead of partial. I have tried putting the caret and dollar anchors everywhere in the expression and cannot make it work.

    logically it should be this:
    ^(?.)+.?(?\k)+(?(N)(?!))$

    Where am I going wrong with this?

Leave a Reply

Your email address will not be published. Required fields are marked *