The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.
Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.
(?<Name1>…)
— A standard named capturing group. The captured value is pushed on theName1
CaptureCollection
or stack.(?<-Name1>…)
— Pops the top backreference off theName1
stack. If there are no backreferences on the stack, the match fails, forcing backtracking.(?<Name2-Name1>…)
— Pops the top backreference off theName1
stack, and pushes text matched since the last timeName1
participated on top of theName2
stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.
I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.
First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").
^
(?<Counter>A)+ # For each A, push to the Counter stack
(?<-Counter>B)+ # For each B, pop from the Counter stack
(?(Counter)(?!)) # Fail if there are any values on the Counter stack
$
A few notes about the above regex:
(?<-Counter>B)
causes the match attempt to backtrack or fail if there are no captured values on theCounter
stack. This prevents matching more Bs than As.(?(Counter)…)
is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.(?!)
is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.
Although there's no way to determine the height of the Counter
stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.
^
(?=.*[a-z](?<N>)|) # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|) # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|) # If 0-9 is found, push to the N stack
(?<-N>){2} # Pop the last two captures off the N stack
.{8,} # Match eight or more characters
Here, by decrementing the height of the N
capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.
Here's a way to match palindromes (e.g., "redivider"):
(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))
In the above regex, \k<N>
is a backreference to the last value on the N
capture stack.
Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.
\(
(?>
[^()]+
|
\( (?<Depth>)
|
\) (?<-Depth>)
)*
(?(Depth)(?!))
\)
Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.
<<
(?>
(?! << | >> ) .
|
<< (?<Depth>)
|
>> (?<-Depth>)
)*
(?(Depth)(?!))
>>
Make sure to use single-line mode (RegexOptions.Singleline
) if you want the dot to match newlines.
Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A
, B
, and C
are not important; the capturing groups are only used to keep count and control the regex's path.
(?:
(?(A)\s|)
(?<B>)
(?<C-B>\w)+ (?(B)(?!))
(?:
\s
(?<C>)
(?<B-C>\w)+ (?(C)(?!))
(?<A>)
)?
)+ \b
Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.
Wow! That is some powerful stuff and I thought I was pretty good at regular expressions already. Thanks for the writeup.
what does the (?> … ) do?
Great article though!
Thanks
@na, it’s an atomic group, which means that backtracking positions within the group are thrown away when the regex exits the group.
It’s great and clearly.
Thanks
Fantastic Post!
Thank you!
Ohhhh
With these balancing groups – its possible to make a Script Parser/validater !!
with the counter, you’ll check for start and ending scope…
See http://www.m-8.dk/resources/RegEx-balancing-group.aspx for a great example of capturing the nested text. For example, for the input:
(7 + (9 * 10))
you can get:
1) 9 * 10
2) 7 + (9 * 10)
Hi,
I am not used to Regular Expression. In your example
^
(?=.*[a-z](?)|) # If a-z is found, push to the N stack
(?=.*[A-Z](?)|) # If A-Z is found, push to the N stack
(?=.*[0-9](?)|) # If 0-9 is found, push to the N stack
(?){2} # Pop the last two captures off the N stack
.{8,} # Match eight or more characters
What does (?=.*[a-z](?)|) do? I cannot understand why (?) is taken after .*[a-z] and why .*[a-z] is used instead of [a-z]. Can you please explain in detail?
Thanks in advanced
This article is great! I was just about to give up using Regex to parse HTML text enclosed by tags when running across this piece. A slight modification of your sample code works like a charm for my need.
Thank you!
Steve, I finally got around to playing with .NET integration in ColdFusion. Not knowing anything about the special features of the .NET regular expression engine, I borrowed your above example – I hope you don’t mind:
http://www.bennadel.com/blog/1932-Using-NET-dotnet-Regular-Expressions-In-ColdFusion.htm
Thanks a lot! As always, your in-depth understanding is inspiring.
@Ben Nadel, very cool. Thanks for the link.
This is exactly what I needed. Balancing groups is the one part of .NET regex I hadn’t used before and this explains exactly what I needed to move my parser forward, thank you!
I discovered this page as well as ‘Expresso’ tonight and used it to learn about “Balancing Groups”.
I used it to work out a moderately complex expression to validate a US phone number with optional parenthesis around the area code. It worked great in all of my Expresso and C# tests. The problem I ran into is when I tried to use the same expression in a javascript function to validate a phone number in a form on a web page. The expression throws a javascript exception in the RegEx constructor. It’s apparently not valid in Javascript. I tried whittling it back down to basics in Javascript and I can’t even figure out how to get the basic named matched sub-expressions working in Javascript. I’m not going to list all the Javascript variants that I tried, but I’ll mention the .Net version that I was able to get working. I was wonderng if you (or anyone) might be up for the challenge of helping me figure out how to translate this to a Javascript equivalent regex or if anyone could direct me to a good regex forum where I can get some help on this.
Here are my test cases and the regex itself.
These values should (and do) match:
“(555) 555-5555”
“(555)-555-5555”
“(555)555-5555”
“555 555-5555”
“555-555-5555”
“”
These values should NOT (and don’t) match:
“(055) 555-5555”
“(155) 555-5555”
“(555) 055-5555”
“(555) 155-5555”
“(555 555-5555”
“555) 555-5555”
“555)555-5555”
“555555-5555”
Here’s the expression:
Note that neither area code (npa) nor the first 3 digit of the local number (nxx) is allowed to start with 0 or 1. That’s the reason for using “[2-9]\d\d” instead of “\d{3}” as in many phone number regex examples.
Note also that the RegEx needs the IgnorePatternWhitespace option as it’s shown here, but it could be reduced to a single line that doesn’t need that option if you’re willing to sacrifice readability.
“^1?
(?:
(?’open’\()?
(?’npa'(?:[2-9]\d\d)?)
(?’close-open’\))?
(?(open)(?!))
(?:
(?:\s)
|
(?:-)
|
(?<=\))
)
(?'nxx'(?:[2-9]\d\d))
–
(?'line'\d{4})
)?
$"
Thanks,
Dave
I had an xml file with certain tag
1800 < 3944877
in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
I am using c#
Can any tell me the regex or is there any other way i can do it
I write this :
(<[^]*?>).*(<).*?(<[^]*?>)
but it holds good till each tag is separated with new line.
I had an xml file with certain tag
<xyz> 1800 < 3944877 </xyz>
in this the less than sign is causing problem to read the xml file so I was thinking to use regex to find the less than sign and replace it.
I am using c#
Can any tell me the regex or is there any other way i can do it
I write this :
(<[^]*?>).*(<).*?(<[^]*?>)
but it holds good till each tag is separated with new line.
I took your sample for parsing scoped groups and extended it to match a C-like function body, complete with filtering contents of comments to avoid matching braces inside a comment. It can be found at http://pastebin.com/cZULgAQ8.
Unfortunately it’s a bit trickier to handle the function declaration, where everything must be a certain order and can potentially have inline comments.
Could somebody let me know what the order of this expression to find palindromes would be?
(?.)+.?(?\k)+(?(N)(?!))
Thanks a lot, this is a great post!
Hi Steven,
Thanks for the explanation! I was very excited to find out about this feature.
To try it out, I immediately launched RegexBuddy – one of the companion programs to your Regular Expressions Cookbook – but it didn’t seem to understand the syntax. Is it possible that balancing groups are not supported by the JGS suite?
@AS, the current version of RegexBuddy doesn’t support balancing groups. However, the Expresso regex tester I mentioned in this post is created with .NET and uses .NET regular expressions. It therefore supports balancing groups. http://www.ultrapico.com/Expresso.htm
I am trying to work out the use of balancing groups, practicing with the palindrome RegEx
What I am trying to work out is how to make it match a whole string instead of partial. I have tried putting the caret and dollar anchors everywhere in the expression and cannot make it work.
logically it should be this:
^(?.)+.?(?\k)+(?(N)(?!))$
Where am I going wrong with this?