<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Flagrant Badassery &#187; .NET</title>
	<atom:link href="http://blog.stevenlevithan.com/category/dot-net/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com</link>
	<description>A JavaScript and regular expression centric blog</description>
	<lastBuildDate>Mon, 05 Jul 2010 20:27:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fun With .NET Regex Balancing Groups</title>
		<link>http://blog.stevenlevithan.com/archives/balancing-groups</link>
		<comments>http://blog.stevenlevithan.com/archives/balancing-groups#comments</comments>
		<pubDate>Wed, 23 Jan 2008 23:59:29 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[recursion]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/balancing-groups</guid>
		<description><![CDATA[The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief [...]]]></description>
			<content:encoded><![CDATA[<p>The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.</p>

<p class="stealth">Note: If you're reading this in a feed reader or aggregator, see the <a href="http://blog.stevenlevithan.com/archives/balancing-groups"><strong>original post</strong></a>, which uses regex syntax highlighting to hopefully make things easier to follow.</p>

<ul>
	<li><code class="regex"><b class="g1">(?&lt;Name1&gt;</b>&hellip;<b class="g1">)</b></code> &mdash; A standard named capturing group. The captured value is pushed on the <code>Name1</code> <code><a href="http://msdn2.microsoft.com/en-us/library/system.text.regularexpressions.capturecollection.aspx">CaptureCollection</a></code> or <em>stack</em>.</li>
	<li><code class="regex"><b class="g1">(?&lt;<b>-</b>Name1&gt;</b>&hellip;<b class="g1">)</b></code> &mdash; Pops the top backreference off the <code>Name1</code> stack. If there are no backreferences on the stack, the match fails, forcing backtracking.</li>
	<li><code class="regex"><b class="g1">(?&lt;Name2<b>-</b>Name1&gt;</b>&hellip;<b class="g1">)</b></code> &mdash; Pops the top backreference off the <code>Name1</code> stack, and pushes text matched since the last time <code>Name1</code> participated on top of the <code>Name2</code> stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.</li>
</ul>

<p>I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using <a href="http://www.ultrapico.com/Expresso.htm">Expresso</a> to play around with balancing groups, and here are a few interesting things I've come up with.</p>

<p>First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of <em>A</em>s followed by the same number of <em>B</em>s (e.g., "AAABBB").</p>

<pre class="code"><code class="regex"><b>^</b>
<b class="g1">(?&lt;Counter&gt;</b>A<b class="g1">)+</b>    <span class="comment"># For each A, push to the Counter stack</span>
<b class="g1">(?&lt;<b>-</b>Counter&gt;</b>B<b class="g1">)+</b>   <span class="comment"># For each B, pop from the Counter stack</span>
<b class="g1">(?(Counter)</b><b class="g2">(?!)</b><b class="g1">)</b>  <span class="comment"># Fail if there are any values on the Counter stack</span>
<b>$</b>
</code></pre>

<p>A few notes about the above regex:</p>
<ul>
	<li><code class="regex"><b class="g1">(?&lt;<b>-</b>Counter&gt;</b>B<b class="g1">)</b></code> causes the match attempt to backtrack or fail if there are no captured values on the <code>Counter</code> stack. This prevents matching more <em>B</em>s than <em>A</em>s.</li>
	<li><code class="regex"><b class="g1">(?(Counter)</b>&hellip;<b class="g1">)</b></code> is a <a href="http://www.regular-expressions.info/conditional.html">conditional</a> without an <em>else</em> part. The way it's used here prevents the match from ending with more <em>A</em>s than <em>B</em>s.</li>
	<li><code class="regex"><b class="g2">(?!)</b></code> is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.</li>
</ul>

<p>Although there's no way to determine the height of the <code>Counter</code> stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least <em>two out of three</em> character types from the set of uppercase letters, lowercase letters, and numbers.</p>

<pre class="code"><code class="regex"><b>^</b>
<b class="g1">(?=</b><b>.*</b><i>[a<u>-</u>z]</i><b class="g2">(?&lt;N&gt;)</b><b class="g1">|)</b>  <span class="comment"># If a-z is found, push to the N stack</span>
<b class="g1">(?=</b><b>.*</b><i>[A<u>-</u>Z]</i><b class="g2">(?&lt;N&gt;)</b><b class="g1">|)</b>  <span class="comment"># If A-Z is found, push to the N stack</span>
<b class="g1">(?=</b><b>.*</b><i>[0<u>-</u>9]</i><b class="g2">(?&lt;N&gt;)</b><b class="g1">|)</b>  <span class="comment"># If 0-9 is found, push to the N stack</span>
<b class="g1">(?&lt;<b>-</b>N&gt;){2}</b>          <span class="comment"># Pop the last two captures off the N stack</span>
<b>.{8,}</b>               <span class="comment"># Match eight or more characters</span>
</code></pre>

<p>Here, by decrementing the height of the <code>N</code> capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of <em>x</em> out of <em>y</em> validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.</p>

<p>Here's a way to match palindromes (e.g., "redivider"):</p>

<p><code class="regex"><b class="g1">(?&lt;N&gt;</b><b>.</b><b class="g1">)+</b><b>.?</b><b class="g1">(?&lt;<b>-</b>N&gt;</b><b>\k&lt;N&gt;</b><b class="g1">)+(?(N)</b><b class="g2">(?!)</b><b class="g1">)</b></code></p>

<p>In the above regex, <code class="regex"><b>\k&lt;N&gt;</b></code> is a backreference to the last value on the <code>N</code> capture stack.</p>

<p>Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from <a href="http://regex.info/blog/">Jeffrey Friedl's</a> book, <cite><a href="http://regex.info">Mastering Regular Expressions</a></cite>.</p>

<pre class="code"><code class="regex">\(
	<b class="g1">(?&gt;</b>
		<i>[^()]</i><b>+</b>
	<b class="g1">|</b>
		\( <b class="g2">(?&lt;Depth&gt;)</b>
	<b class="g1">|</b>
		\) <b class="g2">(?&lt;<b>-</b>Depth&gt;)</b>
	<b class="g1">)*</b>
	<b class="g1">(?(Depth)</b><b class="g2">(?!)</b><b class="g1">)</b>
\)
</code></pre>

<p>Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "&lt;&lt;" to your left delimiter and "&gt;&gt;" to your right delimiter.</p>

<pre class="code"><code class="regex">&lt;&lt;
	<b class="g1">(?&gt;</b>
		<b class="g2">(?!</b> &lt;&lt; <b class="g2">|</b> &gt;&gt; <b class="g2">)</b> <b>.</b>
	<b class="g1">|</b>
		&lt;&lt; <b class="g2">(?&lt;Depth>)</b>
	<b class="g1">|</b>
		&gt;&gt; <b class="g2">(?&lt;<b>-</b>Depth>)</b>
	<b class="g1">)*</b>
	<b class="g1">(?(Depth)</b><b class="g2">(?!)</b><b class="g1">)</b>
&gt;&gt;
</code></pre>

<p>Make sure to use single-line mode (<code>RegexOptions.Singleline</code>) if you want the dot to match newlines.</p>

<p>Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by <code>A</code>, <code>B</code>, and <code>C</code> are not important; the capturing groups are only used to keep count and control the regex's path.</p>

<pre class="code"><code class="regex"><b class="g1">(?:</b>
	<b class="g2">(?(A)</b><b>\s</b><b class="g2">|)</b>
	<b class="g2">(?&lt;B&gt;)</b>
	<b class="g2">(?&lt;C<b>-</b>B&gt;</b><b>\w</b><b class="g2">)+</b> <b class="g2">(?(B)</b><b class="g3">(?!)</b><b class="g2">)</b>
	<b class="g2">(?:</b>
		<b>\s</b>
		<b class="g3">(?&lt;C&gt;)</b>
		<b class="g3">(?&lt;B<b>-</b>C&gt;</b><b>\w</b><b class="g3">)+</b> <b class="g3">(?(C)</b><b class="g4">(?!)</b><b class="g3">)</b>
		<b class="g3">(?&lt;A&gt;)</b>
	<b class="g2">)?</b>
<b class="g1">)+</b> <b>\b</b>
</code></pre>

<p>Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/balancing-groups/feed</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>
