Regex Recursion (Matching Nested Constructs)

On a regular expression advice forum I visit every once in a while, some dude was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
	field1 = value,
	field2 = "value in quotation marks",
	field3 = "value in quotation marks, with {brackets} in the value",
	field4 = {brackets}
}

The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions). (Update: Also see my own post on the topic: Fun With .NET Regex Balancing Groups.)

Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

However, this works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see regular-expressions.info.


Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:

No recursion:
<([a-z\d]+)>.*?</\1>

Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>

Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>

…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.


Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible with Perl and PCRE regexes. See the following for more information: