Regex Recursion (Matching Nested Constructs)
Some dude posed the following problem on a regex advice forum I visit every once in a while…
He was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).
A single BibTeX entry looks roughly like this:
@resourceType{
field1 = value,
field2 = "value in quotation marks",
field3 = "value in quotation marks, with {brackets} in the value",
field4 = {brackets}
}
The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions).
Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this particular case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.
Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):
@[^{]+{(?:[^{}]|{[^{}]*})*}
However, this works only if:
- braces are always balanced, and
- the level of brace nesting is no more than one.
For those unfamiliar with regular expressions or who are looking for a good tutorial, see www.regular-expressions.info.
And feel free to post your own regex problems in the comments if you think I might be able to help.
Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:
No recursion:
<([a-z\d]+)>.*?</\1>
Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>
Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>
…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.
Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible in Perl and PCRE. See the following for more information:
- Regexp Power, an article by Simon Cozens.
- Perl Regular Expression Mastery: Slide 83, by Mark Jason Dominus.
- PCRE man pages.

Comment by Porceleindoll on 26 March 2006:
Hey! Love the background and fonts and all! Didn’t know you were so ‘pretty’ :)! And today I learned what a DOM is (Yeah, I’m behind all this stuff! Just starting to catch up.)
Will keep an eye on your script comments and changes. I’m just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.
Take care!
PD