On a regular expression advice forum I visit every once in a while, some dude was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).
A single BibTeX entry looks roughly like this:
@resourceType{ field1 = value, field2 = "value in quotation marks", field3 = "value in quotation marks, with {brackets} in the value", field4 = {brackets} }
The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions). (Update: Also see my own post on the topic: Fun With .NET Regex Balancing Groups.)
Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this case lies in how you distinguish between the last closing bracket of the @resourceType{…}
block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.
Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):
@[^{]+{(?:[^{}]|{[^{}]*})*}
However, this works only if:
- braces are always balanced, and
- the level of brace nesting is no more than one.
For those unfamiliar with regular expressions or who are looking for a good tutorial, see regular-expressions.info.
Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:
No recursion:
<([a-z\d]+)>.*?</\1>
Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>
Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>
…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.
Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible with Perl and PCRE regexes. See the following for more information:
- Regexp Power, an article by Simon Cozens.
- Perl Regular Expression Mastery: Slide 83, by Mark Jason Dominus.
- PCRE man pages.
Hey! Love the background and fonts and all! Didn’t know you were so ‘pretty’ :)! And today I learned what a DOM is (Yeah, I’m behind all this stuff! Just starting to catch up.)
Will keep an eye on your script comments and changes. I’m just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.
Take care!
PD
I’m seeing a problem in my implementation of this. Firebug gives the following error message:
cont.match(“[^W]*?</\x01”) has no properties
So Firefox apparently interprets the backslash in front of \1 as an octal escape…even though there is no bracket wrapping it. Any ideas why this might be happening?
awesome , I had given up thinking this was impossible thanks!
The so-called resident regex experts misunderstand “regexes are incapable of recursion” which the theory say.
Regexes can match any fixed patterns (i.e. finite patterns) including any finite level of recursion. Pure regexes cannot match arbitrary level of recursion (i.e. infinite level of recursion). Such infinite things require more than single “memory” (e.g. “balancing groups” i.e. “stack”) or more powerful “memory” (e.g. “counter” which is more than finite state translation).
You can use regexes to match any fixed patterns because you can hard-code any fixed patterns in regexes at least.
You are right, recursive regular expressions are also implemented in PCRE, which makes them available in PHP through the preg functions.
What’s more, PCRE lets you either recurse the entire regex pattern, or a part of the pattern, i.e., the regex contained by a set of parentheses, referenced by its capture group number.
I saw your “Edit 2” section, here’s a comprehensive resource about recursive regular expressions. But only for PHP.
Only a matter of time before the feature is implemented in Java.