Flagrant Badassery

A JavaScript and regular expression centric blog

Regex Recursion (Matching Nested Constructs)

On a regular expression advice forum I visit every once in a while, some dude was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
	field1 = value,
	field2 = "value in quotation marks",
	field3 = "value in quotation marks, with {brackets} in the value",
	field4 = {brackets}
}

The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions). (Update: Also see my own post on the topic: Fun With .NET Regex Balancing Groups.)

Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

However, this works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see regular-expressions.info.


Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:

No recursion:
<([a-z\d]+)>.*?</\1>

Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>

Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>

…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.


Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible with Perl and PCRE regexes. See the following for more information:

There Are 7 Responses So Far. »

  1. Hey! Love the background and fonts and all! Didn’t know you were so ‘pretty’ :)! And today I learned what a DOM is (Yeah, I’m behind all this stuff! Just starting to catch up.)

    Will keep an eye on your script comments and changes. I’m just getting into PHP and mySQL and server side stuff, basically, throwing myself full-time into learning all these languages and coding.

    Take care!
    PD

  2. I’m seeing a problem in my implementation of this. Firebug gives the following error message:

    cont.match(“[^W]*?</\x01”) has no properties

    So Firefox apparently interprets the backslash in front of \1 as an octal escape…even though there is no bracket wrapping it. Any ideas why this might be happening?

  3. awesome , I had given up thinking this was impossible thanks!

  4. sloneczne noclegi

  5. The so-called resident regex experts misunderstand “regexes are incapable of recursion” which the theory say.

    Regexes can match any fixed patterns (i.e. finite patterns) including any finite level of recursion. Pure regexes cannot match arbitrary level of recursion (i.e. infinite level of recursion). Such infinite things require more than single “memory” (e.g. “balancing groups” i.e. “stack”) or more powerful “memory” (e.g. “counter” which is more than finite state translation).

    You can use regexes to match any fixed patterns because you can hard-code any fixed patterns in regexes at least.

  6. Hello!

    I’ve found a site dedicated to grooming.

    Dog grooming refers to both the hygienic care and cleaning of a dog, as well as a process by which a dog’s physical appearance is enhanced for showing or other types of competition. A dog groomer (or simply “groomer”) is a person who earns their living grooming dogs.

    Grooming is an important part of dog care. Depending on the breed, age, and health of the dog, grooming may be a daily activity. Many breeds require significantly less grooming than this, but regular grooming helps to ensure the dog is healthy and comfortable. It is important to note that while many dogs shed, others (such as the Poodle), do not shed (see Moult) as profusely, and require grooming by a professional every 6–8 weeks maximum.

    Dogs can be bathed by being sprayed with a hand-held shower head, or doused with water from a bucket. Often, one bath will not make a dog truly clean. A second bath is excellent to ensure the entire body has been cleaned. Dogs should be bathed with warm, not hot water, in order to make it a more enjoyable experience. Dogs with a heavy or matted coat should never be bathed without first being completely brushed out or clipped of any mats.

    Psi salon

    It’s written in Polish but still have profoundly nice gallery of dogs and cats.
    They use many products such as ones of Ring 5, 1 All System, Bio-Groom to take care of your pupil.

    Bye to every one!

  7. You are right, recursive regular expressions are also implemented in PCRE, which makes them available in PHP through the preg functions.

    What’s more, PCRE lets you either recurse the entire regex pattern, or a part of the pattern, i.e., the regex contained by a set of parentheses, referenced by its capture group number.

    I saw your “Edit 2” section, here’s a comprehensive resource about recursive regular expressions. But only for PHP.

    Only a matter of time before the feature is implemented in Java.

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).