Matching Innermost HTML Elements

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match <table>…</table>, but should only match the inner table within <table>…<table>…</table>…</table>. This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that many regex libraries are incapable of recursion (although even then it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. wink

Here's the solution I offered (though there were a few steps to get there):

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also quite fast, and can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the three highlighted segments of the text below:

<table><td><table><td>&nbsp;</td></table></td></table> <table><tr><td>&nbsp;</td></tr></table><table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed, and comparatively slow):

<table>(?:.(?!<table>))*?</table>

(Make sure to turn on the "dot matches newline" modifier with the above, or change the dot to [\S\s].)

Basically, the way that works is it matches an opening <table> tag, then it looks at each following character one at a time, checking if they are followed by another instance of <table> before </table>. If so, the match fails, because it's not an innermost table. (In theory, at least.)

Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening <table>, but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory… you'll see that there is now a much bigger issue than before), because within each <table> tag it will greedily jump between all characters which are not < in single steps (using [^<]+), and it will only use the negative lookahead when it encounters <. Secondly, it solves the previously noted error by using <(?!table>) instead of .(?!<table>).

If you're wondering about table tags which contain attributes, that's not a problem. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table\b[^>]*>(?:[^<]+|<(?!table\b[^>]*>))*?</table>

At first I thought this closed the case… The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The addition to the regex is highlighted:

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in a couple milliseconds on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, an approach which is not susceptible to catastrophic backtracking would be:

<table\b[^>]*>(?!<table\b[^>]*>)(?:[\S\s](?!<table\b[^>]*>))*?</table>

That runs a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through its debugger for an under-the-hood look at how they're handled by a regex engine.


Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table\b[^>]*>(?:(?=([^<]+))\1|<(?!table\b[^>]*>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by \1 to act just like an atomic group!

Update: See Matching Nested Constructs in JavaScript, Part 2 for a way to match outermost HTML elements using JavaScript.

Regex Recursion (Matching Nested Constructs)

On a regular expression advice forum I visit every once in a while, some dude was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
	field1 = value,
	field2 = "value in quotation marks",
	field3 = "value in quotation marks, with {brackets} in the value",
	field4 = {brackets}
}

The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions). (Update: Also see my own post on the topic: Fun With .NET Regex Balancing Groups.)

Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

However, this works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see regular-expressions.info.


Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:

No recursion:
<([a-z\d]+)>.*?</\1>

Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>

Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>

…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.


Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible with Perl and PCRE regexes. See the following for more information: