Matching Innermost HTML Elements

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match <table>…</table>, but should only match the inner table within <table>…<table>…</table>…</table>. This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that many regex libraries are incapable of recursion (although even then it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. wink

Here's the solution I offered (though there were a few steps to get there):

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also quite fast, and can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the three highlighted segments of the text below:

<table><td><table><td>&nbsp;</td></table></td></table> <table><tr><td>&nbsp;</td></tr></table><table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed, and comparatively slow):

<table>(?:.(?!<table>))*?</table>

(Make sure to turn on the "dot matches newline" modifier with the above, or change the dot to [\S\s].)

Basically, the way that works is it matches an opening <table> tag, then it looks at each following character one at a time, checking if they are followed by another instance of <table> before </table>. If so, the match fails, because it's not an innermost table. (In theory, at least.)

Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening <table>, but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory… you'll see that there is now a much bigger issue than before), because within each <table> tag it will greedily jump between all characters which are not < in single steps (using [^<]+), and it will only use the negative lookahead when it encounters <. Secondly, it solves the previously noted error by using <(?!table>) instead of .(?!<table>).

If you're wondering about table tags which contain attributes, that's not a problem. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table\b[^>]*>(?:[^<]+|<(?!table\b[^>]*>))*?</table>

At first I thought this closed the case… The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The addition to the regex is highlighted:

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in a couple milliseconds on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, an approach which is not susceptible to catastrophic backtracking would be:

<table\b[^>]*>(?!<table\b[^>]*>)(?:[\S\s](?!<table\b[^>]*>))*?</table>

That runs a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through its debugger for an under-the-hood look at how they're handled by a regex engine.


Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table\b[^>]*>(?:(?=([^<]+))\1|<(?!table\b[^>]*>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by \1 to act just like an atomic group!

Update: See Matching Nested Constructs in JavaScript, Part 2 for a way to match outermost HTML elements using JavaScript.

45 thoughts on “Matching Innermost HTML Elements”

  1. Hi there. This is very fantastic! I was looking all around to find this!

    But I’m still searching for an expression matching outermost tags. I was not able to modify your regex, because I’m not that genius 😉

    Maybe you have any idea?

    Greetings from Germany,

    Mazze

  2. Hi Mazze, and thanks. To match outermost HTML elements using a regex, you need to be able to handle recursion. Currently, only .NET, Perl, and PCRE (and hence the PHP preg functions) support recursion, all in their own ways. However, even if you are not using one of those three regex engines, it is possible to handle recursion up to a known maximum level. See the following post for more information, including examples for matching HTML elements:

    https://blog.stevenlevithan.com/archives/regex-recursion

  3. Hi Steve, thank you for answering and sorry for disturbing again. I’m working on php (PRCE) and i found a way to match recursive Patterns:

    $pattern = ‘#\( ( (?>[^()]+) | (?R) )* \)#x’;

    The (?R) is the recursion. This works very fine for finding words in nested and balanced brackets (). But i cant transform it to my problem, because i dont know how to handle the “excluvise part” (?>[^()]+) because i can only exclude single characters not a whole “tag” like this: [^]. (just example – i know it cant work like i want ;-)) Do you have any idea how i could manage this?

    Thank you and cheers,

    Mazze

  4. Mazze, I haven’t used PCRE’s recursion constructs in the past (I don’t know PHP), but according to my understanding of how they work, this should do the trick:

    <table\b[^>]*>(?>(?:[^<]++|<(?!\/?table\b[^>]*>))+|(?R))*<\/table>

    I’ve only briefly tested it at http://www.supercrumbly.com/assets/html/phpregextester/ but it indeed appears to work quite nicely, matching all outermost table elements.

    The pattern could be written more simply as <table\b[^>]*>(?>(?!<\/?table\b[^>]*>).|(?R))*<\/table>, but that would be less efficient.

  5. Hi Steve,

    Someone at houseoffusion.com mentioned your name as the guy with all regex solution in a thread I posted there (http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:52575). I’m new to the world of Regex please my query below, I’ll be grateful if you can be of any help;

    I have a body of text, within that text are random <ul> tags preceded by <p>text</p>, example below;

    <p>Body of text loads of it, sometimes just a one liner and it ends here</p>
    <ul>
    <li>blurb of text one</li>
    <li>blurb of text two</li>
    </ul>

    What I want to be able to do is wherever i have a preceding body of text before any set of <ul> tags, the <p> tag must go round the <ul> tags. Is this at all possible? I bumped into this problem about a week and someone told me regex is the way to go or should I be looking at something totally different?

    Many thanks,
    Jide.

  6. Jide, a very simplistic approach would be to replace (<p\b[^>]*>.*?)</p>(\s*<ul>.*?</ul>) with \1\2</p> (I’m using \ instead of $ for backreferences in the replacement string since I’m assuming you’re using ColdFusion).

    However, what you are doing is taking valid HTML and replacing it with something invalid. If you are doing this to adjust spacing or what have you, that should be done using CSS instead.

  7. Hi Steve

    You’ve provided a rare solution to a common problem for PHP coders and I’m wondering why it isn’t all over the internet. My most common use of it is to pull a specific DIV (with all its contents) from an HTML page. The regex for this–a slight adaptation of your own excellent regex, is:

    $regex = "/<div\b[^>]*>(?>(?:[^<]++|<(?!\/?div\b[^>]*>))+|(?R))*<\/div>/is";

    Note that in PHP the regex has to be enclosed within delimiters–“/” in this case, which means that the forward slash in that final /div has to be escaped. I’ve also added modifiers i and s to make the regex caseless and to execute across newlines.

    To pull a specific DIV (eg ) out of a page is tricky. You might be able to do this in a regex but it can also be done somewhat ponderously using recursion in PHP. Since the result of the code above always starts with a full DIV tag and ends in a you can examine the starting tag; if it isn’t the div you want, strip off the start and end tags, regex the new contents and go round until you find your inner DIV.

  8. Hi John, and thanks! A few notes about your modification:
    1. I’d already escaped the two “/” characters used for closing tags in my table-element-based example for PCRE which you modified.
    2. The “s” modifier is unnecessary here. It simply changes the meaning of the dot metacharacter from “any character except newline” to “any character.” Since there are no dot characters in the regex, the “s” modifier has no impact.
    3. Based on what I know about PHP string literals, I would generally recommend using single-quoted strings when constructing regexes in PHP, unless you need the features which PHP double-quoted string literals provide (variable interpolation, a greater number of string metasequences, etc.). With this regex it probably doesn’t make a difference, but with some it would result in fewer characters needing to be escaped.

    Pulling a specific div element out of a string is a similar affair. Here’s an example:

    $regex = '/<div\b[^>]+?\bid\s*=\s*"MyID"[^>]*>(?:((?:[^<]++|<(?!\/?div\b[^>]*>))+)|(<div\b[^>]*>(?>(?1)|(?2))*<\/div>))?<\/div>/i';

    With the above, I’ve used PCRE’s “subpatterns as subroutines” feature (to reduce redundancy), in addition to recursion. It should match the entire contents of <div> elements with the idMyID” (I’ve tested it at http://www.lumadis.be/regex/test_regex.php).

    Note that there are a few issues with the above regex: it does not support single quotes or no quotes around the value for the id attribute, unencoded “>” characters within attribute values (which would make the HTML/XHTML invalid), or singleton (self-closed) div elements. However, it should be okay for most people’s needs. The issues I just mentioned would not be very hard to handle properly, but since they’d add to the length of the regex I tried to keep it simple.

  9. Great article!!
    I’m pretty new to regex, and stacked at my first task very badly, could anyone help me out please? I want to extract the data from a “real” html table’s rows. It should take all html table code (beginning with , ending with ), with all possible tag properties…
    Is it possible to make it so that the captured groups will be the results (basically the text between the td tags)? Would be a nice/clean solution but I just couldn’t…

    Many thanks in advance!!!
    N

  10. The thing is: You just can’t do this kind of processing with regular expressions with real world data. What this regexp doesn’t handle is attributes that just might contain HTML, for eg. <table title=”></table>”><tr><td> </td></tr></table>

    Of course if you can guarantee that there is no such HTML in the attributes then it’s fine. Otherwise you just can’t get around doing a proper HTML parser.

    I’m doing this for a CMS, where the user can enter title attributes for images, and said user might just enter previously mentioned HTML, and it will blow up.

    Otherwise, great article!

    -Timo

  11. @Timo:

    “You just can’t do this kind of processing with regular expressions with real world data.”

    That’s not the case, although it’s true that it can be a significant challenge if your goal is to emulate the flexibility of browsers at handling invalid markup.

    “What this regexp doesn’t handle is attributes that just might contain HTML”

    Yeah, I intentionally avoiding dealing with edge-case invalid HTML to keep the regular expressions easier to follow. Although it would be ideal if your CMS correctly encoded HTML entities in attribute values, the invalid case can be handled. A simple, non-optimized approach (which doesn’t deal with any other potential issues surrounding invalid markup) is simply changing the two instances of table\b[^>]*> to table\b(?:[^"'>]|"[^"]*"|'[^']*')*>

    Note that even “proper HTML parsers” are not necessarily idiot-proof when it comes to handling invalid markup.

  12. Hi, Thank you Steve for your sugguestion. The original code you provived above, however, dosen’t work for my following text. I made some small modifications on it and it run well. I tried the text with both pieces of our codes at http://www.lumadis.be/regex/test_regex.php , and they proved my findins.

    Please help me to check them and find out why, if you have the time. Thanks 🙂

    Jim

    ———————————————the html text
    <div class="outside"> <div id="nowhere"></div>
    <div id="MyID">
    starting…
        <div class="1">
            <div class="2">    <a href="http://www.abc.com">before div</a>
                <div class="3">andy        <font size="1">inside div</font></div>
                    <a href="#">betwen div</a>
                <div class="3">One of the deepest level</div>
            </div>    
            <table>
                <tr>
                    <td> This is a table</td>
                </tr>
            </table>
            <div class="2">sandy</div>
            note: <strong>Somthing after a div</strong>
        </div>
        Eeg
        <div class="1">Hello, I am here</div>
        Apple …
    </Div>
    Out of the wolrd
    </div>
    ———————————————

    your original code:
    #<div\b[^>]+?\bid\s*=\s*"MyID"[^>]*>(?:((?:[^<]++|<(?!\/?div\b[^>]*>))+)|(<div\b[^>]*>(?>(?1)|(?2))*<\/div>))?<\/div>#i

    my modified code:
    #<div[^>]*MyID[^>]*>([^<]+|<(?!/?div[^>]*>)|<div[^>]*>(?>(?1))*</div>)*<\/div>#i

    a better view of my code:
    $regex    = ‘~
        <div[^>]*MyID[^>]*>
            (
                [^<]+|
                <(?!/?div[^>]*>)|
                <div[^>]*>
                    (?>(?1))*
                </div>
            )*
        </div>
    ~ix’;

  13. @Jim, if you replace the last question mark in my regex with an asterisk, it will work. That regex was quick and dirty, but note that the majority of the smaller changes you made reduce preciseness and optimizations. I’ll try to look both of them over more closely later on.

    @Pavel Donchev, you’re very welcome!

  14. hi steave,
    here in the regular expression we know that can be duplicated , but i want an expression that should not contain duplicate of its result
    for example if my control is for matching (?.*) than it should not have xyz in the result

    if you able to solve it , please reply with your answer to my mail i.e. imran261281@gmail.com

  15. Input:
    {happy} sometext sometext sometext {test {zebra} test} sometext sometext sometext {apple {red {crunchy}}} sometext {yay}

    Desired output (im using preg_match_all()):
    happy
    zebra
    test {zebra} test
    crunchy
    red {crunchy}
    apple {red {crunchy}}
    yay

    the closest I’ve gotten to my desired result is:
    preg_match_all(‘/\{(.*?\}*)\}+/’, $input, $output, PREG_PATTERN_ORDER);

    but, it only works if the closing brackets are all touching. In the example above, it works for the last nested group, but not the test {zebra} test group

    Array
    (
    [0] => Array
    (
    [0] => {happy}
    [1] => {test {zebra}
    [2] => {apple {red {crunchy}}}
    [3] => {yay}
    )

    [1] => Array
    (
    [0] => happy
    [1] => test {zebra
    [2] => apple {red {crunchy}}
    [3] => yay
    )

    )

    any help would be appreciated!

  16. Can you apply this to nested parentheses? I need to parse in JavaScript for nested parentheses with Velocity commands i.e:

    #set ($a = (5 + $b) * $c)

    The number of inner parentheses is unknown.

  17. “basaddery” indeed! Thank you for your solution to a problem that had me thoroughly vexed.

  18. I was looking for a solution like this a long time ago, and no one, even so called experts here know how to do it.
    I will post this site in all the places I can!

    THANKS FOR SHARING.

  19. Hi,

    First, congrats for the great post 🙂

    I need a PCRE expression for php able to catch everything inside a certain element with a certain id. I’ve tried some examples posted here, with and without modifications, but still can’t get what I want. My problem is this :

    <div id=”test”>text 1<div>text 2</div></div>text 3<div>

    I was able to catch everything inside the first div, excluding “text 3”. Can someone help please?

    Thanks 🙂

  20. @yoda.pt:
    I don’t have a one liner regex like those Steve throws around. But if you just want to get the job done:
    I wrote a PHP class that allows you to get or set the content of any element(s) using CSS selectors [plus some other features].
    Its small, quick, and pretty powerful, and does not the use built in the PCRE recursion. [Intentionally – this is quicker and handles invalid HTML.]

    If you [or anyone else] are interested, email me at siteroller at gmail.

  21. Hi Steven,

    First, thanks for this wonderful post.
    I’m new to regular expressions. Above you mentioned regular expression to get outermost table elements “]*>(?>(?:[^<]++|]*>))+|(?R))*”. I tried to convert this to make it work for javascript, but couldn’t able to do it.
    Can you please help me out.

    I have 1 more issue. Can we get all the table elements in an array 1 by 1, starting from outer most to innermost.. for example

    tableArray[0] = ” .. …. …. … .. “;
    tableArray[1] = ” …. …. … “;
    tableArray[2] = “….”;

    It can be done in javascript using regular expression and recursion ?

    Thanks

  22. Your post suggest to trace the table elements. My problem is that I want all the data from the HTML page except for the data in between table start and table end.

    Can you please guide me on that?

  23. Awesome article. I am migrating very old site to Drupal and one of the scenarios was to extract the innermost table out of the page. And your article helped to achieve without any trouble. Thanks a lot.

  24. What if I would like to match the innermost ‘tr’ element that contains some special string inside a ‘td’ (e.g. <tr><td> TEXT </td></tr>)?

  25. It’s great but seems to have a little bug in some cases.

    When I apply replace by field $1, using the generic pattern <table\b[^>]*>(?:(?=([^<]+))\1|<(?!table\b[^>]*>))*?</table>, got this:

    input=”<div class="moz-forward-container"><br></div>
    output=”br>
    correct: “<br>

    input=”<div class="moz-forward-container"><b>Atte. Ing. Arturo DomĂ­nguez</b></div>
    output=”/b>
    correct: “<b>Atte. Ing. Arturo DomĂ­nguez</b>

    input=”<p><H5>ADVERTENCIA LEGAL</H5</p>
    output=”/H5
    correct: “ADVERTENCIA LEGAL</H5

  26. This is not a bug. Unfortunately, though, I can’t fully understand your examples, because the regex you showed only matches table elements. None of your three example inputs contain table tags at all. If I change the three instances of “table” within the regex to “div”, then it matches the div elements from your examples. It sounds like you changed something about the regex that is not shown here. Also note that in your third example input, the </H5> is missing its closing “>”.

    In any case, the regex is not meant to be used with a simple replacement of $1. The capturing group in the regex is mixed with the lookahead and backreference to emulate an atomic grouping—that is its only purpose.

    If you want to be able to replace all matches of innermost table elements with the entire contents of innermost table tags, you’d have to change the regex to this:
    <table\b[^>]*>((?:(?=([^<]+))\2|<(?!table\b[^>]*>))*?)</table>

    Then you could replace matches with $1.

Leave a Reply

Your email address will not be published. Required fields are marked *