Flagrant Badassery

A JavaScript and regular expression centric blog

Remove Nested Patterns with One Line of JavaScript

Here's a neat little trick I came up with for removing nested patterns from a string.

var str = "abc<1<2<>3>4>def";

while (str != (str = str.replace(/<[^<>]*>/g, "")));

// str -> "abcdef"

Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while loop's condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.

You can use a similar approach to grab nested patterns rather than delete them, as shown below.

[Edit (6/6/2008): The following code does not correctly handle input like "((a)(b))". If you browse tag recursion on this blog you'll find a variety of other approaches for matching nested constructs that actually work correctly.]

var str = "abc(d(e())f)(gh)ijk()",
    re = /\([^()]*\)/,
    output = [],
    match, parts, last;

while (match = re.exec(str)) {
    parts = match[0].split("\uFFFF");
    if (parts.length < 2) {
        last = output.push(match[0]) - 1;
    } else {
        output[last] = parts[0] + output[last] + parts[1];
    }
    str = str.replace(re, "\uFFFF");
}

// output -> ["(d(e())f)", "(gh)", "()"]

Since once again we're working from the inside out, reassembling each complete match requires us to mark the position at which the previous deepest-level match was removed. I've used the Unicode escape sequence \uFFFF to mark such positions, because that's a permanently-unassigned code point.

Note that using a negated character class like [^()] to match the inner pattern as shown in the examples here only works correctly if you're using single-character delimiters like (…) or <…>. If you want to match/remove nested patterns that use multi-character delimiters, you can use a regex like /<<(?:(?!<<|>>)[\S\s])*>>/. Just change both instances of << to your left delimiter, and >> to your right delimiter.

There Are 8 Responses So Far. »

  1. Nice!

  2. PCRE (http://pcre.org/pcre.txt) comes with a useful (but quite complicated and ugly) atom for handling recursion: ?R

    The while trick is small and powerful, nice work.

  3. Very cool! Now I can get down to finishing that JavaScript Lisp parser. ;-)

  4. This hack is way cool!!!

  5. I’ll admit, this snippet is pretty cool, but now I’m thinking about how I would use it to strip HTML tags (even just the simple ones). The hard part here is that it relies on character classes, reducing the possible nest delimiters to a single character. How would you apply this pattern to remove the HTML tags from the following?

    var str = "<p><i>this</i> is <b>my <i>awesome</i> example</b>!</p>"

    I’m thinking it could be done through a combination of lookaheads and backreferences, but I’m getting stuck on the part where the negative-lookahead doesn’t kill the last tag match. Something like this:

    while (str != (str = str.replace(/<(\w+)>((?:(?!<\/?\1>).)*)</\1>/g, "$2")));

    Any ideas?

    (I was about to feel stupid when I reread and saw your links to other nested-construct blog posts, but I’m calling special circumstances on this reply because it only concerns removal and I have a place where I could use it. ^_^)

  6. @Scott Trenda, what would support for nested tags buy you when trying to remove HTML tags? What about something simple like str = str.replace(/<[^>]+>/g, "");?

    I showed at the end of this post how to get around the single-character-delimiter restriction imposed by using a negated character class to match the inner pattern, but in any case you’ve already implemented it perfectly here. Here’s your code again, with an added backslash to escape the final /, the dot changed to [\S\s] to allow the regex to work correctly over multiple lines, and support for tags that include attributes.

    while (str != (str = str.replace(/<(\w+)[^>]*>((?:(?!<\/?\1\b[^>]*>)[\S\s])*)<\/\1>/g, "$2")));

    (Your use of a backreference in the replacement string is interesting and gives this trick new potential that I hadn’t thought about.)

    So, this removes HTML tags that have both an opening and closing tag. Improperly-nested tags might be a bit of an issue though. E.g., "<p>this is my <i>awe<p>some</i> example</p>" will end up as "<p>this is my awesome example". It also doesn’t support self-closed tags. Here’s an example I gave elsewhere of removing div tags and their contents, accounting for nested and self-closed divs.

    while (str != (str = str.replace(/<div\b[^>]*\/>|<div\b[^>]*>(?:(?!<\/?div\b[^>]*>)[\S\s])*<\/div>/gi, "")));

  7. Heh, this pattern came in useful to me again – this time while I was writing a BBCode parser in WebPlus (it’s like ColdFusion) and needed to handle nested tags. Works beautifully now, thanks to your pattern at the bottom. I knew I had seen the failing-negative-lookahead pattern here before, and it turned out to be just what I needed. :)

    I think I hit a breakthrough this time too; I don’t know how apparent/obvious it is on its face, so I figure this is as good of a place to post it as anywhere. I’ve been bemoaning the ability to use functions for replacement strings in other languages, ever since you keyed me into using them in Javascript… until today. The crux of the it centers around the fact (?) that if you do a RegexMatch and RegexSplit separately on the same string, using the same regex, you get a perfectly partitioned view of that string, with access to all the subgroups in the matches as well. Then you can just alternate through the adjacent split parts and matched parts, and pass the array of matches off to whatever handler routine you need.

    Is this technique more obvious than I thought it was? If you think you’d use it (don’t you work with ColdFusion?), I’d be happy to provide the (pseudo-)source for it. That is, if it turns out I didn’t just re-invent the wheel. :)

    In any case, I just wanted to say thanks, again. Congratulations on your new book, too!

  8. Very cool :)

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).