Remove Nested Patterns with One Line of JavaScript
Here's a neat little trick I came up with for removing nested patterns from a string.
var str = "abc<1<2<>3>4>def"; while (str != (str = str.replace(/<[^<>]*>/g, ""))); // str -> "abcdef"
Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while loop's condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.
You can use a similar approach to grab nested patterns rather than delete them, as shown below.
[Edit (6/6/2008): The following code does not correctly handle input like "((a)(b))". If you browse tag recursion on this blog you'll find a variety of other approaches for matching nested constructs that actually work correctly.]
var str = "abc(d(e())f)(gh)ijk()",
re = /\([^()]*\)/,
output = [],
match, parts, last;
while (match = re.exec(str)) {
parts = match[0].split("\uFFFF");
if (parts.length < 2) {
last = output.push(match[0]) - 1;
} else {
output[last] = parts[0] + output[last] + parts[1];
}
str = str.replace(re, "\uFFFF");
}
// output -> ["(d(e())f)", "(gh)", "()"]
Since once again we're working from the inside out, reassembling each complete match requires us to mark the position at which the previous deepest-level match was removed. I've used the Unicode escape sequence \uFFFF to mark such positions, because that's a permanently-unassigned code point.
Note that using a negated character class like [^()] to match the inner pattern as shown in the examples here only works correctly if you're using single-character delimiters like (…) or <…>. If you want to match/remove nested patterns that use multi-character delimiters, you can use a regex like /<<(?:(?!<<|>>)[\S\s])*>>/. Just change both instances of << to your left delimiter, and >> to your right delimiter.

Comment by Dean Edwards on 28 May 2008:
Nice!
Comment by Yoan on 28 May 2008:
PCRE (http://pcre.org/pcre.txt) comes with a useful (but quite complicated and ugly) atom for handling recursion: ?R
The while trick is small and powerful, nice work.
Comment by Frederick Polgardy on 30 May 2008:
Very cool! Now I can get down to finishing that JavaScript Lisp parser.
Comment by Ori Peleg on 1 June 2008:
This hack is way cool!!!
Comment by Scott Trenda on 6 June 2008:
I’ll admit, this snippet is pretty cool, but now I’m thinking about how I would use it to strip HTML tags (even just the simple ones). The hard part here is that it relies on character classes, reducing the possible nest delimiters to a single character. How would you apply this pattern to remove the HTML tags from the following?
var str = "<p><i>this</i> is <b>my <i>awesome</i> example</b>!</p>"I’m thinking it could be done through a combination of lookaheads and backreferences, but I’m getting stuck on the part where the negative-lookahead doesn’t kill the last tag match. Something like this:
while (str != (str = str.replace(/<(\w+)>((?:(?!<\/?\1>).)*)</\1>/g, "$2")));Any ideas?
(I was about to feel stupid when I reread and saw your links to other nested-construct blog posts, but I’m calling special circumstances on this reply because it only concerns removal and I have a place where I could use it. ^_^)
Comment by Steven Levithan on 6 June 2008:
@Scott Trenda, what would support for nested tags buy you when trying to remove HTML tags? What about something simple like
str = str.replace(/<[^>]+>/g, "");?I showed at the end of this post how to get around the single-character-delimiter restriction imposed by using a negated character class to match the inner pattern, but in any case you’ve already implemented it perfectly here. Here’s your code again, with an added backslash to escape the final
/, the dot changed to[\S\s]to allow the regex to work correctly over multiple lines, and support for tags that include attributes.while (str != (str = str.replace(/<(\w+)[^>]*>((?:(?!<\/?\1\b[^>]*>)[\S\s])*)<\/\1>/g, "$2")));(Your use of a backreference in the replacement string is interesting and gives this trick new potential that I hadn’t thought about.)
So, this removes HTML tags that have both an opening and closing tag. Improperly-nested tags might be a bit of an issue though. E.g.,
"<p>this is my <i>awe<p>some</i> example</p>"will end up as"<p>this is my awesome example". It also doesn’t support self-closed tags. Here’s an example I gave elsewhere of removing div tags and their contents, accounting for nested and self-closed divs.while (str != (str = str.replace(/<div\b[^>]*\/>|<div\b[^>]*>(?:(?!<\/?div\b[^>]*>)[\S\s])* <\/div>/gi, "")));