Here's a neat little trick I came up with for removing nested patterns from a string.
var str = "abc<1<2<>3>4>def"; while (str != (str = str.replace(/<[^<>]*>/g, ""))); // str -> "abcdef"
Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while
loop's condition replaces instances of <…>
(where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.
You can use a similar approach to grab nested patterns rather than delete them, as shown below.
[Edit (6/6/2008): The following code does not correctly handle input like "((a)(b))". If you browse tag recursion on this blog you'll find a variety of other approaches for matching nested constructs that actually work correctly.]
var str = "abc(d(e())f)(gh)ijk()", re = /\([^()]*\)/, output = [], match, parts, last; while (match = re.exec(str)) { parts = match[0].split("\uFFFF"); if (parts.length < 2) { last = output.push(match[0]) - 1; } else { output[last] = parts[0] + output[last] + parts[1]; } str = str.replace(re, "\uFFFF"); } // output -> ["(d(e())f)", "(gh)", "()"]
Since once again we're working from the inside out, reassembling each complete match requires us to mark the position at which the previous deepest-level match was removed. I've used the Unicode escape sequence \uFFFF
to mark such positions, because that's a permanently-unassigned code point.
Note that using a negated character class like [^()]
to match the inner pattern as shown in the examples here only works correctly if you're using single-character delimiters like (…)
or <…>
. If you want to match/remove nested patterns that use multi-character delimiters, you can use a regex like /<<(?:(?!<<|>>)[\S\s])*>>/
. Just change both instances of <<
to your left delimiter, and >>
to your right delimiter.
Nice!
PCRE (http://pcre.org/pcre.txt) comes with a useful (but quite complicated and ugly) atom for handling recursion: ?R
The while trick is small and powerful, nice work.
Very cool! Now I can get down to finishing that JavaScript Lisp parser. 😉
This hack is way cool!!!
I’ll admit, this snippet is pretty cool, but now I’m thinking about how I would use it to strip HTML tags (even just the simple ones). The hard part here is that it relies on character classes, reducing the possible nest delimiters to a single character. How would you apply this pattern to remove the HTML tags from the following?
var str = "<p><i>this</i> is <b>my <i>awesome</i> example</b>!</p>"
I’m thinking it could be done through a combination of lookaheads and backreferences, but I’m getting stuck on the part where the negative-lookahead doesn’t kill the last tag match. Something like this:
while (str != (str = str.replace(/<(\w+)>((?:(?!<\/?\1>).)*)</\1>/g, "$2")));
Any ideas?
(I was about to feel stupid when I reread and saw your links to other nested-construct blog posts, but I’m calling special circumstances on this reply because it only concerns removal and I have a place where I could use it. ^_^)
@Scott Trenda, what would support for nested tags buy you when trying to remove HTML tags? What about something simple like
str = str.replace(/<[^>]+>/g, "");
?I showed at the end of this post how to get around the single-character-delimiter restriction imposed by using a negated character class to match the inner pattern, but in any case you’ve already implemented it perfectly here. Here’s your code again, with an added backslash to escape the final
/
, the dot changed to[\S\s]
to allow the regex to work correctly over multiple lines, and support for tags that include attributes.while (str != (str = str.replace(/<(\w+)[^>]*>((?:(?!<\/?\1\b[^>]*>)[\S\s])*)<\/\1>/g, "$2")));
(Your use of a backreference in the replacement string is interesting and gives this trick new potential that I hadn’t thought about.)
So, this removes HTML tags that have both an opening and closing tag. Improperly-nested tags might be a bit of an issue though. E.g.,
"<p>this is my <i>awe<p>some</i> example</p>"
will end up as"<p>this is my awesome example"
. It also doesn’t support self-closed tags. Here’s an example I gave elsewhere of removing div tags and their contents, accounting for nested and self-closed divs.while (str != (str = str.replace(/<div\b[^>]*\/>|<div\b[^>]*>(?:(?!<\/?div\b[^>]*>)[\S\s])* <\/div>/gi, "")));
Heh, this pattern came in useful to me again – this time while I was writing a BBCode parser in WebPlus (it’s like ColdFusion) and needed to handle nested tags. Works beautifully now, thanks to your pattern at the bottom. I knew I had seen the failing-negative-lookahead pattern here before, and it turned out to be just what I needed. 🙂
I think I hit a breakthrough this time too; I don’t know how apparent/obvious it is on its face, so I figure this is as good of a place to post it as anywhere. I’ve been bemoaning the ability to use functions for replacement strings in other languages, ever since you keyed me into using them in Javascript… until today. The crux of the it centers around the fact (?) that if you do a RegexMatch and RegexSplit separately on the same string, using the same regex, you get a perfectly partitioned view of that string, with access to all the subgroups in the matches as well. Then you can just alternate through the adjacent split parts and matched parts, and pass the array of matches off to whatever handler routine you need.
Is this technique more obvious than I thought it was? If you think you’d use it (don’t you work with ColdFusion?), I’d be happy to provide the (pseudo-)source for it. That is, if it turns out I didn’t just re-invent the wheel. 🙂
In any case, I just wanted to say thanks, again. Congratulations on your new book, too!
Very cool 🙂
Great hack and very useful for me, thanks for the post
Hello Steven. At all I like your approaches. Thank you for sharing. Regarding nested braces we can you different approach.
function f(str) {
var braceReg = /[()]/g,
output = [],
deep = 0,
S, start;
while(S = braceReg.exec(str)) {
if (deep == 0) {
start = braceReg.lastIndex – 1;
}
(S == ‘(‘ ? deep++ : deep–);
if (deep == 0) {
output.push(str.slice(start, braceReg.lastIndex));
}
}
return output;
}
print(f(“abc(d(e(test)(test))f)(gh)ijk()”)); //(d(e(test)(test))f),(gh),()
print(f(“((a)(b))”)); //((a)(b))
Concerning the case where the delimiters are more than one char…
The efficiency of these regexes can be improved (at the expense of a bit more complexity), by implementing Jeffrey Friedl’s unrolling-the-loop technique. The following are improved versions written in /Javascript/ syntax:
Unrolling Steve’s <<example>> regex…
/<<[^<>]*(?:(?:(?!<<)<|(?!>>)>)[^<>]*)*>>/
Unrolling Scott’s <tag>example</tag> regex…
while (str != (str = str.replace(/<(\w+)\b[^>]*>([^<]*(?:(?!<\/?\1\b)<[^<]*)*)<\/\1>/g, "$2")));
But more importantly, these fix a rather nasty PHP bug…
The regex constructs recommended in the original post (and the ones from the comments which follow), lead directly to segmentation-fault problems when parsing semi-longish strings with PHP. This type of regex construct results in a lot of recursive calls in the PCRE library which quickly results in a stack overflow (and segmentation fault) when applied to a long subject text. And the length of subject text which causes the overflow is directly related to the stack size of the executable linked with the PCRE library: typically PHP.EXE (the command line utility) or HTTPD.EXE (Apache web server) on a Windows box. PHP.EXE is typically built with an 8MB stack and HTTPD.EXE is built with a 256KB stack. My testing reveals that Scott’s HTML tag matching regex and Steve’s improved version both generate a stack overflow and seg-fault when applied to a mere 673 byte subject string when the PHP executable has a 256KB stack (i.e. Apache running on Windows). When the PHP executable has an 8MB stack executable (which is the typical value for Apache on Linux boxes), these regexes blow up with a 22999 byte string. And Steve’s original <<example>> regex also blows up with similarly sized strings. The improved regexes I provided above do not suffer this limitation and happily work with subject strings up to 26MB regardless of the executable stack size.
The Drupal community recently wrangled with this problem (which is how I became intimately familiar with the issue) – you can read about that here: Optimize CSS option causes php cgi to segfault in pcre function “match”. For details on the PCRE stack usage issue see: PCRE DISCUSSION OF STACK USAGE.
Hi,
I would like to create some regular expression to highlight syntax code. In my code, the variables are defined inside brackets [var]. (I can put variables inside another ones, [var[i]])
I have some problems with nested brackets.
I’m using /\[(.*?)\]/g but for [[%1]stack[i]] , ‘stack’ and the last bracket is not coloured.
Any help will be appreciated
Thanks in advanced
Dear Steven Levithan,
Grace to this powerful line of code, I have been able to build the kernel of my wiki parser engine (http://epsilonwiki.free.fr) as explained in this page : http://epsilonwiki.free.fr/index.php?view=parser .
Thank you a lot.
Alain Marty
s/loop’s/loops
Dear Steven Levithan,
In my previous post (30 July 2011), I write how useful for me had been this “Remove Nested Patterns with One Line of JavaScript”. Since that date, I have been working, as an amateur, on a wiki parser based on a LISP flavored syntax : http://epsilonwiki.free.fr/lambdawiki/. Thes “One line of Javascript” can be seen in line 389 of the parser’s code which can be got here : http://epsilonwiki.free.fr/lambdawiki/meca/parser.js .
I would be happy if you could give me your opinion on this work.
Thank you in advance.
Alain Marty
Hi marty alain, that’s very cool that this has become an important piece of your code. Unfortunately, I don’t think I will be able to review a large piece of work like that. I’ll just have to assume that it’s awesome. 🙂
Hi Steven Levithan,
On 11 June 2012, you assumed that lambdawiki was awsome, thanks.
And what about http://epsilonwiki.free.fr/lambdaway/ and its Lisp flavored syntax ?
LambdaWay is still using your “Remove Nested Patterns with One Line of JavaScript”
upon a unique RegExp : /\{(?:([\w\d@\+\-\*\/=]+)(?: |\uFFFF)*([^{}]*))?\}/g
and a few more lines of Javascript code.
Best regards,
Alain Marty
Hi Steven Levithan,
Inspired by your “Remove Nested Patterns with One Line of JavaScript” I built this function :
function doLoop( str ) {
while (str != (str = str.replace(
/\{([^\s{}]*)(?:[\s]*)([^{}]*)\}/g, // catch {first rest}
function (m, f, r) {
if (html.hasOwnProperty( first )) // 1) html operators
return html[first]();
if (xhtml.hasOwnProperty( first )) // 2) xtended operators
return xhtml[first]();
return ‘(‘ + first + ‘ ‘ + rest + ‘)’; // 3) default :
};
)));
return str;
}
which is the heart of my little 3Okb alphawiki :
– http://epsilonwiki.free.fr/alphawiki/.
Thak you a lot.
Best regards
Marty Alain
Hi Steven Levithan,
Here is a very short code : http://epsilonwiki.free.fr/alphawiki/?view=eval_evolution
This page presents a comparison between two s-expressions evaluators written “a minima”. The first is based on a standard “recursive evaluation working on an array” and the second on a “regexp loop evaluation working on strings” (inspired by your own work).
Any opinion would be a great help for me to go further.
Best regards
Alain Marty