Remove Nested Patterns with One Line of JavaScript

Here's a neat little trick I came up with for removing nested patterns from a string.

var str = "abc<1<2<>3>4>def";

while (str != (str = str.replace(/<[^<>]*>/g, "")));

// str -> "abcdef"

Notice that the regex in this one-liner doesn't try to deal with nested patterns at all. The while loop's condition replaces instances of <…> (where angled brackets are not allowed in the inner pattern) with an empty string. This repeats from the inside out, until the regex no longer matches. At that point, the result of the replacement is the same as the subject string, and the loop ends.

You can use a similar approach to grab nested patterns rather than delete them, as shown below.

[Edit (6/6/2008): The following code does not correctly handle input like "((a)(b))". If you browse tag recursion on this blog you'll find a variety of other approaches for matching nested constructs that actually work correctly.]

var str = "abc(d(e())f)(gh)ijk()",
    re = /\([^()]*\)/,
    output = [],
    match, parts, last;

while (match = re.exec(str)) {
    parts = match[0].split("\uFFFF");
    if (parts.length < 2) {
        last = output.push(match[0]) - 1;
    } else {
        output[last] = parts[0] + output[last] + parts[1];
    }
    str = str.replace(re, "\uFFFF");
}

// output -> ["(d(e())f)", "(gh)", "()"]

Since once again we're working from the inside out, reassembling each complete match requires us to mark the position at which the previous deepest-level match was removed. I've used the Unicode escape sequence \uFFFF to mark such positions, because that's a permanently-unassigned code point.

Note that using a negated character class like [^()] to match the inner pattern as shown in the examples here only works correctly if you're using single-character delimiters like (…) or <…>. If you want to match/remove nested patterns that use multi-character delimiters, you can use a regex like /<<(?:(?!<<|>>)[\S\s])*>>/. Just change both instances of << to your left delimiter, and >> to your right delimiter.

XRegExp 0.5 Released!

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

If you haven't seen the prior versions, XRegExp is an MIT-licensed JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a new, powerful recursive-construct parser that uses regex delimiters are also included.

Here's what you get beyond the standard JavaScript regex features:

  • Added regex syntax:
    • Comprehensive named capture support. (Improved)
    • Comment patterns: (?#…). (New)
  • Added regex modifiers (flags):
    • s (singleline), to make dot match all characters including newlines.
    • x (extended), for free-spacing and comments.
  • Added awesome:
    • Reduced cross-browser inconsistencies. (More)
    • Recursive-construct parser with regex delimiters. (New)
    • An easy way to cache and reuse regex objects. (New)
    • The ability to safely embed literal text in your regex patterns. (New)
    • A method to add modifiers to existing regex objects.
    • Regex call and apply methods, which make generically working with functions and regexes easier. (New)

All of this can be yours for the low, low price of 2.4 KB. smile Version 0.5 also introduces extensive documentation and code examples.

If you're using a previous version, note that there are a few non-backward compatible changes for the sake of strict ECMA-262 Edition 3 compliance and compatibility with upcoming ECMAScript 4 changes.

  • The XRegExp.overrideNative function has been removed, since it is no longer possible to override native constructors in Firefox 3 or ECMAScript 4 (as proposed).
  • Named capture syntax has been changed from (<name>…) to (?<name>…), which is the standard in most regex libraries and under consideration for ES4. Named capture is now always available, and does not require the k modifier.
  • Due to cross-browser compatibility issues, previous versions enforced that a leading, unescaped ] within a character class was treated as a literal character, which is how things work in most regex flavors. XRegExp now follows ECMA-262 Edition 3 on this point. [] is an empty set and never matches (this is enforced in all browsers).

Get it while it's hot! Check out the new XRegExp documentation and source code.

Fun With .NET Regex Balancing Groups

The .NET Framework's regular expression package includes a unique feature called balancing groups, which is a misnomer since although they can indeed be used to match balanced constructs, that's not all they're good for and really has nothing to do with how they work. Unfortunately, balancing groups are quite poorly documented. Following is a brief description of their functionality, but this post will mostly focus on examples of using them in interesting ways.

Note: If you're reading this in a feed reader or aggregator, see the original post, which uses regex syntax highlighting to hopefully make things easier to follow.

  • (?<Name1>) — A standard named capturing group. The captured value is pushed on the Name1 CaptureCollection or stack.
  • (?<-Name1>) — Pops the top backreference off the Name1 stack. If there are no backreferences on the stack, the match fails, forcing backtracking.
  • (?<Name2-Name1>) — Pops the top backreference off the Name1 stack, and pushes text matched since the last time Name1 participated on top of the Name2 stack. I imagine that in most cases where this feature has been used, it mostly just served as a notational convenience.

I'm not a .NET coder, but I recognize the potential of this functionality. This evening I spent a few minutes using Expresso to play around with balancing groups, and here are a few interesting things I've come up with.

First, here's a simple example of using balancing groups outside the context of recursion and nested constructs. This regex matches any number of As followed by the same number of Bs (e.g., "AAABBB").

^
(?<Counter>A)+    # For each A, push to the Counter stack
(?<-Counter>B)+   # For each B, pop from the Counter stack
(?(Counter)(?!))  # Fail if there are any values on the Counter stack
$

A few notes about the above regex:

  • (?<-Counter>B) causes the match attempt to backtrack or fail if there are no captured values on the Counter stack. This prevents matching more Bs than As.
  • (?(Counter)) is a conditional without an else part. The way it's used here prevents the match from ending with more As than Bs.
  • (?!) is an empty negative lookahead. It will never match, and is hence an easy way to force a match attempt to backtrack or fail.

Although there's no way to determine the height of the Counter stack from within the regex, you can directly manipulate that number by incrementing or decrementing it by set amounts. To demonstrate, here's a regex designed to match a password which is at least eight characters long, and which contains at least two out of three character types from the set of uppercase letters, lowercase letters, and numbers.

^
(?=.*[a-z](?<N>)|)  # If a-z is found, push to the N stack
(?=.*[A-Z](?<N>)|)  # If A-Z is found, push to the N stack
(?=.*[0-9](?<N>)|)  # If 0-9 is found, push to the N stack
(?<-N>){2}          # Pop the last two captures off the N stack
.{8,}               # Match eight or more characters

Here, by decrementing the height of the N capture stack by two, we cause the match to fail if it hadn't already reached at least two. Note that there's an empty alternation at the end of each lookahead, which is used to cancel the effect of the lookahead if it would otherwise cause the match to fail. This kind of x out of y validation of orthogonal rules would normally be unmanageable using regular expressions, since without equivalent functionality we'd have to use a bunch of alternation or conditionals to account for each possible set and ordering of allowed matches.

Here's a way to match palindromes (e.g., "redivider"):

(?<N>.)+.?(?<-N>\k<N>)+(?(N)(?!))

In the above regex, \k<N> is a backreference to the last value on the N capture stack.

Moving on to what is undoubtedly the most common usage of balancing groups, following is an example of matching balanced sets of parentheses. It's taken from Jeffrey Friedl's book, Mastering Regular Expressions.

\(
	(?>
		[^()]+
	|
		\( (?<Depth>)
	|
		\) (?<-Depth>)
	)*
	(?(Depth)(?!))
\)

Here's a simple variation which allows easily using multi-character delimiters. To swap in your own delimiters (such as HTML tags), change each instance of "<<" to your left delimiter and ">>" to your right delimiter.

<<
	(?>
		(?! << | >> ) .
	|
		<< (?<Depth>)
	|
		>> (?<-Depth>)
	)*
	(?(Depth)(?!))
>>

Make sure to use single-line mode (RegexOptions.Singleline) if you want the dot to match newlines.

Finally, here's a way to match words of incrementally increasing length (e.g., "abc abcd abcde abcdef"), starting from any word length (the preceding example started from a word length of three). See if you can figure out how it works. The values stored by A, B, and C are not important; the capturing groups are only used to keep count and control the regex's path.

(?:
	(?(A)\s|)
	(?<B>)
	(?<C-B>\w)+ (?(B)(?!))
	(?:
		\s
		(?<C>)
		(?<B-C>\w)+ (?(C)(?!))
		(?<A>)
	)?
)+ \b

Have you seen or devised any other non-conventional uses for so-called balancing group definitions? If so, please share.

Matching Nested Constructs in JavaScript, Part 2

When I posted my matchRecursive function the other day (which allows easily matching nested constructs), I noted that it could easily be modified to work with a regex pattern rather than a string as the format argument. After looking at it again, I realized the conversion wouldn't be entirely straightforward, so I've gone ahead and reimplemented it as matchRecursiveRegExp to work with regex format delimiters. You can now use it to like e.g. matchRecursiveRegExp(str, "<div\\b[^>]*>", "</div>", "gi") to match the entire contents of all outermost <div> tags.

As before, read the code comments for further details. Note that in this version I've made the variable names more or less indecipherable in order to keep code size down, so see the earlier function for a similar but more readable implementation.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursiveRegExp
	Accepts a string to search, a left and right format delimiter
	as regex patterns, and optional regex flags. Returns an array
	of matches, allowing nested instances of left/right delimiters.
	Use the "g" flag to return all matches, otherwise only the
	first is returned. Be careful to ensure that the left and
	right format delimiters produce mutually exclusive matches.
	Backreferences are not supported within the right delimiter
	due to how it is internally combined with the left delimiter.
	When matching strings whose format delimiters are unbalanced
	to the left or right, the output is intentionally as a
	conventional regex library with recursion support would
	produce, e.g. "<<x>" and "<x>>" both produce ["x"] when using
	"<" and ">" as the delimiters (both strings contain a single,
	balanced instance of "<x>").

	examples:
		matchRecursiveRegExp("test", "\\(", "\\)")
			returns: []
		matchRecursiveRegExp("<t<<e>><s>>t<>", "<", ">", "g")
			returns: ["t<<e>><s>", ""]
		matchRecursiveRegExp("<div id=\"x\">test</div>", "<div\\b[^>]*>", "</div>", "gi")
			returns: ["test"]

*/
function matchRecursiveRegExp (str, left, right, flags) {
	var	f = flags || "",
		g = f.indexOf("g") > -1,
		x = new RegExp(left + "|" + right, "g" + f),
		l = new RegExp(left, f.replace(/g/g, "")),
		a = [],
		t, s, m;

	do {
		t = 0;
		while (m = x.exec(str)) {
			if (l.test(m[0])) {
				if (!t++) s = x.lastIndex;
			} else if (t) {
				if (!--t) {
					a.push(str.slice(s, m.index));
					if (!g) return a;
				}
			}
		}
	} while (t && (x.lastIndex = s));

	return a;
}

You can download it here.

Let me know if there are any additional features you'd like to see with this.

Matching Nested Constructs in JavaScript

In the past, I've touched on using regexes to match nested constructs up to a predetermined depth, which is the best you can do unless you're using one of the three regex engines (Perl, PCRE, and .NET) which are currently able to handle true recursion.

Well, recently I wanted to be able to support unlimited nesting depth in a fast, flexible, and easy to use way when matching strings in JavaScript, so here's the code I wrote for it. Basic documentation and examples are included in the code comments.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursive
	accepts a string to search and a format (start and end tokens separated by "...").
	returns an array of matches, allowing nested instances of format.

	examples:
		matchRecursive("test",          "(...)")   -> []
		matchRecursive("(t(e)s)()t",    "(...)")   -> ["t(e)s", ""]
		matchRecursive("t<e>>st",       "<...>")   -> ["e"]
		matchRecursive("t<<e>st",       "<...>")   -> ["e"]
		matchRecursive("t<<e>>st",      "<...>")   -> ["<e>"]
		matchRecursive("<|t<e<|s|>t|>", "<|...|>") -> ["t<e<|s|>t"]
*/
var matchRecursive = function () {
	var	formatParts = /^([\S\s]+?)\.\.\.([\S\s]+)/,
		metaChar = /[-[\]{}()*+?.\\^$|,]/g,
		escape = function (str) {
			return str.replace(metaChar, "\\$&");
		};

	return function (str, format) {
		var p = formatParts.exec(format);
		if (!p) throw new Error("format must include start and end tokens separated by '...'");
		if (p[1] == p[2]) throw new Error("start and end format tokens cannot be identical");

		var	opener = p[1],
			closer = p[2],
			/* Use an optimized regex when opener and closer are one character each */
			iterator = new RegExp(format.length == 5 ? "["+escape(opener+closer)+"]" : escape(opener)+"|"+escape(closer), "g"),
			results = [],
			openTokens, matchStartIndex, match;

		do {
			openTokens = 0;
			while (match = iterator.exec(str)) {
				if (match[0] == opener) {
					if (!openTokens)
						matchStartIndex = iterator.lastIndex;
					openTokens++;
				} else if (openTokens) {
					openTokens--;
					if (!openTokens)
						results.push(str.slice(matchStartIndex, match.index));
				}
			}
		} while (openTokens && (iterator.lastIndex = matchStartIndex));

		return results;
	};
}();

You can download the code here.

Note that the format argument expects a simple string; not a regular expression. However, the code could easily be modified to work with regexes if that's what you were after.

Update: I've posted an alternative version which accepts regex patterns as the format as matchRecursiveRegExp.