Matching Nested Constructs in JavaScript, Part 2

When I posted my matchRecursive function the other day (which allows easily matching nested constructs), I noted that it could easily be modified to work with a regex pattern rather than a string as the format argument. After looking at it again, I realized the conversion wouldn't be entirely straightforward, so I've gone ahead and reimplemented it as matchRecursiveRegExp to work with regex format delimiters. You can now use it to like e.g. matchRecursiveRegExp(str, "<div\\b[^>]*>", "</div>", "gi") to match the entire contents of all outermost <div> tags.

As before, read the code comments for further details. Note that in this version I've made the variable names more or less indecipherable in order to keep code size down, so see the earlier function for a similar but more readable implementation.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursiveRegExp
	Accepts a string to search, a left and right format delimiter
	as regex patterns, and optional regex flags. Returns an array
	of matches, allowing nested instances of left/right delimiters.
	Use the "g" flag to return all matches, otherwise only the
	first is returned. Be careful to ensure that the left and
	right format delimiters produce mutually exclusive matches.
	Backreferences are not supported within the right delimiter
	due to how it is internally combined with the left delimiter.
	When matching strings whose format delimiters are unbalanced
	to the left or right, the output is intentionally as a
	conventional regex library with recursion support would
	produce, e.g. "<<x>" and "<x>>" both produce ["x"] when using
	"<" and ">" as the delimiters (both strings contain a single,
	balanced instance of "<x>").

	examples:
		matchRecursiveRegExp("test", "\\(", "\\)")
			returns: []
		matchRecursiveRegExp("<t<<e>><s>>t<>", "<", ">", "g")
			returns: ["t<<e>><s>", ""]
		matchRecursiveRegExp("<div id=\"x\">test</div>", "<div\\b[^>]*>", "</div>", "gi")
			returns: ["test"]

*/
function matchRecursiveRegExp (str, left, right, flags) {
	var	f = flags || "",
		g = f.indexOf("g") > -1,
		x = new RegExp(left + "|" + right, "g" + f),
		l = new RegExp(left, f.replace(/g/g, "")),
		a = [],
		t, s, m;

	do {
		t = 0;
		while (m = x.exec(str)) {
			if (l.test(m[0])) {
				if (!t++) s = x.lastIndex;
			} else if (t) {
				if (!--t) {
					a.push(str.slice(s, m.index));
					if (!g) return a;
				}
			}
		}
	} while (t && (x.lastIndex = s));

	return a;
}

You can download it here.

Let me know if there are any additional features you'd like to see with this.

6 thoughts on “Matching Nested Constructs in JavaScript, Part 2”

  1. You can avoid the second regex test l.test(m[0]) by wraping the left part in parenthesses new RegExp( ‘(‘ + left + “)|” + right, “g” + f) and then checking if there is something in this group if(m[1]). Regards.

  2. This thread is pretty old so I’m not sure if you see this anymore but I thought I’d make a comment / give some updated code.

    This function works great, but will not return nested results/matches.
    For example:

    Given the string: “”;

    var str
    var div_tags = matchRecursiveRegExp(

  3. This thread is pretty old so I’m not sure if you see this anymore but I thought I’d make a comment / give some updated code.

    This function works great, but will not return nested results/matches.
    For example:

    Given the string: “”;

    var str = “”;
    var div_tags = matchRecursiveRegExp(str, “]*>”, “”, “g”);

    The div_tags array will only contain 1 item (the outer div) instead of 2.

    This might be expected if you are not doing a global search, but if you are doing a global search I would think that you’d want to get all the matches.

    The solution is easy though. Basically every time you increment t, and not doing a set of s, you store off the index for later in an array. Then if said array has content, slice the string and keep going.

    Here’s the code:
    function matchRecursiveRegExp (str, left, right, flags) {
    var str_copy = str,
    f = flags || “”,
    g = f.indexOf(“g”) > -1,
    x = new RegExp(left + “|” + right, “g” + f),
    l = new RegExp(left, f.replace(/g/g, “”)),
    a = [],
    b = [],
    keep_going, t, s, m;

    do {
    t = 0;
    keep_going = false;
    while (m = x.exec(str_copy))
    {
    if (l.test(m[0])) {
    if (!t++) {
    s = x.lastIndex;
    } else {
    //another start has occurred
    //save that off
    b.push(m.index);
    }
    } else if (t) {
    if (!–t) {
    a.push(str_copy.slice(s, m.index));
    //found the end match
    if (!g) return a;
    }
    }
    }
    if (g && b.length) {
    //nested matches found
    //slice the string with the index that was
    //saved and keep going
    str_copy = str_copy.slice(b.shift());
    keep_going = true;
    }
    } while (keep_going || (t && (x.lastIndex = s)));

    return a;
    }

  4. I forgot to escape my HTML 😛

    String >div id=’foo'<>div id=’bar'<Hello World>/div<><

    Left >div[^<]<
    Right >\/div<

  5. Seems to be an issue with the line:

    x = new RegExp(left + “|” + right, “g” + f)

    When passing g as a flag this results in “gg” and my browser doesn’t like that very much.

Leave a Reply

Your email address will not be published. Required fields are marked *