Steven Levithan – Page 17 – Flagrant Badassery

Mimicking Conditionals

Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the regex libraries I'm familiar with, conditionals are only supported by .NET, Perl, PCRE (and hence, PHP's preg functions), and JGsoft products (including RegexBuddy).

There are two common types of regex conditionals in those libraries: capturing-group-based and lookaround-based. I'll get to the latter type in a bit, but first I'll address capturing-group-based conditionals, which are able to base logic on whether a capturing group has participated in the match so far. Here's an example:

(a)?b(?(1)c|d)

That matches only "bd" and "abc". The pattern can be expressed as follows:

(if_matched)?inner_pattern(?(1)then|else)

Here's a comparable pattern I created which doesn't require support for conditionals:

(?=(a)()|())\1?b(?:\2c|\3d)

Note that to use it without an "else" part, you still need to include the second empty backreference (in this case, \3) at the end, like this:

(?=(a)()|())\1?b(?:\2c|\3)

As a brief explanation of how that works, there's a zero-length alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the lookahead are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:

This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
It interacts with backtracking differently than a real conditional (the "a" part is treated as if it were within an optional, atomic group, e.g., (?>(a))? instead of (a)?), so it might be better to think of this as a new operator which is similar to a conditional.

Here are the regex engines I've briefly tested this pattern with:

Language	Supports fake cond.	Supports real cond.	Notes
.NET	Yes	Yes	Tested using Expresso.
ColdFusion	Yes	No	Tested using ColdFusion MX7.
Java	Yes	No	Tested using Regular Expression Test Applet.
JavaScript	No	No	According to ECMA-262v3, backreferences to non-participating capturing groups always succeed, and most browsers respect that. Unfortunately, this pattern depends on the way most other regex engines handle such groups.
JGsoft	Yes	Yes	(Edit:) Works as of RegexBuddy version 2.4.0. Previous versions contained two bugs (which I reported to JGsoft) which prevented this from working reliably.

As for lookaround-based conditionals, we can mimic them using the same concepts. Here's what a real lookaround-based conditional looks like (this example uses a positive lookahead for the assertion):

(?(?=if_assertion)then|else)

And here's how you can mimic it:

(?:(?=if_assertion()|())\1then|\2else)

Again, to use it without an "else" part, you still need to include the second empty backreference (in this case, \2) at the end, like this:

(?:(?=if_assertion()|())\1then|\2)

Notes:

The above compatibility table applies just the same.
Backtracking does not come into play with lookaround-based conditionals in the same way as with capturing-group-based conditionals. As a result, mimicked lookaround-based conditionals are functionally identical to their "real" counterparts.
In some regex flavors, it may be necessary to write it in the the somewhat less lucid form (?=a()|())(?:\1b|\2c).
Another, potentially more verbose and less efficient way to mimic a lookaround-based conditional is to alternate two opposite lookarounds. E.g., (?=if_assertion)then|(?!if_assertion)else. That will work even in the case of flavors like JavaScript where backreferences to non-participating groups match the empty string.

Emulating Atomic Groups

Edit (2024): The lightweight Regex+ JavaScript library adds support for atomic groups and possessive quantifiers to native JavaScript regular expressions, making this pattern much easier to use.

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds only match a position. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference zero)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually consume anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That produces the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking sweet.

Matching Innermost HTML Elements

On a regular expression forum I visit every once in awhile, a user asked how he could match all innermost tables within HTML source code. In other words, he wanted to match all tables which did not contain other tables. The regex should match <table>…</table>, but should only match the inner table within <table>…<table>…</table>…</table>. This logic needed to support an unlimited amount of nested tables.

One of the resident regex experts quickly claimed that regexes are not suited for parsing nested HTML data, and that this was therefore impossible using regular expressions, period.

It's true that many regex libraries are incapable of recursion (although even then it's often possible to fake it to an acceptable level). However, when people make claims like that, it encourages me to try to prove otherwise. wink

Here's the solution I offered (though there were a few steps to get there):

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

That matches all innermost (or deepest level) tables, and supports an unlimited amount of nesting. It's also quite fast, and can easily be modified to work with other HTML elements (just change the three instances of "table" to whatever element name you want).

To demonstrate, the above regex matches the three highlighted segments of the text below:

<table><td><table><td> </td></table></td></table> <table><tr><td> </td></tr></table><table></table>

In order to explain how it works, I'll show the progression of gradually more solid regexes I tried along the way to the final result. Here was my first stab at the regex, which is probably easiest to follow (note that it's somewhat flawed, and comparatively slow):

<table>(?:.(?!<table>))*?</table>

(Make sure to turn on the "dot matches newline" modifier with the above, or change the dot to [\S\s].)

Basically, the way that works is it matches an opening <table> tag, then it looks at each following character one at a time, checking if they are followed by another instance of <table> before </table>. If so, the match fails, because it's not an innermost table. (In theory, at least.)

Within a couple minutes I realized there was a slight flaw. In order for it to work, there must be at least one character before it encounters a nested table (e.g., "<table>1<table></table></table>" has no problem, but "<table><table></table></table>" would return incorrect results). This is easily fixable by using another negative lookahead immediately after the opening <table>, but in any case this regex is also slower than it needs to be, since it tests a negative lookahead against every character contained within table tags.

To address both of those issues, I used the following regex:

<table>(?:[^<]+|<(?!table>))*?</table>

First, that increases speed (in theory… you'll see that there is now a much bigger issue than before), because within each <table> tag it will greedily jump between all characters which are not < in single steps (using [^<]+), and it will only use the negative lookahead when it encounters <. Secondly, it solves the previously noted error by using <(?!table>) instead of .(?!<table>).

If you're wondering about table tags which contain attributes, that's not a problem. Here's an updated regex to accomplish this (the added parts are highlighted in yellow):

<table\b[^>]*>(?:[^<]+|<(?!table\b[^>]*>))*?</table>

At first I thought this closed the case… The regex supports an unlimited amount of recursion within its context, despite the traditional wisdom that regexes are incapable of recursion. However, one of the forum moderators noted that its performance headed south very quickly when run against certain examples of real world data. This was a result of the regex triggering catastrophic backtracking. Although this is something I should've anticipated (nested quantifiers should always warrant extra attention and care), it's very easy to fix using an atomic grouping or possessive quantifier (I'll use an atomic grouping here since they're more widely supported). The addition to the regex is highlighted:

<table\b[^>]*>(?:(?>[^<]+)|<(?!table\b[^>]*>))*?</table>

And that's it. As a result of all this, the regex not only does its job, but it performs quite impressively. When running it over a source code test case (which previously triggered catastrophic backtracking) containing nearly 100,000 characters and lots of nested tables, it correctly returned all innermost tables in a couple milliseconds on my system.

However, note that neither possessive quantifiers nor atomic groupings are supported by some programming languages, such as JavaScript. If you want to pull this off in JavaScript, an approach which is not susceptible to catastrophic backtracking would be:

<table\b[^>]*>(?!<table\b[^>]*>)(?:[\S\s](?!<table\b[^>]*>))*?</table>

That runs a little bit slower than (but produces the same result as) the earlier regex which relied on an atomic grouping.

If you have a copy of RegexBuddy (and if you don't, I highly recommend it), run these regexes through its debugger for an under-the-hood look at how they're handled by a regex engine.

Edit: Using a trick I just stumbled upon (which I'll have to blog about in a second), the regex can be rewritten in a way that does not rely on an atomic grouping but is nearly as fast as the one that does:

<table\b[^>]*>(?:(?=([^<]+))\1|<(?!table\b[^>]*>))*?</table>

Basically, that uses a capturing group inside a positive lookahead followed by \1 to act just like an atomic group!

Update: See Matching Nested Constructs in JavaScript, Part 2 for a way to match outermost HTML elements using JavaScript.

parseUri: Split URLs in JavaScript

Update: The following post is outdated. See parseUri 2.0 for the latest, greatest version.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(uri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1.1, by Steven Levithan <http://stevenlevithan.com>
Splits any well-formed URI into the following parts (all are optional):
----------------------
- source (since the exec method returns the entire match as key 0, we might as well use it)
- protocol (i.e., scheme)
- authority (includes both the domain and port)
  - domain (i.e., host; can be an IP address)
  - port
- path (includes both the directory path and filename)
  - directoryPath (supports directories with periods, and without a trailing backslash)
  - fileName
- query (does not include the leading question mark)
- anchor (i.e., fragment) */
function parseUri(sourceUri){
	var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"],
		uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri),
		uri = {};
	
	for(var i = 0; i < 10; i++){
		uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
	}
	
	/* Always end directoryPath with a trailing backslash if a path was present in the source URI
	Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key */
	if(uri.directoryPath.length > 0){
		uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
	}
	
	return uri;
}

Test it.

Is there any leaner, meaner URI parser out there? 🙂

Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if people leave comments asking for the support, I'll go ahead and add it.

Regexes in Depth: Advanced Quoted String Matching

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

(Note: This has some issues. See Edit 2, at the end of this post, if you intend to use this in an application.)

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:

First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

"test"
'test'
"t'es't"
'te"st'
'te\'st'
"\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure.

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote mark with the quote mark makes the lazy repetition operator treat them as one node, and continue on it's way through the string.

Note that if you wanted to support multi-line quotes in libraries without an option to make dots match newlines (e.g., JavaScript), change the dot to [\S\s]. Also, with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

(["'])(?:\\\1|.)*?\1 (the regex being discussed)
(["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. And if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. The reason we can't use [^\1] either is because you can't use backreferences within character classes, even though in this case the backreference's value is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. With most regular expression libraries, they assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd typically need to use a leading zero (e.g., \01), but inside a character class the leading zero is often optional.

If anyone has questions about how or why other specific regex patterns work or don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.

Edit: Just for kicks, here's a version which adds support for fancy, angled “…” and ‘…’ pairs. This uses a lookbehind and conditionals. Libraries which support both features include the .NET framework and PCRE.

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

In the above, I'm using nested conditionals to achieve an if/elseif/else construct. Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or '.

“test”
“te“st”
“te\”st”
‘test’
‘t‘e“"”s\’t’

Edit 2: Note that these regexes were designed more for illustrative purposes than practical use within programming. One issue is that they don't account for escaped backslashes within the quotes (e.g., they treat \\" as a backslash followed by an escaped double quote, rather than an escaped backslash followed by an unescaped double quote. However, that's easy to address. For the first regex, just replace it with this:

(["'])(?:\\?.)*?\1

To also avoid an issue with quote marks which are followed by an escaped quote of the same type but are not followed by a closing quote, make the first quantifier possessive, like so:

(["'])(?:\\?+.)*?\1

Or, if the regex engine you're using doesn't support possessive quantifiers or atomic groups (which can be used equivalently), use one of the following:

(["'])(?:(?=(\\?))\2.)*?\1 «OR» (["'])(?:(?!\1)[^\\]|\\.)*\1

The former mimics an atomic group, and the later utilizes a negative lookahead which allows replacing the lazy star with a greedy one. There is still potential to optimize these for efficiency, and none of them account for the outermost opening quote marks being escaped or other issues regarding context (e.g., you might want to ignore quoted strings within code comments), but still, these are probably good enough for most people's needs.