August 2007 — Flagrant Badassery

Let's say you wanted to find all <div> tags, and capture their id and class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.

I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):

<div\b(?>\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+)*>

The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <div> tags) over simplicity. Note that it will capture the id to backreference one and the class to backreference two regardless of the order the attributes appear in (i.e., class remains constant as backreference two even if it comes before id, or if id doesn't exist).

The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:

<div\b(?:(?=(\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+))\1)*>

In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the id and class values are shifted to two and three, respectively.

Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.

I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. wink If you have any specific questions about the pattern, feel free to ask.

Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers—unlike most other regex engines—reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.

Yes I know, there are many other JavaScript regex testers available. Why did I create yet another? RegexPal brings several new things to the table for such web-based apps, and in my (biased) opinion it's easier to use and more helpful towards learning regular expressions than the others currently available. Additionally, most other such tools are very slow for the kind of data I often work with. They might appear fast when displaying 10 matches, but what about 100, 1000, or 5000? Try generating 5,000 matches (which is easy to do with an any-character pattern such as a dot) in your favorite existing web-based tool and see if your browser ever recovers (doubtful). The same task takes RegexPal less than half a second, and what's more, results overlay the text while you're typing it.

At the moment, RegexPal is short on features, but here are the highlights:

Real-time regex syntax highlighting with backwards and forwards context awareness.
Lightning-fast match highlighting with alternating styles.
Inverted matches (match any text not matched by the regex).

I'm not sure when I'll add additional features, but there are lots of things I'm considering. If there is something you'd like to see, let me know.

A few things to be aware of:

The approach I've used for scrollable rich-text editing (which I haven't seen elsewhere) is fast but a bit buggy. Firefox 2 and IE7 have the least issues, but it more or less works in other browsers as well.
The syntax highlighting generally marks corner-case issues that create cross-browser inconsistencies as errors even if they are the result of browser bugs or missing behavior documentation in ECMA-262 v3.
There are different forms of line breaks cross-platform/browser. E.g., Firefox uses \n even on Windows where nearly all programs use \r\n. This can affect the results of certain regexes.

At least for me, RegexPal is lots of fun to play with and helps to make learning regular expressions easy through its instant feedback. I encourage you to just go play with it and discover its results on your own, but for the curious, I'll keep rambling…

Regex syntax parsing (needed for the syntax highlighting) is somewhat complex, due to the numerous backwards and forwards context awareness issues involved. Take, for example, the pattern \10. What does it mean?

Backreference 10, if not inside a character class and at least 10 capturing groups are opened before that point.
Backreference 1, followed by a literal "0", if not inside a character class and between 1 and 9 capturing groups are opened before that point.
Octal character index 10 (decimal 8), if inside a character class, or if no capturing groups are opened before that point.
The three literal characters "\", "1", and "0", if preceded by an unescaped "\" character.
An incomplete token in a couple other situations.

Another example is the "-" character. Outside a character class it's always a literal hyphen, but inside a character class…

It creates a range between tokens if:
- There is a preceding and following token in the class, or it's preceded by a token and is the last character in an unclosed character class (caveats follow).
It's a literal character if:
- It's the first or last character in the class.
- It's preceded by an unescaped "\".
- It follows a token which is the end index for a range.
- It follows a hyphen which creates a range.
It's an error if:
- It's creating a range between tokens in reverse character index order (e.g., z-a, @-!, \uFFFF-\b, or \127-\cB).
- It would otherwise create a range, but it's followed or preceded by a token which represents more than one character index (e.g., \d). In fact, in some cases browsers take this to mean that the hyphen should be treated as a literal, but browser bugs cause it to be handled inconsistently so RegexPal flags it as a range error.

Here are a few more things which aren't errors but are flagged as such:

Empty, top-level alternation, except at the end of the pattern, where such an alternation is ignored when highlighting matches in order to create a less surprising experience while the user is in the middle of constructing the regex. Empty, top-level alternation is flagged as an error because it effectively truncates the regex at that point (since it will always match). If a zero-length, top-level alteration is really needed, there are other easy ways to do that more explicitly.
Lookaround quantifiers (e.g., the plus sign in (?!x)+). This would be an actual error with some regex libraries (e.g., PCRE), and although that's not the case in most web browsers, such constructs add no value. As a result, RegexPal flags such quantifiers as an error, since they are almost certainly a user mistake.
\c when not followed by A–Z, \x when not followed by two hex characters, and \u when not followed by four hex characters. Although these do not cause most browsers to throw errors, they are handled inconsistently cross-browser and are hence flagged as errors. They would almost certainly be a user mistake even if the cross-browser issues didn't exist.

Credit to osteele.com from where the text of the short-and-sweet Quick Reference is based, and to RegexBuddy from JGsoft for inspiring many of RegexPal's features. The name RegexPal is, in part, a nod to RegexBuddy, but also selected because it contains both "regex" and "regexp." wink

Month: August 2007

Capturing Multiple, Optional HTML Attribute Values

RegexPal: Web-Based Regex Testing Reinvented