Let's say you wanted to find all <div>
tags, and capture their id
and class
attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.
I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):
<div\b(?>\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+
The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <div>
tags) over simplicity. Note that it will capture the id
to backreference one and the class
to backreference two regardless of the order the attributes appear in (i.e., class
remains constant as backreference two even if it comes before id
, or if id
doesn't exist).
The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:
<div\b(?:(?=(\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+
In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the id
and class
values are shifted to two and three, respectively.
Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.
I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. If you have any specific questions about the pattern, feel free to ask.
Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers—unlike most other regex engines—reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.