Capturing Multiple, Optional HTML Attribute Values
Let's say you wanted to find all <div> tags, and capture their id and class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.
I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):
<div\b(?>\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+
The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <div> tags) over simplicity. Note that it will capture the id to backreference one and the class to backreference two regardless of the order the attributes appear in (i.e., class remains constant as backreference two even if it comes before id, or if id doesn't exist).
The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:
<div\b(?:(?=(\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+
In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the id and class values are shifted to two and three, respectively.
Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.
I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article.
If you have any specific questions about the pattern, feel free to ask.
Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers—unlike most other regex engines—reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.

Pingback by Daily misery » Blog Archive » Links for 8.10.2007 through 8.15.2007 on 15 August 2007:
[…] Capturing Multiple, Optional HTML Attribute Values another great tool goes into my data mining knowledge mine […]
Comment by Gianna on 18 August 2007:
hi nice post, i enjoyed it
Comment by Michael Geary on 1 September 2007:
I’ve learned a lot about regexes from reading your blog - thanks! But you seriously need to get a fluid width theme. The code in this article runs off the edge of the fixed width column - not readable at all.
Comment by Steve on 2 September 2007:
I agree that I need to change the theme, for a number of reasons. Which browser and version are you using? What you describe should only be the case if your browser doesn’t automatically break on any of the characters in the regexes, and doesn’t respect
<wbr/>or entity 8203.Comment by Niel Drummond on 14 September 2007:
looks neat, I have done something like the above before in perl.. needless to say, IE6 jscript does not support lookahead, /
Comment by Steve on 14 September 2007:
Hi Niel, IE6 supports lookahead just fine (although it does have a bug involving multiple lookaheads at the same match position). Note the specific mention in my post of the second pattern working correctly in IE. That should be true down to IE 5.5.
Comment by Edward on 4 October 2007:
You are a Master f!@$ing Sensei at Expressions….. I find it difficult to fathom the basics of JavaScript (maybe I’m just not that sharp) but your grasp of the subject is incredible. You got a bright future … Unless the Feds use your brain to create an army of neo-scientists or something.
Hats Off
Comment by George Fisher on 19 December 2007:
Searched high and low for this. Thanks for the post. I was looking for numerous attributes of an img tag and this is working for me …
<img\b(?>\s+(?:alt="([^"]*)"|class="([^"]*)"|style="([^"]*)"|src="([^"]*)"|height="([^"]*)"|width="([^"]*)")|[^\s>]+|\s+)*>
Thanks very much
Comment by Steve on 19 December 2007:
@George, you’re very welcome. Note that if you wanted the attributes to be required while maintaining the rest of the regex’s flexibility, you could use backreferences to empty capturing groups like so (I’ll just use the first two attributes from your regex to keep it short):
<img\b(?>\s+(?:alt="([^"]*)"()|class="([^"]*)"())|[^\s>]+|\s+)*\2\4>Backref one would be the alt value, backref three would be the class value, and backrefs two and four are just overhead towards making the magic happen.
Comment by George Fisher on 20 December 2007:
Thanks again. I’m probably doing something wrong but copying the new string into my trusty RegexBuddy doesn’t produce a match. Without the backrefs, but not with.
Comment by Steve on 21 December 2007:
@George, I’m not sure what you’re describing as “the new string”, but the regexes discussed here work fine in RegexBuddy 3.1. If you have v3.x, I would recommend asking for help on its integrated forum. If not, I would recommend you upgrade. ;)
Comment by Phillip on 26 March 2008:
Could you help me modify your regex so that it captures multiple attributes of a style? for example I want to capture only the styles bold, italics an underline.
<span style="font-style: italic; font-weight:bold; height:100px;" class="class" title="title">styled text </span>many thanks
Comment by Steve on 26 March 2008:
@Philip, I’m not sure I understand what you’re trying to do, since bold, italic, and underline are merely values for their related CSS properties. If you’re just checking for the existence of certain words within your style attribute, that’s unrelated to this post. Also, you haven’t mentioned which regex flavor you’re using.
In any case, if you need one-on-one regex advice I’d recommend regexadvice.com or the RegexBuddy forums.