Capturing Multiple, Optional HTML Attribute Values

Let's say you wanted to find all <div> tags, and capture their id and class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.

I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):

<div\b(?>\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+)*>

The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <div> tags) over simplicity. Note that it will capture the id to backreference one and the class to backreference two regardless of the order the attributes appear in (i.e., class remains constant as backreference two even if it comes before id, or if id doesn't exist).

The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:

<div\b(?:(?=(\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+))\1)*>

In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the id and class values are shifted to two and three, respectively.

Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.

I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. wink If you have any specific questions about the pattern, feel free to ask.

Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers—unlike most other regex engines—reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.

16 thoughts on “Capturing Multiple, Optional HTML Attribute Values”

Pingback: Daily misery » Blog Archive » Links for 8.10.2007 through 8.15.2007
Gianna says:

August 18, 2007 at 6:40 pm

hi nice post, i enjoyed it
Michael Geary says:

September 1, 2007 at 8:03 pm

I’ve learned a lot about regexes from reading your blog – thanks! But you seriously need to get a fluid width theme. The code in this article runs off the edge of the fixed width column – not readable at all.
Steve says:

September 2, 2007 at 4:45 pm

I agree that I need to change the theme, for a number of reasons. Which browser and version are you using? What you describe should only be the case if your browser doesn’t automatically break on any of the characters in the regexes, and doesn’t respect <wbr/> or entity 8203.
Niel Drummond says:

September 14, 2007 at 7:11 am

looks neat, I have done something like the above before in perl.. needless to say, IE6 jscript does not support lookahead, /
Steve says:

September 14, 2007 at 9:32 am

Hi Niel, IE6 supports lookahead just fine (although it does have a bug involving multiple lookaheads at the same match position). Note the specific mention in my post of the second pattern working correctly in IE. That should be true down to IE 5.5.
Edward says:

October 4, 2007 at 10:58 pm

You are a Master f!@$ing Sensei at Expressions….. I find it difficult to fathom the basics of JavaScript (maybe I’m just not that sharp) but your grasp of the subject is incredible. You got a bright future … Unless the Feds use your brain to create an army of neo-scientists or something.

Hats Off
George Fisher says:

December 19, 2007 at 10:00 pm

Searched high and low for this. Thanks for the post. I was looking for numerous attributes of an img tag and this is working for me …

<img\b(?>\s+(?:alt="([^"]*)"|class="([^"]*)"|style="([^"]*)"|src="([^"]*)"|height="([^"]*)"|width="([^"]*)")|[^\s>]+|\s+)*>

Thanks very much
Steve says:

December 19, 2007 at 11:04 pm

@George, you’re very welcome. Note that if you wanted the attributes to be required while maintaining the rest of the regex’s flexibility, you could use backreferences to empty capturing groups like so (I’ll just use the first two attributes from your regex to keep it short):

<img\b(?>\s+(?:alt="([^"]*)"()|class="([^"]*)"())|[^\s>]+|\s+)*\2\4>

Backref one would be the alt value, backref three would be the class value, and backrefs two and four are just overhead towards making the magic happen.
George Fisher says:

December 20, 2007 at 7:48 am

Thanks again. I’m probably doing something wrong but copying the new string into my trusty RegexBuddy doesn’t produce a match. Without the backrefs, but not with.
Steve says:

December 21, 2007 at 4:00 pm

@George, I’m not sure what you’re describing as “the new string”, but the regexes discussed here work fine in RegexBuddy 3.1. If you have v3.x, I would recommend asking for help on its integrated forum. If not, I would recommend you upgrade. 😉
Phillip says:

March 26, 2008 at 2:16 pm

Could you help me modify your regex so that it captures multiple attributes of a style? for example I want to capture only the styles bold, italics an underline.

<span style="font-style: italic; font-weight:bold; height:100px;" class="class" title="title">styled text </span>

many thanks
Steve says:

March 26, 2008 at 7:47 pm

@Philip, I’m not sure I understand what you’re trying to do, since bold, italic, and underline are merely values for their related CSS properties. If you’re just checking for the existence of certain words within your style attribute, that’s unrelated to this post. Also, you haven’t mentioned which regex flavor you’re using.

In any case, if you need one-on-one regex advice I’d recommend regexadvice.com or the RegexBuddy forums.
kurt says:

June 7, 2010 at 11:13 am

How would one make sure at least n attributes are matched?

I have the following:

(?:<SPAN|<span)+\b(?:(?=(\s+(?:datafield=”?([^”|\s|>]*)”?|class=”?dataElement”?|id=”?([^”|\s|>]*)”?)|[^\s>]+|\s+))\1)*>(.*?)(?:</SPAN>|</span>)

And I would ideally like to match span’s that have datafield, dataelement and id (in any order). I’ve tried using {3,} but I don’t think it was a applied as assumed due to the last sequence in the optional capturing group that says all but ” \s or >. Your help would be greatly appreciated.
ridgerunner says:

August 20, 2010 at 5:14 pm

Just when one thinks one has become a “Regex Master”, how soon we are humbled to learn new tricks that define a previously hidden new higher level. After taking the time to absorb the subtleties embedded in this post, I come away richer.

I learned 2 new tricks:
1.) How to capture randomly ordered items in a known order.
2.) How to simulate atomic grouping by combining lookahead with a sacrificial capture group.

Both are gems. Thank you!
Daan says:

July 24, 2011 at 1:50 pm

Wow, this is great man, tnx very muchos!

16 thoughts on “Capturing Multiple, Optional HTML Attribute Values”

Leave a Reply