Let's say you wanted to find all
<div> tags, and capture their
class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.
I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):
The finer details of the pattern are designed for efficiency (even with bad data such as unclosed
<div> tags) over simplicity. Note that it will capture the
id to backreference one and the
class to backreference two regardless of the order the attributes appear in (i.e.,
class remains constant as backreference two even if it comes before
id, or if
id doesn't exist).
The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:
In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the
class values are shifted to two and three, respectively.
Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.
I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. If you have any specific questions about the pattern, feel free to ask.