<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Flagrant Badassery &#187; Cross-Browser Issues</title>
	<atom:link href="http://blog.stevenlevithan.com/category/cross-browser/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com</link>
	<description>A JavaScript and regular expression centric blog</description>
	<pubDate>Sat, 09 Aug 2008 15:09:54 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
	<language>en</language>
			<item>
		<title>An IE lastIndex Bug with Zero-Length Regex Matches</title>
		<link>http://blog.stevenlevithan.com/archives/exec-bugs</link>
		<comments>http://blog.stevenlevithan.com/archives/exec-bugs#comments</comments>
		<pubDate>Mon, 14 Apr 2008 02:24:59 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
		
		<category><![CDATA[Cross-Browser Issues]]></category>

		<category><![CDATA[JavaScript]]></category>

		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/exec-bugs</guid>
		<description><![CDATA[The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's lastIndex property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in [...]]]></description>
			<content:encoded><![CDATA[<p>The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's <code>lastIndex</code> property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in a string using the <code>RegExp.prototype.exec</code> method. That's where I've most frequently encountered the bug, and I think it will help explain why the issue exists in the first place.</p>

<p>First of all, if you're not already familiar with how to use <code>exec</code> to iterate over a string, you're missing out on some very powerful functionality. Here's the basic construct:</p>

<pre class="code">var	regex = /.../g,
	subject = "test",
	match = regex.exec(subject);

while (match != null) {
	<span class="comment">// matched text: match[0]
	// match start: match.index
	// match end: regex.lastIndex
	// capturing group n: match[n]</span>

	...

	match = regex.exec(subject);
}
</pre>

<p>When the <code>exec</code> method is called for a regex that uses the <code>/g</code> (global) modifier, it searches from the point in the subject string specified by the regex's <code>lastIndex</code> property (which is initially zero, so it searches from the beginning of the string). If the <code>exec</code> method finds a match, it updates the regex's <code>lastIndex</code> property to the character index at the end of the match, and returns an array containing the matched text and any captured subexpressions. If there is no match from the point in the string where the search started, <code>lastIndex</code> is reset to zero, and <code>null</code> is returned.</p>

<p>You can tighten up the above code by moving the <code>exec</code> method call into the <code>while</code> loop's condition, like so:</p>

<pre class="code">var	regex = /.../g,
	subject = "test",
	match;

while (match = regex.exec(subject)) {
	...
}
</pre>

<p>This cleaner version works essentially the same as before. As soon as <code>exec</code> can't find any further matches and therefore returns <code>null</code>, the loop ends. However, there are a couple cross-browser issues to be aware of with either version of this code. One is that if the regex contains capturing groups which do not participate in the match, some values in the returned array could be either <code>undefined</code> or an empty string. I've previously discussed that issue in depth in a post about what I called <a href="http://blog.stevenlevithan.com/archives/npcg-javascript">non-participating capturing groups</a>.</p>

<p>Another issue (the topic of <em>this</em> post) occurs when your regex matches an empty string. There are many reasons why you might allow a regex to do that, but if you can't think of any, consider cases where you're accepting regexes from an outside source. Here's a simple example of such a regex:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	endPositions.push(regex.lastIndex);
}
</pre>

<p>You might expect the <code>endPositions</code> array to be set to <code>[0,2,4]</code>, since those are the character positions for the beginning of the string and just after each newline character. Thanks to the <code>/m</code> modifier, those are the positions where the regex will match; and since the regex matches empty strings, <code>regex.lastIndex</code> should be the same as <code>match.index</code>. However, Internet Explorer (tested with v5.5&ndash;7) sets <code>endPositions</code> to <code>[1,3,5]</code>. Other browsers will go into an infinite loop until you short-circuit the code.</p>

<p>So what's going on here? Remember that every time <code>exec</code> runs, it attempts to match within the subject string starting at the position specified by the <code>lastIndex</code> property of the regex. Since our regex matches a zero-length string, <code>lastIndex</code> remains exactly where we started the search. Therefore, every time through the loop our regex will match at the same position&mdash;the start of the string. Internet Explorer tries to be helpful and avoid this situation by automatically incrementing <code>lastIndex</code> when a zero-length string is matched. That might seem like a good idea (in fact, I've seen people adamantly argue that is a bug that Firefox does not do the same), but it means that in Internet Explorer the <code>lastIndex</code> property cannot be relied on to accurately determine the ending position of a match.</p>

<p>We can correct this situation cross-browser with the following code:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	var zeroLengthMatch = !match[0].length;
	<span class="comment">// Fix IE's incorrect lastIndex</span>
	if (zeroLengthMatch &#038;& regex.lastIndex > match.index)
		regex.lastIndex--;

	endPositions.push(regex.lastIndex);

	<span class="comment">// Avoid an infinite loop with zero-length matches</span>
	if (zeroLengthMatch)
		regex.lastIndex++;
}
</pre>

<p>You can see an example of the above code in the <a href="http://blog.stevenlevithan.com/archives/cross-browser-split">cross-browser split method</a> I posted a while back. Keep in mind that none of the extra code here is needed if your regex cannot possibly match an empty string.</p>

<p>Another way to deal with this issue is to use <code>String.prototype.replace</code> to iterate over the subject string. The <code>replace</code> method moves forward automatically after zero-length matches, avoiding this issue altogether. Unfortunately, in the three biggest browsers (IE, Firefox, Safari), <code>replace</code> doesn't seem to deal with the <code>lastIndex</code> property except to reset it to zero. Opera gets it right (according to my reading of the spec) and updates <code>lastIndex</code> along the way. Given the current situation, you can't rely on <code>lastIndex</code> in your code when iterating over a string using <code>replace</code>, but you can still easily derive the value for the end of each match. Here's an example:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	endPositions = [];

subject.replace(regex, function (match) {
	<span class="comment">// Not using a named argument for the index since capturing
	// groups can change its position in the list of arguments</span>
	var	index = arguments[arguments.length - 2],
		lastIndex = index + match.length;

	endPositions.push(lastIndex);
});
</pre>

<p>That's perhaps less lucid than before (since we're not actually replacing anything), but there you have it&hellip; two cross-browser ways to get around a little-known issue that could otherwise cause tricky, latent bugs in your code.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/exec-bugs/feed</wfw:commentRss>
		</item>
		<item>
		<title>A JScript/VBScript Regex Lookahead Bug</title>
		<link>http://blog.stevenlevithan.com/archives/regex-lookahead-bug</link>
		<comments>http://blog.stevenlevithan.com/archives/regex-lookahead-bug#comments</comments>
		<pubDate>Mon, 24 Mar 2008 05:50:58 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
		
		<category><![CDATA[Cross-Browser Issues]]></category>

		<category><![CDATA[JavaScript]]></category>

		<category><![CDATA[Regular Expressions]]></category>

		<category><![CDATA[VBScript]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-lookahead-bug</guid>
		<description><![CDATA[Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via ?, *, {0,n}, or (.&#124;); but not +, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:

/(?=a?b)ab/.test("ab");
// Should return true, but IE 5.5 [...]]]></description>
			<content:encoded><![CDATA[<p>Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via <code>?</code>, <code>*</code>, <code>{0,<em>n</em>}</code>, or <code>(.|)</code>; but not <code>+</code>, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:</p>

<pre class="code"><span class="regex">/(?=a?b)ab/</span>.test("ab");
<span class="comment">// Should return true, but IE 5.5 &ndash; 8b1 return false</span>

<span class="regex">/(?=a?b)ab/</span>.test("abc");
<span class="comment">// Correctly returns true (even in IE), although the
// added "c" does not take part in the match</span>
</pre>

<p>I've been aware of this bug for a couple years, thanks to a <a href="http://regexadvice.com/blogs/mash/archive/2004/10/05/320.aspx">blog post by Michael Ash</a> that describes the bug with a password-complexity regex. However, the bug description there is incomplete and subtly incorrect, as shown by the above, reduced test case. To be honest, although the errant behavior is predictable, it's a bit tricky to describe because I haven't yet figured out exactly what's happening internally. I'd recommend playing with variations of the above code to get a better understanding of the problem.</p>

<p>Fortunately, since the bug is predictable, it's usually possible to work around. For example, you can avoid the bug with the password regex in Michael's post (<code>/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$/</code>) by writing it as <code>/^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/</code> (the <code>.{8,15}$</code> lookahead must come first here). The important thing is to be aware of the issue, because it can easily introduce latent and difficult to diagnose bugs into your code. Just remember that it shows up with variable-length lookahead. If you're using such patterns, test the hell out of them in IE.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/regex-lookahead-bug/feed</wfw:commentRss>
		</item>
		<item>
		<title>JavaScript, Regex, and Unicode</title>
		<link>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</link>
		<comments>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode#comments</comments>
		<pubDate>Wed, 02 Jan 2008 06:11:24 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
		
		<category><![CDATA[Cross-Browser Issues]]></category>

		<category><![CDATA[JavaScript]]></category>

		<category><![CDATA[Regular Expressions]]></category>

		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</guid>
		<description><![CDATA[Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, [...]]]></description>
			<content:encoded><![CDATA[<p>Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.</p>

<p>According to ECMA-262 3rd Edition, <code>\s</code>, <code>\S</code>, <code>.</code>, <code>^</code>, and <code>$</code> use Unicode-based interpretations of <em>whitespace</em> and <em>newline</em>, while <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\b</code>, and <code>\B</code> use ASCII-only interpretations of <em>digit</em>, <em>word character</em>, and <em>word boundary</em> (e.g. <code>/a\b/.<wbr/>test("na&iuml;ve")</code> returns <code>true</code>). Actual browser implementations often differ on these points. For example, Firefox 2 considers <code>\d</code> and <code>\D</code> to be Unicode-aware, while Firefox 3 fixes this bug &mdash; making <code>\d</code> equivalent to <code>[0-9]</code> as with most other browsers.</p>

<p>Here again are the affected tokens, along with their definitions:</p>

<ul>
	<li><code>\d</code> &mdash; Digits.</li>
	<li><code>\s</code> &mdash; Whitespace.</li>
	<li><code>\w</code> &mdash; Word characters.</li>
	<li><code>\D</code> &mdash; All except digits.</li>
	<li><code>\S</code> &mdash; All except whitespace.</li>
	<li><code>\W</code> &mdash; All except word characters.</li>
	<li><code>.</code> &mdash; All except newlines.</li>
	<li><code>^</code> (with <code>/m</code>) &mdash; The positions at the beginning of the string and just after newlines.</li>
	<li><code>$</code> (with <code>/m</code>) &mdash; The positions at the end of the string and just before newlines.</li>
	<li><code>\b</code> &mdash; Word boundary positions.</li>
	<li><code>\B</code> &mdash; Not word boundary positions.</li>
</ul>

<p>All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms <em>digit</em>, <em>whitespace</em>, <em>word character</em>, <em>word boundary</em>, and <em>newline</em> depend on the regex flavor, character encoding, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:</p>

<ul>
	<li><em>Digit</em> &mdash; The characters 0-9 only.</li>
	<li><em>Whitespace</em> &mdash; Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".</li>
	<li><em>Word character</em> &mdash; The characters A-Z, a-z, 0-9, and _ only.</li>
	<li><em>Word boundary</em> &mdash; The position between a <em>word character</em> and non-<em>word character</em>.</li>
	<li><em>Newline</em> &mdash; The line feed, carriage return, line separator, and paragraph separator characters.</li>
</ul>

<p>Here again are the newline characters, with their character codes:</p>

<ul>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code></li>
	<li><code>\u2028</code> &mdash; Line separator</li>
	<li><code>\u2029</code> &mdash; Paragraph separator</li>
</ul>

<p>Note that ECMAScript 4 proposals indicate that the <a href="http://en.wikipedia.org/wiki/C0_and_C1_control_codes">C1</a>/Unicode NEL "next line" control character (<code>\u0085</code>) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, <code>/\r^$\n/m.test("\r\n")</code> returns <code>true</code>.</p>

<p>As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's <a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes">Basic Multilingual Plane</a>, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by <code>\s</code> according to ECMA-262 3rd Edition and Unicode 5.1:</p>

<ul>
	<li><code>\u0009</code> &mdash; Tab &mdash; <code>\t</code></li>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code> &mdash; (newline character)</li>
	<li><code>\u000b</code> &mdash; Vertical tab &mdash; <code>\v</code></li>
	<li><code>\u000c</code> &mdash; Form feed &mdash; <code>\f</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code> &mdash; (newline character)</li>
	<li><code>\u0020</code> &mdash; Space</li>
	<li><code>\u00a0</code> &mdash; No-break space</li>
	<li><code>\u1680</code> &mdash; Ogham space mark</li>
	<li><code>\u180e</code> &mdash; Mongolian vowel separator</li>
	<li><code>\u2000</code> &mdash; En quad</li>
	<li><code>\u2001</code> &mdash; Em quad</li>
	<li><code>\u2002</code> &mdash; En space</li>
	<li><code>\u2003</code> &mdash; Em space</li>
	<li><code>\u2004</code> &mdash; Three-per-em space</li>
	<li><code>\u2005</code> &mdash; Four-per-em space</li>
	<li><code>\u2006</code> &mdash; Six-per-em space</li>
	<li><code>\u2007</code> &mdash; Figure space</li>
	<li><code>\u2008</code> &mdash; Punctuation space</li>
	<li><code>\u2009</code> &mdash; Thin space</li>
	<li><code>\u200a</code> &mdash; Hair space</li>
	<li><code>\u2028</code> &mdash; Line separator &mdash; (newline character)</li>
	<li><code>\u2029</code> &mdash; Paragraph separator &mdash; (newline character)</li>
	<li><code>\u202f</code> &mdash; Narrow no-break space</li>
	<li><code>\u205f</code> &mdash; Medium mathematical space</li>
	<li><code>\u3000</code> &mdash; Ideographic space</li>
</ul>

<p>To test which characters or positions are matched by all of the tokens mentioned here in your browser, see <a href="http://stevenlevithan.com/regex/xregexp/tests/unicode.html"><strong>JavaScript Regex and Unicode Tests</strong></a>. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode/feed</wfw:commentRss>
		</item>
		<item>
		<title>When innerHTML isn&#8217;t Fast Enough</title>
		<link>http://blog.stevenlevithan.com/archives/faster-than-innerhtml</link>
		<comments>http://blog.stevenlevithan.com/archives/faster-than-innerhtml#comments</comments>
		<pubDate>Wed, 12 Sep 2007 04:39:32 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
		
		<category><![CDATA[Cross-Browser Issues]]></category>

		<category><![CDATA[JavaScript]]></category>

		<category><![CDATA[Performance]]></category>

		<category><![CDATA[innerhtml]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/faster-than-innerhtml</guid>
		<description><![CDATA[This post isn't about the pros and cons of innerHTML vs. W3C DOM methods. That has been hashed and rehashed elsewhere. Instead, I'll show how you can combine the use of innerHTML and DOM methods to make your code potentially hundreds of times faster than innerHTML on its own, when working with large numbers of [...]]]></description>
			<content:encoded><![CDATA[<p>This post isn't about the pros and cons of <code>innerHTML</code> vs. W3C DOM methods. That has been hashed and rehashed <a href="http://www.dustindiaz.com/innerhtml-vs-dom-methods/">elsewhere</a>. Instead, I'll show how you can combine the use of <code>innerHTML</code> and DOM methods to make your code potentially hundreds of times faster than <code>innerHTML</code> on its own, when working with large numbers of elements.</p>

<p>In some browsers (most notably, Firefox), although <code>innerHTML</code> is generally much faster than DOM methods, it spends a disproportionate amount of time clearing out existing elements vs. creating new ones. Knowing this, we can combine the speed of destroying elements by removing their parent using the standard DOM methods with creating new elements using <code>innerHTML</code>. (This technique is something I discovered during the development of <a href="http://regexpal.com" title="JavaScript regex tester">RegexPal</a>, and is one of its two main performance optimizations. The other is one-shot markup generation for match highlighting, which avoids needing to loop over matches or reference them individually.)</p>

<h3 class="sub">The code:</h3>

<pre class="code">
function replaceHtml(el, html) {
	var oldEl = typeof el === "string" ? document.getElementById(el) : el;
	/*@cc_on <span class="comment">// Pure innerHTML is slightly faster in IE</span>
		oldEl.innerHTML = html;
		return oldEl;
	@*/
	var newEl = oldEl.cloneNode(false);
	newEl.innerHTML = html;
	oldEl.parentNode.replaceChild(newEl, oldEl);
	<span class="comment">/* Since we just removed the old element from the DOM, return a reference
	to the new element, which can be used to restore variable references. */</span>
	return newEl;
};
</pre>

<p>You can use the above as <code>el = replaceHtml(el, newHtml)</code> instead of <code>el.innerHTML = newHtml</code>.</p>

<h3 class="sub">innerHTML is already pretty fast...is this really warranted?</h3>

<p>That depends on how many elements you're overwriting. In RegexPal, every keydown event potentially triggers the destruction and creation of thousands of elements (in order to make the syntax and match highlighting work). In such cases, the above approach has enormous positive impact. Even something as simple as <code>el.innerHTML += str</code> or <code>el.innerHTML = ""</code> could be a performance disaster if the element you're updating happens to have a few thousand children.</p>

<p>I've created a page which allows you to easily <a href="http://stevenlevithan.com/demo/replaceHtml.html" class="bold">test the performance difference</a> of <code>innerHTML</code> and my <code>replaceHtml</code> function with various numbers of elements. Make sure to try it out in a few browsers for comparison. Following are a couple examples of typical results from Firefox 2.0.0.6 on my system:</p>

<pre class="code"><span class="comment"><strong>1000 elements...</strong></span>
innerHTML (destroy only): <strong>156ms</strong>
innerHTML (create only): <strong>15ms</strong>
innerHTML (destroy &amp; create): <strong>172ms</strong>
replaceHtml (destroy only): <strong>0ms</strong> (<span class="comment"><strong>faster</strong></span>)
replaceHtml (create only): <strong>15ms</strong> (~ same speed)
replaceHtml (destroy &amp; create): <strong>15ms</strong> (<span class="comment"><strong>11.5x faster</strong></span>)

<span class="comment"><strong>15000 elements...</strong></span>
innerHTML (destroy only): <strong>14703ms</strong>
innerHTML (create only): <strong>250ms</strong>
innerHTML (destroy &amp; create): <strong>14922ms</strong>
replaceHtml (destroy only): <strong>31ms</strong> (<span class="comment"><strong>474.3x faster</strong></span>)
replaceHtml (create only): <strong>250ms</strong> (~ same speed)
replaceHtml (destroy &amp; create): <strong>297ms</strong> (<span class="comment"><strong>50.2x faster</strong></span>)
</pre>

<p>I think the numbers speak for themselves. Comparable performance improvements can also be seen in Safari. In Opera, <code>replaceHtml</code> is still typically faster than <code>innerHTML</code>, but by a narrower margin. In IE, simple use of <code>innerHTML</code> is typically faster than mixing it with DOM methods, but not by nearly the same kinds of margins as you can see above. Nevertheless, IE's conditional compilation feature is used to avoid the relatively minor performance penalty, by just using <code>innerHTML</code> with that browser.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/faster-than-innerhtml/feed</wfw:commentRss>
		</item>
		<item>
		<title>Capturing Multiple, Optional HTML Attribute Values</title>
		<link>http://blog.stevenlevithan.com/archives/multi-attr-capture</link>
		<comments>http://blog.stevenlevithan.com/archives/multi-attr-capture#comments</comments>
		<pubDate>Wed, 15 Aug 2007 05:49:45 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
		
		<category><![CDATA[Cross-Browser Issues]]></category>

		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/multi-attr-capture</guid>
		<description><![CDATA[Let's say you wanted to find all &#60;div&#62; tags, and capture their id and class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for [...]]]></description>
			<content:encoded><![CDATA[<p>Let's say you wanted to find all <code>&lt;div&gt;</code> tags, and capture their <code>id</code> and <code>class</code> attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.</p>

<p>I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):</p>

<p><code class="regex">&lt;div<b>\b</b><b class="g1">(?&gt;</b><b>\s+</b><b class="g2">(?:</b>id="<b class="g3">(</b><i>[^"]</i><b>*</b><b class="g3">)</b>"<b class="g2">|</b>class="<b class="g3">(</b><i>[^"]</i><b>*</b><b class="g3">)</b>"<b class="g2">)</b><b class="g1">|</b><i>[^<b>\s</b>&gt;]</i><b>+</b><b class="g1">|</b><b>\s+</b><wbr/><b class="g1">)*</b>&gt;</code></p>

<p>The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <code>&lt;div&gt;</code> tags) over simplicity. Note that it will capture the <code>id</code> to backreference one and the <code>class</code> to backreference two regardless of the order the attributes appear in (i.e., <code>class</code> remains constant as backreference two even if it comes before <code>id</code>, or if <code>id</code> doesn't exist).</p>

<p>The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:

<p><code class="regex">&lt;div<b>\b</b><b class="g1">(?:</b><b class="g2">(?=</b><b class="g3">(</b><b>\s+</b><b class="g4">(?:</b>id="<b class="g5">(</b><i>[^"]</i><b>*</b><b class="g5">)</b>"<b class="g4">|</b>class="<b class="g5">(</b><i>[^"]</i>*<b class="g5">)</b>"<b class="g4">)</b><b class="g3">|</b><i>[^<b>\s</b>&gt;]</i><b>+</b><wbr/><b class="g3">|</b><b>\s+</b><b class="g3">)</b><b class="g2">)</b><b>\1</b><b class="g1">)*</b>&gt;</code></p>

<p>In the above, a backreference to a capturing group within a lookahead is used to <a href="http://blog.stevenlevithan.com/archives/mimic-atomic-groups">mimic an atomic group</a>, so the backreference numbers for the <code>id</code> and <code>class</code> values are shifted to two and three, respectively.</p>

<p>Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.</p>

<p>I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. <img src="/wp-includes/images/smilies/icon_wink.gif" alt="wink" /> If you have any specific questions about the pattern, feel free to ask.</p>

<p>Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers&mdash;unlike most other regex engines&mdash;reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/multi-attr-capture/feed</wfw:commentRss>
		</item>
	</channel>
</rss>
