<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Flagrant Badassery &#187; Cross-Browser Issues</title>
	<atom:link href="http://blog.stevenlevithan.com/category/cross-browser/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com</link>
	<description>A JavaScript and regular expression centric blog</description>
	<lastBuildDate>Thu, 25 Oct 2012 16:59:05 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>What the JavaScript RegExp API Got Wrong, &amp; How to Fix It</title>
		<link>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp</link>
		<comments>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp#comments</comments>
		<pubDate>Mon, 01 Mar 2010 08:43:04 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=327</guid>
		<description><![CDATA[Over the last few years, I've occasionally commented on JavaScript's RegExp API, syntax, and behavior on the ES-Discuss mailing list. Recently, JavaScript inventor Brendan Eich suggested that, in order to get more discussion going, I write up a list of regex changes to consider for future ECMAScript standards (or as he humorously put it, have [...]]]></description>
				<content:encoded><![CDATA[<p>Over the last few years, I've occasionally commented on JavaScript's RegExp API, syntax, and behavior on the <a href="https://mail.mozilla.org/listinfo/es-discuss">ES-Discuss mailing list</a>. Recently, JavaScript inventor <a href="http://weblogs.mozillazine.org/roadmap/">Brendan Eich</a> suggested that, in order to get more discussion going, I write up a list of regex changes to consider for future <abbr title="ECMAScript is the official name of the JavaScript language standard.">ECMAScript</abbr> standards (or as he humorously put it, have my "95 [regex] theses nailed to the <abbr title="ECMAScript 3rd Edition">ES3</abbr> cathedral door"). I figured I'd give it a shot, but I'm going to split my response into a few parts. In this post, I'll be discussing issues with the current RegExp API and behavior. I'll be leaving aside new features that I'd like to see added, and merely suggesting ways to make existing capabilities better. I'll discuss possible new features in a follow-up post.</p>

<p>For a language as widely used as JavaScript, any realistic change proposal must strongly consider backward compatibility. For this reason, some of the following proposals might <em>not</em> be particularly realistic, but nevertheless I think that <em>a</em>) it's worthwhile to consider what might change if backward compatibility wasn't a concern, and <em>b</em>) in the long run, all of these changes would improve the ease of use and predictability of how regular expressions work in JavaScript.</p>


<h3 style="margin-bottom:0">Remove RegExp.prototype.lastIndex and replace it with an argument for start position</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate RegExp.prototype.lastIndex and add a "pos" argument to the RegExp.prototype.exec/test methods</em></p>

<p>JavaScript's <code>lastIndex</code> property serves too many purposes at once:</p>

<dl>
	<dt>It lets users manually specify where to start a regex search</dt>
	<dd>You could claim this is not <code>lastIndex</code>'s intended purpose, but it's nevertheless an important use since there's no alternative feature that allows this. <code>lastIndex</code> is not very good at this task, though. You need to compile your regex with the <code>/g</code> flag to let <code>lastIndex</code> be used this way; and even then, it only specifies the starting position for the <code>regexp.exec</code>/<code>test</code> methods. It cannot be used to set the start position for the <code>string.match</code>/<code>replace</code>/<code>search</code>/<code>split</code> methods.</dd>

	<dt>It indicates the position where the last match ended</dt>
	<dd>Even though you could derive the match end position by adding the match index and length, this use of <code>lastIndex</code> serves as a convenient and commonly used compliment to the <code>index</code> property on match arrays returned by <code>exec</code>. Like always, using <code>lastIndex</code> like this works only for regexes compiled with <code>/g</code>.</dd>

	<dt>It's used to track the position where the next search should start</dt>
	<dd>This comes into play, e.g., when using a regex to iterate over all matches in a string. However, the fact that <code>lastIndex</code> is actually set to the end position of the last match rather than the position where the next search should start (unlike equivalents in other programming languages) causes a problem after zero-length matches, which are easily possible with regexes like <code>/\w*/g</code> or <code>/^/mg</code>. Hence, you're forced to manually increment <code>lastIndex</code> in such cases. I've posted about this issue in more detail before (see: <em><a href="http://blog.stevenlevithan.com/archives/exec-bugs">An IE lastIndex Bug with Zero-Length Regex Matches</a></em>), as has Jan Goyvaerts (<em><a href="http://www.regexguru.com/2008/04/watch-out-for-zero-length-matches/">Watch Out for Zero-Length Matches</a></em>).</dd>
</dl>

<p>Unfortunately, <code>lastIndex</code>'s versatility results in it not working ideally for any specific use. I think <code>lastIndex</code> is misplaced anyway; if you need to store a search's ending (or next-start) position, it should be a property of the target string and not the regular expression. Here are three reasons this would work better:</p>

<ul>
	<li>It would let you use the same regex with multiple strings, without losing track of the next search position within each one.</li>
	<li>It would allow using multiple regexes with the same string and having each one pick up from where the last one left off.</li>
	<li>If you search two strings with the same regex, you're probably not expecting the search within the second string to start from an arbitrary position just because a match was found in the first string.</li>
</ul>

<p>In fact, Perl uses this approach of storing next-search positions with strings to great effect, and adds various features around it.</p>

<p>So that's my case for <code>lastIndex</code> being misplaced, but I go one further in that I don't think <code>lastIndex</code> should be included in JavaScript at all. Perl's tactic works well for Perl (especially when considered as a complete package), but some other languages (including Python) let you provide a search-start position as an argument when calling regex methods, which I think is an approach that is more natural and easier for developers to understand and use. I'd therefore fix <code>lastIndex</code> by getting rid of it completely. Regex methods and regex-using string methods would use internal search position trackers that are not observable by the user, and the <code>exec</code> and <code>test</code> methods would get a second argument (called <code>pos</code>, for position) that specifies where to start their search. It might be convenient to also give the <code>String</code> methods <code>search</code>, <code>match</code>, <code>replace</code>, and <code>split</code> their own <code>pos</code> arguments, but that is not as important and the functionality it would provide is not currently possible via <code>lastIndex</code> anyway.</p>

<p>Following are examples of how some common uses of <code>lastIndex</code> could be rewritten if these changes were made:</p>

<p>Start search from position 5, using <code>lastIndex</code> (the staus quo):

<pre class="code">var regexGlobal = /\w+/g,
    result;

regexGlobal.lastIndex = 5;
result = regexGlobal.test(str);
<span class="comment">// must reset lastIndex or future tests will continue from the
// match-end position (defensive coding)</span>
regexGlobal.lastIndex = 0;

var regexNonglobal = /\w+/;

regexNonglobal.lastIndex = 5;
<span class="comment">// no go - lastIndex will be ignored. instead, you have to do this</span>
result = regexNonglobal.test(str.slice(5));
</pre>

<p>Start search from position 5, using <code>pos</code>:</p>

<pre class="code">var regex = /\w+/, <span class="comment">// flag /g doesn't matter</span>
    result = regex.test(str, 5);
</pre>

<p>Match iteration, using <code>lastIndex</code>:</p>

<pre class="code">var regex = /\w*/g,
    matches = [],
    match;

<span class="comment">// the /g flag is required for this regex. if your code was provided a non-
// global regex, you'd need to recompile it with /g, and if it already had /g,
// you'd need to reset its lastIndex to 0 before entering the loop</span>

while (match = regex.exec(str)) {
    matches.push(match);
    <span class="comment">// avoid an infinite loop on zero-length matches</span>
    if (regex.lastIndex == match.index) {
        regex.lastIndex++;
    }
}
</pre>

<p>Match iteration, using <code>pos</code>:</p>

<pre class="code">var regex = /\w*/, <span class="comment">// flag /g doesn't matter</span>
    pos = 0,
    matches = [],
    match;

while (match = regex.exec(str, pos)) {
    matches.push(match);
    pos = match.index + (match[0].length || 1);
}
</pre>

<p>Of course, you could easily add your own sugar to further simplify match iteration, or JavaScript could add a method dedicated to this purpose similar to Ruby's <code>scan</code> (although JavaScript already sort of has this via the use of replacement functions with <code>string.replace</code>).</p>

<p>To reiterate, I'm describing what I would do if backward compatibility was irrelevant. I don't think it would be a good idea to add a <code>pos</code> argument to the <code>exec</code> and <code>test</code> methods unless the <code>lastIndex</code> property was deprecated or removed, due to the functionality overlap. If a <code>pos</code> argument existed, people would expect <code>pos</code> to be <code>0</code> when it's not specified. Having <code>lastIndex</code> around to sometimes screw up this expectation would be confusing and probably lead to latent bugs. Hence, if <code>lastIndex</code> was deprecated in favor of <code>pos</code>, it should be a means toward the end of removing <code>lastIndex</code> altogether.</p>


<h3 style="margin-bottom:0">Remove String.prototype.match's nonglobal operating mode</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate String.prototype.match and add a new matchAll method</em></p>

<p><code>String.prototype.match</code> currently works very differently depending on whether the <code>/g</code> (global) flag has been set on the provided regex:</p>

<ul>
	<li>For regexes with <code>/g</code>: If no matches are found, <code>null</code> is returned; otherwise an array of simple matches is returned.</li>
	<li>For regexes without <code>/g</code>: The <code>match</code> method operates as an alias of <code>regexp.exec</code>. If a match is not found, <code>null</code> is returned; otherwise you get an array containing the (single) match in key zero, with any backreferences stored in the array's subsequent keys. The array is also assigned special <code>index</code> and <code>input</code> properties.</li>
</ul>

<p>The <code>match</code> method's nonglobal mode is confusing and unnecessary. The reason it's unnecessary is obvious: If you want the functionality of <code>exec</code>, just use it (no need for an alias). It's confusing because, as described above, the <code>match</code> method's two modes return very different results. The difference is not merely whether you get one match or all matches&mdash;you get a completely different kind of result. And since the result is an array in either case, you have to know the status of the regex's <code>global</code> property to know which type of array you're dealing with.</p>

<p>I'd change <code>string.match</code> by making it always return an array containing all matches in the target string. I'd also make it return an empty array, rather than <code>null</code>, when no matches are found (an idea that comes from Dean Edwards's <a href="http://code.google.com/p/base2/">base2</a> library). If you want the first match only or you need backreferences and extra match details, that's what <code>regexp.exec</code> is for.</p>

<p>Unfortunately, if you want to consider this change as a realistic proposal, it would require some kind of language version- or mode-based switching of the <code>match</code> method's behavior (unlikely to happen, I would think). So, instead of that, I'd recommend deprecating the <code>match</code> method altogether in favor of a new method (perhaps <code>RegExp.prototype.matchAll</code>) with the changes prescribed above.</p>


<h3 style="margin-bottom:0">Get rid of /g and RegExp.prototype.global</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate /g and RegExp.prototype.global, and add a boolean replaceAll argument to String.prototype.replace</em></p>

<p>If the last two proposals were implemented and therefore <code>regexp.lastIndex</code> and <code>string.match</code> were things of the past (or <code>string.match</code> no longer sometimes served as an alias of <code>regexp.exec</code>), the only method where <code>/g</code> would still have any impact is <code>string.replace</code>. Additionally, although <code>/g</code> follows prior art from Perl, etc., it doesn't really make sense to have something that is not an attribute of a regex stored as a regex flag. Really, <code>/g</code> is more of a statement about how you want methods to apply their own functionality, and it's not uncommon to want to use the same pattern with and without <code>/g</code> (currently you'd have to construct two different regexes to do so). If it was up to me, I'd get rid of the <code>/g</code> flag and its corresponding <code>global</code> property, and instead simply give the <code>string.replace</code> method an additional argument that indicates whether you want to replace the first match only (the default handling) or all matches. This could be done with either a <code>replaceAll</code> boolean or, for greater readability, a <code>scope</code> string that accepts values <code>'one'</code> and <code>'all'</code>. This new argument would have the additional benefit of allowing replace-all functionality with nonregex searches.</p>

<p>Note that SpiderMonkey already has a proprietary third <code>string.replace</code> argument ("flags") that this proposal would conflict with. I doubt this conflict would cause much heartburn, but in any case, a new <code>replaceAll</code> argument would provide the same functionality that SpiderMonkey's <code>flags</code> argument is most useful for (that is, allowing global replacements with nonregex searches).</p>


<h3 style="margin-bottom:0">Change the behavior of backreferences to nonparticipating groups</h3>
<p style="margin-top:0"><em>Actual proposal: Make backreferences to nonparticipating groups fail to match</em></p>

<p>I'll keep this brief since David "liorean" Andersson and I have previously argued for this on ES-Discuss and elsewhere. David posted about this in detail on his blog (see: <em><a href="http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/">ECMAScript 3 Regular Expressions: A specification that doesn't make sense</a></em>), and I've previously touched on it here (<em><a href="http://blog.stevenlevithan.com/archives/es3-regexes-broken">ECMAScript 3 Regular Expressions are Defective by Design</a></em>). On several occasions, Brendan Eich has also stated that he'd like to see this changed. The short explanation of this behavior is that, in JavaScript, backreferences to capturing groups that have not (yet) participated in a match always succeed (i.e., they match the empty string), whereas the opposite is true in all other regex flavors: they fail to match and therefore cause the regex engine to backtrack or fail. JavaScript's behavior means that <code>/(a|(b))\2c/.test("ac")</code> returns <code>true</code>. The (negative) implications of this reach quite far when pushing the boundaries of regular expressions.</p>

<p>I think everyone agrees that changing to the traditional backreferencing behavior would be an improvement&mdash;it provides far more intuitive handling, compatibility with other regex flavors, and great potential for creative use (e.g., see my post on <em><a href="http://blog.stevenlevithan.com/archives/mimic-conditionals">Mimicking Conditionals</a></em>). The bigger question is whether it would be safe, in light of backward compatibility. I think it would be, since I imagine that more or less no one uses the unintuitive JavaScript behavior intentionally. The JavaScript behavior amounts to automatically adding a <code>?</code> quantifier after backreferences to nonparticipating groups, which is what people already do explicitly if they actually want backreferences to nonzero-length subpatterns to be optional. Also note that Safari 3.0 and earlier did not follow the spec on this point and used the more intuitive behavior, although that has <a href="http://bugs.webkit.org/show_bug.cgi?id=14931">changed</a> in more recent versions (notably, this change was due to a <a href="http://blog.stevenlevithan.com/archives/npcg-javascript">write up</a> on my blog rather than reports of real-world errors).</p>

<p>Finally, it's probably worth noting that .NET's ECMAScript regex mode (enabled via the <code>RegexOptions.ECMAScript</code> flag) indeed switches .NET to ECMAScript's unconventional backreferencing behavior.</p>


<h3 style="margin-bottom:0">Make \d \D \w \W \b \B support Unicode (like \s \S . ^ $, which already do)</h3>
<p style="margin-top:0"><em>Actual proposal: Add a /u flag (and corresponding RegExp.prototype.unicode property) that changes the meaning of \d, \w, \b, and related tokens</em></p>

<p>Unicode-aware digit and word character matching is not an existing JavaScript capability (short of constructing character class monstrosities that are hundreds or thousands of characters long), and since JavaScript lacks lookbehind you can't reproduce a Unicode-aware word boundary. You could therefore say this proposal is outside the stated scope of this post, but I'm including it here because I consider this more of a fix than a new feature.</p>

<p>According to current JavaScript standards, <code>\s</code>, <code>\S</code>, <code>.</code>, <code>^</code>, and <code>$</code> use Unicode-based interpretations of <em>whitespace</em> and <em>newline</em>, whereas <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\b</code>, and <code>\B</code> use ASCII-only interpretations of <em>digit</em>, <em>word character</em>, and <em>word boundary</em> (e.g., <code>/na\b/.test("na&iuml;ve")</code> unfortunately returns <code>true</code>). See my post on <em><a href="http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode">JavaScript, Regex, and Unicode</a></em> for further details. Adding Unicode support to these tokens would cause unexpected behavior for thousands of websites, but it could be implemented safely via a new <code>/u</code> flag (inspired by Python's <code>re.U</code> or <code>re.UNICODE</code> flag) and a corresponding <code>RegExp.prototype.unicode</code> property. Since it's actually fairly common to <em>not</em> want these tokens to be Unicode enabled in particular regex patterns, a new flag that activates Unicode support would offer the best of both worlds.</p>


<h3 style="margin-bottom:0">Change the behavior of backreference resetting during subpattern repetition</h3>
<p style="margin-top:0"><em>Actual proposal: Never reset backreference values during a match</em></p>

<p>Like the last backreferencing issue, this too was covered by David Andersson in his post <em><a href="http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/">ECMAScript 3 Regular Expressions: A specification that doesn't make sense</a></em>. The issue here involves the value remembered by capturing groups nested within a quantified, outer group (e.g., <code>/((a)|(b))*/</code>). According to traditional behavior, the value remembered by a capturing group within a quantified grouping is whatever the group matched the last time it participated in the match. So, the value of <code>$1</code> after <code>/(?:(a)|(b))*/</code> is used to match <code>"ab"</code> would be <code>"a"</code>. However, according to ES3/ES5, the value of backreferences to nested groupings is reset/erased after the outer grouping is repeated. Hence, <code>/(?:(a)|(b))*/</code> would still match <code>"ab"</code>, but after the match is complete <code>$1</code> would reference 
a nonparticipating capturing group, which in JavaScript would match an empty string within the regex itself, and be returned as <code>undefined</code> in, e.g., the array returned by the <code>regexp.exec</code>.</p>

<p>My case for change is that current JavaScript behavior breaks from the norm in other regex flavors, does not lend itself to various types of creative patterns (see one example in my post on <em><a href="http://blog.stevenlevithan.com/archives/multi-attr-capture">Capturing Multiple, Optional HTML Attribute Values</a></em>), and in my opinion is far less intuitive than the more common, alternative regex behavior.</p>

<p>I believe this behavior is safe to change for two reasons. First, this is generally an edge case issue for all but hardcore regex wizards, and I'd be surprised to find regexes that rely on JavaScript's version of this behavior. Second, and more importantly, Internet Explorer does not implement this rule and follows the more traditional behavior.</p>


<h3 style="margin-bottom:0">Add an /s flag, already</h3>
<p style="margin-top:0"><em>Actual proposal: Add an /s flag (and corresponding RegExp.prototype.dotall property) that changes dot to match all characters including newlines</em></p>

<p>I'll sneak this one in as a change/fix rather than a new feature since it's not exactly difficult to use <code>[\s\S]</code> in place of a dot when you want the behavior of <code>/s</code>. I presume the <code>/s</code> flag has been excluded thus far to save novices from themselves and limit the damage of runaway backtracking, but what ends up happening is that people write horrifically inefficient patterns like <code>(.|\r|\n)*</code> instead.</p>

<p><!--You might ask why JavaScript should bother to add <code>/s</code> if you can already mimic it.-->Regex searches in JavaScript are seldom line-based, and it's therefore more common to want dot to include newlines than to match anything-but-newlines (although both modes are useful). It makes good sense to keep the default meaning of dot (no newlines) since it is shared by other regex flavors and required for backward compatibility, but adding support for the <code>/s</code> flag is overdue. A boolean indicating whether this flag was set should show up on regexes as a property named either <code>singleline</code> (the <a href="http://blog.stevenlevithan.com/archives/singleline-multiline-confusing">unfortunate name</a> from Perl, .NET, etc.) or the more descriptive <code>dotall</code> (used in Java, Python, PCRE, etc.).</p>


<h3>Personal preferences</h3>

<p>Following are a few changes that would suit my preferences, although I don't think most people would consider them significant issues:</p>

<ul>
	<li>Allow regex literals to use unescaped forward slashes within character clases (e.g., <code>/[/]/</code>). This was already included in the abandoned <a href="http://wiki.ecmascript.org/doku.php?id=proposals:extend_regexps#regexp_scanning">ES4 change proposals</a>.</li>
	<li>Allow an unescaped <code>]</code> as the first character in character classes (e.g., <code>[]]</code> or <code>[^]]</code>). This is allowed in probably every other regex flavor, but creates an empty class followed by a literal <code>]</code> in JavaScript. I'd like to imagine that no one uses empty classes intentionally, since they don't work consistently cross-browser and there are widely-used/common-sense alternatives (<code>(?!)</code> instead of <code>[]</code>, and <code>[\s\S]</code> instead of <code>[^]</code>). Unfortunately, adherence to this JavaScript quirk is tested in <a href="http://acid3.acidtests.org/">Acid3</a> (test 89), which is likely enough to kill requests for this backward-incompatible but reasonable change.</li>
	<li>Change the <code>$&#038;</code> token used in replacement strings to <code>$0</code>. It just makes sense. (Equivalents in other replacement text flavors for comparison: Perl: <code>$&#038;</code>; Java: <code>$0</code>; .NET: <code>$0</code>, <code>$&#038;</code>; PHP: <code>$0</code>, <code>\0</code>; Ruby: <code>\0</code>, <code>\&#038;</code>; Python: <code>\g<0></code>.)</li>
	<li>Get rid of the special meaning of <code>[\b]</code>. Within character classes, the metasequence <code>\b</code> matches a backspace character (equivalent to <code>\x08</code>). This is a worthless convenience since no one cares about matching backspace characters, and it's confusing given that <code>\b</code> matches a word boundary when used outside of character classes. Even though this would break from regex tradition (which I'd usually advocate following), I think that <code>\b</code> should have no special meaning inside character classes and simply match a literal <code>b</code>.</li>
</ul>


<h3>Fixed in ES3: Remove octal character references</h3>

<p>ECMAScript 3 removed octal character references from regular expression syntax, although <code>\0</code> was kept as a convenient exception that allows easily matching a NUL character. However, browsers have generally kept full octal support around for backward compatibility. Octals are very confusing in regular expressions since their syntax overlaps with backreferences and an extra leading zero is allowed outside of character classes. Consider the following regexes:</p>

<ul>
	<li><code>/a\1/</code>: <code>\1</code> is an octal.</li>
	<li><code>/(a)\1/</code>: <code>\1</code> is a backreference.</li>
	<li><code>/(a)[\1]/</code>: <code>\1</code> is an octal.</li>
	<li><code>/(a)\1\2/</code>: <code>\1</code> is a backreference; <code>\2</code> is an octal.</li>
	<li><code>/(a)\01\001[\01\001]/</code>: All occurences of <code>\01</code> and <code>\001</code> are octals. However, according to the ES3+ specs, the numbers after each <code>\0</code> should be treated (barring nonstandard extensions) as literal characters, completely changing what this regex matches. <em>(Edit-2012: Actually, a close reading of the spec shows that any 0-9 following <code>\0</code> should cause a <code>SyntaxError</code>.)</em></li>
	<li><code>/(a)\0001[\0001]/</code>: The <code>\0001</code> outside the character class is an octal; but inside, the octal ends at the third zero (i.e., the character class matches character index zero <em>or</em> <code>"1"</code>). This regex is therefore equivalent to <code>/(a)\x01[\x00\x31]/</code>; although, as mentioned just above, adherence to ES3 would change the meaning.</li>
	<li><code>/(a)\00001[\00001]/</code>: Outside the character class, the octal ends at the fourth zero and is followed by a literal <code>"1"</code>. Inside, the octal ends at the third zero and is followed by a literal <code>"01"</code>. And once again, ES3's exclusion of octals and inclusion of <code>\0</code> could change the meaning.</li>
	<li><code>/\1(a)/</code>: Given that, in JavaScript, backreferences to capturing groups that have not (yet) participated match the empty string, does this regex match <code>"a"</code> (i.e., <code>\1</code> is treated as a backreference since a corresponding capturing group appears in the regex) or does it match <code>"\x01a"</code> (i.e., the <code>\1</code> is treated as an octal since it appears <em>before</em> its corresponding group)? Unsurprisingly, browsers disagree.</li>
	<li><code>/(\2(a)){2}/</code>: Now things get really hairy. Does this regex match <code>"aa"</code>, <code>"aaa"</code>, <code>"\x02aaa"</code>, <code>"2aaa"</code>, <code>"\x02a\x02a"</code>, or <code>"2a2a"</code>? All of these options seem plausible, and browsers disagree on the correct choice.</li>
</ul>

<p>There are other issues to worry about, too, like whether octal escapes go up to <code>\377</code> (<code>\xFF</code>, 8-bit) or <code>\777</code> (<code>\u01FF</code>, 9-bit); but in any case, octals in regular expressions are a confusing cluster-cuss. Even though ECMAScript has already cleaned up this mess by removing support for octals, browsers have not followed suit. I wish they would, because unlike browser makers, I don't have to worry about this bit of legacy (I never use octals in regular expressions, and neither should you).</p>

<h3>Fixed in ES5: Don't cache regex literals</h3>

<p>According to ES3 rules, regex literals did not create a new regex object if a literal with the same pattern/flag combination was already used in the same script or function (this did not apply to regexes created by the <code>RegExp</code> constructor). A common side effect of this was that regex literals using the <code>/g</code> flag did not have their <code>lastIndex</code> property reset in some cases where most developers would expect it. Several browsers didn't follow the spec on this unintuitive behavior, but Firefox did, and as a result it became the <a href="http://whereswalden.com/2010/01/15/more-es5-incompatible-changes-regular-expressions-now-evaluate-to-a-new-object-not-the-same-object-each-time-theyre-encountered/">second most duplicated</a> JavaScript <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=98409">bug report</a> for Mozilla. Fortunately, ES5 got rid of this rule, and now regex literals must be recompiled every time they're encountered (this change is coming in Firefox 3.7).</p>


<p>&mdash;&mdash;&mdash;<br />So there you have it. I've outlined what I think the JavaScript RegExp API got wrong. Do you agree with all of these proposals, or <em>would</em> you if you didn't have to worry about backward compatibility? Are there better ways than what I've proposed to fix the issues discussed here? Got any other gripes with existing JavaScript regex features? I'm eager to hear feedback about this.</p>

<p>Since I've been focusing on the negative in this post, I'll note that I find working with regular expressions in JavaScript to be a generally pleasant experience. There's a hell of a lot that JavaScript got right.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>An IE lastIndex Bug with Zero-Length Regex Matches</title>
		<link>http://blog.stevenlevithan.com/archives/exec-bugs</link>
		<comments>http://blog.stevenlevithan.com/archives/exec-bugs#comments</comments>
		<pubDate>Mon, 14 Apr 2008 02:24:59 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/exec-bugs</guid>
		<description><![CDATA[The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's lastIndex property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in [...]]]></description>
				<content:encoded><![CDATA[<p>The bottom line of this blog post is that Internet Explorer incorrectly increments a regex object's <code>lastIndex</code> property after a successful, zero-length match. However, for anyone who isn't sure what I'm talking about or is interested in how to work around the problem, I'll describe the issue with examples of iterating over each match in a string using the <code>RegExp.prototype.exec</code> method. That's where I've most frequently encountered the bug, and I think it will help explain why the issue exists in the first place.</p>

<p>First of all, if you're not already familiar with how to use <code>exec</code> to iterate over a string, you're missing out on some very powerful functionality. Here's the basic construct:</p>

<pre class="code">var	regex = /.../g,
	subject = "test",
	match = regex.exec(subject);

while (match != null) {
	<span class="comment">// matched text: match[0]
	// match start: match.index
	// match end: regex.lastIndex
	// capturing group n: match[n]</span>

	...

	match = regex.exec(subject);
}
</pre>

<p>When the <code>exec</code> method is called for a regex that uses the <code>/g</code> (global) modifier, it searches from the point in the subject string specified by the regex's <code>lastIndex</code> property (which is initially zero, so it searches from the beginning of the string). If the <code>exec</code> method finds a match, it updates the regex's <code>lastIndex</code> property to the character index at the end of the match, and returns an array containing the matched text and any captured subexpressions. If there is no match from the point in the string where the search started, <code>lastIndex</code> is reset to zero, and <code>null</code> is returned.</p>

<p>You can tighten up the above code by moving the <code>exec</code> method call into the <code>while</code> loop's condition, like so:</p>

<pre class="code">var	regex = /.../g,
	subject = "test",
	match;

while (match = regex.exec(subject)) {
	...
}
</pre>

<p>This cleaner version works essentially the same as before. As soon as <code>exec</code> can't find any further matches and therefore returns <code>null</code>, the loop ends. However, there are a couple cross-browser issues to be aware of with either version of this code. One is that if the regex contains capturing groups which do not participate in the match, some values in the returned array could be either <code>undefined</code> or an empty string. I've previously discussed that issue in depth in a post about what I called <a href="http://blog.stevenlevithan.com/archives/npcg-javascript">non-participating capturing groups</a>.</p>

<p>Another issue (the topic of <em>this</em> post) occurs when your regex matches an empty string. There are many reasons why you might allow a regex to do that, but if you can't think of any, consider cases where you're accepting regexes from an outside source. Here's a simple example of such a regex:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	endPositions.push(regex.lastIndex);
}
</pre>

<p>You might expect the <code>endPositions</code> array to be set to <code>[0,2,4]</code>, since those are the character positions for the beginning of the string and just after each newline character. Thanks to the <code>/m</code> modifier, those are the positions where the regex will match; and since the regex matches empty strings, <code>regex.lastIndex</code> should be the same as <code>match.index</code>. However, Internet Explorer (tested with v5.5&ndash;7) sets <code>endPositions</code> to <code>[1,3,5]</code>. Other browsers will go into an infinite loop until you short-circuit the code.</p>

<p>So what's going on here? Remember that every time <code>exec</code> runs, it attempts to match within the subject string starting at the position specified by the <code>lastIndex</code> property of the regex. Since our regex matches a zero-length string, <code>lastIndex</code> remains exactly where we started the search. Therefore, every time through the loop our regex will match at the same position&mdash;the start of the string. Internet Explorer tries to be helpful and avoid this situation by automatically incrementing <code>lastIndex</code> when a zero-length string is matched. That might seem like a good idea (in fact, I've seen people adamantly argue that is a bug that Firefox does not do the same), but it means that in Internet Explorer the <code>lastIndex</code> property cannot be relied on to accurately determine the ending position of a match.</p>

<p>We can correct this situation cross-browser with the following code:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	match,
	endPositions = [];

while (match = regex.exec(subject)) {
	var zeroLengthMatch = !match[0].length;
	<span class="comment">// Fix IE's incorrect lastIndex</span>
	if (zeroLengthMatch &#038;& regex.lastIndex > match.index)
		regex.lastIndex--;

	endPositions.push(regex.lastIndex);

	<span class="comment">// Avoid an infinite loop with zero-length matches</span>
	if (zeroLengthMatch)
		regex.lastIndex++;
}
</pre>

<p>You can see an example of the above code in the <a href="http://blog.stevenlevithan.com/archives/cross-browser-split">cross-browser split method</a> I posted a while back. Keep in mind that none of the extra code here is needed if your regex cannot possibly match an empty string.</p>

<p>Another way to deal with this issue is to use <code>String.prototype.replace</code> to iterate over the subject string. The <code>replace</code> method moves forward automatically after zero-length matches, avoiding this issue altogether. Unfortunately, in the three biggest browsers (IE, Firefox, Safari), <code>replace</code> doesn't seem to deal with the <code>lastIndex</code> property except to reset it to zero. Opera gets it right (according to my reading of the spec) and updates <code>lastIndex</code> along the way. Given the current situation, you can't rely on <code>lastIndex</code> in your code when iterating over a string using <code>replace</code>, but you can still easily derive the value for the end of each match. Here's an example:</p>

<pre class="code">var	regex = /^/gm,
	subject = "A\nB\nC",
	endPositions = [];

subject.replace(regex, function (match) {
	<span class="comment">// Not using a named argument for the index since capturing
	// groups can change its position in the list of arguments</span>
	var	index = arguments[arguments.length - 2],
		lastIndex = index + match.length;

	endPositions.push(lastIndex);
});
</pre>

<p>That's perhaps less lucid than before (since we're not actually replacing anything), but there you have it&hellip; two cross-browser ways to get around a little-known issue that could otherwise cause tricky, latent bugs in your code.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/exec-bugs/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>A JScript/VBScript Regex Lookahead Bug</title>
		<link>http://blog.stevenlevithan.com/archives/regex-lookahead-bug</link>
		<comments>http://blog.stevenlevithan.com/archives/regex-lookahead-bug#comments</comments>
		<pubDate>Mon, 24 Mar 2008 05:50:58 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[VBScript]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-lookahead-bug</guid>
		<description><![CDATA[Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via ?, *, {0,n}, or (.&#124;); but not +, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript: /(?=a?b)ab/.test("ab"); // Should return true, but [...]]]></description>
				<content:encoded><![CDATA[<p>Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via <code>?</code>, <code>*</code>, <code>{0,<em>n</em>}</code>, or <code>(.|)</code>; but not <code>+</code>, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:</p>

<pre class="code"><span class="regex">/(?=a?b)ab/</span>.test("ab");
<span class="comment">// Should return true, but IE 5.5 &ndash; 8b1 return false</span>

<span class="regex">/(?=a?b)ab/</span>.test("abc");
<span class="comment">// Correctly returns true (even in IE), although the
// added "c" does not take part in the match</span>
</pre>

<p>I've been aware of this bug for a couple years, thanks to a <a href="http://regexadvice.com/blogs/mash/archive/2004/10/05/320.aspx">blog post by Michael Ash</a> that describes the bug with a password-complexity regex. However, the bug description there is incomplete and subtly incorrect, as shown by the above, reduced test case. To be honest, although the errant behavior is predictable, it's a bit tricky to describe because I haven't yet figured out exactly what's happening internally. I'd recommend playing with variations of the above code to get a better understanding of the problem.</p>

<p>Fortunately, since the bug is predictable, it's usually possible to work around. For example, you can avoid the bug with the password regex in Michael's post (<code>/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$/</code>) by writing it as <code>/^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/</code> (the <code>.{8,15}$</code> lookahead must come first here). The important thing is to be aware of the issue, because it can easily introduce latent and difficult to diagnose bugs into your code. Just remember that it shows up with variable-length lookahead. If you're using such patterns, test the hell out of them in IE.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/regex-lookahead-bug/feed</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>JavaScript, Regex, and Unicode</title>
		<link>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</link>
		<comments>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode#comments</comments>
		<pubDate>Wed, 02 Jan 2008 06:11:24 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</guid>
		<description><![CDATA[Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore. According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, [...]]]></description>
				<content:encoded><![CDATA[<p>Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.</p>

<p>According to ECMA-262 3rd Edition, <code>\s</code>, <code>\S</code>, <code>.</code>, <code>^</code>, and <code>$</code> use Unicode-based interpretations of <em>whitespace</em> and <em>newline</em>, while <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\b</code>, and <code>\B</code> use ASCII-only interpretations of <em>digit</em>, <em>word character</em>, and <em>word boundary</em> (e.g. <code>/a\b/.<wbr/>test("na&iuml;ve")</code> returns <code>true</code>). Actual browser implementations often differ on these points. For example, Firefox 2 considers <code>\d</code> and <code>\D</code> to be Unicode-aware, while Firefox 3 fixes this bug &mdash; making <code>\d</code> equivalent to <code>[0-9]</code> as with most other browsers.</p>

<p>Here again are the affected tokens, along with their definitions:</p>

<ul>
	<li><code>\d</code> &mdash; Digits.</li>
	<li><code>\s</code> &mdash; Whitespace.</li>
	<li><code>\w</code> &mdash; Word characters.</li>
	<li><code>\D</code> &mdash; All except digits.</li>
	<li><code>\S</code> &mdash; All except whitespace.</li>
	<li><code>\W</code> &mdash; All except word characters.</li>
	<li><code>.</code> &mdash; All except newlines.</li>
	<li><code>^</code> (with <code>/m</code>) &mdash; The positions at the beginning of the string and just after newlines.</li>
	<li><code>$</code> (with <code>/m</code>) &mdash; The positions at the end of the string and just before newlines.</li>
	<li><code>\b</code> &mdash; Word boundary positions.</li>
	<li><code>\B</code> &mdash; Not word boundary positions.</li>
</ul>

<p>All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms <em>digit</em>, <em>whitespace</em>, <em>word character</em>, <em>word boundary</em>, and <em>newline</em> depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:</p>

<ul>
	<li><em>Digit</em> &mdash; The characters 0-9 only.</li>
	<li><em>Whitespace</em> &mdash; Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".</li>
	<li><em>Word character</em> &mdash; The characters A-Z, a-z, 0-9, and _ only.</li>
	<li><em>Word boundary</em> &mdash; The position between a <em>word character</em> and non-<em>word character</em>.</li>
	<li><em>Newline</em> &mdash; The line feed, carriage return, line separator, and paragraph separator characters.</li>
</ul>

<p>Here again are the newline characters, with their character codes:</p>

<ul>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code></li>
	<li><code>\u2028</code> &mdash; Line separator</li>
	<li><code>\u2029</code> &mdash; Paragraph separator</li>
</ul>

<p>Note that ECMAScript 4 proposals indicate that the <a href="http://en.wikipedia.org/wiki/C0_and_C1_control_codes">C1</a>/Unicode NEL "next line" control character (<code>\u0085</code>) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, <code>/\r^$\n/m.test("\r\n")</code> returns <code>true</code>.</p>

<p>As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's <a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes">Basic Multilingual Plane</a>, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by <code>\s</code> according to ECMA-262 3rd Edition and Unicode 5.1:</p>

<ul>
	<li><code>\u0009</code> &mdash; Tab &mdash; <code>\t</code></li>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code> &mdash; (newline character)</li>
	<li><code>\u000b</code> &mdash; Vertical tab &mdash; <code>\v</code></li>
	<li><code>\u000c</code> &mdash; Form feed &mdash; <code>\f</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code> &mdash; (newline character)</li>
	<li><code>\u0020</code> &mdash; Space</li>
	<li><code>\u00a0</code> &mdash; No-break space</li>
	<li><code>\u1680</code> &mdash; Ogham space mark</li>
	<li><code>\u180e</code> &mdash; Mongolian vowel separator</li>
	<li><code>\u2000</code> &mdash; En quad</li>
	<li><code>\u2001</code> &mdash; Em quad</li>
	<li><code>\u2002</code> &mdash; En space</li>
	<li><code>\u2003</code> &mdash; Em space</li>
	<li><code>\u2004</code> &mdash; Three-per-em space</li>
	<li><code>\u2005</code> &mdash; Four-per-em space</li>
	<li><code>\u2006</code> &mdash; Six-per-em space</li>
	<li><code>\u2007</code> &mdash; Figure space</li>
	<li><code>\u2008</code> &mdash; Punctuation space</li>
	<li><code>\u2009</code> &mdash; Thin space</li>
	<li><code>\u200a</code> &mdash; Hair space</li>
	<li><code>\u2028</code> &mdash; Line separator &mdash; (newline character)</li>
	<li><code>\u2029</code> &mdash; Paragraph separator &mdash; (newline character)</li>
	<li><code>\u202f</code> &mdash; Narrow no-break space</li>
	<li><code>\u205f</code> &mdash; Medium mathematical space</li>
	<li><code>\u3000</code> &mdash; Ideographic space</li>
</ul>

<p>To test which characters or positions are matched by all of the tokens mentioned here in your browser, see <a href="http://xregexp.com/tests/unicode.html"><strong>JavaScript Regex and Unicode Tests</strong></a>. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.</p>

<div class="update">
<p><strong>Update:</strong> My new <a href="http://xregexp.com/plugins/"><strong>Unicode plugin</strong></a> for <a href="http://xregexp.com/"><strong>XRegExp</strong></a> allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.</p>
</div>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode/feed</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>When innerHTML isn&#8217;t Fast Enough</title>
		<link>http://blog.stevenlevithan.com/archives/faster-than-innerhtml</link>
		<comments>http://blog.stevenlevithan.com/archives/faster-than-innerhtml#comments</comments>
		<pubDate>Wed, 12 Sep 2007 04:39:32 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[innerhtml]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/faster-than-innerhtml</guid>
		<description><![CDATA[This post isn't about the pros and cons of innerHTML vs. W3C DOM methods. That has been hashed and rehashed elsewhere. Instead, I'll show how you can combine the use of innerHTML and DOM methods to make your code potentially hundreds of times faster than innerHTML on its own, when working with large numbers of [...]]]></description>
				<content:encoded><![CDATA[<p>This post isn't about the pros and cons of <code>innerHTML</code> vs. W3C DOM methods. That has been hashed and rehashed <a href="http://www.dustindiaz.com/innerhtml-vs-dom-methods/">elsewhere</a>. Instead, I'll show how you can combine the use of <code>innerHTML</code> and DOM methods to make your code potentially hundreds of times faster than <code>innerHTML</code> on its own, when working with large numbers of elements.</p>

<p>In some browsers (most notably, Firefox), although <code>innerHTML</code> is generally much faster than DOM methods, it spends a disproportionate amount of time clearing out existing elements vs. creating new ones. Knowing this, we can combine the speed of destroying elements by removing their parent using the standard DOM methods with creating new elements using <code>innerHTML</code>. (This technique is something I discovered during the development of <a href="http://regexpal.com" title="JavaScript regex tester">RegexPal</a>, and is one of its two main performance optimizations. The other is one-shot markup generation for match highlighting, which avoids needing to loop over matches or reference them individually.)</p>

<h3 class="sub">The code:</h3>

<pre class="code">
function replaceHtml(el, html) {
	var oldEl = typeof el === "string" ? document.getElementById(el) : el;
	/*@cc_on <span class="comment">// Pure innerHTML is slightly faster in IE</span>
		oldEl.innerHTML = html;
		return oldEl;
	@*/
	var newEl = oldEl.cloneNode(false);
	newEl.innerHTML = html;
	oldEl.parentNode.replaceChild(newEl, oldEl);
	<span class="comment">/* Since we just removed the old element from the DOM, return a reference
	to the new element, which can be used to restore variable references. */</span>
	return newEl;
};
</pre>

<p>You can use the above as <code>el = replaceHtml(el, newHtml)</code> instead of <code>el.innerHTML = newHtml</code>.</p>

<h3 class="sub">innerHTML is already pretty fast...is this really warranted?</h3>

<p>That depends on how many elements you're overwriting. In RegexPal, every keydown event potentially triggers the destruction and creation of thousands of elements (in order to make the syntax and match highlighting work). In such cases, the above approach has enormous positive impact. Even something as simple as <code>el.innerHTML += str</code> or <code>el.innerHTML = ""</code> could be a performance disaster if the element you're updating happens to have a few thousand children.</p>

<p>I've created a page which allows you to easily <a href="http://stevenlevithan.com/demo/replaceHtml.html" class="bold">test the performance difference</a> of <code>innerHTML</code> and my <code>replaceHtml</code> function with various numbers of elements. Make sure to try it out in a few browsers for comparison. Following are a couple examples of typical results from Firefox 2.0.0.6 on my system:</p>

<pre class="code"><span class="comment"><strong>1000 elements...</strong></span>
innerHTML (destroy only): <strong>156ms</strong>
innerHTML (create only): <strong>15ms</strong>
innerHTML (destroy &amp; create): <strong>172ms</strong>
replaceHtml (destroy only): <strong>0ms</strong> (<span class="comment"><strong>faster</strong></span>)
replaceHtml (create only): <strong>15ms</strong> (~ same speed)
replaceHtml (destroy &amp; create): <strong>15ms</strong> (<span class="comment"><strong>11.5x faster</strong></span>)

<span class="comment"><strong>15000 elements...</strong></span>
innerHTML (destroy only): <strong>14703ms</strong>
innerHTML (create only): <strong>250ms</strong>
innerHTML (destroy &amp; create): <strong>14922ms</strong>
replaceHtml (destroy only): <strong>31ms</strong> (<span class="comment"><strong>474.3x faster</strong></span>)
replaceHtml (create only): <strong>250ms</strong> (~ same speed)
replaceHtml (destroy &amp; create): <strong>297ms</strong> (<span class="comment"><strong>50.2x faster</strong></span>)
</pre>

<p>I think the numbers speak for themselves. Comparable performance improvements can also be seen in Safari. In Opera, <code>replaceHtml</code> is still typically faster than <code>innerHTML</code>, but by a narrower margin. In IE, simple use of <code>innerHTML</code> is typically faster than mixing it with DOM methods, but not by nearly the same kinds of margins as you can see above. Nevertheless, IE's conditional compilation feature is used to avoid the relatively minor performance penalty, by just using <code>innerHTML</code> with that browser.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/faster-than-innerhtml/feed</wfw:commentRss>
		<slash:comments>92</slash:comments>
		</item>
	</channel>
</rss>
