<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Flagrant Badassery &#187; Regular Expressions</title>
	<atom:link href="http://blog.stevenlevithan.com/category/regular-expressions/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com</link>
	<description>A JavaScript and regular expression centric blog</description>
	<lastBuildDate>Mon, 05 Jul 2010 20:27:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Regex Syntax Highlighter</title>
		<link>http://blog.stevenlevithan.com/archives/regex-syntax-highlighter</link>
		<comments>http://blog.stevenlevithan.com/archives/regex-syntax-highlighter#comments</comments>
		<pubDate>Mon, 05 Jul 2010 08:55:36 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Project Releases]]></category>
		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=381</guid>
		<description><![CDATA[Do you regularly post regular expressions online? Have you seen the regex syntax highlighting in RegexPal, RegexBuddy, or on my blog (example), and wanted to apply it to your own websites?

Prompted by blog reader Mark McDonnell, I've extracted the regex syntax highlighting engine built into RegexPal and made it into its own library, unimaginatively named [...]]]></description>
			<content:encoded><![CDATA[<p>Do you regularly post regular expressions online? Have you seen the regex syntax highlighting in <a href="http://regexpal.com/">RegexPal</a>, <a href="http://www.regexbuddy.com/cgi-bin/affref.pl?aff=SteveL">RegexBuddy</a>, or on my blog (<a href="http://blog.stevenlevithan.com/archives/multi-attr-capture">example</a>), and wanted to apply it to your own websites?</p>

<p>Prompted by blog reader <a href="http://www.integralist.co.uk/">Mark McDonnell</a>, I've extracted the regex syntax highlighting engine built into RegexPal and made it into its own library, unimaginatively named <a href="http://stevenlevithan.com/regex/syntaxhighlighter/"><strong>JavaScript Regex Syntax Highlighter</strong></a>. When combined with the provided CSS, this 1.6 KB self-contained JavaScript file can be used, for instance, to automatically apply regex syntax highlighting to any HTML element with the "<code>regex</code>" class. You can see an example of doing just that on my quick and dirty <a href="http://stevenlevithan.com/regex/syntaxhighlighter/">test page</a>.</p>

<div style="border:1px solid #d3d3d3; background:#f6f6f6; padding:5px; margin:15px 0;">
<p style="text-align:center; margin:0;">Highlighting example:<br />
<code class="regex">&lt;table<b>\b</b><i>[^&gt;]</i><b>*</b>&gt;<b class="g1">(?:</b><b class="g2">(?=</b><b class="g3">(</b><i>[^&lt;]</i><b>+</b><b class="g3">)</b><b class="g2">)</b><b>\1</b><b class="g1">|</b>&lt;<b class="g2">(?!</b>table<b>\b</b><i>[^&gt;]</i><b>*</b>&gt;<b class="g2">)</b><b class="g1">)</b><b class="g1">*?</b>&lt;/table&gt;</code></p>
</div>

<p>Although the library is simple (there's just one function to call), the syntax highlighting is pretty advanced and handles all valid JavaScript regex syntax and errors (with errors highlighted in red). An example of its advanced highlighting support is that it knows, based on the context, whether <code>\10</code> is backreference 10, backreference 1 followed by a literal zero, octal character index 10, or something else altogether due to its position in the surrounding pattern. Speaking of octal escapes (which are de facto browser extensions; not part of the spec.), they are correctly highlighted according to their subtle differences inside and outside character classes (outside of character classes only, octals can include a fourth digit if the leading digit is a zero).</p>

<p>As far as I'm aware, this is the first JavaScript library for highlighting regex syntax, with or without the level of completeness included here. For people who might feel inclined to use or improve upon my work, I've made the licensing as permissive as possible to avoid getting in your way. RegexPal is already open source under the GNU LGPL 3.0 License, but this new library is released under the MIT License. If you plan to customize or help upgrade this code, note that it could probably use a bit of an overhaul (it's ripped from RegexPal with minimal modification), and might <em>require</em> an overhaul if you want to cleanly add support for additional regex flavors. Another nifty feature I plan to eventually add is explanatory <code>title</code> attributes for each element in the returned HTML, which might be particularly helpful for deciphering any highlighted errors or warnings.</p>

<p>Let me know if this library is useful for you, or if there are any other features you'd like to see added or changed. Thanks!</p>

<p>Link: <a href="http://stevenlevithan.com/regex/syntaxhighlighter/"><strong>JavaScript Regex Syntax Highlighter</strong></a>.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/regex-syntax-highlighter/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>What the JavaScript RegExp API Got Wrong, &amp; How to Fix It</title>
		<link>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp</link>
		<comments>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp#comments</comments>
		<pubDate>Mon, 01 Mar 2010 08:43:04 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=327</guid>
		<description><![CDATA[Over the last few years, I've occasionally commented on JavaScript's RegExp API, syntax, and behavior on the ES-Discuss mailing list. Recently, JavaScript inventor Brendan Eich suggested that, in order to get more discussion going, I write up a list of regex changes to consider for future ECMAScript standards (or as he humorously put it, have [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last few years, I've occasionally commented on JavaScript's RegExp API, syntax, and behavior on the <a href="https://mail.mozilla.org/listinfo/es-discuss">ES-Discuss mailing list</a>. Recently, JavaScript inventor <a href="http://weblogs.mozillazine.org/roadmap/">Brendan Eich</a> suggested that, in order to get more discussion going, I write up a list of regex changes to consider for future <abbr title="ECMAScript is the official name of the JavaScript language standard.">ECMAScript</abbr> standards (or as he humorously put it, have my "95 [regex] theses nailed to the <abbr title="ECMAScript 3rd Edition">ES3</abbr> cathedral door"). I figured I'd give it a shot, but I'm going to split my response into a few parts. In this post, I'll be discussing issues with the current RegExp API and behavior. I'll be leaving aside new features that I'd like to see added, and merely suggesting ways to make existing capabilities better. I'll discuss possible new features in a follow-up post.</p>

<p>For a language as widely used as JavaScript, any realistic change proposal must strongly consider backward compatibility. For this reason, some of the following proposals might <em>not</em> be particularly realistic, but nevertheless I think that <em>a</em>) it's worthwhile to consider what might change if backward compatibility wasn't a concern, and <em>b</em>) in the long run, all of these changes would improve the ease of use and predictability of how regular expressions work in JavaScript.</p>


<h3 style="margin-bottom:0">Remove RegExp.prototype.lastIndex and replace it with an argument for start position</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate RegExp.prototype.lastIndex and add a "pos" argument to the RegExp.prototype.exec/test methods</em></p>

<p>JavaScript's <code>lastIndex</code> property serves too many purposes at once:</p>

<dl>
	<dt>It lets users manually specify where to start a regex search</dt>
	<dd>You could claim this is not <code>lastIndex</code>'s intended purpose, but it's nevertheless an important use since there's no alternative feature that allows this. <code>lastIndex</code> is not very good at this task, though. You need to compile your regex with the <code>/g</code> flag to let <code>lastIndex</code> be used this way; and even then, it only specifies the starting position for the <code>regexp.exec</code>/<code>test</code> methods. It cannot be used to set the start position for the <code>string.match</code>/<code>replace</code>/<code>search</code>/<code>split</code> methods.</dd>

	<dt>It indicates the position where the last match ended</dt>
	<dd>Even though you could derive the match end position by adding the match index and length, this use of <code>lastIndex</code> serves as a convenient and commonly used compliment to the <code>index</code> property on match arrays returned by <code>exec</code>. Like always, using <code>lastIndex</code> like this works only for regexes compiled with <code>/g</code>.</dd>

	<dt>It's used to track the position where the next search should start</dt>
	<dd>This comes into play, e.g., when using a regex to iterate over all matches in a string. However, the fact that <code>lastIndex</code> is actually set to the end position of the last match rather than the position where the next search should start (unlike equivalents in other programming languages) causes a problem after zero-length matches, which are easily possible with regexes like <code>/\w*/g</code> or <code>/^/mg</code>. Hence, you're forced to manually increment <code>lastIndex</code> in such cases. I've posted about this issue in more detail before (see: <em><a href="http://blog.stevenlevithan.com/archives/exec-bugs">An IE lastIndex Bug with Zero-Length Regex Matches</a></em>), as has Jan Goyvaerts (<em><a href="http://www.regexguru.com/2008/04/watch-out-for-zero-length-matches/">Watch Out for Zero-Length Matches</a></em>).</dd>
</dl>

<p>Unfortunately, <code>lastIndex</code>'s versatility results in it not working ideally for any specific use. I think <code>lastIndex</code> is misplaced anyway; if you need to store a search's ending (or next-start) position, it should be a property of the target string and not the regular expression. Here are three reasons this would work better:</p>

<ul>
	<li>It would let you use the same regex with multiple strings, without losing track of the next search position within each one.</li>
	<li>It would allow using multiple regexes with the same string and having each one pick up from where the last one left off.</li>
	<li>If you search two strings with the same regex, you're probably not expecting the search within the second string to start from an arbitrary position just because a match was found in the first string.</li>
</ul>

<p>In fact, Perl uses this approach of storing next-search positions with strings to great effect, and adds various features around it.</p>

<p>So that's my case for <code>lastIndex</code> being misplaced, but I go one further in that I don't think <code>lastIndex</code> should be included in JavaScript at all. Perl's tactic works well for Perl (especially when considered as a complete package), but some other languages (including Python) let you provide a search-start position as an argument when calling regex methods, which I think is an approach that is more natural and easier for developers to understand and use. I'd therefore fix <code>lastIndex</code> by getting rid of it completely. Regex methods and regex-using string methods would use internal search position trackers that are not observable by the user, and the <code>exec</code> and <code>test</code> methods would get a second argument (called <code>pos</code>, for position) that specifies where to start their search. It might be convenient to also give the <code>String</code> methods <code>search</code>, <code>match</code>, <code>replace</code>, and <code>split</code> their own <code>pos</code> arguments, but that is not as important and the functionality it would provide is not currently possible via <code>lastIndex</code> anyway.</p>

<p>Following are examples of how some common uses of <code>lastIndex</code> could be rewritten if these changes were made:</p>

<p>Start search from position 5, using <code>lastIndex</code> (the staus quo):

<pre class="code">var regexGlobal = /\w+/g,
    result;

regexGlobal.lastIndex = 5;
result = regexGlobal.test(str);
<span class="comment">// must reset lastIndex or future tests will continue from the
// match-end position (defensive coding)</span>
regexGlobal.lastIndex = 0;

var regexNonglobal = /\w+/;

regexNonglobal.lastIndex = 5;
<span class="comment">// no go - lastIndex will be ignored. instead, you have to do this</span>
result = regexNonglobal.test(str.slice(5));
</pre>

<p>Start search from position 5, using <code>pos</code>:</p>

<pre class="code">var regex = /\w+/, <span class="comment">// flag /g doesn't matter</span>
    result = regex.test(str, 5);
</pre>

<p>Match iteration, using <code>lastIndex</code>:</p>

<pre class="code">var regex = /\w*/g,
    matches = [],
    match;

<span class="comment">// the /g flag is required for this regex. if your code was provided a non-
// global regex, you'd need to recompile it with /g, and if it already had /g,
// you'd need to reset its lastIndex to 0 before entering the loop</span>

while (match = regex.exec(str)) {
    matches.push(match);
    <span class="comment">// avoid an infinite loop on zero-length matches</span>
    if (regex.lastIndex == match.index) {
        regex.lastIndex++;
    }
}
</pre>

<p>Match iteration, using <code>pos</code>:</p>

<pre class="code">var regex = /\w*/, <span class="comment">// flag /g doesn't matter</span>
    pos = 0,
    matches = [],
    match;

while (match = regex.exec(str, pos)) {
    matches.push(match);
    pos = match.index + (match[0].length || 1);
}
</pre>

<p>Of course, you could easily add your own sugar to further simplify match iteration, or JavaScript could add a method dedicated to this purpose similar to Ruby's <code>scan</code> (although JavaScript already sort of has this via the use of replacement functions with <code>string.replace</code>).</p>

<p>To reiterate, I'm describing what I would do if backward compatibility was irrelevant. I don't think it would be a good idea to add a <code>pos</code> argument to the <code>exec</code> and <code>test</code> methods unless the <code>lastIndex</code> property was deprecated or removed, due to the functionality overlap. If a <code>pos</code> argument existed, people would expect <code>pos</code> to be <code>0</code> when it's not specified. Having <code>lastIndex</code> around to sometimes screw up this expectation would be confusing and probably lead to latent bugs. Hence, if <code>lastIndex</code> was deprecated in favor of <code>pos</code>, it should be a means toward the end of removing <code>lastIndex</code> altogether.</p>


<h3 style="margin-bottom:0">Remove String.prototype.match's nonglobal operating mode</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate String.prototype.match and add a new matchAll method</em></p>

<p><code>String.prototype.match</code> currently works very differently depending on whether the <code>/g</code> (global) flag has been set on the provided regex:</p>

<ul>
	<li>For regexes with <code>/g</code>: If no matches are found, <code>null</code> is returned; otherwise an array of simple matches is returned.</li>
	<li>For regexes without <code>/g</code>: The <code>match</code> method operates as an alias of <code>regexp.exec</code>. If a match is not found, <code>null</code> is returned; otherwise you get an array containing the (single) match in key zero, with any backreferences stored in the array's subsequent keys. The array is also assigned special <code>index</code> and <code>input</code> properties.</li>
</ul>

<p>The <code>match</code> method's nonglobal mode is confusing and unnecessary. The reason it's unnecessary is obvious: If you want the functionality of <code>exec</code>, just use it (no need for an alias). It's confusing because, as described above, the <code>match</code> method's two modes return very different results. The difference is not merely whether you get one match or all matches&mdash;you get a completely different kind of result. And since the result is an array in either case, you have to know the status of the regex's <code>global</code> property to know which type of array you're dealing with.</p>

<p>I'd change <code>string.match</code> by making it always return an array containing all matches in the target string. I'd also make it return an empty array, rather than <code>null</code>, when no matches are found (an idea that comes from Dean Edwards's <a href="http://code.google.com/p/base2/">base2</a> library). If you want the first match only or you need backreferences and extra match details, that's what <code>regexp.exec</code> is for.</p>

<p>Unfortunately, if you want to consider this change as a realistic proposal, it would require some kind of language version- or mode-based switching of the <code>match</code> method's behavior (unlikely to happen, I would think). So, instead of that, I'd recommend deprecating the <code>match</code> method altogether in favor of a new method (perhaps <code>RegExp.prototype.matchAll</code>) with the changes prescribed above.</p>


<h3 style="margin-bottom:0">Get rid of /g and RegExp.prototype.global</h3>
<p style="margin-top:0"><em>Actual proposal: Deprecate /g and RegExp.prototype.global, and add a boolean replaceAll argument to String.prototype.replace</em></p>

<p>If the last two proposals were implemented and therefore <code>regexp.lastIndex</code> and <code>string.match</code> were things of the past (or <code>string.match</code> no longer sometimes served as an alias of <code>regexp.exec</code>), the only method where <code>/g</code> would still have any impact is <code>string.replace</code>. Additionally, although <code>/g</code> follows prior art from Perl, etc., it doesn't really make sense to have something that is not an attribute of a regex stored as a regex flag. Really, <code>/g</code> is more of a statement about how you want methods to apply their own functionality, and it's not uncommon to want to use the same pattern with and without <code>/g</code> (currently you'd have to construct two different regexes to do so). If it was up to me, I'd get rid of the <code>/g</code> flag and its corresponding <code>global</code> property, and instead simply give the <code>string.replace</code> method an additional argument that indicates whether you want to replace the first match only (the default handling) or all matches. This would have the additional benefit of allowing replace-all functionality with nonregex searches.</p>

<p>Note that SpiderMonkey already has a proprietary third <code>string.replace</code> argument ("flags") that this proposal would conflict with. I doubt this conflict would cause much heartburn, but in any case, a new <code>replaceAll</code> argument would provide the same functionality that SpiderMonkey's <code>flags</code> argument is most useful for (that is, allowing global replacements with nonregex searches).</p>


<h3 style="margin-bottom:0">Change the behavior of backreferences to nonparticipating groups</h3>
<p style="margin-top:0"><em>Actual proposal: Make backreferences to nonparticipating groups fail to match</em></p>

<p>I'll keep this brief since David "liorean" Andersson and I have previously argued for this on ES-Discuss and elsewhere. David posted about this in detail on his blog (see: <em><a href="http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/">ECMAScript 3 Regular Expressions: A specification that doesn't make sense</a></em>), and I've previously touched on it here (<em><a href="http://blog.stevenlevithan.com/archives/es3-regexes-broken">ECMAScript 3 Regular Expressions are Defective by Design</a></em>). On several occasions, Brendan Eich has also stated that he'd like to see this changed. The short explanation of this behavior is that, in JavaScript, backreferences to capturing groups that have not (yet) participated in a match always succeed (i.e., they match the empty string), whereas the opposite is true in all other regex flavors: they fail to match and therefore cause the regex engine to backtrack or fail. JavaScript's behavior means that <code>/(a|(b))\2c/.test("ac")</code> returns <code>true</code>. The (negative) implications of this reach quite far when pushing the boundaries of regular expressions.</p>

<p>I think everyone agrees that changing to the traditional backreferencing behavior would be an improvement&mdash;it provides far more intuitive handling, compatibility with other regex flavors, and great potential for creative use (e.g., see my post on <em><a href="http://blog.stevenlevithan.com/archives/mimic-conditionals">Mimicking Conditionals</a></em>). The bigger question is whether it would be safe, in light of backward compatibility. I think it would be, since I imagine that more or less no one uses the unintuitive JavaScript behavior intentionally. The JavaScript behavior amounts to automatically adding a <code>?</code> quantifier after backreferences to nonparticipating groups, which is what people already do explicitly if they actually want backreferences to nonzero-length subpatterns to be optional. Also note that Safari 3.0 and earlier did not follow the spec on this point and used the more intuitive behavior, although that has <a href="http://bugs.webkit.org/show_bug.cgi?id=14931">changed</a> in more recent versions (notably, this change was due to a <a href="http://blog.stevenlevithan.com/archives/npcg-javascript">write up</a> on my blog rather than reports of real-world errors).</p>

<p>Finally, it's probably worth noting that .NET's ECMAScript regex mode (enabled via the <code>RegexOptions.ECMAScript</code> flag) indeed switches .NET to ECMAScript's unconventional backreferencing behavior.</p>


<h3 style="margin-bottom:0">Make \d \D \w \W \b \B support Unicode (like \s \S . ^ $, which already do)</h3>
<p style="margin-top:0"><em>Actual proposal: Add a /u flag (and corresponding RegExp.prototype.unicode property) that changes the meaning of \d, \w, \b, and related tokens</em></p>

<p>Unicode-aware digit and word character matching is not an existing JavaScript capability (short of constructing character class monstrosities that are hundreds or thousands of characters long), and since JavaScript lacks lookbehind you can't reproduce a Unicode-aware word boundary. You could therefore say this proposal is outside the stated scope of this post, but I'm including it here because I consider this more of a fix than a new feature.</p>

<p>According to current JavaScript standards, <code>\s</code>, <code>\S</code>, <code>.</code>, <code>^</code>, and <code>$</code> use Unicode-based interpretations of <em>whitespace</em> and <em>newline</em>, whereas <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\b</code>, and <code>\B</code> use ASCII-only interpretations of <em>digit</em>, <em>word character</em>, and <em>word boundary</em> (e.g., <code>/na\b/.test("na&iuml;ve")</code> unfortunately returns <code>true</code>). See my post on <em><a href="http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode">JavaScript, Regex, and Unicode</a></em> for further details. Adding Unicode support to these tokens would cause unexpected behavior for thousands of websites, but it could be implemented safely via a new <code>/u</code> flag (inspired by Python's <code>re.U</code> or <code>re.UNICODE</code> flag) and a corresponding <code>RegExp.prototype.unicode</code> property. Since it's actually fairly common to <em>not</em> want these tokens to be Unicode enabled in particular regex patterns, a new flag that activates Unicode support would offer the best of both worlds.</p>


<h3 style="margin-bottom:0">Change the behavior of backreference resetting during subpattern repetition</h3>
<p style="margin-top:0"><em>Actual proposal: Never reset backreference values during a match</em></p>

<p>Like the last backreferencing issue, this too was covered by David Andersson in his post <em><a href="http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/">ECMAScript 3 Regular Expressions: A specification that doesn't make sense</a></em>. The issue here involves the value remembered by capturing groups nested within a quantified, outer group (e.g., <code>/((a)|(b))*/</code>). According to traditional behavior, the value remembered by a capturing group within a quantified grouping is whatever the group matched the last time it participated in the match. So, the value of <code>$1</code> after <code>/(?:(a)|(b))*/</code> is used to match <code>"ab"</code> would be <code>"a"</code>. However, according to ES3/ES5, the value of backreferences to nested groupings is reset/erased after the outer grouping is repeated. Hence, <code>/(?:(a)|(b))*/</code> would still match <code>"ab"</code>, but after the match is complete <code>$1</code> would reference 
a nonparticipating capturing group, which in JavaScript would match an empty string within the regex itself, and be returned as <code>undefined</code> in, e.g., the array returned by the <code>regexp.exec</code>.</p>

<p>My case for change is that current JavaScript behavior breaks from the norm in other regex flavors, does not lend itself to various types of creative patterns (see one example in my post on <em><a href="http://blog.stevenlevithan.com/archives/multi-attr-capture">Capturing Multiple, Optional HTML Attribute Values</a></em>), and in my opinion is far less intuitive than the more common, alternative regex behavior.</p>

<p>I believe this behavior is safe to change for two reasons. First, this is generally an edge case issue for all but hardcore regex wizards, and I'd be surprised to find regexes that rely on JavaScript's version of this behavior. Second, and more importantly, Internet Explorer does not implement this rule and follows the more traditional behavior.</p>


<h3 style="margin-bottom:0">Add an /s flag, already</h3>
<p style="margin-top:0"><em>Actual proposal: Add an /s flag (and corresponding RegExp.prototype.dotall property) that changes dot to match all characters including newlines</em></p>

<p>I'll sneak this one in as a change/fix rather than a new feature since it's not exactly difficult to use <code>[\s\S]</code> in place of a dot when you want the behavior of <code>/s</code>. I presume the <code>/s</code> flag has been excluded thus far to save novices from themselves and limit the damage of runaway backtracking, but what ends up happening is that people write horrifically inefficient patterns like <code>(.|\r|\n)*</code> instead.</p>

<p><!--You might ask why JavaScript should bother to add <code>/s</code> if you can already mimic it.-->Regex searches in JavaScript are seldom line-based, and it's therefore more common to want dot to include newlines than to match anything-but-newlines (although both modes are useful). It makes good sense to keep the default meaning of dot (no newlines) since it is shared by other regex flavors and required for backward compatibility, but adding support for the <code>/s</code> flag is overdue. A boolean indicating whether this flag was set should show up on regexes as a property named either <code>singleline</code> (the <a href="http://blog.stevenlevithan.com/archives/singleline-multiline-confusing">unfortunate name</a> from Perl, .NET, etc.) or the more descriptive <code>dotall</code> (used in Java, Python, PCRE, etc.).</p>


<h3>Personal preferences</h3>

<p>Following are a few changes that would suit my preferences, although I don't think most people would consider them significant issues:</p>

<ul>
	<li>Allow regex literals to use unescaped forward slashes within character clases (e.g., <code>/[/]/</code>). This was already included in the abandoned <a href="http://wiki.ecmascript.org/doku.php?id=proposals:extend_regexps#regexp_scanning">ES4 change proposals</a>.</li>
	<li>Allow an unescaped <code>]</code> as the first character in character classes (e.g., <code>[]]</code> or <code>[^]]</code>). This is allowed in probably every other regex flavor, but creates an empty class followed by a literal <code>]</code> in JavaScript. I'd like to imagine that no one uses empty classes intentionally, since they don't work consistently cross-browser and there are widely-used/common-sense alternatives (<code>(?!)</code> instead of <code>[]</code>, and <code>[\s\S]</code> instead of <code>[^]</code>). Unfortunately, adherence to this JavaScript quirk is tested in <a href="http://acid3.acidtests.org/">Acid3</a> (test 89), which is likely enough to kill requests for this backward-incompatible but reasonable change.</li>
	<li>Change the <code>$&#038;</code> token used in replacement strings to <code>$0</code>. It just makes sense. (Equivalents in other replacement text flavors for comparison: Perl: <code>$&#038;</code>; Java: <code>$0</code>; .NET: <code>$0</code>, <code>$&#038;</code>; PHP: <code>$0</code>, <code>\0</code>; Ruby: <code>\0</code>, <code>\&#038;</code>; Python: <code>\g<0></code>.)</li>
	<li>Get rid of the special meaning of <code>[\b]</code>. Within character classes, the metasequence <code>\b</code> matches a backspace character (equivalent to <code>\x08</code>). This is a worthless convenience since no one cares about matching backspace characters, and it's confusing given that <code>\b</code> matches a word boundary when used outside of character classes. Even though this would break from regex tradition (which I'd usually advocate following), I think that <code>\b</code> should have no special meaning inside character classes and simply match a literal <code>b</code>.</li>
</ul>


<h3>Fixed in ES3: Remove octal character references</h3>

<p>ECMAScript 3 removed octal character references from regular expression syntax, although <code>\0</code> was kept as a convenient exception that allows easily matching a NUL character. However, browsers have generally kept full octal support around for backward compatibility. Octals are very confusing in regular expressions since their syntax overlaps with backreferences and an extra leading zero is allowed outside of character classes. Consider the following regexes:</p>

<ul>
	<li><code>/a\1/</code>: <code>\1</code> is an octal.</li>
	<li><code>/(a)\1/</code>: <code>\1</code> is a backreference.</li>
	<li><code>/(a)[\1]/</code>: <code>\1</code> is an octal.</li>
	<li><code>/(a)\1\2/</code>: <code>\1</code> is a backreference; <code>\2</code> is an octal.</li>
	<li><code>/(a)\01\001[\01\001]/</code>: All occurences of <code>\01</code> and <code>\001</code> are octals. However, according to the ES3+ specs, the numbers after each <code>\0</code> should be treated (barring nonstandard extensions) as literal characters, completely changing what this regex matches.</li>
	<li><code>/(a)\0001[\0001]/</code>: The <code>\0001</code> outside the character class is an octal; but inside, the octal ends at the third zero (i.e., the character class matches character index zero <em>or</em> <code>"1"</code>). This regex is therefore equivalent to <code>/(a)\x01[\x00\x31]/</code>; although, as mentioned just above, adherence to ES3 would change the meaning.</li>
	<li><code>/(a)\00001[\00001]/</code>: Outside the character class, the octal ends at the fourth zero and is followed by a literal <code>"1"</code>. Inside, the octal ends at the third zero and is followed by a literal <code>"01"</code>. And once again, ES3's exclusion of octals and inclusion of <code>\0</code> could change the meaning.</li>
	<li><code>/\1(a)/</code>: Given that, in JavaScript, backreferences to capturing groups that have not (yet) participated match the empty string, does this regex match <code>"a"</code> (i.e., <code>\1</code> is treated as a backreference since a corresponding capturing group appears in the regex) or does it match <code>"\x01a"</code> (i.e., the <code>\1</code> is treated as an octal since it appears <em>before</em> its corresponding group)? Unsurprisingly, browsers disagree.</li>
	<li><code>/(\2(a)){2}/</code>: Now things get really hairy. Does this regex match <code>"aa"</code>, <code>"aaa"</code>, <code>"\x02aaa"</code>, <code>"2aaa"</code>, <code>"\x02a\x02a"</code>, or <code>"2a2a"</code>? All of these options seem plausible, and browsers disagree on the correct choice.</li>
</ul>

<p>There are other issues to worry about, too, like whether octal escapes go up to <code>\377</code> (<code>\xFF</code>, 8-bit) or <code>\777</code> (<code>\u01FF</code>, 9-bit); but in any case, octals in regular expressions are a confusing cluster-cuss. Even though ECMAScript has already cleaned up this mess by removing support for octals, browsers have not followed suit. I wish they would, because unlike browser makers, I don't have to worry about this bit of legacy (I never use octals in regular expressions, and neither should you).</p>

<h3>Fixed in ES5: Don't cache regex literals</h3>

<p>According to ES3 rules, regex literals did not create a new regex object if a literal with the same pattern/flag combination was already used in the same script or function (this did not apply to regexes created by the <code>RegExp</code> constructor). A common side effect of this was that regex literals using the <code>/g</code> flag did not have their <code>lastIndex</code> property reset in some cases where most developers would expect it. Several browsers didn't follow the spec on this unintuitive behavior, but Firefox did, and as a result it became the <a href="http://whereswalden.com/2010/01/15/more-es5-incompatible-changes-regular-expressions-now-evaluate-to-a-new-object-not-the-same-object-each-time-theyre-encountered/">second most duplicated</a> JavaScript <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=98409">bug report</a> for Mozilla. Fortunately, ES5 got rid of this rule, and now regex literals must be recompiled every time they're encountered (this change is coming in Firefox 3.7).</p>


<p>&mdash;&mdash;&mdash;<br />So there you have it. I've outlined what I think the JavaScript RegExp API got wrong. Do you agree with all of these proposals, or <em>would</em> you if you didn't have to worry about backward compatibility? Are there better ways than what I've proposed to fix the issues discussed here? Got any other gripes with existing JavaScript regex features? I'm eager to hear feedback about this.</p>

<p>Since I've been focusing on the negative in this post, I'll note that I find working with regular expressions in JavaScript to be a generally pleasant experience. There's a hell of a lot that JavaScript got right.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/fixing-javascript-regexp/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>&#8216;Regular Expressions Cookbook&#8217; Giveaway on Jan Goyvaerts&#8217;s Regex Guru</title>
		<link>http://blog.stevenlevithan.com/archives/regular-expressions-cookbook-giveaway-on-jan-goyvaerts-regex-guru</link>
		<comments>http://blog.stevenlevithan.com/archives/regular-expressions-cookbook-giveaway-on-jan-goyvaerts-regex-guru#comments</comments>
		<pubDate>Thu, 18 Feb 2010 13:24:42 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[contest]]></category>
		<category><![CDATA[schwag]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=312</guid>
		<description><![CDATA[If you're not already a subscriber, check out Regex Guru, an excellent blog on all things regex by Jan Goyvaerts (coauthor of Regular Expressions Cookbook and creator of regular-expressions.info, RegexBuddy, PowerGREP, and RegexMagic). Now's a better time than ever to check out the site since he's giving away five copies of Regular Expressions Cookbook; just [...]]]></description>
			<content:encoded><![CDATA[<p>If you're not already a subscriber, check out <a href="http://www.regexguru.com/">Regex Guru</a>, an excellent blog on all things regex by Jan Goyvaerts (coauthor of <em><a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20">Regular Expressions Cookbook</a></em> and creator of <a href="http://regexp.info">regular-expressions.info</a>, <a href="http://www.regexbuddy.com/cgi-bin/affref.pl?aff=SteveL">RegexBuddy</a>, <a href="http://www.powergrep.com/cgi-bin/affref.pl?aff=SteveL">PowerGREP</a>, and <a href="http://www.regexmagic.com/cgi-bin/affref.pl?aff=SteveL">RegexMagic</a>). Now's a better time than ever to check out the site since he's giving away five copies of <em>Regular Expressions Cookbook</em>; just leave a comment on <a href="http://www.regexguru.com/2010/02/regular-expressions-cookbook-is-in-the-money-win-a-copy/">this post</a> (but make sure to read the rules listed there first) by Feb. 28<sup>th</sup> and you're in the running.</p>

<p>Note that Jan's contest is separate from my <a href="http://blog.stevenlevithan.com/archives/high-performance-javascript">ongoing giveaway</a> to promote the release of <em><a href="http://www.amazon.com/dp/059680279X/?tag=slfb-20">High Performance JavaScript</a></em> (ends Feb. 24<sup>th</sup>). You can be entered in both contests at the same time.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/regular-expressions-cookbook-giveaway-on-jan-goyvaerts-regex-guru/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Validate Phone Numbers: A Detailed Guide</title>
		<link>http://blog.stevenlevithan.com/archives/validate-phone-number</link>
		<comments>http://blog.stevenlevithan.com/archives/validate-phone-number#comments</comments>
		<pubDate>Tue, 09 Feb 2010 06:57:36 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[validation]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=275</guid>
		<description><![CDATA[

Following are a couple recipes I wrote for Regular Expressions Cookbook, composing a fairly comprehensive guide to validating and formatting North American and international phone numbers using regular expressions. The regexes in these recipes are all pretty straightforward, but hopefully this gives an example of the depth you can expect from the book.

For more than [...]]]></description>
			<content:encoded><![CDATA[<a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20"><img class="right" src="http://oreilly.com/catalog/covers/9780596520687_cat.gif" alt="Book cover: Regular Expressions Cookbook"></a>

<p>Following are a couple recipes I wrote for <em><a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20">Regular Expressions Cookbook</a></em>, composing a fairly comprehensive guide to validating and formatting North American and international phone numbers using regular expressions. The regexes in these recipes are all pretty straightforward, but hopefully this gives an example of the depth you can expect from the book.</p>

<p>For more than 100 detailed regular expression recipes that include equal coverage for eight programming languages (C#, Java, JavaScript, Perl, PHP, Python, Ruby, and VB.NET), get your very own copy of <em><a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20">Regular Expressions Cookbook</a></em>.  Also available in <a href="http://www.books.ru/shop/search?query=978-5-93286-181-3">Russian</a>, <a href="http://www.amazon.de/dp/3897219573">German</a>, <a href="http://www.oreilly.co.jp/books/9784873114507/">Japanese</a>, and <a href="http://knihy.cpress.cz/knihy/pocitacova-literatura/programovani/regularni-vyrazy-kucharka-programatora/">Czech</a>.</p>

<ul>
	<li><a href="#r4-2">Validate and Format North American Phone Numbers</a>
		<ul>
			<li>Variations:
				<ul>
					<li><a href="#r4-2-v-invalid">Eliminate invalid phone numbers</a></li>
					<li><a href="#r4-2-v-inline">Find phone numbers in documents</a></li>
					<li><a href="#r4-2-v-leading1">Allow a leading &ldquo;1&rdquo;</a></li>
					<li><a href="#r4-2-v-local">Allow seven-digit phone numbers</a></li>
				</ul>
			</li>
		</ul>
	</li>
	<li><a href="#r4-3">Validate International Phone Numbers</a>
		<ul>
			<li>Variations:
				<ul>
					<li><a href="#r4-3-v-epp">Validate international phone numbers in EPP format</a></li>
				</ul>
			</li>
		</ul>
	</li>
</ul>


<div style="border:1px solid #999; background:#f3f3f3; padding:15px 15px 0; margin-bottom:25px;">
	<p>Following is an excerpt from <em><a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20">Regular Expressions Cookbook</a></em> (O'Reilly, 2009) by Jan Goyvaerts and Steven Levithan. Reprinted with permission.</p>
</div>


<h3 id="r4-2">Validate and Format North American Phone Numbers</h3>

<h4 id="r4-2-p">Problem</h4>

<p>You want to determine whether a user entered a North American phone number in a common format, including the local area code. These formats include <code>1234567890</code>, <code>123-456-7890</code>, <code>123.456.7890</code>, <code>123 456 7890</code>, <code>(123) 456 7890</code>, and all related combinations. If the phone number is valid, you want to convert it to your standard format, <code>(123) 456-7890</code>, so that your phone number records are consistent.</p>

<h4 id="r4-2-s">Solution</h4>

<p>A regular expression can easily check whether a user entered something that looks like a valid phone number. By using capturing groups to remember each set of digits, the same regular expression can be used to replace the subject text with precisely the format you want.</p>

<h5 id="r4-2-s-regex">Regular expression</h5>

<div class="indent">
	<p><code>^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<h5 id="r4-2-s-replacement">Replacement</h5>

<div class="indent">
	<p><code>($1) $2-$3</code><br />
	   <strong>Replacement text flavors:</strong> .NET, Java, JavaScript, Perl, PHP</p>

	<p><code>(\1) \2-\3</code><br />
	   <strong>Replacement text flavors:</strong> Python, Ruby</p>
</div>

<h5 id="r4-2-s-csharp">C#</h5>

<pre class="indent">Regex regexObj =
    new Regex(@"^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$");

if (regexObj.IsMatch(subjectString)) {
    string formattedPhoneNumber =
        regexObj.Replace(subjectString, "($1) $2-$3");
} else {
    // Invalid phone number
}</pre>

<h5 id="r4-2-s-js">JavaScript</h5>

<pre class="indent">var regexObj = /^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$/;

if (regexObj.test(subjectString)) {
    var formattedPhoneNumber =
        subjectString.replace(regexObj, "($1) $2-$3");
} else {
    // Invalid phone number
}</pre>

<h5 id="r4-2-s-other">Other programming languages</h5>

<p>See Recipes 3.5 and 3.15 for help implementing this regular expression with other programming languages.</p>

<h4 id="r4-2-d">Discussion</h4>

<p>This regular expression matches three groups of digits. The first group can optionally be enclosed with parentheses, and the first two groups can optionally be followed with a choice of three separators (a hyphen, dot, or space). The following layout breaks the regular expression into its individual parts, omitting the redundant groups of digits:</p>

<pre class="indent">^        # Assert position at the beginning of the string.
\(       # Match a literal "("...
  ?      #   between zero and one time.
(        # Capture the enclosed match to backreference 1...
  [0-9]  #   Match a digit...
    {3}  #     exactly three times.
)        # End capturing group 1.
\)       # Match a literal ")"...
  ?      #   between zero and one time.
[-. ]    # Match one character from the set "-. "...
  ?      #   between zero and one time.
&#x22ef;        # [Match the remaining digits and separator.]
$        # Assert position at the end of the string.</pre>

<p>Let&rsquo;s look at each of these parts more closely. The <code>^</code> and <code>$</code> at the beginning and end of the regular expression are a special kind of metacharacter called an <em>anchor</em> or <em>assertion</em>. Instead of matching text, assertions match a position within the text. Specifically, <code>^</code> matches at the beginning of the text, and <code>$</code> at the end. This ensures that the phone number regex does not match within longer text, such as <code>123-456-78901</code>.</p>

<p>As we&rsquo;ve repeatedly seen, parentheses are special characters in regular expressions, but in this case we want to allow a user to enter parentheses and have our regex recognize them. This is a textbook example of where we need a backslash to escape a special character so the regular expression treats it as literal input. Thus, the <code>\(</code> and <code>\)</code> sequences that enclose the first group of digits match literal parenthesis characters. Both are followed by a question mark, which makes them optional. We&rsquo;ll explain more about the question mark after discussing the other types of tokens in this regular expression.</p>

<p>The parentheses that appear without backslashes are capturing groups and are used to remember the values matched within them so that the matched text can be recalled later. In this case, backreferences to the captured values are used in the replacement text so we can easily reformat the phone number as needed.</p>

<p>Two other types of tokens used in this regular expression are character classes and quantifiers. Character classes allow you to match any one out of a set of characters. <code>[0-9]</code> is a character class that matches any digit. The regular expression flavors covered by this book all include the shorthand character class <code>\d</code> that also matches a digit, but in some flavors <code>\d</code> matches a digit from any language&rsquo;s character set or script, which is not what we want here. See Recipe 2.3 for more information about <code>\d</code>.</p>

<p><code>[-. ]</code> is another character class, one that allows any one of three separators. It&rsquo;s important that the hyphen appears first in this character class, because if it appeared
between other characters, it would create a range, as with <code>[0-9]</code>. Another way to ensure that a hyphen inside a character class matches a literal version of itself is to escape it with a backslash. <code>[.\- ]</code> is therefore equivalent.</p>

<p>Finally, quantifiers allow you to repeat a token or group. <code>{3}</code> is a quantifier that causes its preceding element to be repeated exactly three times. The regular expression <code>[0-9]{3}</code> is therefore equivalent to <code>[0-9][0-9][0-9]</code>, but is shorter and hopefully easier to read. A question mark (mentioned earlier) is a special quantifier that causes its preceding element to repeat zero or one time. It could also be written as <code>{0,1}</code>. Any quantifier that allows something to be repeated zero times effectively makes that element optional. Since a question mark is used after each separator, the phone number digits are allowed to run together.</p>

<p>Note that although this recipe claims to handle North American phone numbers, it&rsquo;s actually designed to work with <em>North American Numbering Plan</em> (NANP) numbers. The NANP is the telephone numbering plan for the countries that share the country code &ldquo;1&rdquo;. This includes the United States and its territories, Canada, Bermuda, and 16 Caribbean nations. It excludes Mexico and the Central American nations.</p>

<h4 id="r4-2-v">Variations</h4>

<h5 id="r4-2-v-invalid">Eliminate invalid phone numbers</h5>

<p>So far, the regular expression matches any 10-digit number. If you want to limit matches to valid phone numbers according to the North American Numbering Plan, here are the basic rules:</p>

<ul>
	<li><em>Area codes</em> start with a number from 2&ndash;9, followed by 0&ndash;8, and then any third digit.</li>
	<li>The second group of three digits, known as the <em>central office</em> or <em>exchange code</em>, starts with a number from 2&ndash;9, followed by any two digits.</li>
	<li>The final four digits, known as the <em>station code</em>, have no restrictions.</li>
</ul>

<p>These rules can easily be implemented with a few character classes:</p>

<div class="indent">
	<p><code>^\(?([2-9][0-8][0-9])\)?[-. ]?([2-9][0-9]{2})[-. ]?([0-9]{4})$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<p>Beyond the basic rules just listed, there are a variety of reserved, unassigned, and restricted phone numbers. Unless you have very specific needs that require you to filter out as many phone numbers as possible, don&rsquo;t go overboard trying to eliminate unused numbers. New area codes that fit the rules listed earlier are made available regularly, and even if a phone number is valid, that doesn&rsquo;t necessarily mean it was issued or is in active use.</p>

<h5 id="r4-2-v-inline">Find phone numbers in documents</h5>

<p>Two simple changes allow the previous regular expression to match phone numbers within longer text:</p>

<div class="indent">
	<p><code>\(?\b([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})\b</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<p>Here, the <code>^</code> and <code>$</code> assertions that bound the regular expression to the beginning and end of the text have been removed. In their place, word boundary tokens (<code>\b</code>) have
been added to ensure that the matched text stands on its own and is not part of a longer number or word.</p>

<p>Similar to <code>^</code> and <code>$</code>, <code>\b</code> is an assertion that matches a position rather than any actual text. Specifically, <code>\b</code> matches the position between a word character and either a nonword character or the beginning or end of the text. Letters, numbers, and underscore are all considered word characters (see Recipe 2.6).</p>

<p>Note that the first word boundary token appears after the optional, opening parenthesis. This is important because there is no word boundary to be matched between two nonword characters, such as the opening parenthesis and a preceding space character. The first word boundary is relevant only when matching a number without parentheses, since the word boundary always matches between the opening parenthesis and the first digit of a phone number.</p>

<h5 id="r4-2-v-leading1">Allow a leading &ldquo;1&rdquo;</h5>

<p>You can allow an optional, leading &ldquo;1&rdquo; for the country code (which covers the North American Numbering Plan region) via the addition shown in the following regex:</p>

<div class="indent">
	<p><code>^(?:\+?1[-. ]?)?\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<p>In addition to the phone number formats shown previously, this regular expression will also match strings such as <code>+1 (123) 456-7890</code> and <code>1-123-456-7890</code>. It uses a noncapturing group, written as <code>(?:&#x22ef;)</code>. When a question mark follows an unescaped left parenthesis like this, it&rsquo;s not a quantifier, but instead helps to identify the type of grouping. Standard capturing groups require the regular expression engine to keep track of backreferences, so it&rsquo;s more efficient to use noncapturing groups whenever the text matched by a group does not need to be referenced later. Another reason to use a noncapturing group here is to allow you to keep using the same replacement string as in the previous examples. If we added a capturing group, we&rsquo;d have to change <code>$1</code> to <code>$2</code> (and so on) in the replacement text shown earlier in this recipe.</p>

<p>The full addition to this version of the regex is <code>(?:\+?1[-. ]?)?</code>. The &ldquo;1&rdquo; in this pattern is preceded by an optional plus sign, and optionally followed by one of three separators (hyphen, dot, or space). The entire, added noncapturing group is also optional, but since the &ldquo;1&rdquo; is required within the group, the preceding plus sign and separator are not allowed if there is no leading &ldquo;1&rdquo;.</p>

<h5 id="r4-2-v-local">Allow seven-digit phone numbers</h5>

<p>To allow matching phone numbers that omit the local area code, enclose the first group of digits together with its surrounding parentheses and following separator in an optional, noncapturing group:</p>

<div class="indent">
	<p><code>^(?:\(?([0-9]{3})\)?[-. ]?)?([0-9]{3})[-. ]?([0-9]{4})$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<p>Since the area code is no longer required as part of the match, simply replacing any match with <code>($1) $2-$3</code> might now result in something like <code>() 123-4567</code>, with an empty set of parentheses. To work around this, add code outside the regex that checks whether group 1 matched any text, and adjust the replacement text accordingly.</p>

<h4 id="r4-2-sa">See Also</h4>

<p><a href="#r4-3">Recipe 4.3</a> shows how to validate international phone numbers.</p>

<p>The North American Numbering Plan (NANP) is the telephone numbering plan for the United States and its territories, Canada, Bermuda, and 16 Caribbean nations. More information is available at <em><a href="http://www.nanpa.com">http://www.nanpa.com</a></em>.</p>


<hr />

<h3 id="r4-3">Validate International Phone Numbers</h3>

<h4 id="r4-3-p">Problem</h4>

<p>You want to validate international phone numbers. The numbers should start with a plus sign, followed by the country code and national number.</p>

<h4 id="r4-3-s">Solution</h4>

<h5 id="r4-3-s-regex">Regular expression</h5>

<div class="indent">
	<p><code>^\+(?:[0-9] ?){6,14}[0-9]$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<h5 id="r4-3-s-js">JavaScript</h5>

<pre class="indent">function validate (phone) {
    var regex = /^\+(?:[0-9] ?){6,14}[0-9]$/;

    if (regex.test(phone)) {
        // Valid international phone number
    } else {
        // Invalid international phone number
    }
}</pre>

<h5 id="r4-3-s-other">Other programming languages</h5>

<p>See Recipe 3.5 for help implementing this regular expression with other programming languages.</p>

<h4 id="r4-3-d">Discussion</h4>

<p>The rules and conventions used to print international phone numbers vary significantly around the world, so it&rsquo;s hard to provide meaningful validation for an international phone number unless you adopt a strict format. Fortunately, there is a simple, industry-standard notation specified by ITU-T E.123. This notation requires that international phone numbers include a leading plus sign (known as the <em>international prefix symbol</em>), and allows only spaces to separate groups of digits. Although the tilde character (~) can appear within a phone number to indicate the existence of an additional dial tone, it has been excluded from this regular expression since it is merely a procedural element (in other words, it is not actually dialed) and is infrequently used. Thanks to the international phone numbering plan (ITU-T E.164), phone numbers cannot contain more than 15 digits. The shortest international phone numbers in use contain seven digits.</p>

<p>With all of this in mind, let&rsquo;s look at the regular expression again after breaking it into its pieces. Because this version is written using free-spacing style, the literal space character has been replaced with <code>\x20</code>:</p>

<pre class="indent">^         # Assert position at the beginning of the string.
\+        # Match a literal "+" character.
(?:       # Group but don't capture...
  [0-9]   #   Match a digit.
  \x20    #   Match a space character...
    ?     #     Between zero and one time.
)         # End the noncapturing group.
  {6,14}  #   Repeat the preceding group between 6 and 14 times.
[0-9]     # Match a digit.
$         # Assert position at the end of the string.</pre>
<div class="indent">
	<p><strong>Regex options:</strong> Free-spacing<br />
	   <strong>Regex flavors:</strong> .NET, Java, PCRE, Perl, Python, Ruby</p>
</div>

<p>The <code>^</code> and <code>$</code> anchors at the edges of the regular expression ensure that it matches the whole subject text. The noncapturing group&mdash;enclosed with <code>(?:&#x22ef;)</code>&mdash;matches a single digit, followed by an optional space character. Repeating this grouping with the interval quantifier <code>{6,14}</code> enforces the rules for the minimum and maximum number of digits, while allowing space separators to appear anywhere within the number. The second instance of the character class <code>[0-9]</code> completes the rule for the number of digits (bumping it up from between 6 and 14 digits to between 7 and 15), and ensures that the phone number does not end with a space.</p>

<h4 id="r4-3-v">Variations</h4>

<h5 id="r4-3-v-epp">Validate international phone numbers in EPP format</h5>

<div class="indent">
	<p><code>^\+[0-9]{1,3}\.[0-9]{4,14}(?:x.+)?$</code><br />
	   <strong>Regex options:</strong> None<br />
	   <strong>Regex flavors:</strong> .NET, Java, JavaScript, PCRE, Perl, Python, Ruby</p>
</div>

<p>This regular expression follows the international phone number notation specified by the Extensible Provisioning Protocol (EPP). EPP is a relatively recent protocol (finalized in 2004), designed for communication between domain name registries and registrars. It is used by a growing number of domain name registries, including <em>.com</em>, <em>.info</em>, <em>.net</em>, <em>.org</em>, and <em>.us</em>. The significance of this is that EPP-style international phone numbers are increasingly used and recognized, and therefore provide a good alternative format for storing (and validating) international phone numbers.</p>

<p>EPP-style phone numbers use the format <em>+CCC.NNNNNNNNNNxEEEE</em>, where <em>C</em> is the 1&ndash;3 digit country code, <em>N</em> is up to 14 digits, and <em>E</em> is the (optional) extension. The leading plus sign and the dot following the country code are required. The literal &ldquo;x&rdquo; character is required only if an extension is provided.</p>

<h4 id="r4-3-sa">See Also</h4>

<p><a href="#r4-2">Recipe 4.2</a> provides more options for validating North American phone numbers.</p>

<p>ITU-T Recommendation E.123 (&ldquo;Notation for national and international telephone numbers, e-mail addresses and Web addresses&rdquo;) can be downloaded here: <em><a href="http://www.itu.int/rec/T-REC-E.123">http://www.itu.int/rec/T-REC-E.123</a></em>.</p>

<p>ITU-T Recommendation E.164 (&ldquo;The international public telecommunication numbering plan&rdquo;) can be downloaded at <em><a href="http://www.itu.int/rec/T-REC-E.164">http://www.itu.int/rec/T-REC-E.164</a></em>.</p>

<p>National numbering plans can be downloaded at <em><a href="http://www.itu.int/ITU-T/inr/nnp">http://www.itu.int/ITU-T/inr/nnp</a></em>.</p>

<p>RFC 4933 defines the syntax and semantics of EPP contact identifiers, including international phone numbers. You can download RFC 4933 at <em><a href="http://tools.ietf.org/html/rfc4933">http://tools.ietf.org/html/rfc4933</a></em>.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/validate-phone-number/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Five Free Copies of Upcoming O&#8217;Reilly Book &#8216;High Performance JavaScript&#8217;</title>
		<link>http://blog.stevenlevithan.com/archives/high-performance-javascript</link>
		<comments>http://blog.stevenlevithan.com/archives/high-performance-javascript#comments</comments>
		<pubDate>Wed, 03 Feb 2010 07:57:34 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Project Releases]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[contest]]></category>
		<category><![CDATA[regex performance]]></category>
		<category><![CDATA[schwag]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=243</guid>
		<description><![CDATA[
Update (2010-02-25): This contest is now closed.




Last year, Yahoo! engineer and all-around JavaScript badass Nicholas Zakas asked if I was interested in writing a chapter for a new book on JavaScript performance that he was working on. I agreed, and that book, High Performance JavaScript, is now available for preorder at Amazon and other fine [...]]]></description>
			<content:encoded><![CDATA[<div class="update">
<p><strong>Update (2010-02-25):</strong> This contest is now closed.</p>
</div>

<a href="http://www.amazon.com/dp/059680279X/?tag=slfb-20"><img class="right" src="http://blog.stevenlevithan.com/assets/images/hpjs_cover_s.png" alt="Book cover: High Performance JavaScript" width="180" height="236"></a>

<p>Last year, Yahoo! engineer and all-around JavaScript badass <a href="http://www.nczonline.net/">Nicholas Zakas</a> asked if I was interested in writing a chapter for a new book on JavaScript performance that he was working on. I agreed, and that book, <strong><em><a href="http://www.amazon.com/dp/059680279X/?tag=slfb-20">High Performance JavaScript</a></em></strong>, is now available for preorder at Amazon and other fine book retailers.</p>

<p>In addition to the wide-ranging content by Nicholas and a chapter on string and regular expression performance by yours truly, chapters were also contributed by an awesome lineup of JavaScript performance gurus: <a href="http://techfoolery.com/">Ross Harmes</a>, <a href="http://www.julienlecomte.net/blog/">Julien Lecomte</a>, <a href="http://www.phpied.com/">Stoyan Stefanov</a>, and Matt Sweeney. This book is unique in its laser-focus on optimizing the performance of your JavaScript applications, and covers many advanced topics in the process. The chapter on strings and regular expressions provides what I think is easily the most in-depth coverage of cross-browser JavaScript regex performance currently available.</p>

<p>Here's the list of chapters:</p>

<ol>
	<li>Loading and Execution</li>
	<li>Data Access</li>
	<li>DOM Scripting <em>(Stoyan Stefanov)</em></li>
	<li>Algorithms and Flow Control</li>
	<li>Strings and Regular Expressions <em>(Steven Levithan)</em></li>
	<li>Responsive Interfaces</li>
	<li>Ajax <em>(Ross Harmes)</em></li>
	<li>Programming Practices</li>
	<li>Build and Deployment <em>(Julien Lecomte)</em></li>
	<li>Tools <em>(Matt Sweeney)</em></li>
</ol>

<p>To celebrate the completion of this book, <del>I'm giving away three copies.</del> <ins>O'Reilly Media increased the offer to five books!</ins> All you need to do is comment on this post by February 24<sup>th</sup>, and I'll pick five people to send a copy to as soon as it's released (Amazon says March 15<sup>th</sup>). If you prefer, I'd be happy to send you a copy of <em><a href="http://www.amazon.com/dp/0596520689/?tag=slfb-20">Regular Expressions Cookbook</a></em> instead (please note which book you want in your comment). Four winners will be chosen at random from the pool of unique commenters (I'll be tracking IPs), and the fifth based on the reason given for why you want a copy.</p>

<p>Make sure to include your email address in the comment form, since I'll need it to contact you if you're selected (your email address won't be used for any other purpose). Good luck, and congratulations to Nicholas Zakas and all the other authors on completing a fantastic new book!</p>

<p><em>Edit (2010-02-05):</em> My blog has been offline more often than not for the first two days after posting this, and many people have reported that they were unable to post a comment. I apologize for the screw-up&mdash;my blog is now on a different server, and the problems should be resolved. Please try again!</p>

<p><em>Edit (2010-02-08):</em> O'Reilly Media kindly offered to pick up the tab for this giveaway, and increased the winnings to five books!</p>

<p><em>Edit (2010-02-09):</em> Nicholas Zakas posted more information about <em>High Performance JavaScript</em> on his blog: <a href="http://www.nczonline.net/blog/2010/02/09/announcing-high-performance-javascript/">Announcing High Performance JavaScript</a>.</p>

<p><em>Edit (2010-02-25):</em> This contest is now closed. Winners will be announced here shortly.</p>

<p id="hpjs-winners"><em>Edit (2010-03-03):</em> Following are the <strong>winners of this giveaway</strong> (the first four were chosen randomly):</p>

<ol>
	<li><a href="/archives/high-performance-javascript#comment-47516">David Henderson</a></li>
	<li><a href="/archives/high-performance-javascript#comment-47686">Daniel Trebbien</a></li>
	<li><a href="/archives/high-performance-javascript#comment-47179">Lea Verou</a></li>
	<li><a href="/archives/high-performance-javascript#comment-47033">Stefan "schnalle" Schallerl</a></li>
	<li><a href="/archives/high-performance-javascript#comment-47099">Adam Crabtree</a></li>
</ol>

<p>No. 5 Adam Crabtree, who wants to review the book and share it with members of the <a href="http://meetup.com/DallasJS">DallasJS Meetup Group</a>, wins the nonrandom drawing for the best reason to win a copy. Runners up for this selection were <a href="/archives/high-performance-javascript#comment-47000">Yoav</a>, who promised to donate the book to a high school library after he's done with it; <a href="/archives/high-performance-javascript#comment-47079">Nick Carter</a>, who threatened me with his wrath if he doesn't win (I'll have to endure); <a href="/archives/high-performance-javascript#comment-47095">Paul Irish</a>, who kindly offered to have my last name corrected (to that of a sea monster) in exchange for winning; <a href="/archives/high-performance-javascript#comment-47487">Alexei</a>, a technical editor of a couple of Nicholas Zakas's previous books who'd like to know how many errors this one contains; and <a href="/archives/high-performance-javascript#comment-48183">Marcel Korpel</a>, who wants to improve his users' health by reducing the "headaches, general stress and insomnia" they suffer while waiting on his websites. <img src='http://blog.stevenlevithan.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>

<p>The winners have been informed by email about how to collect their prize. Thanks to everyone for playing!</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/high-performance-javascript/feed</wfw:commentRss>
		<slash:comments>278</slash:comments>
		</item>
	</channel>
</rss>
