<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Flagrant Badassery &#187; Unicode</title>
	<atom:link href="http://blog.stevenlevithan.com/category/unicode/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com</link>
	<description>A JavaScript and regular expression centric blog</description>
	<lastBuildDate>Mon, 05 Jul 2010 20:27:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Unicode Plugin for XRegExp</title>
		<link>http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin</link>
		<comments>http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin#comments</comments>
		<pubDate>Sat, 02 Aug 2008 00:39:36 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[xregexp]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/?p=88</guid>
		<description><![CDATA[I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:


	\p{L} &#8212; Letter
	\p{M} &#8212; Mark
	\p{N} &#8212; Number
	\p{P} &#8212; Punctuation
	\p{S} &#8212; [...]]]></description>
			<content:encoded><![CDATA[<p>I've released a simple plugin for XRegExp (my <a href="http://xregexp.com/">JavaScript regex library</a>) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.</p>

<p>The Unicode plugin enables the following Unicode properties/categories in any XRegExp:</p>

<ul>
	<li><code>\p{L}</code> &mdash; Letter</li>
	<li><code>\p{M}</code> &mdash; Mark</li>
	<li><code>\p{N}</code> &mdash; Number</li>
	<li><code>\p{P}</code> &mdash; Punctuation</li>
	<li><code>\p{S}</code> &mdash; Symbol</li>
	<li><code>\p{Z}</code> &mdash; Separator</li>
	<li><code>\p{C}</code> &mdash; Other (control, format, private use, surrogate, and unassigned codes)</li>
</ul>

<p>It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:</p>

<ul style="-moz-column-count:2; font-size:80%;">
	<li><code>\p{InAlphabeticPresentationForms}</code></li>
	<li><code>\p{InArabic}</code></li>
	<li><code>\p{InArabicPresentationFormsA}</code></li>
	<li><code>\p{InArabicPresentationFormsB}</code></li>
	<li><code>\p{InArabicSupplement}</code></li>
	<li><code>\p{InArmenian}</code></li>
	<li><code>\p{InArrows}</code></li>
	<li><code>\p{InBalinese}</code></li>
	<li><code>\p{InBasicLatin}</code></li>
	<li><code>\p{InBengali}</code></li>
	<li><code>\p{InBlockElements}</code></li>
	<li><code>\p{InBopomofo}</code></li>
	<li><code>\p{InBopomofoExtended}</code></li>
	<li><code>\p{InBoxDrawing}</code></li>
	<li><code>\p{InBraillePatterns}</code></li>
	<li><code>\p{InBuginese}</code></li>
	<li><code>\p{InBuhid}</code></li>
	<li><code>\p{InCham}</code></li>
	<li><code>\p{InCherokee}</code></li>
	<li><code>\p{InCJKCompatibility}</code></li>
	<li><code>\p{InCJKCompatibilityForms}</code></li>
	<li><code>\p{InCJKCompatibilityIdeographs}</code></li>
	<li><code>\p{InCJKRadicalsSupplement}</code></li>
	<li><code>\p{InCJKStrokes}</code></li>
	<li><code>\p{InCJKSymbolsandPunctuation}</code></li>
	<li><code>\p{InCJKUnifiedIdeographs}</code></li>
	<li><code>\p{InCJKUnifiedIdeographsExtensionA}</code></li>
	<li><code>\p{InCombiningDiacriticalMarks}</code></li>
	<li><code>\p{InCombiningDiacriticalMarksforSymbols}</code></li>
	<li><code>\p{InCombiningDiacriticalMarksSupplement}</code></li>
	<li><code>\p{InCombiningHalfMarks}</code></li>
	<li><code>\p{InControlPictures}</code></li>
	<li><code>\p{InCoptic}</code></li>
	<li><code>\p{InCurrencySymbols}</code></li>
	<li><code>\p{InCyrillic}</code></li>
	<li><code>\p{InCyrillicExtendedA}</code></li>
	<li><code>\p{InCyrillicExtendedB}</code></li>
	<li><code>\p{InCyrillicSupplement}</code></li>
	<li><code>\p{InDevanagari}</code></li>
	<li><code>\p{InDingbats}</code></li>
	<li><code>\p{InEnclosedAlphanumerics}</code></li>
	<li><code>\p{InEnclosedCJKLettersandMonths}</code></li>
	<li><code>\p{InEthiopic}</code></li>
	<li><code>\p{InEthiopicExtended}</code></li>
	<li><code>\p{InEthiopicSupplement}</code></li>
	<li><code>\p{InGeneralPunctuation}</code></li>
	<li><code>\p{InGeometricShapes}</code></li>
	<li><code>\p{InGeorgian}</code></li>
	<li><code>\p{InGeorgianSupplement}</code></li>
	<li><code>\p{InGlagolitic}</code></li>
	<li><code>\p{InGreekandCoptic}</code></li>
	<li><code>\p{InGreekExtended}</code></li>
	<li><code>\p{InGujarati}</code></li>
	<li><code>\p{InGurmukhi}</code></li>
	<li><code>\p{InHalfwidthandFullwidthForms}</code></li>
	<li><code>\p{InHangulCompatibilityJamo}</code></li>
	<li><code>\p{InHangulJamo}</code></li>
	<li><code>\p{InHangulSyllables}</code></li>
	<li><code>\p{InHanunoo}</code></li>
	<li><code>\p{InHebrew}</code></li>
	<li><code>\p{InHighPrivateUseSurrogates}</code></li>
	<li><code>\p{InHighSurrogates}</code></li>
	<li><code>\p{InHiragana}</code></li>
	<li><code>\p{InIdeographicDescriptionCharacters}</code></li>
	<li><code>\p{InIPAExtensions}</code></li>
	<li><code>\p{InKanbun}</code></li>
	<li><code>\p{InKangxiRadicals}</code></li>
	<li><code>\p{InKannada}</code></li>
	<li><code>\p{InKatakana}</code></li>
	<li><code>\p{InKatakanaPhoneticExtensions}</code></li>
	<li><code>\p{InKayahLi}</code></li>
	<li><code>\p{InKhmer}</code></li>
	<li><code>\p{InKhmerSymbols}</code></li>
	<li><code>\p{InLao}</code></li>
	<li><code>\p{InLatin1Supplement}</code></li>
	<li><code>\p{InLatinExtendedA}</code></li>
	<li><code>\p{InLatinExtendedAdditional}</code></li>
	<li><code>\p{InLatinExtendedB}</code></li>
	<li><code>\p{InLatinExtendedC}</code></li>
	<li><code>\p{InLatinExtendedD}</code></li>
	<li><code>\p{InLepcha}</code></li>
	<li><code>\p{InLetterlikeSymbols}</code></li>
	<li><code>\p{InLimbu}</code></li>
	<li><code>\p{InLowSurrogates}</code></li>
	<li><code>\p{InMalayalam}</code></li>
	<li><code>\p{InMathematicalOperators}</code></li>
	<li><code>\p{InMiscellaneousMathematicalSymbolsA}</code></li>
	<li><code>\p{InMiscellaneousMathematicalSymbolsB}</code></li>
	<li><code>\p{InMiscellaneousSymbols}</code></li>
	<li><code>\p{InMiscellaneousSymbolsandArrows}</code></li>
	<li><code>\p{InMiscellaneousTechnical}</code></li>
	<li><code>\p{InModifierToneLetters}</code></li>
	<li><code>\p{InMongolian}</code></li>
	<li><code>\p{InMyanmar}</code></li>
	<li><code>\p{InNewTaiLue}</code></li>
	<li><code>\p{InNKo}</code></li>
	<li><code>\p{InNumberForms}</code></li>
	<li><code>\p{InOgham}</code></li>
	<li><code>\p{InOlChiki}</code></li>
	<li><code>\p{InOpticalCharacterRecognition}</code></li>
	<li><code>\p{InOriya}</code></li>
	<li><code>\p{InPhagspa}</code></li>
	<li><code>\p{InPhoneticExtensions}</code></li>
	<li><code>\p{InPhoneticExtensionsSupplement}</code></li>
	<li><code>\p{InPrivateUseArea}</code></li>
	<li><code>\p{InRejang}</code></li>
	<li><code>\p{InRunic}</code></li>
	<li><code>\p{InSaurashtra}</code></li>
	<li><code>\p{InSinhala}</code></li>
	<li><code>\p{InSmallFormVariants}</code></li>
	<li><code>\p{InSpacingModifierLetters}</code></li>
	<li><code>\p{InSpecials}</code></li>
	<li><code>\p{InSundanese}</code></li>
	<li><code>\p{InSuperscriptsandSubscripts}</code></li>
	<li><code>\p{InSupplementalArrowsA}</code></li>
	<li><code>\p{InSupplementalArrowsB}</code></li>
	<li><code>\p{InSupplementalMathematicalOperators}</code></li>
	<li><code>\p{InSupplementalPunctuation}</code></li>
	<li><code>\p{InSylotiNagri}</code></li>
	<li><code>\p{InSyriac}</code></li>
	<li><code>\p{InTagalog}</code></li>
	<li><code>\p{InTagbanwa}</code></li>
	<li><code>\p{InTaiLe}</code></li>
	<li><code>\p{InTamil}</code></li>
	<li><code>\p{InTelugu}</code></li>
	<li><code>\p{InThaana}</code></li>
	<li><code>\p{InThai}</code></li>
	<li><code>\p{InTibetan}</code></li>
	<li><code>\p{InTifinagh}</code></li>
	<li><code>\p{InUnifiedCanadianAboriginalSyllabics}</code></li>
	<li><code>\p{InVai}</code></li>
	<li><code>\p{InVariationSelectors}</code></li>
	<li><code>\p{InVerticalForms}</code></li>
	<li><code>\p{InYijingHexagramSymbols}</code></li>
	<li><code>\p{InYiRadicals}</code></li>
	<li><code>\p{InYiSyllables}</code></li>
</ul>

<p>In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, <code>\p{InLatinExtendedA}</code>, <code>\p{InLatin Extended-A}</code>, and <code>\p{in latin extended a}</code> are all equivalent.</p>

<p>All properties and blocks can be inverted by using an uppercase p. For example, <code>\P{N}</code> matches any code point that is not in the Number category. <code>\P{InArabic}</code> matches code points that are not in the Arabic block.</p>

<p><strong>IMPORTANT:</strong> The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.</p>

<table cellspacing="0">
	<thead>
		<tr>
			<th>Instead Of:</th>
			<th>Use:</th>
		</tr>
	</thead>
	<tbody>
		<tr><td><code>[\p{N}]</code></td><td><code>\p{N}</code></td></tr>
		<tr class="altBg"><td><code>[\p{N}a-z~]</code></td><td><code>(?:\p{N}|[a-z~])</code></td></tr>
		<tr><td><code>[\p{N}\P{Z}]</code></td><td><code>(?:\p{N}|\P{Z})</code></td></tr>
		<tr class="altBg"><td><code>[\p{N}\P{Z}a-z~]</code></td><td><code>(?:\p{N}|\P{Z}|[a-z~])</code></td></tr>
		<tr><td><code>[^\p{N}]</code></td><td><code>\P{N}</code></td></tr>
		<tr class="altBg"><td><code>[^\p{N}a-z~]</code></td><td><code>(?:(?!\p{N})[^a-z~])</code></td></tr>
		<tr><td><code>[^\p{N}\P{Z}]</code></td><td><code>(?:(?!\p{N}|\P{Z})[\S\s])</code></td></tr>
		<tr class="altBg"><td><code>[^\p{N}\P{Z}a-z~]</code></td><td><code>(?:(?!\p{N}|\P{Z})[^a-z~])</code></td></tr>
	</tbody>
</table>

<p>Additionally, Unicode subcategories like <code>\p{Nd}</code> and scripts like <code>\p{Latin}</code> are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts <em>or</em> blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)</p>

<p>Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.</p>

<p>The Unicode plugin clocks in at a mere 5.2 KB after minification (using the <a href="http://developer.yahoo.com/yui/compressor/">YUI Compressor</a>) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more <a href="http://xregexp.com/">JavaScript regex goodness</a>.</p>

<p>To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.</p>

<pre class="code">&lt;script src="xregexp.js"&gt;&lt;/script&gt;
&lt;script src="xregexp-unicode.js"&gt;&lt;/script&gt;
&lt;script&gt;
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("&#x0420;&#x0443;&#x0441;&#x0441;&#x043A;&#x0438;&#x0439;")); <span class="comment">// true</span>
&lt;/script&gt;
</pre>

<p><a href="http://xregexp.com/plugins/"><strong>Download the Unicode plugin</strong></a>.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>JavaScript, Regex, and Unicode</title>
		<link>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</link>
		<comments>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode#comments</comments>
		<pubDate>Wed, 02 Jan 2008 06:11:24 +0000</pubDate>
		<dc:creator>Steven Levithan</dc:creator>
				<category><![CDATA[Cross-Browser Issues]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode</guid>
		<description><![CDATA[Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, [...]]]></description>
			<content:encoded><![CDATA[<p>Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.</p>

<p>According to ECMA-262 3rd Edition, <code>\s</code>, <code>\S</code>, <code>.</code>, <code>^</code>, and <code>$</code> use Unicode-based interpretations of <em>whitespace</em> and <em>newline</em>, while <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\b</code>, and <code>\B</code> use ASCII-only interpretations of <em>digit</em>, <em>word character</em>, and <em>word boundary</em> (e.g. <code>/a\b/.<wbr/>test("na&iuml;ve")</code> returns <code>true</code>). Actual browser implementations often differ on these points. For example, Firefox 2 considers <code>\d</code> and <code>\D</code> to be Unicode-aware, while Firefox 3 fixes this bug &mdash; making <code>\d</code> equivalent to <code>[0-9]</code> as with most other browsers.</p>

<p>Here again are the affected tokens, along with their definitions:</p>

<ul>
	<li><code>\d</code> &mdash; Digits.</li>
	<li><code>\s</code> &mdash; Whitespace.</li>
	<li><code>\w</code> &mdash; Word characters.</li>
	<li><code>\D</code> &mdash; All except digits.</li>
	<li><code>\S</code> &mdash; All except whitespace.</li>
	<li><code>\W</code> &mdash; All except word characters.</li>
	<li><code>.</code> &mdash; All except newlines.</li>
	<li><code>^</code> (with <code>/m</code>) &mdash; The positions at the beginning of the string and just after newlines.</li>
	<li><code>$</code> (with <code>/m</code>) &mdash; The positions at the end of the string and just before newlines.</li>
	<li><code>\b</code> &mdash; Word boundary positions.</li>
	<li><code>\B</code> &mdash; Not word boundary positions.</li>
</ul>

<p>All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms <em>digit</em>, <em>whitespace</em>, <em>word character</em>, <em>word boundary</em>, and <em>newline</em> depend on the regex flavor, character encoding, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:</p>

<ul>
	<li><em>Digit</em> &mdash; The characters 0-9 only.</li>
	<li><em>Whitespace</em> &mdash; Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".</li>
	<li><em>Word character</em> &mdash; The characters A-Z, a-z, 0-9, and _ only.</li>
	<li><em>Word boundary</em> &mdash; The position between a <em>word character</em> and non-<em>word character</em>.</li>
	<li><em>Newline</em> &mdash; The line feed, carriage return, line separator, and paragraph separator characters.</li>
</ul>

<p>Here again are the newline characters, with their character codes:</p>

<ul>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code></li>
	<li><code>\u2028</code> &mdash; Line separator</li>
	<li><code>\u2029</code> &mdash; Paragraph separator</li>
</ul>

<p>Note that ECMAScript 4 proposals indicate that the <a href="http://en.wikipedia.org/wiki/C0_and_C1_control_codes">C1</a>/Unicode NEL "next line" control character (<code>\u0085</code>) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, <code>/\r^$\n/m.test("\r\n")</code> returns <code>true</code>.</p>

<p>As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's <a href="http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes">Basic Multilingual Plane</a>, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by <code>\s</code> according to ECMA-262 3rd Edition and Unicode 5.1:</p>

<ul>
	<li><code>\u0009</code> &mdash; Tab &mdash; <code>\t</code></li>
	<li><code>\u000a</code> &mdash; Line feed &mdash; <code>\n</code> &mdash; (newline character)</li>
	<li><code>\u000b</code> &mdash; Vertical tab &mdash; <code>\v</code></li>
	<li><code>\u000c</code> &mdash; Form feed &mdash; <code>\f</code></li>
	<li><code>\u000d</code> &mdash; Carriage return &mdash; <code>\r</code> &mdash; (newline character)</li>
	<li><code>\u0020</code> &mdash; Space</li>
	<li><code>\u00a0</code> &mdash; No-break space</li>
	<li><code>\u1680</code> &mdash; Ogham space mark</li>
	<li><code>\u180e</code> &mdash; Mongolian vowel separator</li>
	<li><code>\u2000</code> &mdash; En quad</li>
	<li><code>\u2001</code> &mdash; Em quad</li>
	<li><code>\u2002</code> &mdash; En space</li>
	<li><code>\u2003</code> &mdash; Em space</li>
	<li><code>\u2004</code> &mdash; Three-per-em space</li>
	<li><code>\u2005</code> &mdash; Four-per-em space</li>
	<li><code>\u2006</code> &mdash; Six-per-em space</li>
	<li><code>\u2007</code> &mdash; Figure space</li>
	<li><code>\u2008</code> &mdash; Punctuation space</li>
	<li><code>\u2009</code> &mdash; Thin space</li>
	<li><code>\u200a</code> &mdash; Hair space</li>
	<li><code>\u2028</code> &mdash; Line separator &mdash; (newline character)</li>
	<li><code>\u2029</code> &mdash; Paragraph separator &mdash; (newline character)</li>
	<li><code>\u202f</code> &mdash; Narrow no-break space</li>
	<li><code>\u205f</code> &mdash; Medium mathematical space</li>
	<li><code>\u3000</code> &mdash; Ideographic space</li>
</ul>

<p>To test which characters or positions are matched by all of the tokens mentioned here in your browser, see <a href="http://xregexp.com/tests/unicode.html"><strong>JavaScript Regex and Unicode Tests</strong></a>. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.</p>

<div class="update">
<p><strong>Update:</strong> My new <a href="http://xregexp.com/plugins/"><strong>Unicode plugin for XRegExp</strong></a> allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.</p>
</div>]]></content:encoded>
			<wfw:commentRss>http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>
