XRegExp Updates

A few days ago, I posted a long-overdue XRegExp bug fix release (version 1.5.1). This was mainly to address an IE issue that a number of people have written to me and blogged about. Specifically, RegExp.prototype.exec no longer throws an error in IE when it is simultaneously provided a nonstring argument and called on a regex with a capturing group that matches an empty string. That's an edge case of an edge case, but it was causing XRegExp to conflict with jQuery 1.7.1 (oops). You can see the full list of changes in the changelog.

But wait, there's more… XRegExp's Unicode plugin has been updated to support Unicode 6.1 (released January 2012), rather than Unicode 5.2. I've also added a new test suite with 265 tests so far, and more on the way.

More substantial changes to XRegExp are planned and coming soon. Follow the brand new XRegExp repository on GitHub to keep up to date or to fork it and help shape the future of this one-of-a-kind JavaScript library. 🙂

Unicode Plugin for XRegExp

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

  • \p{L} — Letter
  • \p{M} — Mark
  • \p{N} — Number
  • \p{P} — Punctuation
  • \p{S} — Symbol
  • \p{Z} — Separator
  • \p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

  • \p{InAlphabeticPresentationForms}
  • \p{InArabic}
  • \p{InArabicPresentationFormsA}
  • \p{InArabicPresentationFormsB}
  • \p{InArabicSupplement}
  • \p{InArmenian}
  • \p{InArrows}
  • \p{InBalinese}
  • \p{InBasicLatin}
  • \p{InBengali}
  • \p{InBlockElements}
  • \p{InBopomofo}
  • \p{InBopomofoExtended}
  • \p{InBoxDrawing}
  • \p{InBraillePatterns}
  • \p{InBuginese}
  • \p{InBuhid}
  • \p{InCham}
  • \p{InCherokee}
  • \p{InCJKCompatibility}
  • \p{InCJKCompatibilityForms}
  • \p{InCJKCompatibilityIdeographs}
  • \p{InCJKRadicalsSupplement}
  • \p{InCJKStrokes}
  • \p{InCJKSymbolsandPunctuation}
  • \p{InCJKUnifiedIdeographs}
  • \p{InCJKUnifiedIdeographsExtensionA}
  • \p{InCombiningDiacriticalMarks}
  • \p{InCombiningDiacriticalMarksforSymbols}
  • \p{InCombiningDiacriticalMarksSupplement}
  • \p{InCombiningHalfMarks}
  • \p{InControlPictures}
  • \p{InCoptic}
  • \p{InCurrencySymbols}
  • \p{InCyrillic}
  • \p{InCyrillicExtendedA}
  • \p{InCyrillicExtendedB}
  • \p{InCyrillicSupplement}
  • \p{InDevanagari}
  • \p{InDingbats}
  • \p{InEnclosedAlphanumerics}
  • \p{InEnclosedCJKLettersandMonths}
  • \p{InEthiopic}
  • \p{InEthiopicExtended}
  • \p{InEthiopicSupplement}
  • \p{InGeneralPunctuation}
  • \p{InGeometricShapes}
  • \p{InGeorgian}
  • \p{InGeorgianSupplement}
  • \p{InGlagolitic}
  • \p{InGreekandCoptic}
  • \p{InGreekExtended}
  • \p{InGujarati}
  • \p{InGurmukhi}
  • \p{InHalfwidthandFullwidthForms}
  • \p{InHangulCompatibilityJamo}
  • \p{InHangulJamo}
  • \p{InHangulSyllables}
  • \p{InHanunoo}
  • \p{InHebrew}
  • \p{InHighPrivateUseSurrogates}
  • \p{InHighSurrogates}
  • \p{InHiragana}
  • \p{InIdeographicDescriptionCharacters}
  • \p{InIPAExtensions}
  • \p{InKanbun}
  • \p{InKangxiRadicals}
  • \p{InKannada}
  • \p{InKatakana}
  • \p{InKatakanaPhoneticExtensions}
  • \p{InKayahLi}
  • \p{InKhmer}
  • \p{InKhmerSymbols}
  • \p{InLao}
  • \p{InLatin1Supplement}
  • \p{InLatinExtendedA}
  • \p{InLatinExtendedAdditional}
  • \p{InLatinExtendedB}
  • \p{InLatinExtendedC}
  • \p{InLatinExtendedD}
  • \p{InLepcha}
  • \p{InLetterlikeSymbols}
  • \p{InLimbu}
  • \p{InLowSurrogates}
  • \p{InMalayalam}
  • \p{InMathematicalOperators}
  • \p{InMiscellaneousMathematicalSymbolsA}
  • \p{InMiscellaneousMathematicalSymbolsB}
  • \p{InMiscellaneousSymbols}
  • \p{InMiscellaneousSymbolsandArrows}
  • \p{InMiscellaneousTechnical}
  • \p{InModifierToneLetters}
  • \p{InMongolian}
  • \p{InMyanmar}
  • \p{InNewTaiLue}
  • \p{InNKo}
  • \p{InNumberForms}
  • \p{InOgham}
  • \p{InOlChiki}
  • \p{InOpticalCharacterRecognition}
  • \p{InOriya}
  • \p{InPhagspa}
  • \p{InPhoneticExtensions}
  • \p{InPhoneticExtensionsSupplement}
  • \p{InPrivateUseArea}
  • \p{InRejang}
  • \p{InRunic}
  • \p{InSaurashtra}
  • \p{InSinhala}
  • \p{InSmallFormVariants}
  • \p{InSpacingModifierLetters}
  • \p{InSpecials}
  • \p{InSundanese}
  • \p{InSuperscriptsandSubscripts}
  • \p{InSupplementalArrowsA}
  • \p{InSupplementalArrowsB}
  • \p{InSupplementalMathematicalOperators}
  • \p{InSupplementalPunctuation}
  • \p{InSylotiNagri}
  • \p{InSyriac}
  • \p{InTagalog}
  • \p{InTagbanwa}
  • \p{InTaiLe}
  • \p{InTamil}
  • \p{InTelugu}
  • \p{InThaana}
  • \p{InThai}
  • \p{InTibetan}
  • \p{InTifinagh}
  • \p{InUnifiedCanadianAboriginalSyllabics}
  • \p{InVai}
  • \p{InVariationSelectors}
  • \p{InVerticalForms}
  • \p{InYijingHexagramSymbols}
  • \p{InYiRadicals}
  • \p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of: Use:
[\p{N}]\p{N}
[\p{N}a-z~](?:\p{N}|[a-z~])
[\p{N}\P{Z}](?:\p{N}|\P{Z})
[\p{N}\P{Z}a-z~](?:\p{N}|\P{Z}|[a-z~])
[^\p{N}]\P{N}
[^\p{N}a-z~](?:(?!\p{N})[^a-z~])
[^\p{N}\P{Z}](?:(?!\p{N}|\P{Z})[\S\s])
[^\p{N}\P{Z}a-z~](?:(?!\p{N}|\P{Z})[^a-z~])

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

JavaScript, Regex, and Unicode

Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

Here again are the affected tokens, along with their definitions:

  • \d — Digits.
  • \s — Whitespace.
  • \w — Word characters.
  • \D — All except digits.
  • \S — All except whitespace.
  • \W — All except word characters.
  • . — All except newlines.
  • ^ (with /m) — The positions at the beginning of the string and just after newlines.
  • $ (with /m) — The positions at the end of the string and just before newlines.
  • \b — Word boundary positions.
  • \B — Not word boundary positions.

All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:

  • Digit — The characters 0-9 only.
  • Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
  • Word character — The characters A-Z, a-z, 0-9, and _ only.
  • Word boundary — The position between a word character and non-word character.
  • Newline — The line feed, carriage return, line separator, and paragraph separator characters.

Here again are the newline characters, with their character codes:

  • \u000a — Line feed — \n
  • \u000d — Carriage return — \r
  • \u2028 — Line separator
  • \u2029 — Paragraph separator

Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n") returns true.

As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s according to ECMA-262 3rd Edition and Unicode 5.1:

  • \u0009 — Tab — \t
  • \u000a — Line feed — \n — (newline character)
  • \u000b — Vertical tab — \v
  • \u000c — Form feed — \f
  • \u000d — Carriage return — \r — (newline character)
  • \u0020 — Space
  • \u00a0 — No-break space
  • \u1680 — Ogham space mark
  • \u180e — Mongolian vowel separator
  • \u2000 — En quad
  • \u2001 — Em quad
  • \u2002 — En space
  • \u2003 — Em space
  • \u2004 — Three-per-em space
  • \u2005 — Four-per-em space
  • \u2006 — Six-per-em space
  • \u2007 — Figure space
  • \u2008 — Punctuation space
  • \u2009 — Thin space
  • \u200a — Hair space
  • \u2028 — Line separator — (newline character)
  • \u2029 — Paragraph separator — (newline character)
  • \u202f — Narrow no-break space
  • \u205f — Medium mathematical space
  • \u3000 — Ideographic space

To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.

Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.