August 2008 — Flagrant Badassery

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

\p{L} — Letter
\p{M} — Mark
\p{N} — Number
\p{P} — Punctuation
\p{S} — Symbol
\p{Z} — Separator
\p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

\p{InAlphabeticPresentationForms}
\p{InArabic}
\p{InArabicPresentationFormsA}
\p{InArabicPresentationFormsB}
\p{InArabicSupplement}
\p{InArmenian}
\p{InArrows}
\p{InBalinese}
\p{InBasicLatin}
\p{InBengali}
\p{InBlockElements}
\p{InBopomofo}
\p{InBopomofoExtended}
\p{InBoxDrawing}
\p{InBraillePatterns}
\p{InBuginese}
\p{InBuhid}
\p{InCham}
\p{InCherokee}
\p{InCJKCompatibility}
\p{InCJKCompatibilityForms}
\p{InCJKCompatibilityIdeographs}
\p{InCJKRadicalsSupplement}
\p{InCJKStrokes}
\p{InCJKSymbolsandPunctuation}
\p{InCJKUnifiedIdeographs}
\p{InCJKUnifiedIdeographsExtensionA}
\p{InCombiningDiacriticalMarks}
\p{InCombiningDiacriticalMarksforSymbols}
\p{InCombiningDiacriticalMarksSupplement}
\p{InCombiningHalfMarks}
\p{InControlPictures}
\p{InCoptic}
\p{InCurrencySymbols}
\p{InCyrillic}
\p{InCyrillicExtendedA}
\p{InCyrillicExtendedB}
\p{InCyrillicSupplement}
\p{InDevanagari}
\p{InDingbats}
\p{InEnclosedAlphanumerics}
\p{InEnclosedCJKLettersandMonths}
\p{InEthiopic}
\p{InEthiopicExtended}
\p{InEthiopicSupplement}
\p{InGeneralPunctuation}
\p{InGeometricShapes}
\p{InGeorgian}
\p{InGeorgianSupplement}
\p{InGlagolitic}
\p{InGreekandCoptic}
\p{InGreekExtended}
\p{InGujarati}
\p{InGurmukhi}
\p{InHalfwidthandFullwidthForms}
\p{InHangulCompatibilityJamo}
\p{InHangulJamo}
\p{InHangulSyllables}
\p{InHanunoo}
\p{InHebrew}
\p{InHighPrivateUseSurrogates}
\p{InHighSurrogates}
\p{InHiragana}
\p{InIdeographicDescriptionCharacters}
\p{InIPAExtensions}
\p{InKanbun}
\p{InKangxiRadicals}
\p{InKannada}
\p{InKatakana}
\p{InKatakanaPhoneticExtensions}
\p{InKayahLi}
\p{InKhmer}
\p{InKhmerSymbols}
\p{InLao}
\p{InLatin1Supplement}
\p{InLatinExtendedA}
\p{InLatinExtendedAdditional}
\p{InLatinExtendedB}
\p{InLatinExtendedC}
\p{InLatinExtendedD}
\p{InLepcha}
\p{InLetterlikeSymbols}
\p{InLimbu}
\p{InLowSurrogates}
\p{InMalayalam}
\p{InMathematicalOperators}
\p{InMiscellaneousMathematicalSymbolsA}
\p{InMiscellaneousMathematicalSymbolsB}
\p{InMiscellaneousSymbols}
\p{InMiscellaneousSymbolsandArrows}
\p{InMiscellaneousTechnical}
\p{InModifierToneLetters}
\p{InMongolian}
\p{InMyanmar}
\p{InNewTaiLue}
\p{InNKo}
\p{InNumberForms}
\p{InOgham}
\p{InOlChiki}
\p{InOpticalCharacterRecognition}
\p{InOriya}
\p{InPhagspa}
\p{InPhoneticExtensions}
\p{InPhoneticExtensionsSupplement}
\p{InPrivateUseArea}
\p{InRejang}
\p{InRunic}
\p{InSaurashtra}
\p{InSinhala}
\p{InSmallFormVariants}
\p{InSpacingModifierLetters}
\p{InSpecials}
\p{InSundanese}
\p{InSuperscriptsandSubscripts}
\p{InSupplementalArrowsA}
\p{InSupplementalArrowsB}
\p{InSupplementalMathematicalOperators}
\p{InSupplementalPunctuation}
\p{InSylotiNagri}
\p{InSyriac}
\p{InTagalog}
\p{InTagbanwa}
\p{InTaiLe}
\p{InTamil}
\p{InTelugu}
\p{InThaana}
\p{InThai}
\p{InTibetan}
\p{InTifinagh}
\p{InUnifiedCanadianAboriginalSyllabics}
\p{InVai}
\p{InVariationSelectors}
\p{InVerticalForms}
\p{InYijingHexagramSymbols}
\p{InYiRadicals}
\p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of:	Use:
`[\p{N}]`	`\p{N}`
`[\p{N}a-z~]`	`(?:\p{N}\|[a-z~])`
`[\p{N}\P{Z}]`	`(?:\p{N}\|\P{Z})`
`[\p{N}\P{Z}a-z~]`	`(?:\p{N}\|\P{Z}\|[a-z~])`
`[^\p{N}]`	`\P{N}`
`[^\p{N}a-z~]`	`(?:(?!\p{N})[^a-z~])`
`[^\p{N}\P{Z}]`	`(?:(?!\p{N}\|\P{Z})[\S\s])`
`[^\p{N}\P{Z}a-z~]`	`(?:(?!\p{N}\|\P{Z})[^a-z~])`

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

Month: August 2008

Unicode Plugin for XRegExp