Flagrant Badassery

A JavaScript and regular expression centric blog

Unicode Plugin for XRegExp

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

  • \p{L} — Letter
  • \p{M} — Mark
  • \p{N} — Number
  • \p{P} — Punctuation
  • \p{S} — Symbol
  • \p{Z} — Separator
  • \p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

  • \p{InAlphabeticPresentationForms}
  • \p{InArabic}
  • \p{InArabicPresentationFormsA}
  • \p{InArabicPresentationFormsB}
  • \p{InArabicSupplement}
  • \p{InArmenian}
  • \p{InArrows}
  • \p{InBalinese}
  • \p{InBasicLatin}
  • \p{InBengali}
  • \p{InBlockElements}
  • \p{InBopomofo}
  • \p{InBopomofoExtended}
  • \p{InBoxDrawing}
  • \p{InBraillePatterns}
  • \p{InBuginese}
  • \p{InBuhid}
  • \p{InCham}
  • \p{InCherokee}
  • \p{InCJKCompatibility}
  • \p{InCJKCompatibilityForms}
  • \p{InCJKCompatibilityIdeographs}
  • \p{InCJKRadicalsSupplement}
  • \p{InCJKStrokes}
  • \p{InCJKSymbolsandPunctuation}
  • \p{InCJKUnifiedIdeographs}
  • \p{InCJKUnifiedIdeographsExtensionA}
  • \p{InCombiningDiacriticalMarks}
  • \p{InCombiningDiacriticalMarksforSymbols}
  • \p{InCombiningDiacriticalMarksSupplement}
  • \p{InCombiningHalfMarks}
  • \p{InControlPictures}
  • \p{InCoptic}
  • \p{InCurrencySymbols}
  • \p{InCyrillic}
  • \p{InCyrillicExtendedA}
  • \p{InCyrillicExtendedB}
  • \p{InCyrillicSupplement}
  • \p{InDevanagari}
  • \p{InDingbats}
  • \p{InEnclosedAlphanumerics}
  • \p{InEnclosedCJKLettersandMonths}
  • \p{InEthiopic}
  • \p{InEthiopicExtended}
  • \p{InEthiopicSupplement}
  • \p{InGeneralPunctuation}
  • \p{InGeometricShapes}
  • \p{InGeorgian}
  • \p{InGeorgianSupplement}
  • \p{InGlagolitic}
  • \p{InGreekandCoptic}
  • \p{InGreekExtended}
  • \p{InGujarati}
  • \p{InGurmukhi}
  • \p{InHalfwidthandFullwidthForms}
  • \p{InHangulCompatibilityJamo}
  • \p{InHangulJamo}
  • \p{InHangulSyllables}
  • \p{InHanunoo}
  • \p{InHebrew}
  • \p{InHighPrivateUseSurrogates}
  • \p{InHighSurrogates}
  • \p{InHiragana}
  • \p{InIdeographicDescriptionCharacters}
  • \p{InIPAExtensions}
  • \p{InKanbun}
  • \p{InKangxiRadicals}
  • \p{InKannada}
  • \p{InKatakana}
  • \p{InKatakanaPhoneticExtensions}
  • \p{InKayahLi}
  • \p{InKhmer}
  • \p{InKhmerSymbols}
  • \p{InLao}
  • \p{InLatin1Supplement}
  • \p{InLatinExtendedA}
  • \p{InLatinExtendedAdditional}
  • \p{InLatinExtendedB}
  • \p{InLatinExtendedC}
  • \p{InLatinExtendedD}
  • \p{InLepcha}
  • \p{InLetterlikeSymbols}
  • \p{InLimbu}
  • \p{InLowSurrogates}
  • \p{InMalayalam}
  • \p{InMathematicalOperators}
  • \p{InMiscellaneousMathematicalSymbolsA}
  • \p{InMiscellaneousMathematicalSymbolsB}
  • \p{InMiscellaneousSymbols}
  • \p{InMiscellaneousSymbolsandArrows}
  • \p{InMiscellaneousTechnical}
  • \p{InModifierToneLetters}
  • \p{InMongolian}
  • \p{InMyanmar}
  • \p{InNewTaiLue}
  • \p{InNKo}
  • \p{InNumberForms}
  • \p{InOgham}
  • \p{InOlChiki}
  • \p{InOpticalCharacterRecognition}
  • \p{InOriya}
  • \p{InPhagspa}
  • \p{InPhoneticExtensions}
  • \p{InPhoneticExtensionsSupplement}
  • \p{InPrivateUseArea}
  • \p{InRejang}
  • \p{InRunic}
  • \p{InSaurashtra}
  • \p{InSinhala}
  • \p{InSmallFormVariants}
  • \p{InSpacingModifierLetters}
  • \p{InSpecials}
  • \p{InSundanese}
  • \p{InSuperscriptsandSubscripts}
  • \p{InSupplementalArrowsA}
  • \p{InSupplementalArrowsB}
  • \p{InSupplementalMathematicalOperators}
  • \p{InSupplementalPunctuation}
  • \p{InSylotiNagri}
  • \p{InSyriac}
  • \p{InTagalog}
  • \p{InTagbanwa}
  • \p{InTaiLe}
  • \p{InTamil}
  • \p{InTelugu}
  • \p{InThaana}
  • \p{InThai}
  • \p{InTibetan}
  • \p{InTifinagh}
  • \p{InUnifiedCanadianAboriginalSyllabics}
  • \p{InVai}
  • \p{InVariationSelectors}
  • \p{InVerticalForms}
  • \p{InYijingHexagramSymbols}
  • \p{InYiRadicals}
  • \p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of: Use:
[\p{N}]\p{N}
[\p{N}a-z~](?:\p{N}|[a-z~])
[\p{N}\P{Z}](?:\p{N}|\P{Z})
[\p{N}\P{Z}a-z~](?:\p{N}|\P{Z}|[a-z~])
[^\p{N}]\P{N}
[^\p{N}a-z~](?:(?!\p{N})[^a-z~])
[^\p{N}\P{Z}](?:(?!\p{N}|\P{Z})[\S\s])
[^\p{N}\P{Z}a-z~](?:(?!\p{N}|\P{Z})[^a-z~])

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

There Are 9 Responses So Far. »

  1. […] http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin […]

  2. […] public links >> unicode Unicode plugin for XRegExp Saved by anix on Mon 22-12-2008 Ruby 1.9 and Unicode: The BOM Will Fuck Your Shit Up Saved by […]

  3. Hi, found your site through this blog http://blogamundo.net/dev/2008/11/21/unicode-script-property-and-javascript/ and I coded a JS function that matches the request. The author thinks you might be interested in integrating the code back into your JS plugin.

    Cheers

  4. Hi,

    Could you tell me if it’s possible to catch unicode uppercase letters with your plugin?

    Thanks in advance.

  5. Hello and thank you for this post: \p is really missing in Firefox javascript.
    However, I am trying to implement it in a FF extension but I’m getting an error “invalid regexp flag g” in the line
    regex = RegExp(output.join(“”), real.replace.call(flags, /[^gimy]+/g, “”));
    Any idea?
    Thanks again,
    oo

  6. @X3mE, Unicode subcategories like \p{Lu} (uppercase letter) are not currently supported. For now, you could try matching letters with \p{L} and comparing them to their .toUpperCase() equivalent.

    @Ooz, recent versions of Firefox (later than 3.5, I believe) choke on redundant regex flags. I need to post a fix for this along with broader improvements for the Unicode plugin, but I may not get to it for a little while. XRegExp currently adds g flags during some operations without worrying about whether a regex already has the g flag. I believe you can fix it by changing this (in RegExp.prototype.addFlags):

    var regex = XRegExp(this.source, (flags || "") + XRegExp._getNativeFlags(this)),

    To this:

    var regex = XRegExp(this.source, ((flags || "") + XRegExp._getNativeFlags(this)).replace(/(.)(?=.*\1)/g, "")),

  7. […] JavaScript: Unicode Plugin for XRegExp […]

  8. Fantastic, Steven!
    Thank you very much, it works fine. It’s open source and reusable as long as we put MIT license and credit, isn’t it?

  9. @Ooz, that’s correct.

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).