Unicode Plugin for XRegExp
Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.
I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.
The Unicode plugin enables the following Unicode properties/categories in any XRegExp:
\p{L}
— Letter\p{M}
— Mark\p{N}
— Number\p{P}
— Punctuation\p{S}
— Symbol\p{Z}
— Separator\p{C}
— Other (control, format, private use, surrogate, and unassigned codes)
It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:
\p{InAlphabeticPresentationForms}
\p{InArabic}
\p{InArabicPresentationFormsA}
\p{InArabicPresentationFormsB}
\p{InArabicSupplement}
\p{InArmenian}
\p{InArrows}
\p{InBalinese}
\p{InBasicLatin}
\p{InBengali}
\p{InBlockElements}
\p{InBopomofo}
\p{InBopomofoExtended}
\p{InBoxDrawing}
\p{InBraillePatterns}
\p{InBuginese}
\p{InBuhid}
\p{InCham}
\p{InCherokee}
\p{InCJKCompatibility}
\p{InCJKCompatibilityForms}
\p{InCJKCompatibilityIdeographs}
\p{InCJKRadicalsSupplement}
\p{InCJKStrokes}
\p{InCJKSymbolsandPunctuation}
\p{InCJKUnifiedIdeographs}
\p{InCJKUnifiedIdeographsExtensionA}
\p{InCombiningDiacriticalMarks}
\p{InCombiningDiacriticalMarksforSymbols}
\p{InCombiningDiacriticalMarksSupplement}
\p{InCombiningHalfMarks}
\p{InControlPictures}
\p{InCoptic}
\p{InCurrencySymbols}
\p{InCyrillic}
\p{InCyrillicExtendedA}
\p{InCyrillicExtendedB}
\p{InCyrillicSupplement}
\p{InDevanagari}
\p{InDingbats}
\p{InEnclosedAlphanumerics}
\p{InEnclosedCJKLettersandMonths}
\p{InEthiopic}
\p{InEthiopicExtended}
\p{InEthiopicSupplement}
\p{InGeneralPunctuation}
\p{InGeometricShapes}
\p{InGeorgian}
\p{InGeorgianSupplement}
\p{InGlagolitic}
\p{InGreekandCoptic}
\p{InGreekExtended}
\p{InGujarati}
\p{InGurmukhi}
\p{InHalfwidthandFullwidthForms}
\p{InHangulCompatibilityJamo}
\p{InHangulJamo}
\p{InHangulSyllables}
\p{InHanunoo}
\p{InHebrew}
\p{InHighPrivateUseSurrogates}
\p{InHighSurrogates}
\p{InHiragana}
\p{InIdeographicDescriptionCharacters}
\p{InIPAExtensions}
\p{InKanbun}
\p{InKangxiRadicals}
\p{InKannada}
\p{InKatakana}
\p{InKatakanaPhoneticExtensions}
\p{InKayahLi}
\p{InKhmer}
\p{InKhmerSymbols}
\p{InLao}
\p{InLatin1Supplement}
\p{InLatinExtendedA}
\p{InLatinExtendedAdditional}
\p{InLatinExtendedB}
\p{InLatinExtendedC}
\p{InLatinExtendedD}
\p{InLepcha}
\p{InLetterlikeSymbols}
\p{InLimbu}
\p{InLowSurrogates}
\p{InMalayalam}
\p{InMathematicalOperators}
\p{InMiscellaneousMathematicalSymbolsA}
\p{InMiscellaneousMathematicalSymbolsB}
\p{InMiscellaneousSymbols}
\p{InMiscellaneousSymbolsandArrows}
\p{InMiscellaneousTechnical}
\p{InModifierToneLetters}
\p{InMongolian}
\p{InMyanmar}
\p{InNewTaiLue}
\p{InNKo}
\p{InNumberForms}
\p{InOgham}
\p{InOlChiki}
\p{InOpticalCharacterRecognition}
\p{InOriya}
\p{InPhagspa}
\p{InPhoneticExtensions}
\p{InPhoneticExtensionsSupplement}
\p{InPrivateUseArea}
\p{InRejang}
\p{InRunic}
\p{InSaurashtra}
\p{InSinhala}
\p{InSmallFormVariants}
\p{InSpacingModifierLetters}
\p{InSpecials}
\p{InSundanese}
\p{InSuperscriptsandSubscripts}
\p{InSupplementalArrowsA}
\p{InSupplementalArrowsB}
\p{InSupplementalMathematicalOperators}
\p{InSupplementalPunctuation}
\p{InSylotiNagri}
\p{InSyriac}
\p{InTagalog}
\p{InTagbanwa}
\p{InTaiLe}
\p{InTamil}
\p{InTelugu}
\p{InThaana}
\p{InThai}
\p{InTibetan}
\p{InTifinagh}
\p{InUnifiedCanadianAboriginalSyllabics}
\p{InVai}
\p{InVariationSelectors}
\p{InVerticalForms}
\p{InYijingHexagramSymbols}
\p{InYiRadicals}
\p{InYiSyllables}
In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}
, \p{InLatin Extended-A}
, and \p{in latin extended a}
are all equivalent.
All properties and blocks can be inverted by using an uppercase p. For example, \P{N}
matches any code point that is not in the Number category. \P{InArabic}
matches code points that are not in the Arabic block.
IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.
Instead Of: | Use: |
---|---|
[\p{N}] | \p{N} |
[\p{N}a-z~] | (?:\p{N}|[a-z~]) |
[\p{N}\P{Z}] | (?:\p{N}|\P{Z}) |
[\p{N}\P{Z}a-z~] | (?:\p{N}|\P{Z}|[a-z~]) |
[^\p{N}] | \P{N} |
[^\p{N}a-z~] | (?:(?!\p{N})[^a-z~]) |
[^\p{N}\P{Z}] | (?:(?!\p{N}|\P{Z})[\S\s]) |
[^\p{N}\P{Z}a-z~] | (?:(?!\p{N}|\P{Z})[^a-z~]) |
Additionally, Unicode subcategories like \p{Nd}
and scripts like \p{Latin}
are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)
Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.
The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.
To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.
<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
var unicodeWord = new XRegExp("^\\p{L}+$");
alert(unicodeWord.test("Русский")); // true
</script>
Pingback by Hacklog » Blog Archive » Unicode Script property and Javascript on 2 December 2008:
[…] http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin […]
Pingback by Recent Links Tagged With "unicode" - JabberTags on 5 January 2009:
[…] public links >> unicode Unicode plugin for XRegExp Saved by anix on Mon 22-12-2008 Ruby 1.9 and Unicode: The BOM Will Fuck Your Shit Up Saved by […]
Comment by dda on 7 January 2009:
Hi, found your site through this blog http://blogamundo.net/dev/2008/11/21/unicode-script-property-and-javascript/ and I coded a JS function that matches the request. The author thinks you might be interested in integrating the code back into your JS plugin.
Cheers
Comment by X3mE on 16 February 2009:
Hi,
Could you tell me if it’s possible to catch unicode uppercase letters with your plugin?
Thanks in advance.
Comment by Ooz on 16 September 2009:
Hello and thank you for this post: \p is really missing in Firefox javascript.
However, I am trying to implement it in a FF extension but I’m getting an error “invalid regexp flag g” in the line
regex = RegExp(output.join(“”), real.replace.call(flags, /[^gimy]+/g, “”));
Any idea?
Thanks again,
oo
Comment by Steven Levithan on 16 September 2009:
@X3mE, Unicode subcategories like \p{Lu} (uppercase letter) are not currently supported. For now, you could try matching letters with \p{L} and comparing them to their
.toUpperCase()
equivalent.@Ooz, recent versions of Firefox (later than 3.5, I believe) choke on redundant regex flags. I need to post a fix for this along with broader improvements for the Unicode plugin, but I may not get to it for a little while. XRegExp currently adds
g
flags during some operations without worrying about whether a regex already has theg
flag. I believe you can fix it by changing this (inRegExp.prototype.addFlags
):var regex = XRegExp(this.source, (flags || "") + XRegExp._getNativeFlags(this)),
To this:
var regex = XRegExp(this.source, ((flags || "") + XRegExp._getNativeFlags(this)).replace(/(.)(?=.*\1)/g, "")),
Pingback by Regular expressions and the ASP.NET RegularExpressionValidator control – an overview of useful links on 21 September 2009:
[…] JavaScript: Unicode Plugin for XRegExp […]
Comment by Ooz on 23 September 2009:
Fantastic, Steven!
Thank you very much, it works fine. It’s open source and reusable as long as we put MIT license and credit, isn’t it?
Comment by Steven Levithan on 23 September 2009:
@Ooz, that’s correct.
Comment by roku.com/link on 11 February 2020:
Right here is the perfect webpage for anyone who would like to find out about this topic. You know so much its almost tough to argue with you (not that I really would want to…HaHa). You certainly put a new spin on a topic that’s been discussed for decades. Great stuff, just great!
Comment by roku.com/link on 11 February 2020:
Excellent blog you have got here.. It’s hard to find good quality writing like yours these days. I honestly appreciate people like you! Take care!!