Unicode Plugin for XRegExp

Update: Many of the details described below are now out of date. Get the latest version of the Unicode plugin for XRegExp.

I've released a simple plugin for XRegExp (my JavaScript regex library) that adds support for Unicode properties and blocks to JavaScript regular expressions. It uses the Unicode 5.1 character database, which is the very latest version.

The Unicode plugin enables the following Unicode properties/categories in any XRegExp:

  • \p{L} — Letter
  • \p{M} — Mark
  • \p{N} — Number
  • \p{P} — Punctuation
  • \p{S} — Symbol
  • \p{Z} — Separator
  • \p{C} — Other (control, format, private use, surrogate, and unassigned codes)

It also enables all 136 blocks that the code points U+0000 through U+FFFF are divided into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Here are the supported blocks in alphabetical order:

  • \p{InAlphabeticPresentationForms}
  • \p{InArabic}
  • \p{InArabicPresentationFormsA}
  • \p{InArabicPresentationFormsB}
  • \p{InArabicSupplement}
  • \p{InArmenian}
  • \p{InArrows}
  • \p{InBalinese}
  • \p{InBasicLatin}
  • \p{InBengali}
  • \p{InBlockElements}
  • \p{InBopomofo}
  • \p{InBopomofoExtended}
  • \p{InBoxDrawing}
  • \p{InBraillePatterns}
  • \p{InBuginese}
  • \p{InBuhid}
  • \p{InCham}
  • \p{InCherokee}
  • \p{InCJKCompatibility}
  • \p{InCJKCompatibilityForms}
  • \p{InCJKCompatibilityIdeographs}
  • \p{InCJKRadicalsSupplement}
  • \p{InCJKStrokes}
  • \p{InCJKSymbolsandPunctuation}
  • \p{InCJKUnifiedIdeographs}
  • \p{InCJKUnifiedIdeographsExtensionA}
  • \p{InCombiningDiacriticalMarks}
  • \p{InCombiningDiacriticalMarksforSymbols}
  • \p{InCombiningDiacriticalMarksSupplement}
  • \p{InCombiningHalfMarks}
  • \p{InControlPictures}
  • \p{InCoptic}
  • \p{InCurrencySymbols}
  • \p{InCyrillic}
  • \p{InCyrillicExtendedA}
  • \p{InCyrillicExtendedB}
  • \p{InCyrillicSupplement}
  • \p{InDevanagari}
  • \p{InDingbats}
  • \p{InEnclosedAlphanumerics}
  • \p{InEnclosedCJKLettersandMonths}
  • \p{InEthiopic}
  • \p{InEthiopicExtended}
  • \p{InEthiopicSupplement}
  • \p{InGeneralPunctuation}
  • \p{InGeometricShapes}
  • \p{InGeorgian}
  • \p{InGeorgianSupplement}
  • \p{InGlagolitic}
  • \p{InGreekandCoptic}
  • \p{InGreekExtended}
  • \p{InGujarati}
  • \p{InGurmukhi}
  • \p{InHalfwidthandFullwidthForms}
  • \p{InHangulCompatibilityJamo}
  • \p{InHangulJamo}
  • \p{InHangulSyllables}
  • \p{InHanunoo}
  • \p{InHebrew}
  • \p{InHighPrivateUseSurrogates}
  • \p{InHighSurrogates}
  • \p{InHiragana}
  • \p{InIdeographicDescriptionCharacters}
  • \p{InIPAExtensions}
  • \p{InKanbun}
  • \p{InKangxiRadicals}
  • \p{InKannada}
  • \p{InKatakana}
  • \p{InKatakanaPhoneticExtensions}
  • \p{InKayahLi}
  • \p{InKhmer}
  • \p{InKhmerSymbols}
  • \p{InLao}
  • \p{InLatin1Supplement}
  • \p{InLatinExtendedA}
  • \p{InLatinExtendedAdditional}
  • \p{InLatinExtendedB}
  • \p{InLatinExtendedC}
  • \p{InLatinExtendedD}
  • \p{InLepcha}
  • \p{InLetterlikeSymbols}
  • \p{InLimbu}
  • \p{InLowSurrogates}
  • \p{InMalayalam}
  • \p{InMathematicalOperators}
  • \p{InMiscellaneousMathematicalSymbolsA}
  • \p{InMiscellaneousMathematicalSymbolsB}
  • \p{InMiscellaneousSymbols}
  • \p{InMiscellaneousSymbolsandArrows}
  • \p{InMiscellaneousTechnical}
  • \p{InModifierToneLetters}
  • \p{InMongolian}
  • \p{InMyanmar}
  • \p{InNewTaiLue}
  • \p{InNKo}
  • \p{InNumberForms}
  • \p{InOgham}
  • \p{InOlChiki}
  • \p{InOpticalCharacterRecognition}
  • \p{InOriya}
  • \p{InPhagspa}
  • \p{InPhoneticExtensions}
  • \p{InPhoneticExtensionsSupplement}
  • \p{InPrivateUseArea}
  • \p{InRejang}
  • \p{InRunic}
  • \p{InSaurashtra}
  • \p{InSinhala}
  • \p{InSmallFormVariants}
  • \p{InSpacingModifierLetters}
  • \p{InSpecials}
  • \p{InSundanese}
  • \p{InSuperscriptsandSubscripts}
  • \p{InSupplementalArrowsA}
  • \p{InSupplementalArrowsB}
  • \p{InSupplementalMathematicalOperators}
  • \p{InSupplementalPunctuation}
  • \p{InSylotiNagri}
  • \p{InSyriac}
  • \p{InTagalog}
  • \p{InTagbanwa}
  • \p{InTaiLe}
  • \p{InTamil}
  • \p{InTelugu}
  • \p{InThaana}
  • \p{InThai}
  • \p{InTibetan}
  • \p{InTifinagh}
  • \p{InUnifiedCanadianAboriginalSyllabics}
  • \p{InVai}
  • \p{InVariationSelectors}
  • \p{InVerticalForms}
  • \p{InYijingHexagramSymbols}
  • \p{InYiRadicals}
  • \p{InYiSyllables}

In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA}, \p{InLatin Extended-A}, and \p{in latin extended a} are all equivalent.

All properties and blocks can be inverted by using an uppercase p. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.

IMPORTANT: The use of Unicode properties or blocks within character classes is not currently supported. However, you can emulate their use with alternation and/or lookahead, as shown below.

Instead Of: Use:
[\p{N}]\p{N}
[\p{N}a-z~](?:\p{N}|[a-z~])
[\p{N}\P{Z}](?:\p{N}|\P{Z})
[\p{N}\P{Z}a-z~](?:\p{N}|\P{Z}|[a-z~])
[^\p{N}]\P{N}
[^\p{N}a-z~](?:(?!\p{N})[^a-z~])
[^\p{N}\P{Z}](?:(?!\p{N}|\P{Z})[\S\s])
[^\p{N}\P{Z}a-z~](?:(?!\p{N}|\P{Z})[^a-z~])

Additionally, Unicode subcategories like \p{Nd} and scripts like \p{Latin} are not currently supported. (For comparison, ECMAScript 4 regex proposals include Unicode properties/categories, but not scripts or blocks. Of the major regex flavors, only Perl and PCRE support Unicode scripts.)

Considering the comprehensive support that XRegExp has for other, extended regex features, I'm not happy with the limitations described above. Hopefully this will come in handy for some people anyway. If there is interest in this plugin, I may add the missing features in future versions.

The Unicode plugin clocks in at a mere 5.2 KB after minification (using the YUI Compressor) and gzipping. This would be added to the 2.5 KB of XRegExp itself, which gives you a lot more JavaScript regex goodness.

To activate this plugin, simply load it after loading XRegExp 0.6.1 or later.

<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
	var unicodeWord = new XRegExp("^\\p{L}+$");
	alert(unicodeWord.test("Русский")); // true
</script>

Download the Unicode plugin.

Test Your XRegExps with JRX

Cüneyt Yılmaz's JRX is a cool JavaScript regex tester inspired by the RX tool of Komodo IDE. Cüneyt recently added my XRegExp library to his tester, so JRX is now a nice and easy way to test XRegExp's singleline and extended modes, as well as named capture and other XRegExp-provided syntax. Check it out!

As for XRegExp, it has recently been upgraded to v0.5.2, which resolved a corner-case bug involving XRegExp.matchRecursive. See the changelog for details.

I'll take this opportunity to highlight some of my other favorite online regex testers. I've actively looked for these kinds of apps over the years and have seen more than 50 of them. Odds are you'll find something new here.

Edit (2008-06-18): Updated the list with a couple that have come out very recently.

  • RegexPal — My own JavaScript regex tester. It includes real-time regex syntax and match highlighting. Although RegexPal uses XRegExp to provide the singleline option, unlike JRX it uses JavaScript regex syntax without the XRegExp syntax extensions.
  • regex — Simple name, simple interface. Great set of flavor support (JavaScript, Perl, Python, PCRE, POSIX ERE).
  • Regexp Editor — Java regex tester with regex syntax highlighting.
  • RegExr — ActionScript regex tester with regex syntax highlighting.
  • reWork — JavaScript regex workbench.
  • reAnimator — Fun app for visualizing regex FSAs.
  • RegexMate — JavaScript regex console.
  • The REWizard — IE only, but offers regex building tools and an interesting visualization.
  • MyRegexTester — Includes code generation and plain-text explanations (via the YAPE::Regex::Explain Perl module).
  • Regular Expression Analyzer — Real-time regex explanation tree that mostly emulates Java, JavaScript, and Perl flavors. Its regex parsing code is very readable.
  • Nregex — .NET regex tester.
  • Rubular — Ruby regex tester.

Have fun!

XRegExp 0.5 Released!

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

If you haven't seen the prior versions, XRegExp is an MIT-licensed JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a new, powerful recursive-construct parser that uses regex delimiters are also included.

Here's what you get beyond the standard JavaScript regex features:

  • Added regex syntax:
    • Comprehensive named capture support. (Improved)
    • Comment patterns: (?#…). (New)
  • Added regex modifiers (flags):
    • s (singleline), to make dot match all characters including newlines.
    • x (extended), for free-spacing and comments.
  • Added awesome:
    • Reduced cross-browser inconsistencies. (More)
    • Recursive-construct parser with regex delimiters. (New)
    • An easy way to cache and reuse regex objects. (New)
    • The ability to safely embed literal text in your regex patterns. (New)
    • A method to add modifiers to existing regex objects.
    • Regex call and apply methods, which make generically working with functions and regexes easier. (New)

All of this can be yours for the low, low price of 2.4 KB. smile Version 0.5 also introduces extensive documentation and code examples.

If you're using a previous version, note that there are a few non-backward compatible changes for the sake of strict ECMA-262 Edition 3 compliance and compatibility with upcoming ECMAScript 4 changes.

  • The XRegExp.overrideNative function has been removed, since it is no longer possible to override native constructors in Firefox 3 or ECMAScript 4 (as proposed).
  • Named capture syntax has been changed from (<name>…) to (?<name>…), which is the standard in most regex libraries and under consideration for ES4. Named capture is now always available, and does not require the k modifier.
  • Due to cross-browser compatibility issues, previous versions enforced that a leading, unescaped ] within a character class was treated as a literal character, which is how things work in most regex flavors. XRegExp now follows ECMA-262 Edition 3 on this point. [] is an empty set and never matches (this is enforced in all browsers).

Get it while it's hot! Check out the new XRegExp documentation and source code.

Safari Support with XRegExp 0.2.2

When I released XRegExp 0.2 several days ago, I hadn't yet tested in Safari or Swift. When I remembered to do this shortly afterwards, I found that both of those WebKit-based browsers didn't like it and often crashed when trying to use it! This was obviously a Very Bad Thing, but due to major time availability issues I wasn't able to get around to in-depth bug-shooting and testing until tonight.

It turns out that Safari's regex engine contains a bug which causes an error to be thrown when compiling a regex containing a character class ending with "[\\".

// These throw an error:
[ /[[\\]/ , /[^[\\]/ , /[abc[\\]/ ]

// ...While these are all fine:
[ /[\\[]/ , /[\[\\]/ , /[[]/ , /[\\]/ , /[[\\abc]/ , /[[\/]/ , /[[(\\]/ ]

// Testing:
try {
	RegExp("[[\\]");
	alert("OK!");
} catch (err) {
	alert(err);
	/* Safari shows:
	"SyntaxError: Invalid regular expression: missing terminating ] for
	character class" */
}

As a result, I've changed two instances of [^[\\] to [^\\[] and upped the version number to 0.2.2. XRegExp has now been tested and works without any known issues in all of the following browsers:

  • Internet Explorer 5.5 – 7
  • Firefox 2.0.0.4
  • Opera 9.21
  • Safari 3.0.2 beta for Windows
  • Swift 0.2

You can get the newest version here.

XRegExp 0.2: Now With Named Capture

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.

JavaScript's regular expression flavor doesn't support named capture. Well, says who? XRegExp 0.2 brings named capture support, along with several other new features. But first of all, if you haven't seen the previous version, make sure to check out my post on XRegExp 0.1, because not all of the documentation is repeated below.

Highlights

  • Comprehensive named capture support (New)
  • Supports regex literals through the addFlags method (New)
  • Free-spacing and comments mode (x)
  • Dot matches all mode (s)
  • Several other minor improvements over v0.1

Named capture

There are several different syntaxes in the wild for named capture. I've compiled the following table based on my understanding of the regex support of the libraries in question. XRegExp's syntax is included at the top.

Library Capture Backreference In replacement Stored at
XRegExp (<name>…) \k<name> ${name} result.name
.NET (?<name>…)
(?'name'…)
\k<name>
\k'name'
${name} Matcher.Groups('name')
Perl 5.10 (beta) (?<name>…)
(?'name'…)
\k<name>
\k'name'
\g{name}
$+{name} $+{name}
Python (?P<name>…) (?P=name) \g<name> result.group('name')
PHP preg (PCRE 7) (.NET, Perl, and Python styles) $regs['name'] $result['name']

No other major regex library currently supports named capture, although the JGsoft engine (used by products like RegexBuddy) supports both .NET and Python syntax. XRegExp does not use a question mark at the beginning of a named capturing group because that would prevent it from being used in regex literals (JavaScript would immediately throw an "invalid quantifier" error).

XRegExp supports named capture on an on-request basis. You can add named capture support to any regex though the use of the new "k" flag. This is done for compatibility reasons and to ensure that regex compilation time remains as fast as possible in all situations.

Following are several examples of using named capture:

// Add named capture support using the XRegExp constructor
var repeatedWords = new XRegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support using RegExp, after overriding the native constructor
XRegExp.overrideNative();
var repeatedWords = new RegExp("\\b (<word> \\w+ ) \\s+ \\k<word> \\b", "gixk");

// Add named capture support to a regex literal
var repeatedWords = /\b (<word> \w+ ) \s+ \k<word> \b/.addFlags("gixk");

var data = "The the test data.";

// Check if data contains repeated words
var hasDuplicates = repeatedWords.test(data);
// hasDuplicates: true

// Use the regex to remove repeated words
var output = data.replace(repeatedWords, "${word}");
// output: "The test data."

In the above code, I've also used the x flag provided by XRegExp, to improve readability. Note that the addFlags method can be called multiple times on the same regex (e.g., /pattern/g.addFlags("k").addFlags("s")), but I'd recommend adding all flags in one shot, for efficiency.

Here are a few more examples of using named capture, with an overly simplistic URL-matching regex (for comprehensive URL parsing, see parseUri):

var url = "http://microsoft.com/path/to/file?q=1";
var urlParser = new XRegExp("^(<protocol>[^:/?]+)://(<host>[^/?]*)(<path>[^?]*)\\?(<query>.*)", "k");
var parts = urlParser.exec(url);
/* The result:
parts.protocol: "http"
parts.host: "microsoft.com"
parts.path: "/path/to/file"
parts.query: "q=1" */

// Named backreferences are also available in replace() callback functions as properties of the first argument
var newUrl = url.replace(urlParser, function(match){
	return match.replace(match.host, "yahoo.com");
});
// newUrl: "http://yahoo.com/path/to/file?q=1"

Note that XRegExp's named capture functionality does not support deprecated JavaScript features including the lastMatch property of the global RegExp object and the RegExp.prototype.compile() method.

Singleline (s) and extended (x) modes

The other non-native flags XRegExp supports are s (singleline) for "dot matches all" mode, and x (extended) for "free-spacing and comments" mode. For full details about these modifiers, see the FAQ in my XRegExp 0.1 post. However, one difference from the previous version is that XRegExp 0.2, when using the x flag, now allows whitespace between a regex token and its quantifier (quantifiers are, e.g., +, *?, or {1,3}). Although the previous version's handling/limitation in this regard was documented, it was atypical compared to other regex libraries. This has been fixed.

Download

XRegExp 0.2.5.

XRegExp has been tested in IE 5.5–7, Firefox 2.0.0.4, Opera 9.21, Safari 3.0.2 beta for Windows, and Swift 0.2.

Finally, note that the XRE object from v0.1 has been removed. XRegExp now only creates one global variable: XRegExp. To permanently override the native RegExp constructor/object, you can now run XRegExp.overrideNative();

Update: This version of XRegExp is outdated. See XRegExp.com for the latest, greatest version.