JavaScript, Regex, and Unicode

Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

Here again are the affected tokens, along with their definitions:

  • \d — Digits.
  • \s — Whitespace.
  • \w — Word characters.
  • \D — All except digits.
  • \S — All except whitespace.
  • \W — All except word characters.
  • . — All except newlines.
  • ^ (with /m) — The positions at the beginning of the string and just after newlines.
  • $ (with /m) — The positions at the end of the string and just before newlines.
  • \b — Word boundary positions.
  • \B — Not word boundary positions.

All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:

  • Digit — The characters 0-9 only.
  • Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
  • Word character — The characters A-Z, a-z, 0-9, and _ only.
  • Word boundary — The position between a word character and non-word character.
  • Newline — The line feed, carriage return, line separator, and paragraph separator characters.

Here again are the newline characters, with their character codes:

  • \u000a — Line feed — \n
  • \u000d — Carriage return — \r
  • \u2028 — Line separator
  • \u2029 — Paragraph separator

Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n") returns true.

As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s according to ECMA-262 3rd Edition and Unicode 5.1:

  • \u0009 — Tab — \t
  • \u000a — Line feed — \n — (newline character)
  • \u000b — Vertical tab — \v
  • \u000c — Form feed — \f
  • \u000d — Carriage return — \r — (newline character)
  • \u0020 — Space
  • \u00a0 — No-break space
  • \u1680 — Ogham space mark
  • \u180e — Mongolian vowel separator
  • \u2000 — En quad
  • \u2001 — Em quad
  • \u2002 — En space
  • \u2003 — Em space
  • \u2004 — Three-per-em space
  • \u2005 — Four-per-em space
  • \u2006 — Six-per-em space
  • \u2007 — Figure space
  • \u2008 — Punctuation space
  • \u2009 — Thin space
  • \u200a — Hair space
  • \u2028 — Line separator — (newline character)
  • \u2029 — Paragraph separator — (newline character)
  • \u202f — Narrow no-break space
  • \u205f — Medium mathematical space
  • \u3000 — Ideographic space

To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.

Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.

/(bb|[^b]{2})/

/(bb|[^b]{2})/

…That is the question. ThinkGeek is selling that on a t-shirt for the "regular expression junkie + lover of literature." Wearing that around would be a sure way to get me to notice you, especially if you have a nice rack or happen to be Angelina Jolie.

Note that the parentheses are unnecessary except in the case of Perl-folk wishing to avoid using the naughty $& variable for performance reasons. Also, the expression is buggy in that many alternatives to "bb" are not matched, so it would probably be better written as simply /bb|.*/s. For a JavaScript solution you could try var answer = /^bb$/.test(i);.

Alright, I'm done debugging the shirt now.

Performance of Greedy vs. Lazy Regex Quantifiers

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.

So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>. I'll use the RegexBuddy debugger style to illustrate each step.

Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.

  1. Beginning match attempt at character 0
  2. <
  3. <ok
  4. <backtrack
  5. <0
  6. <0backtrack
  7. <01
  8. <01backtrack
  9. <012
  10. <012backtrack
  11. <0123
  12. <0123backtrack
  13. <01234
  14. <01234backtrack
  15. <012345
  16. <012345backtrack
  17. <0123456
  18. <0123456backtrack
  19. <01234567
  20. <01234567backtrack
  21. <012345678
  22. <012345678backtrack
  23. <0123456789
  24. <0123456789>
  25. Match found

The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point, which it does after trying and failing to match the next token (>). As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".

Here's what happens with <[^>]*>:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789
  4. <0123456789>
  5. Match found

As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.

You might be wondering what happens if we use the regex <.*>. Take a look:

  1. Beginning match attempt at character 0
  2. <
  3. <0123456789>
  4. <0123456789>backtrack
  5. <0123456789
  6. <0123456789>
  7. Match found

Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .* caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.

In short:

  • Make your patterns more explicit where it makes sense, to better guide the regex engine and avoid backtracking.
  • Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
  • If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.

Matching Nested Constructs in JavaScript, Part 2

When I posted my matchRecursive function the other day (which allows easily matching nested constructs), I noted that it could easily be modified to work with a regex pattern rather than a string as the format argument. After looking at it again, I realized the conversion wouldn't be entirely straightforward, so I've gone ahead and reimplemented it as matchRecursiveRegExp to work with regex format delimiters. You can now use it to like e.g. matchRecursiveRegExp(str, "<div\\b[^>]*>", "</div>", "gi") to match the entire contents of all outermost <div> tags.

As before, read the code comments for further details. Note that in this version I've made the variable names more or less indecipherable in order to keep code size down, so see the earlier function for a similar but more readable implementation.

// (c) 2007 Steven Levithan <stevenlevithan.com>
// MIT License

/*** matchRecursiveRegExp
	Accepts a string to search, a left and right format delimiter
	as regex patterns, and optional regex flags. Returns an array
	of matches, allowing nested instances of left/right delimiters.
	Use the "g" flag to return all matches, otherwise only the
	first is returned. Be careful to ensure that the left and
	right format delimiters produce mutually exclusive matches.
	Backreferences are not supported within the right delimiter
	due to how it is internally combined with the left delimiter.
	When matching strings whose format delimiters are unbalanced
	to the left or right, the output is intentionally as a
	conventional regex library with recursion support would
	produce, e.g. "<<x>" and "<x>>" both produce ["x"] when using
	"<" and ">" as the delimiters (both strings contain a single,
	balanced instance of "<x>").

	examples:
		matchRecursiveRegExp("test", "\\(", "\\)")
			returns: []
		matchRecursiveRegExp("<t<<e>><s>>t<>", "<", ">", "g")
			returns: ["t<<e>><s>", ""]
		matchRecursiveRegExp("<div id=\"x\">test</div>", "<div\\b[^>]*>", "</div>", "gi")
			returns: ["test"]

*/
function matchRecursiveRegExp (str, left, right, flags) {
	var	f = flags || "",
		g = f.indexOf("g") > -1,
		x = new RegExp(left + "|" + right, "g" + f),
		l = new RegExp(left, f.replace(/g/g, "")),
		a = [],
		t, s, m;

	do {
		t = 0;
		while (m = x.exec(str)) {
			if (l.test(m[0])) {
				if (!t++) s = x.lastIndex;
			} else if (t) {
				if (!--t) {
					a.push(str.slice(s, m.index));
					if (!g) return a;
				}
			}
		}
	} while (t && (x.lastIndex = s));

	return a;
}

You can download it here.

Let me know if there are any additional features you'd like to see with this.

ECMAScript 3 Regular Expressions are Defective by Design

ECMAScript 3 has some major regex design flaws, and if nothing changes the ES4 group will be propagating some of the mistakes into ECMAScript 4 (aka JavaScript 2).

Recently, longtime JavaScript regex guru David "liorean" Andersson wrote up a couple posts about my biggest gripes with the ECMAScript 3 regex flavor, namely the way that backreferences are handled for non-participating capturing groups and within quantified subpatterns (see ECMAScript 3 Regular Expressions: A specification that doesn't make sense and A quick JS quiz for anybody who think they know regex). I'll avoid rehashing the points here, since I think David already articulated the problems well. For the record, I had planned to submit these issues as ECMAScript 4 bugs, but I already have a number of ES4 regex tickets open and was waiting to see their outcome before submitting more.

Another historically problematic issue has been the fact that, according to ES3, regex literals cause only one object to be created at runtime for a script or function. This issue exhibits itself most frequently as regex literals which use the /g modifier not having their lastIndex property reset in some cases where most developers would expect it. Fortunately, this is already planned to be fixed in ES4. The fact that it has been the third most duplicated Firefox bug report undoubtedly had something to do with this decision.

But getting back to my original rant, although the backreference handling issues might be less visible to some developers than having their regex objects' lastIndex properties seemingly out of whack, they are no more sensible or in line with developer expectations. Additionally, the ES3 handling for these issues is incompatible with other modern regex libraries, and far less useful than the alternative handling (see e.g. Mimicking Conditionals and Capturing Multiple, Optional HTML Attribute Values for a couple examples of where the conventional, Perl-style handling could be put to good use).

As a related rant, IMHO the ECMAScript 4 regex extension proposals miss some opportunities for key feature additions. Here's what ES4 regexes add, along with a few compatibility-related changes and the ability for regex literals to span multiple lines:

  • Character class set operations — intersection and subtraction, with syntax inspired by java.util.regex.
  • (?#…) comment patterns.
  • Named capture — though it seems this wasn't fully thought out. However, it looks like the TG1 group might be willing to change the syntax from that proposed in the draft spec to the more common .NET/Perl syntax, which would be an improvement.
  • The /y (sticky) modifier — similar to several other libraries' use of \G.
  • The /x (extended) modifier — for free-spacing and comments.
  • Unicode character properties — but there's no support for Unicode scripts or blocks, and no \X metasequence to match a single grapheme, which means you'll have to use \P{M}\p{M}*.
  • Support for hex character codes outside Unicode's Basic Multilingual Plane — via \x{n…} and \u{n…}, which are equivalent.

For a description of these features, see the ES4 wiki, but note that many of the finer details of how they'll work are not mentioned, or are being discussed elsewhere, such as on the es4-discuss@mozilla.org mailing list (external archive here) or within the ECMAScript 4 issue database.

Aside from a few details of their currently proposed implementation (which for the most part I've already brought up elsewhere), I think these additions are great. To be honest though, if I could trade all of the ES4 regex extensions for atomic groups and lookbehind, I would. And while it's understandable that different people have different priorities, the lack of atomic groups in particular is a significant omission considering their potentially dramatic performance-enhancing power combined with their minimal implementation burden. Additional features found either in Perl or other Perl-derivative regex libraries which could be quite useful include possessive quantifiers, backtracking control verbs, mode modifiers and mode-modified spans, conditionals, \A and \z assertions, callouts, relative backreferences, recursion, subpatterns as subroutines, match point resetting (via \K), duplicate subpattern numbers (?|…), subpattern definitions (?(DEFINE)…), partial matching, backwards matching, etc.

Since the ECMA TG1 group has stated that they're no longer accepting major spec proposals, I expect the additions will be limited to those already proposed. However, I'm hopeful that the situation will be improved, at least by refining some of the existing ES3 features and ES4 proposals. Since I love both JavaScript and regular expressions, I'd love to see them come together in a way that rivals the best regex libraries. Perhaps ECMAScript could even introduce a little innovation in the space.