January 2008 — Flagrant Badassery

Alright, that title doesn't really work, but one thing I've encountered quite frequently is that the terms "single-line mode" and "multi-line mode" seem to cause no end of confusion for the vast majority of regex users. Many guides try to explain the terms based on some description of lines, or other unrelated issues. I won't, because I think that's highly misleading. In fact, I shunned the terms in RegexPal in favor of the hopefully more descriptive "dot matches all" (instead of single-line) and "^$ match at line breaks" (instead of multi-line). Another sensible name I've seen used for multi-line mode is "enhanced line-anchor mode".

The first important thing to understand is that single-line and multi-line modes have nothing to do with each other. Hence, they can be applied independently — e.g., sometimes it makes sense to use both at the same time. Single-line mode changes the meaning of the dot, while multi-line mode changes the meaning of ^ and $. There are no other side effects. This means that if your regex does not contain one of those three metacharacters (., ^, $), neither modifier has any impact (except possibly confusing other people who read your regexes).

Let's get more specific…

Dot (.) matches all characters except newlines. In single-line mode, it matches all characters.
Caret (^) matches the position at the beginning of the string. In multi-line mode, it additionally matches the position just after newlines.
Dollar ($) matches the position at the end of the string. With most regex flavors, it also matches just before a string-ending newline. In multi-line mode, it additionally matches the position just before newlines (not only string-ending newlines).

As for exactly which characters are considered newline characters, that is regex-flavor and character-encoding specific. See my last post, JavaScript, Regex, and Unicode, for a precise JavaScript definition.

Finally, although single-line and multi-line modifiers are standard in most Perl-derivative regex flavors, there are a couple notable exceptions.

JavaScript does not have a single-line modifier. Use [\S\s] instead of a dot if you want to match any character including newlines. And for the love of god, don't use (.|\r|\n) or similar — it's terribly inefficient (especially when repeated infinitely) and doesn't match a couple lesser-used newline characters.
In JavaScript, $ without /m matches only the position at the end of the string. It does not match before a string-ending newline.
In Ruby, ^ and $ always match at newlines, and there is no mode to change this. Use \A and \Z to match at the beginning and end of the string only. In fact, what Ruby calls multi-line mode (and implements as /m) is what other regex flavors call single-line mode! Talk about confusing!

Edit: On his blog, Jeffrey Friedl pointed out a related post he'd made on the comp.lang.perl Usenet group back in 1994.

Edit 2: Note that I consider Tcl's "Advanced Regular Expressions" flavor to be more inspired by some features of Perl than actually Perl-derivative. Hence, its peculiarities regarding so-called single-line and multi-line handling are not mentioned here.

Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

Here again are the affected tokens, along with their definitions:

\d — Digits.
\s — Whitespace.
\w — Word characters.
\D — All except digits.
\S — All except whitespace.
\W — All except word characters.
. — All except newlines.
^ (with /m) — The positions at the beginning of the string and just after newlines.
$ (with /m) — The positions at the end of the string and just before newlines.
\b — Word boundary positions.
\B — Not word boundary positions.

All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:

Digit — The characters 0-9 only.
Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
Word character — The characters A-Z, a-z, 0-9, and _ only.
Word boundary — The position between a word character and non-word character.
Newline — The line feed, carriage return, line separator, and paragraph separator characters.

Here again are the newline characters, with their character codes:

\u000a — Line feed — \n
\u000d — Carriage return — \r
\u2028 — Line separator
\u2029 — Paragraph separator

Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n") returns true.

As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s according to ECMA-262 3rd Edition and Unicode 5.1:

\u0009 — Tab — \t
\u000a — Line feed — \n — (newline character)
\u000b — Vertical tab — \v
\u000c — Form feed — \f
\u000d — Carriage return — \r — (newline character)
\u0020 — Space
\u00a0 — No-break space
\u1680 — Ogham space mark
\u180e — Mongolian vowel separator
\u2000 — En quad
\u2001 — Em quad
\u2002 — En space
\u2003 — Em space
\u2004 — Three-per-em space
\u2005 — Four-per-em space
\u2006 — Six-per-em space
\u2007 — Figure space
\u2008 — Punctuation space
\u2009 — Thin space
\u200a — Hair space
\u2028 — Line separator — (newline character)
\u2029 — Paragraph separator — (newline character)
\u202f — Narrow no-break space
\u205f — Medium mathematical space
\u3000 — Ideographic space

To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.

Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.

Month: January 2008

The terms single-line and multi-line considered harmful

JavaScript, Regex, and Unicode