Alright, that title doesn't really work, but one thing I've encountered quite frequently is that the terms "single-line mode" and "multi-line mode" seem to cause no end of confusion for the vast majority of regex users. Many guides try to explain the terms based on some description of lines, or other unrelated issues. I won't, because I think that's highly misleading. In fact, I shunned the terms in RegexPal in favor of the hopefully more descriptive "dot matches all" (instead of single-line) and "^$ match at line breaks" (instead of multi-line). Another sensible name I've seen used for multi-line mode is "enhanced line-anchor mode".
The first important thing to understand is that single-line and multi-line modes have nothing to do with each other. Hence, they can be applied independently — e.g., sometimes it makes sense to use both at the same time. Single-line mode changes the meaning of the dot, while multi-line mode changes the meaning of ^
and $
. There are no other side effects. This means that if your regex does not contain one of those three metacharacters (.
, ^
, $
), neither modifier has any impact (except possibly confusing other people who read your regexes).
Let's get more specific…
- Dot (
.
) matches all characters except newlines. In single-line mode, it matches all characters. - Caret (
^
) matches the position at the beginning of the string. In multi-line mode, it additionally matches the position just after newlines. - Dollar (
$
) matches the position at the end of the string. With most regex flavors, it also matches just before a string-ending newline. In multi-line mode, it additionally matches the position just before newlines (not only string-ending newlines).
As for exactly which characters are considered newline characters, that is regex-flavor and character-encoding specific. See my last post, JavaScript, Regex, and Unicode, for a precise JavaScript definition.
Finally, although single-line and multi-line modifiers are standard in most Perl-derivative regex flavors, there are a couple notable exceptions.
- JavaScript does not have a single-line modifier. Use
[\S\s]
instead of a dot if you want to match any character including newlines. And for the love of god, don't use(.|\r|\n)
or similar — it's terribly inefficient (especially when repeated infinitely) and doesn't match a couple lesser-used newline characters. - In JavaScript,
$
without/m
matches only the position at the end of the string. It does not match before a string-ending newline. - In Ruby,
^
and$
always match at newlines, and there is no mode to change this. Use\A
and\Z
to match at the beginning and end of the string only. In fact, what Ruby calls multi-line mode (and implements as/m
) is what other regex flavors call single-line mode! Talk about confusing!
Edit: On his blog, Jeffrey Friedl pointed out a related post he'd made on the comp.lang.perl Usenet group back in 1994.
Edit 2: Note that I consider Tcl's "Advanced Regular Expressions" flavor to be more inspired by some features of Perl than actually Perl-derivative. Hence, its peculiarities regarding so-called single-line and multi-line handling are not mentioned here.
Oh yes! Those two flags are in the top 10 of the least inspired names in the flags history if you discount those weird Windows native method flags that no one can remember. A while ago I was working with a regular expression and I remember having the epiphany: wait a minute! These two options are not mutually exclusive! And I have been using the C# Regex for at least two years.
Steve, great breakdown and explanation. Just out of curiosity, was the selection of [\S\s] arbitrary? Or, would it work just as well as [\W\w], which is what I have done in the past?
Ben, thanks! And yes,
[\S\s]
,[\W\w]
, and[\D\d]
are all equivalent. In ES3 (which only supports the Unicode BMP),[\0-\uffff]
is also equivalent.“JavaScript does not have a single-line modifier. Use [\S\s] instead of a dot if you want to match any character including newlines.”
Thanx a lot for this one… I wasn’t sure of what to use instead of single-line here in javascript.
Python refers to these as “MULTILINE” = “M” and “DOTALL” = “S”.
http://docs.python.org/lib/node46.html
This was really helpful, especially the bit about [\s\S] vs. . in JavaScript. Thanks!
Very helpful Ive been pouring over the net trying to figure out why “s” wasnt working in JS. thanks a lot