Over the last few years, I've occasionally commented on JavaScript's RegExp API, syntax, and behavior on the ES-Discuss mailing list. Recently, JavaScript inventor Brendan Eich suggested that, in order to get more discussion going, I write up a list of regex changes to consider for future ECMAScript standards (or as he humorously put it, have my "95 [regex] theses nailed to the ES3 cathedral door"). I figured I'd give it a shot, but I'm going to split my response into a few parts. In this post, I'll be discussing issues with the current RegExp API and behavior. I'll be leaving aside new features that I'd like to see added, and merely suggesting ways to make existing capabilities better. I'll discuss possible new features in a follow-up post.
For a language as widely used as JavaScript, any realistic change proposal must strongly consider backward compatibility. For this reason, some of the following proposals might not be particularly realistic, but nevertheless I think that a) it's worthwhile to consider what might change if backward compatibility wasn't a concern, and b) in the long run, all of these changes would improve the ease of use and predictability of how regular expressions work in JavaScript.
Remove RegExp.prototype.lastIndex and replace it with an argument for start position
Actual proposal: Deprecate RegExp.prototype.lastIndex and add a "pos" argument to the RegExp.prototype.exec/test methods
JavaScript's lastIndex
property serves too many purposes at once:
- It lets users manually specify where to start a regex search
- You could claim this is not
lastIndex
's intended purpose, but it's nevertheless an important use since there's no alternative feature that allows this.lastIndex
is not very good at this task, though. You need to compile your regex with the/g
flag to letlastIndex
be used this way; and even then, it only specifies the starting position for theregexp.exec
/test
methods. It cannot be used to set the start position for thestring.match
/replace
/search
/split
methods. - It indicates the position where the last match ended
- Even though you could derive the match end position by adding the match index and length, this use of
lastIndex
serves as a convenient and commonly used compliment to theindex
property on match arrays returned byexec
. Like always, usinglastIndex
like this works only for regexes compiled with/g
. - It's used to track the position where the next search should start
- This comes into play, e.g., when using a regex to iterate over all matches in a string. However, the fact that
lastIndex
is actually set to the end position of the last match rather than the position where the next search should start (unlike equivalents in other programming languages) causes a problem after zero-length matches, which are easily possible with regexes like/\w*/g
or/^/mg
. Hence, you're forced to manually incrementlastIndex
in such cases. I've posted about this issue in more detail before (see: An IE lastIndex Bug with Zero-Length Regex Matches), as has Jan Goyvaerts (Watch Out for Zero-Length Matches).
Unfortunately, lastIndex
's versatility results in it not working ideally for any specific use. I think lastIndex
is misplaced anyway; if you need to store a search's ending (or next-start) position, it should be a property of the target string and not the regular expression. Here are three reasons this would work better:
- It would let you use the same regex with multiple strings, without losing track of the next search position within each one.
- It would allow using multiple regexes with the same string and having each one pick up from where the last one left off.
- If you search two strings with the same regex, you're probably not expecting the search within the second string to start from an arbitrary position just because a match was found in the first string.
In fact, Perl uses this approach of storing next-search positions with strings to great effect, and adds various features around it.
So that's my case for lastIndex
being misplaced, but I go one further in that I don't think lastIndex
should be included in JavaScript at all. Perl's tactic works well for Perl (especially when considered as a complete package), but some other languages (including Python) let you provide a search-start position as an argument when calling regex methods, which I think is an approach that is more natural and easier for developers to understand and use. I'd therefore fix lastIndex
by getting rid of it completely. Regex methods and regex-using string methods would use internal search position trackers that are not observable by the user, and the exec
and test
methods would get a second argument (called pos
, for position) that specifies where to start their search. It might be convenient to also give the String
methods search
, match
, replace
, and split
their own pos
arguments, but that is not as important and the functionality it would provide is not currently possible via lastIndex
anyway.
Following are examples of how some common uses of lastIndex
could be rewritten if these changes were made:
Start search from position 5, using lastIndex
(the staus quo):
var regexGlobal = /\w+/g, result; regexGlobal.lastIndex = 5; result = regexGlobal.test(str); // must reset lastIndex or future tests will continue from the // match-end position (defensive coding) regexGlobal.lastIndex = 0; var regexNonglobal = /\w+/; regexNonglobal.lastIndex = 5; // no go - lastIndex will be ignored. instead, you have to do this result = regexNonglobal.test(str.slice(5));
Start search from position 5, using pos
:
var regex = /\w+/, // flag /g doesn't matter
result = regex.test(str, 5);
Match iteration, using lastIndex
:
var regex = /\w*/g, matches = [], match; // the /g flag is required for this regex. if your code was provided a non- // global regex, you'd need to recompile it with /g, and if it already had /g, // you'd need to reset its lastIndex to 0 before entering the loop while (match = regex.exec(str)) { matches.push(match); // avoid an infinite loop on zero-length matches if (regex.lastIndex == match.index) { regex.lastIndex++; } }
Match iteration, using pos
:
var regex = /\w*/, // flag /g doesn't matter
pos = 0,
matches = [],
match;
while (match = regex.exec(str, pos)) {
matches.push(match);
pos = match.index + (match[0].length || 1);
}
Of course, you could easily add your own sugar to further simplify match iteration, or JavaScript could add a method dedicated to this purpose similar to Ruby's scan
(although JavaScript already sort of has this via the use of replacement functions with string.replace
).
To reiterate, I'm describing what I would do if backward compatibility was irrelevant. I don't think it would be a good idea to add a pos
argument to the exec
and test
methods unless the lastIndex
property was deprecated or removed, due to the functionality overlap. If a pos
argument existed, people would expect pos
to be 0
when it's not specified. Having lastIndex
around to sometimes screw up this expectation would be confusing and probably lead to latent bugs. Hence, if lastIndex
was deprecated in favor of pos
, it should be a means toward the end of removing lastIndex
altogether.
Remove String.prototype.match's nonglobal operating mode
Actual proposal: Deprecate String.prototype.match and add a new matchAll method
String.prototype.match
currently works very differently depending on whether the /g
(global) flag has been set on the provided regex:
- For regexes with
/g
: If no matches are found,null
is returned; otherwise an array of simple matches is returned. - For regexes without
/g
: Thematch
method operates as an alias ofregexp.exec
. If a match is not found,null
is returned; otherwise you get an array containing the (single) match in key zero, with any backreferences stored in the array's subsequent keys. The array is also assigned specialindex
andinput
properties.
The match
method's nonglobal mode is confusing and unnecessary. The reason it's unnecessary is obvious: If you want the functionality of exec
, just use it (no need for an alias). It's confusing because, as described above, the match
method's two modes return very different results. The difference is not merely whether you get one match or all matches—you get a completely different kind of result. And since the result is an array in either case, you have to know the status of the regex's global
property to know which type of array you're dealing with.
I'd change string.match
by making it always return an array containing all matches in the target string. I'd also make it return an empty array, rather than null
, when no matches are found (an idea that comes from Dean Edwards's base2 library). If you want the first match only or you need backreferences and extra match details, that's what regexp.exec
is for.
Unfortunately, if you want to consider this change as a realistic proposal, it would require some kind of language version- or mode-based switching of the match
method's behavior (unlikely to happen, I would think). So, instead of that, I'd recommend deprecating the match
method altogether in favor of a new method (perhaps RegExp.prototype.matchAll
) with the changes prescribed above.
Get rid of /g and RegExp.prototype.global
Actual proposal: Deprecate /g and RegExp.prototype.global, and add a boolean replaceAll argument to String.prototype.replace
If the last two proposals were implemented and therefore regexp.lastIndex
and string.match
were things of the past (or string.match
no longer sometimes served as an alias of regexp.exec
), the only method where /g
would still have any impact is string.replace
. Additionally, although /g
follows prior art from Perl, etc., it doesn't really make sense to have something that is not an attribute of a regex stored as a regex flag. Really, /g
is more of a statement about how you want methods to apply their own functionality, and it's not uncommon to want to use the same pattern with and without /g
(currently you'd have to construct two different regexes to do so). If it was up to me, I'd get rid of the /g
flag and its corresponding global
property, and instead simply give the string.replace
method an additional argument that indicates whether you want to replace the first match only (the default handling) or all matches. This could be done with either a replaceAll
boolean or, for greater readability, a scope
string that accepts values 'one'
and 'all'
. This new argument would have the additional benefit of allowing replace-all functionality with nonregex searches.
Note that SpiderMonkey already has a proprietary third string.replace
argument ("flags") that this proposal would conflict with. I doubt this conflict would cause much heartburn, but in any case, a new replaceAll
argument would provide the same functionality that SpiderMonkey's flags
argument is most useful for (that is, allowing global replacements with nonregex searches).
Change the behavior of backreferences to nonparticipating groups
Actual proposal: Make backreferences to nonparticipating groups fail to match
I'll keep this brief since David "liorean" Andersson and I have previously argued for this on ES-Discuss and elsewhere. David posted about this in detail on his blog (see: ECMAScript 3 Regular Expressions: A specification that doesn't make sense), and I've previously touched on it here (ECMAScript 3 Regular Expressions are Defective by Design). On several occasions, Brendan Eich has also stated that he'd like to see this changed. The short explanation of this behavior is that, in JavaScript, backreferences to capturing groups that have not (yet) participated in a match always succeed (i.e., they match the empty string), whereas the opposite is true in all other regex flavors: they fail to match and therefore cause the regex engine to backtrack or fail. JavaScript's behavior means that /(a|(b))\2c/.test("ac")
returns true
. The (negative) implications of this reach quite far when pushing the boundaries of regular expressions.
I think everyone agrees that changing to the traditional backreferencing behavior would be an improvement—it provides far more intuitive handling, compatibility with other regex flavors, and great potential for creative use (e.g., see my post on Mimicking Conditionals). The bigger question is whether it would be safe, in light of backward compatibility. I think it would be, since I imagine that more or less no one uses the unintuitive JavaScript behavior intentionally. The JavaScript behavior amounts to automatically adding a ?
quantifier after backreferences to nonparticipating groups, which is what people already do explicitly if they actually want backreferences to nonzero-length subpatterns to be optional. Also note that Safari 3.0 and earlier did not follow the spec on this point and used the more intuitive behavior, although that has changed in more recent versions (notably, this change was due to a write up on my blog rather than reports of real-world errors).
Finally, it's probably worth noting that .NET's ECMAScript regex mode (enabled via the RegexOptions.ECMAScript
flag) indeed switches .NET to ECMAScript's unconventional backreferencing behavior.
Make \d \D \w \W \b \B support Unicode (like \s \S . ^ $, which already do)
Actual proposal: Add a /u flag (and corresponding RegExp.prototype.unicode property) that changes the meaning of \d, \w, \b, and related tokens
Unicode-aware digit and word character matching is not an existing JavaScript capability (short of constructing character class monstrosities that are hundreds or thousands of characters long), and since JavaScript lacks lookbehind you can't reproduce a Unicode-aware word boundary. You could therefore say this proposal is outside the stated scope of this post, but I'm including it here because I consider this more of a fix than a new feature.
According to current JavaScript standards, \s
, \S
, .
, ^
, and $
use Unicode-based interpretations of whitespace and newline, whereas \d
, \D
, \w
, \W
, \b
, and \B
use ASCII-only interpretations of digit, word character, and word boundary (e.g., /na\b/.test("naïve")
unfortunately returns true
). See my post on JavaScript, Regex, and Unicode for further details. Adding Unicode support to these tokens would cause unexpected behavior for thousands of websites, but it could be implemented safely via a new /u
flag (inspired by Python's re.U
or re.UNICODE
flag) and a corresponding RegExp.prototype.unicode
property. Since it's actually fairly common to not want these tokens to be Unicode enabled in particular regex patterns, a new flag that activates Unicode support would offer the best of both worlds.
Change the behavior of backreference resetting during subpattern repetition
Actual proposal: Never reset backreference values during a match
Like the last backreferencing issue, this too was covered by David Andersson in his post ECMAScript 3 Regular Expressions: A specification that doesn't make sense. The issue here involves the value remembered by capturing groups nested within a quantified, outer group (e.g., /((a)|(b))*/
). According to traditional behavior, the value remembered by a capturing group within a quantified grouping is whatever the group matched the last time it participated in the match. So, the value of $1
after /(?:(a)|(b))*/
is used to match "ab"
would be "a"
. However, according to ES3/ES5, the value of backreferences to nested groupings is reset/erased after the outer grouping is repeated. Hence, /(?:(a)|(b))*/
would still match "ab"
, but after the match is complete $1
would reference
a nonparticipating capturing group, which in JavaScript would match an empty string within the regex itself, and be returned as undefined
in, e.g., the array returned by the regexp.exec
.
My case for change is that current JavaScript behavior breaks from the norm in other regex flavors, does not lend itself to various types of creative patterns (see one example in my post on Capturing Multiple, Optional HTML Attribute Values), and in my opinion is far less intuitive than the more common, alternative regex behavior.
I believe this behavior is safe to change for two reasons. First, this is generally an edge case issue for all but hardcore regex wizards, and I'd be surprised to find regexes that rely on JavaScript's version of this behavior. Second, and more importantly, Internet Explorer does not implement this rule and follows the more traditional behavior.
Add an /s flag, already
Actual proposal: Add an /s flag (and corresponding RegExp.prototype.dotall property) that changes dot to match all characters including newlines
I'll sneak this one in as a change/fix rather than a new feature since it's not exactly difficult to use [\s\S]
in place of a dot when you want the behavior of /s
. I presume the /s
flag has been excluded thus far to save novices from themselves and limit the damage of runaway backtracking, but what ends up happening is that people write inefficient patterns that rely on backtracking like (.|\r|\n)*
instead.
Regex searches in JavaScript are seldom line-based, and it's therefore more common to want dot to include newlines than to match anything-but-newlines (although both modes are useful). It makes good sense to keep the default meaning of dot (no newlines) since it is shared by other regex flavors and required for backward compatibility, but adding support for the /s
flag is overdue. A boolean indicating whether this flag was set should show up on regexes as a property named either singleline
(the unfortunate name from Perl, .NET, etc.) or the more descriptive dotall
(used in Java, Python, PCRE, etc.).
Personal preferences
Following are a few changes that would suit my preferences, although I don't think most people would consider them significant issues:
- Allow regex literals to use unescaped forward slashes within character clases (e.g.,
/[/]/
). This was already included in the abandoned ES4 change proposals. - Allow an unescaped
]
as the first character in character classes (e.g.,[]]
or[^]]
). This is allowed in probably every other regex flavor, but creates an empty class followed by a literal]
in JavaScript. I'd like to imagine that no one uses empty classes intentionally, since they don't work consistently cross-browser and there are widely-used/common-sense alternatives ((?!)
instead of[]
, and[\s\S]
instead of[^]
). Unfortunately, adherence to this JavaScript quirk is tested in Acid3 (test 89), which is likely enough to kill requests for this backward-incompatible but reasonable change. - Change the
$&
token used in replacement strings to$0
. It just makes sense. (Equivalents in other replacement text flavors for comparison: Perl:$&
; Java:$0
; .NET:$0
,$&
; PHP:$0
,\0
; Ruby:\0
,\&
; Python:\g<0>
.) - Get rid of the special meaning of
[\b]
. Within character classes, the metasequence\b
matches a backspace character (equivalent to\x08
). This is a worthless convenience since no one cares about matching backspace characters, and it's confusing given that\b
matches a word boundary when used outside of character classes. Even though this would break from regex tradition (which I'd usually advocate following), I think that\b
should have no special meaning inside character classes and simply match a literalb
.
Fixed in ES3: Remove octal character references
ECMAScript 3 removed octal character references from regular expression syntax, although \0
was kept as a convenient exception that allows easily matching a NUL character. However, browsers have generally kept full octal support around for backward compatibility. Octals are very confusing in regular expressions since their syntax overlaps with backreferences and an extra leading zero is allowed outside of character classes. Consider the following regexes:
/a\1/
:\1
is an octal./(a)\1/
:\1
is a backreference./(a)[\1]/
:\1
is an octal./(a)\1\2/
:\1
is a backreference;\2
is an octal./(a)\01\001[\01\001]/
: All occurences of\01
and\001
are octals. However, according to the ES3+ specs, the numbers after each\0
should be treated (barring nonstandard extensions) as literal characters, completely changing what this regex matches. (Edit-2012: Actually, a close reading of the spec shows that any 0-9 following\0
should cause aSyntaxError
.)/(a)\0001[\0001]/
: The\0001
outside the character class is an octal; but inside, the octal ends at the third zero (i.e., the character class matches character index zero or"1"
). This regex is therefore equivalent to/(a)\x01[\x00\x31]/
; although, as mentioned just above, adherence to ES3 would change the meaning./(a)\00001[\00001]/
: Outside the character class, the octal ends at the fourth zero and is followed by a literal"1"
. Inside, the octal ends at the third zero and is followed by a literal"01"
. And once again, ES3's exclusion of octals and inclusion of\0
could change the meaning./\1(a)/
: Given that, in JavaScript, backreferences to capturing groups that have not (yet) participated match the empty string, does this regex match"a"
(i.e.,\1
is treated as a backreference since a corresponding capturing group appears in the regex) or does it match"\x01a"
(i.e., the\1
is treated as an octal since it appears before its corresponding group)? Unsurprisingly, browsers disagree./(\2(a)){2}/
: Now things get really hairy. Does this regex match"aa"
,"aaa"
,"\x02aaa"
,"2aaa"
,"\x02a\x02a"
, or"2a2a"
? All of these options seem plausible, and browsers disagree on the correct choice.
There are other issues to worry about, too, like whether octal escapes go up to \377
(\xFF
, 8-bit) or \777
(\u01FF
, 9-bit); but in any case, octals in regular expressions are a confusing cluster-cuss. Even though ECMAScript has already cleaned up this mess by removing support for octals, browsers have not followed suit. I wish they would, because unlike browser makers, I don't have to worry about this bit of legacy (I never use octals in regular expressions, and neither should you).
Fixed in ES5: Don't cache regex literals
According to ES3 rules, regex literals did not create a new regex object if a literal with the same pattern/flag combination was already used in the same script or function (this did not apply to regexes created by the RegExp
constructor). A common side effect of this was that regex literals using the /g
flag did not have their lastIndex
property reset in some cases where most developers would expect it. Several browsers didn't follow the spec on this unintuitive behavior, but Firefox did, and as a result it became the second most duplicated JavaScript bug report for Mozilla. Fortunately, ES5 got rid of this rule, and now regex literals must be recompiled every time they're encountered (this change is coming in Firefox 3.7).
———
So there you have it. I've outlined what I think the JavaScript RegExp API got wrong. Do you agree with all of these proposals, or would you if you didn't have to worry about backward compatibility? Are there better ways than what I've proposed to fix the issues discussed here? Got any other gripes with existing JavaScript regex features? I'm eager to hear feedback about this.
Since I've been focusing on the negative in this post, I'll note that I find working with regular expressions in JavaScript to be a generally pleasant experience. There's a hell of a lot that JavaScript got right.
Instead of returning an array, I think many of these methods should return a generator instead.
Hey Steve, good list — I may transcribe it into wiki.ecmascript.org for you unless you have time and want to (mail me for edit access).
One small correction: the /[/]/ reality-based change was fixed in ES5.
/be
Thanks, Brendan! I’d be happy to add it to the wiki (and I will email you when I actually have the time to do so), but other obligations preclude me from taking this on for at least a couple weeks. If you’d like to add it and get to it before I do, by all means, that would be awesome.
This has nothing to do with JavaScript but… how nice it is to find you. We knew each other in Japan, ’98, I think. Your mom taught me and my sister, Jan, some art classes. I’ve been trying to get back in contact with Jessica, I hope you can help.
I wrote a blog post yesterday complaining about two things (one well known, one new to me) in JavaScript related to regular expressions.
http://blog.getify.com/2010/11/to-capture-or-not/#lookclosely
Summary:
1) I don’t understand why regular expressions default to having ( ) groups “capture”. My perspective is that far more often people are using ( ) to do grouping for operator binding, not for capturing. Yes, they *can* remember to always use (?: ) for that, but it makes the regexes even longer and more complicated than the already nearly opaque syntax.
So, why isn’t there a global modifier flag like “c” or “n” which can either turn on or turn off default capturing? Why do I have to specifically opt-out of capturing for every single group in my regex? Capturing is less performant, so I don’t understand why it’s the default behavior? It’s not that capturing is not useful, I just don’t understand why it’s default for the simpler ( ) syntax instead of the simpler and more performant (non-capturing) behavior.
2) Secondly, and more disturbingly, I find it frustrating that the designers of the .match(), .exec(), and especially the .split() API functions made a (IMHO) faulty assumption in how they specified the behavior with regards to the results returned.
Those three API functions assume that if the regular expression has one or more capture groups in it, that the return results should always include those matched group values in the results array. It’s true that this may sometimes be desired, but it’s not always desired, and in fact is sometimes quite more difficult to deal with. The API for these functions does not provide a way to use capture groups and NOT have the captured values returned in the results.
It’s true that in most cases, use of ( ) without intention to have capture can be achieved with (?: ) instead. However, there’s one particular case (which I’ve run up against now) that this API assumption of return result behavior is not appropriate and there’s no way to override it.
Imagine this scenario:
var str = “some ‘g’ good \”s\” stuff going on ‘h’ here”;
var results = str.split(/([“‘]).\1/);
// want: [“some”, “good”, “stuff going on”, “here”]
// get: [“some”, “‘”, “good”, “””, “stuff going on”, “‘”, “here”]
Notice that I’m needing to use a capture group for the \1 back-reference, but it’s definitely more confusing and difficult to deal with the extra values in the result array.
In this scenario, we see the problem with the assumption that use of a capture group implies desire to have captured values returned. Capture groups were originally designed for the use inside the regular expression by back-references, and in my scenario, that’s ALL I want the capture group to be used for — I do not want it to affect the results.
So, my question is, wouldn’t it make more sense for these API functions to have the result return behavior controlled explicitly with a specific parameter/flag rather than assumed by the regular expression structure?
For instance, str.split((/([“‘]).\1/,0,false) would say “do not return me the capture groups”.
The default value for that parameter could still be true, so as to preserve existing behavior. But having it be controllable would make the API more sensible and flexible.
What do you think?
@Kyle Simpson, I’ve replied to the issues raised here on your blog post at http://bit.ly/auyJe8
Feedback on \u from a native speaker of a superset of US-ASCII:
Having a separate RegExp flag for the purposes of Unicode word characters and word boundaries: most excellent!
Rolling it up together with changed behaviour for \d and \D (use typically corresponding to code using parseInt, or maybe parseFloat): bad!
As long as these flags are regexp-global, that would impact use negatively the same way that you can’t have parts of a RegExp be case sensitive and other parts case insensitive, requiring the use of multiple RegExp passes and much additional flow logic to counteract bad underlying design.
While I’m sure there exist some case for (non-)Unicode-digit character groups (a subset of which, I guess, would be /[1234567890??????????]/, or /[0-9\u0660-\u0669] for matching Roman and Arabic-Indic digits) for text processing, layouting or formatting purposes, for more common use cases parseInt(‘1234567890’, 10) produces 1234567890 but parseInt(‘??????????’, 10) unforgivingly yields NaN.
Glad I added the \u escapes there; your blog engine’s or comment form processing neuters Unicode like ‘\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669’ to ASCII question marks. 🙂
I only skimmed it because after beginning to read it I wanted only a fraction of the stupid contained in this post.
The big one is about eschewing the global modifier method in lieu of just searching globally by default and not having another option. What a great way to artificially increase overhead! Moron.
One glaring omission from the JS regexp API is a way to retrieve the starting index of matching captured groups from the search string.
For example:
var result = /(?:a+)(?:b+)(c+)/.exec(“aaabbccc”);
result[1]; // ccc, but what index does it start at?
Java has Matcher#regionStart for this. Python has MatchObject.start(). JavaScript has … nothing.
Ahhh.. Thank you very much. Finally I know why all my scripts don’t work anymore.
var match;
while (match = /^(\d+) (\d+) (\d+) (\d+)$/mg.exec(data)) {
// endless loop in es5
// worked fine before
}