ECMAScript 3 Regular Expressions are Defective by Design

ECMAScript 3 has some major regex design flaws, and if nothing changes the ES4 group will be propagating some of the mistakes into ECMAScript 4 (aka JavaScript 2).

Recently, longtime JavaScript regex guru David "liorean" Andersson wrote up a couple posts about my biggest gripes with the ECMAScript 3 regex flavor, namely the way that backreferences are handled for non-participating capturing groups and within quantified subpatterns (see ECMAScript 3 Regular Expressions: A specification that doesn't make sense and A quick JS quiz for anybody who think they know regex). I'll avoid rehashing the points here, since I think David already articulated the problems well. For the record, I had planned to submit these issues as ECMAScript 4 bugs, but I already have a number of ES4 regex tickets open and was waiting to see their outcome before submitting more.

Another historically problematic issue has been the fact that, according to ES3, regex literals cause only one object to be created at runtime for a script or function. This issue exhibits itself most frequently as regex literals which use the /g modifier not having their lastIndex property reset in some cases where most developers would expect it. Fortunately, this is already planned to be fixed in ES4. The fact that it has been the third most duplicated Firefox bug report undoubtedly had something to do with this decision.

But getting back to my original rant, although the backreference handling issues might be less visible to some developers than having their regex objects' lastIndex properties seemingly out of whack, they are no more sensible or in line with developer expectations. Additionally, the ES3 handling for these issues is incompatible with other modern regex libraries, and far less useful than the alternative handling (see e.g. Mimicking Conditionals and Capturing Multiple, Optional HTML Attribute Values for a couple examples of where the conventional, Perl-style handling could be put to good use).

As a related rant, IMHO the ECMAScript 4 regex extension proposals miss some opportunities for key feature additions. Here's what ES4 regexes add, along with a few compatibility-related changes and the ability for regex literals to span multiple lines:

Character class set operations — intersection and subtraction, with syntax inspired by java.util.regex.
(?#…) comment patterns.
Named capture — though it seems this wasn't fully thought out. However, it looks like the TG1 group might be willing to change the syntax from that proposed in the draft spec to the more common .NET/Perl syntax, which would be an improvement.
The /y (sticky) modifier — similar to several other libraries' use of \G.
The /x (extended) modifier — for free-spacing and comments.
Unicode character properties — but there's no support for Unicode scripts or blocks, and no \X metasequence to match a single grapheme, which means you'll have to use \P{M}\p{M}*.
Support for hex character codes outside Unicode's Basic Multilingual Plane — via \x{n…} and \u{n…}, which are equivalent.

For a description of these features, see the ES4 wiki, but note that many of the finer details of how they'll work are not mentioned, or are being discussed elsewhere, such as on the es4-discuss@mozilla.org mailing list (external archive here) or within the ECMAScript 4 issue database.

Aside from a few details of their currently proposed implementation (which for the most part I've already brought up elsewhere), I think these additions are great. To be honest though, if I could trade all of the ES4 regex extensions for atomic groups and lookbehind, I would. And while it's understandable that different people have different priorities, the lack of atomic groups in particular is a significant omission considering their potentially dramatic performance-enhancing power combined with their minimal implementation burden. Additional features found either in Perl or other Perl-derivative regex libraries which could be quite useful include possessive quantifiers, backtracking control verbs, mode modifiers and mode-modified spans, conditionals, \A and \z assertions, callouts, relative backreferences, recursion, subpatterns as subroutines, match point resetting (via \K), duplicate subpattern numbers (?|…), subpattern definitions (?(DEFINE)…), partial matching, backwards matching, etc.

Since the ECMA TG1 group has stated that they're no longer accepting major spec proposals, I expect the additions will be limited to those already proposed. However, I'm hopeful that the situation will be improved, at least by refining some of the existing ES3 features and ES4 proposals. Since I love both JavaScript and regular expressions, I'd love to see them come together in a way that rivals the best regex libraries. Perhaps ECMAScript could even introduce a little innovation in the space.

7 thoughts on “ECMAScript 3 Regular Expressions are Defective by Design”

liorean says:

November 27, 2007 at 4:46 pm

Just a note: what you call “the ECMAScript 4 discussion forum” is in fact not a forum, but an external archive of the es4-discuss@mozilla.org mailing list. For the actual mailing list and official archives, see https://mail.mozilla.org/listinfo/es4-discuss .
Steve says:

November 27, 2007 at 9:11 pm

Thanks for the clarification. I’ve edited the post to correct that. Incidentally, while doing so I also toned down a lot of the previous, undue negativity towards the ES4 regex extension proposals.
Pingback: Fix Regex-additions.js Errors - Windows XP, Vista, 7 & 8
Pingback: RegExp.test() bug in javascript | Ian Conery
Pingback: regex.test() only works every other time - Tutorial Guruji
Pingback: regex.test() only works every other time - Row Coding
Pingback: JavaScript: Regex.test() only works every other time – Veritas Reporters

7 thoughts on “ECMAScript 3 Regular Expressions are Defective by Design”

Leave a Reply