Crafting efficient regular expressions is somewhat of an art. In large part, it centers around controlling/minimizing backtracking and the number of steps it takes the regex engine to match or fail, but the fact that most engines implement different sets of internal optimizations (which can either make certain operations faster, or avoid work by performing simplified pre-checks or skipping unnecessary operations, etc.) also makes the topic dependent on the particular regex flavor and implementation you're using to a significant extent. Most developers aren't deeply aware of regex performance issues, so when they run into problems with regular expressions running slowly, their first thought is to remove the regexes. In reality, most non-trivial regexes I've seen could be significantly optimized for efficiency, and I've frequently seen dramatic improvements as a result of doing so.
The best discussion of regex optimization I've seen is chapter six (a good 60 pages) of Mastering Regular Expressions, Third Edition by Jeffrey Friedl. Unfortunately, other good material on the subject is fairly scarce, so I was pleasantly surprised to stumble upon the recent article Optimizing regular expressions in Java by Cristian Mocanu. Of course, it is in part Java specific, but for the most part it is a good article on the basics of optimization for traditional NFA regex engines. Check it out.
Have you seen any good articles or discussions about regular expression performance or efficiency optimization recently? Do you have any questions about the subject? Experience or pointers to share? Let me know. (I hope to eventually write up an in-depth article on JavaScript regex optimization, with lots of tips, techniques, and cross-browser benchmarks.)
I’ve got an awesome tip for you. Did you know you can use ?: to avoid capturing a backreference when one isn’t needed? LOL!
I’m just kidding. Obviously you know plenty enough as it is, but thanks to your elite regex wizardry (which I abbreviate to regexardry) I know enough about regular expressions to yell at the n00bs at work.
Hi Steve,
Great blog, just wondering if you could help.. I have the regular expression for the bit i want to disregard, how do i keep the bit i want?
All i want to do, is strip a URL of the anchor..
So i had var regexp = /\/#/;
but that’s the bit i want to throw away! how do i keep the other bit? surely it must be easily possible like in PHP
thanks in advance 🙂
great blog btw..
jon
Hi Jon, thanks!
Just replace the text you match with an empty string, and what’s left will be the part you want to keep, e.g.:
str = str.replace(/#.*/, "");
Or in this case you could also use
str = str.match(/^[^#]*/)[0];
However, for the case you described you don’t really need to use a regex. Try this instead:
str = str.substring(0, str.indexOf("#"));
great thanks a lot! Friday afternoon here in UK, would explain why i didn’t see that one..
have a good weekend.
j
I figured this would be valuable to mention regarding scanning performance on large strings. Observation in my (almost ready to release) lex clone using ECMAScript.
Because I write a lexical analysis routine in JS, there are potentially a lot of regex tests being run on a long input source, many of which won’t match at all, but nevertheless the regexp engine will scan the whole string.
The solution? Limit the size of the input to the first N characters of your input. If there’s no match, forget it. If there’s a match and its length is exactly equal to N, re-run the match against the full string (since you may have accidentally truncated it).
Even though this now means you may be regexing more times than without a limit, you are scanning less data each time.
This cut my execution time down massively. For 11KB of JavaScript source lexing it took 1500ms on Safari 3 and 650ms on FF3. I tried all sorts and was on the verge of giving up (what use is a lexical analyzer that’s going to take so long to execute?).
With input size limits imposed this came down to 120ms in Safari and 300ms in FF (the tables turned in terms of which browser was faster).
There are other optimizations I need to find now (string buffering & traversing), but that by far will be the biggest I’ll resolve 😉
This is the relevant code:
* (self is a copy of the reference to “this”)
* self.In is the input source (it gets eaten from left-to-right so the start is the important bit)
* The RegExp (re) has been optimized to inject the caret “^” to the start
* _minInputSize is the optimization factor. I default to 32 chars of input.
——————————–
/** @private */
var _scanByRegExp = function _scanByRegExp(re) {
var match = ”;
var matches;
if ((matches = re.exec(self.In.substring(0, _minInputSize)))
&& matches.index == 0) {
match = matches[0];
//FSA optimization check:
//If it looks like there’s more of this token, try without the limit
if (match.length == _minInputSize) {
matches = re.exec(self.In);
match = matches[0];
}
}
return match;
};
————————————-
It can also help to “chew up” text that does not contain the matches inside the regExp itself.
For example a regExp that searches for non-word matches in plain text:
/(<|>|::|–|@|###)/
could be sped up by adding an additional chew-up blind-match:
/(<|>|::|–|@|###)|[^<>:-@#]+/
The chew-up blind-match has then to be filtered out (e.g. as an empty match).
This also works for matches containing words:
/\b(http:|ftp:|gopher:)/
becomes:
/\b(http:|ftp:|gopher:)|\b[^:]{7,}/
I had a gigantic regExp to parse wiki code for syntax highlighting, consisting of “|”-separated subexpressions for all existing wiki codes. Just by adding a chew-up expression I cut the execution time in half (the code is for the Wikipedia editor wikEd and I was using JavaScript under Firefox 3.5).
And what was other good material on the subject?
Hi found good contents on the blog,
i am new to regex and have one query about it.
Suppose if i have an expression (hello1) | (hello2) |(hello3)
Now on regex match i want to know which subexpression got matched, how can i get this information.
Hi.. I need to optimise the following reg exp so that it executes faster. Can anyone help? Thanks
^([\d\w]{15}[\x01]\d{12}[\x01]\d{2}(.){6}((13((0[0-9]|([1-4][0-9])|5[0-9]))|14((0[0-9]|([1-2][0-9])|30)))[0-5][0-9])801(?:.*))$
I have a general regex optimization problem – it’s unlikely anyone has a solution, but I wondered if anyone had attempted anything similar? I am posting here, because I’m not sure where else to ask?
The problem can be stated thus:
Given a known set S of possible strings, contruct the fastest executing regex that matches only a specified subset of those strings. Note – it’s okay for the regex to match strings that are outside of S – we only care about the function of the regex in the domain of S.
For example, for the set of strings {“fish”, “moose”, “fishes”, “cat”, “wishes”, “poultry”, “sardines”} what is the fastest executing regex that will match only {“fish”, “fishes”, “sardines”} but none of the others in the set?
For small sets a regex would probably not out-perform a direct check against the subset, but for filtering a large set of thousands of known possible strings it would probably be much faster to use a regex.