<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Regex Performance Optimization</title>
	<atom:link href="http://blog.stevenlevithan.com/archives/regex-optimization/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.stevenlevithan.com/archives/regex-optimization</link>
	<description>A JavaScript and regular expression centric blog</description>
	<lastBuildDate>Wed, 08 Sep 2010 23:20:23 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: asd0z</title>
		<link>http://blog.stevenlevithan.com/archives/regex-optimization/comment-page-1#comment-52434</link>
		<dc:creator>asd0z</dc:creator>
		<pubDate>Mon, 10 May 2010 13:03:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-optimization#comment-52434</guid>
		<description>And what was other good material on the subject?</description>
		<content:encoded><![CDATA[<p>And what was other good material on the subject?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cacycle</title>
		<link>http://blog.stevenlevithan.com/archives/regex-optimization/comment-page-1#comment-39411</link>
		<dc:creator>Cacycle</dc:creator>
		<pubDate>Sat, 01 Aug 2009 23:18:39 +0000</pubDate>
		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-optimization#comment-39411</guid>
		<description>It can also help to &quot;chew up&quot; text that does not contain the matches inside the regExp itself.

For example a regExp that searches for non-word matches in plain text:

/(&lt;&#124;&gt;&#124;::&#124;--&#124;@&#124;###)/

could be sped up by adding an additional chew-up blind-match:

/(&lt;&#124;&gt;&#124;::&#124;--&#124;@&#124;###)&#124;[^&lt;&gt;:-@#]+/

The chew-up blind-match has then to be filtered out (e.g. as an empty match).

This also works for matches containing words:

/\b(http:&#124;ftp:&#124;gopher:)/

becomes:

/\b(http:&#124;ftp:&#124;gopher:)&#124;\b[^:]{7,}/

I had a gigantic regExp to parse wiki code for syntax highlighting, consisting of &quot;&#124;&quot;-separated subexpressions for all existing wiki codes. Just by adding a chew-up expression I cut the execution time in half (the code is for the Wikipedia editor wikEd and I was using JavaScript under Firefox 3.5).</description>
		<content:encoded><![CDATA[<p>It can also help to &#8220;chew up&#8221; text that does not contain the matches inside the regExp itself.</p>
<p>For example a regExp that searches for non-word matches in plain text:</p>
<p>/(&lt;|&gt;|::|&#8211;|@|###)/</p>
<p>could be sped up by adding an additional chew-up blind-match:</p>
<p>/(&lt;|&gt;|::|&#8211;|@|###)|[^&lt;&gt;:-@#]+/</p>
<p>The chew-up blind-match has then to be filtered out (e.g. as an empty match).</p>
<p>This also works for matches containing words:</p>
<p>/\b(http:|ftp:|gopher:)/</p>
<p>becomes:</p>
<p>/\b(http:|ftp:|gopher:)|\b[^:]{7,}/</p>
<p>I had a gigantic regExp to parse wiki code for syntax highlighting, consisting of &#8220;|&#8221;-separated subexpressions for all existing wiki codes. Just by adding a chew-up expression I cut the execution time in half (the code is for the Wikipedia editor wikEd and I was using JavaScript under Firefox 3.5).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Corbyn</title>
		<link>http://blog.stevenlevithan.com/archives/regex-optimization/comment-page-1#comment-25531</link>
		<dc:creator>Chris Corbyn</dc:creator>
		<pubDate>Thu, 08 Jan 2009 11:39:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-optimization#comment-25531</guid>
		<description>I figured this would be valuable to mention regarding scanning performance on large strings.  Observation in my (almost ready to release) lex clone using ECMAScript.

Because I write a lexical analysis routine in JS, there are potentially a lot of regex tests being run on a long input source, many of which won&#039;t match at all, but nevertheless the regexp engine will scan the whole string.

The solution?  Limit the size of the input to the first N characters of your input.  If there&#039;s no match, forget it.  If there&#039;s a match and its length is exactly equal to N, re-run the match against the full string (since you may have accidentally truncated it).

Even though this now means you may be regexing more times than without a limit, you are scanning less data each time.

This cut my execution time down massively.  For 11KB of JavaScript source lexing it took 1500ms on Safari 3 and 650ms on FF3.  I tried all sorts and was on the verge of giving up (what use is a lexical analyzer that&#039;s going to take so long to execute?).

With input size limits imposed this came down to 120ms in Safari and 300ms in FF (the tables turned in terms of which browser was faster).

There are other optimizations I need to find now (string buffering &amp; traversing), but that by far will be the biggest I&#039;ll resolve ;)

This is the relevant code:
 * (self is a copy of the reference to &quot;this&quot;)
 * self.In is the input source (it gets eaten from left-to-right so the start is the important bit)
 * The RegExp (re) has been optimized to inject the caret &quot;^&quot; to the start
 * _minInputSize is the optimization factor.  I default to 32 chars of input.
--------------------------------
  /** @private */
  var _scanByRegExp = function _scanByRegExp(re) {
    var match = &#039;&#039;;
    var matches;
    if ((matches = re.exec(self.In.substring(0, _minInputSize)))
      &amp;&amp; matches.index == 0) {
      match = matches[0];
      
      //FSA optimization check:
      //If it looks like there&#039;s more of this token, try without the limit
      if (match.length == _minInputSize) {
        matches = re.exec(self.In);
        match = matches[0];
      }
    }
    return match;
  };
-------------------------------------</description>
		<content:encoded><![CDATA[<p>I figured this would be valuable to mention regarding scanning performance on large strings.  Observation in my (almost ready to release) lex clone using ECMAScript.</p>
<p>Because I write a lexical analysis routine in JS, there are potentially a lot of regex tests being run on a long input source, many of which won&#8217;t match at all, but nevertheless the regexp engine will scan the whole string.</p>
<p>The solution?  Limit the size of the input to the first N characters of your input.  If there&#8217;s no match, forget it.  If there&#8217;s a match and its length is exactly equal to N, re-run the match against the full string (since you may have accidentally truncated it).</p>
<p>Even though this now means you may be regexing more times than without a limit, you are scanning less data each time.</p>
<p>This cut my execution time down massively.  For 11KB of JavaScript source lexing it took 1500ms on Safari 3 and 650ms on FF3.  I tried all sorts and was on the verge of giving up (what use is a lexical analyzer that&#8217;s going to take so long to execute?).</p>
<p>With input size limits imposed this came down to 120ms in Safari and 300ms in FF (the tables turned in terms of which browser was faster).</p>
<p>There are other optimizations I need to find now (string buffering &amp; traversing), but that by far will be the biggest I&#8217;ll resolve <img src='http://blog.stevenlevithan.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>This is the relevant code:<br />
 * (self is a copy of the reference to &#8220;this&#8221;)<br />
 * self.In is the input source (it gets eaten from left-to-right so the start is the important bit)<br />
 * The RegExp (re) has been optimized to inject the caret &#8220;^&#8221; to the start<br />
 * _minInputSize is the optimization factor.  I default to 32 chars of input.<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br />
  /** @private */<br />
  var _scanByRegExp = function _scanByRegExp(re) {<br />
    var match = &#8221;;<br />
    var matches;<br />
    if ((matches = re.exec(self.In.substring(0, _minInputSize)))<br />
      &amp;&amp; matches.index == 0) {<br />
      match = matches[0];</p>
<p>      //FSA optimization check:<br />
      //If it looks like there&#8217;s more of this token, try without the limit<br />
      if (match.length == _minInputSize) {<br />
        matches = re.exec(self.In);<br />
        match = matches[0];<br />
      }<br />
    }<br />
    return match;<br />
  };<br />
&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jon</title>
		<link>http://blog.stevenlevithan.com/archives/regex-optimization/comment-page-1#comment-5065</link>
		<dc:creator>jon</dc:creator>
		<pubDate>Fri, 05 Oct 2007 16:13:59 +0000</pubDate>
		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-optimization#comment-5065</guid>
		<description>great thanks a lot! Friday afternoon here in UK, would explain why i didn&#039;t see that one..

have a good weekend.
j</description>
		<content:encoded><![CDATA[<p>great thanks a lot! Friday afternoon here in UK, would explain why i didn&#8217;t see that one..</p>
<p>have a good weekend.<br />
j</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://blog.stevenlevithan.com/archives/regex-optimization/comment-page-1#comment-5064</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Fri, 05 Oct 2007 15:18:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.stevenlevithan.com/archives/regex-optimization#comment-5064</guid>
		<description>Hi Jon, thanks!

Just replace the text you match with an empty string, and what&#039;s left will be the part you want to keep, e.g.:

&lt;code&gt;str = str.replace(/#.*/, &quot;&quot;);&lt;/code&gt;

Or in this case you could also use &lt;code&gt;str = str.match(/^[^#]*/)[0];&lt;/code&gt;

However, for the case you described you don&#039;t really need to use a regex. Try this instead:

&lt;code&gt;str = str.substring(0, str.indexOf(&quot;#&quot;));&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Hi Jon, thanks!</p>
<p>Just replace the text you match with an empty string, and what&#8217;s left will be the part you want to keep, e.g.:</p>
<p><code>str = str.replace(/#.*/, "");</code></p>
<p>Or in this case you could also use <code>str = str.match(/^[^#]*/)[0];</code></p>
<p>However, for the case you described you don&#8217;t really need to use a regex. Try this instead:</p>
<p><code>str = str.substring(0, str.indexOf("#"));</code></p>
]]></content:encoded>
	</item>
</channel>
</rss>
