RegexPal Now Open Source

RegexPal (easily the most del.icio.used regex tester wink) is now released under the Creative Commons Attribution-Share Alike 3.0 License GNU LGPL.

There are certainly many more features that can be added to the app and things that can be improved, so if you are interested in helping out or creating your own version, you are welcome to do so. If there is interest I'll create a Google Code project, but for now there is a package you can download which includes all files for the regexpal.com website. Two of the files in the package (xregexp.js and helpers.js) are dual-licensed under the MIT License.

If you're only interested in the JavaScript, you can see the three source files here:

For regex aficionados particularly, there is some stuff here you might find interesting, including the latest, as-yet-unreleased version of my XRegExp library, and the regex syntax parser used for RegexPal's syntax highlighting (which includes lots of details on the minutiae of regex syntax and cross-browser regex handling).

Automatic HTML Summary / Teaser

When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br> or <img>.

function getLeadingHtml (input, maxChars) {
	// token matches a word, tag, or special character
	var	token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
		selfClosingTag = /^(?:[hb]r|img)$/i,
		output = "",
		charCount = 0,
		openTags = [],
		match;

	// Set the default for the max number of characters
	// (only counts characters outside of HTML tags)
	maxChars = maxChars || 250;

	while ((charCount < maxChars) && (match = token.exec(input))) {
		// If this is an HTML tag
		if (match[2]) {
			output += match[0];
			// If this is not a self-closing tag
			if (!(match[3] || selfClosingTag.test(match[2]))) {
				// If this is a closing tag
				if (match[1]) openTags.pop();
				else openTags.push(match[2]);
			}
		} else {
			charCount += match[0].length;
			if (charCount <= maxChars) output += match[0];
		}
	}

	// Close any tags which were left open
	var i = openTags.length;
	while (i--) output += "</" + openTags[i] + ">";
	
	return output;
};

This is all pretty straightforward stuff, but I figured I might as well pass it on.

Here's an example of the output:

var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);

/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/

Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):

^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?

That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99} in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99} quantifier with *. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>).

If you want to mimic the lookbehind in JavaScript, you could use the following:

// JavaScript doesn't include a native reverse method for strings,
// so we need to create one
String.prototype.reverse = function() {
	return this.split("").reverse().join("");
};
// Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through
// node-by-node reversal
var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/;
var output = input.reverse().match(regex)[0].reverse();

When innerHTML isn’t Fast Enough

This post isn't about the pros and cons of innerHTML vs. W3C DOM methods. That has been hashed and rehashed elsewhere. Instead, I'll show how you can combine the use of innerHTML and DOM methods to make your code potentially hundreds of times faster than innerHTML on its own, when working with large numbers of elements.

In some browsers (most notably, Firefox), although innerHTML is generally much faster than DOM methods, it spends a disproportionate amount of time clearing out existing elements vs. creating new ones. Knowing this, we can combine the speed of destroying elements by removing their parent using the standard DOM methods with creating new elements using innerHTML. (This technique is something I discovered during the development of RegexPal, and is one of its two main performance optimizations. The other is one-shot markup generation for match highlighting, which avoids needing to loop over matches or reference them individually.)

The code:

function replaceHtml(el, html) {
	var oldEl = typeof el === "string" ? document.getElementById(el) : el;
	/*@cc_on // Pure innerHTML is slightly faster in IE
		oldEl.innerHTML = html;
		return oldEl;
	@*/
	var newEl = oldEl.cloneNode(false);
	newEl.innerHTML = html;
	oldEl.parentNode.replaceChild(newEl, oldEl);
	/* Since we just removed the old element from the DOM, return a reference
	to the new element, which can be used to restore variable references. */
	return newEl;
};

You can use the above as el = replaceHtml(el, newHtml) instead of el.innerHTML = newHtml.

innerHTML is already pretty fast...is this really warranted?

That depends on how many elements you're overwriting. In RegexPal, every keydown event potentially triggers the destruction and creation of thousands of elements (in order to make the syntax and match highlighting work). In such cases, the above approach has enormous positive impact. Even something as simple as el.innerHTML += str or el.innerHTML = "" could be a performance disaster if the element you're updating happens to have a few thousand children.

I've created a page which allows you to easily test the performance difference of innerHTML and my replaceHtml function with various numbers of elements. Make sure to try it out in a few browsers for comparison. Following are a couple examples of typical results from Firefox 2.0.0.6 on my system:

1000 elements...
innerHTML (destroy only): 156ms
innerHTML (create only): 15ms
innerHTML (destroy & create): 172ms
replaceHtml (destroy only): 0ms (faster)
replaceHtml (create only): 15ms (~ same speed)
replaceHtml (destroy & create): 15ms (11.5x faster)

15000 elements...
innerHTML (destroy only): 14703ms
innerHTML (create only): 250ms
innerHTML (destroy & create): 14922ms
replaceHtml (destroy only): 31ms (474.3x faster)
replaceHtml (create only): 250ms (~ same speed)
replaceHtml (destroy & create): 297ms (50.2x faster)

I think the numbers speak for themselves. Comparable performance improvements can also be seen in Safari. In Opera, replaceHtml is still typically faster than innerHTML, but by a narrower margin. In IE, simple use of innerHTML is typically faster than mixing it with DOM methods, but not by nearly the same kinds of margins as you can see above. Nevertheless, IE's conditional compilation feature is used to avoid the relatively minor performance penalty, by just using innerHTML with that browser.

RegexPal: Web-Based Regex Testing Reinvented

Yes I know, there are many other JavaScript regex testers available. Why did I create yet another? RegexPal brings several new things to the table for such web-based apps, and in my (biased) opinion it's easier to use and more helpful towards learning regular expressions than the others currently available. Additionally, most other such tools are very slow for the kind of data I often work with. They might appear fast when displaying 10 matches, but what about 100, 1000, or 5000? Try generating 5,000 matches (which is easy to do with an any-character pattern such as a dot) in your favorite existing web-based tool and see if your browser ever recovers (doubtful). The same task takes RegexPal less than half a second, and what's more, results overlay the text while you're typing it.

At the moment, RegexPal is short on features, but here are the highlights:

  • Real-time regex syntax highlighting with backwards and forwards context awareness.
  • Lightning-fast match highlighting with alternating styles.
  • Inverted matches (match any text not matched by the regex).
regexpal.com screenshot

I'm not sure when I'll add additional features, but there are lots of things I'm considering. If there is something you'd like to see, let me know.

A few things to be aware of:

  • The approach I've used for scrollable rich-text editing (which I haven't seen elsewhere) is fast but a bit buggy. Firefox 2 and IE7 have the least issues, but it more or less works in other browsers as well.
  • The syntax highlighting generally marks corner-case issues that create cross-browser inconsistencies as errors even if they are the result of browser bugs or missing behavior documentation in ECMA-262 v3.
  • There are different forms of line breaks cross-platform/browser. E.g., Firefox uses \n even on Windows where nearly all programs use \r\n. This can affect the results of certain regexes.

At least for me, RegexPal is lots of fun to play with and helps to make learning regular expressions easy through its instant feedback. I encourage you to just go play with it and discover its results on your own, but for the curious, I'll keep rambling…

Regex syntax parsing (needed for the syntax highlighting) is somewhat complex, due to the numerous backwards and forwards context awareness issues involved. Take, for example, the pattern \10. What does it mean?

  • Backreference 10, if not inside a character class and at least 10 capturing groups are opened before that point.
  • Backreference 1, followed by a literal "0", if not inside a character class and between 1 and 9 capturing groups are opened before that point.
  • Octal character index 10 (decimal 8), if inside a character class, or if no capturing groups are opened before that point.
  • The three literal characters "\", "1", and "0", if preceded by an unescaped "\" character.
  • An incomplete token in a couple other situations.

Another example is the "-" character. Outside a character class it's always a literal hyphen, but inside a character class…

  • It creates a range between tokens if:
    • There is a preceding and following token in the class, or it's preceded by a token and is the last character in an unclosed character class (caveats follow).
  • It's a literal character if:
    • It's the first or last character in the class.
    • It's preceded by an unescaped "\".
    • It follows a token which is the end index for a range.
    • It follows a hyphen which creates a range.
  • It's an error if:
    • It's creating a range between tokens in reverse character index order (e.g., z-a, @-!, \uFFFF-\b, or \127-\cB).
    • It would otherwise create a range, but it's followed or preceded by a token which represents more than one character index (e.g., \d). In fact, in some cases browsers take this to mean that the hyphen should be treated as a literal, but browser bugs cause it to be handled inconsistently so RegexPal flags it as a range error.

Here are a few more things which aren't errors but are flagged as such:

  • Empty, top-level alternation, except at the end of the pattern, where such an alternation is ignored when highlighting matches in order to create a less surprising experience while the user is in the middle of constructing the regex. Empty, top-level alternation is flagged as an error because it effectively truncates the regex at that point (since it will always match). If a zero-length, top-level alteration is really needed, there are other easy ways to do that more explicitly.
  • Lookaround quantifiers (e.g., the plus sign in (?!x)+). This would be an actual error with some regex libraries (e.g., PCRE), and although that's not the case in most web browsers, such constructs add no value. As a result, RegexPal flags such quantifiers as an error, since they are almost certainly a user mistake.
  • \c when not followed by A–Z, \x when not followed by two hex characters, and \u when not followed by four hex characters. Although these do not cause most browsers to throw errors, they are handled inconsistently cross-browser and are hence flagged as errors. They would almost certainly be a user mistake even if the cross-browser issues didn't exist.

Credit to osteele.com from where the text of the short-and-sweet Quick Reference is based, and to RegexBuddy from JGsoft for inspiring many of RegexPal's features. The name RegexPal is, in part, a nod to RegexBuddy, but also selected because it contains both "regex" and "regexp." wink

Non-Participating Groups: A Cross-Browser Mess

Cross-browser issues surrounding the handling of regular expression non-participating capturing groups (which I'll call NPCGs) present several challenges. The standard sucks to begin with, and the three biggest browsers (IE, Firefox, Safari) each disrespect the rules in their own unique ways.

First, I should explain what NPCGs are, as it seems that even some experienced regex users aren't fully aware of or understand the concept. Assuming you're already familiar with the idea of capturing and non-capturing parentheses (see this page if you need a refresher), note that NPCGs are different from groups which capture a zero-length value (i.e., an empty string). This is probably easiest to explain by showing some examples…

The following regexes all potentially contain NPCGs (depending on the data they are run over), because the capturing groups are not required to participate:

  • /(x)?/
  • /(x)*/
  • /(x){0,2}/
  • /(x)|(y)/ — If this matches, it's guaranteed to contain exactly one NPCG.
  • /(?!(x))/ — If this matches (which, on its own, it will at least at the end of the string), it's guaranteed to contain an NPCG, because the pattern only succeeds if the match of "x" fails.
  • /()??/ — This is guaranteed to match within any string and contain an NPCG, because of the use of a lazy ?? quantifier on a capturing group for a zero-length value.

On the other hand, these will never contain an NPCG, because although they are allowed to match a zero-length value, the capturing groups are required to participate:

  • /(x?)/
  • /(x*)/
  • /(x{0,2})/
  • /((?:xx)?)/ –or– /(xx|)/ — These two are equivalent.
  • /()?/ –or– /(x?)?/ — These are not required to participate, but their greedy ? quantifiers ensure they will always succeed at capturing at least an empty string.

So, what's the difference between an NPCG and a group which captures an empty string? I guess that's up to the regex library, but typically, backreferences to NPCGs are assigned a special null or undefined value.

Following are the ECMA-262v3 rules (paraphrased) for how NPCGs should be handled in JavaScript:

  • Within a regex, backreferences to NPCGs match an empty string (i.e., the backreferences always succeed). This is unfortunate, since it prevents some fancy patterns which would otherwise be possible (e.g., see my method for mimicking conditionals), and it's atypical compared to many other regular expression engines including Perl 5 (which ECMA-standard regular expressions are supposedly based on), PCRE, .NET, Java, Python, Ruby, JGsoft, and others.
  • Within a replacement string, backreferences to NPCGs produce an empty string (i.e., nothing). Unlike the previous point, this is typical elsewhere, and allows you to use a regex like /a(b)|c(d)/ and replace it with "$1$2" without having to worry about null pointers or errors about non-participating groups.
  • In the result arrays from RegExp.prototype.exec, String.prototype.match (when used with a non-global regex), String.prototype.split, and the arguments available to callback functions with String.prototype.replace, NPCGs return undefined. This is a very logical approach.

References: ECMA-262v3 sectons 15.5.4.11, 15.5.4.14, 15.10.2.1, 15.10.2.3, 15.10.2.8, 15.10.2.9.

Unfortunately, actual browser handling of NPCGs is all over the place, resulting in numerous cross-browser differences which can easily result in subtle (or not so subtle) bugs in your code if you don't know what you're doing. E.g., Firefox incorrectly uses an empty string with the replace() and split() methods, but correctly uses undefined with the exec() method. Conversely, IE correctly uses undefined with the replace() method, incorrectly uses an empty string with the exec() method, and incorrectly returns neither with the split() method since it doesn't splice backreferences into the resulting array. As for the handling of backreferences to non-participating groups within regexes (e.g., /(x)?\1y/.test("y")), Safari uses the more sensible, non-ECMA-compliant approach (returning false for the previous bit of code), while IE, Firefox, and Opera follow the standard. (If you use /(x?)\1y/.test("y") instead, all four browsers will correctly return true.)

Several times I've seen people encounter these differences and diagnose them incorrectly, not having understood the root cause. A recent instance is what prompted this writeup.

Here are cross-browser results from each of the regex and regex-using methods when NPCGs have an impact on the outcome:

Code ECMA-262v3 IE 5.5 – 7 Firefox 2.0.0.6 Opera 9.23 Safari 3.0.3
/(x)?\1y/.test("y") true true true true false
/(x)?\1y/.exec("y") ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] null
/(x)?y/.exec("y") ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] ["y", undefined]
"y".match(/(x)?\1y/) ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] null
"y".match(/(x)?y/) ["y", undefined] ["y", ""] ["y", undefined] ["y", undefined] ["y", undefined]
"y".match(/(x)?\1y/g) ["y"] ["y"] ["y"] ["y"] null
"y".split(/(x)?\1y/) ["", undefined, ""] [ ] ["", "", ""] ["", undefined, ""] ["y"]
"y".split(/(x)?y/) ["", undefined, ""] [ ] ["", "", ""] ["", undefined, ""] ["", ""]
"y".search(/(x)?\1y/) 0 0 0 0 -1
"y".replace(/(x)?\1y/, "z") "z" "z" "z" "z" "y"
"y".replace(/(x)?y/, "$1") "" "" "" "" ""
"y".replace(/(x)?\1y/,
    function($0, $1){
        return String($1);
    })
"undefined" "undefined" "" "undefined" "y"
"y".replace(/(x)?y/,
    function($0, $1){
        return String($1);
    })
"undefined" "undefined" "" "undefined" ""
"y".replace(/(x)?y/,
    function($0, $1){
        return $1;
    })
"undefined" "" "" "undefined" ""

(Run the tests in your browser.)

The workaround for this mess is to avoid creating any potential for non-participating capturing groups, unless you know exactly what you're doing. Although that shouldn't be necessary, NPCGs are usually easy to avoid anyway. See the examples near the top of this post.

Edit (2007-08-16): I've updated this post with data from the newest versions of the listed browsers. The original data contained a few false negatives for Opera and Safari which resulted from a faulty library used to generate the results.