Regular Expressions — Flagrant Badassery

RegexPal Now Open Source

RegexPal (easily the most del.icio.used regex tester wink ) is now released under the ~~Creative Commons Attribution-Share Alike 3.0 License~~ GNU LGPL.

There are certainly many more features that can be added to the app and things that can be improved, so if you are interested in helping out or creating your own version, you are welcome to do so. If there is interest I'll create a Google Code project, but for now ~~there is a package you can download which includes all files for the regexpal.com website~~. Two of the files in the package (xregexp.js and helpers.js) are dual-licensed under the MIT License.

If you're only interested in the JavaScript, you can see the three source files here:

For regex aficionados particularly, there is some stuff here you might find interesting, including the latest, as-yet-unreleased version of my XRegExp library, and the regex syntax parser used for RegexPal's syntax highlighting (which includes lots of details on the minutiae of regex syntax and cross-browser regex handling).

Regex Performance Optimization

Crafting efficient regular expressions is somewhat of an art. In large part, it centers around controlling/minimizing backtracking and the number of steps it takes the regex engine to match or fail, but the fact that most engines implement different sets of internal optimizations (which can either make certain operations faster, or avoid work by performing simplified pre-checks or skipping unnecessary operations, etc.) also makes the topic dependent on the particular regex flavor and implementation you're using to a significant extent. Most developers aren't deeply aware of regex performance issues, so when they run into problems with regular expressions running slowly, their first thought is to remove the regexes. In reality, most non-trivial regexes I've seen could be significantly optimized for efficiency, and I've frequently seen dramatic improvements as a result of doing so.

The best discussion of regex optimization I've seen is chapter six (a good 60 pages) of Mastering Regular Expressions, Third Edition by Jeffrey Friedl. Unfortunately, other good material on the subject is fairly scarce, so I was pleasantly surprised to stumble upon the recent article Optimizing regular expressions in Java by Cristian Mocanu. Of course, it is in part Java specific, but for the most part it is a good article on the basics of optimization for traditional NFA regex engines. Check it out.

Have you seen any good articles or discussions about regular expression performance or efficiency optimization recently? Do you have any questions about the subject? Experience or pointers to share? Let me know. (I hope to eventually write up an in-depth article on JavaScript regex optimization, with lots of tips, techniques, and cross-browser benchmarks.)

Automatic HTML Summary / Teaser

When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br> or <img>.

function getLeadingHtml (input, maxChars) {
	// token matches a word, tag, or special character
	var	token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
		selfClosingTag = /^(?:[hb]r|img)$/i,
		output = "",
		charCount = 0,
		openTags = [],
		match;

	// Set the default for the max number of characters
	// (only counts characters outside of HTML tags)
	maxChars = maxChars || 250;

	while ((charCount < maxChars) && (match = token.exec(input))) {
		// If this is an HTML tag
		if (match[2]) {
			output += match[0];
			// If this is not a self-closing tag
			if (!(match[3] || selfClosingTag.test(match[2]))) {
				// If this is a closing tag
				if (match[1]) openTags.pop();
				else openTags.push(match[2]);
			}
		} else {
			charCount += match[0].length;
			if (charCount <= maxChars) output += match[0];
		}
	}

	// Close any tags which were left open
	var i = openTags.length;
	while (i--) output += "</" + openTags[i] + ">";
	
	return output;
};

This is all pretty straightforward stuff, but I figured I might as well pass it on.

Here's an example of the output:

var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);

/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/

Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):

^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?

That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99} in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99} quantifier with *. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>).

If you want to mimic the lookbehind in JavaScript, you could use the following:

// JavaScript doesn't include a native reverse method for strings,
// so we need to create one
String.prototype.reverse = function() {
	return this.split("").reverse().join("");
};
// Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through
// node-by-node reversal
var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/;
var output = input.reverse().match(regex)[0].reverse();

Automatic Regex Generation

Has it ever been done before? I'm interested in building an app (although I will possibly never get to it) which can automatically generate relatively-basic regular expressions by simply having the user enter some test data and mark the parts which the regex should match (all other text would serve as examples of what the regex should not match). By "relatively basic," I mean that the regexes should at least be capable of using literal characters, character classes, shorthand character classes like \d and \s, greedy quantification, alternation, grouping, and word boundaries.

Of course, there are major issues involved with this. For example, where would the algorithm draw the line between exactness and flexibility? I would imagine that it would attempt to create regexes which are as flexible as possible, since it's easy to rule out many non-matches by including a few non-matching examples which are very similar to strings which should match. Secondly, what would be the regex operator precedence (e.g., would it choose to attempt quantification before alternation)? There are numerous other implementation details which might present challenges, as well.

One thing that could sometimes improve the quality of the generated regexes is that the generator can cheat, by filtering expected results though a library of common patterns when this can be done with a reasonable level of certainty.

Have you ever considered this kind of thing in the past or seen it done elsewhere? Any thoughts on the issues or how the implementation might work? I have some ideas about simple algorithm approaches, but haven't proved out that they would work well in a respectable number of cases. Can you see something like this being useful to you?

Capturing Multiple, Optional HTML Attribute Values

Let's say you wanted to find all <div> tags, and capture their id and class attribute values. Anyone who's spent much time parsing HTML with regular expressions is probably aware that it can get quite tricky to match or capture multiple, specific attribute values with one regex, considering that the regex needs to allow for any other attributes which might exist, and allow attributes to appear in any order.

I needed to do something like that for a project recently, so here's what I wrote to solve the problem (after removing support for single-quoted/non-quoted values and whitespace before and after the equals signs, so you can more easily see what's going on):

<div\b(?>\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+)*>

The finer details of the pattern are designed for efficiency (even with bad data such as unclosed <div> tags) over simplicity. Note that it will capture the id to backreference one and the class to backreference two regardless of the order the attributes appear in (i.e., class remains constant as backreference two even if it comes before id, or if id doesn't exist).

The regex uses an atomic group, so if you want to pull this off with similar efficiency in a regex flavor which lacks atomic groups or possessive quantifiers, you can mimic it like so:

<div\b(?:(?=(\s+(?:id="([^"]*)"|class="([^"]*)")|[^\s>]+|\s+))\1)*>

In the above, a backreference to a capturing group within a lookahead is used to mimic an atomic group, so the backreference numbers for the id and class values are shifted to two and three, respectively.

Note that you can easily add as many other attributes as you want to this regex, and it will capture all of their values in the listed order regardless of where they appear in the tag. This construct can also be adapted to a number of other, similar scenarios.

I realize I haven't explained how the regexes actually work or justified any of the details from an efficiency standpoint, but I wanted to share this without having to turn it into a 10-page article. wink If you have any specific questions about the pattern, feel free to ask.

Unfortunately for JavaScripters including myself, neither of the above regexes work as described in Firefox 2.0.0.6 or Opera 9.23, although the latter regex works fine in IE, and either will work in Safari 3 beta since that browser supports atomic groups (unlike all other major browsers). It doesn't work in Firefox or Opera since those two browsers—unlike most other regex engines—reset backreference values when an alternation option fails before the engine reaches a capturing group within it. Of course, you could achieve the same end-result using more verbose code paired with multiple regexes, but that just wouldn't be as cool. Or you could just use the DOM, which would usually be more appropriate for something like this in JavaScript anyway.