Steven Levithan – Page 12 – Flagrant Badassery

Date Format 1.1

I've just updated my ColdFusion-inspired JavaScript Date Format script to version ~~1.0~~ 1.1, and updated the documentation in the old post along with it. The new release includes "Z" (US timezone abbreviation) and "o" (UTC offset) flags as well as brevity enhancements from Scott Trenda, along with several other new features including a standalone dateFormat function, named and default masks (plus you can easily add your own), easier internationalization, etc.

This update includes one change which is not backwards compatible: mask characters and sequences no longer have to comprise entire words for them to be treated specially. The former handling was intended to make it dead-easy to mix literal characters into date masks, but ended up mostly just being a slight nuisance since most people didn't use it to embed dates in larger strings.

Check out the new Date Format!

Edit: Date Format is now integrated into two JavaScript frameworks:

CFJS is a library of almost 70 ColdFusion functions written in JavaScript by Chris Jordan. CFJS has used Date Format, which was a natural fit since it's largely based on ColdFusion's dateFormat and timeFormat functions, since version 0.1.
Chiron is an innovative, emerging JavaScript library by Kris Kowal. It's based on Python idioms, and at its heart is an advanced module loader and isolation system the likes of which hasn't been seen yet in the JavaScript world. In addition to integrating Date Format as a module called date.js, Chiron has also integrated my XRegExp library, and uses regular expressions from parseUri in its core. Expect to hear more about Chiron as it gets closer to 0.1 release.

RegexPal Now Open Source

RegexPal (easily the most del.icio.used regex tester wink ) is now released under the ~~Creative Commons Attribution-Share Alike 3.0 License~~ GNU LGPL.

There are certainly many more features that can be added to the app and things that can be improved, so if you are interested in helping out or creating your own version, you are welcome to do so. If there is interest I'll create a Google Code project, but for now ~~there is a package you can download which includes all files for the regexpal.com website~~. Two of the files in the package (xregexp.js and helpers.js) are dual-licensed under the MIT License.

If you're only interested in the JavaScript, you can see the three source files here:

For regex aficionados particularly, there is some stuff here you might find interesting, including the latest, as-yet-unreleased version of my XRegExp library, and the regex syntax parser used for RegexPal's syntax highlighting (which includes lots of details on the minutiae of regex syntax and cross-browser regex handling).

Regex Performance Optimization

Crafting efficient regular expressions is somewhat of an art. In large part, it centers around controlling/minimizing backtracking and the number of steps it takes the regex engine to match or fail, but the fact that most engines implement different sets of internal optimizations (which can either make certain operations faster, or avoid work by performing simplified pre-checks or skipping unnecessary operations, etc.) also makes the topic dependent on the particular regex flavor and implementation you're using to a significant extent. Most developers aren't deeply aware of regex performance issues, so when they run into problems with regular expressions running slowly, their first thought is to remove the regexes. In reality, most non-trivial regexes I've seen could be significantly optimized for efficiency, and I've frequently seen dramatic improvements as a result of doing so.

The best discussion of regex optimization I've seen is chapter six (a good 60 pages) of Mastering Regular Expressions, Third Edition by Jeffrey Friedl. Unfortunately, other good material on the subject is fairly scarce, so I was pleasantly surprised to stumble upon the recent article Optimizing regular expressions in Java by Cristian Mocanu. Of course, it is in part Java specific, but for the most part it is a good article on the basics of optimization for traditional NFA regex engines. Check it out.

Have you seen any good articles or discussions about regular expression performance or efficiency optimization recently? Do you have any questions about the subject? Experience or pointers to share? Let me know. (I hope to eventually write up an in-depth article on JavaScript regex optimization, with lots of tips, techniques, and cross-browser benchmarks.)

Automatic HTML Summary / Teaser

When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br> or <img>.

function getLeadingHtml (input, maxChars) {
	// token matches a word, tag, or special character
	var	token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
		selfClosingTag = /^(?:[hb]r|img)$/i,
		output = "",
		charCount = 0,
		openTags = [],
		match;

	// Set the default for the max number of characters
	// (only counts characters outside of HTML tags)
	maxChars = maxChars || 250;

	while ((charCount < maxChars) && (match = token.exec(input))) {
		// If this is an HTML tag
		if (match[2]) {
			output += match[0];
			// If this is not a self-closing tag
			if (!(match[3] || selfClosingTag.test(match[2]))) {
				// If this is a closing tag
				if (match[1]) openTags.pop();
				else openTags.push(match[2]);
			}
		} else {
			charCount += match[0].length;
			if (charCount <= maxChars) output += match[0];
		}
	}

	// Close any tags which were left open
	var i = openTags.length;
	while (i--) output += "</" + openTags[i] + ">";
	
	return output;
};

This is all pretty straightforward stuff, but I figured I might as well pass it on.

Here's an example of the output:

var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);

/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/

Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):

^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?

That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99} in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99} quantifier with *. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>).

If you want to mimic the lookbehind in JavaScript, you could use the following:

// JavaScript doesn't include a native reverse method for strings,
// so we need to create one
String.prototype.reverse = function() {
	return this.split("").reverse().join("");
};
// Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through
// node-by-node reversal
var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/;
var output = input.reverse().match(regex)[0].reverse();

When innerHTML isn’t Fast Enough

This post isn't about the pros and cons of innerHTML vs. W3C DOM methods. That has been hashed and rehashed elsewhere. Instead, I'll show how you can combine the use of innerHTML and DOM methods to make your code potentially hundreds of times faster than innerHTML on its own, when working with large numbers of elements.

In some browsers (most notably, Firefox), although innerHTML is generally much faster than DOM methods, it spends a disproportionate amount of time clearing out existing elements vs. creating new ones. Knowing this, we can combine the speed of destroying elements by removing their parent using the standard DOM methods with creating new elements using innerHTML. (This technique is something I discovered during the development of RegexPal, and is one of its two main performance optimizations. The other is one-shot markup generation for match highlighting, which avoids needing to loop over matches or reference them individually.)

The code:

function replaceHtml(el, html) {
	var oldEl = typeof el === "string" ? document.getElementById(el) : el;
	/*@cc_on // Pure innerHTML is slightly faster in IE
		oldEl.innerHTML = html;
		return oldEl;
	@*/
	var newEl = oldEl.cloneNode(false);
	newEl.innerHTML = html;
	oldEl.parentNode.replaceChild(newEl, oldEl);
	/* Since we just removed the old element from the DOM, return a reference
	to the new element, which can be used to restore variable references. */
	return newEl;
};

You can use the above as el = replaceHtml(el, newHtml) instead of el.innerHTML = newHtml.

innerHTML is already pretty fast...is this really warranted?

That depends on how many elements you're overwriting. In RegexPal, every keydown event potentially triggers the destruction and creation of thousands of elements (in order to make the syntax and match highlighting work). In such cases, the above approach has enormous positive impact. Even something as simple as el.innerHTML += str or el.innerHTML = "" could be a performance disaster if the element you're updating happens to have a few thousand children.

I've created a page which allows you to easily test the performance difference of innerHTML and my replaceHtml function with various numbers of elements. Make sure to try it out in a few browsers for comparison. Following are a couple examples of typical results from Firefox 2.0.0.6 on my system:

1000 elements...
innerHTML (destroy only): 156ms
innerHTML (create only): 15ms
innerHTML (destroy & create): 172ms
replaceHtml (destroy only): 0ms (faster)
replaceHtml (create only): 15ms (~ same speed)
replaceHtml (destroy & create): 15ms (11.5x faster)

15000 elements...
innerHTML (destroy only): 14703ms
innerHTML (create only): 250ms
innerHTML (destroy & create): 14922ms
replaceHtml (destroy only): 31ms (474.3x faster)
replaceHtml (create only): 250ms (~ same speed)
replaceHtml (destroy & create): 297ms (50.2x faster)

I think the numbers speak for themselves. Comparable performance improvements can also be seen in Safari. In Opera, replaceHtml is still typically faster than innerHTML, but by a narrower margin. In IE, simple use of innerHTML is typically faster than mixing it with DOM methods, but not by nearly the same kinds of margins as you can see above. Nevertheless, IE's conditional compilation feature is used to avoid the relatively minor performance penalty, by just using innerHTML with that browser.