Flagrant Badassery

A JavaScript and regular expression centric blog

Automatic HTML Summary / Teaser

When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br> or <img>.

function getLeadingHtml (input, maxChars) {
	// token matches a word, tag, or special character
	var	token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
		selfClosingTag = /^(?:[hb]r|img)$/i,
		output = "",
		charCount = 0,
		openTags = [],
		match;

	// Set the default for the max number of characters
	// (only counts characters outside of HTML tags)
	maxChars = maxChars || 250;

	while ((charCount < maxChars) && (match = token.exec(input))) {
		// If this is an HTML tag
		if (match[2]) {
			output += match[0];
			// If this is not a self-closing tag
			if (!(match[3] || selfClosingTag.test(match[2]))) {
				// If this is a closing tag
				if (match[1]) openTags.pop();
				else openTags.push(match[2]);
			}
		} else {
			charCount += match[0].length;
			if (charCount <= maxChars) output += match[0];
		}
	}

	// Close any tags which were left open
	var i = openTags.length;
	while (i--) output += "</" + openTags[i] + ">";
	
	return output;
};

This is all pretty straightforward stuff, but I figured I might as well pass it on.

Here's an example of the output:

var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);

/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/


Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):

^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?

That should work with regex engines which at least support finite-length lookbehind, including .NET, Java, PCRE, and JGsoft, but not Perl or Python, both of which support only fixed-length lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind (so we could replace the {0,99} quantifier with *). Many other flavors including JavaScript don't support lookbehind at all.

With the .NET, PCRE, and JGsoft engines (which additionally support lookaround-based conditionals), you could save two characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>), and if you want to mimic the lookbehind in JavaScript, you could use the following:

// JavaScript doesn't include a native reverse method for strings, so we need to create one
String.prototype.reverse = function() {
	return this.split("").reverse().join("");
};
// Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through node-by-node reversal
var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/;
var output = input.reverse().match(regex)[0].reverse();

There Are 4 Responses So Far. »

  1. Good stuff. It took me a minute to figure why you were adding match[0] to the output IF the current match was a tag… but then I remembered the whole point was to add tags and then make sure they are closed. I like that you are only counting characters for non-tag matched. slick.

  2. @Ben, thanks! And thanks for bringing this up.

    BTW, I’ve just added some stuff to the end of the above post, including an example of node-by-node regex reversal for mimicking lookbehind.

  3. Alternatively in javascript you could use the browser’s processor to fix mistakes for you.
    //assuming str is your string of truncated code

    var ele = document.createElement(”div”);
    ele.innerHTML = str;
    str = ele.innerHTML;

    I understand your focus is on fantastic regex so I appreciate the approach, this one is just different, for a similar result. Also this will fix attributes not enclosed in quotations.

  4. @Matt Foster, good point. :-) Assuming that all browsers will close open tags when using innerHTML, the function could be reduced to the following if you still wanted its ability to truncate while only counting characters outside of tags, and never ending in the middle of a word or tag:

    function getLeadingHtml (input, maxChars) {
    	// token matches a word, tag, or special character
    	var token = /\w+|[^\w<]|<\/?(\w+)[^>]*>|</g,
    		match,
    		output = "",
    		charCount = 0;
    
    	// Set the default for the max number of characters (only counts characters outside HTML tags)
    	maxChars = maxChars || 250;
    
    	while ((charCount < maxChars) && (match = token.exec(input))) {
    		// If this is an HTML tag
    		if (match[1]) {
    			output += match[0];
    		} else {
    			charCount += match[0].length;
    			if (charCount <= maxChars) {
    				output += match[0];
    			}
    		}
    	}
    
    	return output;
    };

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).