When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br>
or <img>
.
function getLeadingHtml (input, maxChars) {
var token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
selfClosingTag = /^(?:[hb]r|img)$/i,
output = "",
charCount = 0,
openTags = [],
match;
maxChars = maxChars || 250;
while ((charCount < maxChars) && (match = token.exec(input))) {
if (match[2]) {
output += match[0];
if (!(match[3] || selfClosingTag.test(match[2]))) {
if (match[1]) openTags.pop();
else openTags.push(match[2]);
}
} else {
charCount += match[0].length;
if (charCount <= maxChars) output += match[0];
}
}
var i = openTags.length;
while (i--) output += "</" + openTags[i] + ">";
return output;
};
This is all pretty straightforward stuff, but I figured I might as well pass it on.
Here's an example of the output:
var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);
Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):
^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?
That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99}
in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99}
quantifier with *
. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>)
.
If you want to mimic the lookbehind in JavaScript, you could use the following:
String.prototype.reverse = function() {
return this.split("").reverse().join("");
};
var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/;
var output = input.reverse().match(regex)[0].reverse();