Automatic HTML Summary / Teaser
When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br>
or <img>
.
function getLeadingHtml (input, maxChars) { // token matches a word, tag, or special character var token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g, selfClosingTag = /^(?:[hb]r|img)$/i, output = "", charCount = 0, openTags = [], match; // Set the default for the max number of characters // (only counts characters outside of HTML tags) maxChars = maxChars || 250; while ((charCount < maxChars) && (match = token.exec(input))) { // If this is an HTML tag if (match[2]) { output += match[0]; // If this is not a self-closing tag if (!(match[3] || selfClosingTag.test(match[2]))) { // If this is a closing tag if (match[1]) openTags.pop(); else openTags.push(match[2]); } } else { charCount += match[0].length; if (charCount <= maxChars) output += match[0]; } } // Close any tags which were left open var i = openTags.length; while (i--) output += "</" + openTags[i] + ">"; return output; };
This is all pretty straightforward stuff, but I figured I might as well pass it on.
Here's an example of the output:
var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);
/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/
Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):
^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?
That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99}
in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99}
quantifier with *
. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>)
.
If you want to mimic the lookbehind in JavaScript, you could use the following:
// JavaScript doesn't include a native reverse method for strings, // so we need to create one String.prototype.reverse = function() { return this.split("").reverse().join(""); }; // Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through // node-by-node reversal var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/; var output = input.reverse().match(regex)[0].reverse();
Comment by Ben Nadel on 3 October 2007:
Good stuff. It took me a minute to figure why you were adding match[0] to the output IF the current match was a tag… but then I remembered the whole point was to add tags and then make sure they are closed. I like that you are only counting characters for non-tag matched. slick.
Comment by Steve on 3 October 2007:
@Ben, thanks! And thanks for bringing this up.
BTW, I’ve just added some stuff to the end of the above post, including an example of node-by-node regex reversal for mimicking lookbehind.
Comment by Matt Foster on 10 October 2007:
Alternatively in javascript you could use the browser’s processor to fix mistakes for you.
//assuming str is your string of truncated code
var ele = document.createElement(“div”);
ele.innerHTML = str;
str = ele.innerHTML;
I understand your focus is on fantastic regex so I appreciate the approach, this one is just different, for a similar result. Also this will fix attributes not enclosed in quotations.
Comment by Steve on 10 October 2007:
@Matt Foster, good point. 🙂 Assuming that all browsers will close open tags when using
innerHTML
, the function could be reduced to the following if you still wanted its ability to truncate while only counting characters outside of tags, and never ending in the middle of a word or tag:Comment by Marc Selman on 9 July 2008:
I have rewritten your code to C# for anybody who would like to use it, here it is:
Comment by andrew cates on 23 September 2009:
The above didn’t work for me. Here’s the working c# code
public string GetLeadingHtml(string input, int? maxChars) {
// token matches a word, tag, or special character
Regex token = new Regex(“\\w+|[^\\w<]|<(/)?(\\w+)[^>]*(/)?>|<“);
Regex selfClosingTag = new Regex(“^(?:[hb]r|img)$”, RegexOptions.IgnoreCase);
string output = String.Empty;
int charCount = 0;
List<string> openTags = new List<string>();
Match match;
// Set the default for the max number of characters
// (only counts characters outside of HTML tags)
if(maxChars == null)
maxChars = 250;
MatchCollection matches = token.Matches(input);
int matchCounter = 0;
// (only counts characters outside of HTML tags)
while (charCount < maxChars && matchCounter < matches.Count)
{
match = matches[matchCounter];
matchCounter++;
// If this is an HTML tag
if (match.Groups[2].Success)
{
output += match.Groups[0].Value;
// If this is not a self-closing tag
if (!(match.Groups[3].Success || selfClosingTag.IsMatch(match.Groups[2].Value)))
{
// If this is a closing tag
if (match.Groups[1].Success)
{
openTags.RemoveAt(openTags.Count – 1);
}
else
{
openTags.Add(match.Groups[2].Value);
}
}
}
else
{
charCount += match.Groups[0].Length;
if (charCount <= maxChars)
output += match.Groups[0].Value;
}
}
// Close any tags which were left open
var i = openTags.Count;
while (i– > 0) output += “</” + openTags[i] + “>”;
return output;
}
Comment by trev on 5 October 2009:
Is there a coldfusion version of this function?
Comment by silent on 13 October 2009:
It doesnt work in firefox… When i call function for the first time – is OK, but for second time – “match[‘index’]” doesnt resets, just continue from index number from last function call.
Any help?
Comment by Maxime on 23 November 2009:
silent, I had the same problem. The solution is simple : just add this line
token.lastIndex = 0;
before the while statement.
Comment by Max Paperno on 27 January 2010:
Here’s a ColdFusion version of this function. Thanks Steve! Love the regex solution.
Tested and working, but YMMV. Posted to CFLib.org but not up there yet.
Pingback by TYPO3 HTML save cropping - nerdcenter on 16 March 2010:
[…] Abhilfe schafft die folgende crop-Methode, die eine Portierung der Javascript-Version von Steven Levithan […]
Comment by vince on 9 May 2011:
this Cold Fusion version is working great. thanks.
Comment by viagra on 30 April 2014:
Hello!