Automatic HTML Summary / Teaser
When generating an HTML content teaser or summary, many people just strip all tags before grabbing the leftmost n characters. Recently on ColdFusion developer Ben Nadel's blog, he tackled the problem of closing XHTML tags in a truncated string using ColdFusion and it's underlying Java methods. After seeing this, I created a roughly equivalent JavaScript version, and added some additional functionality. Specifically, the following code additionally truncates the string for you (based on a user-specified number of characters), and in the process only counts text outside of HTML tags towards the length, avoids ending the string in the middle of a tag or word, and avoids adding closing tags for singleton elements like <br> or <img>.
function getLeadingHtml (input, maxChars) {
// token matches a word, tag, or special character
var token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
selfClosingTag = /^(?:[hb]r|img)$/i,
output = "",
charCount = 0,
openTags = [],
match;
// Set the default for the max number of characters
// (only counts characters outside of HTML tags)
maxChars = maxChars || 250;
while ((charCount < maxChars) && (match = token.exec(input))) {
// If this is an HTML tag
if (match[2]) {
output += match[0];
// If this is not a self-closing tag
if (!(match[3] || selfClosingTag.test(match[2]))) {
// If this is a closing tag
if (match[1]) openTags.pop();
else openTags.push(match[2]);
}
} else {
charCount += match[0].length;
if (charCount <= maxChars) output += match[0];
}
}
// Close any tags which were left open
var i = openTags.length;
while (i--) output += "</" + openTags[i] + ">";
return output;
};
This is all pretty straightforward stuff, but I figured I might as well pass it on.
Here's an example of the output:
var input = '<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out and cut off people\'s heads all the time!</u></strong></p>';
var output = getLeadingHtml(input, 40);
/* Output:
<p><a href="http://www.realultimatepower.net/">Ninjas</a> are mammals<br>who <strong><em>love</em> to <u>flip out </u></strong></p>
*/
Edit: On a related note, here's a regex I posted earlier on Ben's site which matches the first 100 characters in a string, unless it ends in the middle of an HTML tag, in which case it will match until the end of the tag (use this with the "dot matches newline" modifier):
^.{1,100}(?:(?<=<[^>]{0,99})[^>]*>)?
That should work the .NET, Java, and JGsoft regex engines. In won't work in most others because of the {0,99} in the lookbehind. Note that .NET and JGsoft actually support infinite-length lookbehind, so with those two you could replace the {0,99} quantifier with *. Since the .NET and JGsoft engines additionally support lookaround-based conditionals, you could save two more characters by writing it as ^.{1,100}(?(?<=<[^>]{0,99})[^>]*>).
If you want to mimic the lookbehind in JavaScript, you could use the following:
// JavaScript doesn't include a native reverse method for strings, // so we need to create one String.prototype.reverse = function() { return this.split("").reverse().join(""); }; // Mimic the regex /^[\S\s]{1,100}(?:(?<=<[^>]*)[^>]*>)?/ through // node-by-node reversal var regex = /(?:>[^>]*(?=[^>]*<))?[\S\s]{1,100}$/; var output = input.reverse().match(regex)[0].reverse();


Comment by Ben Nadel on 3 October 2007:
Good stuff. It took me a minute to figure why you were adding match[0] to the output IF the current match was a tag… but then I remembered the whole point was to add tags and then make sure they are closed. I like that you are only counting characters for non-tag matched. slick.
Comment by Steve on 3 October 2007:
@Ben, thanks! And thanks for bringing this up.
BTW, I’ve just added some stuff to the end of the above post, including an example of node-by-node regex reversal for mimicking lookbehind.
Comment by Matt Foster on 10 October 2007:
Alternatively in javascript you could use the browser’s processor to fix mistakes for you.
//assuming str is your string of truncated code
var ele = document.createElement(“div”);
ele.innerHTML = str;
str = ele.innerHTML;
I understand your focus is on fantastic regex so I appreciate the approach, this one is just different, for a similar result. Also this will fix attributes not enclosed in quotations.
Comment by Steve on 10 October 2007:
@Matt Foster, good point.
Assuming that all browsers will close open tags when using
innerHTML, the function could be reduced to the following if you still wanted its ability to truncate while only counting characters outside of tags, and never ending in the middle of a word or tag:function getLeadingHtml (input, maxChars) { // token matches a word, tag, or special character var token = /\w+|[^\w<]|<\/?(\w+)[^>]*>|</g, match, output = "", charCount = 0; // Set the default for the max number of characters (only counts characters outside HTML tags) maxChars = maxChars || 250; while ((charCount < maxChars) && (match = token.exec(input))) { // If this is an HTML tag if (match[1]) { output += match[0]; } else { charCount += match[0].length; if (charCount <= maxChars) { output += match[0]; } } } return output; };Comment by Marc Selman on 9 July 2008:
I have rewritten your code to C# for anybody who would like to use it, here it is:
public static string GetLeadingHtml(string input, int maxChars) { // token matches a word, tag, or special character Regex token = new Regex(@"\w+|[^\w<]|]*(\/)?>|<"); Regex selfClosingTag = new Regex("^(?:[hb]r|img)$", RegexOptions.IgnoreCase); string output = ""; int charCount = 0; List openTags = new List(); Match match; MatchCollection matches = token.Matches(input); int matchCounter = 0; // (only counts characters outside of HTML tags) while (charCount < maxChars && matchCounter < matches.Count) { match = matches[matchCounter]; matchCounter++; // If this is an HTML tag if (match.Groups[2].Success) { output += match.Groups[0].Value; // If this is not a self-closing tag if (!(match.Groups[3].Success || selfClosingTag.IsMatch(match.Groups[2].Value))) { // If this is a closing tag if (match.Groups[1].Success) { openTags.RemoveAt(openTags.Count - 1); } else { openTags.Add(match.Groups[2].Value); } } } else { charCount += match.Groups[0].Length; if (charCount = 0) { output += ""; } return output; }Comment by andrew cates on 23 September 2009:
The above didn’t work for me. Here’s the working c# code
public string GetLeadingHtml(string input, int? maxChars) {
// token matches a word, tag, or special character
Regex token = new Regex(“\\w+|[^\\w<]|<(/)?(\\w+)[^>]*(/)?>|<”);
Regex selfClosingTag = new Regex(“^(?:[hb]r|img)$”, RegexOptions.IgnoreCase);
string output = String.Empty;
int charCount = 0;
List<string> openTags = new List<string>();
Match match;
// Set the default for the max number of characters
// (only counts characters outside of HTML tags)
if(maxChars == null)
maxChars = 250;
MatchCollection matches = token.Matches(input);
int matchCounter = 0;
// (only counts characters outside of HTML tags)
while (charCount < maxChars && matchCounter < matches.Count)
{
match = matches[matchCounter];
matchCounter++;
// If this is an HTML tag
if (match.Groups[2].Success)
{
output += match.Groups[0].Value;
// If this is not a self-closing tag
if (!(match.Groups[3].Success || selfClosingTag.IsMatch(match.Groups[2].Value)))
{
// If this is a closing tag
if (match.Groups[1].Success)
{
openTags.RemoveAt(openTags.Count – 1);
}
else
{
openTags.Add(match.Groups[2].Value);
}
}
}
else
{
charCount += match.Groups[0].Length;
if (charCount <= maxChars)
output += match.Groups[0].Value;
}
}
// Close any tags which were left open
var i = openTags.Count;
while (i– > 0) output += “</” + openTags[i] + “>”;
return output;
}
Comment by trev on 5 October 2009:
Is there a coldfusion version of this function?
Comment by silent on 13 October 2009:
It doesnt work in firefox… When i call function for the first time – is OK, but for second time – “match['index']” doesnt resets, just continue from index number from last function call.
Any help?
Comment by Maxime on 23 November 2009:
silent, I had the same problem. The solution is simple : just add this line
token.lastIndex = 0;
before the while statement.
Comment by Max Paperno on 27 January 2010:
Here’s a ColdFusion version of this function. Thanks Steve! Love the regex solution.
Tested and working, but YMMV. Posted to CFLib.org but not up there yet.
function getLeadingHtml(input, maxChars) { // token matches a word, tag, or special character var token = "[[:word:]]+|[^[:word:]< ]|(?:]*(\/)?>)|< "; var selfClosingTag = "^(?:[hb]r|img)$"; var output = ""; var charCount = 0; var openTags = ""; var strPos = 0; var tag = ""; var i = 1; var match = REFind(token, input, i, "true"); while ( (charCount LT maxChars) AND match.pos[1] ) { // If this is an HTML tag if (match.pos[3]) { output = output & Mid(input, match.pos[1], match.len[1]); tag = Mid(input, match.pos[3], match.len[3]); // If this is not a self-closing tag if ( NOT ( match.pos[4] OR REFindNoCase(selfClosingTag, tag) ) ) { // If this is a closing tag if ( match.pos[2] AND ListFindNoCase(openTags, tag) ) { openTags = ListDeleteAt(openTags, ListFindNoCase(openTags, tag)); } else { openTags = ListAppend(openTags, tag); } } } else { charCount = charCount + match.len[1]; if (charCount LTE maxChars) output = output & Mid(input, match.pos[1], match.len[1]); } i = i + match.len[1]; match = REFind(token, input, i, "true"); } // Close any tags which were left open while ( ListLen(openTags) ) { output = output & ""; openTags = ListDeleteAt(openTags, ListLen(openTags)); } if ( Len(input) GT Len(output) ) output = output & "…"; return output; }Pingback by TYPO3 HTML save cropping - nerdcenter on 16 March 2010:
[...] Abhilfe schafft die folgende crop-Methode, die eine Portierung der Javascript-Version von Steven Levithan [...]
Comment by vince on 9 May 2011:
this Cold Fusion version is working great. thanks.