REMatch (ColdFusion)

Following are some UDFs I wrote recently to make using regexes in ColdFusion a bit easier. The biggest deal here is my reMatch() function.

reMatch(), in its most basic usage, is similar to JavaScript's String.prototype.match() method. Compare getting the first number in a string using reMatch() vs. built-in ColdFusion functions:

  • reMatch:
    <cfset num = reMatch("\d+", string) />
  • reReplace:
    <cfset num = reReplace(string, "\D*(\d+).*", "\1") />
  • reFind:
    <cfset match = reFind("\d+", string, 1, TRUE) />
    <cfset num = mid(string, match.pos[1], match.len[1]) />

All of the above would return the same result, unless a number wasn't found in the string, in which case the reFind()-based method would throw an error since the mid() function would be passed a start value of 0. I think it's pretty clear from the above which approach is easiest to use for a situation like this, and it would be easy to envision scenarios where this functionality could more drastically improve code brevity.

Still, that's just the beginning of what reMatch() can do. Change the scope argument from the default of "ONE" to "ALL" (to follow the convention used by reReplace(), etc.), and the function will return an array of all matches. Finally, set the returnLenPos argument to TRUE and the function will return either a struct or array of structs (based on the value of scope) containing the len, pos, AND value of each match. This is very different from how the returnSubExpressions argument of reFind() works. When using returnSubExpressions, you get back a struct containing arrays of the len and pos (but not value) of each backreference from the first match.

Here's the code, with four additional UDFs (reMatchNoCase(), match(), matchNoCase(), and reEscape()) added for good measure:

See the demo and get the source code.

Now that I've got a deeply featured match function, all I need Adobe to add to ColdFusion in the way to regex support is lookbehinds, atomic groups, possessive quantifiers, conditionals, balancing groups, etc., etc.…

Regex Recursion (Matching Nested Constructs)

On a regular expression advice forum I visit every once in a while, some dude was trying to scrape BibTeX entries from Web pages using JavaScript (from within a Firefox extension).

A single BibTeX entry looks roughly like this:

@resourceType{
	field1 = value,
	field2 = "value in quotation marks",
	field3 = "value in quotation marks, with {brackets} in the value",
	field4 = {brackets}
}

The resident regex experts were quick to claim that regexes are only capable of recursion through the use of “balancing groups,” a feature supported exclusively by .NET (see the chapter on .NET in Mastering Regular Expressions). (Update: Also see my own post on the topic: Fun With .NET Regex Balancing Groups.)

Basically, searches requiring recursion have typically been the domain of more traditional parsers, rather than regexes. The problem in this case lies in how you distinguish between the last closing bracket of the @resourceType{…} block and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.

Still, given that there is a known maximum amount of recursion that needs to be accounted for, it's quite possible. Here's the solution offered, which works just fine with JavaScript (it doesn't use any advanced regex features, actually):

@[^{]+{(?:[^{}]|{[^{}]*})*}

However, this works only if:

  • braces are always balanced, and
  • the level of brace nesting is no more than one.

For those unfamiliar with regular expressions or who are looking for a good tutorial, see regular-expressions.info.


Edit: This logic is easy to extend to support more levels of recursion, up to a known maximum. Here's a simple example of matching HTML elements and their contents:

No recursion:
<([a-z\d]+)>.*?</\1>

Up to one level of recursion:
<([a-z\d]+)>(?:<\1>.*?</\1>|.)*?</\1>

Up to two levels of recursion:
<([a-z\d]+)>(?:<\1>(?:<\1>.*?</\1>|.)*?</\1>|.)*?</\1>

…And so on. Note that the above don't support attributes or singleton (self-closed) elements, but that would make the regexes longer and this is only meant for demonstration purposes.


Edit 2: I've since learned that, in addition to using .NET's balancing groups, true recursion is possible with Perl and PCRE regexes. See the following for more information: