Regexes in Depth: Advanced Quoted String Matching

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

(Note: This has some issues. See Edit 2, at the end of this post, if you intend to use this in an application.)

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:


First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

  • "test"
  • 'test'
  • "t'es't"
  • 'te"st'
  • 'te\'st'
  • "\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure.

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote mark with the quote mark makes the lazy repetition operator treat them as one node, and continue on it's way through the string.

Note that if you wanted to support multi-line quotes in libraries without an option to make dots match newlines (e.g., JavaScript), change the dot to [\S\s]. Also, with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

  • (["'])(?:\\\1|.)*?\1 (the regex being discussed)
  • (["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. And if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. The reason we can't use [^\1] either is because you can't use backreferences within character classes, even though in this case the backreference's value is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. With most regular expression libraries, they assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd typically need to use a leading zero (e.g., \01), but inside a character class the leading zero is often optional.


If anyone has questions about how or why other specific regex patterns work or don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.


Edit: Just for kicks, here's a version which adds support for fancy, angled “…” and ‘…’ pairs. This uses a lookbehind and conditionals. Libraries which support both features include the .NET framework and PCRE.

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

In the above, I'm using nested conditionals to achieve an if/elseif/else construct. Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or '.

  • “test”
  • “te“st”
  • “te\”st”
  • ‘test’
  • ‘t‘e“"”s\’t’

Edit 2: Note that these regexes were designed more for illustrative purposes than practical use within programming. One issue is that they don't account for escaped backslashes within the quotes (e.g., they treat \\" as a backslash followed by an escaped double quote, rather than an escaped backslash followed by an unescaped double quote. However, that's easy to address. For the first regex, just replace it with this:

(["'])(?:\\?.)*?\1

To also avoid an issue with quote marks which are followed by an escaped quote of the same type but are not followed by a closing quote, make the first quantifier possessive, like so:

(["'])(?:\\?+.)*?\1

Or, if the regex engine you're using doesn't support possessive quantifiers or atomic groups (which can be used equivalently), use one of the following:

(["'])(?:(?=(\\?))\2.)*?\1    «OR»    (["'])(?:(?!\1)[^\\]|\\.)*\1

The former mimics an atomic group, and the later utilizes a negative lookahead which allows replacing the lazy star with a greedy one. There is still potential to optimize these for efficiency, and none of them account for the outermost opening quote marks being escaped or other issues regarding context (e.g., you might want to ignore quoted strings within code comments), but still, these are probably good enough for most people's needs.

Capturing vs. Non-Capturing Groups

I near-religiously use non-capturing groups whenever I do not need to reference a group's contents. Recently, several people have asked me why this is, so here are the reasons:

  • Capturing groups negatively impact performance. The performance hit may be tiny, especially when working with small strings, but it's there.
  • When you need to use several groupings in a single regex, only some of which you plan to reference later, it's convenient to have the backreferences you want to use numbered sequentially. E.g., the logic in my parseUri UDF could not be nearly as simple if I had not made appropriate use of capturing and non-capturing groups within the same regex.
  • They might be slightly harder to read, but ultimately, non-capturing groups are less confusing and easier to maintain, especially for others working with your code. If I modify a regex and it contains capturing groups, I have to worry about if they're referenced anywhere outside of the regex itself, and what exactly they're expected to contain.

Of course, some capturing groups are necessary. There are three scenarios which meet this description:

  • You're using parts of a match to construct a replacement string, or otherwise referencing parts of the match in code outside the regex.
  • You need to reuse parts of the match within the regex itself. E.g., (["'])(?:\\\1|.)*?\1 would match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match, and allowing inner, escaped quotes of the same type as the enclosure. (Update: For details about this pattern, see Regexes in Depth: Advanced Quoted String Matching.)
  • You need to test if a group has participated in the match so far, as the condition to evaluate within a conditional. E.g., (a)?b(?(1)c|d) only matches the values "bd" and "abc".

If a grouping doesn't meet one of the above conditions, there's no need to capture.

More URI-Related UDFs

To follow up my parseUri() function, here are several more UDFs I've written recently to help with URI management:

  • getPageUri()
    Returns a struct containing the relative and absolute URIs of the current page. The difference between getPageUri().relative and CGI.SCRIPT_NAME is that the former will include the query string, if present.
  • matchUri(testUri, [masterUri])
    Returns a Boolean indicating whether or not two URIs are the same, disregarding the following differences:
    • Fragments (page anchors), e.g., "#top".
    • Inclusion of "index.cfm" in paths, e.g., "/dir/" vs. "/dir/index.cfm" (supports trailing query strings).
    If masterUri is not provided, the current page is used for comparison (supports both relative and absolute URIs).
  • replaceUriQueryKey(uri, key, substring)
    Replaces a URI query key and its value with a supplied key=value pair. Works with relative and absolute URIs, as well as standalone query strings (with or without a leading "?"). This is also used to support the following two UDFs:
  • addUriQueryKey(uri, key, value)
    Removes any existing instances of the supplied key, then appends it together with the provided value to the provided URI.
  • removeUriQueryKey(uri, key)
    Removes one or more query keys (comma delimited) and their values from the provided URI.

View the source code.

Now that I have these at my disposal, I frequently find myself using them in combination with each other. E.g.:

<a href="<cfoutput>#addUriQueryKey(
	getPageUri().relative,
	"key",
	"value"
)#</cfoutput>">Link</a>.

Let me know if you find any of these useful.

In other news, this cracked me up.

parseUri: Split URLs in ColdFusion

Update: I've added a JavaScript implementation of the following UDF. See parseUri: Split URLs in JavaScript.

Here's a UDF I wrote recently which allows me to show off my regex skillz. parseUri() splits any well-formed URI into its components (all are optional).

The core code is already very brief, but I could replace everything within the <cfloop> with one line of code if I didn't have to account for bugs in the reFind() function (tested in CF7). Note that all components are split with a single regex, using backreferences. My favorite part of this UDF is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers.

Since the function returns a struct, you can do, e.g., parseUri(uri).anchor, etc. Check it out:

See the demo and get the source code.

REMatch (ColdFusion)

Following are some UDFs I wrote recently to make using regexes in ColdFusion a bit easier. The biggest deal here is my reMatch() function.

reMatch(), in its most basic usage, is similar to JavaScript's String.prototype.match() method. Compare getting the first number in a string using reMatch() vs. built-in ColdFusion functions:

  • reMatch:
    <cfset num = reMatch("\d+", string) />
  • reReplace:
    <cfset num = reReplace(string, "\D*(\d+).*", "\1") />
  • reFind:
    <cfset match = reFind("\d+", string, 1, TRUE) />
    <cfset num = mid(string, match.pos[1], match.len[1]) />

All of the above would return the same result, unless a number wasn't found in the string, in which case the reFind()-based method would throw an error since the mid() function would be passed a start value of 0. I think it's pretty clear from the above which approach is easiest to use for a situation like this, and it would be easy to envision scenarios where this functionality could more drastically improve code brevity.

Still, that's just the beginning of what reMatch() can do. Change the scope argument from the default of "ONE" to "ALL" (to follow the convention used by reReplace(), etc.), and the function will return an array of all matches. Finally, set the returnLenPos argument to TRUE and the function will return either a struct or array of structs (based on the value of scope) containing the len, pos, AND value of each match. This is very different from how the returnSubExpressions argument of reFind() works. When using returnSubExpressions, you get back a struct containing arrays of the len and pos (but not value) of each backreference from the first match.

Here's the code, with four additional UDFs (reMatchNoCase(), match(), matchNoCase(), and reEscape()) added for good measure:

See the demo and get the source code.

Now that I've got a deeply featured match function, all I need Adobe to add to ColdFusion in the way to regex support is lookbehinds, atomic groups, possessive quantifiers, conditionals, balancing groups, etc., etc.…