Regexes in Depth: Advanced Quoted String Matching

In my previous post, one of the examples I used of when capturing groups are appropriate demonstrated how to match quoted strings:

(["'])(?:\\\1|.)*?\1

(Note: This has some issues. See Edit 2, at the end of this post, if you intend to use this in an application.)

To recap, that will match values enclosed in either double or single quotes, while requiring that the same quote type start and end the match. It also allows inner, escaped quotes of the same type as the enclosure.

On his blog, Ben Nadel asked:

I do not follow the \\\1 in the middle group. You said that that was an escaped closing of the same type (group 1). I do not follow. Does that mean that the middle group can have quotes in it? If that is the case, how does the reluctant search in the middle (*?) know when to stop if it can have quotes in side of it? What am I missing?

Good question. Following is the response I gave, slightly updated to improve clarity:


First, to ensure we're on the same page, here are some examples of the kinds of quoted strings the regex will correctly match:

  • "test"
  • 'test'
  • "t'es't"
  • 'te"st'
  • 'te\'st'
  • "\"te\"\"st\""

In other words, it allows any number of escaped quotes of the same type as the enclosure.

As for how the regex works, it uses a trick similar in construct to the examples I gave in my blog post about regex recursion.

Basically, the inner grouping matches escaped quotes OR any single character, with the escaped quote part before the dot in the test attempt sequence. So, as the lazy repetition operator (*?) steps through the match looking for the first closing quote, it jumps right past each instance of the two characters which together make up an escaped quote. In other words, pairing something other than the quote mark with the quote mark makes the lazy repetition operator treat them as one node, and continue on it's way through the string.

Note that if you wanted to support multi-line quotes in libraries without an option to make dots match newlines (e.g., JavaScript), change the dot to [\S\s]. Also, with regex engines which support negative lookbehinds (i.e., not those used by ColdFusion, JavaScript, etc.), the following two patterns would be equivalent to each other:

  • (["'])(?:\\\1|.)*?\1 (the regex being discussed)
  • (["']).*?(?<!\\)\1 (uses a negative lookbehind to achieve logic which is possibly simpler to understand)

Because I use JavaScript and ColdFusion a lot, I automatically default to constructing patterns in ways which don't require lookbehinds. And if you can create a pattern which avoids lookbehinds it will often be faster, though in this case it wouldn't make much of a difference.

One final thing worth noting is that in neither regex did I try to use anything like [^\1] for matching the inner, quoted content. If [^\1] worked as you might expect, it might allow us to construct a slightly faster regex which would greedily jump from the start to the end of each quote and/or between escaped quotes. First of all, the reason we can't greedily repeat an "any character" pattern such as a dot or [\S\s] is that we would then no longer be able to distinguish between multiple discrete quotes within the same string, and our match would go from the start of the first quote to the end of the last quote. The reason we can't use [^\1] either is because you can't use backreferences within character classes, even though in this case the backreference's value is only one character in length. Also note that the patterns [\1] and [^\1] actually do have special meaning, though possibly not what you would expect. With most regular expression libraries, they assert: match a single character which is/is not octal index 1 in the character set. To assert that outside of a character class, you'd typically need to use a leading zero (e.g., \01), but inside a character class the leading zero is often optional.


If anyone has questions about how or why other specific regex patterns work or don't work, let me know, and I can try to make "Regexes in Depth" a regular feature here.


Edit: Just for kicks, here's a version which adds support for fancy, angled “…” and ‘…’ pairs. This uses a lookbehind and conditionals. Libraries which support both features include the .NET framework and PCRE.

(?:(["'])|(“)|‘).*?(?<!\\)(?(1)\1|(?(2)”|’))

In the above, I'm using nested conditionals to achieve an if/elseif/else construct. Here are some examples of the kinds of quoted strings the above regex adds support for (in addition to preserving support for quotes enclosed with " or '.

  • “test”
  • “te“st”
  • “te\”st”
  • ‘test’
  • ‘t‘e“"”s\’t’

Edit 2: Note that these regexes were designed more for illustrative purposes than practical use within programming. One issue is that they don't account for escaped backslashes within the quotes (e.g., they treat \\" as a backslash followed by an escaped double quote, rather than an escaped backslash followed by an unescaped double quote. However, that's easy to address. For the first regex, just replace it with this:

(["'])(?:\\?.)*?\1

To also avoid an issue with quote marks which are followed by an escaped quote of the same type but are not followed by a closing quote, make the first quantifier possessive, like so:

(["'])(?:\\?+.)*?\1

Or, if the regex engine you're using doesn't support possessive quantifiers or atomic groups (which can be used equivalently), use one of the following:

(["'])(?:(?=(\\?))\2.)*?\1    «OR»    (["'])(?:(?!\1)[^\\]|\\.)*\1

The former mimics an atomic group, and the later utilizes a negative lookahead which allows replacing the lazy star with a greedy one. There is still potential to optimize these for efficiency, and none of them account for the outermost opening quote marks being escaped or other issues regarding context (e.g., you might want to ignore quoted strings within code comments), but still, these are probably good enough for most people's needs.

22 thoughts on “Regexes in Depth: Advanced Quoted String Matching”

  1. Hi Steven, I stumbled upon your regex examples while trying to stop my head from bashing in to this wall. I wonder if you’d be able to take a quick look at this problem I’ve got – it seems to be something similar to the above.

    I’m trying to rip thought a bunch of php code and pull out some parameters within a function. The function name is __() and it’s the first 2 variables that are passed that I’m interested in. So something like:

    __(‘new_test_token’,”– JUST A TEST –“);
    __(‘new_test_token2’,”– JUST ANOTHER \”TEST\” –“);

    would be normal, there might also be extra parameters passed too. So far I have this expression:

    __\(([“‘])([a-zA-Z0-9_\-\.]*?)\1,([“‘])([^”]*?)\3(\s*,?.*?)\)

    Which works fine as long as there are no escaped quotes in the string (as in the first example). The first variable has a strict use, so that’s no problem. And I think the last part matches onwards ok. It’s just the “([^”]*?)” expression I’m having a problem with. I’ve tried a few variations of your code above, but not to much success….

    Any help would be greatly appreciated 🙂

  2. Tony, PHP string literals add complexity because, in PHP, double-quoted strings and single-quoted strings work differently (e.g., in single-quoted strings, “\” is not a metacharacter unless it’s followed by another “\” or a single quote, whereas it’s always a metacharacter in double-quoted strings).

    Try this regex (the first and second parameters are captured to $1 and $2):

    __\(\s*((?>'(?:[^’\\]|\\[‘\\]?+)*’|”(?:[^”\\]|\\.)*”)),\s*((?>'(?:[^’\\]|\\[‘\\]?+)*’|”(?:[^”\\]|\\.)*”))\s*(?:,.*?)?\);

    That might work for your needs, but there are a couple issues involving recursion, etc., which are not accounted for. Either way, let me know how that works out.

  3. Wow, thanks, Steve. That seems to have done the trick on a first pass. I’ll do a bit of testing today and see how it goes.

    Thanks again!

  4. If we wanted to deal with, say, only double-quoted strings, we could get by with "([^\\"]|\\.)*" and for single quotes '([^\\']|\\.)*' so why not just do a choice between them:

    ("([^\\"]|\\.)*"|'([^\\']|\\.)*')

    It may not be pretty, but I think it’s midge faster than the other versions and works with just about the most primitive regex engines.

  5. A fair point, Ted. However, this post was born out of examples of using backreferences. I think the regex (["'])(\\?+.)*?\1 is quite pretty due to its brevity, but the rest of the regexes in this post are not great (this is one of my oldest posts here). If you really cared about performance you could unroll the loops, resulting in "[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'. That’s just as portable as what you posted if you take out the ?:s which create non-capturing groups.

  6. I am trying to adapt this in VB/VBScript to parse a dynamic sql result string for parameters. I will then take each matched parameter and use a command parameter to help alleviate sql injection concerns. I was getting tripped up with some of the more advanced input scenarios though, but now I think I have it straightened out. Thought I’d share it with you since I found your post here to do all the heavy lifting for me and others may find this useful.

    Since VBS handles escaped quotes 2 double quotes I have modified the regex to handle those instead of javascript escaped characters.

    <%
    'SQL query tests using parameters
    
    '@@@@@@@@@@@@@@@@@@@@     Functions     @@@@@@@@@@@@@@@@@@@@
    	function ConvertToParams(sqlIn)
    		retVal = sqlIn
    		
    		Set RegularExpressionObject = New RegExp
    		
    		With RegularExpressionObject
    		.Pattern = "([""'])(?:(?!\1)[^""""]|"""".)*\1"
    		.IgnoreCase = True
    		.Global = True
    		End With
    		
    		Set expressionmatch = RegularExpressionObject.Execute(retVal)
    		
    		if expressionmatch.Count > 0 Then
    			For Each expressionmatched in expressionmatch
    				Response.Write "<B>" & expressionmatched.Value & "</B> was matched at position <B>" & expressionmatched.FirstIndex & "</B><BR>"
    			Next
    		end if
    		
    		ConvertToParams = retVal
    	end function	
    	
    '@@@@@@@@@@@@@@@@@@@@     End Functions     @@@@@@@@@@@@@@@@@@@@
    %>
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
     <head>
     <title>Parameterized Queries Test</title>
     <script type="application/javascript">
     </script>
     </head>
     <body>
     <p>
     <% 
    	 
    	 SQL = "INSERT INTO dbo.TEST " & _
    			"(TEXT) " & _
    			"VALUES " & _
    			"(""I would say that the """"preference"""" is for a more open solution that will meet everyone's needs.  Wouldn't you agree with these thoughts I've had?"") " & _
    			"go " & _
    			"INSERT INTO dbo.TEST " & _
    			"(TEXT) " & _
    			"VALUES " & _
    			"(""O'brien's favorite pastime isn't listed although it """"should"""" be"") " & _
    			"go "
    
    	 response.write("SQL=" & SQL & "<br /><br />" &vbcrlf)	
    	 response.write(ConvertToParams(SQL) & "<br /><br /><br /><br />" &vbcrlf)		 
     %>
     </p>
     </body>
    </html>
    
  7. In PHP, I’m using this pattern :
    ‘/([\”\’])(?:.*[^\\\\]+)*(?:(?:\\\\{2})*)+\1/xU’

    Works great for me, even with multiple backslashed-backslashes, like :

    name=”manda yugana \\\”gantenx\\\” banget”

    or

    name=”manda yugana \\\\\”gantenx\\\\\” banget”

  8. I like this regex, pretty cool. However, one caveat to mention… in source code for a language like javascript that supports regex literals, this regex would match what it thinks are string literals inside a regex literal, which some might argue it should not.

    currently working on my own set of code to properly recognize/tokenize string literals, regex literals, and single/multi-line comments from JS code, and I’ve found that regex literals in particular throw lots of monkey wrench into the process.

    The only way i’ve found is a combination of limited regexes and a lot of stateful loop iteration processing to identify all the various cases where strings appear inside of regexes, or regexes appear inside of strings, or all the other weird cases that can happen. lots of fun, let me tell ya. 🙂

  9. I am trying to get a regular expression that will only remove double quotes between any html tag.

    <TABLE borderColor=#111111 cellSpacing=0 cellPadding=2 width= “100%” border=0>I Like “cheese”

    Only want to remove quotes from width parameter. Any help would be appreciated…

  10. hi, surfing 1 hrs on web and have to say we have here
    interesting site and I’m gonna like it.

    look forward to surfing around and reading alot of topics here.

    figured I say what’s up!

  11. Hi, You have a great website. I am trying to figure out a regex for finding a text that has this pattern:
    /”>any text

    I tried:
    ([/”>])[0-9a-zA-Z]

    but it doesn’t seem to be working, any idea?
    Thanks

    PS. I am coding in Java

  12. Hi, You have a great website. I am trying to figure out a regex for finding a text that has this pattern:
    /”>any text

    I tried:
    ([/”>])[0-9a-zA-Z]/

    but it doesn’t seem to be working, any idea?
    Thanks

    PS. I am coding in Java

  13. Good day Steve:
    I was trying to retrieve some quoted strings from lex. It has became clear after I tried and studied your examples that lex or flex doesn’t support looking somewhere or anywhere nor lazy star / lazy plus operators.
    So this is what I did after a day off from school was wholly burnt:
    (“)((\\\\)|(\\”)|([^*”]))*(“)
    The result from lex looks just fine for me so far but could you still tell me whether it’s the right thing or not in such circumstances?

  14. By the way, just to clarify, the flex i mentioned isn’t the adobe’s sdk. It’s a lexical analyzer employed just for tokenizing. It’s quite lame, i haven’t expected a clashing here.

  15. Hi
    I want find out the missing single and double quotes in a html file using regex.can anyone help me?

  16. Hi Steve,

    I know this is a really old post, but how would you account for the following case:

    \”Text”

    So far I’ve come to this, but I can’t seem to find a solution as concise as yours:

    (?<![^\\]\\)('|").+?(?<![^\\]\\)\1

  17. Thanks so much. The correct patter for PHP should be:

    preg_match(‘/^([“\’])(?:\\\\\1|.)*?\1/’, $str, $matches);

    Note that the backslash character is repeated four times.

  18. Many thanks for your analysis, it has been very helpful. However, there are several issues with your regex. For example it starts capturing if the initial ” is quoted, as in:

    abc”na\\\i\\”l\\\” t\he 2\”x4\” plank”def

    Or inside a single quoted string could not exist escaped \’ .

    Please compare:

    Your regex: https://regex101.com/r/S7PQjk/1
    An alternative: https://regex101.com/r/Cu8lIA/1/

Leave a Reply

Your email address will not be published. Required fields are marked *