Flagrant Badassery

A JavaScript and regular expression centric blog

Mimicking Atomic Groups

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds only match a position. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference zero)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually consume anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That produces the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking sweet.

There Are 16 Responses So Far. »

  1. All right, man. You’ve officially eclipsed my regex capacity. I considered myself pretty proficient until today. I’ve read this post 3 or 4 times and…my mind is officially blown.

    That aside, you’ve introduced some pretty innovative stuff in this space. Keep it coming. I always learn something – even if I don’t understand what it is.

  2. If this works im in awe. I was pretty shocked that javascript does not support atomic groups, took me a long time to realise, it does apear like it would in the references ive found but all my testing sugests otherwise. You just saved my day, well, actually, probably saved me several days.

  3. @nic, Safari 3 supports atomic groups, but no other browser I’ve tested does, and they’re not in the ECMAScript 3 or proposed ECMAScript 4 specs.

    BTW, since posting this I’ve learned that emulating atomic groups using capturing groups in lookahead is actually a fairly well-known trick. It’s mentioned in the official Perl documentation, as well as in Jeffrey Friedl’s Mastering Regular Expressions.

    See my post just after this one, Mimicking Conditionals, for another cool trick which I haven’t seen elsewhere.

  4. Of course, since possessive quantifiers are semantically equivalent to regular quantifiers in atomic groups, possessive quantifiers can be emulated using the same pattern, too.

  5. Absolutely (in fact I mentioned them in the post), but since possessive quantifiers are really just a notational convenience, I could get into a semantic argument about if using syntax less convenient than a regular atomic grouping can really be considered to be emulating a notational convenience.

  6. Thanks for the post. My situation was somewhat different: I did not have the need to fake an aotmic group, but since the .Net RegEx engine does not support possessive quantifiers, I was able to achieve the same results using your atomic groups sample. Thanks again.

  7. Hi,

    Just tried this via RegexBuddy and couldn’t get it to match.

    I took an existing regex that is part of the RegexBuddy library “HTML file (atomic)” which looks like this…

    <html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>

    …and then modified it to use the technique you discuss here but it doesn’t work…

    <html>(?=(.*?<head>))\1(?=(.*?<title>))\2(?=(.*?</title>))\3(?=(.*?</head>))\4(?=(.*?<body[^>]*>))\5(?=(.*?</body>))\6.*?</html>

    Any help appreciated.

    M.

  8. @Mark McDonnell, your modified regex works fine for me in RegexBuddy. Did you enable the “Dot matches newline” option? (When you use RegexBuddy’s library it will apply appropriate options automatically for that specific regex.)

  9. Hi Steve,

    Because I wanted to check out how to mimic atomic groups I changed the regex flavour to JavaScript and so now I’m unable to select “Dot matches new line”.

    I then tried changing the dot to [\S\s] but that also didn’t work?

    <html>(?=([\S\s]*?<head>))\1(?=([\S\s]*?<title>))\2(?=([\S\s]*?</title>))\3(?=([\S\s]*?</head>))\4(?=([\S\s]*?<body[^>]*>))\5(?=([\S\s]*?</body>))\6.*?</html>

  10. Dude, there’s still a dot in your updated regex (near the end). 😉

    In any case, I can tell you that according to the ES3/5 spec the method described here should work 100% reliably (and I’ve never had any problem with it in any browser or in RegexBuddy). If you find a bug in e.g. RegexBuddy’s JavaScript-flavor regex emulation (which I don’t think exists in this case), it would probably be best to report it on the RegexBuddy forums.

  11. OMG what an idiot I am! :-$ sorry about that 🙂

    Works like a charm now!

    Thanks again for your help clearing that up for me.

  12. This doesn’t appear to be exactly equivalent to atomic grouping, it doesn’t seem to play very nice with repetition.

    The regex /^(?>[abc])*$/ matches the string “abc” but /^((?=[abc]))\1*$/ does not.

  13. @Andrew, you need to quantify the entire emulated atomic group pattern. Thus, given your example, you need to use ^(?:(?=([abc]))\1)*$.

  14. […] took lookahead hacking to the next level, by using something similar to mimic conditionals and atomic groups. Mind = blown. This entry was posted in Tips and tagged RegExp, regular expressions. Bookmark the […]

  15. Hi Steven, this question is not about the trick itself but about something you said at the top of the post: “Consider that by using this technique, you can return text to your application (using backreferences) which wasn’t contained within your actual match (backreference zero)!” Is that so? If my string is “my black cat”, do you have a trick to capture “dog”? I tried this a while ago with similar ideas but didn’t succeed.

  16. Hello I am so grateful I found your blog, I really found you by error, while I was looking on Digg for something else, Anyhow I am here now and would just like to say
    thanks a lot for a remarkable post and a all round thrilling blog (I also
    love the theme/design), I don’t have time
    to read through it all at the moment but I have saved it and also added
    your RSS feeds, so when I have time I will be back to read a lot more,
    Please do keep up the excellent job.

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).