Mimicking Atomic Groups

So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))"), even though, of course, lookarounds only match a position. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference zero)!

Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.

Here's how it's done:

(?=(pattern to make atomic))\1

Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually consume anything, so the rest of the regex can't backtrack into it), followed by "\1" (the backreference you just captured), to act just like an atomic group. That produces the exactly same result as "(?>pattern to make atomic)", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking sweet.

20 thoughts on “Mimicking Atomic Groups”

  1. All right, man. You’ve officially eclipsed my regex capacity. I considered myself pretty proficient until today. I’ve read this post 3 or 4 times and…my mind is officially blown.

    That aside, you’ve introduced some pretty innovative stuff in this space. Keep it coming. I always learn something – even if I don’t understand what it is.

  2. If this works im in awe. I was pretty shocked that javascript does not support atomic groups, took me a long time to realise, it does apear like it would in the references ive found but all my testing sugests otherwise. You just saved my day, well, actually, probably saved me several days.

  3. @nic, Safari 3 supports atomic groups, but no other browser I’ve tested does, and they’re not in the ECMAScript 3 or proposed ECMAScript 4 specs.

    BTW, since posting this I’ve learned that emulating atomic groups using capturing groups in lookahead is actually a fairly well-known trick. It’s mentioned in the official Perl documentation, as well as in Jeffrey Friedl’s Mastering Regular Expressions.

    See my post just after this one, Mimicking Conditionals, for another cool trick which I haven’t seen elsewhere.

  4. Of course, since possessive quantifiers are semantically equivalent to regular quantifiers in atomic groups, possessive quantifiers can be emulated using the same pattern, too.

  5. Absolutely (in fact I mentioned them in the post), but since possessive quantifiers are really just a notational convenience, I could get into a semantic argument about if using syntax less convenient than a regular atomic grouping can really be considered to be emulating a notational convenience.

  6. Thanks for the post. My situation was somewhat different: I did not have the need to fake an aotmic group, but since the .Net RegEx engine does not support possessive quantifiers, I was able to achieve the same results using your atomic groups sample. Thanks again.

  7. Hi,

    Just tried this via RegexBuddy and couldn’t get it to match.

    I took an existing regex that is part of the RegexBuddy library “HTML file (atomic)” which looks like this…

    <html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>

    …and then modified it to use the technique you discuss here but it doesn’t work…

    <html>(?=(.*?<head>))\1(?=(.*?<title>))\2(?=(.*?</title>))\3(?=(.*?</head>))\4(?=(.*?<body[^>]*>))\5(?=(.*?</body>))\6.*?</html>

    Any help appreciated.

    M.

  8. @Mark McDonnell, your modified regex works fine for me in RegexBuddy. Did you enable the “Dot matches newline” option? (When you use RegexBuddy’s library it will apply appropriate options automatically for that specific regex.)

  9. Hi Steve,

    Because I wanted to check out how to mimic atomic groups I changed the regex flavour to JavaScript and so now I’m unable to select “Dot matches new line”.

    I then tried changing the dot to [\S\s] but that also didn’t work?

    <html>(?=([\S\s]*?<head>))\1(?=([\S\s]*?<title>))\2(?=([\S\s]*?</title>))\3(?=([\S\s]*?</head>))\4(?=([\S\s]*?<body[^>]*>))\5(?=([\S\s]*?</body>))\6.*?</html>

  10. Dude, there’s still a dot in your updated regex (near the end). 😉

    In any case, I can tell you that according to the ES3/5 spec the method described here should work 100% reliably (and I’ve never had any problem with it in any browser or in RegexBuddy). If you find a bug in e.g. RegexBuddy’s JavaScript-flavor regex emulation (which I don’t think exists in this case), it would probably be best to report it on the RegexBuddy forums.

  11. This doesn’t appear to be exactly equivalent to atomic grouping, it doesn’t seem to play very nice with repetition.

    The regex /^(?>[abc])*$/ matches the string “abc” but /^((?=[abc]))\1*$/ does not.

  12. Hi Steven, this question is not about the trick itself but about something you said at the top of the post: “Consider that by using this technique, you can return text to your application (using backreferences) which wasn’t contained within your actual match (backreference zero)!” Is that so? If my string is “my black cat”, do you have a trick to capture “dog”? I tried this a while ago with similar ideas but didn’t succeed.

  13. 2018???????????
    ????2018??????????????????????
    100??????????????? 100????????????
    ????????????????
    ???????????????????????
    ??????????????????????????
    ??????????
    ??????????????????
    ???????????????????????????????????

  14. Wow man you are a hero! I am working on a solution to avoid catastrophic backtracking in JS, but it is amazingly hard without atomic groups.

    “`js
    var evilRegex = /^(a+)+$/g;
    var atomicGroupedEvilRegex = /^(?=((a+)+))\1$/g;
    var harmlessRegex = /^a+$/g;

    var a5 = “aaaaa”;
    var x = “x”;
    var string15 = a5 + a5 + a5 + x;

    evilRegex.test(string15); // 4.8k ops/sec
    atomicGroupedEvilRegex.test(string15); // 22M ops/sec
    harmlessRegex.test(string15); // 21M ops/sec
    “`

  15. I am actually having some difficulties with this..

    I am trying to make this regex (with atomic groups) for JS:

    /^(?=((?>.*?\S){20}))/

    The idea here is that I could create some sort of a “substring” with spaces ignored…

    Like:

    “While still in high school I signed up to participate in amateur night at the Educational Alliance. I wanted to show my mother I had talent.”

    Would return me:

    “While still in high scho”

    That contains 24 chars, but contains exactly 20 without the spaces…

    I tried doing

    ^((?=(?:(?=(.*?\S))\1){20}))

    This is not quite working, I just can’t figure it out why 🙁

  16. Jeffrey Friedl sends his regards! (this topic was described in his famous book)

Leave a Reply

Your email address will not be published. Required fields are marked *