Mimicking Atomic Groups
So, I was messing around with RegexBuddy and discovered that capturing groups work inside lookarounds (e.g., "(?=(captured))
"), even though, of course, lookarounds only match a position. Consider that by using this technique, you can return text to your application (using backreferences) which wasn't contained within your actual match (backreference zero)!
Thinking back to the regex I just posted about (which matches innermost HTML elements, supporting an infinite amount of nesting), I realized this technique could actually be used to fake an atomic grouping. So, I've added a note on the end of the last post with an improved non-atomic-group-reliant version, which sure enough is nearly identical in speed to the regex which uses a real atomic grouping.
Here's how it's done:
(?=(pattern to make atomic))\1
Basically, it uses a capturing group inside a positive lookahead (which captures but doesn't actually consume anything, so the rest of the regex can't backtrack into it), followed by "\1
" (the backreference you just captured), to act just like an atomic group. That produces the exactly same result as "(?>pattern to make atomic)
", but can be used in programming languages which don't support atomic groups or possessive quantifiers (assuming they do support lookaheads). I can now use such constructs in languages like JavaScript and ColdFusion, and I think that's pretty freaking sweet.
Comment by Rob on 4 April 2007:
All right, man. You’ve officially eclipsed my regex capacity. I considered myself pretty proficient until today. I’ve read this post 3 or 4 times and…my mind is officially blown.
That aside, you’ve introduced some pretty innovative stuff in this space. Keep it coming. I always learn something – even if I don’t understand what it is.
Comment by nic_tester on 28 November 2007:
If this works im in awe. I was pretty shocked that javascript does not support atomic groups, took me a long time to realise, it does apear like it would in the references ive found but all my testing sugests otherwise. You just saved my day, well, actually, probably saved me several days.
Comment by Steve on 28 November 2007:
@nic, Safari 3 supports atomic groups, but no other browser I’ve tested does, and they’re not in the ECMAScript 3 or proposed ECMAScript 4 specs.
BTW, since posting this I’ve learned that emulating atomic groups using capturing groups in lookahead is actually a fairly well-known trick. It’s mentioned in the official Perl documentation, as well as in Jeffrey Friedl’s Mastering Regular Expressions.
See my post just after this one, Mimicking Conditionals, for another cool trick which I haven’t seen elsewhere.
Comment by liorean on 29 November 2007:
Of course, since possessive quantifiers are semantically equivalent to regular quantifiers in atomic groups, possessive quantifiers can be emulated using the same pattern, too.
Comment by Steve on 29 November 2007:
Absolutely (in fact I mentioned them in the post), but since possessive quantifiers are really just a notational convenience, I could get into a semantic argument about if using syntax less convenient than a regular atomic grouping can really be considered to be emulating a notational convenience.
Comment by Jason on 11 April 2008:
Thanks for the post. My situation was somewhat different: I did not have the need to fake an aotmic group, but since the .Net RegEx engine does not support possessive quantifiers, I was able to achieve the same results using your atomic groups sample. Thanks again.
Comment by Mark McDonnell on 25 May 2010:
Hi,
Just tried this via RegexBuddy and couldn’t get it to match.
I took an existing regex that is part of the RegexBuddy library “HTML file (atomic)” which looks like this…
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
…and then modified it to use the technique you discuss here but it doesn’t work…
<html>(?=(.*?<head>))\1(?=(.*?<title>))\2(?=(.*?</title>))\3(?=(.*?</head>))\4(?=(.*?<body[^>]*>))\5(?=(.*?</body>))\6.*?</html>
Any help appreciated.
M.
Comment by Steven Levithan on 25 May 2010:
@Mark McDonnell, your modified regex works fine for me in RegexBuddy. Did you enable the “Dot matches newline” option? (When you use RegexBuddy’s library it will apply appropriate options automatically for that specific regex.)
Comment by Mark McDonnell on 26 May 2010:
Hi Steve,
Because I wanted to check out how to mimic atomic groups I changed the regex flavour to JavaScript and so now I’m unable to select “Dot matches new line”.
I then tried changing the dot to [\S\s] but that also didn’t work?
<html>(?=([\S\s]*?<head>))\1(?=([\S\s]*?<title>))\2(?=([\S\s]*?</title>))\3(?=([\S\s]*?</head>))\4(?=([\S\s]*?<body[^>]*>))\5(?=([\S\s]*?</body>))\6.*?</html>
Comment by Steven Levithan on 26 May 2010:
Dude, there’s still a dot in your updated regex (near the end). 😉
In any case, I can tell you that according to the ES3/5 spec the method described here should work 100% reliably (and I’ve never had any problem with it in any browser or in RegexBuddy). If you find a bug in e.g. RegexBuddy’s JavaScript-flavor regex emulation (which I don’t think exists in this case), it would probably be best to report it on the RegexBuddy forums.
Comment by Mark McDonnell on 26 May 2010:
OMG what an idiot I am! :-$ sorry about that 🙂
Works like a charm now!
Thanks again for your help clearing that up for me.
Comment by Andrew on 18 April 2012:
This doesn’t appear to be exactly equivalent to atomic grouping, it doesn’t seem to play very nice with repetition.
The regex /^(?>[abc])*$/ matches the string “abc” but /^((?=[abc]))\1*$/ does not.
Comment by Steven Levithan on 24 April 2012:
@Andrew, you need to quantify the entire emulated atomic group pattern. Thus, given your example, you need to use
^(?:(?=([abc]))\1)*$
.Pingback by Hacking lookahead to mimic intersection, subtraction and negation | Lea Verou on 28 May 2012:
[…] took lookahead hacking to the next level, by using something similar to mimic conditionals and atomic groups. Mind = blown. This entry was posted in Tips and tagged RegExp, regular expressions. Bookmark the […]
Comment by Roman on 17 May 2014:
Hi Steven, this question is not about the trick itself but about something you said at the top of the post: “Consider that by using this technique, you can return text to your application (using backreferences) which wasn’t contained within your actual match (backreference zero)!” Is that so? If my string is “my black cat”, do you have a trick to capture “dog”? I tried this a while ago with similar ideas but didn’t succeed.
Comment by Muzietto on 16 May 2018:
Still useful after all those years… thank you!
Comment by ?????????????? on 14 August 2018:
2018???????????
????2018??????????????????????
100??????????????? 100????????????
????????????????
???????????????????????
??????????????????????????
??????????
??????????????????
???????????????????????????????????
Comment by inf3rno on 25 July 2019:
Wow man you are a hero! I am working on a solution to avoid catastrophic backtracking in JS, but it is amazingly hard without atomic groups.
“`js
var evilRegex = /^(a+)+$/g;
var atomicGroupedEvilRegex = /^(?=((a+)+))\1$/g;
var harmlessRegex = /^a+$/g;
var a5 = “aaaaa”;
var x = “x”;
var string15 = a5 + a5 + a5 + x;
evilRegex.test(string15); // 4.8k ops/sec
atomicGroupedEvilRegex.test(string15); // 22M ops/sec
harmlessRegex.test(string15); // 21M ops/sec
“`