Excited by the fact that I can mimic atomic groups when using most regex libraries which don't support them, I set my sights on another of my most wanted features which is commonly lacking: conditionals (which provide an if-then-else construct). Of the regex libraries I'm familiar with, conditionals are only supported by .NET, Perl, PCRE (and hence, PHP's preg functions), and JGsoft products (including RegexBuddy).
There are two common types of regex conditionals in those libraries: capturing-group-based and lookaround-based. I'll get to the latter type in a bit, but first I'll address capturing-group-based conditionals, which are able to base logic on whether a capturing group has participated in the match so far. Here's an example:
(a)?b(?(1)c|d)
That matches only "bd" and "abc". The pattern can be expressed as follows:
(if_matched)?inner_pattern(?(1)then|else)
Here's a comparable pattern I created which doesn't require support for conditionals:
(?=(a)()|())\1?b(?:\2c|\3d)
Note that to use it without an "else" part, you still need to include the second empty backreference (in this case, \3
) at the end, like this:
(?=(a)()|())\1?b(?:\2c|\3)
As a brief explanation of how that works, there's a zero-length alternation option within the lookahead at the beginning which is used to cancel the effect of the lookahead, while at the same time, the intentionally empty capturing groups within the lookahead are exploited to base the then/else part on which option in the lookahead matched. However, there are a couple issues:
- This doesn't work with some regex engines, due to how they handle backreferences for non-participating capturing groups.
- It interacts with backtracking differently than a real conditional (the "a" part is treated as if it were within an optional, atomic group, e.g.,
(?>(a))?
instead of(a)?
), so it might be better to think of this as a new operator which is similar to a conditional.
Here are the regex engines I've briefly tested this pattern with:
Language | Supports fake cond. | Supports real cond. | Notes |
---|---|---|---|
.NET | Yes | Yes | Tested using Expresso. |
ColdFusion | Yes | No | Tested using ColdFusion MX7. |
Java | Yes | No | Tested using Regular Expression Test Applet. |
JavaScript | No | No | According to ECMA-262v3, backreferences to non-participating capturing groups always succeed, and most browsers respect that. Unfortunately, this pattern depends on the way most other regex engines handle such groups. |
JGsoft | Yes | Yes | (Edit:) Works as of RegexBuddy version 2.4.0. Previous versions contained two bugs (which I reported to JGsoft) which prevented this from working reliably. |
As for lookaround-based conditionals, we can mimic them using the same concepts. Here's what a real lookaround-based conditional looks like (this example uses a positive lookahead for the assertion):
(?(?=if_assertion)then|else)
And here's how you can mimic it:
(?:(?=if_assertion()|())\1then|\2else)
Again, to use it without an "else" part, you still need to include the second empty backreference (in this case, \2
) at the end, like this:
(?:(?=if_assertion()|())\1then|\2)
Notes:
- The above compatibility table applies just the same.
- Backtracking does not come into play with lookaround-based conditionals in the same way as with capturing-group-based conditionals. As a result, mimicked lookaround-based conditionals are functionally identical to their "real" counterparts.
- In some regex flavors, it may be necessary to write it in the the somewhat less lucid form
(?=a()|())(?:\1b|\2c)
. - Another, potentially more verbose and less efficient way to mimic a lookaround-based conditional is to alternate two opposite lookarounds. E.g.,
(?=if_assertion)then|(?!if_assertion)else
. That will work even in the case of flavors like JavaScript where backreferences to non-participating groups match the empty string.