Flagrant Badassery

A JavaScript and regular expression centric blog

Creating Grammatical Regexes Using XRegExp.build

Recently, I've added three new addons for XRegExp v2.0 (currently in release candidate stage on GitHub):

  • XRegExp.build — Lets you build regexes using named subpatterns. Inspired by Lea Verou's RegExp.create.
  • XRegExp Prototype Methods — Adds a collection of methods to be inherited by XRegExp regexes: apply, call, forEach, globalize, xexec, and xtest. These also work for native RegExps copied by XRegExp.
  • XRegExp Unicode Properties — Includes the remaining nine properties (beyond what's already available in other XRegExp addons) required for Level-1 Unicode support: Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, and Assigned.

Jumping right into some code, the following demonstrates how the new XRegExp.build addon can be used to create a grammatical pattern for matching real numbers:

// Approach 1: Make all of the subpatterns reusable

var lib = {
    digit:             /[0-9]/,
    exponentIndicator: /[Ee]/,
    digitSeparator:    /[_,]/,
    sign:              /[+-]/,
    point:             /[.]/
};
lib.preexponent = XRegExp.build('(?xn)\
    {{sign}} ?              \
    (?= {{digit}}           \
      | {{point}}           \
    )                       \
    ( {{digit}} {1,3}       \
      ( {{digitSeparator}} ?\
        {{digit}} {3}       \
      ) *                   \
    ) ?                     \
    ( {{point}}             \
      {{digit}} +           \
    ) ?                     ',
    lib
);
lib.exponent = XRegExp.build('(?x)\
    {{exponentIndicator}}\
    {{sign}} ?           \
    {{digit}} +          ',
    lib
);
lib.real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    lib
);

// Approach 2: No need to reuse the subpatterns. {{sign}} and {{digit}} are
// defined twice, but that can be avoided by defining them before constructing
// the main pattern (see Approach 1).

var real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    {
        preexponent: XRegExp.build('(?xn)\
            {{sign}} ?              \
            (?= {{digit}}           \
              | {{point}}           \
            )                       \
            ( {{digit}} {1,3}       \
              ( {{digitSeparator}} ?\
                {{digit}} {3}       \
              ) *                   \
            ) ?                     \
            ( {{point}}             \
              {{digit}} +           \
            ) ?                     ',
            {
                sign:           /[+-]/,
                digit:          /[0-9]/,
                digitSeparator: /[_,]/,
                point:          /[.]/
            }
        ),
        exponent: XRegExp.build('(?x)\
            {{exponentIndicator}}\
            {{sign}} ?           \
            {{digit}} +          ',
            {
                sign:              /[+-]/,
                digit:             /[0-9]/,
                exponentIndicator: /[Ee]/
            }
        )
    }
);

The real and lib.real regexes created by the above code are identical. Here are a few examples of strings they match:

  • -1
  • 1,000
  • 10_000_000
  • 1,111.1111
  • 01.0
  • .1
  • 1e2
  • +1.1e-2

And here are a few examples of strings they don't match:

  • ,100
  • 10,00
  • 1,0000
  • 1.
  • 1.1,111
  • 1k

Grammatical patterns like this are easier to read, write, and maintain, and look more like a BNF than the typical line-noisy regular expressions that some people have come to hate.

Note that the {{…}} syntax shown here works only for regexes created by XRegExp.build. Named subpatterns can be provided as strings or regex objects (strings are passed to the XRegExp constructor). The provided patterns are automatically wrapped in (?:…) so they can be quantified as a unit and don't interfere with the surrounding pattern in unexpected ways. A leading ^ and trailing unescaped $ are stripped from subpatterns if both are present, which allows embedding independently useful anchored patterns. Flags can be provided via XRegExp.build's optional third (flags) argument. Native flags used by provided subpatterns are ignored in favor of the flags argument. Backreferences in the outer pattern and provided subpatterns are automatically renumbered to work correctly within the larger combined pattern. The syntax ({{name}}) works as shorthand for named capture via (?<name>{{name}}). The {{…}} syntax can be escaped with a backslash.

Play around with the above details a bit, and I think you'll find that XRegExp.build works intuitively and handles any edge cases you throw at it.

Feel free to share how you might alter the above regexes. And make sure to check out the fancy new XRegExp v2.0 and its upgraded addons at GitHub.

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).