parseUri 2.0: A mighty but tiny URI parser

I created parseUri v1 17 years ago, but never hosted it on GitHub/npm because it's older than both of those tools. Nevertheless, it’s been used very widely ever since due to it being tiny and predating JavaScript’s built-in URL constructor. After this short gap, I just released v2: github.com/slevithan/parseuri. It’s still tiny (nothing similar comes close, even with libraries that support far fewer URI parts, types, and edge cases), and it includes several advantages over URL:

  • parseUri gives you many additional properties (authority, userinfo, subdomain, domain, tld, resource, directory, filename, suffix) that are not available from URL.
  • URL throws e.g. if not given a protocol, and in many other cases of valid (but not supported) and invalid URIs. parseUri makes a best case effort even with partial or invalid URIs and is extremely good with edge cases.
  • URL’s rules don’t allow correctly handling many non-web protocols. For example, URL doesn’t throw on any of 'git://localhost:1234', 'ssh://myid@192.168.1.101', or 't2ab:///path/entry', but it also doesn’t get their details correct since it treats everything after : up to ? or # as part of the pathname.
  • parseUri includes a “friendly” parsing mode (in addition to its default mode) that handles human-friendly URLs like 'example.com/index.html' as expected.
  • parseUri includes partial/extensible support for second-level domains like in '//example.co.uk'.

Conversely, parseUri is single-purpose and doesn’t do normalization. But of course you can pass URIs through a normalizer separately, if you need that. Or, if you wanted to create an exceptionally lightweight URI normalizer, parseUri would be a great base to build on top of. 😊

So although it’s needed less often these days because of the built-in URL, if URL is ever not enough for your needs, this is an extremely accurate, flexible, and lightweight option.

Check it out!

XRegExp 3.0.0!

After 3+ years, XRegExp 3.0.0 has been released. Standout features are dramatically better performance (many common operations are 2x to 50x faster) and support for full 21-bit Unicode (thanks to Mathias Bynens). I’ve also just finished updating all the documentation on xregexp.com so go check that out. 🙂

If you haven’t used XRegExp before, it’s an MIT licensed JavaScript library that provides augmented (and extensible!) regular expressions. You get new modern syntax and flags beyond what browsers support natively. XRegExp is also a regex utility belt with tools to make your client-side grepping and parsing easier, while freeing you from worrying about pesky cross-browser inconsistencies and things like manually manipulating lastIndex or slicing strings when tokenizing.

Version 3.0.0 has lots of additional features, options, fine tuning, cross-browser fixes, some new simplified syntax, and thousands of new tests. And it still supports all the browsers. Check out the long list of changes. There are a few minor breaking changes that shouldn’t affect most people and have easy workarounds. I’ve listed them all below, but see the full changelog if you need more details about them.

  • XRegExp.forEach no longer accepts or returns its context. Use binding with the provided callback instead.
  • Moved character data for Unicode category L (Letter) from Unicode Base to Unicode Categories. This has no effect if you’re already using Unicode Categories or XRegExp-All.
  • Using the same name for multiple named capturing groups in a single regex is now a SyntaxError.
  • Removed the 'all' shortcut used by XRegExp.install/uninstall.
  • Removed the Prototypes addon, which added methods apply, call, forEach, globalize, xexec, and xtest to XRegExp.prototype. These were all just aliases of methods on the XRegExp object.
  • A few changes affect custom addons only: changed the format for providing custom Unicode data, replaced XRegExp.addToken’s trigger and customFlags options with new flag and optionalFlags options, and removed the this.hasFlag function previously available within token definition functions.

You can download the new release on GitHub or install via npm. I’d love to hear feedback and common regex-related use cases that you think could be simplified via new XRegExp features. Let me know here or in GitHub issues. Thanks!

JavaScript Regex Lookbehind Redux

Five years ago I posted Mimicking Lookbehind in JavaScript on this blog, wherein I detailed several ways to emulate positive and negative lookbehind in JavaScript. My approaches back then were all fairly rough, and it was complicated to properly customize any of them to work with a given pattern. Plus, they were only designed to simulate lookbehind in a regex-based replacement.

To make it much easier to use lookbehind, I recently posted a collection of short functions on GitHub. They use XRegExp v2, so you should check that out, too.

Here's the code:

// Simulating infinite-length leading lookbehind in JavaScript. Uses XRegExp.
// Captures within lookbehind are not included in match results. Lazy
// repetition in lookbehind may lead to unexpected results.

(function (XRegExp) {

    function prepareLb(lb) {
        // Allow mode modifier before lookbehind
        var parts = /^((?:\(\?[\w$]+\))?)\(\?<([=!])([\s\S]*)\)$/.exec(lb);
        return {
            // $(?!\s) allows use of (?m) in lookbehind
            lb: XRegExp(parts ? parts[1] + "(?:" + parts[3] + ")$(?!\\s)" : lb),
            // Positive or negative lookbehind. Use positive if no lookbehind group
            type: parts ? parts[2] === "=" : !parts
        };
    }

    XRegExp.execLb = function (str, lb, regex) {
        var pos = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                return match;
            }
            pos = match.index + 1;
        }
        return null;
    };

    XRegExp.testLb = function (str, lb, regex) {
        return !!XRegExp.execLb(str, lb, regex);
    };

    XRegExp.searchLb = function (str, lb, regex) {
        var match = XRegExp.execLb(str, lb, regex);
        return match ? match.index : -1;
    };

    XRegExp.matchAllLb = function (str, lb, regex) {
        var matches = [], pos = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                matches.push(match[0]);
                pos = match.index + (match[0].length || 1);
            } else {
                pos = match.index + 1;
            }
        }
        return matches;
    };

    XRegExp.replaceLb = function (str, lb, regex, replacement) {
        var output = "", pos = 0, lastEnd = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                // Doesn't work correctly if lookahead in regex looks outside of the match
                output += str.slice(lastEnd, match.index) + XRegExp.replace(match[0], regex, replacement);
                lastEnd = match.index + match[0].length;
                if (!regex.global) {
                    break;
                }
                pos = match.index + (match[0].length || 1);
            } else {
                pos = match.index + 1;
            }
        }
        return output + str.slice(lastEnd);
    };

}(XRegExp));

That's less than 0.5 KB after minification and gzipping. It provides a collection of functions that make it simple to emulate leading lookbehind:

  • XRegExp.execLb
  • XRegExp.testLb
  • XRegExp.searchLb
  • XRegExp.matchAllLb
  • XRegExp.replaceLb

Each of these functions takes three arguments: the string to search, the lookbehind pattern as a string (can use XRegExp syntax extensions), and the main regex. XRegExp.replaceLb takes a fourth argument for the replacement value, which can be a string or function.

Usage examples follow:

XRegExp.execLb("Fluffy cat", "(?i)(?<=fluffy\\W+)", XRegExp("(?i)(?<first>c)at"));
// -> ["cat", "c"]
// Result has named backref: result.first -> "c"

XRegExp.execLb("Fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> null

XRegExp.testLb("Fluffy cat", "(?i)(?<=fluffy\\W+)", /cat/i);
// -> true

XRegExp.testLb("Fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> false

XRegExp.searchLb("Catwoman's fluffy cat", "(?i)(?<=fluffy\\W+)", /cat/i);
// -> 18

XRegExp.searchLb("Catwoman's fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> 0

XRegExp.matchAllLb("Catwoman's cats are fluffy cats", "(?i)(?<=fluffy\\W+)", /cat\w*/i);
// -> ["cats"]

XRegExp.matchAllLb("Catwoman's cats are fluffy cats", "(?i)(?<!fluffy\\W+)", /cat\w*/i);
// -> ["Catwoman", "cats"]

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<=fluffy\\W+)", /cat/ig, "dog");
// -> "Catwoman's fluffy dog is a cat"

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<!fluffy\\W+)", /cat/ig, "dog");
// -> "dogwoman's fluffy cat is a dog"

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<!fluffy\\W+)", /cat/ig, function ($0) {
    var first = $0.charAt(0);
    return first === first.toUpperCase() ? "Dog" : "dog";
});
// -> "Dogwoman's fluffy cat is a dog"

Easy peasy lemon squeezy. 🙂

Creating Grammatical Regexes Using XRegExp.build

Recently, I've added three new addons for XRegExp v2.0 (currently in release candidate stage on GitHub):

  • XRegExp.build — Lets you build regexes using named subpatterns. Inspired by Lea Verou's RegExp.create.
  • XRegExp Prototype Methods — Adds a collection of methods to be inherited by XRegExp regexes: apply, call, forEach, globalize, xexec, and xtest. These also work for native RegExps copied by XRegExp.
  • XRegExp Unicode Properties — Includes the remaining nine properties (beyond what's already available in other XRegExp addons) required for Level-1 Unicode support: Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, and Assigned.

Jumping right into some code, the following demonstrates how the new XRegExp.build addon can be used to create a grammatical pattern for matching real numbers:

// Approach 1: Make all of the subpatterns reusable

var lib = {
    digit:             /[0-9]/,
    exponentIndicator: /[Ee]/,
    digitSeparator:    /[_,]/,
    sign:              /[+-]/,
    point:             /[.]/
};
lib.preexponent = XRegExp.build('(?xn)\
    {{sign}} ?              \
    (?= {{digit}}           \
      | {{point}}           \
    )                       \
    ( {{digit}} {1,3}       \
      ( {{digitSeparator}} ?\
        {{digit}} {3}       \
      ) *                   \
    ) ?                     \
    ( {{point}}             \
      {{digit}} +           \
    ) ?                     ',
    lib
);
lib.exponent = XRegExp.build('(?x)\
    {{exponentIndicator}}\
    {{sign}} ?           \
    {{digit}} +          ',
    lib
);
lib.real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    lib
);

// Approach 2: No need to reuse the subpatterns. {{sign}} and {{digit}} are
// defined twice, but that can be avoided by defining them before constructing
// the main pattern (see Approach 1).

var real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    {
        preexponent: XRegExp.build('(?xn)\
            {{sign}} ?              \
            (?= {{digit}}           \
              | {{point}}           \
            )                       \
            ( {{digit}} {1,3}       \
              ( {{digitSeparator}} ?\
                {{digit}} {3}       \
              ) *                   \
            ) ?                     \
            ( {{point}}             \
              {{digit}} +           \
            ) ?                     ',
            {
                sign:           /[+-]/,
                digit:          /[0-9]/,
                digitSeparator: /[_,]/,
                point:          /[.]/
            }
        ),
        exponent: XRegExp.build('(?x)\
            {{exponentIndicator}}\
            {{sign}} ?           \
            {{digit}} +          ',
            {
                sign:              /[+-]/,
                digit:             /[0-9]/,
                exponentIndicator: /[Ee]/
            }
        )
    }
);

The real and lib.real regexes created by the above code are identical. Here are a few examples of strings they match:

  • -1
  • 1,000
  • 10_000_000
  • 1,111.1111
  • 01.0
  • .1
  • 1e2
  • +1.1e-2

And here are a few examples of strings they don't match:

  • ,100
  • 10,00
  • 1,0000
  • 1.
  • 1.1,111
  • 1k

Grammatical patterns like this are easier to read, write, and maintain, and look more like a BNF than the typical line-noisy regular expressions that some people have come to hate.

Note that the {{…}} syntax shown here works only for regexes created by XRegExp.build. Named subpatterns can be provided as strings or regex objects (strings are passed to the XRegExp constructor). The provided patterns are automatically wrapped in (?:…) so they can be quantified as a unit and don't interfere with the surrounding pattern in unexpected ways. A leading ^ and trailing unescaped $ are stripped from subpatterns if both are present, which allows embedding independently useful anchored patterns. Flags can be provided via XRegExp.build's optional third (flags) argument. Native flags used by provided subpatterns are ignored in favor of the flags argument. Backreferences in the outer pattern and provided subpatterns are automatically renumbered to work correctly within the larger combined pattern. The syntax ({{name}}) works as shorthand for named capture via (?<name>{{name}}). The {{…}} syntax can be escaped with a backslash.

Play around with the above details a bit, and I think you'll find that XRegExp.build works intuitively and handles any edge cases you throw at it.

Feel free to share how you might alter the above regexes. And make sure to check out the fancy new XRegExp v2.0 and its upgraded addons at GitHub.

XRegExp Updates

A few days ago, I posted a long-overdue XRegExp bug fix release (version 1.5.1). This was mainly to address an IE issue that a number of people have written to me and blogged about. Specifically, RegExp.prototype.exec no longer throws an error in IE when it is simultaneously provided a nonstring argument and called on a regex with a capturing group that matches an empty string. That's an edge case of an edge case, but it was causing XRegExp to conflict with jQuery 1.7.1 (oops). You can see the full list of changes in the changelog.

But wait, there's more… XRegExp's Unicode plugin has been updated to support Unicode 6.1 (released January 2012), rather than Unicode 5.2. I've also added a new test suite with 265 tests so far, and more on the way.

More substantial changes to XRegExp are planned and coming soon. Follow the brand new XRegExp repository on GitHub to keep up to date or to fork it and help shape the future of this one-of-a-kind JavaScript library. 🙂