Win a Free Copy of Regex Cookbook 2nd Edition

Update: This contest is now finished. See the list of winners.

I'm excited to announce the release of Regular Expressions Cookbook 2nd Edition, which I wrote together with regex superguru Jan Goyvaerts. It has actually been available as an ebook for a couple weeks on oreilly.com, but as of now, it is also in stock on amazon.com.

To promote this release, O'Reilly Media is giving away free copies to 15 people who comment on this post on or before September 7th! To get a free copy, you must read the details at the end of this post. But first, some FAQs about the second edition:

Wait…this is a cookbook?

The book tackles hundreds of real-world regular expression tasks in a problem, solution, discussion format. There's also a detailed regular expression tutorial included, in the same format. Check out Jeff Atwood's and Ben Nadel's reviews of the first edition, which have more details about this.

Update: Rob Friesel Jr. posted an awesome, detailed review of the second edition.

The first edition was a bestseller, and is now available in eight languages. It briefly held Amazon's #1 spot for computer books upon its release in mid 2009, and the ebook version was O'Reilly's top seller of 2010.

What has changed with this new edition?

The second edition adds more content, and updates existing chapters. There are innumerable improvements, including the most noticeable addition of a new chapter written by Jan, titled Source Code and Log Files, and various new recipes interspersed with four of the other chapters.

There are 101 new pages in the second edition, and that's after shortening and removing some content from the first edition. There were 125 recipes in the first edition, upped to 146 in the second. Note that many of the book's recipes provide solutions and in-depth discussions for more than one problem. Tons of changes, ranging from minor copyedits and errata corrections to major revisions and the addition of significant new content, were made throughout the existing content. Everything was brought up to date with the latest standards, tools, and programming language versions. In particular, updates to Java and Perl since the first edition brought very significant regular expression changes. Plus, we've covered some advanced regular expression features that already existed the last time around, but didn't make it into the first edition.

The first edition was already groundbreaking for the depth of its explanations and its equal coverage of all regexes in eight programming languages (C#, Java, JavaScript, Perl, PHP, Python, Ruby, and VB.NET). The second edition significantly improves upon this by providing even more details about the many peculiarities and differences of the APIs, syntax, and behavior of these regex flavors and programming languages. The second edition also adds coverage of XRegExp when it provides a better solution than native JavaScript. IMO, the second edition is easily the most comprehensive source of information about the use of modern regexes across multiple programming languages—far more detailed than anything else in print or online.

What will interest long-term fans the most?

Even though there are lots of important changes throughout the book, the new recipes and the updated coverage for the latest programming languages are probably the main reasons for owners of the first edition to upgrade. The fully-new recipes cover creating a regex-based parser, validating password complexity, adding thousands separators to numbers, matching various kinds of numbers, decoding XML entities, and everything in the new Source Code and Log Files chapter. The coverage of XRegExp is also completely new in the second edition.

What will cause readers to trip over themselves in their haste to buy a copy?

If you read though the book, you'll learn about a lot of things—much more than just regular expressions. You'll almost certainly learn something new about Unicode, phone numbers, and XML, as just a few examples. You'll learn that the eighth floor of the Saks Fifth Avenue store in New York City has its own ZIP code, which also happens to be the only ZIP code that includes letters. You'll learn that the Chicago Manual of Style and Merriam-Webster's Biographical Dictionary disagree on the correct alphabetical listing of the name Charles de Gaulle (my girlfriend and I are in opposing camps). Jan and I put a ton of research into the book, and we pay attention to details. I think that shines through.

Oh yeah, and along the way, you'll also become a Master Chef of regular expressions, able to slice and dice text with the best of them. But not everyone will want to actually read through the book. Some readers will prefer to take advantage of the cookbook format and read only the parts that solve their immediate problems. That's fine, too.

Many developers complain that they're continually relearning regular expressions, going back to the reference documents every time they need to write a new regex. The problem/solution approach of Regular Expressions Cookbook means you learn by doing, and we think that helps the details stick with you more securely than with the other books and websites out there.

Many regex novices turn to Google to get prewritten regexes that solve their problems. Unfortunately, if you're not already fluent in regex, you won't realize that 90% of the regexes out there have some kind of problem, be it returning false positives or negatives, performing inefficiently (or maybe even crashing your server when fed malicious data), being more complicated than necessary, not being portable, or what have you. When you use the regexes in Regular Expressions Cookbook, not only do you get detailed coverage of all the related issues (which helps you customize the solution, if necessary), you also get the peace of mind that you're using proven solutions by real subject-matter experts.

So how do I enter to win a free copy?

Simply comment on this blog post on or before 11:59 PM EDT on September 7th, and you'll be in the running. I wish I could leave the contest open a bit longer, but I'll be moving to California to work for a little Internet startup. You'll need to use your actual first and last name and email address with your comment. Names are published, but email addresses are not. Each person commenting has only one chance to win, regardless of how many comments they post. If you don't know what to write in your comment, just mention whether you'd prefer a print or ebook copy.

Shortly after this contest ends, I'll randomly choose 15 winners and contact them by email. If you prefer a printed copy, I'll be asking for your address and, if you're outside of the U.S., your phone number. O'Reilly will pay for shipping to anywhere in the world. Good luck!

Coauthor Jan Goyvaerts has written up his own summary of the changes in What's New in The Second Edition of Regular Expressions Cookbook.


Follow me on Twitter @slevithan or on GitHub at slevithan.

JavaScript Regex Lookbehind Redux

Five years ago I posted Mimicking Lookbehind in JavaScript on this blog, wherein I detailed several ways to emulate positive and negative lookbehind in JavaScript. My approaches back then were all fairly rough, and it was complicated to properly customize any of them to work with a given pattern. Plus, they were only designed to simulate lookbehind in a regex-based replacement.

To make it much easier to use lookbehind, I recently posted a collection of short functions on GitHub. They use XRegExp v2, so you should check that out, too.

Here's the code:

// Simulating infinite-length leading lookbehind in JavaScript. Uses XRegExp.
// Captures within lookbehind are not included in match results. Lazy
// repetition in lookbehind may lead to unexpected results.

(function (XRegExp) {

    function prepareLb(lb) {
        // Allow mode modifier before lookbehind
        var parts = /^((?:\(\?[\w$]+\))?)\(\?<([=!])([\s\S]*)\)$/.exec(lb);
        return {
            // $(?!\s) allows use of (?m) in lookbehind
            lb: XRegExp(parts ? parts[1] + "(?:" + parts[3] + ")$(?!\\s)" : lb),
            // Positive or negative lookbehind. Use positive if no lookbehind group
            type: parts ? parts[2] === "=" : !parts
        };
    }

    XRegExp.execLb = function (str, lb, regex) {
        var pos = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                return match;
            }
            pos = match.index + 1;
        }
        return null;
    };

    XRegExp.testLb = function (str, lb, regex) {
        return !!XRegExp.execLb(str, lb, regex);
    };

    XRegExp.searchLb = function (str, lb, regex) {
        var match = XRegExp.execLb(str, lb, regex);
        return match ? match.index : -1;
    };

    XRegExp.matchAllLb = function (str, lb, regex) {
        var matches = [], pos = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                matches.push(match[0]);
                pos = match.index + (match[0].length || 1);
            } else {
                pos = match.index + 1;
            }
        }
        return matches;
    };

    XRegExp.replaceLb = function (str, lb, regex, replacement) {
        var output = "", pos = 0, lastEnd = 0, match, leftContext;
        lb = prepareLb(lb);
        while (match = XRegExp.exec(str, regex, pos)) {
            leftContext = str.slice(0, match.index);
            if (lb.type === lb.lb.test(leftContext)) {
                // Doesn't work correctly if lookahead in regex looks outside of the match
                output += str.slice(lastEnd, match.index) + XRegExp.replace(match[0], regex, replacement);
                lastEnd = match.index + match[0].length;
                if (!regex.global) {
                    break;
                }
                pos = match.index + (match[0].length || 1);
            } else {
                pos = match.index + 1;
            }
        }
        return output + str.slice(lastEnd);
    };

}(XRegExp));

That's less than 0.5 KB after minification and gzipping. It provides a collection of functions that make it simple to emulate leading lookbehind:

  • XRegExp.execLb
  • XRegExp.testLb
  • XRegExp.searchLb
  • XRegExp.matchAllLb
  • XRegExp.replaceLb

Each of these functions takes three arguments: the string to search, the lookbehind pattern as a string (can use XRegExp syntax extensions), and the main regex. XRegExp.replaceLb takes a fourth argument for the replacement value, which can be a string or function.

Usage examples follow:

XRegExp.execLb("Fluffy cat", "(?i)(?<=fluffy\\W+)", XRegExp("(?i)(?<first>c)at"));
// -> ["cat", "c"]
// Result has named backref: result.first -> "c"

XRegExp.execLb("Fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> null

XRegExp.testLb("Fluffy cat", "(?i)(?<=fluffy\\W+)", /cat/i);
// -> true

XRegExp.testLb("Fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> false

XRegExp.searchLb("Catwoman's fluffy cat", "(?i)(?<=fluffy\\W+)", /cat/i);
// -> 18

XRegExp.searchLb("Catwoman's fluffy cat", "(?i)(?<!fluffy\\W+)", /cat/i);
// -> 0

XRegExp.matchAllLb("Catwoman's cats are fluffy cats", "(?i)(?<=fluffy\\W+)", /cat\w*/i);
// -> ["cats"]

XRegExp.matchAllLb("Catwoman's cats are fluffy cats", "(?i)(?<!fluffy\\W+)", /cat\w*/i);
// -> ["Catwoman", "cats"]

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<=fluffy\\W+)", /cat/ig, "dog");
// -> "Catwoman's fluffy dog is a cat"

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<!fluffy\\W+)", /cat/ig, "dog");
// -> "dogwoman's fluffy cat is a dog"

XRegExp.replaceLb("Catwoman's fluffy cat is a cat", "(?i)(?<!fluffy\\W+)", /cat/ig, function ($0) {
    var first = $0.charAt(0);
    return first === first.toUpperCase() ? "Dog" : "dog";
});
// -> "Dogwoman's fluffy cat is a dog"

Easy peasy lemon squeezy. 🙂

Creating Grammatical Regexes Using XRegExp.build

Recently, I've added three new addons for XRegExp v2.0 (currently in release candidate stage on GitHub):

  • XRegExp.build — Lets you build regexes using named subpatterns. Inspired by Lea Verou's RegExp.create.
  • XRegExp Prototype Methods — Adds a collection of methods to be inherited by XRegExp regexes: apply, call, forEach, globalize, xexec, and xtest. These also work for native RegExps copied by XRegExp.
  • XRegExp Unicode Properties — Includes the remaining nine properties (beyond what's already available in other XRegExp addons) required for Level-1 Unicode support: Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, Any, ASCII, and Assigned.

Jumping right into some code, the following demonstrates how the new XRegExp.build addon can be used to create a grammatical pattern for matching real numbers:

// Approach 1: Make all of the subpatterns reusable

var lib = {
    digit:             /[0-9]/,
    exponentIndicator: /[Ee]/,
    digitSeparator:    /[_,]/,
    sign:              /[+-]/,
    point:             /[.]/
};
lib.preexponent = XRegExp.build('(?xn)\
    {{sign}} ?              \
    (?= {{digit}}           \
      | {{point}}           \
    )                       \
    ( {{digit}} {1,3}       \
      ( {{digitSeparator}} ?\
        {{digit}} {3}       \
      ) *                   \
    ) ?                     \
    ( {{point}}             \
      {{digit}} +           \
    ) ?                     ',
    lib
);
lib.exponent = XRegExp.build('(?x)\
    {{exponentIndicator}}\
    {{sign}} ?           \
    {{digit}} +          ',
    lib
);
lib.real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    lib
);

// Approach 2: No need to reuse the subpatterns. {{sign}} and {{digit}} are
// defined twice, but that can be avoided by defining them before constructing
// the main pattern (see Approach 1).

var real = XRegExp.build('(?x)\
    ^              \
    {{preexponent}}\
    {{exponent}} ? \
    $              ',
    {
        preexponent: XRegExp.build('(?xn)\
            {{sign}} ?              \
            (?= {{digit}}           \
              | {{point}}           \
            )                       \
            ( {{digit}} {1,3}       \
              ( {{digitSeparator}} ?\
                {{digit}} {3}       \
              ) *                   \
            ) ?                     \
            ( {{point}}             \
              {{digit}} +           \
            ) ?                     ',
            {
                sign:           /[+-]/,
                digit:          /[0-9]/,
                digitSeparator: /[_,]/,
                point:          /[.]/
            }
        ),
        exponent: XRegExp.build('(?x)\
            {{exponentIndicator}}\
            {{sign}} ?           \
            {{digit}} +          ',
            {
                sign:              /[+-]/,
                digit:             /[0-9]/,
                exponentIndicator: /[Ee]/
            }
        )
    }
);

The real and lib.real regexes created by the above code are identical. Here are a few examples of strings they match:

  • -1
  • 1,000
  • 10_000_000
  • 1,111.1111
  • 01.0
  • .1
  • 1e2
  • +1.1e-2

And here are a few examples of strings they don't match:

  • ,100
  • 10,00
  • 1,0000
  • 1.
  • 1.1,111
  • 1k

Grammatical patterns like this are easier to read, write, and maintain, and look more like a BNF than the typical line-noisy regular expressions that some people have come to hate.

Note that the {{…}} syntax shown here works only for regexes created by XRegExp.build. Named subpatterns can be provided as strings or regex objects (strings are passed to the XRegExp constructor). The provided patterns are automatically wrapped in (?:…) so they can be quantified as a unit and don't interfere with the surrounding pattern in unexpected ways. A leading ^ and trailing unescaped $ are stripped from subpatterns if both are present, which allows embedding independently useful anchored patterns. Flags can be provided via XRegExp.build's optional third (flags) argument. Native flags used by provided subpatterns are ignored in favor of the flags argument. Backreferences in the outer pattern and provided subpatterns are automatically renumbered to work correctly within the larger combined pattern. The syntax ({{name}}) works as shorthand for named capture via (?<name>{{name}}). The {{…}} syntax can be escaped with a backslash.

Play around with the above details a bit, and I think you'll find that XRegExp.build works intuitively and handles any edge cases you throw at it.

Feel free to share how you might alter the above regexes. And make sure to check out the fancy new XRegExp v2.0 and its upgraded addons at GitHub.

Ideas for Regular Expressions Cookbook Second Edition

I'm happy to report that work is underway by Jan Goyvaerts and me on the second edition of Regular Expressions Cookbook. In the new edition, we'll be fixing all known errata, improving existing content, updating everything to support the latest versions of the book's eight covered programming languages, and sprinkling several new recipes into existing chapters.

We'll also be adding a new chapter, tentatively titled "Source Code and Log Files". This new chapter will aim to assist programmers and system admins with common tasks, using regex-based solutions. The recipe list for this chapter remains to be determined, but it will include things like how to find strings and comments in various programming languages, and how to find 404 records within Apache HTTP Server logs.

Do you have an idea for a recipe that you'd like to see added? Suggestions are particularly welcome for this new chapter, but ideas for the rest of the book are also welcome.

New language editions

The first edition of Regular Expressions Cookbook has now been published in eight languages. You can get your favorite language version of it from the following sites:

XRegExp Updates

A few days ago, I posted a long-overdue XRegExp bug fix release (version 1.5.1). This was mainly to address an IE issue that a number of people have written to me and blogged about. Specifically, RegExp.prototype.exec no longer throws an error in IE when it is simultaneously provided a nonstring argument and called on a regex with a capturing group that matches an empty string. That's an edge case of an edge case, but it was causing XRegExp to conflict with jQuery 1.7.1 (oops). You can see the full list of changes in the changelog.

But wait, there's more… XRegExp's Unicode plugin has been updated to support Unicode 6.1 (released January 2012), rather than Unicode 5.2. I've also added a new test suite with 265 tests so far, and more on the way.

More substantial changes to XRegExp are planned and coming soon. Follow the brand new XRegExp repository on GitHub to keep up to date or to fork it and help shape the future of this one-of-a-kind JavaScript library. 🙂