A JScript/VBScript Regex Lookahead Bug

Here's one of the oddest and most significant regex bugs in Internet Explorer. It can appear when using optional elision within lookahead (e.g., via ?, *, {0,n}, or (.|); but not +, interval quantifiers starting from one or higher, or alternation without a zero-length option). An example in JavaScript:

/(?=a?b)ab/.test("ab");
// Should return true, but IE 5.5 – 8b1 return false

/(?=a?b)ab/.test("abc");
// Correctly returns true (even in IE), although the
// added "c" does not take part in the match

I've been aware of this bug for a couple years, thanks to a blog post by Michael Ash that describes the bug with a password-complexity regex. However, the bug description there is incomplete and subtly incorrect, as shown by the above, reduced test case. To be honest, although the errant behavior is predictable, it's a bit tricky to describe because I haven't yet figured out exactly what's happening internally. I'd recommend playing with variations of the above code to get a better understanding of the problem.

Fortunately, since the bug is predictable, it's usually possible to work around. For example, you can avoid the bug with the password regex in Michael's post (/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$/) by writing it as /^(?=.{8,15}$)(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*/ (the .{8,15}$ lookahead must come first here). The important thing is to be aware of the issue, because it can easily introduce latent and difficult to diagnose bugs into your code. Just remember that it shows up with variable-length lookahead. If you're using such patterns, test the hell out of them in IE.

Solving Algebraic Equations Using Regular Expressions

Regexes suck at math. To a regex engine, the characters 0 through 9 are no more special than any others.

I should mention that there are a couple exceptions. Perl and PCRE allow dynamic code to be run at any point during the matching process, which presents a great deal of extra potential. Perl does this with code embedded in the regex; PCRE with callouts to external functions. But those regex flavors are the exceptions, and even with them the extended capabilities only take you so far. Generally, math-related problems such as matching numeric ranges (useful for tasks like matching a set of years within longer text) are a pain in the a** to deal with, when they're possible at all.

However, the power and expressiveness of even basic regular expression syntax can lead to some nifty tricks. Things like matching only non-prime-length strings! The primality regex is somewhat famous now, but another hack that might surprise you is using regexes to solve a simple class of linear equations. I stumbled on the idea for this pattern while messing around with RegexBuddy's awesome debugger. The implementation itself is dead simple and should work pretty much universally, with the exception of strict POSIX ERE implementations or other esoteric flavors which don't allow backreferences. Here's the template:

^(.*)\1{A−1}(.*)\2{B−1}$

That lets us solve for x and y with an equation like 17x + 12y = 51. A and B are placeholders for constants that in this case are 17 and 12. So, the regex becomes ^(.*)\1{16}(.*)\2{11}$. We subtract one from values A and B because we're repeating backreferences to subpatterns that have already matched once before. If you run that regex against a 51-character string, the length of $1 (backreference one) will be 3 (which tells us that x = 3), and the length of $2 (backreference two) will be 0 (meaning that y = 0). Indeed, 17×3 + 12×0 = 51. If the problem is unsolvable, the regex will not match the string. If there are multiple possible solutions, the one with the highest value of x will be returned since x is handled earlier in the regex.

Try it out. You can use as many variables as you'd like as long as the equation follows the same form. E.g., 11x + 2y + 5z = 115 can be solved with ^(.*)\1{10}(.*)\2{1}(.*)\3{4}$ and a 115-character subject string (the result: 11×10 + 2×0 + 5×1 = 115). Run ^(.*)\1{12}$ against a 247-character string and you'll get back a 19-character value for backreference one, demonstrating that 13×19 = 247. Keep in mind that as the integers and string lengths get higher and the number of variables increase, the amount of backtracking by the regex engine will also increase. At some threshold this pattern will slow to the point that it's unusable. But I don't really care; it's still cool. wink

JavaScript Roman Numeral Converter

While looking for something quick to do during a brief internet outage, I wrote some code to convert to and from Roman numerals. Once things were back up I searched for equivalent code, but only found stuff that was multiple pages long, limited the range of what it could convert, or both. I figured I might as well share what I came up with:

function romanize (num) {
  if (!+num) return false;
  var digits = String(+num).split('');
  var key = ['','C','CC','CCC','CD','D','DC','DCC','DCCC','CM',
             '','X','XX','XXX','XL','L','LX','LXX','LXXX','XC',
             '','I','II','III','IV','V','VI','VII','VIII','IX'];
  var roman = '', i = 3;
  while (i--) roman = (key[+digits.pop() + (i * 10)] || '') + roman;
  return Array(+digits.join('') + 1).join('M') + roman;
}

function deromanize (str) {
  var str = str.toUpperCase();
  var validator = /^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/;
  var token = /[MDLV]|C[MD]?|X[CL]?|I[XV]?/g;
  var key = {M:1000,CM:900,D:500,CD:400,C:100,XC:90,L:50,XL:40,X:10,IX:9,V:5,IV:4,I:1};
  var num = 0, m;
  if (!(str && validator.test(str))) return false;
  while (m = token.exec(str)) num += key[m[0]];
  return num;
}

How would you rewrite this code? Can you create a shorter version?

Highlights From SXSWi 2008

I'm back from the SXSW Interactive conference, which offered a pretty decent selection of panels and good times. I ended up attending about 20 panels over three and a half days, and about as many bars and after-parties. Here were some of the personal highlights, for anyone who's interested.

The first session I attended on Saturday morning was High Performance Web Sites (panel info) by Steve Souders. I've read Steve's book, so there wasn't really anything new there for me. For those out of the loop, there's a summary of the rules in his book on the Yahoo! Developer Network. Apart from his YSlow add-on for Firebug, he also mentioned HttpWatch ($295) for IE, which seemed pretty cool. What was more interesting for me though was meeting Steve the next day at the O'Reilly Media booth, where we discussed the success of his book (which I believe was the best selling tech book on Amazon last year), working with O'Reilly Media and specifically his editor Andy Oram (who also edited Mastering Regular Expressions, among other top-selling O'Reilly books), and the upcoming Velocity conference that Steve will co-chair. At one point I asked him why he moved from Yahoo! to Google. His response was that he'll be able to focus more on open source and green computing, which are issues he's passionate about. The probable pay raise and that fact that he now knows about pretty much everything that both Yahoo! and Google are doing for web performance can't hurt either. I also learned that he's working on a sequel to his book (as opposed to a second edition), which will include a range of additional front-end performance tips and insight.

Another early Saturday panel I attended was Accessible Rich Media (panel info, podcast). Developing rich yet accessible internet applications seemed to be one of the focuses for SXSWi this year, but the main reason I'm mentioning this panel here is to point out some of the tools they highlighted: NVDA (an open source, Windows screen reader), Colour Contrast Analyser, the iCITA Firefox Accessibility Extension, and Fangs (a screen reader emulator). They also talked up WAI-ARIA and the accessibility built into Dojo's Dijit control library.

From there it was on to 10 Things We've Learned at 37signals (panel info, podcast), which as I've mentioned previously was one of the highlights for me. Co-founder Jason Fried covered some of the overarching principals at 37signals, including ways they try to improve productivity within their team, as well as the usefulness and simplicity of their software.

Following are the main points he discussed, as described on the accompanying slides. Listen to the podcast for more details.

  1. Ignore the great unknown
  2. Watch out for red flags (by this he meant words such as need, can't, only, fast, and easy, which often lead to project delays)
  3. Be successful and make money by helping other people be successful and make money
  4. Target nonconsumers and nonconsumption
  5. Question your work regularly
  6. Read your product
  7. Err on the side of simple
  8. Invest in what doesn't change
  9. Follow the chefs (this was about becoming famous and successful by giving away your knowledge)
  10. Interruption is the enemy of productivity
  11. Road maps send you in the wrong direction
  12. Be clear in crisis
  13. Make tiny decisions
  14. Make it matter

It turns out they've learned a few more than 10 things.

Skipping right along to the Keynote Interview with Mark Zuckerberg (panel info, podcast, video)… um, yeah, it was salvaged only by the crowd's reaction, which has been much written about. Many people seem to agree that the interviewer was fairly self-centered and didn't understand her audience.

The only discussion that might have gotten more laughs than the Zuckerberg interview was LOLWUT? Why Do I Keep Coming Back to This Website? (panel info), which chronicled I Can Has Cheezburger? from its inception to its massive success, subsequent buyout, and further expansion. It turns out that lolcats pay the salaries of nine people these days, including four developers. Another important thing I learned was that there's a loldog site at I Has a Hotdog. The folks behind those sites are also working on a new political photos site, which they introduced with an awesome image of Hillary Clinton shouting "THIS IS … SPARTA!"

Side note: One cool thing at the convention center was the number of Rock Band and Guitar Hero 3 sets available. If you include the Interactive trade show, ScreenBurn arcade, and a secondary Microsoft booth in one of the hallways, there were at least five Guitar Hero 3 setups plus four more Rock Band sets. At one point, Guitar Hero world champion Freddie Wong was showing off at the main Microsoft booth, where he racked up an over 100-note streak playing at expert difficulty with the guitar behind his head. After pretending to play some rock music myself, I ended up talking to a number of folks at the Microsoft booth. The big things they were pushing were Silverlight, Deep Zoom, Live Search (especially for multimedia), Live Maps, and Expression Web. They also had a setup where one of the Microsoft people was demoing Project Maestro, a Minority Report -style system where a pair of wireless gloves were being used as the input to move, stack, rotate, and zoom in and out of photos using simple hand gestures. Pretty slick, but it would've been cooler if they'd let you try it out for yourself. Oh, and the Sun booth was giving away free blu-ray movies.

L-R: Chris Wilson, Arun Ranganathan, Charles McCathieNevile, and Brendan Eich

Then there was Browser Wars: Deja Vu All Over Again? (panel info) with Brendan Eich, Mozilla CTO and inventor of JavaScript; Chris Wilson, Internet Explorer Platform Architect and chair of the HTML Working Group; and some dude, from Opera. Just kidding, Charles McCathieNevile is Opera's Chief Standards Officer and an all-around web standards titan.

Joined by moderator Arun Ranganathan (previously of Netscape), the three browser representatives fielded questions on the mobile web (a particularly big deal for Opera), Silverlight, ECMAScript 4, SVG, etc. A podcast of the session should eventually be available from 2008.sxsw.com/coverage/podcasts, and will probably be worth checking out. The discussion was attended by a full-capacity crowd, with many who didn't get a seat early being turned away at the door. My friend and coworker Ryan Christie was one of the people turned away, but that worked out because outside the doors he struck up a conversation with fellow shut-out John Resig, who mentioned that Mozilla was hosting a mini-party at a nearby grill later that afternoon.

I didn't want to miss the Mozilla get-together, so Ryan and I headed over shortly afterwards. It was a lot of fun hanging out with the Mozilla people—Brendan Eich, John Resig, Aza Raskin, Melissa Shapiro (who was exceptionally friendly and outgoing, btw)—and all the other people who stopped by. The only downside was that because they had a full bar serving free drinks, I probably downed a few too many.

John Resig (right) and I at Moonshine Grill

Along with free food and drinks, there was plenty of schwag to go around including Mozilla Labs t-shirts, stickers from Mozilla Japan, badges, fake tattoos, and of course screaming monkeys that could be launched slingshot-style across the room. I proceeded to do just that as much as possible. The Mozilla crew also had a competition going for the best Firefox add-on idea, which would net three winners a set of the coolest gear on hand: a nifty Firefox-branded backpack, stuffed animal Firefox, and Mozilla beanie. I submitted some of the worst ideas I could think of (e.g., an email and phone number harvester that would automatically sell the data it gathers as you browse the web to Russian spammers), but the rest of the ideas must have been pretty silly too because I ended up winning first place for a Firefox 3 add-on which would leak memory to remind you of the old days. The idea was actually a collaboration with Ryan, so he took first dibs on the winnings (making off with the stuffed animal). Runner-up winner was a token useful add-on that would show you phishing websites and the sites they were spoofing side by side, I guess as an educational experience. Third place winner was I SERVED U AN AD BUT I EATED IT, which would replace all ads with lolcats.

After the prize handouts, Brendan Eich came by and encouraged me to submit an ES4 ticket on the regex behavior for backreferences to non-participating groups which I've previously mentioned here and on the ES4-discuss mailing list. I think it's something he'd like to see change as well. He also mentioned something about the /x (extended) and /y (sticky) modifiers being among the best ES4 regex extensions, which I wholeheartedly agree with (ES4's /y is similar to but better designed than Perl/PCRE/.NET/Java/Ruby's \G token). Of course, Brendan is someone I deeply respect and admire, so just the fact that he remembered me from the mailing list was pretty cool.

One panel I missed that I would've loved to attend was Secrets of JavaScript Libraries (panel info, slides) hosted by John Resig (jQuery) along with Alex Russell (Dojo), Sam Stephenson (Prototype), Andrew Dupont (Prototype), and Thomas Fuchs (Scriptaculous). Unfortunately, my Tuesday flight left a little too early, but I'll be looking out for the podcast.

So those were most of the highlights for me. Here be a few final observations:

  • Core Conversations is a bullshit format. Seriously, get some mics next time.
  • The porn industry is well represented at SXSWi.
  • Austin seems to be comprised of little more than office space, hotels, and bars.
  • 80% of the world's MacBook Pros and iPhones showed up for the conference.

(Photos by Ryan Christie and his POS Motorola RIZR.)

SXSW Interactive 2008

Despite flight cancellations and a last-minute work crisis, I've been in Austin, Texas with a couple work buddies since late-Friday to attend the packed-out South by Southwest Interactive conference. It's been a lot of fun so far. 11 panels into it, the standout for me has easily been 10 Things We've Learned at 37signals by 37signals co-founder and CEO Jason Fried. I'll try to post more about the conference sometime after leaving, but hey, if anyone reading this is here in Austin and wants to meet up, grab some drinks, hit an after party, play Rock Band at the ScreenBurn arcade, whatever, send me an email: steves_list [at] hotmail.com.