Levels of JavaScript Regex Knowledge

  1. N00b
    • Thinks "regular expressions" is open mic night at a poetry bar.
    • Uses \w, \d, \s, and other shorthand classes purely by accident if at all.
    • Painfully misuses * and especially .*.
    • Puts words in character classes.
    • Uses | in character classes for alternation.
    • Hasn't heard of the exec method.
    • Copies and pastes poorly written regexes from the web.
  2. Trained n00b
    • Uses regexes where methods like slice or indexOf would do.
    • Uses g, i, and m modifiers needlessly.
    • Uses [^\w] instead of \W.
    • Doesn't know why using [\w\d_] gives away their n00bness.
    • Tries to remove HTML tags with replace(/<.*?>/g,"").
    • Escapes all punctuation\!
  3. User
    • Knows when to use regexes, and when to use string methods.
    • Toys with lookahead.
    • Uses regexes in conditionals.
    • Starts to understand why HTML tags are hard to match with regexes.
    • Knows to use (?:…) when a backreference or capture isn't needed.
    • Can read a relatively simple regex and explain its function.
    • Knows their way around the use of replace callback functions.
  4. Haxz0r
    • Uses lookahead with impunity.
    • Sighs at the unavailability of lookbehind and other features from more powerful regex libraries.
    • Knows what $`, $', and $& mean in a replacement string.
    • Knows the difference between string literal and regex metacharacters, and how this impacts the RegExp constructor.
    • Generally knows whether a greedy or lazy quantifier is more appropriate, even when it doesn't change what the regex matches.
    • Has a basic sense of how to avoid regex efficiency problems.
    • Knows how to iterate over strings using the exec method and a while loop.
    • Knows that properties of the global RegExp object and the compile method are deprecated.
  5. Guru
    • Understands the significance of manually modifying a regex object's lastIndex property and when this can be useful within a loop.
    • Can explain how any given regex will or won't work.
    • No longer experiences the excitement of writing complex regexes that work on the first try, since regex behavior has become predictable and obvious.
    • Is immune to catastrophic backtracking, and can easily (and accurately) determine if a nested quantifier is safe.
    • Knows of numerous cross-browser regex syntax and behavior differences.
    • Knows offhand the section number of ECMA-262 3rd Edition that covers regexes.
    • Understands the difference between capturing group nonparticipation vs participating but capturing an empty string, and the behavior differences this can lead to.
    • Has a preference for particular backreference rules related to capturing group participation and quantified alternation, or is at least aware of the implementation inconsistencies.
    • Often knows which browser will run a given regex fastest before testing, based on known internal optimizations and weaknesses.
    • Thinks that writing recursive regexes is easy, so long as there is an upper bound to recursion depth.
  6. Wizard
    • Works on a regex engine.
    • Has patched the engine from time to time.
  7. God
    • Can add features to the engine at a whim.
    • Also created all life on earth using a constructor function.

(Heavily adapted and JavaScriptized from 7 Stages of a [Perl] Regex User.)

15 thoughts on “Levels of JavaScript Regex Knowledge”

  1. I would assert that there are shades of gray between the levels. I am probably most typically in the “Trained n00b” category, however there are some aspects of the “User” category I fall into, but not all of them. So would that make me a n00bified user?

  2. I guess. ๐Ÿ˜‰

    This is mostly just intended humorously, but it’s also meant to be soul-crushingly tough on people so don’t sweat it if you’re not yet among the highest levels.

  3. “Toys with lookahead.”

    Now, do you mean “Has some clue about lookahead”, or are we talking about the same kind of toying that blew up my parent’s breaker box when I toyed with electricity in the 8th grade?

  4. Hm, well, im more trained noob than anything else here but i desperately need to challenge this statement:

    Starts to understand why HTML tags are hard to match with regexes.

    Why is that hard? It seems so neat to use regexp for that. I need to nuke all html tags except for a select few from a chunk of code but all my attempts at using regexp for this go to rot ๐Ÿ™

  5. @nic_tester:

    It’s difficult (if not impossible) for the following reasons:

    1. HTML tags nest with no upper bound on nesting depth. Most regex flavors do not support infinite recursion (certainly not JavaScript’s).
    2. HTML attribute values can contain unencoded < and > characters. This requires extra handling.
    3. HTML attribute values can be surrounded by double quotes, single quotes, or no quotes (and in IE, backtick quotes). Multiple attribute values within the same tag can use different quote styles, and quoted values can contain quotes of alternative types. All of this complicates the handling for point 2.
    4. Attributes can appear in any order or not at all. This significantly complicates things if you need to work with more than one specific attribute.
    5. Browsers support a whole lot of invalid markup most people wouldn’t think about handling. Accounting for such issues is often quite difficult, and not doing so can result in security hazards.
    6. HTML comments can contain HTML tags, which throws off a lot of simple handling.
    7. HTML tags are sometimes mixed into content which uses unencoded < and > characters which are not part of HTML tags.

    For your task, if you don’t need to account for the edge cases mentioned above you could use something like str.replace(/<\/?(?!(?:a|select|few)\b)[^>]+>/gi, "") to get rid of all tags other than a, select, and few.

    If you need additional regex construction advice you might want to try someplace like the RegexBuddy or regexadvice.com forums.

  6. That’s already partially represented at the User level (“Knows when to use regexes…”). A guru should have learned that lesson long ago, and I don’t think that how comfortable you are with “more advanced” parsing languages/tools/models is directly related to your level of regex mastery.

    Naturally, anything outside of my domain knowledge is going to be underrepresented.

  7. I have to agree with Steve’s statement, I don’t believe there is any connection between comfort levels for regexes and “more advanced” parsing languages … etc. I believe if more of those who are comfortable with the “more advanced” techniques took the time to really understand regexes they would likely find that much of the “more advanced” techniques are a piece of cake to replace with a relatively simple regex.

  8. Hey, I know this post was a while ago, but I found it looking for help on matching attributes within a HTML tag. After reading it, it made me more determined to work it out for myself, and I just did ๐Ÿ™‚

    \w+\s*=\s*([“‘\w])(?:(?:.*?\1)|[^\s|>]*)

    There may be an easier way but I’ve tested this and it works fine with attributes written like attr=”val”, attr=’val’, attr=’hello “value”‘ and attr=val.

    I’m using it in a function that removes non-white-listed attributes (mostly to catch onmouseover, onfocus, etc). I probably wouldn’t have been as determined if I hadn’t read this post. I felt compelled to try and fit into a higher level, I’m probably a User with a few points in haxz0r. Go me!

  9. Hey, You can put on 5. Guru, that you can be a Guru if you create a code conversion like “VB” to “Javascript” only using recursive regexp. ๐Ÿ˜‰

    I just finished my own code conversion.

    Sorry my poor english, i’m brazilian!

    Thank’s!

Leave a Reply

Your email address will not be published. Required fields are marked *