Flagrant Badassery

A JavaScript and regular expression centric blog

Levels of JavaScript Regex Knowledge

(Adapted from 7 Stages of a [Perl] Regex User.)

  1. N00b
    • Thinks "regular expressions" is open mic night at a poetry bar.
    • Uses \w, \d, \s, and other shorthand classes purely by accident if at all.
    • Painfully misuses * and especially .*.
    • Puts words in character classes.
    • Uses | in character classes for alternation.
    • Hasn't heard of the exec method.
    • Copies and pastes poorly written regexes from the web, taking credit for this on the job.
  2. Trained n00b
    • Uses regexes where methods like substr or indexOf would do.
    • Uses modifiers like i and m needlessly.
    • Uses [^\w] instead of \W.
    • Doesn't know why using [\w\d_] gives away their n00bness.
    • Tries to remove HTML tags with replace(/<.*>/g,"") or replace(/<.*?>/g,"").
    • Backslashes needlessly\!
  3. User
    • Knows when to use regexes, and when to use string methods.
    • Toys with lookahead.
    • Uses regexes in conditionals.
    • Starts to understand why HTML tags are hard to match with regexes.
    • Knows to use (?:…) when a backreference or capture isn't needed.
    • Can read a relatively simple regex and explain its function.
  4. Haxz0r
    • Uses lookahead with impunity.
    • Sighs at the unavailability of lookbehind and other features from more powerful regex libraries.
    • Knows what $`, $', and $& mean in a replacement string.
    • Knows the difference between string literal and regex metacharacters, and how this impacts the RegExp constructor.
    • Generally knows whether a lazy or greedy quantifier is more appropriate even when it doesn't change what the regex matches.
    • Knows their way around the use of replace callback functions.
    • Has read Mastering Regular Expressions.
    • Knows how to "unroll the loop" (but might not yet be immune to catastrophic backtracking).
    • Knows how to step through data using the exec method and a while loop.
    • Knows that properties of the global RegExp object and the compile method are deprecated.
  5. Guru
    • Can explain how any given regex will or won't work.
    • Can easily (and accurately) determine if a nested quantifier is safe.
    • Understands the significance of manually modifying a regex object's lastIndex property and when this can be useful within a loop.
    • Knows of numerous cross-browser regex syntax and behavior differences.
    • Knows offhand the section number of ECMA-262 3rd Edition which covers regexes.
    • Has a preference for particular backreference rules related to capturing group participation and quantified alternation, or is at least aware of the implementation inconsistencies.
    • Often knows which browser will run a given regex fastest before testing, based on known internal optimizations and weaknesses.
  6. Wizard
    • Works on a regex engine.
    • Has patched the engine from time to time.
  7. God
    • Can add features to the engine at a whim.
    • Also created all life on earth using a constructor function.

There Are 12 Responses So Far. »

  1. I would assert that there are shades of gray between the levels. I am probably most typically in the “Trained n00b” category, however there are some aspects of the “User” category I fall into, but not all of them. So would that make me a n00bified user?

  2. I guess. ;-)

    This is mostly just intended humorously, but it’s also meant to be soul-crushingly tough on people so don’t sweat it if you’re not yet among the highest levels.

  3. Level 1 is too advanced. I’m going to go emo out in a corner now. kthxbie.

  4. “Toys with lookahead.”

    Now, do you mean “Has some clue about lookahead”, or are we talking about the same kind of toying that blew up my parent’s breaker box when I toyed with electricity in the 8th grade?

  5. *Bows to the god of regex* hehe

  6. Hm, well, im more trained noob than anything else here but i desperately need to challenge this statement:

    Starts to understand why HTML tags are hard to match with regexes.

    Why is that hard? It seems so neat to use regexp for that. I need to nuke all html tags except for a select few from a chunk of code but all my attempts at using regexp for this go to rot :(

  7. @nic_tester:

    It’s difficult (if not impossible) for the following reasons:

    1. HTML tags nest. Most regex flavors do not support recursion (certainly not JavaScript’s).
    2. HTML attribute values can contain unencoded < and > characters. Whether or not they’re allowed in valid markup is often irrelevant.
    3. HTML attribute values can be surrounded by double quotes, single quotes, or no quotes. Also, multiple attribute values within the same tag can use different quote styles, and quoted values can contain quotes of the alternative type. All of this complicates the handling for point 2.
    4. Attributes can appear in any order or not at all. This complicates things if you need to work with more than one specific attribute.
    5. Browsers support a whole lot of invalid markup most people wouldn’t think about handling. Accounting for such issues is often quite difficult, and not doing so can result in security hazards.
    6. HTML comments can contain HTML tags, which throws off a lot of simple handling.
    7. HTML tags are sometimes mixed into content which uses unencoded < and > characters which are not part of HTML tags.

    For your task, if you don’t need to account for the edge cases mentioned above you could use something like str.replace(/<\/?(?!(?:a|select|few)\b)[^>]+>/gi, "") to get rid of all tags other than a, select, and few.

    If you need additional regex construction advice you might want to try someplace like the RegexBuddy or regexadvice.com forums.

  8. In Guru level, I would add:

    Knows when Regular Expressions will not work, and is not afraid of more advanced parsing methods (which typically require good understanding of regular expressions anyway).

    Alas, I think I’m at the trained n00b method.

    As Steve Pavlina says, if you think you’re a 7, you’re probably a 3.

    http://www.stevepavlina.com/blog/2005/07/how-to-get-from-a-7-to-a-10/

  9. That’s already partially represented at the User level (”Knows when to use regexes…”). A guru should have learned that lesson long ago, and I don’t think that how comfortable you are with “more advanced” parsing languages/tools/models is directly related to your level of regex mastery.

    Naturally, anything outside of my domain knowledge is going to be underrepresented.

  10. I have to agree with Steve’s statement, I don’t believe there is any connection between comfort levels for regexes and “more advanced” parsing languages … etc. I believe if more of those who are comfortable with the “more advanced” techniques took the time to really understand regexes they would likely find that much of the “more advanced” techniques are a piece of cake to replace with a relatively simple regex.

  11. [...] pueden coincidir prácticamente todo En otras palabras, las expresiones regulares son poderosas. Un guru de las expresiones regulares puede encontrar muchos usos apropiados para las expresiones regulares [...]

  12. Hey, I know this post was a while ago, but I found it looking for help on matching attributes within a HTML tag. After reading it, it made me more determined to work it out for myself, and I just did :)

    \w+\s*=\s*(["'\w])(?:(?:.*?\1)|[^\s|>]*)

    There may be an easier way but I’ve tested this and it works fine with attributes written like attr=”val”, attr=’val’, attr=’hello “value”‘ and attr=val.

    I’m using it in a function that removes non-white-listed attributes (mostly to catch onmouseover, onfocus, etc). I probably wouldn’t have been as determined if I hadn’t read this post. I felt compelled to try and fit into a higher level, I’m probably a User with a few points in haxz0r. Go me!

Post a Response

If you are about to post code, please escape your HTML entities (&amp;, &gt;, &lt;).