December 2007 — Flagrant Badassery

JavaScript Password Validation

Since I've seen tons of password validation help requests on regexadvice.com (where I hang out from time to time), I've written up a more general-purpose JavaScript password validation function. It's reasonably straightforward, and covers the validation requirements I've most frequently encountered. Plus, if it doesn't handle your exact needs, its functionality can be augmented by passing it custom functions and regular expressions.

Here are the validation types supported out of the box. All are optional, which means that all passwords are allowed by default.

Minimum and maximum length.
Minimum n lowercase characters (a–z).
Minimum n uppercase characters (A–Z).
Minimum n combined a–z and A–Z characters.
Minimum n numeric characters (0–9).
Minimum n special characters (characters other than a–z, A–Z, and 0–9).
Ban particular words (tested case-insensitively).
Ban n-length character sequences (e.g. "abc", "XYZ", or "789", with a sequence length of 3; does not apply to special characters).
Ban n-length qwerty character sequences (e.g. "qwerty" or "asdf", with a sequence length of 4; does not apply to special characters).
Ban sequential, identical characters (e.g. "aa" or "!!").
Use custom regular expressions (tested using RegExp.prototype.test) and functions (the password is provided as the first argument, and a Boolean value is expected in return).

Here's an example of how it can be used:

var password = "password";
var passed = validatePassword(password, {
	length:   [8, Infinity],
	lower:    1,
	upper:    1,
	numeric:  1,
	special:  1,
	badWords: ["password", "steven", "levithan"],
	badSequenceLength: 4
});
// passed: false

The above requires that password is at least eight characters long; has at least one lowercase, uppercase, numeric, and special character; doesn't include the words "password", "steven", or "levithan"; and doesn't include an alphanumeric sequence four or more characters in length (e.g. "1234").

Here's the code (there are no external library dependencies):

/*
	Password Validator 0.1
	(c) 2007 Steven Levithan <stevenlevithan.com>
	MIT License
*/

function validatePassword (pw, options) {
	// default options (allows any password)
	var o = {
		lower:    0,
		upper:    0,
		alpha:    0, /* lower + upper */
		numeric:  0,
		special:  0,
		length:   [0, Infinity],
		custom:   [ /* regexes and/or functions */ ],
		badWords: [],
		badSequenceLength: 0,
		noQwertySequences: false,
		noSequential:      false
	};

	for (var property in options)
		o[property] = options[property];

	var	re = {
			lower:   /[a-z]/g,
			upper:   /[A-Z]/g,
			alpha:   /[A-Z]/gi,
			numeric: /[0-9]/g,
			special: /[\W_]/g
		},
		rule, i;

	// enforce min/max length
	if (pw.length < o.length[0] || pw.length > o.length[1])
		return false;

	// enforce lower/upper/alpha/numeric/special rules
	for (rule in re) {
		if ((pw.match(re[rule]) || []).length < o[rule])
			return false;
	}

	// enforce word ban (case insensitive)
	for (i = 0; i < o.badWords.length; i++) {
		if (pw.toLowerCase().indexOf(o.badWords[i].toLowerCase()) > -1)
			return false;
	}

	// enforce the no sequential, identical characters rule
	if (o.noSequential && /([\S\s])\1/.test(pw))
		return false;

	// enforce alphanumeric/qwerty sequence ban rules
	if (o.badSequenceLength) {
		var	lower   = "abcdefghijklmnopqrstuvwxyz",
			upper   = lower.toUpperCase(),
			numbers = "0123456789",
			qwerty  = "qwertyuiopasdfghjklzxcvbnm",
			start   = o.badSequenceLength - 1,
			seq     = "_" + pw.slice(0, start);
		for (i = start; i < pw.length; i++) {
			seq = seq.slice(1) + pw.charAt(i);
			if (
				lower.indexOf(seq)   > -1 ||
				upper.indexOf(seq)   > -1 ||
				numbers.indexOf(seq) > -1 ||
				(o.noQwertySequences && qwerty.indexOf(seq) > -1)
			) {
				return false;
			}
		}
	}

	// enforce custom regex/function rules
	for (i = 0; i < o.custom.length; i++) {
		rule = o.custom[i];
		if (rule instanceof RegExp) {
			if (!rule.test(pw))
				return false;
		} else if (rule instanceof Function) {
			if (!rule(pw))
				return false;
		}
	}

	// great success!
	return true;
}

You can download it here.

Lemme know if you have any feature requests or other suggestions about how to improve it, or if you need help writing custom rules for it.

/(bb|[^b]{2})/

$/(bb|[^b]{2})/$

…That is the question. ThinkGeek is selling that on a t-shirt for the "regular expression junkie + lover of literature." Wearing that around would be a sure way to get me to notice you, especially if you have a nice rack or happen to be Angelina Jolie.

Note that the parentheses are unnecessary except in the case of Perl-folk wishing to avoid using the naughty $& variable for performance reasons. Also, the expression is buggy in that many alternatives to "bb" are not matched, so it would probably be better written as simply /bb|.*/s. For a JavaScript solution you could try var answer = /^bb$/.test(i);.

Alright, I'm done debugging the shirt now.

Performance of Greedy vs. Lazy Regex Quantifiers

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

Consider the following simple example: When the regexes <.*?> and <[^>]*> are applied to the subject string "<0123456789>", they are functionally equivalent. The only difference is how the regex engine goes about generating the match. However, the latter regex (which uses a greedy quantifier) is more efficient, because it describes what the user really means: match the character "<", followed by any number of characters which are not ">", and finally, match the character ">". Defined this way, it requires no backtracking in the case of any successful match, and only one backtracking step in the case of any unsuccessful match. Hand-optimization of regex patterns largely revolves around the ideas of reducing backtracking and the steps which are potentially required to match or fail at any given character position, and here we've reduced both cases to the absolute minimum.

So what exactly is different about how a backtracking, leftmost-first regex engine (your traditional NFA engine type, which describes most modern, Perl-derivative flavors) goes about matching our subject text when working with these two regexes? Let's first take a look at what happens with <.*?>. I'll use the RegexBuddy debugger style to illustrate each step.

Note: If you're reading this in a feed reader or aggregator which strips out inline styles, see the original post, which should make things much easier to follow.

Beginning match attempt at character 0
<
<ok
<backtrack
<0
<0backtrack
<01
<01backtrack
<012
<012backtrack
<0123
<0123backtrack
<01234
<01234backtrack
<012345
<012345backtrack
<0123456
<0123456backtrack
<01234567
<01234567backtrack
<012345678
<012345678backtrack
<0123456789
<0123456789>
Match found

The text ok means the engine completed a successful, zero-length match, and backtrack shows where the engine backtracks to the last choice point, which it does after trying and failing to match the next token (>). As you can see here, with lazy quantification the engine backtracks forward after each character which is not followed by ">".

Here's what happens with <[^>]*>:

Beginning match attempt at character 0
<
<0123456789
<0123456789>
Match found

As you can see, 20 fewer steps are required, and there is no backtracking. This is faster, and can also increase the potential for internal optimization of the matching process. As a simple example, when greedily quantifying an any-character token, the engine can simply move the read-head to the end of the string, rather than testing every character along the way (this is an oversimplification of the process taken by modern regex engines, but you get the idea). You can see this exact kind of internal optimization occurring in IE with the "trim8" pattern in my Faster JavaScript Trim post from a while back.

You might be wondering what happens if we use the regex <.*>. Take a look:

Beginning match attempt at character 0
<
<0123456789>
<0123456789>backtrack
<0123456789
<0123456789>
Match found

Although the output didn't change much when used with our subject string of "<0123456789>", you can see that the .* caused the engine to trek all the way to the end of the string, then backtrack from the right. If we'd used a 10,000-character string without newlines or other ">" characters, that would've resulted in nearly 10,000 backtracks.

In short:

Make your patterns more explicit where it makes sense, to better guide the regex engine and avoid backtracking.
Be careful about using the more appropriate type of quantification (to your best approximation, since this often depends on the subject string) even when it has no impact on the text matched by your regex.
If possible, avoid greedy quantification of the dot and other needlessly flexible tokens.