Flagrant Badassery

A JavaScript and regular expression centric blog

JavaScript, Regex, and Unicode

Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.

According to ECMA-262 3rd Edition, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

Here again are the affected tokens, along with their definitions:

  • \d — Digits.
  • \s — Whitespace.
  • \w — Word characters.
  • \D — All except digits.
  • \S — All except whitespace.
  • \W — All except word characters.
  • . — All except newlines.
  • ^ (with /m) — The positions at the beginning of the string and just after newlines.
  • $ (with /m) — The positions at the end of the string and just before newlines.
  • \b — Word boundary positions.
  • \B — Not word boundary positions.

All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:

  • Digit — The characters 0-9 only.
  • Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
  • Word character — The characters A-Z, a-z, 0-9, and _ only.
  • Word boundary — The position between a word character and non-word character.
  • Newline — The line feed, carriage return, line separator, and paragraph separator characters.

Here again are the newline characters, with their character codes:

  • \u000a — Line feed — \n
  • \u000d — Carriage return — \r
  • \u2028 — Line separator
  • \u2029 — Paragraph separator

Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n") returns true.

As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s according to ECMA-262 3rd Edition and Unicode 5.1:

  • \u0009 — Tab — \t
  • \u000a — Line feed — \n — (newline character)
  • \u000b — Vertical tab — \v
  • \u000c — Form feed — \f
  • \u000d — Carriage return — \r — (newline character)
  • \u0020 — Space
  • \u00a0 — No-break space
  • \u1680 — Ogham space mark
  • \u180e — Mongolian vowel separator
  • \u2000 — En quad
  • \u2001 — Em quad
  • \u2002 — En space
  • \u2003 — Em space
  • \u2004 — Three-per-em space
  • \u2005 — Four-per-em space
  • \u2006 — Six-per-em space
  • \u2007 — Figure space
  • \u2008 — Punctuation space
  • \u2009 — Thin space
  • \u200a — Hair space
  • \u2028 — Line separator — (newline character)
  • \u2029 — Paragraph separator — (newline character)
  • \u202f — Narrow no-break space
  • \u205f — Medium mathematical space
  • \u3000 — Ideographic space

To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.

Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.

There Are 20 Responses So Far. »

  1. [...] JavaScript, Regex, and Unicode (tags: javascript regex unicode) [...]

  2. Hi Steve, I really like the GUI of your test page. A couple of suggestions:

    An input box would be nice, so that you can enter arbitrary regexps. (Perhaps with more clickable examples like in my version – Feel free to reuse any of this code if it’s helpful).

    Using timeouts to update the large result sets. (and allow the user to cancel the test)

    Why do you avoid some of the special unicode blocks? I think it’s still interesting to see how they are treated by the regexp. Although not printing the chars is a good idea.

    I also saw that FF3 fixed the \d group. I wonder what other changes where made since FF2. It’s actually quite scary that different browsers handle regexps in different ways. It would be interesting to see a table of browsers and their results. An Acid test for regexps perhaps?

  3. Dude, don’t get me started on cross-browser regex differences. I could probably name 30 off the top of my head (one day I’ll write up a list). Look around on this blog and you’ll find discussion of a number of them.

    Showing progress and allowing the user to cancel long-running tests (as in your app) would indeed be a Good Thing. As for providing an input box, I think our apps serve a somewhat different purpose — yours being more based around discovery of the meaning of any one-character token, and mine around exposing differences in Unicode support (which e.g. requires special handling for the zero-width assertions).

    I skip some of the Unicode blocks which are reserved for private use in order to make the tests a little faster (not so much by skipping the tests as by making the browser generate and render fewer DOM nodes in the log). I think I can get away with it since my app only allows testing a limited set of tokens, and with tokens like the dot it’s the inverse of what they match that is relevant. I might take your advice and switch things up a bit though if I feel like spending more time on this later. Thanks for the suggestions!

  4. SUPER THANKS!!!, for more than 2 days I was trying to finish an article to post on my blog, but I was having a LOT OF problems with regular expressions on JavaScript, until I found your test application, the method was supposed to get the caller method name, but because of the \r\n the result was coming “null”.
    Thanks again AWESOME!!

    function GetMethodName(value)
    {
    if (value !== null)
    {
    // value = “Function trycatch (){ … }”.
    // index : 1 => ([Ff]unction[ ]*);
    // index : 2 => (\\w*) – The method name;
    // index : 3 => ([ ]*\\(([\w|,| ]*)\\));
    var pattern = new RegExp(“([Ff]unction[ ]*)(\\w*)([ ]*\\(([\w|,| ]*)\\))”, “m”);
    // Try to find the pattern.
    var m = value.match(pattern);

    if ( (m !== null) && (m.length > 2) )
    return m[2];
    }
    return undefined;
    }

    function CallerMethod()
    {
    var MethodName = GetMethodName(arguments.callee.toString());
    }

  5. @TheCodeMaster, I’m glad I could help!

    BTW, here’s one way you could rewrite the above GetMethodName function:

    function GetMethodName (value) {
    	return (/function ([\w$]+)/.exec(value) || [])[1];
    }

    In addition to that being shorter I also fixed a couple bugs in the regex, but for the most part the results will be the same.

  6. Sorry Steve, but can you explain to me what is happening on this code:
    return (/function ([\w$]+)/.exec(value) || [])[1];

    i didn’t get the “.exec(value)” , “[]” and “[1]“;

    let’s say the input would be :

    Edit (Steve): Removed a long code sample since it had no impact on the provided code’s output.

    thanks again!

  7. It was intentionally tricky for the sake of brevity. If you’re not familiar with the exec method, you can read about it at the Mozilla Developer Center. Essentially, it returns either null (if there is no match) or an array containing the matched text and any backreferences. That’s the same result as from String.prototype.match with a non-global regex. [] creates a new, empty array, so that we don’t try to access a key on the null value in case there is no match. Array key one (accessed using [1]) will either be the value matched by ([\w$]+) or undefined (in the case of reading from an empty array).

  8. look this is great and its just what I need for my stuff.. unfortunately I’m having problems with symbols. I work with XML and people have been copying and pasting to the text areas from Microsoft word and these stupid special characters that Microsoft has make my programs fail I need a string that contains a-z A-Z 0-9 the typical special characters that u can type from the keyboard and thats it nothing else any one can give me some help with this would be great.

  9. There are many different common keyboard layouts. Also, reserved XML characters are present on most keyboards and might need separate, special handling. I would recommend that you refer to regex character class documentation, the Windows charmap application, and perhaps a regex forum like regexadvice.com.

  10. Just for grins, what would the regex be to simply strip out all Unicode chars (i.e. all vals > 255?)

  11. @Jason, I think you’re confusing your terminology a little, but to remove all Unicode code points higher than 255 decimal (FF hex) in JavaScript you could use replace(/[\u0100-\uFFFF]+/g, ""). The range caps at FFFF hex because JavaScript only supports Unicode’s Basic Multilingual Plane. It might be more useful to kill everything outside the 128 US-ASCII characters. That would be replace(/[\u0080-\uFFFF]+/g, "").

  12. [...] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode [...]

  13. [...] ??????????????????? JavaScript ?????s ? unicode-aware ?????? u0020, u0009 ??????????????????????????JavaScript, Regex, and Unicode. [...]

  14. [...] ??????????????????? JavaScript ?????s ? unicode-aware ?????? u0020, u0009 ??????????????????????????JavaScript, Regex, and Unicode. [...]

  15. There’s loads of handy JavaScript Unicode-related tools available at http://u-n-i.co/de/

    They might solve this problem

  16. Hi Steve,

    Here is my code to validate a file name against blank spaces in an attachment for a sharepoint list.

    —————————————-

    function PreSaveAction()
    {
    var attachment;
    var filename=””;
    var fileNameSpecialCharacters = new RegExp(‘\\s’, ‘g’);
    try {
    attachment = document.getElementById(“idAttachmentsTable”).getElementsByTagName(“span”)[0].firstChild;
    filename = attachment.data;
    }
    catch (e) {
    }
    if (fileNameSpecialCharacters.test(filename)) {
    alert(“Please remove the special characters and white spaces from file attachment name.”);
    return false;
    }
    else {
    return true;
    }
    }
    ————————————-

    Its working fine with all browsers except IE. Can you suggest me an aleternate way to fix this?

    Is this an issue with the ‘/s’ metacharacter?

    BR,
    Sijo

  17. I had an issue in BigMachines where the only way to solve a business need was to add some funky footer JS script… the issue there was that IE would not match SPACE characters when replacing string content. The source string included   characters in the HTML source instead of spaces. Now in the regular expression, [\s] did not match  . Adding the unicode character to the regular expression addressed the IE limitation => [\s\u00A0].

  18. [...] of JavaScript’s character classes are not Unicode-aware, per ECMA-262. Have a look at http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode for more [...]

  19. [...] ”JavaScript, Regex, and Unicode“ by Steven Levithan [...]

  20. The Trick For swarovski crystal york

Post a Response

If you are about to post code, please escape your HTML entities (&, >, <).