JavaScript, Regex, and Unicode
Not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. In some cases it can be important to know exactly what certain tokens match, and that's what this post will explore.
According to ECMA-262 3rd Edition, \s
, \S
, .
, ^
, and $
use Unicode-based interpretations of whitespace and newline, while \d
, \D
, \w
, \W
, \b
, and \B
use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.
returns true
). Actual browser implementations often differ on these points. For example, Firefox 2 considers \d
and \D
to be Unicode-aware, while Firefox 3 fixes this bug — making \d
equivalent to [0-9]
as with most other browsers.
Here again are the affected tokens, along with their definitions:
\d
— Digits.\s
— Whitespace.\w
— Word characters.\D
— All except digits.\S
— All except whitespace.\W
— All except word characters..
— All except newlines.^
(with/m
) — The positions at the beginning of the string and just after newlines.$
(with/m
) — The positions at the end of the string and just before newlines.\b
— Word boundary positions.\B
— Not word boundary positions.
All of the above are standard in Perl-derivative regex flavors. However, the meaning of the terms digit, whitespace, word character, word boundary, and newline depend on the regex flavor, character set, and platform you're using, so here are the official JavaScript meanings as they apply to regexes:
- Digit — The characters 0-9 only.
- Whitespace — Tab, line feed, vertical tab, form feed, carriage return, space, no-break space, line separator, paragraph separator, and "any other Unicode 'space separator'".
- Word character — The characters A-Z, a-z, 0-9, and _ only.
- Word boundary — The position between a word character and non-word character.
- Newline — The line feed, carriage return, line separator, and paragraph separator characters.
Here again are the newline characters, with their character codes:
\u000a
— Line feed —\n
\u000d
— Carriage return —\r
\u2028
— Line separator\u2029
— Paragraph separator
Note that ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085
) will be recognized as an additional newline character in that standard. Also note that although CRLF (a carriage return followed by a line feed) is treated as a single newline sequence in most contexts, /\r^$\n/m.test("\r\n")
returns true
.
As for whitespace, ECMA-262 3rd Edition uses an interpretation based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters which should be matched by \s
according to ECMA-262 3rd Edition and Unicode 5.1:
\u0009
— Tab —\t
\u000a
— Line feed —\n
— (newline character)\u000b
— Vertical tab —\v
\u000c
— Form feed —\f
\u000d
— Carriage return —\r
— (newline character)\u0020
— Space\u00a0
— No-break space\u1680
— Ogham space mark\u180e
— Mongolian vowel separator\u2000
— En quad\u2001
— Em quad\u2002
— En space\u2003
— Em space\u2004
— Three-per-em space\u2005
— Four-per-em space\u2006
— Six-per-em space\u2007
— Figure space\u2008
— Punctuation space\u2009
— Thin space\u200a
— Hair space\u2028
— Line separator — (newline character)\u2029
— Paragraph separator — (newline character)\u202f
— Narrow no-break space\u205f
— Medium mathematical space\u3000
— Ideographic space
To test which characters or positions are matched by all of the tokens mentioned here in your browser, see JavaScript Regex and Unicode Tests. Note that Firefox 2.0.0.11, IE 7, and Safari 3.0.3 beta all get some of the tests wrong.
Update: My new Unicode plugin for XRegExp allows you to easily match Unicode categories, scripts, and blocks in JavaScript regular expressions.
Pingback by links for 2008-01-03 on 2 January 2008:
[…] JavaScript, Regex, and Unicode (tags: javascript regex unicode) […]
Comment by Will Moffat on 4 January 2008:
Hi Steve, I really like the GUI of your test page. A couple of suggestions:
An input box would be nice, so that you can enter arbitrary regexps. (Perhaps with more clickable examples like in my version – Feel free to reuse any of this code if it’s helpful).
Using timeouts to update the large result sets. (and allow the user to cancel the test)
Why do you avoid some of the special unicode blocks? I think it’s still interesting to see how they are treated by the regexp. Although not printing the chars is a good idea.
I also saw that FF3 fixed the \d group. I wonder what other changes where made since FF2. It’s actually quite scary that different browsers handle regexps in different ways. It would be interesting to see a table of browsers and their results. An Acid test for regexps perhaps?
Comment by Steve on 4 January 2008:
Dude, don’t get me started on cross-browser regex differences. I could probably name 30 off the top of my head (one day I’ll write up a list). Look around on this blog and you’ll find discussion of a number of them.
Showing progress and allowing the user to cancel long-running tests (as in your app) would indeed be a Good Thing. As for providing an input box, I think our apps serve a somewhat different purpose — yours being more based around discovery of the meaning of any one-character token, and mine around exposing differences in Unicode support (which e.g. requires special handling for the zero-width assertions).
I skip some of the Unicode blocks which are reserved for private use in order to make the tests a little faster (not so much by skipping the tests as by making the browser generate and render fewer DOM nodes in the log). I think I can get away with it since my app only allows testing a limited set of tokens, and with tokens like the dot it’s the inverse of what they match that is relevant. I might take your advice and switch things up a bit though if I feel like spending more time on this later. Thanks for the suggestions!
Comment by TheCodeMaster on 10 January 2008:
SUPER THANKS!!!, for more than 2 days I was trying to finish an article to post on my blog, but I was having a LOT OF problems with regular expressions on JavaScript, until I found your test application, the method was supposed to get the caller method name, but because of the \r\n the result was coming “null”.
Thanks again AWESOME!!
function GetMethodName(value)
{
if (value !== null)
{
// value = “Function trycatch (){ … }”.
// index : 1 => ([Ff]unction[ ]*);
// index : 2 => (\\w*) – The method name;
// index : 3 => ([ ]*\\(([\w|,| ]*)\\));
var pattern = new RegExp(“([Ff]unction[ ]*)(\\w*)([ ]*\\(([\w|,| ]*)\\))”, “m”);
// Try to find the pattern.
var m = value.match(pattern);
if ( (m !== null) && (m.length > 2) )
return m[2];
}
return undefined;
}
function CallerMethod()
{
var MethodName = GetMethodName(arguments.callee.toString());
}
Comment by Steve on 11 January 2008:
@TheCodeMaster, I’m glad I could help!
BTW, here’s one way you could rewrite the above
GetMethodName
function:In addition to that being shorter I also fixed a couple bugs in the regex, but for the most part the results will be the same.
Comment by TheCodeMaster on 14 January 2008:
Sorry Steve, but can you explain to me what is happening on this code:
return (/function ([\w$]+)/.exec(value) || [])[1];
i didn’t get the “.exec(value)” , “[]” and “[1]”;
let’s say the input would be :
Edit (Steve): Removed a long code sample since it had no impact on the provided code’s output.
thanks again!
Comment by Steve on 14 January 2008:
It was intentionally tricky for the sake of brevity. If you’re not familiar with the
exec
method, you can read about it at the Mozilla Developer Center. Essentially, it returns eithernull
(if there is no match) or an array containing the matched text and any backreferences. That’s the same result as fromString.prototype.match
with a non-global regex.[]
creates a new, empty array, so that we don’t try to access a key on thenull
value in case there is no match. Array key one (accessed using[1]
) will either be the value matched by([\w$]+)
orundefined
(in the case of reading from an empty array).Comment by tony G on 30 January 2008:
look this is great and its just what I need for my stuff.. unfortunately I’m having problems with symbols. I work with XML and people have been copying and pasting to the text areas from Microsoft word and these stupid special characters that Microsoft has make my programs fail I need a string that contains a-z A-Z 0-9 the typical special characters that u can type from the keyboard and thats it nothing else any one can give me some help with this would be great.
Comment by Steve on 30 January 2008:
There are many different common keyboard layouts. Also, reserved XML characters are present on most keyboards and might need separate, special handling. I would recommend that you refer to regex character class documentation, the Windows charmap application, and perhaps a regex forum like regexadvice.com.
Comment by Jason on 31 March 2008:
Just for grins, what would the regex be to simply strip out all Unicode chars (i.e. all vals > 255?)
Comment by Steve on 31 March 2008:
@Jason, I think you’re confusing your terminology a little, but to remove all Unicode code points higher than 255 decimal (FF hex) in JavaScript you could use
replace(/[\u0100-\uFFFF]+/g, "")
. The range caps at FFFF hex because JavaScript only supports Unicode’s Basic Multilingual Plane. It might be more useful to kill everything outside the 128 US-ASCII characters. That would bereplace(/[\u0080-\uFFFF]+/g, "")
.Pingback by My Personal Diary » Howto to enable firefox to speak with Festival and few more things! on 27 April 2009:
[…] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode […]
Comment by David Higgins on 19 September 2011:
There’s loads of handy JavaScript Unicode-related tools available at http://u-n-i.co/de/
They might solve this problem
Comment by Sijo on 12 December 2011:
Hi Steve,
Here is my code to validate a file name against blank spaces in an attachment for a sharepoint list.
—————————————-
function PreSaveAction()
{
var attachment;
var filename=””;
var fileNameSpecialCharacters = new RegExp(‘\\s’, ‘g’);
try {
attachment = document.getElementById(“idAttachmentsTable”).getElementsByTagName(“span”)[0].firstChild;
filename = attachment.data;
}
catch (e) {
}
if (fileNameSpecialCharacters.test(filename)) {
alert(“Please remove the special characters and white spaces from file attachment name.”);
return false;
}
else {
return true;
}
}
————————————-
Its working fine with all browsers except IE. Can you suggest me an aleternate way to fix this?
Is this an issue with the ‘/s’ metacharacter?
BR,
Sijo
Comment by Stefan on 25 January 2012:
I had an issue in BigMachines where the only way to solve a business need was to add some funky footer JS script… the issue there was that IE would not match SPACE characters when replacing string content. The source string included characters in the HTML source instead of spaces. Now in the regular expression, [\s] did not match . Adding the unicode character to the regular expression addressed the IE limitation => [\s\u00A0].
Pingback by How to check if the first character is a letter | PHP Developer Resource on 28 May 2012:
[…] of JavaScript’s character classes are not Unicode-aware, per ECMA-262. Have a look at http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode for more […]
Pingback by Unicode and JavaScript | InfoLogs on 5 October 2013:
[…] ”JavaScript, Regex, and Unicode“ by Steven Levithan […]
Comment by swarovski crystal evil eye pendant on 18 May 2014:
The Trick For swarovski crystal york
Pingback by Javascript + Unicode regexes | ASK AND ANSWER on 20 December 2015:
[…] Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the […]
Comment by Sagar P on 10 February 2016:
I need a regular expression to accept alphabets,number and few special symbols i.e. – . ( ) (space) from English keyboard only.
as soon as user tries to type from non-english keyboard.it should not accept entered char.
I have used var reg = /^[\u0080-\uFFFF]$/; it accepts all characters,numbers,digits for english keyboard.
Thanks in advance
Pingback by Javascript + Unicode regexes - QuestionFocus on 18 December 2017:
[…] Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the […]
Pingback by Regular expression to match non-ASCII characters? – inneka.com on 1 July 2019:
[…] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode […]
Pingback by Javascript + Unicode regexes – inneka.com on 1 July 2019:
[…] Badassery has an article on JavaScript, Regex, and Unicode that sheds some light on the […]
Comment by roku.com/link on 10 February 2020:
The next time I read a blog, Hopefully it won’t fail me as much as this one. I mean, Yes, it was my choice to read, but I genuinely thought you would probably have something helpful to say. All I hear is a bunch of whining about something you could fix if you weren’t too busy seeking attention.
Comment by roku.com/link on 11 February 2020:
Next time I read a blog, Hopefully it does not fail me as much as this one. I mean, Yes, it was my choice to read, nonetheless I truly believed you would probably have something interesting to say. All I hear is a bunch of whining about something that you could possibly fix if you were not too busy seeking attention.
Comment by me on 10 June 2020:
great
Comment by test on 10 June 2020:
https://dorzeczy.pl/
Comment by download on 25 June 2020:
you are awesome guys i always check you website
Comment by Music Download on 27 June 2020:
you are awesome guys i always check you website…
Comment by music on 9 August 2020:
you are awesome guys i always check you website…
Comment by ?????? ???? on 16 August 2020:
you are awesome guys i always check you website…
Comment by ????? ????? on 31 August 2020:
I do accept as true with all oof the ideaas you’ve offered for your
post. They’re very convincing and will certainly work.
http://www.codetools.ir/today.html
Comment by ????? ????? on 10 September 2020:
tanx for post
http://timedate.ir
Pingback by ????????ASCII???_javascript?? on 10 October 2020:
[…] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode […]
Pingback by JavaScript + Unicode?????_javascript?? on 13 October 2020:
[…] Flagrant Badassery在JavaScript,Regex和Unicode上发表了一篇文章,阐明了这一问题。 […]
Pingback by [javascript] JavaScript + ?? ?? ??? - ???? on 1 November 2020:
[…] Badassery?? JavaScript, Regex ? Unicode ? ?? ??? ???? ??? ?? […]
Pingback by [javascript] ? ASCII ??? ???? ???? - ???? on 3 November 2020:
[…] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode […]
Pingback by ????????ASCII???|jquery?? on 5 November 2020:
[…] http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode […]
Comment by ?????? ???? ??? ?????? on 8 January 2021:
?????? ???? ??? ??????
Comment by tpmp replay on 20 January 2021:
Revoir en replay vidéo les programmes tv de Rai HD : séries tv, actualité / jt, sport, jeunesse, magazine, divertissement, documentaires, fictions, musique / clips, … , La TV en replay by Orange est en SD pour le SAT et l’ADSL, sauf erreur de ma part. Mais pour la fibre, elle est en HD j’espère ? Merci d’avance. Regarder toutes les vidéos, replay et direct en streaming sur la plateforme ww1.mon-tele.com. Voir en replay , tpmp replay, Découvrez toute la télévision française en replay sur un seul site. TV les programmes (film, séries tv, émissions, sport, …) des principales chaînes (TF1, France 2, M6, D8, W9, …), Trouvez votre programme à revoir en streaming. Séries, documentaires, émissions… , L’offre de télévision des fournisseurs d’accès internet se diversifie grâce à la TV de rattrapage et à la haute définition. Regardez facilement toute la télévision française avec Molotov. Naviguez parmi de nombreux programmes en streaming : actualités et infos, séries TV…