parseUri: Split URLs in JavaScript

Update: The following post is outdated. See parseUri 1.2 for the latest, greatest version.

For fun, I spent the 10 minutes needed to convert my parseUri() ColdFusion UDF into a JavaScript function.

For those who haven't already seen it, I'll repeat my explanation from the other post…

parseUri() splits any well-formed URI into its parts (all are optional). Note that all parts are split with a single regex using backreferences, and all groupings which don't contain complete URI parts are non-capturing. My favorite bit of this function is its robust support for splitting the directory path and filename (it supports directories with periods, and without a trailing backslash), which I haven't seen matched in other URI parsers. Since the function returns an object, you can do, e.g., parseUri(uri).anchor, etc.

I should note that, by design, this function does not attempt to validate the URI it receives, as that would limit its flexibility. IMO, validation is an entirely unrelated process that should come before or after splitting a URI into its parts.

This function has no dependencies, and should work cross-browser. It has been tested in IE 5.5–7, Firefox 2, and Opera 9.

/* parseUri JS v0.1.1, by Steven Levithan <http://stevenlevithan.com>
Splits any well-formed URI into the following parts (all are optional):
----------------------
- source (since the exec method returns the entire match as key 0, we might as well use it)
- protocol (i.e., scheme)
- authority (includes both the domain and port)
  - domain (i.e., host; can be an IP address)
  - port
- path (includes both the directory path and filename)
  - directoryPath (supports directories with periods, and without a trailing backslash)
  - fileName
- query (does not include the leading question mark)
- anchor (i.e., fragment) */
function parseUri(sourceUri){
	var uriPartNames = ["source","protocol","authority","domain","port","path","directoryPath","fileName","query","anchor"],
		uriParts = new RegExp("^(?:([^:/?#.]+):)?(?://)?(([^:/?#]*)(?::(\\d*))?)((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[\\?#]|$)))*/?)?([^?#/]*))?(?:\\?([^#]*))?(?:#(.*))?").exec(sourceUri),
		uri = {};
	
	for(var i = 0; i < 10; i++){
		uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : "");
	}
	
	/* Always end directoryPath with a trailing backslash if a path was present in the source URI
	Note that a trailing backslash is NOT automatically inserted within or appended to the "path" key */
	if(uri.directoryPath.length > 0){
		uri.directoryPath = uri.directoryPath.replace(/\/?$/, "/");
	}
	
	return uri;
}

Test it.

Is there any leaner, meaner URI parser out there? 🙂


Edit: This function doesn't currently support URIs which include a username or username/password pair (e.g., "http://user:password@domain.com/"). I didn't care about this when I originally wrote the ColdFusion UDF this is based on, since I never use such URIs. However, since I've released this I kind of feel like the support should be there. Supporting such URIs and appropriately splitting the parts would be easy. What would take longer is setting up an appropriate, large list of all kinds of URIs (both well-formed and not) to retest the function against. However, if people leave comments asking for the support, I'll go ahead and add it.

29 thoughts on “parseUri: Split URLs in JavaScript”

  1. Damn, that a serious regex! Thanks for posting this, it will be very handy. I have been using the following until now. Maybe you can point out if there is something wrong with it:

    function parseUrl(data) {
    var e = /((http|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+\.[^#?\s]+)(#[\w\-]+)?/;

    if (data.match(e)) {
    return {
    url: RegExp[‘$&’],
    protocol: RegExp.$1,
    host:RegExp.$2,
    path:RegExp.$3,file:
    RegExp.$5,hash:RegExp.$6
    };
    }
    else {
    return {url: “”, protocol: “”, host: “”, path: “”, file: “”, hash: “”};
    }
    }

  2. Boyan, thanks! As for the code you posted, well, beyond being far less powerful/flexible, the first thing that jumps out at me when looking over the regex is that it wouldn’t even match or split the URI “http://www.google.com/”. In other words, it’s deeply flawed.

  3. @thunder down under:

    Poly9’s URL parser is weak. Ajaxian posted the reasons for this I sent them, though in my defense I hadn’t meant for them to actually publish the list. Rather, it was part of my pitch towards why they might want to feature another URI parser even though they’d done so recently.

    IMO, rewriting Poly9’s parser to depend on a massive library like Prototype is extra weak.

    (For those who didn’t find this via Ajaxian, here’s the link: parseUri: Another JavaScript URL parser.)

  4. Nice work, Dan G. Switzer, II.

    BTW, one of the fundamental differences between our two UDFs (which adds some complexity to mine) is that with, e.g., the URIs “/dir/sub” and “/dir/sub?q”, your UDF will treat “sub” as the file name, while mine will treat it as part of the directory path. Since many people enter directory paths without a trailing backslash (and such URIs work with every HTTP server I’m familiar with), I’ve found this adjustment to be a necessity.

    Also, one issue I noticed during a very brief test is that, e.g., with the URI “www.foo.com:80/dir/”, your UDF treats the “80” as part of the directory path, returns no authority, and returns “www.foo.com” as the scheme. Although this may be technically correct according to generic URI syntax (I understand why the scheme comes out the way it does, but I’m not so sure about “80” as part of the directory path), it prevents the common scenario of users entering URIs which start with a domain name, without the leading “//” to identify it as the authority. Other examples of differences are that your UDF will treat “www.foo.com” as a file name, and “www.foo.com/dir/” as one component comprised solely of a directory path. On the other hand, in all of the above cases parseUri() will identify “www.foo.com” as the domain, and “/dir/” as the path. I’m not noting this to claim superiority, but rather to point out additional areas where I’ve found that slightly diverging from the official generic URI syntax spec allows the function to become much more “real-world ready,” and able to actually be tested against end user input.

    Finally, I know code brevity was probably not your goal, but page weight becomes especially important with a JavaScript implementation. The over 90 lines of code (after stripping all comments and empty lines) in the post you linked to seems on the heavy side.

    Nevertheless, it’s a solid, fully-featured implementation, and gives me more incentive to add support for the missing pieces from my function (username/password/segment [these shouldn’t add any lines of code], and param splitting).

  5. hi,

    i’m not familiar with regular expressions, so i tried to extract the user infos as an exercise…
    so i added “userInfo”, “userName”, “password” in between “authority” and “domain” in uriPartNames, and added this part to your regexp :
    “(” + “(?:(([^:]+)?(?::)?([^:]+)?)?@)?” + “([^:/?#]*)(?::(\\d*))?)?”

    well, it seems to work with :
    http://userName:password@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top
    http://userName:@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top
    http://userName@www.domain.com:81/dir1/dir.2/index.html?id=1&test=2#top

    please tell me if i’m wrong and/or if there is a better way to do it !

    thank you.

  6. Seb, that seems pretty reasonable, and after a minute testing it with several URIs it seems to hold up well (aside from when you start a URI with a username/password pair, but I’m not sure if I’d do anything to change the behavior).

    BTW, here are a couple ways your addition to the regex can be tweaked, after a quick lookover.

    – Change “(?::)?” to simply “:?” (the grouping is not necessary to make it optional).
    – Replace both instances of “[^:]+” with “[^:@]+” (this will improve efficiency when tested against certain types of values, by reducing the amount of backtracking required).

    Whenever I find some time to do more extensive re-testing, I’ll go ahead and add support for these and other, additional URI parts.

  7. BTW, I’ve updated my local copy of the regex to include support for usernames and passwords, while also appropriately splitting URIs which start with a username/password pair (i.e., they’re not preceded by a protocol and/or “//”). I’ll include this in v0.2 of this function, along with a few other minor changes/tweaks. Hopefully I’ll release this within a few days (after more testing).

    Also, Dan G. Switzer, I’ve decided against supporting filename param segments (e.g., “file.gif;p=5”), since as far as I understand they’re deprecated by RFC 3986, and in any case they can easily be tested for after the fact since they’re picked up as part of the file name. I’ve also decided against returning an array of objects containing the names and values of each discrete query parameter, since this is easy to implement in a separate function when needed (queries have only two, easily distinguishable delimiters: “&” and “=”), and it would add to the function’s length. I also don’t want to get carried away with the idea (e.g., returning arrays containing each subdomain, directory, etc.).

  8. I’m not sure this should be within the scope of this function, but I find it useful to be able to actually access the query string variables. As such, I added some code to the function to create an object (called queryVars) that serves as a hash of URL variables. That way you can do parseUri(window.location).queryVars.MyURLVar to access the value of a URL variable. Please note I just did this in 5 minutes and I’m sure it’s not full-proof, but it’s an idea… The code is as follows:

    for(var i = 0; i < 10; i++) { uri[uriPartNames[i]] = (uriParts[i] ? uriParts[i] : ""); if ( uriParts[i] && uriPartNames[i] == 'query' ) { uri['queryVars'] = {}; var qString = uriParts[i]; qString = qString.split('&'); for (var j=0; j

  9. @derek:

    Thanks. License is MIT-style.

    @Thomas Messier and Paul Irish:

    Since it’s clear that query-splitting is helpful for some users, I’ve gone ahead and added an implementation of this functionality (to the forthcoming version of parseUri) which uses 4 lines of code and additionally supports query keys which aren’t followed by “=” as well as query values which contain “=”. This, along with support for userInfo and extensive new demos, is all ready to go, but I’m hoping to release this on my own domain, and I’m currently having some trouble with my new host. I’ll include an update here as soon as this is resolved (hopefully within a couple days).

  10. Yeah, bad ass piece of code and some really masterful regexery! Saved me a good hour. Keep up the good work

  11. Well, I’m still having problems with setting up my blog the way I want it with my new host (e.g., they’re still trying to resolve issues with URL rewriting, etc.), but since I don’t know when everything will be resolved, here’s a link to the demo page for the latest version of parseUri:

    http://stevenlevithan.com/demo/parseuri/js/

  12. Steve,

    I mentioned in your blog that I’m integrating your URI parser in my module loader / js library project. Thanks for posting the page and for adding user:pass. I’ve been writing up test cases and ran into one that fails:

    path/to/file

    The parser catches the path as the domain, which I believe should only be caught with prefix double slashes.

    //path/to/file

    If you would like, I can notify you when my test bank is online.
    Also, this would be a good time to bring up licensing issues. I’m not so much trying to get something for free as to provide one, preferably as open and simple as possible (BSD, possibly Apache if I integrate Google’s excanvas). If you’re planning to GPL this, please let me know so I can back out.

    Kris.
    http://cixar.com

  13. Steve,

    Hi again. I’ve fixed the problem with ‘path/to/file’ parsing. You’ll find my rendition of your code here:

    https://cixar.com/tracs/javascript/browser/modules.js#L296

    I also deconstructed the regex to aid maintainability, but I don’t imagine you’re interested in that. Something of the regex’s “majesty”
    is lost in the process. I also made some changes that no doubt make the implementation incompatible with your requirements, but are verily required for the module loader, particularly that I need to treat directories as files unless there’s a terminal slash. This is necessary for my relative URL resolution algorithm to work consistently, and to preserve this axiom:

    format(parse(url)) == format(parse(format(parse(url))))

    I would have liked to verify that the algorithm worked with the “axiom”:

    url == format(parse(url))

    However, some legal URL’s don’t pass this axiom without being first transformed into a normal form. The former axiom, I believe is sufficient to demonstrate that no data is lost or mangled.

    I’ve lifted your tests and added a few. You can find the test scaffold script code, data, and publication respectively.

    https://cixar.com/tracs/javascript/browser/test/http/url.html
    https://cixar.com/tracs/javascript/browser/test/http/urls.txt
    http://cixar.com/~kris/javascript/test/http/url.html (Beware that this will pwn your browser. FF/Safari/Opera so far.)

    Thanks again for posting. This greatly accelerates the development of script relative javascript module importing.

    Kris.

  14. Hi Kris,

    You said…

    > I’ve been writing up test cases and ran into one that fails:
    >
    > path/to/file
    >
    > The parser catches the path as the domain, which I believe should only
    > be caught with prefix double slashes.
    >
    > //path/to/file

    According to RFC 3986, you are correct. However, note that I am very much aware of this case and explained the reason for my approach towards handling such URIs at http://stevenlevithan.com/demo/parseuri/js/ :

    “Finally, note that this function assumes that any URI which does not start with “protocol:”, “//” (authority), “userInfo@”, “/” (path), “?” (query), or “#” (anchor) begins with a host. This is atypical compared to most URI parsers, which usually treat such strings as beginning with a path. This alternative approach, however, allows the function to parse URIs such as “www.google.com” or “www.google.com/dir/” in a way that users would probably expect. I feel that slightly diverging from the official generic URI syntax spec (RFC 3986) like this allows the function to become much more “real-world ready,” and able to actually be tested against end user input.”

    Note that requiring the authority to be preceded by “//” would also change handling for URIs starting with userInfo (e.g., email addresses would no longer be handled correctly). However, I understand that for some it may be very important to handle relative paths not starting from root, and as a result I will include some kind of “strict mode” (I’m not sure what name to use) in the next version of this function. This mode will probably be implemented using an optional function argument, and will utilize a secondary regex. There are two changes I plan to implement for “strict mode.” First, the authority must be preceded by “//” to be identified as such, and second, the directory path must be followed by a terminating backslash. I expect that because of these two handling changes the resulting regex will be significantly shorter and less complex. I know you already implemented these changes in your code (I have not analyzed or tested it), but I plan to re-approach the entire regex with these changes in mind, and if nothing else it will show possible differences in ways to construct such a pattern.

    One question… do you think it might be better to implement this alternative mode as a second function, rather than triggering it using a second function argument? For one thing, the code to fix directory paths not ending in “/” won’t be necessary in strict mode, and secondly, I imagine that most people will consistently want one handling or the other.

    BTW, the code is released under an MIT License. According to my understanding, MIT licensing is even more permissive than BSD, but let me know if there are any problems with this.

    Finally, there are a few additional URIs you may wish to add to your test cases at the end of the following page:

    http://stevenlevithan.com/demo/parseuri/cf/dynList.cfm

    That page is dynamically rendered… it runs the ColdFusion versions of parseUri and formatParsedUri on the back-end for each of the URIs listed. The main thing to note is the additional URIs at the bottom of the page, which were created specifically to try to find issues with the handling of userInfo and the directory/file splitting. You can see that with the directory/file splitting there are a couple corner cases which cause minor issues, but all of that should go away by requiring a terminating slash.

    Thanks,
    Steve

  15. Steve,

    Thanks. I do think that separate functions for end user and standards’ compliant URI parsing would be the better approach than passing in a regex argument. I presume that your original approach was intended for form validation or URI extrapolation for converting URI’s to links in a body of plain text. If that’s the case, you might consider adding /.\s/ to your list of terminals since it’s often the case that URI’s appear at the end of sentences. If you’re looking to transform URL’s in plain tainted text, you might consider writing an algorithm to normalize the results back to a standard URL so the user can put it in an href, for example, s{kris.kowal@gmail.com}{mailto:kris.kowal@gmail.com}, or s{google.com}{//google.com/}.

    Thanks also for explicating your concerns with the “standard” approach for parsing “//”. I’ll have to ponder an acceptable compromise if I advertise it in the documentation as capable of handling mailtos. At the moment, I’m only using it as the basis of a URL resolver, so that I can fully qualify a URL relative, absolute, domain absolute, or protocol absolute in comparison to another URL. So, in my case absolute strictness is necessary. This was relatively easy to implement starting on the foundation of yours. You are of course welcome to take the modified code back if it’s of use to you.

    Meanwhile my module loader + URL resolution appears to work splendidly in all browsers. Of course, nothing else I’ve written works in IE yet, but such a day will come :-).

    Thanks for releasing MIT; that greatly simplifies distribution. I plan to leave your name and link in a references section.

    Kris.

  16. Kris,

    Thanks. My function was originally designed for two purposes: 1, extracting URL parts from user input, for which a “loose” approach is often imperative (unless I want to provide instructions which people might not follow anyway), and 2, extracting portions of URLs on the backend whenever I’m working with URLs. For the second task, I have no problem working with either a strict or loose approach, but I know that for some people and tasks a strict approach is preferable. Here is how I would split a URL in strict mode (with escaped forward slashes necessary when using a string literal with the JavaScript RegExp constructor):

    ^(?:([^:/?#]+):)?(?://((?:(([^:@]*):?([^:@]*))?@)?([^:/?#]*)(?::(\\d*))?))?((((?:[^?#/]*/)*)([^?#]*))(?:\\?([^#]*))?(?:#(.*))?)

    That captures the same 14 parts as the current version of my function:
    url/source (backreference 0), protocol, authority, userInfo, user, password, host, port, relative, path, directory, file, query, anchor.

    Compare that regex to the more complex, loose counterpart I currently have online:

    ^(?:(?![^:@]+:[^:@/]*@)([^:/?#.]+):)?(?://)?((?:(([^:@]*):?([^:@]*))?@)?([^:/?#]*)(?::(\\d*))?)(((/(?:[^?#](?![^?#/]*\\.[^?#/.]+(?:[?#]|$)))*/?)?([^?#/]*))(?:\\?([^#]*))?(?:#(.*))?)

    There are several major and subtle differences in the strict regex. E.g., one change that might not be immediately obvious is that in strict mode I can allow the dot (.) character in the protocol (scheme). You may wish to replace the strict regex you’re currently using with the one above, since it is shorter, faster (due to not using the negative lookaheads), and I believe it might more closely follow the official spec (again, I have not yet looked closely at your modifications). However, I’ve spent less than 10 minutes so far writing and testing the strict regex, so if you find any problems with it, let me know.

    Thanks,
    Steve

  17. @Scott:

    No, it makes no such assumption. It simply splits the URI in the most logical way according to its rules. See my note on how this function intentionally does not attempt to validate the URIs it receives.

  18. Thanks Steve for a very useful function. There is one thing I’d like to do with this function, and I’m not sure how to do it – I need to split the hostname further, and only retain the TLD portion. So I would match google.com in mail.google.com and google.com and http://www.google.com. My psuedo regex for this would be [optional some characters including dots][some characters without dots].com. The first portion I wouldn’t need access to. The second portion I would. I’m not quite sure how to express this in real regex, in particular it’s not clear how to indicate that a piece of a match should be “named”. Is it parens? Anyway, any tips you could give on this would be great. FYI I need to write this function so I can set cookies via Javascript that can be set in one subdomain and read in another. According to the rfc for cookies you should be able to set the Domain attribute to the TLD portion, prepending a dot, and that cookie will be sent by the browser to subdomains.

  19. @josh

    JavaScript doesn’t support named capturing groups. I’m assigning names to each part by mapping names from the uriPartNames array to the array of backreferences returned by the RegExp.exec() method. Parentheses are used to capture the backreferences, but not all of the parentheses are part of capturing groups.

    As for your task, there are some cases you might not be thinking about. E.g., how would “www.google.co.uk”, “64.233.287.99”, or something like “localhost” be handled? By the way, the top-level domain (TLD) from your example would be “com”, not “google.com”.

Leave a Reply

Your email address will not be published. Required fields are marked *