Uniform Resource Locator (URL)

Security Engineering Cheat Sheet

When Constructing Brand-New URLs Based on User Input

If you allow user-supplied data in path, query, or fragment ID: If one of the section delimiters manages to get through without proper escaping, the URL may have a differ- ent effect from what you intended (for example, linking one of the user-visible HTML buttons to the wrong server-side action). It is okay to err on the side of caution: When inserting an attacker-controlled field value, you can simply percent-escape everything but alphanumerics.
If you allow user-supplied scheme name or authority section: This is a major code injection and phishing risk! Apply the relevant input-validation rules outlined below.

When Designing URL Input Filters

Relative URLs: Disallow or explicitly rewrite them to absolute references to avoid trouble. Anything else is very likely unsafe.
Scheme name: Permit only known prefixes, such as http://, https://, or ftp://. Do not use blacklisting instead; it is extremely unsafe.
Authority section: Hostname should contain only alphanumerics, “-”, and “.” and can only be followed by “/”, “?”, “#”, or end-of-string. Allowing anything else will backfire. If you need to examine the hostname, make sure to make a proper right-hand substring match. In rare cases, you might need to account for IDNA, IPv6 bracket notation, port num- bers, or HTTP credentials in the URL. If so, you must fully parse the URL, validate all sec- tions and reject anomalous values, and reserialize them into a nonambiguous, canonical, well-escaped representation.

When Decoding Parameters Received Through URLs

Do not assume that any particular character will be escaped just because the standard says so or because your browser does it. Before echoing back any URL-derived values or put- ting them inside database queries, new URLs, and so on, scrub them carefully for danger- ous characters.

The most recognizable hallmark of the Web is a simple text string known as the Uniform Resource Locator (URL). Each well-formed, fully qualified URL is meant to con- clusively address and uniquely identify a single resource on a remote server (and in doing so, implement a couple of related, auxiliary functions). The URL syntax is the cornerstone of the address bar, the most important user interface (UI) security indicator in every browser.

In addition to true URLs used for content retrieval, several classes of pseudo-URLs use a similar syntax to provide convenient access to browser-level features, including the integrated scripting engine, several special document- rendering modes, and so on. Perhaps unsurprisingly, these pseudo-URL actions can have a significant impact on the security of any site that decides to link to them.
The ability to figure out how a particular URL will be interpreted by the browser, and the side effects it will have, is one of the most basic and com- mon security tasks attempted by humans and web applications alike, but it can be a problematic one. The generic URL syntax, the work of Tim Berners-Lee, is codified primarily in RFC 3986;1 its practical uses on the Web are outlined in RFCs 1738,2 2616,3 and a couple of other, less-significant standards. These documents are remarkably detailed, resulting in a fairly complex parsing model, but they are not precise enough to lead to harmonious, compatible implementations in all client software. In addition, individual software ven- dors have chosen to deviate from the specifications for their own reasons.

Uniform Resource Locator Structure

The format of a fully qualified absolute URL, one that specifies all information required to access a particular resource and that does not depend in any way on where the navigation began. In contrast, a relative URL, such as ../file.php?text=hello+world, omits some of this information and must be interpreted in the context of a base URL associated with the current browsing context.

The segments of the absolute URL seem intuitive, but each comes with a set of gotchas, so let’s review them now.

Scheme Name

The scheme name is a case-insensitive string that ends with a single colon, indicating the protocol to be used to retrieve the resource. The official registry of valid URL schemes is maintained by the Internet Assigned Numbers Authority (IANA), a body more widely known for its management of the IP address space.4 IANA’s current list of valid scheme names includes several dozen entries such as http:, https:, and ftp:; in practice, a much broader set of schemes is informally recognized by common browsers and third-party appli- cations, some which have special security consequences. (Of particular inter- est are several types of pseudo-URLs, such as data: or javascript:)

Before they can do any further parsing, browsers and web applications need to distinguish fully qualified absolute URLs from relative ones. The presence of a valid scheme in front of the address is meant to be the key difference, as defined in RFC 1738: In a compliant absolute URL, only the alphanumerics “+”, “-”, and “.” may appear before the required “:”. In practice, however, browsers deviate from this guidance a bit. All ignore leading newlines and white spaces. Internet Explorer ignores the entire nonprintable character range of ASCII codes 0x01 to 0x1F. Chrome additionally skips 0x00, the NULL character. Most implementations also ignore newlines and tabs in the middle of scheme names, and Opera accepts high-bit characters in the string.

Because of these incompatibilities, applications that depend on the ability to differentiate between relative and absolute URLs must conservatively reject any anomalous syntax—but as we will soon find out, even this is not enough.

Indicator of a Hierarchical URL

In order to comply with the generic syntax rules laid out in RFC 1738, every absolute, hierarchical URL is required to contain the fixed string “//” right before the authority section. If the string is missing, the format and function of the remainder of the URL is undefined for the purpose of that specification and must be treated as an opaque, scheme-specific value.

An example of a nonhierarchical URL is the mailto: protocol, used to specify email addresses and possibly a subject line (mailto:user@example.com?subject= Hello+world). Such URLs are passed down to the default mail client without making any further attempt to parse them.

The concept of a generic, hierarchical URL syntax is, in theory, an elegant one. It ought to enable applications to extract some information about the address without knowing how a particular scheme works. For example, without a preconceived notion of the wacky-widget: protocol, and by applying the concept of generic URL syntax alone, the browser could decide that http://example.com/test1/ and wacky-widget://example.com/test2/ reference the same, trusted remote host.
Regrettably, the specification has an interesting flaw: The aforementioned RFC says nothing about what the implementer should do when encountering URLs where the scheme is known to be nonhierarchical but where a “//” prefix still appears, or vice versa. In fact, a reference parser implementation provided in RFC 1630 contains an unintentional loophole that gives a counterintuitive meaning to the latter class of URLs. In RFC 3986, published some years later, the authors sheepishly acknowledge this flaw and permit imple- mentations to try to parse such URLs for compatibility reasons. As a conse- quence, many browsers interpret the following examples in unexpected ways:
http:example.com/ In Firefox, Chrome, and Safari, this address may be treated identically to http://example.com/ when no fully qualified base URL context exists and as a relative reference to a directory named example.com when a valid base URL is available.
javascript://example.com/%0Aalert(1) This string is interpreted as a valid nonhierarchical pseudo-URL in all modern browsers, and the JavaScript alert(1) code will execute, showing a simple dialog window.
mailto://user@example.com Internet Explorer accepts this URL as a valid nonhierarchical reference to an email address; the “//” part is simply skipped. Other browsers disagree.

Credentials to Access the Resource

The credentials portion of the URL is optional. This location can specify a username, and perhaps a password, that may be required to retrieve the data from the server. The method through which these credentials are exchanged is not specified as a part of the abstract URL syntax, and it is always protocol specific. For those protocols that do not support authentication, the behav- ior of a credential-bearing URL is simply undefined.
When no credentials are supplied, the browser will attempt to fetch the resource anonymously. In the case of HTTP and several other protocols, this means not sending any authentication data; for FTP, it involves logging into a guest account named ftp with a bogus password.

Most browsers accept almost any characters, other than general URL section delimiters, in this section with two exceptions: Safari, for unclear rea- sons, rejects a broader set of characters, including “<”, “>”, “{”, and “}”, while Firefox also rejects newlines.

Server Address

For all fully qualified hierarchical URLs, the server address section must specify a case-insensitive DNS name (such as example.com), a raw IPv4 address (such as 127.0.0.1), or an IPv6 address in square brackets (such as [0:0:0:0:0:0:0:1]), indicating the location of a server hosting the requested resource. Firefox will also accept IPv4 addresses and hostnames in square brackets, but other implementations reject them immediately.
Although the RFC permits only canonical notations for IP addresses, stan- dard C libraries used by most applications are much more relaxed, accepting noncanonical IPv4 addresses that mix octal, decimal, and hexadecimal nota- tion or concatenate some or all of the octets into a single integer. As a result, the following options are recognized as equivalent:
http://127.0.0.1/ This is a canonical representation of an IPv4 address.
http://0x7f.1/ This is a representation of the same address that uses a hexadecimal number to represent the first octet and concatenates all the remaining octets into a single decimal value.
http://017700000001/ The same address is denoted using a 0-prefixed octal value, with all octets concatenated into a single 32-bit integer.

This is possibly out of the concern for FTP, which transmits user credentials without any encoding; in this protocol, a newline transmitted as is would be misinterpreted by the server as the beginning of a new FTP command. Other browsers may transmit FTP credentials in noncompliant percent-encoded form or simply strip any problematic characters later on.

A similar laid-back approach can be seen with DNS names. Theoretically, DNS labels need to conform to a very narrow character set (specifically, alphanumerics, “.”, and “-”, as defined in RFC 10355), but many browsers will happily ask the underlying operating system resolver to look up almost anything, and the operating system will usually also not make a fuss. The exact set of charac- ters accepted in the hostname and passed to the resolver varies from client to client. Safari is most rigorous, while Internet Explorer is the most permissive. Perhaps of note, several control characters in the 0x0A–0x0D and 0xA0–0xAD ranges are ignored by most browsers in this portion of the URL.

One fascinating behavior of the URL parsers in all of the mainstream browsers is their willingness to treat the character “ ” (ideographic full stop, Unicode point U+3002) identically to a period in hostnames but not anywhere else in the URL. This is reportedly because certain Chinese keyboard mappings make it much easier to type this symbol than the expected 7-bit ASCII value.

Server Port

This server port is an optional section that describes a nonstandard network port to connect to on the previously specified server. Virtually all application-level protocols supported by browsers and third-party applications use TCP or UDP as the underlying transport method, and both TCP and UDP rely on 16-bit port numbers to separate traffic between unrelated services running on a single machine. Each scheme is associated with a default port on which servers for that protocol are customarily run (80 for HTTP, 21 for FTP, and so on), but the default can be overridden at the URL level.

An interesting and unintended side effect of this feature is that browsers can be tricked into sending attacker-supplied data to random network services that do not speak the protocol the browser expects them to. For example, one may point a browser to http://mail.example.com:25/, where 25 is a port used by the Simple Mail Transfer Protocol (SMTP) service rather than HTTP. This fact has caused a range of security problems and prompted a number of imperfect workarounds.

Hierarchical File Path

The next portion of the URL, the hierarchical file path, is envisioned as a way to identify a specific resource to be retrieved from the server, such as /documents/2009/my_diary.txt. The specification quite openly builds on top of the Unix directory semantics, mandating the resolution of “/../” and “/./” segments in the path and providing a directory-based method for sorting out relative references in non–fully qualified URLs.
Using the filesystem model must have seemed like a natural choice in the 1990s, when web servers acted as simple gateways to a collection of static files and the occasional executable script. But since then, many contemporary web application frameworks have severed any remaining ties with the filesystem, interfacing directly with database objects or registered locations in resident program code. Mapping these data structures to well-behaved URL paths is possible but not always practiced or practiced carefully. All of this makes automated content retrieval, indexing, and security testing more complicated than it should be.

Query String

The query string is an optional section used to pass arbitrary, nonhierarchical parameters to the resource earlier identified by the path. One common example is passing user-supplied terms to a server-side script that implements the search functionality, such as:
http://example.com/search.php?query=Hello+world
Most web developers are accustomed to a particular layout of the query string; this familiar format is generated by browsers when handling HTML- based forms and follows this syntax:
name1=value1&name2=value2...

Surprisingly, such layout is not mandated in the URL RFCs. Instead, the query string is treated as an opaque blob of data that may be interpreted by the final recipient as it sees fit, and unlike the path, it is not encumbered with specific parsing rules.

Hints of the commonly used format can be found in an informational RFC 1630,6 in a mail-related RFC 2368,7 and in HTML specifications dealing with forms.8 None of this is binding, and therefore, while it may be impolite, it is not a mistake for web applications to employ arbitrary formats for what- ever data they wish to put in that part of the URL.

Fragment ID

The fragment ID is an opaque value with a role similar to the query string but that provides optional instructions for the client application rather than the server. (In fact, the value is not supposed to be sent to the server at all.) Neither the format nor function of the fragment ID is clearly specified in the RFCs, but it is hinted that it may be used to address “subresources” in the retrieved document or to provide other document-specific rendering cues.
In practice, fragment identifiers have only a single sanctioned use in the browser: that of specifying the name of an anchor HTML element for in-document navigation. The logic is simple. If an anchor name is supplied in the URL and a matching HTML tag can be located, the document will be scrolled to that location for viewing; otherwise, nothing happens. Because the information is encoded in the URL, this particular view of a lengthy doc- ument could be easily shared with others or bookmarked. In this use, the meaning of a fragment ID is limited to scrolling an existing document, so there is no need to retrieve any new data from the server when only this por- tion of the URL is updated in response to user actions.

This interesting property has led to another, more recent and completely ad hoc use of this value: to store miscellaneous state information needed by client-side scripts. For example, consider a map-browsing application that puts the currently viewed map coordinates in the fragment identifier so that it will know to resume from that same location if the link is bookmarked or shared. Unlike updating the query string, changing the fragment ID on-the-fly will not trigger a time-consuming page reload, making this data-storage trick a killer feature.

Putting It All Together

Each of the aforementioned URL segments is delimited by certain reserved characters: slashes, colons, question marks, and so on. To make the whole approach usable, these delimiting characters should not appear anywhere in the URL for any other purpose. With this assumption in mind, imagine a sample algorithm to split absolute URLs into the aforementioned functional parts in a manner at least vaguely consistent with how browsers accomplish this task. A reasonably decent example of such an algorithm could be:
STEP 1: Extract the scheme name.
Scan for the first “:” character. The part of the URL to its left is the scheme name. Bail out if the scheme name does not conform to the expected set of characters; the URL may need to be treated as a relative one if so.
STEP 2: Consume the hierarchical URL identifier.
The string “//” should follow the scheme name. Skip it if found; bail out if not.

Unlike UNIX-derived operating systems, Microsoft Windows uses backslashes instead of slashes to delimit file paths (say, c:\windows\system32\calc.exe). Microsoft probably tried to compensate for the possibility that users would be confused by the need to type a different type of a slash on the Web or hoped to resolve other possible inconsistencies with file: URLs and similar mechanisms that would be interfacing directly with the local filesystem. Other Windows filesystem specifics (such as case insensitivity) are not replicated, however.

STEP 3: Grab the authority section.
Scan for the next “/”, “?”, or “#”, whichever comes first, to extract the authority section from the URL. As mentioned above, most browsers will also accept “ \” as a delimiter in place of a forward slash, which may need to be accounted for. The semicolon (;) is another acceptable authority delimiter in browsers other than Internet Explorer and Safari; the rea- son for this decision is unknown.
STEP 3A: Find the credentials, if any.
Once the authority section is extracted, locate the at symbol (@) in the substring. If found, the leading snippet constitutes login credentials, which should be further tokenized at the first occurrence of a colon (if present) to split the login and password data.
STEP 3B: Extract the destination address.
The remainder of the authority section is the destination address. Look for the first colon to separate the hostname from the port number. A special case is needed for bracket-enclosed IPv6 addresses, too.
STEP 4: Identify the path (if present).
If the authority section is followed immediately by a forward slash—or for some implementations, a backslash or semicolon, as noted earlier— scan for the next “?”, “#”, or end-of-string, whichever comes first. The text in between constitutes the path section, which should be normalized according to Unix path semantics.
STEP 5: Extract the query string (if present).
If the last successfully parsed segment is followed by a question mark, scan for the next “#” character or end-of-string, whichever comes first. The text in between is the query string.
STEP 6: Extract the fragment identifier (if present).
If the last successfully parsed segment is followed by “#”, everything from that character to the end-of-string is the fragment identifier. Either way, you’re done!

This algorithm may seem mundane, but it reveals subtle details that even seasoned programmers normally don’t think about. It also illustrates that it is extremely difficult for casual users to understand how a particular URL may be parsed. Let's start with this fairly simple case:
http://example.com&gibberish=1234@167772161/
The target of this URL—a concatenated IP address that decodes to 10.0.0.1—is not readily apparent to a nonexpert, and many users would believe they are visiting example.com instead.* But all right, that was an easy one! So let’s have a peek at this syntax instead:
http://example.com\@coredump.cx/
In Firefox, that URL will take the user to coredump.cx, because example.com will be interpreted as a valid value for the login field. In almost all other brows- ers, “\” will be interpreted as a path delimiter, and the user will land on example .com instead.
An even more frustrating example exists for Internet Explorer. Consider this:
http://example.com;.coredump.cx/
Microsoft’s browser permits “;” in the hostname and successfully resolves this label, thanks to the appropriate configuration of the coredump.cx domain. Most other browsers will autocorrect the URL to http://example.com/ ;.coredump.cx and take the user to example.com instead (except for Safari, where the syntax causes an error). If this looks messy, remember that we are just getting started with how browsers work!

This particular @-based trick was quickly embraced to facilitate all sorts of online fraud targeted at casual users. Attempts to mitigate its impact ranged from the heavy-handed and oddly specific (e.g., disabling URL-based authentication in Internet Explorer or crippling it with warnings in Firefox) to the fairly sensible (e.g., hostname highlighting in the address bar of several browsers).

Reserved Characters and Percent Encoding

The URL-parsing algorithm outlined in the previous section relies on the assumption that certain reserved, syntax-delimiting characters will not appear literally in the URL in any other capacity (that is, they won’t be a part of the user- name, request path, and so on). These generic, syntax-disrupting delimiters are: : / ? # [ ] @
The RFC also names a couple of lower-tier delimiters without giving them any specific purpose, presumably to allow scheme- or application- specific features to be implemented within any of the top-level sections: ! $ & ' ( ) * + , ; =
All of the above characters are in principle off-limits, but there are legiti- mate cases where one would want to include them in the URL (for example, to accommodate arbitrary search terms entered by the user and passed to the server in the query string). Therefore, rather than ban them, the standard provides a method to encode all spurious occurrences of these values. The method, simply called percent encoding or URL encoding, substitutes characters with a percent sign (%) followed by two hexadecimal digits representing a matching ASCII value. For example, “/” will be encoded as %2F (uppercase is customary but not enforced). It follows that to avoid ambiguity, the naked percent sign itself must be encoded as %25. Any intermediaries that handle existing URLs (browsers and web applications included) are further com- pelled never to attempt to decode or encode reserved characters in relayed URLs, because the meaning of such a URL may suddenly change.
Regrettably, the immutability of reserved characters in existing URLs is at odds with the need to respond to any URLs that are technically illegal because they misuse these characters and that are encountered by the browser in the wild. This topic is not covered by the specifications at all, which forces browser vendors to improvise and causes cross-implementation inconsisten- cies. For example, should the URL http://a@b@c/ be translated to http:// a@b%40c/ or perhaps to http://a%40b@c/? Internet Explorer and Safari think the former makes more sense; other browsers side with the latter view.
The remaining characters not in the reserved set are not supposed to have any particular significance within the URL syntax itself. However, some (such as nonprintable ASCII control characters) are clearly incompatible with the idea that URLs should be human readable and transport-safe. There- fore, the RFC outlines a confusingly named subset of unreserved characters (consisting of alphanumerics, “-”, “.”, “_”, and “~”) and says that only this subset and the reserved characters in their intended capacity are formally allowed to appear in the URL as is.

Curiously, these unreserved characters are only allowed to appear in an unescaped form; they are not required to do so. User agents may encode or decode them at whim, and doing so does not change the meaning of the URL at all. This property brings up yet another way to confuse users: the use of noncanonical representations of unreserved characters. Specifically, all of the following are equivalent:

http://example.com/
http://%65xample.%63om/
http://%65%78%61%6d%70%6c%65%2e%63%6f%6d/

💡 Lastly, contrary to the requirements spelled out in the RFC, most browsers also do not encode fragment identifiers at all. This poses an unexpected challenge to client-side scripts that rely on this string and expect certain potentially unsafe characters never to appear literally.

Handling of Non-US-ASCII Text

Many languages used around the globe rely on characters outside the basic, 7-bit ASCII character set or the default 8-bit code page traditionally used by all PC-compatible systems (CP437). Heck, some languages depend on alpha- bets that are not based on Latin at all.
In order to accommodate the needs of an often-ignored but formidable non-English user base, various 8-bit code pages with an alternative set of high- bit characters were devised long before the emergence of the Web: ISO 8859-1,CP850, and Windows 1252 for Western European languages; ISO 8859-2, CP852, and Windows 1250 for Eastern and Central Europe; and KOI8-R and Windows 1251 for Russia. And, because several alphabets could not be accom- modated in the 256-character space, we saw the rise of complex variable- width encodings, such as Shift JIS for katakana.
The incompatibility of these character maps made it difficult to exchange documents between computers configured for different code pages. By the early 1990s, this growing problem led to the creation of Unicode—a sort of universal character set, too large to fit within 8 bits but meant to encompass practically all regional scripts and specialty pictographs known to man. Uni- code was followed by UTF-8, a relatively simple, variable-width representation of these characters, which was theoretically safe for all applications capable of handling traditional 8-bit formats. Unfortunately, UTF-8 required more bytes to encode high-bit characters than did most of its competitors, and to many users, this seemed wasteful and unnecessary. Because of this criticism, it took well over a decade for UTF-8 to gain traction on the Web, and it only did so long after all the relevant protocols had solidified.
This unfortunate delay had some bearing on the handling of URLs that contain user input. Browsers needed to accommodate such use very early on, but when the developers turned to the relevant standards, they found no meaningful advice. Even years later, in 2005, the RFC 3986 had just this to say:
In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification.
Percent-encoded octets . . . may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.
Alas, despite this wishful thinking, none of the remaining standards addressed this topic. It was always possible to put raw high-bit characters in a URL, but without knowing the code page they should be interpreted in, the server would not be able to tell if that %B1 was supposed to mean “±”, “a”, or some other squiggly character specific to the user’s native script.
Sadly, browser vendors have not taken the initiative and come up with a consistent solution to this problem. Most browsers internally transcode URL path segments to UTF-8 (or ISO 8859-1, if sufficient), but then they generate the query string in the code page of the referring page instead. In certain cases, when URLs are entered manually or passed to certain specialized APIs, high-bit characters may be also downgraded to their 7-bit US-ASCII look alikes, replaced with question marks, or even completely mangled due to implementation flaws.
An astute reader might wonder why this limitation would matter; that is, why was it important to have localized domain names in non-Latin alphabets, too? That question may be difficult to answer now. Quite simply, several folks thought a lack of these encodings would prevent businesses and individuals around the world from fully embracing and enjoying the Web—and, rightly or not, they were determined to make it happen.
🔑 This pursuit led to the formation of the Internationalized Domain Names in Applications (IDNA). First, RFC 3490,9 which outlined a rather contrived scheme to encode arbitrary Unicode strings using alphanumerics and dashes, and then RFC 3492,10 which described a way to apply this encoding to DNS labels using a format known as Punycode. Punycode looked roughly like this:
xn--[US-ASCII part]-[encoded Unicode data]
A compliant browser presented with a technically illegal URL that con- tained a literal non-US-ASCII character anywhere in the hostname was sup- posed to transform the name to Punycode before performing a DNS lookup. Consequently, when presented with Punycode in an existing URL, it should put a decoded, human-readable form of the string in the address bar.

Combining all these incompatible encoding strategies can make for an amusing mix. Consider this example URL of a made-up Polish-language towel shop:

Of all the URL-based encoding approaches, IDNA soon proved to be the most problematic. In essence, the domain name in the URL shown in the browser’s address bar is one of the most important security indicators on the Web, as it allows users to quickly differentiate sites they trust and have done business with from the rest of the Internet. When the hostname shown by the browser consists of 38 familiar and distinctive characters, only fairly careless victims will be tricked into thinking that their favorite example.com domain and an impostor examp1e.com site are the same thing. But IDNA casually and indiscriminately extended these 38 characters to some 100,000 glyphs supported by Unicode, many of which look exactly alike and are separated from each other based on functional differences alone.

How bad is it? Let’s consider Cyrillic, for example. This alphabet has a number of homoglyphs that look practically identical to their Latin counterparts but that have completely different Unicode values and resolve to completely different Punycode DNS names:

When IDNA was proposed and first implemented in browsers, nobody seriously considered the consequences of this issue. Browser vendors appar- ently assumed that DNS registrars would prevent people from registering look-alike names, and registrars figured it was the browser vendors’ problem to have unambiguous visuals in the address bar.

In 2002 the significance of the problem was finally recognized by all parties involved. That year, Evgeniy Gabrilovich and Alex Gontmakher pub- lished “The Homograph Attack,”11 a paper exploring the vulnerability in great detail. They noted that any registrar-level work-arounds, even if imple- mented, would have a fatal flaw. An attacker could always purchase a whole- some top-level domain and then, on his own name server, set up a subdomain record that, with the IDNA transformation applied, would decode to a string visually identical to example.com/ (the last character being merely a nonfunc- tional look-alike of the actual ASCII slash). The result would be:

http://example.com/.wholesome-domain.com/
                  ^
  This only looks like a real slash.

There is nothing that a registrar can do to prevent this attack, and the ball is in the browser vendors’ court. But what options do they have, exactly?
As it turns out, there aren’t many. We now realize that the poorly envisioned IDNA standard cannot be fixed in a simple and painless way. Browser developers have responded to this risk by reverting to incomprehensible Punycode when a user’s locale does not match the script seen in a particular DNS label (which causes problems when browsing foreign sites or when using imported or simply misconfigured computers); permitting IDNA only in certain country-specific, top-level domains (ruling out the use of international- ized domain names in .com and other high-profile TLDs); and blacklisting certain “bad” characters that resemble slashes, periods, white spaces, and so forth (a fool’s errand, given the number of typefaces used around the world).

These measures are drastic enough to severely hinder the adoption of internationalized domain names, probably to a point where the standard’s lingering presence causes more security problems than it brings real usability benefits to non-English users.

Common URL Schemes and Their Function

Let’s leave the bizarre world of URL parsing behind us and go back to the basics. Earlier in this chapter, we implied that certain schemes may have unexpected security consequences and that because of this, any web applica- tion handling user-supplied URLs must be cautious. To explain this point a bit better, it is useful to review all the URL schemes commonly supported in a typical browser environment. These can be combined into four basic groups.

Browser-Supported, Document-Fetching Protocols

These schemes, handled internally by the browser, offer a way to retrieve arbitrary content using a particular transport protocol and then display it using common, browser-level rendering logic. This is the most rudimentary and the most expected function of a URL.
The list of commonly supported schemes in this category is surprisingly short: http: (RFC 2616), the primary transport mode used on the Web and the focus of the next chapter of this book; https:, an encrypted version of HTTP (RFC 2818); and ftp:, an older file transfer protocol (RFC 959). All browsers also support file: (previously also known as local:), a system-specific method for accessing the local filesystem or NFS and SMB shares. (This last scheme is usually not directly accessible through Internet-originating pages, though.)
Two additional, obscure cases also deserve a brief mention: built-in support for the gopher: scheme, one of the failed predecessors of the Web (RFC 143614), which is still present in Firefox, and shttp:, an alternative, failed take on HTTPS (RFC 266015), still recognized in Internet Explorer (but today, simply aliased to HTTP).

Protocols Claimed by Third-Party Applications and Plug-ins

For these schemes, matching URLs are simply dispatched to external, specialized applications that implement functionality such as media playback, document viewing, or IP telephony. At this point, the involvement of the browser (mostly) ends.
Scores of external protocol handlers exist today, and it would take another thick book to cover them all. Some of the most common examples include the acrobat: scheme, predictably routed to Adobe Acrobat Reader; callto: and sip: schemes claimed by all sorts of instant messengers and telephony soft- ware; daap:, itpc:, and itms: schemes used by Apple iTunes; mailto:, news:, and nntp: protocols claimed by mail and Usenet clients; mmst:, mmsu:, msbd:, and rtsp: protocols for streaming media players; and so on. Browsers are sometimes also included on the list. The previously mentioned firefoxurl: scheme launches Firefox from within another browser, while cf: gives access to Chrome from Internet Explorer.
For the most part, when these schemes appear in URLs, they usually have no impact on the security of the web applications that allow them to go through (although this is not guaranteed, especially in the case of plugin–supported content). It is worth noting that third-party protocol handlers tend to be notoriously buggy and are sometimes abused to compromise the operating system. Therefore, restricting the ability to navigate to mystery protocols is a common courtesy to the user of any reasonably trustworthy website.

Nonencapsulating Pseudo-Protocols

An array of protocols is reserved to provide convenient access to the browser’s scripting engine and other internal functions, without actually retrieving any remote content and perhaps without establishing an isolated document context to display the result. Many of these pseudo-protocols are highly browser-specific and are either not directly accessible from the Inter- net or are incapable of doing harm. However, there are several important exceptions to this rule.
Perhaps the best-known exception is the javascript: scheme (in earlier years, also available under aliases such as livescript: or mocha: in Netscape brows- ers). This scheme gives access to the JavaScript-programming engine in the context of the currently viewed website. In Internet Explorer, vbscript: offers similar capabilities through the proprietary Visual Basic interface.
Another important case is the data: protocol (RFC 239716), which permits short, inline documents to be created without any extra network requests and sometimes inherits much of their operating context from the referring page. An example of a data: URL is:
data:text/plain,Why,%20hello%20there!
These externally accessible pseudo-URLs are of acute significance to site security. When navigated to, their payload may execute in the context of the originating domain, possibly stealing sensitive data or altering the appearance of the page for the affected user.
but as you might expect, they are substantial. (View URL context inheritance rules)

Encapsulating Pseudo-Protocols

This special class of pseudo-protocols may be used to prefix any other URL in order to force a special decoding or rendering mode for the retrieved resource. Perhaps the best-known example is the view-source: scheme sup- ported by Firefox and Chrome, used to display the pretty-printed source of an HTML page. This scheme is used in the following way:
view-source:http://www.example.com/
Other protocols that function similarly include jar:, which allows content to be extracted from ZIP files on the fly in Firefox; wyciwyg: and view-cache:, which give access to cached pages in Firefox and Chrome respectively; an oddball feed: scheme, which is meant to access news feeds in Safari;17 and a host of poorly documented protocols associated with the Windows help sub- system and other components of Microsoft Windows (hcp:, its:, mhtml:, mk:, ms-help:, ms-its:, and ms-itss:).

The common property of many encapsulating protocols is that they allow the attacker to hide the actual URL that will be ultimately interpreted by the browser from naïve filters: view-source:javascript: (or even view-source:view- source:javascript:) followed by malicious code is a simple way to accomplish this. Some security restrictions may be present to limit such trickery, but they should not be relied upon. Another significant problem, recurring especially with Microsoft’s mhtml:, is that using the protocol may ignore some of the content directives provided by the server on HTTP level, possibly leading to widespread misery.

Closing Note on Scheme Detection

The sheer number of pseudo-protocols is the primary reason why web appli- cations need to carefully screen user-supplied URLs. The wonky and browser- specific URL-parsing patterns, coupled with the open-ended nature of the list of supported schemes, means that it is unsafe to simply blacklist known bad schemes; for example, a check for javascript: may be circumvented if this keyword is spliced with a tab or a newline, replaced with vbscript:, or prefixed with another encapsulating scheme.

Resolution of Relative URLs

Relative URLs have been mentioned on several occasions earlier, and they deserve some additional attention at this point, too. The reason for their existence is that on almost every web page on the Internet, a consid- erable number of URLs will reference resources hosted on that same server, perhaps in the same directory. It would be inconvenient and wasteful to require a fully qualified URL to appear in the document every time such a reference is needed, so short, relative URLs (such as ../other_file.txt) are used instead. The missing details are inferred from the URL of the referring document.
Because relative URLs are allowed to appear in exactly the same scenar- ios in which any absolute URL may appear, a method to distinguish between the two is necessary within the browser. Web applications also benefit from the ability to make the distinction, because most types of URL filters may want to scrutinize absolute URLs only and allow local references through as is.
❓ The specification may make this task seem very simple: If the URL string does not begin with a valid scheme name followed by a semicolon and, pref- erably, a valid “//” sequence, it should be interpreted as a relative reference. And if no context for parsing such a relative URL exists, it should be rejected. Everything else is a safe relative link, right?
Predictably, it’s not as easy as it seems. First, as outlined in previous sec- tions, the accepted set of characters in a valid scheme name, and the patterns accepted in lieu of “//”, vary from one implementation to another. Perhaps more interestingly, it is a common misconception that relative links can point only to resources on the same server; quite a few other, less-obvious variants of relative URLs exist.
Let’s have a quick peek at the known classes of relative URLs to better illustrate this possibility.
Scheme, but no authority present (http:foo.txt)
This infamous loophole is hinted at in RFC 3986 and attributed to an oversight in one of the earlier specs. While said specs descriptively clas- sified such URLs as (invalid) absolute references, they also provided a promiscuous reference-parsing algorithm keen on interpreting them incorrectly.
In the latter interpretation, these URLs would set a new protocol and path, query, or fragment ID but have the authority section copied over from the referring location. This syntax is accepted by several browsers, but inconsistently. For example, in some cases, http:foo.txt may be treated as a relative reference, while https:example.com may be parsed as an absolute one!
No scheme, but authority present (//example.com)
This is another notoriously confusing but at least well-documented quirk. While /example.com is areference to a local resource on the current server, the standard compels browsers to treat //example.com as a very different case: a reference to a different authority over the current protocol. In this scenario, the scheme will be copied over from the referring location, and all other URL details will be derived from the relative URL.
No scheme, no authority, but path present (../notes.txt)
This is the most common variant of a relative link. Protocol and authority information is copied over from the referring URL. If the relative URL does not start with a slash, the path will also be copied over up to the rightmost “/”. For example, if the base URL is http://www.example .com/files/, the path is the same, but in http://www.example.com/files/index .html, the filename is truncated. The new path is then appended, and standard path normalization follows on the concatenated value. The query string and fragment ID are derived only from the relative URL.
No scheme, no authority, no path, but query string present (?search=bunnies)
In this scenario, protocol, authority, and path information are copied verbatim from the referring URL. The query string and fragment ID are derived from the relative URL.
Only fragment ID present (#bunnies)
All information except for the fragment ID is copied verbatim from the referring URL; only the fragment ID is substituted. Following this type of relative URL does not cause the page to be reloaded under normal circumstances, as noted earlier.

Because of the risk of potential misunderstandings between application- level URL filters and the browser when handling these types of relative refer- ences, it is a good design practice never to output user-supplied relative URLs verbatim. Where feasible, they should be explicitly rewritten to absolute ref- erences, and all security checks should be carried out against the resulting fully qualified address instead.

PreviousAnatomy Of The Web NextHypertext Transfer Protocol (HTTP)

Last updated 1 year ago