Why are URLs case-sensitive?
URLs are not case-sensitive, only parts of them.
For example, nothing is case-sensitive in the URL
With reference to RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
First, from Wikipedia, a URL looks like:
(I've removed the
user:password part because it is not interesting and rarely used)
schemes are case-insensitive
The host subcomponent is case-insensitive.
The path component contains data...
The query component contains non-hierarchical data...
Individual media types may define their own restrictions on or structures within the fragment identifier syntax for specifying different types of subsets, views, or external references
host are case-insensitive.
The rest of the URL is case-sensitive.
Why is the
This seems to be the main question.
It is difficult to answer "why" something was done if it was not documented, but we can make a very good guess.
I've picked very specific quotes from the spec, with emphasis on data.
Let's look at the URL again:
Location - The location has a canonical form, and is case-insensitive. Why? Probably so you could buy a domain name without having to buy thousands of variants.
Data - the data is used by the target server, and the application can choose what it means. It wouldn't make any sense to make data case insensitive. The application should have more options, and defining case-insensitivity in the spec will limit these options.
This is also a useful distinction for HTTPS: the data is encrypted, but the host is visible.
Is it useful?
Case-sensitivity has its pitfalls when it comes to caching and canonical URLs, but it is certainly useful. Some examples:
- Base64, which is used in Data URIs.
- Sites can encode Base64 data in the url, for example: http://tryroslyn.azurewebsites.net/#f:r/A4VwRgNglgxgBDCBDAziuBhOBvGB7AOxQBc4SAnKAgczLgF44AiAUQPwBMBTDuKuYgAsucAKoAlADIBCJgG4AvkA
- URL shorteners utilize case sensitivity:
/a5Bmight be different than
- As you've mentioned, Wikipedia can differentiate "AIDS" from "Aids".
Simple. The OS is case sensitive. Web servers generally do not care unless they have to hit the file system at some point. This is where Linux and other Unix-based operating systems enforce the rules of the file system in which case sensitivity is a major part. This is why IIS has never been case sensitive; because Windows was never case sensitive.
There have been some strong arguments in the comments (since deleted) about whether URLs have any relationship with the file system as I have stated. These arguments have become heated. It is extremely short-sighted to believe that there is not a relationship. There absolutely is! Let me explain further.
Application programmers are not generally systems internals programmers. I am not being insulting. They are two separate disciplines and system internals knowledge is not required to write applications when applications can simply make calls to the OS. Since application programmers are not systems internals programmers, bypassing the OS services is not possible. I say this because these are two separate camps and they rarely cross-over. Applications are written to use OS services as a rule. There are rare some exceptions of course.
Back when web servers began to appear, application developers did not attempt to bypass OS services. There were several reasons for this. One, it was not necessary. Two, application programmers generally did not know how to bypass OS services. Three, most OSes were either extremely stable and robust, or extremely simple and light-weight and not worth the cost.
Keep in mind that the early web servers either ran on expensive computers such as the DEC VAX/VMS servers and the Unix of the day (Berkeley and Ultrix as well as others) on main-frame or mid-frame computers, then soon after on light-weight computers such as PCs and Windows 3.1. When more modern search engines began to appear, such as Google in 1997/8, Windows had moved into Windows NT and other OSes such as Novell and Linux had also began to run web servers. Apache was the dominant web server though there were others such as IIS and O'Reilly which were also very popular. None of them at the time bypassed OS services. It is likely that none of the web servers do even today.
Early web servers were quite simple. They still are today. Any request made for a resource via an HTTP request that exists on a hard-drive was/is made by the web server through the OS file system.
File systems are rather simple mechanisms. As a request is made for access to a file, if that file exists, the request is passed to the authorization sub-system and if granted, the original request is satisfied. If the resource does not exist or is not authorized, an exception is thrown by the system. When an application makes a request, a trigger is set and the application waits. When the request is answered, the trigger is thrown and the application processes the request response. It still works that way today. If the application sees that the request has been satisfied it continues, if it has failed, the application executes an error condition within it's code or dies if not handled. Simple.
In the case of a web server, assuming that a URL request for a path/file is made, the web server takes the path/file portion of the URL request (URI) and makes a request to the file system and it is either satisfied or throws an exception. The web server then processes the response. If, for example, the path and file requested is found and access granted by the authorization sub-system, then the web server processes that I/O request as normal. If the file system throws an exception, then the web server returns a 404 error if the file is Not Found or a 403 Forbidden if the reason code is unauthorized.
Since some OSes are case sensitive and file systems of this type require exact matches, the path/file that is requested of the web server must match what exists on the hard drive exactly. The reason for this is simple. Web servers do not guess what you mean. No computer does so without being programmed to. Web servers simply process requests as they receive them. If the path/file portion of the URL request being passed directly to the file system does not match what is on the hard drive, then the file system throws an exception and the web server returns a 404 Not Found error.
It is really that simple folks. It is not rocket science. There is an absolute relationship between the path/file portion of a URL and the file system.
URLs claim to be a UNIFORM Resource locator and can point to resources that predate the web. Some of these are case sensitive (eg many ftp servers) and URLs need to be able to represent these resources in a reasonably intuitive fashion.
Case insensitivity requires more work when looking for a match (either in the OS or above it).
If you define URLs as case sensitive individual servers can implement them as case insensitive if they want. The reverse is not true.
Case insensitivity can be non-trivial in international contexts: https://en.wikipedia.org/wiki/Dotted_and_dotless_I . Also RFC1738 allowed for the use of characters outside the ASCII range provided they were encoded but didn't specify a charset. This is fairly important for something calling itself the WORLD wide web. Defining URLs as case insensitive would open up a lot of scope for bugs.
If you are trying to pack a lot of data into a URI (eg a Data URI) you can pack more in if upper and lower case are distinct.