Is it possible to sniff HTTPS URLs?

TL;DR An attacker cannot see anything past the domain.

Structure of a HTTP request

HTTP works by sending two things to a website: the method, and the headers. The most common methods are GET, POST, and HEAD, which retrieves a page, transfers data, or requests only response headers, respectively. TLS encrypts the entirety of HTTP traffic, including the headers and method. In HTTP, the path in the URL is sent along with the header body. Take this example, with wget loading the page foo.example.com/some/page.html. This text, as ASCII, is sent to the server:

GET /some/page.html HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: foo.example.com

The server will then respond with an HTTP status code, some headers of its own, and optionally some data (such as HTML). An example, giving a 301 redirect and some plain text as a response, may be:

HTTP/1.1 301 Moved Permanently
Date: Wed, 27 Dec 2017 04:42:54 GMT
Server: Apache
Location: https://bar.example.com/new/location.html
Content-Length: 56
Content-Type: text/plain

Thank you Mario, but our princess is in another castle!

Which would tell the client that the correct location is elsewhere.

These are the headers sent directly to the site over TCP. TLS works on a different layer, making all of this encrypted. This includes the page you are accessing with the GET method. Note that, although the Host header is also in the header body and thus encrypted, the host can still be obtained through rDNS lookup on the IP address, or by checking SNI, which transmits the domain in plaintext.

Structure of a URL

https://foo.example.com/some/page.html#some-fragment
| proto |    domain    |     path     |  fragment  |
  • proto - There are only two protocols in common use, HTTP and HTTPS.
  • domain - The domain is example.com and *.example.com, detectable with rDNS or SNI.
  • path - The path is completely encrypted and can only be read by the target server.
  • fragment - The fragment is visible only to the web browser and is not transmitted.

What an attacker can see

So what can an attacker see if you make a request over HTTPS? Let's take the previous hypothetical request from the perspective of a passive eavesdropper on the network. If I wanted to know what you are accessing, I have only limited options:

  • I see you making a web request encrypted with HTTPS going to 203.0.113.98.
  • I see that the destination port is 443, which I know is used for HTTPS.
  • I do an rDNS lookup and see that IP is used for example.com and example.org.
  • I look at the SNI record and see you are connecting to foo.example.com.

This is all I could do. I would not be able to see the path you are requesting, or even what method you are using, short of heuristic analysis based on the sizes of the data being sent and received, called traffic analysis attacks. For a large service like Wikipedia, I would have no idea what article you are viewing based on analysis of the unencrypted data alone.

An important note about referers on older browsers

Even though HTTPS encrypts the path you are accessing, if you click a hyperlink within that site which goes to an unencrypted page, the full path may be leaked in the referer header. This is not the case anymore for many newer browsers, but older or non-compliant browsers may still have this behavior, as will websites which set the HTML5 referer meta tag to always send the information. An example sent unencrypted by a client go from https://example.com/private/details.html to http://example.org/public/page.html in such a case would be:

GET /public/page.html
Referer: https://example.com/private/details.html
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.org

As such, navigating from an HTTPS page to an HTTP page may leak the full URL (excluding the fragment) of the previous page, so keep that in mind.


The naive answer is no: the URL is encrypted in the TLS stream. But that answer ignores a great deal of relevant information.

Suppose it's Wikipedia. How long is an HTTP GET request for https://en.wikipedia.org/wiki/Cryptography versus https://en.wikipedia.org/wiki/Information_security, assuming all the header fields are the same? If you can measure the length of a request, which will likely be submitted in a single TLS record, then you can probably tell these apart.

That doesn't help you to distinguish a request for the article on cryptography from the article on choreography, of course. It also doesn't help if the TLS client cleverly adds some padding, ignored by the server, to the TLS record to round it to a multiple of some block size. But English Wikipedia has a much longer article on cryptography than on choreography. So even if the endpoints pad their TLS records to the maximum 16384 bytes, you can probably distinguish the article on cryptography from the article on choreography.

There's a complication from your perspective as the attacker: the client may use the same TLS stream for many requests and many responses. But they will likely all be timed in a burst as the victim loads a single page with embedded CSS, images, JavaScript, etc., and then go silent as the victim reads the page. The timing and number of these requests provides another variable on which you can discriminate what page they were looking for.

All these variables can be fed into a probabilistic model of pages—here's one example, lifted from the anonymity bibliography. Defeating that one example doesn't mean that the distribution of data an attacker on the network learns for one page is indistinguishable from another page, just that that particular distinguisher isn't as effective.

So, are you, as the eavesdropper, guaranteed to be able to read the URL off the wire? No: it is encrypted in the TLS stream (unless the NULL cipher is chosen!), so at best you can infer it from other observable variables with probabilistic dependencies on it.

On the other hand, is the victim guaranteed that their URL is concealed from an eavesdropper? No: there are many variables dependent on the URL that an attacker may be able to infer juicy information about, like which sexually transmitted disease you're reading about at the Mayo Clinic.

(Note that anything in the fragment of a URL—the part after the # mark in https://en.wikipedia.org/wiki/Cryptography#Terminology—is not transmitted in the HTTP GET request at all, unless there is some script on the page that triggers different network traffic dependent on the URL fragment.)

Tags:

Url

Tls