How do CDN (Content Distribution Networks) work?

You don't simply host the whole site with the CDN, just your content.

I just realized I answered a similar question a while back: What does akamaihd.net do?

Data request flowImage by WikiMedia

So your site references http://akamai/myfile.ext. This will request myfile.ext from akamai. akamai can then send an HTTP redirect to the actual content server.

Now, when that last step is cached, great, all future requests will go to the closest content server.

How does that work?

Let's assume this website:

<html>
  <body>
    <img src="http://cdn/oliver.png" />
  </body>
</html>

I request this website from my own webserver. The .html file is not hosted with cdn. Neither is the DNS of my webserver.

Initial request

So my browser got that HTML file and now parses it. It finds the referenced image and notes that it is located at http://cdn/oliver.png. It requests that file.

To do that, it need to find the IP address of cdn. In our example, that IP address is 10.10.10.10.

With that IP address, it can connect to the cdn server and request /oliver.png.

Geo Location

Now cdn realizes, "that guy is from Germany!". So instead of sending me my awesome picture that I wanted, it sends me an HTTP redirect saying:

/oliver.png is not here. It's at 10.10.33.33/oliver.png

So my browser will ask 10.10.33.33 (which is hopefully closer to me) for the picture.

Seriously?

I'm not saying this is how ALL CDNs work, but it would be one approach.

You could also implement a DNS daemon that returns different results for a name lookup depending on the location of whoever sent the query.
But I doubt that this is done in practice. But maybe I just can't imagine how to properly set that up. See fluffy's answer for how that could work.

Who runs CDNs?

Most global players have their own content delivery network in a way (or so I would assume). Some providers just offload certain services to larger CDNs (like Microsoft does with MSDN downloads). And this might somehow touch on your second subject.

Consider this, in the MSDN Microsoft offers product downloads. These downloads are then provided by Akamai. If you can determine the URL of that download, you can just download the product without ever getting in touch with Microsoft.

Is that a security issue? Not really, because what is being downloaded is still protected (by a product key).

But how about other data?

If your data is security relevant, then it isn't CDN material. If you don't want something to be available as widely as possible, don't put it in a CDN.


A pretty common approach to CDN is to use what's known as "anycast." How this works is that your distributed servers are colocated with DNSes that respond with that server as the destination; for example, you might have three servers in different hosting facilities, and their respective DNSes all claim their IP address to be the canonical one for your server (call it, say, content.example.com). The DNSes are each configured to have the same global IP address, and then each of the servers' facilities use BGP updates to make it so that the route to the closest server wins - so when you do a name lookup on content.example.com, the fastest/closest/most available DNS responds to the request with its HTTP server.

In this way, no GeoIP tricks are necessary, and you're always being served content by whichever server is fastest for you - which may or may not have anything to do with its physical location, due to the heterogenous nature of the Internet.

It is my understanding that Akamai at least partially works in this way.


Also available are Origin Pull type CDNs.

Amazon Cloudfront is able to use this technique.

You set up a CNAME like media.example.com that points to their assigned server name and leave all your content on your server. For images and content you want delivered over the CDN, you use media.example.com in the URL. The request goes to their server network and if the content is not available, their servers pull the content from your server. Once in the system, the content is distributed to server farms closest to where the demand exists and remains there for the assigned TTL. Your server no longer sees any traffic on the cached content until the TTL expires and Cloudfront has to refresh it.