Dynamically blocking excessive HTTP bandwidth use?

Solution 1:

If your PIX is running version 7.2 or greater of the OS, or can be upgraded to it, then you can implement QOS policies at the firewall level. In particular this allows you to shape traffic and should allow you to limit the bandwidth used by bots. Cisco have a good gudie to this here.

Solution 2:

I'm not sure about yahoo, but you can configure the frequency Google's bot indexes your site. Have a look at Google Webmasters. I'm not sure if Yahoo has anything similar. At any that'll reduce your traffic up to 50%.

Alternatively, some web servers can limit traffic per connection so you can try that. I personally would stay away from hardware solutions since it's most likely going to cost more.


Solution 3:

To reduce the crawling load - This only works with Microsoft and Yahoo. For Google, you’ll need to specify a slower crawling speed through their Webmaster Tools (http://www.google.com/webmasters/).

Be VERY careful when implementing this because if you slow down the crawl too much, robots won’t be able to get to all of your site, and you may lose pages from the index.

Here are some examples (these go in your robots.txt file):

# Yahoo's Slurp Robot - Please wait 7 seconds in between visits

User-agent: slurp
Crawl-delay: 7

# MSN Robot - Please wait 5 seconds in between visits

User-agent: msnbot
Crawl-delay: 5

Slightly off-topic, but you can also specify a Sitemap or Sitemap index file.

If you’d like to provide search engines with a comprehensive list of your best URLs, you can also provide one or more Sitemap autodiscovery directives. Please note that user-agent does not apply to this directive, so you cannot use this to specify a sitemap to some but not all search engines.

# Please read my sitemap and index everything!

Sitemap: http://yourdomain.com/sitemap.axd

Solution 4:

We use a Watchguard firewall (ours is a X1000 which is end-of life now). They have many feautres revolving around blocking domains or ips who are seen time and time again or are using an obsesive amount of bandwidth.

This would need some tweaking because you obvisouly would not want to block Jon Skeet on stackoverflow :)


Solution 5:

I'd recommend Microsoft ISA Server 2006. Specifically for this requirement, it will limit to 600 HTTP requests/min per IP by default and you can apply an exception for Jon Skeet (sorry, I realise that "joke" has been made already!).

You have the additional benefits of application-level filtering, the ability to load-balance across multiple webservers (instead of NLB on those servers), VPN termination etc. There's a number of commercial extensions available and you can even write your own ISAPI filter if you're feeling brave.

It's obviously not open-source but has benefits to a Windows shop and runs on commodity hardware.