Control over the Internet Archive besides just "Disallow /"?

Note: This answer is increasingly out-of-date.

The largest contributor to the Internet Archive's web collection has been Alexa Internet. Material that Alexa crawls for its purposes has been donated to IA a few months later. Adding the disallow rule mentioned in the question does not affect those crawls, but the Wayback will 'retroactively' honor them (denying access, the material will still be in the archive - you should exclude Alexa's robot if you really want to keep your material out of the Internet Archive).

There may be ways to affect Alexa's crawls, but I'm not familiar with that.

Since IA developed its own crawler (Heritrix) they have started doing their own crawls, but those tend to be targeted crawls (they do election crawls for Library of Congress and have done national crawls for France and Australia etc.). They do not engage in the kind of sustained world scale crawls that Google and Alexa conduct. IA's largest crawl was a special project to crawl 2 billion pages.

As these crawls are operated on schedules that derive from project specific factors, you can not affect how often they visit your site or if they visit your site.

The only way to directly affect how and when IA crawls your site is to use their Archive-It service. That service allows you to specify custom crawls. The resultant data will (eventually) be incorporated into IA's web collection. This is however a paid subscription service.

Most search engines support the "Crawl-delay" directive, but I don't know if IA does. You could try it though:

User-agent: ia_archiver
Crawl-delay: 3600

This would limit the delay between requests to 3600 seconds (i.e. 1 hour), or ~700 requests per month.

I don't think #2 is possible - the IA bot grabs the assets as and when it sees fit. It may have a file size limit to avoid using too much storage.

Control over the Internet Archive besides just "Disallow /"?

Tags:

Cache

Internet Archive

Related

Recent Posts