What user agent should I set?

I'm the primary designer and author of a fairly large-scale web crawler (see metadatalabs.com/mlbot (archived link)). What you're asking touches on a topic that's very important to us--perhaps the most important part of running a crawler: that of politeness.

First: the reason for the "Mozilla" thing is to tell the site what your browser capabilities are. If your bot isn't trying to act like a browser, there's no particular reason you need to include the "Mozilla" thing.

As for your user agent string and other politeness-related items:

  1. Select a name that you know nobody else is using. I suspect that if you use "Goofybot", you'll be fine. But I'd check it out to be sure.

  2. Your user agent string should include a link to more information about the bot. For example, our string reads "MLBot (www.metadatalabs.com/mlbot)".

  3. Make sure that if somebody searches for "Goofybot", that page is high (preferably first) in the search results.

  4. Your page about the bot should say what you're using the information for, what IP addresses you crawl from, and include a way for people to contact you about problems with the bot.

  5. You should respond to any questions or complaints quickly, using the "customer is always right" philosophy. Remember, if your bot caused a problem that this person is complaining about then it probably caused problems on a dozen other sites that nobody complained about. They either didn't see the problems or they just put a block on your IP address.

  6. Your should build in the facility to prevent your bot from accessing a particular domain name. Some people won't want you to crawl at all and don't have the access or technical ability to create a robots.txt or block in .htaccess. We found that this ability lets us tell somebody, "We're sorry MLBot caused a problem. We have instructed it never to crawl your site again." Perhaps not surprisingly, that calms people down very quickly.

  7. If you don't already respect robots.txt, do it. Nothing will get you a bad reputation faster than ignoring robots.txt.

Wow. That went on longer than I expected. In the past four years, I've made every one of those mistakes I allude to above, and others besides. We found, however, that if we're open about what we're doing and communicate honestly (including posting information about mistakes before we get complaints), the majority of Webmasters view us as a good Internet citizen.


Mozilla/2.0 and Mozilla/5.0 are both references to the Mozilla browser. It has become largely meaningless, with many crawlers using it, but should tell the site to treat your crawler as it would any random user browsing with a regular browser.

It is however good etiquette to include an URL linking to a page about who you are and why you are crawling in the following section. Ask Jeeves can get away with just the name, but you should include an URL.

E.g.

Mozilla/5.0 (compatible; http://example.org/)

This will allow web admins to figure out why you are crawling their site and also to contact you if there is a problem with how your crawler is behaving.


I think the following links can help:

  • http://www.user-agents.org/
  • http://en.wikipedia.org/wiki/User_agent#Format
  • http://tools.ietf.org/html/rfc1945#section-10.15