Giving malicious crawlers and scripts a hard time

You're hurting yourself.

The "attacker"/crawler... probably doesn't pay for their traffic or processing power (ie. they use a botnet, hijacked servers or at least are on a connection that doesn't make them pay for traffic), but you will be billed for traffic and CPU/storage/memory, or your server's hoster has a "fair usage" clause, under which your server's connection will be throttled or cut off if you serve Gigabytes of data on the short term, or your storage bandwidth will be reduced, or your CPU usage limited.

Also, why should anyone be so stupid to download Gigabytes of data when they're just looking for a specific page? Either, they're just looking for existence of that page, in which case the page's size won't matter, or they will definitely set both a timeout and a maximum size – no use waiting seconds for a server to complete a response if you've got hundreds of other servers to scan, and especially not when greylisting is a well-known technology to slow down attackers.


Consider that serving anything other than HTTP 404 page for /administrator/index.php may get your server in the lists of potential targets, which means even more scans in the future. Since crackers who pay for such lists don't need to scan millions of IPs themselves, they can concentrate on you with much more sophisticated attacks than checking for the default admin page.

Unless your server is set up with the purpose of attracting malicious activity, looking like a potential victim will do you no good.


As already said, its probably not worth it, but it is a very interesting topic to think about. There was a very good talk on that topic at DEF CON 21 called "Making Problems for Script Kiddies and Scanner Monkeys" which you can find here: https://www.youtube.com/watch?v=I3pNLB3Cq24

Several ideas are presented, and some are very simple and effective like sending certain random HTTP response codes, which in the do not affect the end user, but significantly slow down scanners. The talk is worth a look :)

Edit: This is brief summary of the talk for those who do not have time to watch it: Browsers interpret many HTTP responses the same way, independent of their response code. That is of course not the case for all response codes (like 302 redirects), but for example if your browser gets a 404 "not found" code, it will render the page the same way if it was a 200 "OK" code. But scanners/crawlers are different. They mostly depend on the returned response code. For example if they get a 404 response code, they conclude that the file does not exist, but if they get a 200 response code, they conclude that the file exist and will do some stuff with it (scan it, report it to the user, or something else).

But what if you would set your web server to only send 200 codes (even if the resource does not exist)? Normal users probably will not notice it, but scanners will get confused because all resources they try to access (for example by brute force) will be reported as existing. Or what if you return only 404 responses? Most scanners will think that none of the resources they are trying to access are available.

The talk addresses that and tests various response codes and various scanners, and most of them can get easily confused that way.

Edit2: Another idea I just got. Instead of sending 10Gb of data as you said to those who you think are scanners, why not just send a HTTP response with a Content-Length header with a value of 10000000000, but add only couple of bytes of data in the HTTP response body? Most clients will wait for the rest of the response, till the connection times out. That would massively slow down scanners. But again, it's probably not worth it, and you would have to be sure to do that only to those who you detect as scanners.