Massive 404 attack with non existent URLs. How to prevent this?
I often see another site that links to tons of pages on my site that don't exist. Even if you are clicking on that page and not seeing the link:
- The site might previously have had those links
- The site may be cloaking and serving those links only to Googlebot and not to visitors
It is a waste of resources, but it won't confuse Google and it won't hurt your rankings. Here is what Google's John Mueller (who works on Webmaster Tools and Sitemaps) has to say about 404 errors that appear in Webmaster tools:
HELP! MY SITE HAS 939 CRAWL ERRORS!!1
I see this kind of question several times a week; you’re not alone - many websites have crawl errors.
- 404 errors on invalid URLs do not harm your site’s indexing or ranking in any way. It doesn’t matter if there are 100 or 10 million, they won’t harm your site’s ranking. http://googlewebmastercentral.blogspot.ch/2011/05/do-404s-hurt-my-site.html
- In some cases, crawl errors may come from a legitimate structural issue within your website or CMS. How you tell? Double-check the origin of the crawl error. If there's a broken link on your site, in your page's static HTML, then that's always worth fixing. (thanks +Martino Mosna)
- You don’t need to fix crawl errors in Webmaster Tools. The “mark as fixed” feature is only to help you, if you want to keep track of your progress there; it does not change anything in our web-search pipeline, so feel free to ignore it if you don’t need it. http://support.google.com/webmasters/bin/answer.py?answer=2467403
- We list crawl errors in Webmaster Tools by priority, which is based on several factors. If the first page of crawl errors is clearly irrelevant, you probably won’t find important crawl errors on further pages. http://googlewebmastercentral.blogspot.ch/2012/03/crawl-errors-next-generation.html
- There’s no need to “fix” crawl errors on your website. Finding 404’s is normal and expected of a healthy, well-configured website. If you have an equivalent new URL, then redirecting to it is a good practice. Otherwise, you should not create fake content, you should not redirect to your homepage, you shouldn’t robots.txt disallow those URLs -- all of these things make it harder for us to recognize your site’s structure and process it properly. We call these “soft 404” errors. http://support.google.com/webmasters/bin/answer.py?answer=181708
- Obviously - if these crawl errors are showing up for URLs that you care about, perhaps URLs in your Sitemap file, then that’s something you should take action on immediately. If Googlebot can’t crawl your important URLs, then they may get dropped from our search results, and users might not be able to access them either.
There are tons of scripts out there that optimistically scan random IP addresses on the internet to find vulnerabilities known in various kinds of software. 99.99% of the time, they find nothing (like on your site,) and that 0.01% of the time, the script will pwn the machine and do whatever the script controller wants. Typically, these scripts are run by anonymous botnets from machines that have previously been pwnd, not from the actual machine of the original script kiddie.
What should you do?
- Make sure that your site is not vulnerable. This requires constant vigilance.
- If this generates so much load that normal site performance is impacted, add an IP-based blocking rule to avoid accepting connections from the particular site.
- Learn to filter out scans for CMD.EXE or cPanel or phpMyAdmin or tons of other vulnerabilities when looking through your server logs.
You seem to believe that any 404 returned from your server to anyone will impact what Google thinks about your site. This is not true. Only 404s returned by Google crawlers, and perhaps Chrome users, will affect your site. As long as all links on your site are proper links, and you don't invalidate links you have previously exposed to the world, you will not see any impact. The script bots don't talk to Google in any way.
If you are getting attacked in a real way, you will need to sign up for some kind of DoS mitigation provider service. Verisign, Neustar, CloudFlare, and Prolexic are all vendors that have various kinds of plans for various kinds of attacks -- from simple web proxying (which may even be free from some providers,) to DNS based on-demand filtering, to full BGP based point-of-presence swings that sends all of your traffic through "scrubbing" data centers with rules that mitigate attacks.
But, it sounds from what you're saying, that you're just seeing the normal vulnerability scripts that any IP on the Internet will see if it's listening on port 80. You can literally put up a new machine, start an empty Apache, and within a few hours, you'll start seeing those lines in the access log.
This probably isn't actually an attack but a scan or probe.
Depending on the scanner/prober, it might be benign, meaning it is just looking for issues in some type of research capacity or it could have a function to automatically attack if it finds an opening.
Web browsers put valid referrer information but other programs can just make up whatever referrer they like.
The referrer is simply a piece of information that is optionally provided by programs accessing your web site. It can be anything they choose to set it to such as
random.yu. It can even be a real web site that they just selected.
You can't really fix this or prevent it. If you tried to block every request of this type, you end up having to maintain a very large list and it isn't worth it.
As long as your host keeps up with patches and preventing vulnerabilities, this should not cause you any actual problems.