Is this Google proxy a fake crawler: google-proxy-66-249-81-131.google.com?

These are not fake and are used, these are private proxies used by staff members for various manual tasks/audits/reviews and should not be blocked...


I have also found that google proxy accessed my website several times (30+) in the very same second:

66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329
66.249.81.106 - - [30/Aug/2013:01:26:35 +0200] "GET /index.php HTTP/1.1" 200 280329

...

and rising my server loads. This was strange because in robots.txt I set:

Crawl-delay: 1

(crawler (google) should access the site at a maximum frequency of 1 queries per second (cca), Google does NOT ignore this setting).

So I tried to create a PHP script to block google(any) IPs if IP does it for more than 30 seconds, but I discovered something different. With this code, i was searching for the visitor IP address:

function get_visitor_ip_address($server)
{
    foreach (array('HTTP_CLIENT_IP', 'HTTP_X_FORWARDED_FOR', 'HTTP_X_FORWARDED', 'HTTP_X_CLUSTER_CLIENT_IP', 'HTTP_FORWARDED_FOR', 'HTTP_FORWARDED', 'REMOTE_ADDR') as $key)
    {
        //if (array_key_exists($key, $_SERVER) === true)
        if ($server->testIp($key))
        {
            //foreach (explode(',', $_SERVER[$key]) as $ip)
            foreach (explode(',', $server->getEscaped($key)) as $ip)
            {
                $ip = trim($ip); // just to be safe
                if (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV4 | FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE) !== false) return $ip;
                if (filter_var($ip, FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 | FILTER_FLAG_NO_PRIV_RANGE | FILTER_FLAG_NO_RES_RANGE) !== false) return $ip;
            }
        }
    }
}

but this code returned different IP address (usually Middle east, Africa, or similar locations, ie. 197.132.255.244). This is from my PHP logs

IP address 197.132.255.244 banned at 2013-08-30 01:26:35 for the 1. time exceeding 30 visits in a second, banned for 30 minutes

Interestedly, my Apache server stored Google proxy IP address to my access logs, not the 197.132.255.244). See the apache logs at the beginning, same date & time, etc... tested several times

> > >

While my PHP script searches for the IP address in several ways, notice the different server params in the PHP code:

'HTTP_CLIENT_IP', 'HTTP_X_FORWARDED_FOR', 'HTTP_X_FORWARDED', 'HTTP_X_CLUSTER_CLIENT_IP', 'HTTP_FORWARDED_FOR', 'HTTP_FORWARDED', 'REMOTE_ADDR'

and this finds and logs the "correct" IP address - 197.132.255.244 (tested several times with various attackers)

http://whois.domaintools.com/197.132.255.244

> > >

My conclusion:

I think, some people are using Google services (like Google translate, Google mobile, etc.) for accessing (blocked) websites (in schools etc.) but also for DOS attacks and similar activity. How?

This way:

http://www.gmodules.com/ig/proxy?url=http://www.yoursite.com
http://www.google.com/translate?langpair=de|en&u=www.yoursite.com 

(change to your website instead of www.yoursite.com)

or other ways:

http://www.tech-recipes.com/rx/1322/use_google_proxy_bypass_blocked_site/

I think, it's up to you if you find and block the original IP address (197.132.255.244) with the help of this PHP function, which works even when the attacker is using a Google Proxy, and you will display them short message "you have exceeded our limits" or empty/error page, as I do...

or you block the Google Proxy IP (66.249.81.106 or similar), for example directly in the .httaccess file, if proxy exceeds your allowed limits. You will not block the Google crawler with this, but you may disable the functionality, when someone real (not attacker) wants to translate your webpage etc.