By richard [at] richardwooding [dot] com - how to scrape a website with Google IPs. Feel free to comment at end of this page. source code. (This is not spoofing: the infrastructure running the code is owned by Google, I will give ability to change user agent later)
This was a test related to this blog post: http://blog.mocality.co.ke/2012/01/13/google-what-were-you-thinking
Mocality posted this: blog.mocality.co.ke/2012/01/13/goo… & Google just responded: plus.google.com/11526406426894… "Mortified" is a good description; not a good day.
— Matt Cutts (@mattcutts) January 13, 2012
Please note:I definitely believe Google was involved. Since the telephone callers were selling Google products, and it appears Google has admitted wrong-doing, because of Matt Cutts' tweet and Nelson Mattos' Google+ update that I know of. (If anybody has seen the more formal admission please add it to the comments below). If anything it shows that Google might be "eating their own dog food" and using Google App Engine in their own projects.
This application is hosted on Google App Engine infrastructure, I hacked it together on Friday (13/1/2012) evening / Saturday morning. It simply sends a http request to an URL. Please don't abuse it.
If you point it a URL you own, and then check your logs you will see that the request is recorded as originating from a IP owned by Google. You can use http://www.ipchecking.com/ to do reverse checks.
The IPs I am using are not spoofed, the infrastructure running my code is owned by Google. By hosting a cloud service Google IPs can be used to scrape websites. On request I will send you the source code
Please note: I know think that I think this is more plausible way of sending a request from a Google IP then use Google's search cache, or via their translation service. the requesting IPs jump around a bit, but a real scraper would be a long running process, so it would be more likely to stay consistently on one IP.
My access log for few attempts showed the following (IPs are coincidentally in same range as blog post):
74.125.64.88 - - [13/Jan/2012:23:16:34 +0100] "GET / HTTP/1.1" 200 1395 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~httprequesttest)" 74.125.64.85 - - [14/Jan/2012:00:01:18 +0100] "GET /major.html HTTP/1.1" 200 1395 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~httprequesttest)" 74.125.156.95 - - [14/Jan/2012:00:01:46 +0100] "GET /nice.html HTTP/1.1" 200 1395 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~httprequesttest)" 74.125.156.86 - - [14/Jan/2012:00:02:04 +0100] "GET /thisisnotaurl.html HTTP/1.1" 200 1395 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: s~httprequesttest)"
Please note: The User-Agent is just the default and can be changed
My blog: http://richardwooding.com