Today I talked with a workmate about Googles and Yahoos different behaviour according to the robots.txt. It seems that when Yahoos Inktomi starts to crawl from a datacenter it first gets the robots.txt. Google seems to have this data available in all datacenters and bases its refetching probably on time or maybe new urls? This is pure guessing of course.
One interesting thing I discovered was when I had a somehow screwed up rewriting on my main site. I redirected every incoming request that does not match an existing local file to my index.php. The problem was that I redirected really everything without checking the extension. And I didn’t have had a robots.txt (as I was told you always should have one as the search engines like to have one, even if its empty).
So Googlebot visited me and requested the robots.txt which was served as the index page, HTML. This seemed to confuse Google, it indexed some pages but not many (mainly from the image gallery), but it requested the robots.txt over and over again! As I discovered that in the stats I just added a blank robots.txt and from there on I got nearly no additional requests for that from google anymore, but many hits for the normal pages. I don’t have good stats to show this as the change was in the middle of the month but I will setup a test domain and play a bit with Google, this seems quite interesting.
My guess is that Google penalized my site for not having a proper parsable robots.txt. What could be interesting is the fact that the indexing increased right after the correction, maybe Google can be tricked somehow to faster index a site. Well, I hope I will find some answers when I run these tests…