Googlebot

From DreamHost
(Redirected from Googlebot behaving badly)
Jump to: navigation, search
This article or section may require a cleanup.
We are hoping to create articles that meet certain standards. Please discuss this issue on the talk page. Editing help is available.


A Googlebot is a search bot used by Google. It collects documents from the web to build a searchable index for the Google search engine. Googlebot has two versions, deepbot and freshbot. Deepbot, the deep crawler, tries to follow every link on the web and download as many pages as it can to the Google indexers. It completes this process about once a month. Freshbot crawls the web looking for fresh content.

Googlebot behaving badly

On a very small percentage of customer sites (less than .01%), the Google crawler will get caught in a loop and it ends up hitting those sites pretty hard.

Even if not in a loop, Googlebot will hit every page in your site, so any code that exists, will be run! And if that barely-used code makes things go haywire when accessed.. we have a problem! In these cases it can cause really poor server performance - or even crash your a server. It's not uncommon for a crazy out-of-control script being hit by Googlebot to use 50% or more of both CPUs on a shared hosting server!

When faced with a loady server, we track down the culprit by checking access.logs for activity. Here's how we look at the last 10,000 entries, ordered by IP:

tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n

That gives us something that looks like this:

364 65.214.44.72
436 64.172.17.3
657 204.74.126.70
1847 66.249.66.167

When we see 66.249 we immediately know who it is:

host 66.249.66.167
Name: crawl-66-249-66-167.googlebot.com
Address: 66.249.66.167

1800 connections is generally quite a bit more than what's necessary to index a customer site.

Occasionally it's the one-click "Gallery" with the slideshow function turned on; in those cases we'll just block the bot from the gallery. However, often it's just churning through the entire site over and over... in those cases we have only a few options:

  • Let the site stay up and crash/slow the server.

(We really can't... there are hundreds of other users who would suffer!)

  • Disable the site.

(The customer usually doesn't want this though!)

  • Disable the customer's entire account.

(That's harsh!)

  • Have the customer fix their site's inefficiencies.

(In the meantime.. though?)

(You have your own server now, let it crash!)

  • Block Googlebot via .htaccess from visiting that site.

(Hooray! If you can fix your site, or Googlebot start to behave, the block can be removed.)

While we have received less-than-ecstatic feedback from customers for taking that last option it's really the best available!

Don't worry, if we do block Googlebot, we always make sure that actually was the problem afterwards, and notify the customer so that they are aware of the change and what their options are at that point.

Our ultimate goal is to provide the best service possible, and a broken spider/a broken site that crashes a server makes everybody unhappy!

What can I do about Googlebot?

If your site is not meshing well with the Google crawler for some reason there are a few possible solutions. First off, if DreamHost has brought it to your attention then it's something that is causing stability issues on your server (this is quite rare but does happen to a fraction of a fraction of a percent of the sites we serve up). That being the case it's not going to be something that can be allowed to continue. Here are the various ways to address it:

-upgrade the account to a service plan that can support the added activity (if your site must be indexed that's fine, we just need you on a plan that can support the additional CPU and memory usage involved)

-block the Google crawler (this can be just from the folder where it is hitting the site excessively as happens with Gallery slideshow - if that is in a 'gallery' subfolder there is no need to block Google from the entire site, it just needs to have access to the gallery folder revoked)

-work with Google: this actually works, they are reasonably responsive if you have set up up your webmaster (http://www.google.com/webmasters/) account - just avoid trying to simply slow the crawl with their tools as that option is not indefinite, you really need to get a hold of them and have them determine what is causing the interaction with your site

Gallery Slideshow

If you're running a Gallery on your site, one of the most common Googlebot problems is with it trying to index the slideshow. It's so much of a problem, that the good people at Gallery have a page dedicated on how to prevent this from happening:

http://codex.gallery2.org/Gallery2:How_to_keep_robots_off_CPU_intensive_pages

Pages such as slideshow are very CPU intensive. To an indexing robot they are also totally useless 
since the information they provide is redundant. So the administrator has every reason to keep the 
robots from visiting such pages.

Using URL rewrite module, the default slideshow URL is the following form:
"/v/my_album/my_sub_album/my_photo.jpg/slideshow.html". The problem is that there is no way to 
exclude that sort of URL in robots.txt syntax. In order to make the URL excludable, some URL 
rewriting is required.

There is no need for fiddling with mod_rewrite directly as the nifty rewrite module can handle the 
details itself. By default the "View Slideshow" rewrite target is "v/%path%/slideshow.html". The 
constant slideshow URL mark ("/slideshow.html") is on the right side of the variable path ("%path%") 
and this is why we could not express the slideshow ban in robots.txt syntax. Reversing this order 
will provide us with an excludable URL.

So change the rewrite target for "View Slideshow" from "v/%path%/slideshow.html" to "v/slideshow/%path%".

Then add "Disallow: /v/slideshow/" to your robots.txt. If you use the PATH_INFO mode of URL rewrite 
module then this will be "Disallow: /main.php/v/slideshow/".

And that's it: no more spiders hogging your precious resources in vain!

See Also

External link