Bots spiders and crawlers

From DreamHost
Jump to: navigation, search

Bots, spiders and other crawlers hitting your dynamic pages can cause extensive resource (memory and cpu usage) and consequently high load on the server, and slow down your sites. How do you reduce server load from bots, spiders and other crawlers?

By creating a "robots.txt" file at the root of your site/domain, you can tell search engines what content on your site they should and should not index. This can be helpful, for example, if you want to keep a portion of your site out of the Google search engine index. We also offer this in our web panel: https://panel.dreamhost.com/index.cgi?tree=goodies.robots&

Note that this file is meant as a suggestion to compliant search engines only, and does not prevent all search engines (or other similar tools, such as email/content scrapers) from accessing the content or making it available. However, most of the major engines respect robots.txt directives.

Use Caution

Blocking all bots (User-agent: *) from your entire site (Disallow: /) will get your site de-indexed from search engines. Also note that bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with .htaccess.

Bad bots may also use your robots.txt file as a target list, so you may want to skip listing directories in robots.txt. Bad bots may also use false or misleading user-agents, so blocking user-agents with .htaccess may not work as well as anticipated.

If you don't want to block anyone, this is a good default robots.txt file:

User-agent: *
Disallow:

(But of course you might as well remove robots.txt in this case, if you don't mind 404 requests in your logs.) Beyond that, I'd recommend only blocking specific user-agents and files/directories, rather than *, unless you're 100% sure that's what you want.

Blocking ALL Bots

Here is how you can prevent them from touching your files:

Place a robots.txt file in your user base directory or in the folder that is or may be getting hit a lot.

To disallow all bots:

User-agent: *
Disallow: /

To disallow them on a specific folder: (Warning: may be used as a target list by bad bots)

User-agent: *
Disallow: /yourfolder/

Slowing Good Bots

To slow some, but not all, good bots:

User-agent: *
Crawl-Delay: 3600
Disallow:
(where the number is the minimum delay in seconds between successive crawler accesses)
  • Note: this is relatively worthless as it has been confirmed that Googlebot ignores it.


To slow Googlebot down you will need to signup for a Google WebMasters tool account here http://www.google.com/webmasters/tools Once you have signed up you can "Set crawl rate" for your site and generate a robot.txt file that will make your site Googlebot friendly and also shared server friendly. We recommend that you set the crawl rate to slow and, if possible or as it applies, block it from crawling areas that you would rather it not crawl, such as members area, admin areas, etc.

See Also

External Links