Finding Causes of Heavy Usage

From DreamHost
(Redirected from Bots spiders and crawlers)
Jump to: navigation, search
The instructions provided in this article or section require shell access unless otherwise stated.

You can use the PuTTY client on Windows, or SSH on UNIX and UNIX-like systems such as Linux or Mac OS X.
Your account must be configured for shell access in the Control Panel.
More information may be available on the article's talk page.


Overview

It’s possible that your site sometimes uses more resources than you expect. Because of this, you may notice your site loading slowly or not at all (downtime). This may be due to heavy usage on your website.

The main causes of heavy usage are bots and inefficient scripts.

Many plugins and modules on dynamic websites that offer additional functionality can also compromise performance. Reducing plugins and modules will almost always result in consuming fewer resources (unless the plugin or module is specifically used for the purpose of making the site more efficient, such as a caching plugin or an anti-spam plugin).

Since you would lose the functionality of the plugin by removing it, this wiki guides you through alternatives to find the causes of heavy usage and mitigate their impact.

Viewing your access.log

You can confirm exactly what is hitting your site in your access.log file.

  • This can be found in your FTP account.
  • Your user must also be a Shell user to view these logs.

To view your access.log file:

  1. Review the Enabling Shell Access wiki to confirm your user is a Shell user.
  2. Log into your account using the instructions from the FTP wiki.
    Once logged in you’ll already be in your /home/username directory.
  3. Navigate to the following folder in your FTP client:
    /logs/example.com/http/access.log
    If instead you're logged into a Shell terminal, you can cd into your /logs directory by entering the following:
    cd ~/logs
    Then cd into your domain and http directory.
  4. In that /http directory, you’ll find your access.log as well as other log files.
  5. Download or edit that file to view the contents.
    Note2 icon.png Notes:
    • You can also access the file via SSH.
    • Make sure your user is a Shell user first by visiting the Enabling Shell Access wiki.


    How to examine your access.log

    You might find abuse from specific IPs, and often this is due to bots hitting your site. But, you may also find IPs that are not associated with bots over-browsing your site.

    This section lists a few commands you can run via SSH to help identify which IPs are hitting your site.

    Listing IP hits

    Important icon.png Important: Make sure that after you log into the server via SSH that you are in your /logs/example.com/http directory. This is where you’ll run the following commands.


    Command Description
    cat access.log| awk '{print $1}' | sort | uniq -c |sort -n
    Generates a list of IP address preceded by the number of times it hit a site.
    tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n
    Generates a list that shows the last 10,000 hits to a site.
    host 66.249.66.167
    
    167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com
    
    The 'host' command determines the hosting company from which a specific IP is hitting a site. In this example, the IP belongs to Google.
    tail -f -q access.log
    Watches your server logs in real-time to see if the issue presents itself with a specific IP (for intermittent issues).
    order allow,deny
    deny from 66.249.66.167
    allow from all 
    
    Blocks the IP in an Htaccess file. In this example, the .htaccess file blocks the above Google IP.

    Listing top files, folders, and domains

    Command Description
    awk '{print $7}' access.log|cut -d? -f1|sort|uniq -c|sort -nk1|tail -n10
    Generates a list of files or directories on your site being called the most.
    for k in `ls -S */http/access.log`; do wc -l $k | sort -r -n; done
    • Generates a list of traffic for all domains listed under a specific user (on a shared server).
    • This command must be run in your /logs/ directory.

    If you have a VPS or Dedicated plan

    Command Description
    for k in `ls -S /home/*/logs/*/http/access.log`; do wc -l $k | sort -r -n; done
    • Generates a list of all traffic for all domains (for multiple domains on a VPS or Dedicated server).
    • You can run this command from within any directory.
    tail -f -q /home/*/logs/*/http/access.log
    • Watches your server logs in real-time to see if the issue presents itself with a specific IP (for intermittent issues).
    • You can run this command from within any directory.

    Bots, spiders, and crawlers

    Bots, spiders, and other crawlers hitting your dynamic pages can cause extensive resource (memory and CPU) usage. This can lead to high load on the server and slow down your site(s).

    One option to reduce server load from bots, spiders, and other crawlers is to create a "robots.txt" file at the root of your site/domain. This tells search engines what content on your site they should and should not index. This can be helpful, for example, if you want to keep a portion of your site out of the Google search engine index.

    If you prefer not to create this file yourself, you can have DreamHost create one for you automatically (on a per-domain basis) on the (Panel > 'Goodies' > 'Block Spiders') page.

    Note2 icon.png Note: While most of the major search engines respect robots.txt directives, this file only acts as a suggestion to compliant search engines and does not prevent search engines (or other similar tools, such as email/content scrapers) from accessing the content or making it available.


    Blocking robots

    The problem may be that Google, Yahoo, or another search engine bot is over-browsing your site. (This is the sort of problem that feeds on itself; if the bot is not able to complete its search because of a lack of resources, it may launch the same search over and over again.)

    Blocking Googlebots

    In the following example, the IP of 66.249.66.167 was found in your access.log. You can check which company this IP belongs to by running the ‘host’ command:

    [server]$ host 66.249.66.167
    167.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-167.googlebot.com.
    

    To block this Googlebot, use the following in your robots.txt file:

    # go away Googlebot
    User-agent: Googlebot
    Disallow: /
    
    Explanation of the fields above:
    # go away
    This is a comment which is only used so you know why you created this rule.
    User-agent
    The name of the bot to which the next rule will apply.
    Disallow
    The path of the URL you wish to block. This forward slash means the entire site will be blocked.

    View further information about Google robots by clicking the following:

    Blocking Yahoo

    Yahoo's crawling bots comply to the crawl-delay rule in robots.txt which limits their fetching activity. For example, to tell Yahoo not to fetch a page more than once every 10 seconds, you would add the following:

    # slow down Yahoo
    User-agent: Slurp
    Crawl-delay: 10
    
    Explanation of the fields above:
    # slow down Yahoo
    This is a comment which is only used so you know why you created this rule.
    User-agent: Slurp
    Slurp is the Yahoo User-agent name. You must use this to block Yahoo.
    Crawl-delay
    Tells the User-agent to wait 10 seconds between each request to the server.

    View further information about Yahoo robots by clicking the following:

    Slowing good bots

    Use the following to slow some, but not all, good bots:

    User-agent: *
    Crawl-Delay: 3600
    
    Explanation of the fields above:
    User-agent: *
    Applies to all User-agents.
    Crawl-delay
    Tells the User-agent to wait 3600 seconds between each request to the server.
    Note2 icon.png Notes:
    • Googlebot ignores the crawl-delay directive.
    • To slow down Googlebot, you’ll need to sign up at Google WebMaster Tools.
    • Once your account is created, you can set the crawl rate and generate a robots.txt.


    Blocking all bots

    To disallow all bots:

    User-agent: *
    Disallow: /
    

    To disallow them on a specific folder:

    User-agent: *
    Disallow: /yourfolder/
    
    Important icon.png Warning: Bad bots may use this content as a list of targets.


    Explanation of the fields above:
    User-agent: *
    Applies to all User-agents.
    Disallow: /  
    Disallows the indexing of everything.
    Disallow: /yourfolder/  
    Disallows the indexing of this single folder.

    Use caution

    Blocking all bots (User-agent: *) from your entire site (Disallow: /) will get your site de-indexed from legitimate search engines. Also, note that bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with Htaccess.

    Bad bots may also use your robots.txt file as a target list, so you may want to skip listing directories in robots.txt. Bad bots may also use false or misleading User-agents, so blocking User-agents with .htaccess may not work as well as anticipated.

    If you don't want to block anyone, this is a good default robots.txt file:

    User-agent: *
    Disallow:
    

    You may need to remove robots.txt in this case, if you don't mind 404 requests in your logs.

    DreamHost recommends that you only block specific User-agents and files/directories, rather than *, unless you're 100% sure that's what you want.

    Blocking bad referrers

    For detailed instructions, please visit the wiki on how to block referrers.

    My unique IP is making a lot of connections

    You may find in your access.log that your site’s Unique IP is making a lot of connections. This is not an issue and can be safely ignored.

    This occurs because Apache is internally generating these connections in order to shut down unneeded processes.

    You can read more about it here.

    See also