Finding Causes of Heavy Usage

From DreamHost
Jump to: navigation, search
The instructions provided in this article or section require shell access unless otherwise stated.

You can use the PuTTY client on Windows, or SSH on UNIX and UNIX-like systems such as Linux or Mac OS X.
Your account must be configured for shell access in the Control Panel.
More information may be available on the article's talk page.


The main cause of heavy usage is because of bots and inefficient scripts.

Many plugins and modules on dynamic websites that offer additional functionality can also compromise performance. Reducing plugins and modules will almost always result in consuming fewer resources (unless the plugin or module is specifically for the purpose of making the site more efficient, such as a caching plugin or an anti-spam plugin).

However, since you would lose the functionality of the plugin by removing it, here are some other things to try to see if you can reduce your resource consumption.

Bots, Spiders, and Crawlers

Bots, spiders and other crawlers hitting your dynamic pages can cause extensive resource (memory and cpu usage) and consequently high load on the server, and slow down your sites. How do you reduce server load from bots, spiders and other crawlers?

By creating a "robots.txt" file at the root of your site/domain, you can tell search engines what content on your site they should and should not index. This can be helpful, for example, if you want to keep a portion of your site out of the Google search engine index. We also offer this in our web panel: https://panel.dreamhost.com/index.cgi?tree=goodies.robots&

Note that this file is meant as a suggestion to compliant search engines only, and does not prevent all search engines (or other similar tools, such as email/content scrapers) from accessing the content or making it available. However, most of the major engines respect robots.txt directives.

Use Caution

Blocking all bots (User-agent: *) from your entire site (Disallow: /) will get your site de-indexed from search engines. Also note that bad bots will likely ignore your robots.txt file, so you may want to block their user-agent with .htaccess.

Bad bots may also use your robots.txt file as a target list, so you may want to skip listing directories in robots.txt. Bad bots may also use false or misleading user-agents, so blocking user-agents with .htaccess may not work as well as anticipated.

If you don't want to block anyone, this is a good default robots.txt file:

User-agent: *
Disallow:

(But of course you might as well remove robots.txt in this case, if you don't mind 404 requests in your logs.) Beyond that, I'd recommend only blocking specific user-agents and files/directories, rather than *, unless you're 100% sure that's what you want.

Blocking Robots

The problem may be that Google, Yahoo or another search engine bot is busily over-browsing your site. (This is the sort of problem that feeds on itself; if the bot is not able to complete its search because of lack of resources, it may launch the same search over and over again.)

   yourserver: 04:41 PM# host 66.249.66.167
   Name: crawl-66-249-66-167.googlebot.com
   Address: 66.249.66.167

To take care of that create a file named robots.txt in your domain folder with the following contents:

   # go away
   User-agent: *
   Disallow: /

Yahoo's crawling bots (either identified by user agent string containing "Yahoo's Slurp" or "inktomisearch.com") do comply to the crawl-delay rule in robots.txt, that limits their fetching activity. For example, to tell Yahoo not to fetch a page more than every 10 seconds, you would add :

   # slow down Yahoo
   User-agent: Slurp
   Crawl-delay: 10

If you do not see any IP that is clearly the cause and have a lot of content that people might be hotlinking be sure to try blocking that:

Preventing hotlinking

Blocking ALL Bots

Here is how you can prevent them from touching your files:

Place a robots.txt file in your user base directory or in the folder that is or may be getting hit a lot.

To disallow all bots:

User-agent: *
Disallow: /

To disallow them on a specific folder: (Warning: may be used as a target list by bad bots)

User-agent: *
Disallow: /yourfolder/

Slowing Good Bots

To slow some, but not all, good bots:

User-agent: *
Crawl-Delay: 3600
Disallow:
(where the number is the minimum delay in seconds between successive crawler accesses)
  • Note: this is relatively worthless as it has been confirmed that Googlebot ignores it.

To slow Googlebot down you will need to signup for a Google WebMasters tool account here http://www.google.com/webmasters/tools Once you have signed up you can "Set crawl rate" for your site and generate a robot.txt file that will make your site Googlebot friendly and also shared server friendly. We recommend that you set the crawl rate to slow and, if possible or as it applies, block it from crawling areas that you would rather it not crawl, such as members area, admin areas, etc.

Checking for abnormally high visits from a limited number of IPs

You might find abuse from specific IPs. Very often, this overlaps with the bot problem above, but you might find IPs that are not associated with bots over-browsing your site. In the case where it is abuse you should be able to determine the cause if you specifically look at each of the access logs for your domains located in /home/YOUR_USERNAME/logs/YOUR_DOMAIN.com/http'

While logged in via the shell, enter this command to see the IPs hitting the domain the most:

cat access.log| awk '{print $1}' | sort | uniq -c |sort -n

and this command is even more useful in some cases as it specifically targets the last 10,000 hits:

tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n

Finally, if you have a ton of domains you may want to use this to aggregate them:

for k in `ls --color=none`; do echo "Top visitors by ip for: $k";awk '{print $1}' ~/logs/$k/http/access.log|sort|uniq -c|sort -n|tail;done

This command is great if you want to see what is being called the most (that can often show you that a specific script is being abused if it's being called way more times than anything else in the site):

awk '{print $7}' access.log|cut -d? -f1|sort|uniq -c|sort -nk1|tail -n10

If you have multiple domains on and on a PS (PS only!) run this command to get all traffic for all domains on the PS:

for k in `ls -S /home/*/logs/*/http/access.log`; do wc -l $k | sort -r -n; done

Here is an alternative to the above command which does the same thing, this is for VPS only using an admin user:

sudo find /home/*/logs -type f -name "access.log" -exec wc -l "{}" \; | sort -r -n

If you're on a shared server you can run this command which will do the same as the one above but just to the domains in your logs directory. You have to run this commands while you're in your user's logs directory:

for k in `ls -S */http/access.log`; do wc -l $k | sort -r -n; done

If you find IPs that are connecting a lot first check to see who it is, replacing 1.1.1.1 for the IP address:

host 1.1.1.1

and then block it by editing or creating your file named .htaccess in the root of your website's folder (typically this will be something like: /home/USER/DOMAIN.COM/.htaccess):

order allow,deny
deny from 1.1.1.1
allow from all

Note that this information must appear first in the .htaccess file. See the htaccess page for more information.

If the issue is intermittent you can watch your server logs to see if the issue presents itself.

 tail -f -q /home/*/logs/*/http/access.log

Checking Processes

If all that fails to help it's quite likely you have a script that is causing the issue. What you need to do then is check to see what processes are running under your user the most. Enter this from the command line and you'll get the details on what processes are running as the logged in user (you'll likely have to run it a few times):

   for k in $(pgrep -u $USER | awk '{print $1}') ; do echo ======== PID: $k ; cat /proc/$k/environ | tr '\000' '\n'  ; done

Once you get a few lines of output hit control+c to stop it so you can actually look at what is coming up (note: it might not catch a process right away but if you do it when your account is busy you should get a lot of information). What this is doing is showing you the environment of the processes your user is running.

PATH=/usr/local/bin:/usr/bin:/bin
DOCUMENT_ROOT=/home/username/userdomain.org
HTTP_ACCEPT=*/*
HTTP_CONNECTION=Keep-Alive
HTTP_HOST=userdomain.org
HTTP_REFERER=http://userdomain.org/weblog
HTTP_USER_AGENT=Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)
REDIRECT_STATUS=200
REDIRECT_URL=/weblog/dotclear/rss.php
REMOTE_ADDR=76.170.16.89
REMOTE_PORT=47739
SCRIPT_FILENAME=/dh/cgi-system/php.cgi
SCRIPT_URI=http://userdomain.org/weblog/dotclear/rss.php
SCRIPT_URL=/weblog/dotclear/rss.php
SERVER_ADDR=208.97.191.135
SERVER_ADMIN=webmaster@userdomain.org
SERVER_NAME=userdomain.org
SERVER_PORT=80
SERVER_SOFTWARE=Apache/1.3.37 (Unix) mod_throttle/3.1.2 DAV/1.0.3 mod_fastcgi/2.4.2 mod_gzip/1.3.26.1a PHP/4.4.4 mod_ssl/2.8.22 OpenSSL/0.9.7e
GATEWAY_INTERFACE=CGI/1.1
SERVER_PROTOCOL=HTTP/1.1
REQUEST_METHOD=GET
QUERY_STRING=
REQUEST_URI=/weblog/dotclear/rss.php
SCRIPT_NAME=/cgi-system/php.cgi
PATH_INFO=/weblog/dotclear/rss.php
PATH_TRANSLATED=/home/username/userdomain.org/weblog/dotclear/rss.phpcat /proc/12741/environ

Your results won't look exactly like this, but if you have the patience you should be able to get some useful information out of it (in the example above the script running was rss.php in the user's /home/username/userdomain.org/weblog/dotclear folder). Even then you still might need some trial and error to know exactly which processes run are eating up the load and this should help you narrow it down. These are the exact steps taken by DreamHost when investigating a user that is crashing a machine or apache service so there is no need to disable the site or user - you can see why it's not that easy. Fortunately you have the inside track as you'll probably know from your statistics where people are going the most.

If that doesn't get you to the source of the usage here are links to some other useful articles in the wiki:

Finally, if all else fails or you have questions/need guidance, please contact support. While support may not be able to find the specific cause for you, they do what they can - the goal is better service for everyone including yourself!

Your sites own IP is making a lot of connections

If you find that your sites unique IP is making a lot of connections to your site this is not an issue, What's actually going on here is strange but harmless. These connections are all being generated by Apache internally to shut down unneeded processes, and should be ignored -- the fact that they appear in access logs at all is somewhat unfortunate, but difficult to avoid. There's a fuller explanation at Here at Httpd Wiki if you're curious.