Finding Causes of Heavy Usage

From DreamHost
Jump to: navigation, search
The instructions provided in this article or section require shell access unless otherwise stated.

You can use the PuTTY client on Windows, or SSH on UNIX and UNIX-like systems such as Linux or Mac OS X.
Your account must be configured for shell access in the Control Panel.
More information may be available on the article's talk page.


The main cause of heavy usage is because of inefficient scripts. Many plugins and modules on dynamic websites that offer additional functionality can also compromise performance. Reducing plugins and modules will almost always result in consuming fewer resources (unless the plugin or module is specifically for the purpose of making the site more efficient, such as a caching plugin or an anti-spam plugin).

However, since you would lose the functionality of the plugin by removing it, here are some other things to try to see if you can reduce your resource consumption.

Contents

Blocking Robots

The problem may be that Google, Yahoo or another search engine bot is busily over-browsing your site. (This is the sort of problem that feeds on itself; if the bot is not able to complete its search because of lack of resources, it may launch the same search over and over again.)

   yourserver: 04:41 PM# host 66.249.66.167
   Name: crawl-66-249-66-167.googlebot.com
   Address: 66.249.66.167

To take care of that create a file named robots.txt in your domain folder with the following contents:

   # go away
   User-agent: *
   Disallow: /

Yahoo's crawling bots (either identified by user agent string containing "Yahoo's Slurp" or "inktomisearch.com") do comply to the crawl-delay rule in robots.txt, that limits their fetching activity. For example, to tell Yahoo not to fetch a page more than every 10 seconds, you would add :

   # slow down Yahoo
   User-agent: Slurp
   Crawl-delay: 10

If you do not see any IP that is clearly the cause and have a lot of content that people might be hotlinking be sure to try blocking that:

Preventing hotlinking

Checking for abnormally high visits from a limited number of IPs

You might find abuse from specific IPs. Very often, this overlaps with the bot problem above, but you might find IPs that are not associated with bots over-browsing your site. In the case where it is abuse you should be able to determine the cause if you specifically look at each of the access logs for your domains located in /home/YOUR_USERNAME/logs/YOUR_DOMAIN.com/http'

While logged in via the shell, enter this command to see the IPs hitting the domain the most:

cat access.log| awk '{print $1}' | sort | uniq -c |sort -n

and this command is even more useful in some cases as it specifically targets the last 10,000 hits:

tail -10000 access.log| awk '{print $1}' | sort | uniq -c |sort -n

Finally, if you have a ton of domains you may want to use this to aggregate them:

for k in `ls --color=none`; do echo "Top visitors by ip for: $k";awk '{print $1}' ~/logs/$k/http/access.log|sort|uniq -c|sort -n|tail;done

This command is great if you want to see what is being called the most (that can often show you that a specific script is being abused if it's being called way more times than anything else in the site):

awk '{print $7}' access.log|cut -d? -f1|sort|uniq -c|sort -nk1|tail -n10

If you have multiple domains on and on a PS (PS only!) run this command to get all traffic for all domains on the PS:

for k in `ls -S /home/*/logs/*/http/access.log`; do wc -l $k | sort -r -n; done

Here is an alternative to the above command which does the same thing, this is for VPS only using an admin user:

sudo find /home/*/logs -type f -name "access.log" -exec wc -l "{}" \; | sort -r -n

If you're on a shared server you can run this command which will do the same as the one above but just to the domains in your logs directory. You have to run this commands while your in your user's logs directory:

for k in `ls -S */http/access.log`; do wc -l $k | sort -r -n; done

If you find IPs that are connecting a lot first check to see who it is, replacing 1.1.1.1 for the IP address:

host 1.1.1.1

and then block it by editing or creating your file named .htaccess in the root of your website's folder (typically this will be something like: /home/USER/DOMAIN.COM/.htaccess):

order deny,allow
deny from 1.1.1.1
allow from all

If the issue is intermittent you can watch your server logs to see if the issue presents itself.

 tail -f -q /home/*/logs/*/http/access.log

Checking Processes

If all that fails to help it's quite likely you have a script that is causing the issue. What you need to do then is check to see what processes are running under your user the most. Enter this from the command line and you'll get the details on what processes are running as the logged in user (you'll likely have to run it a few times):

   for k in $(pgrep -u $USER | awk '{print $1}') ; do echo ======== PID: $k ; cat /proc/$k/environ | tr '\000' '\n'  ; done

Once you get a few lines of output hit control+c to stop it so you can actually look at what is coming up (note: it might not catch a process right away but if you do it when your account is busy you should get a lot of information). What this is doing is showing you the environment of the processes your user is running.

PATH=/usr/local/bin:/usr/bin:/bin
DOCUMENT_ROOT=/home/username/userdomain.org
HTTP_ACCEPT=*/*
HTTP_CONNECTION=Keep-Alive
HTTP_HOST=userdomain.org
HTTP_REFERER=http://userdomain.org/weblog
HTTP_USER_AGENT=Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)
REDIRECT_STATUS=200
REDIRECT_URL=/weblog/dotclear/rss.php
REMOTE_ADDR=76.170.16.89
REMOTE_PORT=47739
SCRIPT_FILENAME=/dh/cgi-system/php.cgi
SCRIPT_URI=http://userdomain.org/weblog/dotclear/rss.php
SCRIPT_URL=/weblog/dotclear/rss.php
SERVER_ADDR=208.97.191.135
SERVER_ADMIN=webmaster@userdomain.org
SERVER_NAME=userdomain.org
SERVER_PORT=80
SERVER_SOFTWARE=Apache/1.3.37 (Unix) mod_throttle/3.1.2 DAV/1.0.3 mod_fastcgi/2.4.2 mod_gzip/1.3.26.1a PHP/4.4.4 mod_ssl/2.8.22 OpenSSL/0.9.7e
GATEWAY_INTERFACE=CGI/1.1
SERVER_PROTOCOL=HTTP/1.1
REQUEST_METHOD=GET
QUERY_STRING=
REQUEST_URI=/weblog/dotclear/rss.php
SCRIPT_NAME=/cgi-system/php.cgi
PATH_INFO=/weblog/dotclear/rss.php
PATH_TRANSLATED=/home/username/userdomain.org/weblog/dotclear/rss.phpcat /proc/12741/environ

Your results won't look exactly like this, but if you have the patience you should be able to get some useful information out of it (in the example above the script running was rss.php in the user's /home/username/userdomain.org/weblog/dotclear folder). Even then you still might need some trial and error to know exactly which processes run are eating up the load and this should help you narrow it down. These are the exact steps taken by DreamHost when investigating a user that is crashing a machine or apache service so there is no need to disable the site or user - you can see why it's not that easy. Fortunately you have the inside track as you'll probably know from your statistics where people are going the most.

If that doesn't get you to the source of the usage here are links to some other useful articles in the wiki:

Finally, if all else fails or you have questions/need guidance, please contact support. While support may not be able to find the specific cause for you, they do what they can - the goal is better service for everyone including yourself!

Your sites own IP is making a lot of connections

If you find that your sites unique IP is making a lot of connections to your site this is not an issue, What's actually going on here is strange but harmless. These connections are all being generated by Apache internally to shut down unneeded processes, and should be ignored -- the fact that they appear in access logs at all is somewhat unfortunate, but difficult to avoid. There's a fuller explanation at Here at Httpd Wiki if you're curious.

Personal tools