High load average, low CPU usage - why?

Solution 1:

With some further investigation, it appears that the performance problem is mostly due to a high number of network calls between two systems (Oracle SSXA and UCM). The calls are quick but plenty and serialized, hence the low CPU usage (mostly waiting for I/O), the high load average (many calls waiting to be processed) and especially the long response times (by accumulation of small response times).

Thanks for your insight on this problem!

Solution 2:

When you say 'High Load average' I assume you mean that prstat shows for 'load average' at the bottom of the output figures of

Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

These numbers, look similar to the ones that top provides and probably mean the average queue size of running process. This isn't the percentage of processor time being used but how many 'things' are harassing the CPU for time to run. Admittedly, these do look quite high but this all depends on the app that you are running; the processes may not actually be doing much once they get their slot. See here for a nice explanation regarding top.

I'm not familiar with WebLogic but I have noticed that, generally, with Apache Tomcat many Java threads can be spawned simultaneously for what appears as not many requests. It could be this that is causing those high average load numbers. Make sure that you are using connection pooling where appropriate to connect to the backend and consider upping the number of idle threads that are available to your app to handle connections (not sure how you do this on WebLogic; Tomcat has a per connector thread pool or a general executor thread pool). If you don't do this then brand new threads may be being spawned to process requests.

As to performance, you need to nail down what part of your app is suffering. Is it the processing that is happening in the WebLogic/Java side of things, the database access, DNS lookups (if they're being done for some reason...), network issues or something on the OS.

99% of the time it will be your code and how it talks to the database that is holding things up. Then it will be configuration of the web app. Past this point you will be working on squeezing the last milliseconds out of your app or looking at providing higher concurrency with the same hardware. For this finer grained performance tuning you need metrics.

For Java I'd suggest installing Java Melody. It can provide a lot of info regarding what your program is doing and help narrow down where it is spending time. I've only used it with Tomcat but should work fine with any Java EE container/servlet thingy.

There are a number of ways you can tune Java, so take a look at their performance guidelines (I'm sure you probably have) and make sure you're setting the correct Heap Size etc. suitable for your program. Java Melody can help you track down the size of Java's heap you're consuming as well as how hard the garbage collector is working/how often it is interrupting your program to clear objects.

I hope that has been helpful. If you provide any more information, I may be able to update this answer and hone it more towards your needs.


Solution 3:

As a side note, load average also includes things waiting for disk activity (i.e. harassing the disk) as well as those waiting for cpu, it's a sum of both...so you might have problems in one or the other.

See http://en.wikipedia.org/wiki/Load_(computing) "Linux also includes [in its load average] processes in uninterruptible sleep states (usually waiting for disk activity)"

As a side note, the particular problem I ran into was that I had high load average, but also lots of idle cpu and low disk usage.

It appears that, at least in my case, sometimes threads/processes waiting for I/O show up in the load average, but do not cause an increase in the "await" column. But they're still I/O bound.

You can tell that this is the case with the following code, if you run it in jruby (just does 100 threads with lots of I/O each):

100.times { Thread.new { loop { File.open('big', 'w') do |f| f.seek 10_000_000_000; f.puts 'a'; end}}}

Which gives a top output like this:

top - 17:45:32 up 38 days,  2:13,  3 users,  load average: 95.18, 50.29, 23.83
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.5%us, 11.3%sy,  0.0%ni, 85.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32940904k total, 23239012k used,  9701892k free,   983644k buffers
Swap: 34989560k total,        0k used, 34989560k free,  5268548k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31866 packrd    18   0 19.9g  12g  11m S 117.0 41.3   4:43.85 java
  912 root      11  -5     0    0    0 S  2.0  0.0   1:40.46 kjournald

So you can see that it has lots of idle cpu, 0.0%wa, but a very high load average.

iostat similarly shows the disk as basically idle:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       9.62    0.00    8.75    0.00    0.00   81.62

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    49.00  0.00  6.40     0.00   221.60    69.25     0.01    0.81   0.66   0.42
sda1              0.00    49.00  0.00  6.40     0.00   221.60    69.25     0.01    0.81   0.66   0.42
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

see also http://linuxgazette.net/141/misc/lg/tracking_load_average_issues.html

As a further side note, this also seems to imply that (at least in this case--running CentOS) the load average includes each thread separately in the total.


Solution 4:

Had the same problem today. After some research and diagnoses I realised that my small VPS was running out of disk.

In shell/prompt (Linux/Unix)type

df -h

to see the disk free on your machine. If you are running out of disk that can be the problem/issue.


Solution 5:

Another useful tool that will help in this situation is nmon.

It includes a variety of ways to view the same data presented by the other tools, in one little package.

If this is content that cannot be cached I would recommend placing multiple servers behind a load balancer such as haproxy in tcp mode to distribute the load.