wa (Waiting for I/O) from top command is big

Solution 1:

Here are a few tools to find disk activity:

  • iotop
  • vmstat 1
  • iostat 1
  • lsof
  • strace -e trace=open <application>
  • strace -e trace=open -p <pid>

In ps auxf you'll also see which processes are are in uninterpretable disk sleep (D) because they are waiting for I/O.

Some days the load increase to reach 40 without increase of the number vistors.

You may also want to create a backup, and see if the harddrive is slowly failing. A harddrive generally starts to slow down before it deceases. This could also explain the high load.

Solution 2:

The output from top suggests that the DBMS is experiencing most of the I/O waits, so database tuning issues are an obvious candidate to investigate.

I/O waiting on a database server - particularly on load spikes - is a clue that your DBMS might be either disk bound (i.e. you need a faster disk subsystem) or it might have a tuning issue. You should probably also look into profiling your database server - i.e. get a trace of what it's doing and what queries are taking the time.

Some starter points for diagnising database tuning issues:-

  • Find the queries that take up the most time, and look at the query plans. See if any have odd query plans such as a table scan where it shouldn't be. Maybe the database needs an index added.

  • Long resource wait times may mean that some key resource pool needs to be expanded.

  • Long I/O wait times may mean that you need a faster disk subsystem.

  • Are your log and data volumes on separate drives? Database logs have a lot of small sequential writes (essentially they behave like a ring buffer). If you have a busy random access workload sharing the same disks as your logs this will disporportionately affect the throughput of the logging. For a database transaction to commit the log entries must be written out to disk, so this will place a bottleneck on the whole system.

    Note that some MySQL storage engines don't use logs so this may not be an issue in your case.

Footnote: Queuing systems

Queuing systems (a statistical model for throughput) get hyperbolically slower as the system approaches saturation. For a high level approximation, a system that is 50% saturated has an average queue length of 2. A system that is 90% saturated has a queue length of 10, a system that is 99% saturated has a queue length of 100.

Thus, on a system that is close to saturation, small changes in load can result in large changes to wait times, in this case manifesting as time spent waiting on I/O. If the I/O capacity of your disk subsystem is nearly saturated then small changes in load can result in significant changes in response times.


Solution 3:

Run iotop, or atop -dD, to see what processes are doing io. Use strace if you need a closer look.

Tags:

Linux

Top