Why is response time exploding when request frequency drops?

Solution 1:

This is common behavior in data centers. The times your response time is slow corresponds to what is commonly called the Batch Window. This is a period of time when user activity is expected to be low and batch processes can be run. Backups are also done during this period. These activities can strain the resources of server and networks causing performance issues such as you see.

There are a few resources that can cause issues:

  • High CPU load. This can cause Apache to wait for a time slice to process the request.
  • High memory usage. This can flush buffers that enable Apache to serve resources without reading them from disk. It can also cause paging/swapping of Apache workers.
  • High disk activity. This can cause disk I/O activity to be queued with corresponding delays in serving content.
  • High network activity. This can cause packets to be queued for transmission, increase retries and otherwise degrade service.

I use sar to investigate issued like this. atsar can be used gather sar data into daily data files. These can be examined to see what the system behavior is like during the daytime when performance is normal, and overnight when performance is variable.

If you are monitoring the system with munin or some other system that gathers and graphs resource utilization, you may find some indicators there. I still find sar more precise.

There are tools like nice and ionice that can be applied to batch processes to minimize their impact. They are only effective for CPU or I/O issues. They are unlikely to resolve issues with Memory or Network activity.

Moving backup activity to a separate network can reduce network contention. Some backup software can be configured to limit the bandwidth that will be used. This could resolve or reduce network contention issues.

Depending on how the batch processes are triggered you may be able to limit the number of batch processes running in parallel. This may actually improve the performance of the batch processes as they are likely experiencing the same resource contention.

Solution 2:

This relation may happen in the other direction if the request senders wait for a previous request to complete before submitting a new one. In that case traffic drops as request times grows (for whatever reason), due to client-side queuing.

Or it can be an artifact of your measurement - if the graph above shows completed requests, as opposed to arriving requests, the rate will drop as request processing time grows (assuming finite capacity :D).


Solution 3:

Though @BillThor's answer may be correct, it seems unlikely that the period of low load is entirely taken up by backup processes (i.e. that the periods match precisely).

An alternative explanation is simply caching. If a given script / database / whatever has not been used recently, the relevant cached data may have been dropped in order to free up memory for the rest of the operating system. This might be indexes on a database, or O/S buffers in relation to a file, or anything else similar. A query is then going to have to reconstitute this information if it has been a while since the last query. In busy periods this won't occur as the last query will have been frequent. This would also explain why you are seeing low response times and high response times during the busy period.


Solution 4:

What you're seeing there looks, to me, like it could be a statistical issue. It might not be, @BillThor's answer could well be right, but I'll post this for completeness.

The response time graphs are percentile based. A sample pool of 800-1000 requests is a good sample count for this, a pool of 50-100 requests maybe not so much.

If you assume that the number of slow requests isn't a linear function of request volume, such that an order of magnitude increase in requests doesn't result in an order of magnitude increase in slow requests, then higher volumes of requests will result in lower average request time.