calculating days until disk is full

Solution 1:

Honestly "Days Until Full" is really a lousy metric anyway -- filesystems get REALLY STUPID as they approach 100% utilization.
I really recommend using the traditional 85%, 90%, 95% thresholds (warning, alarm, and critical you-really-need-to-fix-this-NOW, respectively) - this should give you lots of warning time on modern disks (let's say a 1TB drive: 85% of a terabyte still leaves you lots of space but you're aware of a potential problem, by 90% you should be planning a disk expansion or some other mitigation, and at 95% of a terabyte you've got 50GB left and should darn well have a fix in motion).

This also ensures that your filesystem functions more-or-less optimally: it has plenty of free space to deal with creating/modifying/moving large files.

If your disks aren't modern (or your usage pattern involves bigger quantities of data being thrown onto the disk) you can easily adjust the thresholds.

If you're still set on using a "days until full" metric you can extract the data from graphite and do some math on it. IBM's monitoring tools implement several days-until-full metrics which can give you an idea of how to implement it, but basically you're taking the rate of change between two points in history.

For the sake of your sanity you could use the derivative from Graphite (which will give you the rate of change over time) and project using that, but if you REALLY want "smarter" alerts I suggest using daily and weekly rate of change (calculated based on peak usage for the day/week).

The specific projection you use (smallest rate of change, largest rate of change, average rate of change, weighted average, etc....) depends on your environment. IBM's tools offer so many different views because it's really hard to nail down a one-size-fits-all pattern.

Ultimately no algorithm is going to be very good at doing the kind of calculation you want. Disk utilization is driven by users, and users are the antithesis of the Rational Actor model: All of your predictions can go out the window with one crazy person deciding that today is the day they're going to perform a full system memory dump to their home directory. Just Because.

Solution 2:

We've recently rolled out a custom solution for this using linear regression.

In our system the primary source of disk exhaustion is stray log files that aren't being rotated.

Since these grow very predictably, we can perform a linear regression on the disk utilization (e.g., z = numpy.polyfit(times, utilization, 1)) then calculate the 100% mark given the linear model (e.g, (100 - z[1]) / z[0])

The deployed implementation looks like this using ruby and GSL, though numpy works quite well too.

Feeding this a week's worth of average utilization data at 90 minute intervals (112 points) has been able to pick out likely candidates for disk exhaustion without too much noise so far.

The class in the gist is wrapped in a class that pulls data from scout, alerts to slack and sends some runtime telemetry to statsd. I'll leave that bit out since it's specific to our infrastructure.