Searching for a fast/efficient histogram algorithm (with pre-specified bins)

GSL (GNU Scientific Library) contains a histogram implementation.

Here is the documentation: http://www.gnu.org/software/gsl/manual/html_node/Histograms.html.

And here is an example use: http://www.gnu.org/software/gsl/manual/html_node/Example-programs-for-histograms.html.


I've written my own histogram code in C, as it's simple enough that I didn't even think to look for a library. Normally you just need to create an array to contain the number of bins that you want [num_bins = (int)(max_val - min_val + 1);], and as you encounter each sample you can divide by the number of bins [bin_idx = (int)((value - min_val) / bin_width);] (where bin_width = (max_val - min_val)/num_bins) to find where it belongs and then increment the bin counter. This is an easy, fast, single pass through the data. Do check my arithmetic above for edge cases.

The problem you might encounter is that the domain of your input might not be known. Having 100 bins over the whole range of double isn't going to be much good if all your data is within only a small fraction of that. The solution is to make a first pass over the data to find the min/max of your range. There's really no quick fix to this and most libraries will ask for min/max up front.


The "ideal" histogram algorithm will depend upon the range you expect to capture. Generally any histogram algorithm will look like this:

const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
    counts[TRANSFER(samples[i])]++;
}

where TRANSFER() is some function that maps your inputs to a bin (0th or Nth bin mapping to "out of range" of applicable).

The exact implementation of TRANSFER() depends a lot on the expected distribution of your sample and where you are interested in detail. Some common approaches I have seen:

  • uniform distribution in range [a,b] (requires linear transform)
  • logarithmic distribution of unsigned integer values (best when combined with some bit twiddling hacks to quickly determine the nearest power-of-two or similar).

If you don't know the distribution up-front, then you really can't have an efficient mechanism to bin them effectively: you'll either have to guess (biased or uninformative results) or store everything and sort it at the end, binning into equal-sized buckets (poor performance).