How to generate Zipf distributed numbers efficiently?

The only C++11 Zipf random generator I could find calculated the probabilities explicitly and used std::discrete_distribution. This works fine for small ranges, but is not useful if you need to generate Zipf values with a very wide range (for database testing, in my case) since it will exhaust memory. So, I implemented the below-mentioned algorithm in C++.

I have not rigorously tested this code, and some optimizations are probably possible, but it only requires constant space and seems to work well.

#include <algorithm>
#include <cmath>
#include <random>

/** Zipf-like random distribution.
 *
 * "Rejection-inversion to generate variates from monotone discrete
 * distributions", Wolfgang Hörmann and Gerhard Derflinger
 * ACM TOMACS 6.3 (1996): 169-184
 */
template<class IntType = unsigned long, class RealType = double>
class zipf_distribution
{
public:
    typedef RealType input_type;
    typedef IntType result_type;

    static_assert(std::numeric_limits<IntType>::is_integer, "");
    static_assert(!std::numeric_limits<RealType>::is_integer, "");

    zipf_distribution(const IntType n=std::numeric_limits<IntType>::max(),
                      const RealType q=1.0)
        : n(n)
        , q(q)
        , H_x1(H(1.5) - 1.0)
        , H_n(H(n + 0.5))
        , dist(H_x1, H_n)
    {}

    IntType operator()(std::mt19937& rng)
    {
        while (true) {
            const RealType u = dist(rng);
            const RealType x = H_inv(u);
            const IntType  k = clamp<IntType>(std::round(x), 1, n);
            if (u >= H(k + 0.5) - h(k)) {
                return k;
            }
        }
    }

private:
    /** Clamp x to [min, max]. */
    template<typename T>
    static constexpr T clamp(const T x, const T min, const T max)
    {
        return std::max(min, std::min(max, x));
    }

    /** exp(x) - 1 / x */
    static double
    expxm1bx(const double x)
    {
        return (std::abs(x) > epsilon)
            ? std::expm1(x) / x
            : (1.0 + x/2.0 * (1.0 + x/3.0 * (1.0 + x/4.0)));
    }

    /** H(x) = log(x) if q == 1, (x^(1-q) - 1)/(1 - q) otherwise.
     * H(x) is an integral of h(x).
     *
     * Note the numerator is one less than in the paper order to work with all
     * positive q.
     */
    const RealType H(const RealType x)
    {
        const RealType log_x = std::log(x);
        return expxm1bx((1.0 - q) * log_x) * log_x;
    }

    /** log(1 + x) / x */
    static RealType
    log1pxbx(const RealType x)
    {
        return (std::abs(x) > epsilon)
            ? std::log1p(x) / x
            : 1.0 - x * ((1/2.0) - x * ((1/3.0) - x * (1/4.0)));
    }

    /** The inverse function of H(x) */
    const RealType H_inv(const RealType x)
    {
        const RealType t = std::max(-1.0, x * (1.0 - q));
        return std::exp(log1pxbx(t) * x);
    }

    /** That hat function h(x) = 1 / (x ^ q) */
    const RealType h(const RealType x)
    {
        return std::exp(-q * std::log(x));
    }

    static constexpr RealType epsilon = 1e-8;

    IntType                                  n;     ///< Number of elements
    RealType                                 q;     ///< Exponent
    RealType                                 H_x1;  ///< H(x_1)
    RealType                                 H_n;   ///< H(n)
    std::uniform_real_distribution<RealType> dist;  ///< [H(x_1), H(n)]
};

As a complement to the very nice rejection-inversion implementation given above, here's a C++ class, with the same API, that is simpler and faster for a small number of bins, only. On my machine, its about 2.3x faster for N=300. It's faster because it performs a direct table lookup, instead of computing logs and powers. The table eats cache, though... Making a guess based on the size of my CPU's d-cache, I imagine that the proper rejection-inversion algo given above will become faster for something around N=35K, maybe. Also, initializing the table requires a call to std::pow() for each bin, so this wins performance only if you are drawing more than N values out of it. Otherwise, rejection-inversion is faster. Choose wisely.

(I've set up the API so it looks a lot like what the std::c++ standards committee might come up with.)

/**
 * Example usage:
 *
 *    std::random_device rd;
 *    std::mt19937 gen(rd());
 *    zipf_table_distribution<> zipf(300);
 *
 *    for (int i = 0; i < 100; i++)
 *        printf("draw %d %d\n", i, zipf(gen));
 */
template<class IntType = unsigned long, class RealType = double>
class zipf_table_distribution
{
   public:
      typedef IntType result_type;

      static_assert(std::numeric_limits<IntType>::is_integer, "");
      static_assert(!std::numeric_limits<RealType>::is_integer, "");

      /// zipf_table_distribution(N, s)
      /// Zipf distribution for `N` items, in the range `[1,N]` inclusive.
      /// The distribution follows the power-law 1/n^s with exponent `s`.
      /// This uses a table-lookup, and thus provides values more
      /// quickly than zipf_distribution. However, the table can take
      /// up a considerable amount of RAM, and initializing this table
      /// can consume significant time.
      zipf_table_distribution(const IntType n,
                              const RealType q=1.0) :
         _n(init(n,q)),
         _q(q),
         _dist(_pdf.begin(), _pdf.end())
      {}
      void reset() {}

      IntType operator()(std::mt19937& rng)
      {
         return _dist(rng);
      }

      /// Returns the parameter the distribution was constructed with.
      RealType s() const { return _q; }
      /// Returns the minimum value potentially generated by the distribution.
      result_type min() const { return 1; }
      /// Returns the maximum value potentially generated by the distribution.
      result_type max() const { return _n; }

   private:
      std::vector<RealType>               _pdf;  ///< Prob. distribution
      IntType                             _n;    ///< Number of elements
      RealType                            _q;    ///< Exponent
      std::discrete_distribution<IntType> _dist; ///< Draw generator

      /** Initialize the probability mass function */
      IntType init(const IntType n, const RealType q)
      {
         _pdf.reserve(n+1);
         _pdf.emplace_back(0.0);
         for (IntType i=1; i<=n; i++)
            _pdf.emplace_back(std::pow((double) i, -q));
         return n;
      }
};

The pre-calculation alone does not help so much. But as it's obvious the sum_prob is accumulative and has ascending order. So if we use a binary-search to find the zipf_value we would decrease the order of generating a Zipf distributed number from O(n) to O(log(n)). Which is so much improvement in efficiency.

Here it is, just replace the zipf() function in genzipf.c with following one:

int zipf(double alpha, int n)
{
  static int first = TRUE;      // Static first time flag
  static double c = 0;          // Normalization constant
  static double *sum_probs;     // Pre-calculated sum of probabilities
  double z;                     // Uniform random number (0 < z < 1)
  int zipf_value;               // Computed exponential value to be returned
  int    i;                     // Loop counter
  int low, high, mid;           // Binary-search bounds

  // Compute normalization constant on first call only
  if (first == TRUE)
  {
    for (i=1; i<=n; i++)
      c = c + (1.0 / pow((double) i, alpha));
    c = 1.0 / c;

    sum_probs = malloc((n+1)*sizeof(*sum_probs));
    sum_probs[0] = 0;
    for (i=1; i<=n; i++) {
      sum_probs[i] = sum_probs[i-1] + c / pow((double) i, alpha);
    }
    first = FALSE;
  }

  // Pull a uniform random number (0 < z < 1)
  do
  {
    z = rand_val(0);
  }
  while ((z == 0) || (z == 1));

  // Map z to the value
  low = 1, high = n, mid;
  do {
    mid = floor((low+high)/2);
    if (sum_probs[mid] >= z && sum_probs[mid-1] < z) {
      zipf_value = mid;
      break;
    } else if (sum_probs[mid] >= z) {
      high = mid-1;
    } else {
      low = mid+1;
    }
  } while (low <= high);

  // Assert that zipf_value is between 1 and N
  assert((zipf_value >=1) && (zipf_value <= n));

  return(zipf_value);
}

The following line in your code is executed n times for each call to zipf():

sum_prob = sum_prob + c / pow((double) i, alpha);

It is regrettable that it is necessary to call the pow() function because, internally, this function sums not one but two Taylor series [considering that pow(x, alpha) == exp(alpha*log(x))]. If alpha is an integer, of course, then you can speed the code up a lot by replacing pow() with simple multiplication. If alpha is a rational number, then you may be able to speed the code up to a lesser degree by coding a Newton-Raphson iteration to take the place of the two Taylor series. If the last condition holds, please advise.

Fortunately, you have indicated that alpha does not change. Can you not speed the code up a lot by preparing a table of pow((double) i, alpha), then letting zipf() look numbers up the table? That way, zipf() would not have to call pow() at all. I suspect that this would save significant time.

Yet further improvements are possible. What if you factored a function sumprob() out of zipf()? Could you not prepare an even more aggressive look-up table for sumprob()'s use?

Maybe some of these ideas will move you in the right direction. See what you cannot do with them.

Update: I see that your question as now revised may not be able to use this answer. From the present point, your question may resolve into a question in complex variable theory. Such are often not easy questions, as you know. It may be that a sufficiently clever mathematician has discovered a relevant recurrence relation or some trick like the normal distribution's Box-Muller technique but, if so, I am not acquainted with the technique. Good luck. (It probably does not matter to you but, in case it does, the late N. N. Lebedev's excellent 1972 book Special Functions and Their Applications is available in English translation from the Russian in an inexpensive paperback edition. If you really, really wanted to crack this problem, you might read Lebedev next -- but, of course, that is a desperate measure, isn't it?)

Tags:

C++

Random