How can I generate sorted uniformly distributed random numbers efficiently in C++?

There are ways to generate samples that are already sorted, but I think that it might be better to generate partially sorted samples.

Divide the output range into k buckets of equal width. The number of samples in each bucket will have multinomial distribution with equal probabilities. The slow method to sample the multinomial distribution is to generate n integers in [0, k). A more efficient method is to draw k Poisson samples with rate n/k conditioned on their sum not exceeding n, then add another n - sum samples using the slow way. Sampling the Poisson distribution is tricky to do perfectly, but when n/k is very large (as it will be here), the Poisson distribution is excellently approximated by rounding a normal distribution with mean and variance n/k. If that's unacceptable, the slow method does parallelize well.

Given the bucket counts, compute the prefix sums to find the bucket boundaries. For each bucket in parallel, generate the given number of samples within the bucketed range and sort them. If we choose n/k well, each bucket will almost certainly fit in L1 cache. For n = 1e9, I think I'd try k = 1e5 or k = 1e6.

Here's a sequential implementation. A little unpolished since we really need to avoid 2x oversampling the bucket boundaries, which are closed, but I'll leave that to you. I'm not familiar with OMP, but I think you can get a pretty good parallel implementation by adding a pragma to the for loop at the end of SortedUniformSamples.

#include <algorithm>
#include <cmath>
#include <iostream>
#include <numeric>
#include <random>
#include <span>
#include <vector>

template <typename Dist, typename Gen>
void SortedSamples(std::span<double> samples, Dist dist, Gen& gen) {
  for (double& sample : samples) {
    sample = dist(gen);
  }
  std::sort(samples.begin(), samples.end());
}

template <typename Gen>
void ApproxMultinomialSample(std::span<std::size_t> samples, std::size_t n,
                             Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::normal_distribution<double> approx_poisson{lambda, std::sqrt(lambda)};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = std::lrint(approx_poisson(gen));
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

template <typename Gen>
void SortedUniformSamples(std::span<double> samples, Gen& gen) {
  static constexpr std::size_t kTargetBucketSize = 1024;
  if (samples.size() < kTargetBucketSize) {
    SortedSamples(samples, std::uniform_real_distribution<double>{0, 1}, gen);
    return;
  }
  std::size_t num_buckets = samples.size() / kTargetBucketSize;
  std::vector<std::size_t> bucket_counts(num_buckets);
  ApproxMultinomialSample(bucket_counts, samples.size(), gen);
  std::vector<std::size_t> prefix_sums(num_buckets + 1);
  std::partial_sum(bucket_counts.begin(), bucket_counts.end(),
                   ++prefix_sums.begin());
  for (std::size_t i = 0; i < num_buckets; i++) {
    SortedSamples(std::span<double>{&samples[prefix_sums[i]],
                                    &samples[prefix_sums[i + 1]]},
                  std::uniform_real_distribution<double>{
                      static_cast<double>(i) / num_buckets,
                      static_cast<double>(i + 1) / num_buckets},
                  gen);
  }
}

int main() {
  std::vector<double> samples(100000000);
  std::default_random_engine gen;
  SortedUniformSamples(samples, gen);
  if (std::is_sorted(samples.begin(), samples.end())) {
    std::cout << "sorted\n";
  }
}

If your standard library has a high-quality implementation of poisson_distribution, you could also do this:

template <typename Gen>
void MultinomialSample(std::span<std::size_t> samples, std::size_t n,
                       Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::poisson_distribution<std::size_t> poisson{lambda};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = poisson(gen);
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

I'd be tempted to rely on the fact that the difference between consecutive elements of a sorted set of uniformly distributed variables are exponentially distributed. This can be exploited to run in O(N) time rather than O(N*log N).

A quick implementation would do something like:

template<typename T> void
computeSorteUniform2(std::vector<T>& elements)
{
    std::random_device rd;
    std::mt19937 prng(rd());

    std::exponential_distribution<T> dist(static_cast<T>(1));

    auto sum = dist(prng);

    for (auto& elem : elements) {
        elem = sum += dist(prng);
    }

    sum += dist(prng);

    for (auto& elem : elements) {
        elem /= sum;
    }
}

this example is simplified by assuming you want values in Uniform(0, 1), but it should be easy to generalise. Making this work using OMP isn't quite trivial, but shouldn't be too hard.

If you care about the last ~50% performance there are some numeric tricks that might speed up generating random deviates (e.g. there are faster and better PRNGs than the MT) as well as converting them to doubles (but recent compilers might know about these tricks). A couple of references: Daniel Lemire's blog and Melissa O'Neill's PCG site.

I've just benchmarked this and discovered that clang's std::uniform_real_distribution and std::exponential_distribution are both very slow. numpy's Ziggurat based implementations are 8 times faster, such that I can generate 1e9 double's in ~10 seconds using a single thread on my laptop (i.e. std implementations take ~80 seconds) using the above algorithm. I've not tried OP's implementation on 1e9 elements, but with 1e8 elements mine is ~15 times faster.