How Slow Is Python Really (Part II)?

C++ x150 x450 x530

Instead of array I used bits (and dark magic).
Thanks @ace for the faster random function.

How does it work: the first 15th bits of the integer s represent the array S[15]; the zeroes represent -1, the ones represent +1. The array F is build in a similar way. But with two bit for each symbol.

  • 00 represent -1
  • 01 and 10 represent 0
  • 11 represent 1

Cause S and F have a different representation I have to interleave S with itself to be comparable with F.

  • 0 (-1) became 00 (-1 in the representation of F)
  • 1 (+1) became 11 (+1 in the representation of F)

Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11

0 . 00 = 11 (-1 * -1 = +1)
0 . 01 = 10 (-1 * 0 = 0)
0 . 10 = 01 (-1 * 0 = 0)
0 . 11 = 00 (-1 * +1 = -1)
1 . 00 = 00 (+1 * -1 = -1)
1 . 10 = 10 (+1 * 0 = 0)
1 . 01 = 01 (+1 * 0 = 0)
1 . 11 = 11 (+1 * +1 = +1)

Looks like a not xor to me. :)

Sum the ones is just a game of shift and mask, nothing really complex.

#include <array>
#include <ctime>

// From standford bithacks
// http://graphics.stanford.edu/~seander/bithacks.html
inline int32_t interleaveBit(int32_t x)
{
   static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
   x = (x | ( x << 8)) & B[3];
   x = (x | ( x << 4)) & B[2];
   x = (x | ( x << 2)) & B[1];
   x = (x | ( x << 1)) & B[0];
   return x | (x << 1);
}

inline int32_t sumOnes(int32_t v)
{
   static int b[] = { 1, 0, 0, 1};
   int s = 0;
   for( int i = 0; i < 8; ++i)
   {
      const int a = 3&(v>>(i*2));
      s += b[a];
   }
   return s;
}

inline int32_t sumArray(int32_t v)
{
   static int b[] = { -1, 0, 0, 1};
   int s = 0;
   for( int i = 0; i < 8; ++i)
   {
      const int a = 3&(v>>(i*2));
      s += b[a];
   }
   return s;
}

uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds

int32_t myRand()
{
   uint32_t t;
   t = x ^ (x<<1);
   x = y;
   y = z;
   z = w;
   w = w ^ ( w >> 19) ^ t ^ (t >> 8);
   return w;
}

int main()
{
   std::array<int, 8> leadingZero{0};
   x = static_cast<int32_t>(time(nullptr)); // seed PRNG
   const int maxS = 1 << 15;
   for(int s = 0; s < maxS; ++s)
   {
      const int32_t x = interleaveBit(s);
      for(int i = 0; i < 1000; ++i)
      {
         int32_t random;
         do
         {
            random = 0xFFFF & myRand();
         }while(sumOnes(random) == 0);
         int j = 7;
         while( j >= 0 )
         {
            const int32_t h = (x >> (j*2));
            const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
            if(sumArray(l) == 0)
            {
               leadingZero[j]++;
            } else
            {
               break;
            }
            j--;
         }

      }
   }
   for(int i = 7; i >= 0; --i)
   {
      printf("%d ", leadingZero[i]);
   }
   printf("\n");
   return 0;
}

Here a sample output:

6332350 2525218 1041716 438741 192917 87159 41023 19908 

real 0m0.372s
user 0m0.371s
sys  0m0.001s

The program has been compiled with:

gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy

on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.

Probably I can gain some ms tweaking compiler parameters.

While this is the OP solution time on my machine:

time pypy 26371.py
[6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]

real 0m53.061s
user 0m53.016s
sys  0m0.022s

Edit:

Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D In this case the leadingZero array must be atomic. The random global... are random, they will be more random.

 #pragma omp parallel for
 for(int i = 0; i < 1000; ++i)
 {
    int32_t random;
    do
    {
       random = 0xFFFF & myRand();
    }while(sumOnes(random) == 0);
    for(int s = 0; s < maxS; ++s)
    {
       const int32_t x = interleaveBit(s);
       int j = 7;
       while( j >= 0 )
       {
          const int32_t h = (x >> (j*2));
          const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
          if( sumArray(l) == 0 )
          {
             leadingZero[j]++;
          } else
          {
             break;
          }
          j--;
       }
    }
 }

need to add -fopenmp to the compiler flag


Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.

real  0m0.101s
user  0m0.101s
sys   0m0.000s

With openmp is slower, cause the atomics add too much overhead.

real  0m0.253s
user  0m1.870s
sys   0m0.001s

Without atomics is even faster, but I get wrong result.

2137992 1147218 619297 321243 155815 70946 32919 15579

real   0m0.048s
user   0m0.338s
sys    0m0.001s

To understand sumArray consider that 16 bit represent and array of 8 numbers.
00 have no 1 and represent -1
01 and 10 have one 1 and represent 0
11 have two 1s and represent 1
So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.

sumOnes is just black magic.

Here the latest compile flags and code.

gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++

#include <cstdint>
#include <cstdio>
#include <ctime>

inline int32_t interleaveBit(int32_t x)
{
   static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
   x = (x | ( x << 8)) & B[3];
   x = (x | ( x << 4)) & B[2];
   x = (x | ( x << 2)) & B[1];
   x = (x | ( x << 1)) & B[0];
   return x | (x << 1);
}

inline int32_t sumOnes(int32_t v)
{
   /* 0xAAAA == 0b1010 1010 1010 1010 */
   return !!(0xAAAA & (v ^ ~(v << 1)));
}

inline int32_t sumArray(int32_t v)
{
   return __builtin_popcount(v) - 8;
}

uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds

int32_t myRand()
{
   uint32_t t;
   t = x ^ (x << 1);
   x = y;
   y = z;
   z = w;
   w = w ^ ( w >> 19) ^ t ^ (t >> 8);
   return w;
}

int main()
{
   int leadingZero[8] = { 0 };
   x = static_cast<int32_t>(time(nullptr)); // seed PRNG
   const int maxS = 1 << 15;
   for( int i = 0; i < 1000; ++i )
   {
      int32_t random;
      do
      {
         random = 0xFFFF & myRand();
      } while(sumOnes(random) == 0 );
      for( int s = 0; s < maxS; ++s )
      {
         const int32_t x = interleaveBit(s);
         int j = 7;
         while( j >= 0 )
         {
            const int32_t h = (x >> (j * 2));
            const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
            if( sumArray(l) == 0 )
            {
               leadingZero[j]++;
            } else
            {
               break;
            }
            j--;
         }
      }
   }
   printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
      leadingZero[7], leadingZero[6],
      leadingZero[5], leadingZero[4],
      leadingZero[3], leadingZero[2],
      leadingZero[1], leadingZero[0]);
   return 0;
}

C++ bit magic

~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.

(speedup is based on multithreaded code on my i7-2820QM and the 1 min 9 seconds mentioned in the question. Since the OP's system has worse single threaded performance than my CPU but better multi threaded perfiormance i expect this number to be accurate)

The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom job library but that one has bugs under unix systems.. For an explanation and almost identical code without threading refer to https://codegolf.stackexchange.com/a/26485/20965.

edit

I gave each thread it's own RNG and cut down the bit length to 32 which reduced the runtime by a few ms.

#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <array>
#include <tuple>
#include <memory>
#include <thread>
#include <future>
#include <string.h>


#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif



void convolve()
{
    static const unsigned threadCount = 32;
    static const unsigned n = 8;
    static const unsigned m = 8;
    static const unsigned totalIters = 1000;
    static_assert( n <= 16, "packing of F fails when n > 16.");
    static uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;

    std::array< uint32_t, m * threadCount > out;
    std::vector< std::future<void> > threads;

    for( int threadId = 0; threadId < threadCount; threadId++)
    {
        threads.emplace_back( std::async( [&, threadId]
        {
            std::random_device rd;
            std::knuth_b gen(rd());
            uint32_t nextRandomNumber = gen();

            const unsigned iters = totalIters / threadCount;

            std::array< uint32_t, m > leadingZeros;
            for( auto& x : leadingZeros )
                x = 0;

            for( unsigned i = 0; i < iters; i++ )
            {
                // generate random bit mess
                uint32_t F;
                do {
                    // this funky looking construction shortens the dependancy chain of F
                    F = nextRandomNumber & fmask;
                    nextRandomNumber = gen();
                } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

                // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
                // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
                // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
                // this results in the distribution ( -1, 0, 0, 1 )
                // to ease calculations we generate r = LSB(F) and l = MSB(F)

                uint32_t r = F % ( 1 << n );
                // modulo is required because the behaviour of the leftmost bit is implementation defined
                uint32_t l = ( F >> 16 ) % ( 1 << n );

                uint32_t posBits = l & ~r;
                uint32_t negBits = ~l & r;
                assert( (posBits & negBits) == 0 );

                uint32_t mask = posBits | negBits;
                uint32_t totalBits = popcnt( mask );
                // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
                if ( totalBits & 1 )
                    continue;

                uint32_t adjF = posBits & ~negBits;
                uint32_t desiredBits = totalBits / 2;

                uint32_t S = (1 << (n + m -1));
                // generate all possible N+1 bit strings
                // 1 = +1
                // 0 = -1
                while ( S-- )
                {
                    for( int shift = 0; shift < m; shift++ )
                    {
                        uint32_t s = (S >> shift) % ( 1 << n );
                        auto firstBits = (s & mask) ^ adjF;

                        if ( desiredBits == popcnt( firstBits ) )
                        {
                            leadingZeros[shift] = leadingZeros[shift] + 1;
                        }
                        else
                        {
                            break;
                        }
                    }
                }
            }

            memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );
        } ));

    };

    std::array< uint32_t, m > leadingZeros;
    for( auto& x : leadingZeros )
        x = 0;

    for( auto& thread : threads )
    {
        thread.wait();
    }

    for( int i = 0; i < (threadCount * m); i++ )
    {
        leadingZeros[i % m] += out[i];
    }


    for( auto x : leadingZeros )
        std::cout << x << ", ";

    std::cout << std::endl;
}

int main()
{
    typedef std::chrono::high_resolution_clock clock;
    int rounds = 100;

    // do some rounds to get the cpu up to speed..
    for( int i = 0; i < rounds / 10; i++ )
    {
        convolve();
    }


    auto start = clock::now();

    for( int i = 0; i < rounds; i++ )
        convolve();

    auto end = clock::now();
    double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;

    std::cout << seconds/rounds*1000 << " msec/round" << std::endl;

    return 0;
}

Sample output:

   6317312, 2515072, 1034368, 434048, 190144, 85200, 39804, 19168,
   6226944, 2481408, 1031168, 438080, 192896, 86816, 40484, 19490,
   6321152, 2524672, 1045376, 442880, 195680, 88464, 41656, 20212,
   6330624, 2517504, 1031104, 430208, 187696, 83976, 38976, 18708,
   6304768, 2510336, 1030720, 433056, 190880, 86824, 40940, 19840,
   6272512, 2494720, 1028160, 432352, 189168, 84752, 39540, 19052,
   6233600, 2507520, 1046912, 447008, 198224, 89984, 42092, 20292,

Julia: 0.7s, 120x faster

As user20768 demonstrated, a straightforward port of the code to Julia is about twice as fast as PyPy. But we can do a lot better than that.

function pleadingzerocounts(; n = 8,
                              m = 8,
                              iters = 1000)
  @parallel (+) for S = 1:2^(8+8-1)
    leading_counts = zeros(Int, m)
    F = Array(Int, n)
    for i = 1:iters
      flag = 0
      while flag == 0
        for i = 1:n
          v = (1-(rand(Int8)&3))%2
          @inbounds F[i] = v
          flag += v & 1
        end
      end
      for j = 1:m
        sum = 0
        for i = 1:n
          @inbounds sum += S & (1 << (j + i - 2)) > 0 ? F[i] : -F[i]
        end
        sum == 0 ?
          (leading_counts[j] += 1) :
          break
      end
    end
    leading_counts
  end
end

function main()
  # Warm up the JIT
  pleadingzerocounts()
  # Then go for real
  println(@time pleadingzerocounts())
end

You can run this using julia -p 8 -e 'require("golf.jl");main()' (the 8 is the number of processes, you might want to play around with it). On the latest Julia prerelease this takes 0.7s vs. 1m22s for PyPy.

If you have enough cores on your computer, and perhaps spin up a few AWS instances, you should be able to shave off some more :)

Tags:

Fastest Code