How slow is Python really? (Or how fast is your language?)

Python2.7 + Numpy 1.8.1: 10.242 s

Fortran 90+: 0.029 s 0.003 s 0.022 s 0.010 s

Damn straight you lost your bet! Not a drop of parallelization here too, just straight Fortran 90+.

EDIT I've taken Guy Sirton's algorithm for permuting the array S (good find :D). I apparently also had the -g -traceback compiler flags active which were slowing this code down to about 0.017s. Currently, I am compiling this as

ifort -fast -o convolve convolve_random_arrays.f90

For those who don't have ifort, you can use

gfortran -O3 -ffast-math -o convolve convolve_random_arrays.f90

EDIT 2: The decrease in run-time is because I was doing something wrong previously and got an incorrect answer. Doing it the right way is apparently slower. I still can't believe that C++ is faster than mine, so I'm probably going to spend some time this week trying to tweak the crap out of this to speed it up.

EDIT 3: By simply changing the RNG section using one based on BSD's RNG (as suggested by Sampo Smolander) and eliminating the constant divide by m1, I cut the run-time to the same as the C++ answer by Guy Sirton. Using static arrays (as suggested by Sharpie) drops the run-time to under the C++ run-time! Yay Fortran! :D

EDIT 4 Apparently this doesn't compile (with gfortran) and run correctly (incorrect values) because the integers are overstepping their limits. I've made corrections to ensure it works, but this requires one to have either ifort 11+ or gfortran 4.7+ (or another compiler that allows iso_fortran_env and the F2008 int64 kind).

Here's the code:

program convolve_random_arrays
   use iso_fortran_env
   implicit none
   integer(int64), parameter :: a1 = 1103515245
   integer(int64), parameter :: c1 = 12345
   integer(int64), parameter :: m1 = 2147483648
   real, parameter ::    mi = 4.656612873e-10 ! 1/m1
   integer, parameter :: n = 6
   integer :: p, pmax, iters, i, nil(0:1), seed
   !integer, allocatable ::  F(:), S(:), FS(:)
   integer :: F(n), S(n+1), FS(2)

   !n = 6
   !allocate(F(n), S(n+1), FS(2))
   iters = 1000
   nil = 0

   !call init_random_seed()

   S = -1
   pmax = 2**(n+1)
   do p=1,pmax
      do i=1,iters
         F = rand_int_array(n)
         if(all(F==0)) then
            do while(all(F==0))
               F = rand_int_array(n)
            enddo
         endif

         FS = convolve(F,S)

         if(FS(1) == 0) then
            nil(0) = nil(0) + 1
            if(FS(2) == 0) nil(1) = nil(1) + 1
         endif

      enddo
      call permute(S)
   enddo

   print *,"first zero:",nil(0)
   print *," both zero:",nil(1)

 contains
   pure function convolve(x, h) result(y)
!x is the signal array
!h is the noise/impulse array
      integer, dimension(:), intent(in) :: x, h
      integer, dimension(abs(size(x)-size(h))+1) :: y
      integer:: i, j, r
      y(1) = dot_product(x,h(1:n-1))
      y(2) = dot_product(x,h(2:n  ))
   end function convolve

   pure subroutine permute(x)
      integer, intent(inout) :: x(:)
      integer :: i

      do i=1,size(x)
         if(x(i)==-1) then
            x(i) = 1
            return
         endif
         x(i) = -1
      enddo
   end subroutine permute

   function rand_int_array(i) result(x)
     integer, intent(in) :: i
     integer :: x(i), j
     real :: y
     do j=1,i
        y = bsd_rng()
        if(y <= 0.25) then
           x(j) = -1
        else if (y >= 0.75) then
           x(j) = +1
        else
           x(j) = 0
        endif
     enddo
   end function rand_int_array

   function bsd_rng() result(x)
      real :: x
      integer(int64) :: b=3141592653
      b = mod(a1*b + c1, m1)
      x = real(b)*mi
   end function bsd_rng
end program convolve_random_arrays

I suppose the question now is will you stop using slow-as-molasses Python and use fast-as-electrons-can-move Fortran ;).


Python 2.7 - 0.882s 0.283s

(OP's original: 6.404s)

Edit: Steven Rumbalski's optimization by precomputing F values. With this optimization cpython beats pypy's 0.365s.

import itertools
import operator
import random

n=6
iters = 1000
firstzero = 0
bothzero = 0

choicesF = filter(any, itertools.product([-1, 0, 0, 1], repeat=n))

for S in itertools.product([-1,1], repeat = n+1):
    for i in xrange(iters):
        F = random.choice(choicesF)
        if not sum(map(operator.mul, F, S[:-1])):
            firstzero += 1
            if not sum(map(operator.mul, F, S[1:])):
                bothzero += 1

print "firstzero", firstzero
print "bothzero", bothzero

OP's original code uses such tiny arrays there is no benefit to using Numpy, as this pure python implementation demonstrates. But see also this numpy implementation which is three times faster again than my code.

I also optimize by skipping the rest of the convolution if the first result isn't zero.


C++ bit magic

0.84ms with simple RNG, 1.67ms with c++11 std::knuth

0.16ms with slight algorithmic modification (see edit below)

The python implementation runs in 7.97 seconds on my rig. So this is 9488 to 4772 times faster depending on what RNG you choose.

#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <tuple>

#if 0
// C++11 random
std::random_device rd;
std::knuth_b gen(rd());

uint32_t genRandom()
{
    return gen();
}
#else
// bad, fast, random.

uint32_t genRandom()
{
    static uint32_t seed = std::random_device()();
    auto oldSeed = seed;
    seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit
    return oldSeed;
}
#endif

#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif



std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;

    uint32_t S = (1 << (n+1));
    // generate all possible N+1 bit strings
    // 1 = +1
    // 0 = -1
    while ( S-- )
    {
        uint32_t s1 = S % ( 1 << n );
        uint32_t s2 = (S >> 1) % ( 1 << n );
        uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
        static_assert( n < 16, "packing of F fails when n > 16.");


        for( unsigned i = 0; i < iters; i++ )
        {
            // generate random bit mess
            uint32_t F;
            do {
                F = genRandom() & fmask;
            } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

            // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
            // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
            // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
            // this results in the distribution ( -1, 0, 0, 1 )
            // to ease calculations we generate r = LSB(F) and l = MSB(F)

            uint32_t r = F % ( 1 << n );
            // modulo is required because the behaviour of the leftmost bit is implementation defined
            uint32_t l = ( F >> 16 ) % ( 1 << n );

            uint32_t posBits = l & ~r;
            uint32_t negBits = ~l & r;
            assert( (posBits & negBits) == 0 );

            // calculate which bits in the expression S * F evaluate to +1
            unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits));
            // idem for -1
            unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits));

            if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) )
            {
                firstZero++;

                unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits));
                unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits));

                if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) )
                {
                    bothZero++;
                }
            }
        }
    }

    return std::make_pair(firstZero, bothZero);
}

int main()
{
    typedef std::chrono::high_resolution_clock clock;
    int rounds = 1000;
    std::vector< std::pair<unsigned, unsigned> > out(rounds);

    // do 100 rounds to get the cpu up to speed..
    for( int i = 0; i < 10000; i++ )
    {
        convolve();
    }


    auto start = clock::now();

    for( int i = 0; i < rounds; i++ )
    {
        out[i] = convolve();
    }

    auto end = clock::now();
    double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;

#if 0
    for( auto pair : out )
        std::cout << pair.first << ", " << pair.second << std::endl;
#endif

    std::cout << seconds/rounds*1000 << " msec/round" << std::endl;

    return 0;
}

Compile in 64-bit for extra registers. When using the simple random generator the loops in convolve() run without any memory access, all variables are stored in the registers.

How it works: rather than storing S and F as in-memory arrays, it is stored as bits in an uint32_t.
For S, the n least significant bits are used where an set bit denotes an +1 and an unset bit denotes an -1.
F requires at least 2 bits to create an distribution of [-1, 0, 0, 1]. This is done by generating random bits and examining the 16 least significant (called r) and 16 most significant bits (called l). If l & ~r we assume that F is +1, if ~l & r we assume that F is -1. Otherwise F is 0. This generates the distribution we're looking for.

Now we have S, posBits with an set bit on every location where F == 1 and negBits with an set bit on every location where F == -1.

We can prove that F * S (where * denotes multiplication) evaluates to +1 under the condition (S & posBits) | (~S & negBits). We can also generate similar logic for all cases where F * S evaluates to -1. And finally, we know that sum(F * S) evaluates to 0 if and only if there is an equal amount of -1's and +1's in the result. This is very easy to calculate by simply comparing the number of +1 bits and -1 bits.

This implementation uses 32 bit ints, and the maximum n accepted is 16. It is possible to scale the implementation to 31 bits by modifying the random generate code, and to 63 bits by using uint64_t instead of uint32_t.

edit

The folowing convolve function:

std::pair<unsigned, unsigned> convolve()
{
    const uint32_t n = 6;
    const uint32_t iters = 1000;
    unsigned firstZero = 0;
    unsigned bothZero = 0;
    uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
    static_assert( n < 16, "packing of F fails when n > 16.");


    for( unsigned i = 0; i < iters; i++ )
    {
        // generate random bit mess
        uint32_t F;
        do {
            F = genRandom() & fmask;
        } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );

        // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
        // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
        // and  ~MSB(F) & LSB(F) returns 1 for all elements that are negative
        // this results in the distribution ( -1, 0, 0, 1 )
        // to ease calculations we generate r = LSB(F) and l = MSB(F)

        uint32_t r = F % ( 1 << n );
        // modulo is required because the behaviour of the leftmost bit is implementation defined
        uint32_t l = ( F >> 16 ) % ( 1 << n );

        uint32_t posBits = l & ~r;
        uint32_t negBits = ~l & r;
        assert( (posBits & negBits) == 0 );

        uint32_t mask = posBits | negBits;
        uint32_t totalBits = popcnt( mask );
        // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
        if ( totalBits & 1 )
            continue;

        uint32_t adjF = posBits & ~negBits;
        uint32_t desiredBits = totalBits / 2;

        uint32_t S = (1 << (n+1));
        // generate all possible N+1 bit strings
        // 1 = +1
        // 0 = -1
        while ( S-- )
        {
            // calculate which bits in the expression S * F evaluate to +1
            auto firstBits = (S & mask) ^ adjF;
            auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 );

            bool a = desiredBits == popcnt( firstBits );
            bool b = desiredBits == popcnt( secondBits );
            firstZero += a;
            bothZero += a & b;
        }
    }

    return std::make_pair(firstZero, bothZero);
}

cuts the runtime to 0.160-0.161ms. Manual loop unroll (not pictured above) makes that 0.150. The less trivial n=10, iter=100000 case runs under 250ms. I'm sure i can get it under 50ms by leveraging additional cores but that's too easy.

This is done by making the inner loop branch free and swapping the F and S loop.
If bothZero is not required i can cut down the run time to 0.02ms by sparsely looping over all possible S arrays.

Tags:

Fastest Code