Parity of permutation with parallelism

The number in each array slot is the proper slot for that item. Think of it as a direct link from the "from" slot to the "to" slot. An array like this is very easy to sort in O(N) time with a single CPU just by following the links, so it would be a shame to have to use a generic sorting algorithm to solve this problem. Thankfully...

You can do this easily in O(log N) time with Ω(N) CPUs.

Let A be your array. Since each array slot has a single link out (the number in that slot) and a single link in (that slot's number is in some slot), the links break down into some number of cycles.

The parity of the permutation is the oddness of N-m, where N is the length of the array and m is the number of cycles, so we can get your answer by counting the cycles.

First, make an array S of length N, and set S[i] = i.

Then:

Repeat ceil(log_2(N)) times:
    foreach i in [0,N), in parallel:
        if S[i] < S[A[i]] then:
            S[A[i]] = S[i]
        A[i] = A[A[i]]

When this is finished, every S[i] will contain the smallest index in the cycle containing i. The first pass of the inner loop propagates the smallest S[i] to the next slot in the cycle by following the link in A[i]. Then each link is made twice as long, so the next pass will propagate it to 2 new slots, etc. It takes at most ceil(log_2(N)) passes to propagate the smallest S[i] around the cycle.

Let's call the smallest slot in each cycle the cycle's "leader". The number of leaders is the number of cycles. We can find the leaders just like this:

foreach i in [0,N), in parallel:
    if (S[i] == i) then:
        S[i] = 1  //leader
    else
        S[i] = 0  //not leader

Finally, we can just add up the elements of S to get the number of cycles in the permutation, from which we can easily calculate its parity.


You didn't specify a machine model, so I'll assume that we're working with an EREW PRAM. The complexity measure you care about is called "span", the number of rounds the computation takes. There is also "work" (number of operations, summed over all processors) and "cost" (span times number of processors).

From the point of view of theory, the obvious answer is to modify an O(log n)-depth sorting network (AKS or Goodrich's Zigzag Sort) to count swaps, then return (number of swaps) mod 2. The code is very complex, and the constant factors are quite large.

A more practical algorithm is to use Batcher's bitonic sorting network instead, which raises the span to O(log2 n) but has reasonable constant factors (such that people actually use it in practice to sort on GPUs).

I can't think of a practical deterministic algorithm with span O(log n), but here's a randomized algorithm with span O(log n) with high probability. Assume n processors and let the (modifiable) input be Perm. Let Coin be an array of n Booleans.

In each of O(log n) passes, the processors do the following in parallel, where i ∈ {0…n-1} identifies the processor, and swaps ← 0 initially. Lower case variables denote processor-local variables.

Coin[i] ← true with probability 1/2, false with probability 1/2
(barrier synchronization required in asynchronous models)
if Coin[i]
    j ← Perm[i]
    if not Coin[j]
        Perm[i] ← Perm[j]
        Perm[j] ← j
        swaps ← swaps + 1
    end if
end if
(barrier synchronization required in asynchronous models)

Afterwards, we sum up the local values of swaps and mod by 2.

Each pass reduces the number of i such that Perm[i] ≠ i by 1/4 of the current total in expectation. Thanks to the linearity of expectation, the expected total is at most n(3/4)r, so after r = 2 log4/3 n = O(log n) passes, the expected total is at most 1/n, which in turn bounds the probability that the algorithm has not converged to the identity permutation as required. On failure, we can just switch to the O(n)-span serial algorithm without blowing up the expected span, or just try again.