Can (compiled) matrix permanent evaluation be further sped-up?

Looking at CompilePrint[compiledGlynnAlgorithm], there are some lines with CopyTensor in it which aren't really needed. There's also a few CoerceTensor lines in there when it might be faster to just coerce the integer matrix once at the beginning.

By slightly adjusting the function, all instances of CopyTensor and CoerceTensor go away, giving a small increase in speed:

compiledGlynnAlgorithmAlt = Compile[{
    {d, _Complex, 2}, {a, _Complex, 2}}, 
   Total@Map[Apply[Times, (#.a)*#] &, d],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True];


n = 20;
rc = RandomComplex[{-I - 1, I + 1}, {n, n}];
a = compiledGlynnAlgorithmAlt[δGrayCodeList[n], rc]; // AbsoluteTiming
b = compiledGlynnAlgorithm[δGrayCodeList[n], rc]; // AbsoluteTiming
a == b
(* {0.582192, Null} *)
(* {0.690600, Null} *)
(* True *)

Some more performance can be squeezed out by caching the resulting sign of each row in δGrayCodeList[n]; the result is no longer exactly the same, but the relative difference is small:

δGrayCodeListSigns[n_] := δGrayCodeListSigns[n] = Times @@@ δGrayCodeList[n]

compiledGlynnAlgorithmKnownSign = 
  Compile[{{d, _Integer, 2}, {a, _Complex, 2}, {s, _Integer, 1}},
   s.Map[ Apply[Times, (#.a)] &, d]
   , CompilationTarget -> "C"
   , RuntimeAttributes -> {Listable}];

n = 20;
rc = RandomComplex[{-I - 1, I + 1}, {n, n}];

a = compiledGlynnAlgorithmAlt[δGrayCodeList[n], rc]; // AbsoluteTiming
b = compiledGlynnAlgorithm[δGrayCodeList[n], rc]; // AbsoluteTiming
c = compiledGlynnAlgorithmKnownSign[
      δGrayCodeList[n], rc, δGrayCodeListSigns[n]
    ]; // AbsoluteTiming

Abs[c - b]/Abs[b]

(* {0.565806, Null} *)
(* {0.614640, Null} *)
(* {0.430388, Null} *)
(* 2.49266*10^-13 *)

You might get a speed up by restricting compiledGlynnAlgorithm to work on just one row of the Gray Code list, allowing the Listable and Parallelization to come into play. I say "might" because the speed up will depend on the details of your hardware.

Redefine compiledGlynnAlgorithm like so (note that it now takes a one dimensional list for d):

compiledGlynnAlgorithm = Compile[{{d, _Integer, 1}, {a, _Complex, 2}},
  Apply[Times, (d.a) d], 
   CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True]

And put the Total into Permanent

Permanent[mArg_List /; (MatrixQ[mArg, NumericQ])] := 
  Total@compiledGlynnAlgorithm[δGrayCodeList[mArg // Length], mArg] // 
    #/2^((mArg // Length) - 1) &;

a bit more speed

As ssch suggested, a little more performance can be squeezed out by exploiting the fact that the product of a given row of the Gray Code list is either 1 or -1. Furthermore, these occur alternately. So we can redefine compiledGlynnAlgorithm to remove the multiplication by d:

compiledGlynnAlgorithm = Compile[{{d, _Integer, 1}, {a, _Complex, 2}},
  Apply[Times, (d.a)],
  CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True]

and modify Permanent to Total the odd and even rows of the result separately:

Permanent[mArg_List /; (MatrixQ[mArg, NumericQ])] :=
 Module[{x},
  x = compiledGlynnAlgorithm[δGrayCodeList[mArg // Length], mArg];
  (Total[x[[;; ;; 2]]] - Total[x[[2 ;; ;; 2]]]) // #/2^((mArg // Length) - 1) &]

On my machine this gives about a factor of 3.5 speed increase over the original code for a 20x20 matrix.


Here's my relatively compact implementation of Glynn's formula, which incorporates the Gray code optimization:

SetAttributes[GrayCode, Listable];
GrayCode[n_Integer] := BitXor[n, BitShiftRight[n]]

permanent[mat_?MatrixQ] /; Equal @@ Dimensions[mat] :=
          Module[{b = 2^(Length[mat] - 1)},
                 PadRight[{}, b, {1, -1}].(Times @@@
                 ((2 IntegerDigits[2 b - GrayCode[Range[0, b - 1]] - 1,
                                   2] - 1).mat))/b]

Here is a compiled version of the Gray-permuted Glynn formula. The code is adapted from Knuth's method for generating tuples in Gray order:

cPermanent = Compile[{{m, _Real, 2}},
                     Module[{n = Length[m], d, f, j, p, s},
                            d = Table[1, {n}]; f = Range[0, n - 1];
                            j = 0; s = 1; p = Times @@ Total[m];
                            While[j < n - 1,
                                  f[[1]] = 0; d[[n - j]] *= -1;
                                  s *= -1; p += s (Times @@ (d.m));
                                  f[[j + 1]] = f[[j + 2]]; f[[j + 2]] = j + 1;
                                  j = f[[1]]];
                            p/2^j], "RuntimeOptions" -> "Quality"];

(Modifying the compiled routine to handle complex matrices is completely straightforward.)

At least in the limited tests I did, these two perform quite well for symbolic and numerical matrices, compared to the MathWorld routine. I do not have version 10, so I can't say whether these routines are any better than the now built-in Permanent[] function.

For reference, here is an implementation of Ryser's formula, adapted from old code by Ilan Vardi:

permanent[mat_?MatrixQ] /; Equal @@ Dimensions[mat] :=
          Module[{n = Length[mat], l},
                 l = Range[2^n - 1];
                 (1 - 2 Mod[n, 2]) PadRight[{}, 2^n - 1, {-1, 1}].(Times @@@
                 Accumulate[JacobiSymbol[-1, l]
                            mat[[IntegerExponent[l, 2] + 1]]])]

It is a bit slower than the routine based on Glynn's formula.