Faster derangements?

Chunks of derangements

Since I've already written library link code generating permutations, generating derangements requires just few tweaks:

/* derangements.c */
#include "WolframLibrary.h"

DLLEXPORT mint WolframLibrary_getVersion() {
    return WolframLibraryVersion;
}
DLLEXPORT int WolframLibrary_initialize(WolframLibraryData libData) {
    return LIBRARY_NO_ERROR;
}
DLLEXPORT void WolframLibrary_uninitialize(WolframLibraryData libData) {}

DLLEXPORT int nextDerangementsChunk(
        WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res
) {
    /* Values tensor: list of integers in original order. */
    MTensor valuesT = MArgument_getMTensor(Args[0]);
    /* Actual data of values tensor. */
    mint* values = libData->MTensor_getIntegerData(valuesT);
    /* Number of elements in list. */
    mint n = libData->MTensor_getDimensions(valuesT)[0];

    /* Ordered values tensor: list of integers in non-increasing order. */
    MTensor orderedValuesT = MArgument_getMTensor(Args[1]);
    /* Actual data of ordered values tensor. */
    mint* orderedValues = libData->MTensor_getIntegerData(orderedValuesT);

    /* `stateT` tensor: `{next1, next2, ..., head, ref}`. */
    MTensor stateT = MArgument_getMTensor(Args[2]);
    /*
     * First `n` elements of `next` array contain indices of next nodes
     * in emulated linked list. Other elements of `stateT` tensor are used
     * only through direct pointers.
     */
    mint* next = libData->MTensor_getIntegerData(stateT);
    /* Pointer to index of head node. */
    mint* head = next + n;
    /* Pointer to index of reference node. */
    mint* ref = head + 1;

    /* Number of permutations in returned chunk. */
    mint chunkSize = MArgument_getInteger(Args[3]);
    /* Dimensions of returned `chunk` tensor. */
    mint chunkDims[2] = {chunkSize, n};
    /* 2 dimentional tensor with chunk of permutations to be returned. */
    MTensor chunkT;
    libData->MTensor_new(MType_Integer, 2, chunkDims, &chunkT);
    /* Actual data of the chunk tensor. */
    mint* chunk = libData->MTensor_getIntegerData(chunkT);

    mint i;
    for (i = 0; i < chunkSize; i++) {
        /*
         * Based on:
         * Aaron Williams. 2009. Loopless generation of multiset permutations
         * using a constant number of variables by prefix shifts.
         * http://webhome.csc.uvic.ca/~haron/CoolMulti.pdf
         */
        mint afterRef = next[*ref];
        mint localRef;
        if (next[afterRef] < n && orderedValues[*ref] >= orderedValues[next[afterRef]]) {
            localRef = afterRef;
        } else {
            localRef = *ref;
        }
        mint newHead = next[localRef];

        next[localRef] = next[newHead];
        next[newHead] = *head;

        if (orderedValues[newHead] < orderedValues[*head]) {
            *ref = newHead;
        }
        *head = newHead;

        /* Populate i-th permutation in chunk. */
        mint j, index;
        for (j = 0, index = *head; j < n; j++) {
            if (orderedValues[index] == values[j]) {
                /*
                 * This is not a derangement. Decrement index so that i-th place
                 * will be populated with next permutation.
                 */
                i--;
                break;
            }
            chunk[i*n + j] = orderedValues[index];
            index = next[index];
        }
    }

    /* Return control over state tensor back to Wolfram Language. */
    libData->MTensor_disown(stateT);

    /* Set chunk tensor as returned value of LibraryFunction. */
    MArgument_setMTensor(Res, chunkT);

    return LIBRARY_NO_ERROR;
}

Save above code in derangements.c file in same directory as current notebook, or paste it as a string, instead of {"derangements.c"}, as first argument of CreateLibrary in code below. Pass, in "CompileOptions", appropriate optimization flags for your compiler, the ones below are for GCC.

Needs@"CCompilerDriver`"
SetDirectory@NotebookDirectory[];
CreateLibrary[{"derangements.c"}, "derangements"(*,
    "CompileOptions" -> "-Wall -march=native -O3"*)
]
nextDerangementsChunk = LibraryFunctionLoad[%, "nextDerangementsChunk",
    {{Integer, 1, "Constant"}, {Integer, 1, "Constant"}, {Integer, 1, "Shared"}, Integer},
    {Integer, 2}
]

nextDerangementsChunk accepts four arguments: list of integers for which we want to generate derangements, list of same integers but in non-increasing order, list representing "state" of generator, and number of derangements in returned chunk. "Generator state" is described more precisely in my permutations post.

As a usage example let's generate derangements of {2, 1, 4, 1, 3} in two 5-element, and one 2-element chunks:

values = {2, 1, 4, 1, 3};
ordered = Reverse@Sort@values;
state = Join[Range@Length@values, {0, Length@values - 2}];
nextDerangementsChunk[values, ordered, state, 5]
nextDerangementsChunk[values, ordered, state, 5]
nextDerangementsChunk[values, ordered, state, 2]
(* {{1, 4, 3, 2, 1}, {3, 4, 1, 2, 1}, {4, 3, 1, 2, 1}, {1, 4, 1, 3, 2}, {1, 3, 1, 4, 2}} *)
(* {{1, 4, 2, 3, 1}, {4, 2, 1, 3, 1}, {1, 3, 2, 4, 1}, {1, 2, 3, 4, 1}, {3, 2, 1, 4, 1}} *)
(* {{1, 3, 1, 2, 4}, {1, 2, 1, 3, 4}} *)

Currently nextDerangementsChunk does no checks of given arguments, passing inconsistent arguments can lead to infinite loop, or kernel crash.


Number of derangements

Above algorithm requires explicit number of expected derangements, so we need to calculate in advance how many derangements, of our list, are there.

In general number of derangements is given by certain integral of product of Laguerre polynomials. For list of unique elements there's a built-in function that gives number of derangements: Subfactorial.

We'll use Subfactorial function for mentioned special case and Laguerre polynomials in general:

multiSubfactorial = With[{tallied = Tally@#},
    If[tallied === {{1, Length@#}},
        Subfactorial@Length@#
    (* else *),
        With[
            {coeffs = Block[{x}, 
                CoefficientList[Times @@ (LaguerreL[#1, x]^#2 & @@@ tallied), x]
            ]},
            Abs@Total[Factorial@Range[0, Length@coeffs - 1] coeffs]
        ]
    ]
]&;

All derangements

derangements // ClearAll
derangements[empty:_[]] := {empty}
derangements[_[_]] = {};
derangements[list_List /; VectorQ[Unevaluated@list, IntegerQ]] :=
    With[{n = Length@list},
        nextDerangementsChunk[
            list,
            Reverse@Sort@list,
            Join[Range@n, {0, n - 2}],
            multiSubfactorial@Tally[list][[All, 2]]
        ]
    ]
derangements[expr_ /; Not@AtomQ@Unevaluated@expr] :=
    With[{n = Length@expr, list = List @@ expr},
    With[{tallied = Sort@Tally@list},
    With[{unique = Head@expr @@ tallied[[All, 1]]},
        unique[[#]] & /@ nextDerangementsChunk[
            Lookup[PositionIndex@tallied[[All, 1]], list][[All, 1]],
            Flatten@Reverse@
                MapIndexed[ConstantArray[First@#2, Last@#1]&, tallied],
            Join[Range@n, {0, n - 2}],
            multiSubfactorial@tallied[[All, 2]]
        ]
    ]]]

Check that it generates same derangements as other methods for integer lists:

And @@ (Function[s, Sort@derangements@s === Sort@Select[Permutations@s, FreeQ[s - #, 0] &]] /@ Join @@ (Tuples[Range@#, #] & /@ Range@6))
(* True *)

and symbolic lists:

ClearAll[f]
And @@ (Function[s, Sort@derangements@s === Sort@Select[Permutations@s, FreeQ[s - #, 0] &]] /@ Join @@ (Tuples[f /@ Range@#, #] & /@ Range@6))
(* True *)

Benchmarks

For list of unique integers, from OP, derangements is ten times faster than Pick:

s = Range@9;
(res1 = Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1]&@Permutations[s]) // MaxMemoryUsed // RepeatedTiming
(res2 = derangements@s) // MaxMemoryUsed // RepeatedTiming
Sort@res1 === Sort@res2
(* {0.052, 78385160} *)
(* {0.0043, 9613720} *)
(* True *)

Speed and memory usage difference is bigger for multisets with multiple duplicates where ratio of derangements to permutations can be much lower than 1/E.

s = Join[ConstantArray[1, 6], Range[2, 7]];
(res1 = Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] &@Permutations[s]) // MaxMemoryUsed // RepeatedTiming
(res2 = derangements@s) // MaxMemoryUsed // RepeatedTiming
Sort@res1 === Sort@res2
(* {0.13, 191603344} *)
(* {0.0054,   70728} *)
(* True *)

s = Join[ConstantArray[1, 7], ConstantArray[2, 5], Range[3, 5]];
(res1 = Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] &@Permutations[s]) // MaxMemoryUsed // RepeatedTiming
(res2 = derangements@s) // MaxMemoryUsed // RepeatedTiming
Sort@res1 === Sort@res2
(* {0.518, 778380768} *)
(* {0.016,    182984} *)
(* True *)

This is the fastest method I have come up with:

s = Range @ 9;

Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] & @ Permutations[s] // 
  Length // RepeatedTiming
{0.0408, 133496}

Here is one way to generate them directly: it is based on a way to generate all permutations but discards invalid ones early:

derangements[{}, ___] = {{}};
derangements[list_List, orig_List] := 
 Union @@ 
   (Prepend[#] /@ derangements[DeleteCases[list, #, 1, 1], Rest@orig] &) /@ 
     DeleteCases[list, First@orig]
derangements[list_List] := derangements[list, list]

Basically, we generate them recursively, keeping track of the original list (or what's left of it) to make sure that don't use an element in the same position that it originally appeared in.

For each element i in the input list that isn't equal to the currently forbidden one, (DeleteCases[list, First @ orig]), we recursively call derangements on list with that element removed and the remainder of orig, and prepend i to each of the resulting derangements.

This in itself is quite slow, and comes somewhere between Combinatorica and my original filtering approach:

s = Range@9;
Needs["Combinatorica`"] // Quiet
Derangements[s] // Length // AbsoluteTiming
Select[Permutations[s], FreeQ[s - #, 0] &] // Length // AbsoluteTiming
Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] &@Permutations[s] // Length // AbsoluteTiming
derangements[s] // Length // AbsoluteTiming
{4.91098, 133496}
{1.09925, 133496}
{0.0509919, 133496}
{2.44123, 133496}

However, since this is recursive, it's possible to memoise this approach, after which it's only 3-4 times slower than Mr. Wizard's filtering approach:

derangementsMemo[{}, ___] = {{}};
derangementsMemo[list_List, orig_List] := derangementsMemo[list, orig] = 
 Union @@ 
   (Prepend[#] /@ derangementsMemo[DeleteCases[list, #, 1, 1], Rest@orig] &) /@ 
     DeleteCases[list, First@orig]
derangementsMemo[list_List] := derangementsMemo[list, list]

derangementsMemo[s] // Length // AbsoluteTiming
{0.179871, 133496}

That said, I'm not sure how feasible memoisation of this kind of combinatorial function is in the long run. You might want to clean out the cache after each use of derangementsMemo.

I'm sure some more experienced Mathematica users would be able to optimise this approach even further.

I believe there must be also at least one other strategy for generating the derangements recursively, based on the recurrence of the subfactorial numbers:

$$ !n = (n-1)(!(n-1) + !(n-2)) $$

However, I haven't yet tried to wrap my head around it, but will report back if I do.


Packing

Mr. Wizard's solution is fast because it utilizes functions optimized for packed arrays. Above recursive solutions can, to some extend, also use packed arrays.

To take advantage of packing we must make sure that calls ending recursion return packed arrays, and replace unpacking Prepend[#] /@ ... with PadLeft. Adding some cosmetic changes and localization of memoisation we get:

ClearAll[pad, derangementsInternal, derangementsPacked]
pad[_, _]@{} = {};
pad[i_, x_]@l_ := PadLeft[l, {Length@l, i}, x];

derangementsInternal[1] =
  If[Last@#1 === Last@#2, {}, Developer`ToPackedArray@{#1}]&;
derangementsInternal[i_][list_, orig_] := derangementsInternal[i][list, orig] =
  Join @@
    (pad[i, #]@derangementsInternal[i - 1][DeleteCases[list, #, 1, 1], orig]&) /@
      Complement[list, orig[[-i ;; -i]]]

derangementsPacked[{}] = {{}};
derangementsPacked[list_List] := Internal`InheritedBlock[{derangementsInternal},
  derangementsInternal[Length@list][list, list]
]

which is faster and more memory efficient than Pick-based solution:

s = Range@9;
(res1 = Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] &@Permutations[s]) // MaxMemoryUsed // RepeatedTiming
(res2 = derangementsPacked@s) // MaxMemoryUsed // RepeatedTiming
res1 === res2
{0.051, 78384288}
{0.032, 34599000}
True

Advantage of this approach, over filtering ones, grows with number of duplicates:

s = Join[ConstantArray[1, 7], ConstantArray[2, 5], Range[3, 5]];
(res1 = Pick[#, Unitize[Times @@ (#\[Transpose] - s)], 1] &@Permutations[s]) // MaxMemoryUsed // RepeatedTiming
(res2 = derangementsPacked@s) // MaxMemoryUsed // RepeatedTiming
res1 === res2
{0.63, 778380768}
{0.0019,  824784}
True

For large enough number of duplicates it can beat library link filtering, which for above example is ten times slower: {0.016, 182984}.