Generate only unique combinations when input contains duplicates

My take based on recursion. This approach will be good when the total number of elements is large compared to the number of distinct elements. The dependence on how many elements are picked (length of the subsets) is like that of Pascal's triangle, just like for Subsets.

Wrapper:

pickDistinct[list_, num_] := Block[
  {dist, count, elemsLeft, n, picked, result},
  {dist, count} = Transpose[Tally[list]];
  dist = Reverse[dist[[Ordering[count]]]];
  count = Reverse[Sort[count]];
  n = Length[dist];
  picked = ConstantArray[0, n];
  elemsLeft = Table[Total@Drop[count, i], {i, 0, n - 1}];

  result = Reap[pickDistinctRec[1, num]][[2, 1]];
  Table[
   Join @@ Table[ConstantArray[dist[[j]], result[[i, j]]], {j, 1, n}]
   , {i, Length[result]}
   ]
  ]

Recursion:

pickDistinctRec[pos_, leftToPick_] :=

 If[pos == n, picked[[pos]] = leftToPick; Sow[picked],
  Do[
   picked[[pos]] = m;
   pickDistinctRec[pos + 1, leftToPick - m]
   , {m, Min[leftToPick, count[[pos]]], 
    Max[0, leftToPick - elemsLeft[[pos + 1]]], -1}
   ]
  ]

For an example with 5 different symbols each occuring 10 times:

list = Sort@Mod[Range[50], 5];
AbsoluteTiming[ans1 = pickDistinct[list, 5];]
AbsoluteTiming[ans2 = Union@Subsets[list, {5}];]

{0.002985, Null}

{1.553072, Null}

so a factor 500 faster. Unfortunately I have 10.0.2.0 so I cannot compare to kglr's algorithm.

UPDATE AND IMPROVEMENT:

The following two tweaks give speedups relative to the original. The original is kept since it has been used for comparison in other posts.

Method 1 (minor):

Do the conversion to OP's subset format in-place instead of looping at the end. Also bail out from recursion as soon as there are no elements left to pick; this is important for the case of small subsets from list of many distinct symbols.

pickDistinct1[list_, num_] := Block[
  {dist, count, elemsLeft, n, picked, result},
  {dist, count} = Transpose[Tally[list]];
  dist = Reverse[dist[[Ordering[count]]]];
  count = Reverse[Sort[count]];
  n = Length[dist];
  picked = ConstantArray[0, n];
  elemsLeft = Table[Total@Drop[count, i], {i, 0, n - 1}];

  Reap[pickDistinctRec1[1, num]][[2, 1]]
  ]

pickDistinctRec1[pos_, leftToPick_] :=

 If[pos == n, picked[[pos]] = leftToPick; 
  Sow[Join @@ ConstantArray @@@ Transpose[{dist, picked}]],
  Do[
   picked[[pos]] = m;
   pickDistinctRec1[pos + 1, leftToPick - m]
   , {m, Min[leftToPick, count[[pos]]], 
    Max[0, leftToPick - elemsLeft[[pos + 1]]], -1}
   ]
  ]

pickDistinctRec1[pos_, 0] :=

 Sow[Join @@ 
   ConstantArray @@@ 
    Transpose[{dist, PadRight[Take[picked, pos - 1], n]}]]

Method 2 ((more) major):

Don't use OP's subset format, but instead return the number of times each symbol has been picked. The function now returns two things: a list of distinct symbols, sorted from largest to smalles frequency, and a list of all the subsets in the format above. NOTE: Whether this is better or not depends a whole lot on how the subsets will be processed after, but that's for the user to decide.

pickDistinct2[list_, num_] := 
 Block[{dist, count, elemsLeft, n, picked, result}, {dist, count} = 
   Transpose[Tally[list]];
  dist = Reverse[dist[[Ordering[count]]]];
  count = Reverse[Sort[count]];
  n = Length[dist];
  picked = ConstantArray[0, n];
  elemsLeft = Table[Total@Drop[count, i], {i, 0, n - 1}];
  {dist, Reap[pickDistinctRec2[1, num]][[2, 1]]}
  ]

pickDistinctRec2[pos_, leftToPick_] := 
 If[pos == n, picked[[pos]] = leftToPick;
  Sow[picked],
  Do[picked[[pos]] = m;
   pickDistinctRec2[pos + 1, leftToPick - m], {m, 
    Min[leftToPick, count[[pos]]], 
    Max[0, leftToPick - elemsLeft[[pos + 1]]], -1}]]

pickDistinctRec2[pos_, 0] := Sow[PadRight[Take[picked, pos - 1], n]]

Comparison:

list = Sort @ Mod[Range[300], 17];
First @ AbsoluteTiming[ans = pickDistinct[list, 8];]
First @ AbsoluteTiming[ans1 = pickDistinct1[list, 8];]
First @ AbsoluteTiming[ans2 = pickDistinct2[list, 8];]

35.994123

16.761077

7.780696

Check:

ans === ans1

True

Length[ans] == Length[ans2[[2]]]

True

The approach is now comparable to ciao's:

First @ AbsoluteTiming[fnAns = fn[list, 8];]
Sort[Sort /@ fnAns] === Sort[Sort /@ ans]

20.015753

True

and with

fn2[list_, len_] := 
 Module[{t = Tally[list], u = Union[list]}, 
  Select[Join @@ 
    Permutations /@ 
     IntegerPartitions[len, {Length@Union@list}, Range[0, len]], 
   And @@ GreaterEqual @@@ Transpose[{t[[All, 2]], #}] &]
  ]

First @ AbsoluteTiming[fnAns2 = fn2[list, 8];]
Sort[fnAns2] === Sort[ans2[[2]]]

8.756621

True

This is a quick-and-dirty, I'm sure there's more in it, busy with other things but will revisit when time permits.

fn[list_, len_] := Module[{t = Tally[list], u = Union[list]},
   Flatten[ConstantArray @@@ Transpose[{u, #}]] & /@ 
    Select[Join @@ 
      Permutations /@ 
       IntegerPartitions[len, {Length@Union@list}, Range[0, len]],
     And @@ GreaterEqual @@@ Transpose[{t[[All, 2]], #}] &]];

Tested against what appear to be the current speed champions shows improvement:

list = Sort@Mod[Range[300], 17];
AbsoluteTiming[ans1 = subs[list, 8];] // First
AbsoluteTiming[ans2 = pickDistinct[list, 8];] // First
AbsoluteTiming[ans3 = fn[list, 8];] // First

Sort[Sort /@ ans1] == Sort[Sort /@ ans2] == Sort[Sort /@ ans3]

26.6541

29.7349

18.3133

True

Here is a recursive version that is probably based on similar ideas as the one by @Marius, but looks a little simpler (which is subjective of course) and does not use any mutable state:

ClearAll[subs]
subs[list_List, len_] := Map[
  List @@ Flatten[#, Infinity, ll]&, 
  Flatten @ subs[ll[], Counts[list], len]
] 

subs[accum_, _, 0] := accum;

subs[accum_, counts_, left_] := 
  With[{fst = First @ Normal @ counts[[{1}]]},
    With[{elem = First @ fst, countLeft = Last @ fst},
        {
            (* Add element, update max count for it *)
            subs[
                ll[accum, elem], 
                DeleteCases[ReplacePart[counts, Key[elem] -> countLeft - 1], 0], 
                left -1 
            ],
            (* Skip element *)
            Replace[
                KeyDrop[counts, elem],
                {<||> -> Nothing, rc_ :> subs[accum, rc, left]}
            ]
        }
    ]
]

It uses linked lists to accumulate the individual sublists, and at each step maintains a partially accumulated list, an association with remaining counts for different elements, and the total number of slots left.

Example:

subs[{a, a, b, c, c, c}, 3]

(* {{a, a, b}, {a, a, c}, {a, b, c}, {a, c, c}, {b, c, c}, {c, c, c}} *)

It seems to be reasonably fast on larger lists, although is probably not the absolute fastest code:

list = Sort @ Mod[Range[50], 5];
subs[list, 5] // Length // AbsoluteTiming

(* {0.005467, 126} *)

Generate only unique combinations when input contains duplicates

Tags:

Permutation

Combinatorics

Subset

Related

Recent Posts