Checking for duplicates in sublists

There are a lot of questions on this site already about deleting duplicates and filtering lists. You'll probably learn a lot if you take the time to search for and read some of them. From several you could learn that the second argument of DeleteDuplicates is rarely a highly efficient way to approach a problem such as this and it is far better to use GatherBy where possible.

Compare the performance:

ss = Subsets[Range@16, {8}];

DeleteDuplicates[ss, Times @@ Take[#1, 4] == Times @@ Take[#2, 4] &] // Timing // First

First /@ GatherBy[ss, Times @@ Take[#, 4] &] // Timing // First

1.919

0.

The second method is so fast we must run it multiple times slow it down:

Do[First /@ GatherBy[ss, Times @@ Take[#, 4] &], {400}] // Timing // First

1.919

So we see that GatherBy is about 400 times faster on this size of problem. On larger sets it will be comparatively even faster as it has superior computational complexity.


For fun, as an additional optimization we can compute the product in a vectorized form (with tuning thanks to Michael E2) and extract our indices using Szabolcs's fine method from the second link below. This is roughly 850 times faster than DeleteDuplicates:

Do[
  With[{prod = Times @@ Take[ss\[Transpose], 4]},
    ss[[GatherBy[Range@Length@prod, prod[[#]] &][[All, 1]]]]
  ],
  {850}
] // Timing // First

1.903


Related examples:

How to represent a list as a cycle
How to efficiently find positions of duplicates?
DeleteDuplicates[] does not work as expected on floating point values
Sort+Union on a list
Intersection for lists of numeric data


How about:

DeleteDuplicates[list, Times @@ Take[#1, 4] == Times @@ Take[#2, 4] &]

If your L1 is long (length n) it may be reasonable to calculate each Product once rather than n-1 times like DeleteDuplicates does.

Let's calculate those products and then we can compare only the results.

l2 = {Times @@ #[[ ;; 4]], #} & /@ list; 
DeleteDuplicates[l2, #1[[ 1]] == #2[[ 1]] &][[;; , 2]]

We can see, it does matter:

n = 10^3;
L1 = RandomInteger[{1, 4}, {n, 7}];

DeleteDuplicates[L1, 
                 Times @@ Take[#1, 4] == Times @@ Take[#2, 4] &
                ] // AbsoluteTiming // First
0.203125
DeleteDuplicates[{Times @@ #[[ ;; 4]], #} & /@ L1 ,
                 #1[[ 1]] == #2[[ 1]] &
                ][[ ;; , 2]] // AbsoluteTiming // First
0.078125