Efficiently Delete Duplicates

Don't use Catenate; it is unpacking your lists!

On["Packing"]

Catenate[list] // ByteCount

Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>

Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>

Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>

General::stop: Further output of Developer`FromPackedArray::punpack1 will be suppressed during this calculation. >>

1440000080

Join instead:

Join @@ list // ByteCount   (* no messages *)
480000144
  • On this packed array DeleteDuplicates works OK.
  • MaxMemoryUsed is significant but not overwhelming for modern hardware.
  • Union is slightly more memory efficient if order does not matter.

Code:

DeleteDuplicates[Join @@ list] // ByteCount       (*  479855632   *)

DeleteDuplicates[Join @@ list] // MaxMemoryUsed   (*  2039547560  *)    

Union @@ list // MaxMemoryUsed                    (*  1936000536  *)

I would use Union @@ list:

a := RandomInteger[10^10, 30];
list = Table[a, 2*^6];

MaxMemoryUsed[Union @@ list;]/MaxMemoryUsed[2 list;]

2.01666

If it is important that no sorting is applied, the following uses less memory than your Fold approach and its faster, though still slow (~sqrt(300)x more time)

Catenate[Reap[Nest[(Sow[First[#]]; DeleteCases[Rest[#],
  Alternatives @@ First[#]]) &, list, Length[list]]][[2, 1]]]