Julia: How to return the number of unique elements in an Array

If you want an exact answer length(unique(x)) is as efficient as it gets for general objects. If your values have a limited domain, eg UInt8, it may be more efficient to use a fixed size table. If you can accept an approximation, then you can use the HyperLogLog data structure / algorithm, which is implemented in the OnlineStats package:

https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog


It appears that length(Set(x)) is somewhat faster than length(unique(x)).

julia> using StatsBase, BenchmarkTools

julia> num_unique(x) = length(Set(x));

julia> a = sample(1:100, 200);

julia> num_unique(x) == length(unique(x))
true

julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     498.130 μs (0.00% GC)
  median time:      570.588 μs (0.00% GC)
  mean time:        579.011 μs (2.41% GC)
  maximum time:     2.321 ms (63.03% GC)
  --------------
  samples:          5264
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial: 
  memory estimate:  288.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     283.031 μs (0.00% GC)
  median time:      393.317 μs (0.00% GC)
  mean time:        397.878 μs (4.24% GC)
  maximum time:     33.499 ms (98.80% GC)
  --------------
  samples:          6704
  evals/sample:     1

And another benchmark for an array of strings:

julia> using Random

julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  450.50 KiB
  allocs estimate:  36
  --------------
  minimum time:     818.024 μs (0.00% GC)
  median time:      895.944 μs (0.00% GC)
  mean time:        906.568 μs (1.61% GC)
  maximum time:     1.964 ms (51.19% GC)
  --------------
  samples:          3049
  evals/sample:     1

julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial: 
  memory estimate:  144.68 KiB
  allocs estimate:  8
  --------------
  minimum time:     367.018 μs (0.00% GC)
  median time:      378.666 μs (0.00% GC)
  mean time:        384.486 μs (1.07% GC)
  maximum time:     1.314 ms (70.80% GC)
  --------------
  samples:          4527
  evals/sample:     1

if you don't need the x array after, length(unique!(x)) is slightly faster. with Floats and Integers, you can use map reduce if your array is already sorted.

function count_unique_sorted(x)
    f(a) = (a,0)
    function op(a,b)
    if a[1] == b[1]
        return (b[1],a[2])
    else
        return (b[1],a[2]+1)
    end
    end
    return mapreduce(f,op,x)[2]+1 
end

If you don't care about the order of the array x, you can sort and count in one function:

count_unique_sorted!(x)=count_unique_sorted(sort!(x))

Some benchmarks:

using Random,StatsBase, BenchmarkTools
x  = sample(1:100,200)
length(unique(x)) == count_unique_sorted(sort(x)) #true

Using length(unique(x)):

@benchmark length(unique(x))
BenchmarkTools.Trial:
  memory estimate:  6.08 KiB
  allocs estimate:  17
  --------------
  minimum time:     3.350 μs (0.00% GC)
  median time:      3.688 μs (0.00% GC)
  mean time:        5.352 μs (24.35% GC)
  maximum time:     6.691 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     8

Using Set:

@benchmark length(Set(x))
BenchmarkTools.Trial:
  memory estimate:  2.82 KiB
  allocs estimate:  8
  --------------
  minimum time:     2.256 μs (0.00% GC)
  median time:      2.467 μs (0.00% GC)
  mean time:        3.654 μs (26.04% GC)
  maximum time:     5.297 ms (99.91% GC)
  --------------
  samples:          10000
  evals/sample:     9

Using count_unique_sorted!:

x2 = copy(x)
@benchmark count_unique_sorted!(x2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     948.387 ns (0.00% GC)
  median time:      990.323 ns (0.00% GC)
  mean time:        1.038 μs (0.00% GC)
  maximum time:     2.481 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     31

Using count_unique_sorted with an already sorted array

x3 = sort(x)
@benchmark count_unique_sorted(x3)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     140.962 ns (0.00% GC)
  median time:      146.831 ns (0.00% GC)
  mean time:        154.121 ns (0.00% GC)
  maximum time:     381.806 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     852

Using count_unique_sorted and sorting the array

@benchmark count_unique_sorted(sort(x))
BenchmarkTools.Trial:
  memory estimate:  1.77 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.470 μs (0.00% GC)
  median time:      1.630 μs (0.00% GC)
  mean time:        2.367 μs (21.82% GC)
  maximum time:     4.880 ms (99.94% GC)
  --------------
  samples:          10000
  evals/sample:     10

For strings, sorting and counting is slower than making a Set.

Tags:

Julia