How to subtract two list (fast)?

You can use comm to remove anything that’s common to both lists:

listr=($(comm -3 <(printf "%s\n" "${list1[@]}" | sort) <(printf "%s\n" "${list2[@]}" | sort) | sort -n))

This sorts both lists in the order comm expects, compares them, outputs only items which are unique to either list, and sorts them again in numeric order.

If both lists are sorted lexicographically (as per LC_COLLATE), you can avoid sorting again:

listr=($(comm --nocheck-order -3 <(printf "%s\n" "${list1[@]}") <(printf "%s\n" "${list2[@]}")))

This also works great if the values you need to compare are stored in files.


#!/bin/zsh
list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
list2=( 1 2 3   5   7 8 9    11 12 )
listr=("${(@)list1:|list2}")
typeset -p listr

Abstract:

  • For long lists, if lists are already sorted, comm (alg7) is the fastest.

  • The zsh solution is (by far) the fastest if no file is read, that is, the lists are given "in memory". However, that is not a fair comparison with all the other solutions that have to read values from files. I changed the original code (which avoided the time to read files from testing) to a code that also include the time to read files.


This is a Community Answer, it is meant to only report the time of each answer.

Please DO edit and add your solution/option to compare all.


List of algorithms

  • alg1: naive looped solution.
  • alg2: using external sort and uniq -u
  • alg3: processing an string in bash.
  • alg4: join -v on sorted lists (Thanks @Kusalananda)
  • alg5: comm (Thanks @Stephen Kitt)
  • alg6: zsh (Thanks @Llua)
  • alg7: comm but on already sorted files (thanks @Stephen Kitt)

Note from zsh manual:

${name:|arrayname}
If arrayname is the name (N.B., not contents) of an array variable, then any elements contained in arrayname are removed from the substitution of name.

Testing

As there are several ways to solve this, we need a general framework to test (in fairness) the answers.

Some guidelines (change them if you find them unfair):

  • Measure enough repetitions to have a reasonably precision.
  • Measure inside the shell used (avoid loading/unloading of the shell).
  • Avoid output overhead by either not printing or redirecting to /dev/null.

Testing code:

#!/bin/bash
alg1(){
         arr=( "${list1[@]}" )
         for i in "${list2[@]}"; do
             for j in "${!arr[@]}"; do
         if [[ "$i" == "${arr[j]}" ]]; then
             unset arr["$j"]
             break
         fi
             done
     done
     printf '%s ' "${arr[@]}"; echo
}

alg2(){
         arr=($({ printf '%s\n' "${list1[@]}" "${list2[@]}"; } | sort | uniq -u))
         printf '%s ' "${arr[@]}"; echo
}

alg3(){
         a=" $(printf '%s ' ${list1[@]})" # Watch the space at the start!!.
         for i in "${list2[@]}"; do
         a=${a/ "$i" / };
     done
     printf '%s\n' "$a"
}

alg4(){  join -v 1 list1.txt list2.txt ; }

alg5(){  #listr=$(
                    comm -3 <(printf "%s\n" "${list1[@]}" | sort) \
                            <(printf "%s\n" "${list2[@]}" | sort) |
                sort -n
     #)
      }

alg6(){ zsh -c '  alg6(){
                           list1=( $(cat list1.txt) )
                           list2=( $(cat list2.txt) )
                           listr=("${(@)list1:|list2}")
                           typeset -p listr
                        }
                  TIMEFMT="%E %U %S"
                  time ( for ((k=0;k<'"$1"';k++)); do alg6; done; )
                '
      }
#: <<-\_comment_
alg7(){ comm -23 list1.txt list2.txt; }

try(){ for ((k=0;k<$1;k++)); do "$2"; done; }

#list1=( 1 2 3 4 5 6 7 8 9 10 11 12 )
#list2=( 1 2 3   5   7 8 9    11 12 )

#list1=( a a b b b c     d d   )
#list2=(     b b   c c c d d e )

size=1000000
list1=( "0" $(seq 1 "$size") )
list2=( "${list1[@]}" ); unset "list2[123]" "list2[234]" "list2[345]"

printf '%s\n' "${list1[@]}" | sort >list1.txt
printf '%s\n' "${list2[@]}" | sort >list2.txt

repeats=${1:-10}; shift
out=${1:-no}    ; shift
(($#==0)) && set -- alg{1..7}

TIMEFORMAT='%R %U %S'
for   i
do    printf '%s ' "$i"
      if [[ $out == no ]]; then
      [[ $i != alg6 ]] &&
          time try "$repeats" "$i" >/dev/null ||
          time alg6 "$repeats" >/dev/null
      else
      [[ $i != alg6 ]] &&
          time try "$repeats" "$i"            ||
          time alg6 "$repeats"
      fi
done

Results:

Short list (as presented in the code):

$ ./script
alg1 2.056 0.806 1.237
alg2 3.478 3.126 1.756
alg3 0.999 0.728 0.304
alg4 1.186 0.780 0.434
alg5 5.234 1.926 1.722
alg6 2.71s 1.64s 1.26s
     2.758 1.637 1.265
alg7 1.156 0.799 0.422

The times for alg6 are as reported by zsh and after as reported by bash.
Also, the execution time of zsh is really smaller (0.050) if the reading of files is removed from the function to outside.

Longer List

Testing a list of only 500 values (10 repeats) reveals that alg1 is very inefficient. Removing it from further testing:

alg1 4.149 3.471 0.657
alg2 0.145 0.055 0.063
alg3 0.219 0.180 0.009
alg4 0.039 0.015 0.003
alg5 0.149 0.018 0.027
alg6 0.06s 0.02s 0.02s
     0.087 0.030 0.018
alg7 0.033 0.008 0.008

Testing 5k values (10 repeats) reveals that alg3 is also inefficient:

alg2 0.590 0.526 0.187
alg3 12.957 12.888 0.044
alg4 0.098 0.047 0.008
alg5 0.654 0.028 0.036
alg6 0.16s 0.12s 0.04s
     0.211 0.118 0.044
alg7 0.038 0.022 0.014

Testing 50k values (10 repeats):

alg2 6.487 5.838 1.611
alg4 0.488 0.469 0.019
alg5 5.073 0.250 0.056
alg6 1.42s 1.20s 0.21s
     1.467 1.206 0.216
alg7 0.271 0.247 0.014

For 500k (10 repeats)

alg4 5.471 5.269 0.156
alg6 15.14s 13.33s 1.91s
     15.215 13.335 1.926
alg7 2.833 2.655 0.138

For 1M (10 repeats)

alg4 11.127 10.804 0.251
alg7 5.772 5.525 0.230

Tags:

Shell

Bash

Ksh

Zsh