What is the most efficient way to compute the difference of lines from two files?

To extend the comment of @L3viathan If order of element is not important set is the rigth way. here a dummy example you can adapt:

l1 = [0,1,2,3,4,5]
l2 = [3,4,5]
setL1 = set(l1)  # transform the list into a set
setL2 = set(l2)
setDiff = setl1 - setl2  # make the difference 
listeDiff = list(setDiff)  # if you want to have your element back in a list

as you see is pretty straightforward in python.


Try using sets:

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    set_b = set(f)

set_c = set_a - set_b

with open("list_c.txt","w") as f:
    for c in set_c:
        f.write(c)

The complexity of subtracting two sets is O(n) in the size of the set a.


you can create one set of the first file contents, then just use difference or symmetric_difference depending on what you call a difference

with open("list_a.txt") as f:
    set_a = set(f)

with open("list_b.txt") as f:
    diffs = set_a.difference(f)

if list_b.txt contains more items than list_a.txt you want to swap them or use set_a.symmetric_difference(f) instead, depending on what you need.

difference(f) works but still has to construct a new set internally. Not a great performance gain (see set issubset performance difference depending on the argument type), but it's shorter.


In case order matters you can presort the lists together with item indices and then iterate over them together:

list_2 = sorted(list_2)
diff_idx = []
j = 0
for i, x in sorted(enumerate(list_1), key=lambda x: x[1]):
    if x != list_2[j]:
        diff_idx.append(i)
    else:
        j += 1
diff = [list_1[i] for i in sorted(diff_idx)]

This has time complexity of the sorting algorithm, i.e. O(n*log n).