Script optimisation to find duplicates filename in huge CSV

The main issue with your code is that you're collecting all the pathnames in a variable and then looping over it to call basename. This makes it slow.

The loop also runs over the unquoted variable expansion ${PATHLIST}, which would be unwise if the pathnames contain spaces or shell globbing characters. In bash (or other shells that supports it), one would have used an array instead.

Suggestion:

$ sed -e '1d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d
quad_list_14.json

The first sed picks out the pathnames (and discards the header line). This might also be written as awk -F';' 'NR > 1 { print $2 }' file.csv, or as tail -n +2 file.csv | cut -d ';' -f 2.

The sort -u gives us unique pathnames, and the following sed gives us the basenames. The final sort with uniq -d at the end tells us which basenames are duplicated.

The last sed 's#.*/##' which gives you the basenames is reminiscent of the parameter expansion ${pathname##*/} which is equivalent to $( basename "$pathname" ). It just deletes everything up to and including the last / in the string.

The main difference from your code is that instead of the loop that calls basename multiple times, a single sed is used to produce the basenames from a list of pathnames.


Alternative for only looking at IN_OPEN entries:

sed -e '/;IN_OPEN;/!d' -e 's/^[^;]*;//' -e 's/;.*//' file.csv | sort -u | sed 's#.*/##' | sort | uniq -d

The following AWK script should do the trick, without using too much memory:

#!/usr/bin/awk -f

BEGIN {
    FS = ";"
}

{
    idx = match($2, "/[^/]+$")
    if (idx > 0) {
        path = substr($2, 1, idx)
        name = substr($2, idx + 1)
        if (paths[name] && paths[name] != path && !output[name]) {
            print name
            output[name] = 1
        }
        paths[name] = path
    }
}

It extracts the path and name of each file, and stores the last path it’s seen for every name. If it had previously seen another path, it outputs the name, unless it’s already output it.