Efficiently merge / sort / unique large number of text files

A simple fix, works at least in Bash, since printf is builtin, and the command line argument limits don't apply to it:

printf "%s\0" * | xargs -0 cat | sort -u > /tmp/bla.txt

(echo * | xargs would also work, except for the handling of file names with white space etc.)


find . -maxdepth 1 -type f ! -name ".*" -exec cat {} + | sort -u -o /path/to/sorted.txt

This will concatenate all non-hidden regular files in the current directory and sort their combined contents (while removing duplicated lines) into the file /path/to/sorted.txt.


With GNU sort, and a shell where printf is built-in (all POSIX-like ones nowadays except some variants of pdksh):

printf '%s\0' * | sort -u --files0-from=- > output

Now, a problem with that is that because the two components of that pipeline are run concurrently and independently, by the time the left one expands the * glob, the right one may have created the output file already which could cause problem (maybe not with -u here) as output would be both an input and output file, so you may want to have the output go to another directory (> ../output for instance), or make sure the glob doesn't match the output file.

Another way to address it in this instance is to write it:

printf '%s\0' * | sort -u --files0-from=- -o output

That way, it's sort opening output for writing and (in my tests), it won't do it before it has received the full list of files (so long after the glob has been expanded). It will also avoid clobbering output if none of the input files are readable.

Another way to write it with zsh or bash

sort -u --files0-from=<(printf '%s\0' *) -o output

That's using process substitution (where <(...) is replaced by a file path that refers to the reading end of the pipe printf is writing to). That feature comes from ksh, but ksh insists in making the expansion of <(...) a separate argument to the command so you can't use it with the --option=<(...) syntax. It would work with this syntax though:

sort -u --files0-from <(printf '%s\0' *) -o output

Note that you'll see a difference from approaches that feed the output of cat on the files in cases where there are files that don't end in a newline character:

$ printf a > a
$ printf b > b
$ printf '%s\0' a b | sort -u --files0-from=-
a
b
$ printf '%s\0' a b | xargs -r0 cat | sort -u
ab

Also note that sort sorts using the collation algorithm in the locale (strcollate()), and sort -u reports one of each set of lines that sort the same by that algorithm, not unique lines at byte level. If you only care about lines being unique at byte level and don't care so much about the order they're sorted on, you may want to fix the locale to C where the sorting is based on byte values (memcmp(); that would probably speed things up significantly):

printf '%s\0' * | LC_ALL=C sort -u --files0-from=- -o output

Tags:

Shell

Uniq

Sort