Fastest `uniq` tool in linux

Let's consider how each solution works.

  • uniq This requires that the file already be sorted. If not, you have to pipe it through sort first, which means that sort has to read the entire file into memory, reorder it (O(n log n)), and then write it into the pipe. The work of uniq is very cheap, since it only has to compare adjacent lines of its input.

  • sort -u This combines the work of sort | uniq. This has to collect all the unique inputs into memory like the awk script does, but it also then wastes time sorting them before producing the output. This is O(n log n), although in this case n is the number of unique items, not all the inputs. So it's better than the pipe.

  • sed I'm not sure why you listed this, as I can't think of a good way to do this with sed at all. Maybe if you first sort it and pipe to a sed script, there's a way to compare adjacent lines. So sed would just be doing what uniq does, and uniq probably does it about as efficiently as possible.

  • awk This is likely the best because it only does the minimal amount of work necessary. As it reads each line, it does an efficient hash lookup to see if the line is already in its memory, and only stores the unique lines as hash keys, and a counter as the value. (If the line wasn't previously present, the condition will be true, so the line will be printed. Otherwise it won't.) This uses O(n) time and O(uniq n) memory.

Every method will use a considerable amount of memory, either for sorting the input or keeping track of which inputs have seen so they can remove duplicates.