Efficient Out-Of-Core Sorting

It's funny as I heard this same question not a month ago... and the response that our local guru gave as well.

"Use the unix sort command"

Though we admitedly thought it was a joke at the expense of the asker... it turns out that it was not. The reasoning is that those smart guys already gave a lot of thought in how to solve the problem of very large files, and came up with a very impressive implementation which makes good use of the available resources.

Therefore, unless you plan in re-inventing the wheel: ie you have time and this is business critical, then simply using the unix sort is probably an excellent idea.

The only drawback is its arcane syntax. This page is dedicated to the command and various explanations.

My personal advise: take a small sample of the data for testing that the command effectively does exactly what you want.


The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of runs you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record.

You then check whether that record would sort before or after the record you just wrote out. If you would sort after it, you insert it into your heap, and continue. If it would sort before, you insert it into a second heap.

You stop adding records to the current run when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you repeat the process, writing a new run to a new file.

This will usually produce considerably longer intermediate runs in the initial phase, so merging them is substantially less work. Assuming the input records are in random order, you can expect this to approximately double the length of each run--but if the input is even partially sorted, this can take advantage of that existing ordering to extend the run lengths even more.

As an aside, I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I don't have a copy of the third edition, so I don't have a page number for that.