Sorting 10GB Data in 1 GB memory. How will I do it?

Please see this link. This guy has explained it beautifully.

An example of disk-based application: External mergesort algorithm (wikipedia)
A merge sort divides the unsorted list into n sublists, each containing 1 element, and then repeatedly merges sublists to produce new sorted sublists until there is only 1 sublist remaining.
The external mergesort algorithm sorts chunks that each fit in RAM, then merges the sorted chunks together.For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
1. Read 100 MB of the data in main memory and sort by some conventional sorting method, like quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4. Read the first 10 MB of each sorted chunk (of 100 MB) into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

We use merge sort first data divided then merged .

  1. Divide the data into 10 groups each of size 1gb.
  2. Sort each group and write them to disk.
  3. Load 10 items from each group into main memory.
  4. Output the smallest item from the main memory to disk. Load the next item from the group whose item was chosen.
  5. Loop step #4 until all items are not outputted.

split the file into parts (buffers) that you can sort in-place

then when all buffers are sorted take 2 (or more) at the time and merge them (like merge sort) until there's only 1 buffer remaining which will be the sorted file


For sorting 10 GB of data using only 1 GB of RAM:

  1. Read 1 GB of the data in main memory and sort by using quicksort.
  2. Write the sorted data to disk.
  3. Repeat steps 1 and 2 until all of the data is in sorted 1GB chunks (there are 10 GB / 1 GB = 10 chunks), which now need to be merged into one single output file.
  4. Read the first 90 MB of each sorted chunk (of 1 GB) into input buffers in main memory and allocate the remaining 100 MB for an output buffer. (For better performance, we can take the output buffer larger and the input buffers slightly smaller.)
  5. Perform a 10-way merge and store the result in the output buffer.
  6. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 90 MB input buffers empty, fill it with the next 90 MB of its associated 1 GB sorted chunk until no more data from the chunk is available.

This is the external merge sort approach which works externally.

Tags:

Algorithm