If your head is spinning trying to flow the intricacies of polyphase merging the final touch to the performance is a simple and pleasing idea.
Obviously the longer the initial runs in the data are the fewer the merge operations needed to sort the file. It is possible to use merge operations to sort data without any pre-sorting and these pure merge sorts offer a surprisingly good performance.
If you merge data that is in random order the average run length is 2. If you make use of a sort routine using enough memory to hold N records then the run length can be increased to N with a subsequent reduction in the number of merges need to sort the whole file.
Is there any way of using the memory to increase the average run lengths beyond N?
The answer is yes.
If you use an incremental sort procedure, i.e. one that can add and remove data items while maintain the sorted order, such as heap sort. In this case you can initially read in and sort N records into a heap and write out the largest record to start the run reading in a new record to replace it.
As the sort procedure is incremental the new record can be placed in its correct sorted position in the heap and the largest record written out again. In this way the run length can be extended beyond N and it only fails when the new record that is read in is larger than the first record that was written out to start the run. When this happens you have no choice but to write out all N records and start a new run by reading in N more records.
If you do the statistics it turns out that using this method the average length of the run is 2N and of course never less than N.
Niklaus Wirth describes a procedure that works in exactly this way that using 6 files and enough RAM to store only 100 records it is possible to sort a file with 165,680,100 initial accidental runs in only 20 passes!
Sorting is a strange subject.
Ok I admit it that merge sorting's best days were when computers kept you warm, had lots of flashing lights and cool tape drives.
This is what a computer should look like - and now you know what the tape drives are doing for most of the time....
The point is however that merge sorting isn't a relic - Java for one uses it as a standard way of sorting collections and many programmers are puzzled as to why this is - why isn't QuickSort in use?
As mentioned in the introduction the answer is that QuickSort has a worst case running time of O(n2) but merge sort doesn't have this worst case performance and so offers O(nlog2n) in the worse case.
It is a method of sorting that uses only sequential access and there are lots of situations when this is the case. If you have data coming in over a live link then sorting it using merge sort like methods may be a cost effective way - building up runs by sending it to a set of files and then merging the files. Similarly with today's huge datasets - big data - you might not be able to store the whole thing in memory and using some merge sort gain becomes a possibility.
So even when sequential access isn't an issue simple merge sorting has advantages.
One completely new application is in parallel processing. Hadoop makes it relatively easy to split up a calculation among many computers and get back a single result. The splitting up is called a map operation where each computer gets to compute its part of the result. Getting the final answer is by way of a reduce operation when each of the machine result is merged to a single answer. A merge sort can be implemented in parallel by allowing each machine to sort a portion of the data small enough to fit into memory and then the final result is obtained by merging. Of course you might not have enough machines to get the job done in one merge so balanced and polyphase merge becomes another useful algorithm.
There are even good reasons to use modified merge sorts to make best use of any caches that are available. In this case the size of each run is arranged to just fit into the cache.
Exactly how does computer memory work? What is surprising is that it still works in more or less the same way as when Babbage designed his Analytical Engine or the IBM 360 accessed core memory. [ ... ]
What makes the difference between a truly awful and a truly great programmer is the way that they use data. The code matters but it is almost an after thought once you have decided on the data structu [ ... ]