Magic of Merging
Written by Mike James   
Thursday, 20 August 2020
Article Index
Magic of Merging
Balanced And Polyphase
Final touches

 

Banner

The final touch

If your head is spinning trying to flow the intricacies of polyphase merging the final touch to the performance is a simple and pleasing idea.

Obviously the longer the initial runs in the data are the fewer the merge operations needed to sort the file. It is possible to use merge operations to sort data without any pre-sorting and these pure merge sorts offer a surprisingly good performance.

If you merge data that is in random order the average run length is 2. If you make use of a sort routine using enough memory to hold N records then the run length can be increased to N with a subsequent reduction in the number of merges need to sort the whole file.

Is there any way of using the memory to increase the average run lengths beyond N?

The answer is yes.

If you use an incremental sort procedure, i.e. one that can add and remove data items while maintain the sorted order, such as heap sort. In this case you can initially read in and sort N records into a heap and write out the largest record to start the run reading in a new record to replace it.

As the sort procedure is incremental the new record can be placed in its correct sorted position in the heap and the largest record written out again. In this way the run length can be extended beyond N and it only fails when the new record that is read in is larger than the first record that was written out to start the run. When this happens you have no choice but to write out all N records and start a new run by reading in N more records.

If you do the statistics it turns out that using this method the average length of the run is 2N and of course never less than N.

Niklaus Wirth describes a procedure that works in exactly this way that using 6 files and enough RAM to store only 100 records it is possible to sort a file with 165,680,100 initial accidental runs in only 20 passes!

Sorting is a strange subject.

The Future

Ok I admit it that merge sorting's best days were when computers kept you warm, had lots of flashing lights and cool tape drives. 

 

tape0

This is what a computer should look like - and now you know what the tape drives are doing for most of the time....

 

 

The point is however that merge sorting isn't a relic - Java for one uses it as a standard way of sorting collections and many programmers are puzzled as to why this is - why isn't QuickSort in use?

As mentioned in the introduction the answer is that QuickSort has a worst case running time of O(n2) but merge sort doesn't have this worst case performance and so offers O(nlog2n) in the worse case.

 

It is a method of sorting that uses only sequential access and there are lots of situations when this is the case. If you have data coming in over a live link then sorting it using merge sort like methods may be a cost effective way - building up runs by sending it to a set of files and then merging the files. Similarly with today's huge datasets - big data - you might not be able to store the whole thing in memory and using some merge sort gain becomes a possibility. 

So even when sequential access isn't an issue simple merge sorting has advantages.

One completely new application is in parallel processing. Hadoop makes it relatively easy to split up a calculation among many computers and get back a single result. The splitting up is called a map operation where each computer gets to compute its part of the result. Getting the final answer is by way of a reduce operation when each of the machine result is merged to a single answer. A merge sort can be implemented in parallel by allowing each machine to sort a portion of the data small enough to fit into memory and then the final result is obtained by merging. Of course you might not have enough machines to get the job done in one merge so balanced and polyphase merge becomes another useful algorithm.

There are even good reasons to use modified merge sorts to make best use of any caches that are available. In this case the size of each run is arranged to just fit into the cache. 

The days of the merge sort are far from over. 

Related Articles

Merge sort as folk dance       

Sorting Algorithms as Dances      

Sequential storage       

QuickSort Exposed       

Quick Median       

 

What Programmers Know

knowcover

Contents

  1. The Computer - What's The Big Idea?*
  2. The Memory Principle - Computer Memory and Pigeonholes*
  3. Principles of Execution - The CPU
  4. The Essence Of Programming
  5. Variables - Scope, Lifetime And More*
  6. Binary Arithmetic
  7. Hexadecimal*
  8. Binary - Negative Numbers*
  9. Floating Point Numbers*
  10. Inside the Computer - Addressing
  11. The Mod Function
  12. Recursion
  13. The Lost Art Of The Storage Mapping Function *
  14. Hashing - The Greatest Idea In Programming
  15. Advanced Hashing
  16. XOR - The Magic Swap*
  17. Programmer's Introduction to XML
  18. From Data To Objects*
  19. What Exactly Is A First Class Function - And Why You Should Care*
  20. Stacks And Trees*
  21. The LIFO Stack - A Gentle Guide*
  22. Data Structures - Trees
  23. Inside Random Numbers
  24. The Monte Carlo Method
  25. Cache Memory And The Caching Principle
  26. Data Compression The Dictionary Way
  27. Dates Are Difficult*
  28. Sequential Storage*
  29. Magic of Merging*
  30. Power of Operators
  31. The Heart Of A Compiler*
  32. The Fundamentals of Pointers
  33. Functional And Dysfunctional Programming*

* Recently revised

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


The Memory Principle - Computer Memory and Pigeonholes

We discover why computer memory can be likened to pigeonholes and even include instructions for you to build your own memory device.



What If Babbage..?

What if the computer had been invented in the Victorian era? This isn’t a silly idea. Charles Babbage was born in the eighteenth century - the age of the Industrial Revolution. The calculating machi [ ... ]


Other Articles

 

<ASIN:0201896850>

<ASIN:1565924533>

<ASIN:0201314525>

<ASIN:020172684X>

<ASIN:1584504951>



Last Updated ( Thursday, 20 August 2020 )