Magic of Merging |
Written by Mike James | ||||
Thursday, 20 August 2020 | ||||
Page 2 of 3
Balanced mergingThe splitting process is a bit wasteful because it moves data around the disk or tape but doesn't do anything to improve the order of the file. A better method would be to split the runs as they were being produced by the merge process. All this involves is using n input files and n output merge files. Each time you come to the end of the current run on all of the input files you simply switch to a different output file. As each set of runs on the input files is merged to a single run each output file receives a run in turn so building up a roughly equal number of runs on each file ready for the next phase of the merge. There are a few fine details to clear up but this is the essence of the balanced merge procedure and it works very well. If you start off with a total of r runs equally distributed across n files then the first merge reduces the number of runs to r/n. The next pass cuts this down to r/n2 and after k passes there are r/nk runs. The total number of passes required to sort N items is therefore logn(N) and as each pass requires N copy operations the performance of this balanced merge is Nlogn(N) which is very acceptable.
Polyphase mergeIf you find balanced merging easy to understand then you might already be wondering if it is the most efficient method and the answer is no because there are two improvements that can be made to it. The first is to make better use of the files that you have open. In a balanced merge only half of the files are being used to sort the data the other half are just receiving the results ready for the next pass. A better method would be to use n input files and one output file. The data on the n input files would be merged onto the one output file until one of the input files was empty at that point the empty file would be closed, the output file would become one of the input files and a new output file opened. This uses fewer files but how many passes does it need? In a perfect case only one file would become empty at the end of each pass - if more than one did then subsequent passes would be merging with fewer than n files and so wouldn't be as efficient. This means that the distribution of runs on the final pass has to be such that there is exactly one run on each file. For example, if you are using three files the final distribution of runs would be
because the next pass will merge the single runs on file 1 and 2 into a single run on file 3 i.e. giving a sorted file. Given that this is the distribution of runs needed at the pass just before the merge is complete we can work back to discover what the original distribution was before this. We know that each pass reduces the file with the smallest number of runs to zero because this is what stops the pass and starts the next one. As one run from each file is merged to form one bigger run on the output file it doesn't take much working out to see that each pass reduces the number of runs on each input file by number of runs on the file with the smallest number of runs i.e. the one that stops the pass. For example, if the number of runs is
then after a merge pass the number of runs will be
i.e. the ideal finishing distribution. The rule is that to find the new distribution of runs after each pass subtract the minimum number of runs from each of the input files and add it to the current output file. In the example above file three only had the smallest number of runs and subtracting this from file one and adding it to file two gives the result shown i.e. 1 1 0. Turning this rule around says that to go from a run distribution to the one before the pass can be done by finding the largest number of runs setting this to 0 and adding this to each of the remaining files. For example,starting from the distribution run distribution 1 1 0, the largest number of runs is 1 so zeroing the run count on file 2 and adding 1 to the others gives 2 0 1. At the pass before the run distribution must have been 0 2 3, and before that is must have been 3 5 0 and so on. Writing these numbers out in a table gives you a better idea of the pattern
and you can extend this table back as far as you like. But you might at this stage be wondering what the fascination is in the way that the distribution of runs changes after each pass? Well suppose you have a data file that you can split into exactly 8 sorted chunks how should you split the chunks up for a three file polyphase merge? The answer is that if you write three chunks to file 1 and five chunks to file 3 then the polyphase merge will give you a fully sorted file after only four passes. In other words the table tells you exactly how to distribute any number of runs to make the polyphase merge work out perfectly. You can construct a table like this for any number of files and the rule is the same. For example, for four files the perfect run distribution is -
These numbers are actually a form a generalised Fibonacci sequence but this mostly only of theoretical interest because in program they would be built up in exactly the same way as the table is. For example, if you decide to use a four file polyphase merge then you would start out by storing one sorted chunk of the file in each of the files except one. If all of the data has been used up at then the distribution of runs is complete and you can sort the data in a single pass. If there is still data then you have to add runs to the files to move to the next level of the table and so on up the table until all of the data is used up. After that you begin the merge process to get to the sorted file. There are a few complications to take into account - for example what do you do if there isn't enough data to complete the distribution of runs at the final level of the table? These are messy considerations involving adding data and trying to share out the shortage of runs as evenly as possible and the use of dummy runs to make up the shortage. As I say the details are very messy and don't add much to your understanding of the overall idea. Nothing terrible happens if you get it wrong you just end up with more than one empty file at some stage in the proceedings so reducing the number of merge files actually operating.
<ASIN:020161586X> <ASIN:059651624X> <ASIN:0764596748> <ASIN:0321637135> <ASIN:0316778494> |
||||
Last Updated ( Thursday, 20 August 2020 ) |