Universal Hashing |
Written by Mike James | ||||||
Friday, 27 January 2023 | ||||||
Page 2 of 2
ParityThis looks like a very easy and very efficient way of creating an m bit hash needing only m ands and m shifts. The problem is that there is still the parity function to write. As this needs to examine each bit to effectively count the number of ones in the data this seems to need m operations. In fact the job can be done in a number of operations that equals the number of bits set. This algorithm is attributed Brian Kernigan but its based on a technique well worth knowing. How do you zero the least significant set bit? At first it seems impossibly difficult to do without using shifts and tests to find the first bit set. However if you start out with a value and subtract 1 then this always zeros the least significant bit that was set - think about it... For example:
or
and so on. At this point you might think that no progress has been made because while you have zeroed the least significant set bit you have set other lower order bits. Of course the trick is that the original data already had these bits zeroed. So if you And the new value with the original then the result will have all of those bits zeroed in addition to the least significant set bit. Thus to zero the least significant set bit in x all you have to do is:
If you keep doing this then eventually you will zero all of the bits and x will be zero. Now you can put this to use in a parity function that only iterates the number of set bits times:
Once again this can be written more compactly as:
Second MethodWith a fast parity function the method described above can return a hash in m*b operations where b is the number of bits set in the intermediate result - say m^2/2 on average. The second method can be more efficient but it makes use of the columns in the matrices rather than the rows. So in this case we need to generate m bit values to represent the columns - but why bother? It costs the same to generate to generate 32 bit positive values. We can simply truncate the result at the end to m bits and get the same result as if we had truncated the columns to m bits before the calculation. To avoid interacting with the previous code we can change the constructor to create a second random matrix in column order:
Now the elements of the array, or at least the first m bits are regarded as the columns of the matrix. Now to work out the hash we select the columns that correspond to the bits set in the data and exclusive or them together:
At the end we truncate the result to m bits by right shifting the result down to leave only m bits. With this in place you can now try out the alternative method of calculation.
Notice that you get a different result because they are using two different matrices. Can these routines be improved? The answer is yes but how exactly depends on the language and architecture you are working with. The most obvious improvement is to unroll any fixed sized loop to eliminate the loop structure. A more complex optimization would be to use a very long binary word representation or to use a GPU to parallelize the algorithms. Notice that either algorithm lends itself to being implemented in hardware quite easily. If you need a universal family of hash functions to try out another algorithm then either of the two methods works well. Further reading
You can download the code for both the Windows Forms version and the WPF version in a single ZIP file from the CodeBin (note you have to register first). Related ArticlesHashing solves one of the basic problems of computing - finding something that you have stored somewhere. Extensible hashing and perfect hashing are ideas that are worth exploring. A look at a way of creating a set of hash functions. To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
. <ASIN:0071460519> <ASIN:3540431012> |
||||||
Last Updated ( Friday, 27 January 2023 ) |