The first step towards perfect hashing is to make a slight change to the way that you think of a hashing function.

A function is something that returns a value that depends on an input value. A good hashing function should distribute the key values as evenly as possible though out the hash table. In most cases hash functions are constructed using bitwise logical operators and shift functions in an effort to jumble up the bits that represent the key as much as possible - but there is another way.

Instead of using operators to provide the random jumbling why not use a table?

Any function can be converted into a lookup table simply by storing its outputs indexed by the input values.

As an example consider creating hash function for a single ASCII character.

To construct a random hash function using a table all you have to do is set up a 256-element array initialised so that the first element contains 1, the second 2 and so on, i.e.:

The next step is to shuffle the array by moving values randomly around. There are many possible shuffling algorithms but the following is OK:

Random R = new Random(); for (int i = 0; i < 128; i++){ int j=R.Next(255); int temp=h[i]; h[i] = h[j]; h[j]=temp; }

If you're not too happy about the goodness of the random number generator then do the shuffle more than once.

Finally how to use the random table?

That's easy - given that the key is just a single character in c, convert it to its ASCII code and look up what is stored in that element of the table i.e.:

int hash =h[ Convert.ToInt32(c)];

As the table has been shuffled the ASCII value is randomly assigned to another value in the range 0 to 255 and this is the result of the hash function.

This is fine but only gives us a hash function that will hash a single key made up of a single character. Extending this to a multi-character key isn't at all difficult and just involves finding a way of using the table on each character to build up a final hash value.

I'm quite sure that you can think up your own methods but a particularly good one that retains the full benefit of the randomness in the table is:

for(int i=0;i<s.Length;i++){ int c = Convert.ToInt32(s.Substring(i, 1)); hash=h(hash ^ c); }

This takes a random walk around the table starting from h(c) and using the ^ (Exclusive OR) operator to combine the result of the previous step with the ASCII code of the next character. This gives a result in the range 0 to 255 that is very random and depends on all the characters in the string.

If you need a second byte for the hash value simply start the random walk again from a different initial point, e.g. by adding one to the initial character's ASCII code.

Perfect hashing

This use of a table to construct a hash function produces excellent hash function behaviour but it also opens up another possibility.

As the table determines where any particular key will be hashed to and the table is something that we create why not try to create tables with advantageous properties. For example, why not test the quality of the hashing function by trying it out on a random selection of keys and see where they are hashed to. If the hash function produces a lot of collisions then you can scrap it and try again.

Of course being programmers the obvious thing to do is to write a program that generates hash functions and tests them automatically - a sort of designer hash function factory!

Now take this search for a good hashing function one tiny step further. Let's look for a perfect hash function! If you know that there is a particular set of data that you want to work with you could use this as your test set of keys and leave the hash function factory searching until it found a table based hash function that produces zero collisions! There is no guarantee that such a table will be found in a reasonable amount of time but this is a gamble worth taking.

A perfect hash function can be used to store the test set of keys without collision and so you can find them again with a single lookup.

Of course as soon as you move beyond the test set of keys collisions will happen but in some applications this might not be important.

For example, if you were writing a compiler then you could search for a perfect hash function for the keywords of the language as part of the tokenisation pass. In this case it would be even better if the hash function hashed each keyword onto consecutive values without gaps i.e. the first keyword hashed to 1, the second to 2 and so on.

Minimal perfect hash function

Such a hash function is called a minimal perfect hash function for the set of keys.

The uses of a minimal perfect hash function aren't that wide but they are important.

For example, in the case of the compiler tokenisation routine each identifier could be hashed in turn. The hash values of the keywords could be arranged to lie in a known range by using a minimal perfect hash function constructed using them. The non-keyword identifiers would be stored in the table in the usual way with collision handling provided.

Consider the problem of detecting particular key words within an input stream of characters. A minimal perfect hash function constructed for the words that you are trying to detect in the input stream would work very well as an alarm system. Simply hash the data stream and look out for the particular range of values that the keywords occupy. Notice that using this method you are testing for the presence of all the keywords with a single hash function!

Perfect hash functions sound useful but the difficulty is in constructing them. The crude trial and error method is very slow and to do the job properly you need to invent a few heuristics to guide the search by modifying the existing table, rather than starting with a completely fresh table each time.

Error correcting codes are essential to computing and all sorts of communications. At first they seem a bit like magic. How can you possibly not only detect an error but correct it as well? How do the [ ... ]

Part II of our look at data takes us into more sophisticated structures that are fundamental to computing - stacks, queues, deques and trees. If you don't know about these four then you are goin [ ... ]