One of the basic problems of computing is finding something that you have stored somewhere.For example, if you are keeping a table of names and telephone numbers how can you organise the list so that you can find the phone number given the name?
The most obvious solution is to keep the list sorted into alphabetical order and use a binary search to locate the name. This is efficient but only as long as the list is kept in strict order. If you only have to add names and number infrequently then you could use an overflow table that isn't sorted into order. Then to find a name you would use a binary search on the sorted table and if you didn't find the name there a linear search on the overflow table would either find the name or confirm that the name wasn't known. Using an overflow table in this way works as long as the overflow table is keep short. If you are adding a lot of names then the time to perform the linear search in the overflow table will quickly become too much. In this case it is time to think of hashing.
Hashing is one of the great ideas of computing and every programmer should know something about it. The basic idea is remarkably simple, in fact it is so simple you can spend time worrying about where the magic comes from.
Suppose you have a function, FHASH(name) that will compute a number in the range 1 to N depending in some way on the value of name then why not store the name, and phone number at FHASH(name) in the table.
Using this scheme finding a name in the table is just a matter of computing FHASH(name) and looking at this location in the table to see if name is stored there! Yes, that's all there is to it. Well there are a few practical details but using FHASH(name) to give the location where name is stored is the key idea.
To make sure that there can be no confusion about how simple hashing really is let's look at a small example. If the hashing function is the sum of the ASCII codes of the first two letters of the name minus 128 then FHASH("JAMES") would be stored at ASC("J")+ASC("A")-128 or, in plain English, at location 11 in the table. If you want to know if you have a phone number for JAMES you simply compute FHASH("JAMES") and look in location 11 to see if it's there.
No sorting and no searching required. When you compute the hash function you know where to store the data and you know where to find the data.
Now that we have a concrete example to look at you might begin to see what the practical problems referred to earlier are. In particular where does our hashing function suggest that we store the phone number for JAMESON?
Yes you're absolutely right in exactly the same place that we store the phone number for JAMES. Clearly how you design the hashing function is all important. Indeed it isn't difficult to show that no matter how much effort you put into a hashing function collisions such as the one typified by JAMES and JAMESON will happen.
A good hashing function will tend to scatter the data values over the storage table as much as possible - hence another name for the technique is scatter storage - but collisions will still occur.
There are two main approaches to dealing with collisions - chaining and open addressing. Chaining is simply an adaptation of the overflow method described earlier. If you compute the hash function and find the location already occupied then the second item to get there is stored in an overflow table and a pointer to it is stored along with the item in the main table. If more collisions occur the extra data items are stored in the overflow table and pointers are set to form a chain of items that all have the same hash value. Now when you look for, or store, an item you may have to perform a linear search of a chained list but as long there aren't too many collisions this shouldn't take much time. The real problem with chaining is the extra storage needed to implement the pointers.
This brings us to the alternative - open addressing. The simplest form of open addressing is the interestingly named `linear probe'. If you go to store an item in the table and find that the location indicated by the hash function is already occupied then search down the table for the first free location and store the item there. This results in data items with the same hash value being stored close to each other in the table.
When you are looking for a name using this scheme you now compute the hash function and inspect sequential locations in the table until you either find what you are looking for or the first blank entry. The table is regarded as circular, i.e. the next location after the last is the first.
Obviously if the table is getting near full then hashing with a linear probe gets to look very like a simple linear search of an unsorted table! In practice linear probing works very well as long as the table is less than 75% full.
An alternative to linear probing is double hashing. In linear probing the locations searched are given by FHASH(name)+n where n is 0,1,2,3.. The problem with linear probing is that the initial hashing function `scatters' the data into the table but after that the linear stepping through the table groups the data together. It more like `blobbing' the data into the table rather than scattering it!
A better method would be to use two hash functions - one for the initial location in the table and the second to scatter the collisions a little. That is, use FHASH(name)+nFHASH2(name) where n is 0,1,2,3.. and FHASH2 is a second hashing function. With a careful choice of the second hash function you can make savings in the time taken to find a data item.
At this point I could start into a description of what makes a good second hash function but I have to admit that unless memory is in very short supply it is usually simpler to make the table bigger using a linear probe than use double hashing.
The point is that the efficiency of hashing is most affected by how full the table is and as long as the table is only around 50% used then there isn't much to be gained by using the more complicated double hashing.
Finally I cannot leave the topic of hashing without describing a real hashing function rather than the simplistic one used in the earlier example. Most good hashing functions work by computing the remainder after dividing by the table size N.
This always gives a value between 0 and N-1 so it suitable but if N is a prime number then it is also excellent at scattering the data round the table. Of course if you have a text value that you want to hash you first have to convert it into a suitable numeric value and a simple scheme like the one in the example will not do.
You need to produce a different numeric value for every possible text value and adding together the ASCII codes of the first two letters clearly doesn't work. A better method is to weight each of the ASCII codes by the position of the letter by multiplying by 1 for the first character, 10 for the second, 100 for the third and so on.. before adding them up to give a single value.
In many computer languages and most frameworks hash functions are provided for you to just use. Even if you don't make use of them for scatter storage the framework most probably will as part of its implemenation of data structures such as dictionary objects and lookup tables.
One small twist to the story is that hash functions often turn up as part of cryptographic and security libraries because they are also used to code passwords and to sign documents. The connection is that if you give a system a password it can compute a hash value and store only the hash. When you log on again you supply the password and as long as the re-computed hash matches the stored value you are authenticated. Notice that for an attacker to gain access all they have to do is find any word with the same hash. In the case of signing documents or detecting changes the entire document is fed into a hash function - any changes to the document are very likely, but not certain, to produce a change in the hash value. There are lots of other creative uses of hash functions but it all started with scatter storage.