The Invertible Bloom Filter |
Written by Mike James | |||||
Friday, 15 July 2022 | |||||
Page 1 of 2 If you think that the Bloom filter is magic, wait until you see the invertible Bloom filter. This not only keeps a record of data, it allows you to add, delete and make a list of the data you have stored. Killer Algorithms
Contents
* First Draft
There is something special about using hash functions to manage storage. It seems to give you magical powers at no cost. The Bloom filter, for example, can tell you almost instantly if you have ever encountered an item of data before. The price is that if it tells you that you have never seen the data then it is always 100% correct, but if it tells you that you have then it might be wrong. You can make the probability of a false positive as small as you like, but it is the price you pay for the lightening lookup time. Bloom Filter BasicsThe Bloom filter is easy to describe but it you want a full account, including a C# implementation demonstrating how it works, then see: The Bloom Filter. For a quick summary: Assuming you have k hash functions h1,h2 .. hk and a bit array B, then when an item of data arrives you set the bits stored in the bit array at h1(d), h2(d) .. hk(d). That is, after the update:
are all set to 1. When a new item of data x arrives and you want to know if you have encountered it before you simply work out the hash functions h1(x), h2(x) .. hk(x) and look in the corresponding locations in the bit array B if any one is zero then you can conclude with certainty that you have not encountered the data before - if you had the bit would have been set. If all of the bits are set to one you can't conclude with certainty that you have seen the data item because it is in the nature of a hash function- a hash function can map two different data items to the same location. In other words, for some data items, a and b, it occasionally happens that h(a)=h(b). This is usually referred to as a hash collision. What this means is that other data might have set some of the bits. The method can tolerate a few bits that are accidentally set, but it is possible for them all to be set by data other than x. However, by using a lot of hash functions and a big bit array you can make the probability of a false positive as small as you like. You trade off the slight chance that you get a false positive for the speed and storage economy offered by a Bloom filter. In general, Bloom filters are ideal when you need to check for the presence/absence of some data element and the cost of getting the presence test wrong is low. An Invertible Bloom FilterThe principle of Bloom filters is both clever and satisfying but has some drawbacks. In particular you can't remove a data item from a filter because you might zero a bit that was also set by another data item and so mark it as not being in the filter as well. You also cannot use a Bloom filter to make a list of what is stored in the filter or retrieve an item of data based on a key. Sometimes not being able to retrieve a value is a good thing from the point of view of security or privacy, but other applications need retrieval. The invertible Bloom filter works in more or less the same way as a basic Bloom filter but it works with with key value pairs (x,y) and instead of a bit array is uses a three-component data structure that can store the key x, the value y and a count. So B[i].count is the number of times B[i] has been used, B[i].key is the key and B[i].value is the value stored. When a key value pair (x,y) needs to be stored all you do is compute the hash functions on the key h1(x), h2(x) .. hk(x) store y in each of the locations and increment the corresponding count. That is:
Of course, this being a Bloom filter, the hash functions will result in storing multiple data items in the same location. So how do we cope with this? You can't simply store x and then store z in the same location because this would wipe out all trace of x. The solution is to use a reversible storage function. For example, if you store a value in B by adding it:
you can remove it by subtracting it:
If B already stored a value before you added x then when you subtract x you get the value back again. You can use addition but a much easier function to use is XOR. If you XOR a value with another then XORing it a second time returns you to the original value. For example:
Do the same operation on the result
and you get the number you started with. In other words:
and XOR is its own inverse operation. To create the invertible Bloom filter all we do is XOR the data into the value element:
This is the complete algorithm for storing an element and it corresponds to the operation: INSERT(x; y): insert the key-value pair, (x; y), into B. This operation always succeeds, assuming that all keys are distinct. |
|||||
Last Updated ( Friday, 15 July 2022 ) |