The Bloom Filter
Written by Mike James   
Thursday, 23 June 2022
Article Index
The Bloom Filter
A Bloom filter in C#

A Bloom filter in C#

To make sure that you follow the way the Bloom filter actually works let's implement a simple version in C#. This is not a production version and it certainly isn't optimized. Optimization would depend very much on what you were using it for. What this implementation does is focus your attention on some of the difficulties of implementing a good Bloom filter.

The first simplicity is that instead of custom crafting a bit array we can use the supplied BitArray object. This can be found in the Collections namespace:

using System.Collections;

We can also create a Bloom filter class:

class Bloom
{
 const int m = 1000;
 const int k = 5;
 BitArray bloom = new BitArray(m);

The constants m and k set the size of the bit array and the number of hash functions to use.

Before we can do anything else we need to solve the problem of finding k hash functions. This is difficult because finding one hash function is tough to find k is very tough. Fortunately there are a number of easy ways out of the problem. The first is that we can use the built in GetHashCode method that is built into every string. While this is not likely to be optimized it is supposed to be good enough for hash storage.

This gives us one hash function. The simplest way to generate k-1 more is to use the hash value as the seed for a random number generator and use the next k-1 random numbers as additional hash values. This isn't ideal and in practice it tends to produce a Bloom filter that is about as good in terms of false positives as using k/2 true hash values.

An alternative way of doing that job that is nearly as efficient is to generate two independent hash values h0 and h1 and then forming:

h[k]=h0+i*h1;

for i = 2,,, k-1. This is easy but it needs two independent hash functions and this would mean having to actually code one in C#.

The simple random number solution to finding k hash values is very easy to implement:

private int[] hashk(string s, int k)
{
 int[] hashes = new int[k];
 hashes[0] = Math.Abs(s.GetHashCode());
 Random R = new Random(hashes[0]);
 for (int i = 1; i < k; i++)
 {
  hashes[i] =  R.Next();
 }
 return hashes;
}

When the method returns we have k hash values in an int array ready to be used.

The Bloom class needs two other methods. One to add a data value:

public void AddData(string s)
{
 int[] hashes = hashk(s, k);
 for (int i = 0; i < k; i++)
 {
  bloom.Set(hashes[i] % m,true);
 }
}

Notice the way that the BitArray uses a Set method to set the bit at the correct location to either true i.e. 1 or false i.e. 0. When this method returns all of the locations indicated by the hash array have been set to 1.

The final method is the one that looks up a value to see it if is already in the filter:

public Boolean LookUp(string s)
{
 int[] hashes = hashk(s, k);
 for (int i = 0; i < k; i++)
 {
  if (bloom[hashes[i] % m] == false) return false;
 }
 return true;
}

 

This has to simply scan the locations indicated by the hash array and return true if they are all true. Notice that the only certainty is if you find a false i.e.0 BitArray element because then you can be sure that the string has never been entered into the filter.

To try it out you would do something like:

Bloom MyBloom = new Bloom();
string data = "Hello Bloom Filter";
MyBloom.AddData(data);

and to lookup a string:

string data = "Hello Bloom Filter";
MessageBox.Show(MyBloom.LookUp(data).ToString());

Of course in a real situation you would add lots of data  and look up lots of data.

If you plan to make use of a Bloom filter then it is worth first writing two good independent hash functions. However you are also going to want to implement it using something closer to the machine-like C, if not assembler, to get the final drop of performance out of it.

bloom1

Related articles

Hashing

Hashing solves one of the basic problems of computing - finding something that you have stored somewhere.

Advanced Hashing

Extensible hashing and perfect hashing are ideas that are worth exploring.

The Invertible Bloom Filter

A Bloom filter that you can delete, retrieve and list all data values.

Universal Hashing

A look at a way of creating a set of hash functions

More Information

Paper on using two hash values to create more:

Less Hashing, Same Performance…

MurmurHash - a good general purpose hash.

 

Killer Algorithms

killercover

Contents

  1. Algorithms - clever ways of doing things
  2. Finding Things With Binary Search
  3. Hashing - The Greatest Idea In Programming 
  4. Advanced Hashing*
  5. Universal Hashing*
  6. QuickSort Exposed*
  7. Quick Median*
  8. The Fast Fourier Transform
  9. Backprop
  10. The Bloom Filter 
  11. The Invertible Bloom Filter !!!NEW
  12. Kalman Filter
  13. Public Key Encryption*
  14. Linear Programming

* First Draft

espbook

 

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


The Fundamentals of Pointers

Despite the fact that pointers have been long regarded as "dangerous" they are still deeply embedded in the way we do things. Much of the difficulty in using them stems from not understanding where th [ ... ]



Compilers, Interpreters, VMs and JIT

The distinction between a compiler and an interpreter is one that can cause controversy. One programmer's compiler is another's interpreter and the whole subject gets very murky when you throw in the  [ ... ]


Other Articles



Last Updated ( Saturday, 25 June 2022 )