Now that we have the mathematics out of the way, we need to return to look again at the idea of information as measured by I(A)=-log_{2}(p).

You may have noticed that powers of two seem to figure in the calculation of information and this seems to be very natural and appropriate when it comes to computers.

However, to see how deep and natural the connection is we can play a simple game.

Suppose I toss a fair coin and you have to find out if I got a head or a tail. A very boring game because you can simply ask me and I will tell you – but some information has flowed between us and we can attempt to use our new measure of information to find out how much.

Before I answer the question the two answers “Heads” or “Tails” are equally possible and so their probabilities are simply p=0.5 and q=0.5 respectively. So when I give you the answer the amount of information I pass to you is

I(answer)=-log_{2}(.5)=-(-1)=1

This is quite interesting because in this binary split between two equally likely alternatives I am passing you one unit of information.

We can extend this a little.

Suppose I toss two coins and you have to ask questions to find out if the result was (Head, Head), (Head,Tail), (Tail, Head) etc.. but you can only ask about one coin at a time.

Coins and information

Then you would ask one question and get 1 unit of information and then ask you second question and get another unit of information. As information is additive you now have 2 units of information.

Whenever you encounter a binary tree of this sort the number of units of information that you get by knowing which outcome occurred is just the number of levels that the tree has.

In other words, for two coins there are two levels and getting to the bottom of the tree gets you two units of information. For three coins you get three units of information and so on. Notice that this isn’t entirely obvious because for 2 coins you have four possible outcomes and for 3 coins you have 8:

It seems that if you find out which of 32 possible outcomes occurred you only gain 5 units of information – but then you only needed to ask 5 binary questions!

You could of course do the job less efficiently. You could ask 32 questions of the type “Did you get HTHTH?” until you get the right answer but this is up to you. If you want to gather 5 units of information in an inefficient way then that’s your choice and not a reflection of the amount of information you actually gain.

A 2-bit tree

Now consider the following task – represent the outcome of the coin-tossing in a program.

Well most programmers would say let the first coin be represented by a single bit, 0 for tails 1 for head. Then the two-coin situation can be represented by two bits 00, 01, 10, and 11. The three-coin situation can be represented by three bits 000, 001, 010, and so on.

You can see that this looks a lot like the way information increases with number of coins. If you have n coins you need n bits to represent a final outcome and you have n units of information when you discover the final outcome.

The connection should by now be fairly obvious. The unit of information is the bit and if a symbol occurs with probability p then it carries –log_{2}(p) bits of information and needs a minimum of –log_{2}(p) bits of information to represent or code it. You can of course use more bits to code it if you want to but –log_{2}(p) is the very smallest number you can use without throwing away information.

Notice that this conclusion is very deep. There is a connection between the probability of a symbol occurring and the amount of information it contains and this determines the smallest number of bits needed to represent it.

We go from probabilities to bits - which is not an obvious connection.

The alphabet game

At this point you should be thinking that this is all very reasonable but how does it work in more complicated situations?

It is time for a very small simple example to show that it all does make good sense.

Consider a standard alphabet of 26 letters. Let’s for the moment assume that all of the letters are equally likely. Then the probability of getting any letter is 1/26 and the information contained in any letter is

–log_{2}(1/26)=4.7bits

What this means is that you need 5 bits to encode this standard alphabet and a single letter carries just less than 5 bits of information.

Notice that in theory we can find a code that only uses 4.7bits per letter on average and this is the sort of question that information theorists spend their lives working out. For most of us a storage scheme that uses 5 bits is good enough.

As a sanity check you can also see that five bits really are enough to code 26 letters or other symbols. If you run though the list of possibilities – 00000, 00001, 00010, and so on you will find that there are 32 possible five-bit codes and so as predicted we have more than enough bits for the job.

Even the guessing game analogy still works. If I think of a letter at random then how many questions do you have to ask to discover which letter it is?

The simple minded answer is 26 at most - is it an A?; is it a B? and so on.. But by asking a slightly more clever question the same job can be done in 5 questions at most.

Can you work out what the questions are?

Compression

Information theory has lots of interesting applications in computing and many of them are explored in other articles but the one I want to bring to your attention is “coding theory”.

A good few years ago at the start of the personal computer revolution a self-appointed guru announced that he had invented a “data compression machine” – in hardware.

These were days well before zipping files was common and the idea of data compression was strange, exotic and very, very desirable – after all the largest disks at the time were only 180Kbytes! Compression is fine but this particular guru claimed that he could get an 80% compression ratio and all the data could be fed through a second time to get a compounded compression rate of 80% of the 80%.

Such ideas are the same as perpetual motion machines and faster than light travel but without information theory you can’t see why this particular claim is nonsense.

Compression works by finding the code that represents the data most efficiently.

If the data D contains I(D) bits of information and it is currently represented by more bits than I(D) you can compress it by finding the optimal code. That’s what programs that zip files attempt to do – they scan the data and work out a code custom made for it. The details of the code are stored along with the data in a “dictionary” but even with this overhead the savings made are enough to provide large space savings in most cases.

Now why can’t you repeat the procedure to get twice the savings?

It should be obvious now with information theory. You can’t do it again because once you have coded the information using an optimal code –you can’t improve on perfection! So the smallest number of bits needed to represent any data is the measure of the information it contains.

Computational grammar is a subject that is sometimes viewed as a form of torture by computer science students, but understanding something about it really does help ....