Page 1 of 3 Information theory – perhaps one of the most remarkable inventions of the twentieth century  naturally leads on to the consideration of how information can be coded and hence coding theory.
A Programmers Guide To Theory First Draft
There is a more recent version of this:
Now available as a paperback and ebook from Amazon.
Contents
 What Is Computable?
 Finite State Machines
 What is a Turing Machine?
 Computational Complexity
 Noncomputable numbers
 The Transfinite
 Axiom Of Choice
 Lambda Calculus
 Grammar and Torture
 Reverse Polish Notation  RPN
 Introduction to Boolean Logic
 Confronting The Unprovable  Gödel And All That
 The Programmer's Guide to Fractals
 The Programmer's Guide to Chaos*
 Prime Numbers And Primality Testing
 Cellular Automata  The How and Why
 Information Theory
 Coding Theory
 Kolmogorov Complexity
*To be revised
In Information Theory we learned that if a message or symbol occurs with probability p then the information contained in that symbol or message is log_{2}(p) bits.
For example, consider the amount of information contained in a single letter from an alphabet. Assuming there are 26 letters in an alphabet and assuming they are all used equally often the probability of receiving any one particular letter is 1/26 which gives log_{2}(1/26)=4.7 bits per letter.
It is also obvious from other elementary considerations that five bits are more than enough to represent 26 symbols. Quite simply five bits allow you to count up to more than 26 and so you can assign one letter of the alphabet to each number. In fact, five bits is enough to represent 32 symbols because you can count up to 31 i.e. 11111=31.
Using all five bits you can associate say A with 0 and Z with 25 and the values from 26 to 31 are just wasted. It seems a shame to have wasted bits just because we can’t quite find a way to use 4.7 bits – or can we.
Can we split the bit?
Average information
Before we go on to consider “splitting the bit” we need to look more carefully at the calculation of the number of bits of information in one symbol from an alphabet of 26.
Clearly the general case is that a symbol from an alphabet of n symbols has information equal to –log_{2}(n) bits assuming that every symbol has equal probability.
Of course what is wrong with this analysis is that the letters of the alphabet are not all equally likely to occur. For example, if you don’t receive a letter “e” you would be surprised – count the number in this sentence. However your surprise at seeing the letter “z” should be higher – again count the number in some other sentence than this one!
Empirical studies can provide us with the actual rates that letters occur – just count them in a book for example. If you add the “Space” character to the set of symbols you will discover that it is by far the most probable character with a probability of 0.185 of occurring, followed by “e” at 0.100, “t” at 0.080 and so on.. down to “z” which has a probability of only 0.0005. It makes you wonder why we bother with “z” at all!
But, given that “z” is so unlikely its information content is very high –log_{2}(0.0005)=10.96 bits. In other words “z” contains nearly 11 bits of information compared to just over 3 bits for “e”.
If you find this argument difficult to follow just imagine that you are placing bets on the next symbol to appear. You expect the next symbol to be a Space character so you are very surprised when it turns out to be a z. Your betting instincts tell you what the expected behaviour of the world is and when something happens that isn't expected you gains some information.
Taken to the extreme it seems almost obvious. The sun rises every morning and so the news that it has risen this morning is really no news  no information. Now consider the opposite event, that is some information!
We can define an average information that we expect to get from a single character taken from the alphabet.
If the i^{th} symbol occurs with probability p_{i} then the information it contains is –log_{2}(p_{i}) bits and the average information contained in one symbol, averaged over all symbols is:
Average information= –p_{1}log_{2}(p_{1}) – p_{2}log_{2}(p_{2}) –p_{3}log_{2}(p_{3}) … –p_{n}log_{2}(p_{n})
You can see that this is reasonable because each symbol occurs with probability p_{i }and each time it occurs it provides –log_{2}(pi) bits of information. Notice that while “z” carries 11 bits of information the fact that it doesn’t occur very often means that its contribution to the average information is only about 0.0005 x 11 or 0.0055 bits per symbol.
Applying this formula to the letters of the alphabet and their probabilities gives an average information of 4.08 bits per symbol which should be compared to 4.76 bits per symbol for an alphabet of 27 equally likely characters. Notice that the average information in 27 equally likely characters is also 4.76 bits – to see why try working it out.
There is also a nice theorem which says that the average information per symbol is largest if all of the symbols are equally likely. This means that if any of the symbols of an alphabet are used more or less often than others the average information per symbol is lower and this observation is the key to data compression techniques.
<ASIN:0521852293>
<ASIN:0252725484>
<ASIN:0198538030>
