Programmer's Guide To Theory - Splitting the Bit
Written by Mike James   
Monday, 22 July 2024
Article Index
Programmer's Guide To Theory - Splitting the Bit
Make It Equal
Huffman And Zip

The two symbols that are least likely now are D and E with a combined probability of 0.55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree. 

fig4

The final step

 

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

 

Summary

  • Information theory predicts the number of bits required to represent a message, which sometimes is a fractional quantity.

  • In particular, if we consider average information contained in a message, a fractional number of bits will be used to represent it.

  • No coding method can use fewer bits than the average information, but inefficient coding methods can use many more bits.

  • The average information in a message is maximum when all of the symbols are equally likely.

  • We can get close to the number of bits in the average information using the Shannon-Fano coding algorithm which attempts to split the symbols into groups that are close to being equally likely.

  • The Shannon-Fano coding is good, but in most cases it isn't optimal. To get an optimal code we need to use the Huffman coding algorithm where variable length codes are used. The shortest codes represent the most likely symbols.

  • Finding codes that represent messages using a smaller number of bits generally don't use theoretically optimal methods. The reason is that speed of coding is important. The most common method of compressing data is to find repeating patterns and represent them as entries in a dictionary.

Related Articles

Information Theory

How Error Correcting Codes Work

Claude Shannon

Introduction to Cryptography with Open-Source Software 

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.

cover600

Contents

  1. What Is Computer Science?
    Part I What Is Computable?
  2. What Is Computation?
  3. The Halting Problem
  4. Finite State Machines
    Extract 1: Finite State Machines
  5. Practical Grammar
  6. Numbers, Infinity and Computation
    Extract 1: Numbers 
    Extract 2: Aleph Zero The First Transfinite
    Extract 3: In Search Of Aleph-One
    Extract 4: Transcendental Numbers
  7. Kolmogorov Complexity and Randomness
    Extract 1:Kolmogorov Complexity 
  8. The Algorithm of Choice
  9. Gödel’s Incompleteness Theorem ***NEW!
  10. Lambda Calculus
    Part II Bits, Codes and Logic
  11. Information Theory 
  12. Splitting the Bit 
  13. Error Correction 
  14. Boolean Logic
    Part III Computational Complexity
  15. How Hard Can It Be?
    Extract 1: Where Do The Big Os Come From
  16. Recursion
    Extract 1: What Is Recursion
    Extract 2: Why Recursion
  17. NP Versus P Algorithms
    Extract 1: NP & Co-NP
    Extract 2: NP Complete

<ASIN:1871962439>

<ASIN:1871962587>

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


IBM Opensources AI Agents For GitHub Issues
14/11/2024

IBM is launching a new set of AI software engineering agents designed to autonomously resolve GitHub issues. The agents are being made available in an open-source licensing model.



Prompt Engineering Techniques To Make You An Expert
18/11/2024

Introducing a GitHub repository full of hot tips and instructions on how to build the perfect prompt presented in a collection of Jupiter Notebooks.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info



Last Updated ( Monday, 22 July 2024 )