Programmer's Guide To Theory - Splitting the Bit
Written by Mike James   
Monday, 22 July 2024
Article Index
Programmer's Guide To Theory - Splitting the Bit
Make It Equal
Huffman And Zip

The two symbols that are least likely now are D and E with a combined probability of 0.55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree. 

fig4

The final step

 

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

 

Summary

  • Information theory predicts the number of bits required to represent a message, which sometimes is a fractional quantity.

  • In particular, if we consider average information contained in a message, a fractional number of bits will be used to represent it.

  • No coding method can use fewer bits than the average information, but inefficient coding methods can use many more bits.

  • The average information in a message is maximum when all of the symbols are equally likely.

  • We can get close to the number of bits in the average information using the Shannon-Fano coding algorithm which attempts to split the symbols into groups that are close to being equally likely.

  • The Shannon-Fano coding is good, but in most cases it isn't optimal. To get an optimal code we need to use the Huffman coding algorithm where variable length codes are used. The shortest codes represent the most likely symbols.

  • Finding codes that represent messages using a smaller number of bits generally don't use theoretically optimal methods. The reason is that speed of coding is important. The most common method of compressing data is to find repeating patterns and represent them as entries in a dictionary.

Related Articles

Information Theory

How Error Correcting Codes Work

Claude Shannon

Introduction to Cryptography with Open-Source Software 

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.

cover600

Contents

  1. What Is Computer Science?
    Part I What Is Computable?
  2. What Is Computation?
  3. The Halting Problem
  4. Finite State Machines
    Extract 1: Finite State Machines
  5. Practical Grammar
  6. Numbers, Infinity and Computation
    Extract 1: Numbers 
    Extract 2: Aleph Zero The First Transfinite
    Extract 3: In Search Of Aleph-One
    Extract 4: Transcendental Numbers
  7. Kolmogorov Complexity and Randomness
    Extract 1:Kolmogorov Complexity 
  8. The Algorithm of Choice
  9. Gödel’s Incompleteness Theorem ***NEW!
  10. Lambda Calculus
    Part II Bits, Codes and Logic
  11. Information Theory 
  12. Splitting the Bit 
  13. Error Correction 
  14. Boolean Logic
    Part III Computational Complexity
  15. How Hard Can It Be?
    Extract 1: Where Do The Big Os Come From
  16. Recursion
    Extract 1: What Is Recursion
    Extract 2: Why Recursion
  17. NP Versus P Algorithms
    Extract 1: NP & Co-NP
    Extract 2: NP Complete

<ASIN:1871962439>

<ASIN:1871962587>

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Google Updates Responsible AI Toolkit
01/11/2024

Google has announced updates to the Responsible Generative AI Toolkit to enable it to be used with any LLM model. The Responsible GenAI Toolkit provides resources to design, build, and evaluate open A [ ... ]



pg_parquet - Postgres To Parquet Interoperability
28/11/2024

pg_parquet is a new extension by Crunchy Data that allows a PostgreSQL instance to work with Parquet files. With pg_duckdb, pg_analytics and pg_mooncake all of which can access Parquet files, is  [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info



Last Updated ( Monday, 22 July 2024 )