Compression - the Starr Guide

Written by Darren Starr

Wednesday, 02 November 2011

Article Index
Compression - the Starr Guide
Dictionary-based compression

Page 1 of 2

Compression is an exciting topic - and here's an exciting explanation that shows why, via encryptions, Morse code, entropy and saying "umm" too often.

Dan Brown proved to the world that people love to feel as if they’re intelligent by understanding nerdy topics like encryption and ciphers. Probably the best example of this was when he presented DaVinci’s eccentric method of writing which was to write everything using a complex script in reverse. There were a lot of people that didn't know about this cipher beforehand, though once explained became perfectly clear and at once everyone understood at least one of the most primitive forms of encryption.

He did this with many of the classical forms of encryption. Newspapers all over the world publish “Jumble” puzzles which for the most part can be considered encryption for the masses. All this attention goes to encryption, but people don’t get all that excited over compression even though compression is far more entertaining to the nerd at heart.

Compression as encryption

One of my greatest joys recently as a programmer was to reverse engineer a file format from the mid-1970’s which was designed for a custom television broadcast subtitle insertion system.

The file format was very primitive and there was no documentation relating to it anywhere. Text was stored in a format that supported multiple languages and alphabets and wherever possible, compression was employed since it was necessary to store several hours’ worth of subtitles with typesetting information on a floppy disc which didn't even have a file system. The disc itself was a single file!

Well, this was a grand challenge that took me 2 months of hacking on and off to accomplish, but eventually, I became the first person in nearly ten years to be able to read this file format. The compression proved to be an encryption system and the key was to understand how programmers with limited resources thought in 1978. Oh… did I forget to add that the characters in the files I had were in a language I didn't speak?

Squeeze Me Baby!

Compression is an exciting topic since nearly ALL forms of even the most lossless compression systems can be understood by nearly anyone. In fact, the logic and math behind most compression systems isn't any more complex than some of the word problem assignments my 8 year old daughter has to complete for her homework assignments.

While I’m sure most of the regular readers here already understand at least a little about compression, I’ll attempt to make this article educational for everyone except those who deem themselves compression “experts” already.

Compression comes in all forms, shapes and sizes. By my own personal definition and since I’m writing the article, that’s all that really matters <insert evil laughing sound of your choice here>,

data compression is storing any amount of data in less space than it took originally.

Simply removing the last carriage return from a document so it no longer ends on a new line IS data compression as it saved space no matter how little it might have been.

In fact, ancient languages such as Hebrew in its Aramaic form (using the characters it they write today in Israel and in the Old Testament) made use of a great form of compression. There was no white space. In fact, the space character is a very modern addition to the language. Instead, the Hebrew/Aramaic alphabet defines special characters to end words with. In addition, “th gnrll dnt wrt wth vwls”. You have to know the language well enough to decipher the vowel sounds of each word to decompress it.

So, between the lack of whitespace and the lack of vowels, Hebrew IS a very well compressed language. If everything were written in Hebrew, we’d save a ton of paper. Of course, in a world with 7 billion people and a total of around 4 million that are able to read the language, we’d save a lot more trees as there’s no point buying books you can’t read.

Other simple forms of compression include simply removing double letters from sentences. They stand out to the practiced eye as speling errors, but they actualy don’t decrease the legibility of the words they present.

Sadly, when I read FaceBook, I realize that there are many people I went to schol with who write similarly to this even with the asistance speling checkers.

Combined with known acronyms like “LOL”, “ROFLMAO”, “WTF”, “AFAIK” and many others, data compression is used either intentionally or accidentally by a large portion of the world’s population.

The removal of double characters from text is in fact a perfect example of data compression. In fact, it’s an example of one of the most important theories of data compression.

“The Identification and Representation or Elimination of Redundancies”

That’s right.

Compression is almost all about finding patterns and deciding either coming up with a way of abbreviating them using a known dictionary. For example, when we see “LOL” we know precisely what it means. However if someone doesn’t know what it means, they can look it up or, if two way communication is possible, then they can ask whoever said it in the first place.

There are ways of inferring the meanings of these compression keys as well. For example, when we read “AFAIR”, the fact that it’s all upper case at that I know the writer tends to use all upper case for these acronyms AND I’m already familiar with “AFAIK”, I can identify that the R must mean something which is similar to “know” but different enough that it justified the use of a new letter. So, even if the reader misinterpret the intended meaning which in this case would be “recall”, the context of the sentence should allow the reader to either “get the gist” or correct the meaning.

In the case of images like JPEG and JPEG2000 or movies like MPEG-(pick a number any number) the AWESOME compression levels these offer are provided by leaving out much of the actual data expecting the viewer’s minds to compensate or simply ignore the omissions.

There of course is a point where leaving out too much information makes the content of the images utterly indecipherable, so although this still counts as compression, it’s a type of compression that’s unrecoverable. Like scrambling an egg to make it use less space, it can’t be undone and if you never saw a scrambled egg before, then there’s no possible way you could figure out that it was in fact once an egg. But at least it uses less space.

Run-length encoding

The simplest form of compression in my opinion is called run-length encoding.

Next time you’re listening to someone speaking who inserts the sound “umm” between each thought, if you catch them repeating “umm” multiple times between two actual words, you may suggest using the data compression algorithm called “Run-Length Encoding” which in their case can be highly efficient. To employ run-length encoding in this scenario, tell the speaker to decide beforehand how many times they will repeat the sound “umm” consecutively and instead to say for example “Five umms”.

For it to work in data compression though, we often need to have some means of knowing there is a “Run” coming, so using a keyword to prefix the representation of the run is a great idea. I personally prefer in this case the word “Beep!”. So from now on, instead of the speaker saying “umm umm umm umm umm”, they would say “Beep! Five Umms”.

The compressed form of the speakers repetitious “umm”s has compressed five words and syllables down to only three words and syllables. We can even abbreviate further to “Beep! Five” as the pattern “Beep!” followed by a number implies we’re discussing the number of “Umms” to be represented. This is a great achievement. Unfortunately, it does not work well for two “umms.”

Prev - Next >>

Last Updated ( Wednesday, 02 November 2011 )