Data compression the dictionary way

 

If you already know something about coding theory you might have heard of Huffman coding and statistical modeling of data etc. You might also think that the data compression methods that we actually use, e.g. when zipping a file, are sophisticated variation on these methods. While it is true that these traditional methods are capable of excellent compression performance they are very slow to implement and not really suitable for real time use.

The breakthrough came in 1977 with a theoretical paper by Jacob Ziv and Abraham Lempel. This paper started the whole subject of dictionary compression but, although it was a very theoretical sounding paper, it is amazing that no-one thought of the technique before. It is even possible that someone might have implemented the idea in a practical form before the great breakthrough - it is so simple.

So what is dictionary coding? Consider the text of this article as a subject for compression. Instead of giving you each word spelled out letter by letter I could give you the page number and line number of the word in a standard dictionary. You can easily imagine, and a back of the envelope calculation will quickly convince you, that you reduce the amount of data needed to store the text using dictionary references. This is the basis of dictionary compression.

The problem is what dictionary should you use. If you use a standard dictionary you can refine the contents of the dictionary to reduce the storage needed to a minimum but there is a hidden cost. To make sense of the data you need the dictionary and the storage needed for it has to be taken into account. You may be able to store this article in only a few KBytes using dictionary coding but you would be less impressed if I then told you that you needed several Megabytes to store the dictionary needed to decode it.

The Sliding Window

Clearly dictionary coding is tantalising but how to construct the dictionary without incurring large storage overheads is the real problem. This is where "sliding window" dictionary compression comes in. Why not use the document as its own dictionary? Childishly simple isn't it? While you are worry what on earth I can be on about it's time to be precise.

Start off with a buffer that acts as a window onto the file being compresses. Divide the buffer into two portions;

window1

The chunk to the left is used as the dictionary and the chunk to the right is used as a look ahead buffer that holds the section of the file that we are trying to compress.

What happens is that you try to find a match for the string in the look-ahead buffer in the dictionary section. Once you find a match the string in the look-ahead is coded by generating a three-element token:

 match position,length,next character

So for example, the token 32,14,d means that the phrase in the look-ahead buffer matches the dictionary at position 32 and the match continues for 14 characters. The first character in the look-ahead after the match fails is "d".

You should be able to see that this token can be decoded back to the phrase that it represents simply by getting the 14 characters starting at position 32 from the dictionary and adding a letter "d" onto the end.

Thus we have succeeded in reducing a phrase of 15 characters to a three element token and this must take less space to store. Notice that as the dictionary is just the first part of the file read into the window there is no need to store a separate dictionary - the file is its own dictionary!

Before you start to criticise the idea of using the start of a file as a dictionary for the rest notice that there is more to the method.

Looking up the look-ahead buffer in the dictionary is simple enough but there is one subtlety. You don't want to code the phrase in the look-ahead buffer as the first match in the dictionary but as the longest match. Why? Well it doesn't matter what sort of match you get, the token still consists of three elements so obviously you do better to code as many characters per token as possible.

This makes the first part of the algorithm:

  •  1) Find the longest match between the phrase in the look-ahead buffer and the dictionary and code this as a three-element token P,N,C where P is the position of the match, N is the length and C is the character following the matching phrase.

Now what do you do? You have successfully coded N+1 and so finished with N+1 characters. This is where the "Sliding" part of the sliding windows algorithm comes in. Having coded N+1 characters why not just slide the window N+1 characters to the right to read in N+1 more characters:

window2

The N+1 characters that were in the look-ahead are now in the dictionary. The look-ahead buffer now contains nothing but characters that are yet to be coded so we can repeat the step of finding the best match and issuing a token.

But wait, what about the N+1 characters that fell off the left-hand end of the dictionary buffer when the window moved? The answer is that if the window slides over the entire file so that data enters at the right and is coded as it moves from the look-ahead to the dictionary then the data that is lost out of the left-hand end of the window will always have been coded at some time in the past.

So the second step in the method is:

  •  2) Shift the window N+1 character to the right and repeat step 1

That's all there is too it. You don't need to store a separate dictionary because the file in both its compressed and uncompressed form is the dictionary.

And this is the method that all of the super quick and powerful data compression methods use - PKZIP, Bitlocker  and even the QIC-122 tape compression standard.

Of course they all throw in their own added extras to make the process more efficient but sliding window compression is the foundation of them all. This form of sliding window compression is generally called LZ77 after its inventors. It is remarkably good at file compression because the dictionary changes to suit the section of the file currently being processed. The idea is that most files follow a principle of local similarity. In other words, if a symbol or phrase has just occurred it is very likely to occur again. Of course the efficiency of the whole scheme depends on the size of the window.

There are of course plenty of practical details that I have glossed over in the description but these are the sort of thing that a programmer can easily notice and solve.

The main question that must be worrying you concerns how the whole coding process gets going. When the window first slides over a data file there is nothing in the dictionary and data slides in until it fills the look-ahead buffer. At this point the contents of the look-ahead have to be coded before they can pass on to the dictionary - but the dictionary is empty! The simple solution is to use a token of the form 0,0,C to code any characters that don't match anything in the dictionary. You should be able to see that the first portion of any file will be codes as a stream of 0,0,C tokens which become less as the dictionary fills.

Decoding also starts out with an empty dictionary. The first token must be a 0,0,C type and this is used to place a character in the look-ahead buffer, which would now be better named an expansion buffer as each token is read from the file and expanded into the the look-ahead.

Notice that the expansion is done so the the phrase is inserted at the boundary between the dictionary and the look-ahead. The expansion works by taking the N characters starting at P from the dictionary and then adding the character in the token to the end. Thus after each token is expanded the look-ahead contains N+1 new characters. These are then shifted into the dictionary ready to help with the decoding of the next token.

Even though the principle is simple, tuning for good efficiency is a difficult job. Don't be tempted to implement your own sliding window compression algorithm - there are too many good libraries just waiting to be used.

<ASIN:012620862X>

<ASIN:1846286026>

<ASIN:1848000715>

<ASIN:1558514341>

<ASIN:1584883138>

<ASIN:0127887741>

 
 

   
RSS feed of all content
I Programmer - full contents
Copyright © 2013 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.