Zip for the Genome

Zip for the Genome - G-SQueeZ

Written by Mike James

Monday, 19 July 2010

The G-SQueeZ algorithm is a new way of compressing genome data that could save a lot of money - but does it deserve to be patented?

The Zip compression algorithm is so ubiquitous that even dumb users understand the verb "to zip". It is as much as any algorithm inventor can hope for. Now we have the G-SQueeZ compression algorithm aimed at making genomic squencing data take up a lot fewer bytes.

Sqeez

It is claimed that the new algorithm can achieve compression rates of 80%, which is not at all surprising given the high degree of regularity and hence redundancy in the data. What makes the compression method so suitable for the data is that it preserves the order and it allows access to portions of the data without decompressing the entire file.

The new algorithm is a direct application of Huffman encoding - a fundamental algorithm from information theory. A standard Huffman encoding is computed from each base plus additional data - the higher the frequency of occurrence of the base+data the shorter the code used to represent it. This is pure Huffman coding procedure.

The data to be encoded is first scanned to create a frequency table and then a Huffman coding tree. The data is then scanned a second time and the data encoded using the Huffman code. The resulting code is then written out, complete with a header giving the Huffman code dictionary. To most programmers this would seem to be a common sense application of an existing algorithm.

If you are looking for the novel idea that makes this algorithm different you probably aren't going to find it. This is a worthwhile implementation of the standard Huffman algorithm - but it isn't a new idea and it isn't a new algorithm.

If you want to use G-SQueeZ for Academic/non-commercial use then fine, you can even have the source code, but if you want to investigate or use it from a within a company that makes money, no matter how small, then you have to pay for the privilege. Which is a bit strange considering the company concerned is TGen which claims to be a Non-profit Biomedical Research Institute.

The creators are also applying for a patent on the algorithm. If they get the patent then presumably it will cover any compression method that is an implementation of the Huffman code - which is arguably all such methods.

Sqeez

What is worse is that if you want to read the details of the algorithm you will have to pay, typically $25 for a day's access to the journal in question - bioinformatics.

Link to the paper on the Bioinformatics Journal

Of course if you have an academic affiliation the chances are that you will be allowed to access the paper for "free" - which means the institution pays for you. Programmers often don't have an academic affiliation and the combination of patent and academic restriction is oppressive in every sense.

The final though has to be - what if Huffman had patented his much more groundbreaking and fundamental algorithm?

Further reading