Zip for the Genome - G-SQueeZ
Written by Mike James   
Monday, 19 July 2010

The G-SQueeZ algorithm is a new way of compressing genome data that could save a lot of money - but does it deserve to be patented?

Banner

 

The Zip compression algorithm is so ubiquitous that even dumb users understand the verb "to zip". It is as much as any algorithm inventor can hope for. Now we have the G-SQueeZ compression algorithm aimed at making genomic squencing data take up a lot fewer bytes.

Sqeez

It is claimed that the new algorithm can achieve compression rates of 80%, which is not at all surprising given the high degree of regularity and hence redundancy in the data. What makes the compression method so suitable for the data is that it preserves the order and it allows access to portions of the data without decompressing the entire file.

The new algorithm is a direct application of  Huffman encoding - a fundamental algorithm from information theory. A standard Huffman encoding is computed from each base plus additional data - the higher the frequency of occurrence of the base+data the shorter the code used to represent it. This is pure Huffman coding procedure.

The data to be encoded is first scanned to create a frequency table and then a Huffman coding tree. The data is then scanned a second time and the data encoded using the Huffman code. The resulting code is then written out, complete with a header giving the Huffman code dictionary. To most programmers this would seem to be a common sense application of an existing algorithm.

If you are looking for the novel idea that makes this algorithm different you probably aren't going to find it. This is a worthwhile  implementation of the standard Huffman algorithm - but it isn't a new idea and it isn't a new algorithm.

If you want to use G-SQueeZ for Academic/non-commercial use then fine, you can even have the source code, but if you want to investigate or use it from a within a company that  makes money, no matter how small, then you have to pay for the privilege. Which is a bit strange considering the company concerned is TGen which claims to be a Non-profit Biomedical Research Institute.

The creators are also applying for a patent on the algorithm. If they get the patent then presumably it will cover any compression method that is an implementation of the Huffman code - which is arguably all such methods.

Sqeez

What is worse is that if you want to read the details of the algorithm you will have to pay, typically $25 for a day's access to the journal in question - bioinformatics.

Link to the paper on the Bioinformatics Journal

Of course if you have an academic affiliation the chances are that you will be allowed to access the paper for "free" - which means the institution pays for you. Programmers often don't have an academic affiliation and the combination of patent and academic restriction is oppressive in every sense.

The final though has to be - what if Huffman had patented his much more groundbreaking and fundamental algorithm?

Further reading

Coding theory

Data compression the dictionary way

Claude Shannon

Information theory

 

Banner


Raspberry Pi CM5 - Expensive And Undocumented
27/11/2024

So the unexpected has happened - the Compute Module 5 has been launched. But it simply emphasises some problems with adopting the Pi as an IoT device.



Discover PostgreSQL How-Tos
16/12/2024

A veritable treasure trove of assorted how-to recipes for PostgreSQL, stored as a Github repository, has been started by Nikolay Samokhvalov, well known in the PostgreSQL world.


More News

<ASIN:0470089857>

<ASIN:0262101068>

<ASIN:059615450X>

<ASIN:354072799X>

<ASIN:0199230234>

<ASIN:1420063677>

<ASIN:0763751863>

Last Updated ( Monday, 19 July 2010 )