Use Rust To Reduce The Size Of Your SQLite Database
Written by Nikos Vaggalis   
Monday, 12 September 2022

Meet sqlite-zstd, a Rust library that compresses your database many fold, leading to great savings in size while conserving its search capabilities intact.

As pointed out in "In Praise Of SQLite", SQLite is not a toy database:

Despite its compact size and absence of the client server model, SQLite is a RDBMS with all the features that make something relational - that is tables, indexes, constraints, triggers, transactions and the like. However, there are few built-in functions compared to PostgreSQL or Oracle. 

SQLite doesn’t have any compression features. This has changed with the Rust-based library, sqlite-zstd which promises to:

provide transparent dictionary-based row-level compression that allows compressing a sqlite database's entries almost as well as if you were compressing the whole DB file, while at the same time retaining random access.

Suffice it to say that we are all aware of the benefits of compressing data, be it that of a PDF document, a humble ZIP file or in this case a database. As a proof of concept and walkthrough of the tool, I'll work with a sample database used by the Joplin note taking app. Well work with Windows because it offers a more straightforward experience.

The sample database.sqlite is 2.6 GB in size.Some would say 'what kind of notes do you have that occupy that much space?'. It is true that that number sounds extravagant. It is like that because Joplin allows you to scrape any web page you encounter and store it as markdown inside the database. Since I'm a person with a wide array of interests, when I find something interesting I keep a copy of it. As such the number of pages I dump into Joplin accumulates and the size of the database increases. Therefore sqlite-zstd was a godsend. Of course your use case might be different; the library's main use case example online showcases compressing a database with 7 million book entries in JSON format, and does so by reducing its size from 2.2GB to 550KB!

But in order to get to that, the first hurdle was to find the 64-bit version of sqlite's CLI, sqlite3, since the official build is offered only in 32-bit, and since zstd is a 64-bit library it needed the corresponding version. While you can build it manually, why go to that trouble when someone has already done it for you? Hop over to the SQLite shell builder Github repo and download the latest 64-bit release for Windows (Ubuntu, MacOS as well).

Having got hold of the CLI it's time to execute it to load both the library and sample database. After that we will enable the transparent row-level compression of the 'body' column, the column that keeps the bulk of the text of the 'notes' table. Saying that,you can invoke the transparent row-level compression function several times on the same table with different columns to compress.

You call it like this:

SELECT
zstd_enable_transparent('{"table": "objects", "column": "data1", "compression_level": 19, "dict_chooser": "''a''"}'),


and as a result of that calling, the table will be renamed to '_table_name_zstd', while 'table_name' will become a view that can be queried as normally, including SELECT, INSERT, UPDATE, and DELETE queries. Keep in mind that this function will not compress any data by itself, you need to call zstd_incremental_maintenance afterwards.

The following differences apply when compression is active: 

  • The compressed column may only contain blob or text data, depending on the affinity of the declared data type (e.g. VARCHAR(10) is fine, but int is not).
  • The primary key must not be null for any row, otherwise updating may not work as expected
    sqlite3_changes() will return 0 for modifying queries (see here).
  • The SQLite streaming blob reading API will be somewhat useless since the blob is fully copied into memory anyways.
  • Attaching a database containing compressed tables using ATTACH 'foo.db' is not supported.
  • DDL statements (like ALTER TABLE and CREATE INDEX) are only partially supported 

Also, 'dict_chooser' is an SQL expression that decides how to partition the data. Example partitioning keys:

I went with the simplest case of 'a' which means that all the rows are compressed with the same dictionary.

And now in code:

$ sqlite3

sqlite>
#opening database
.open database.sqlite

#loading extension
.load sqlite_zstd

#enabling enabling transparent row-level compression
select zstd_enable_transparent('{"table": "notes", "column": "body", "compression_level": 19, "dict_chooser": "''a''"}');

Unfortunately I got an error back :

Runtime error: Could not create insert trigger

Caused by:
0: near "order": syntax error
1: Error code


Enabling the debug log of the library reveals what happens behind the covers (many columns removed for brevity):

[2022-08-22T16:57:46Z INFO sqlite_zstd::create_extension] [sqlite-zstd] initialized
sqlite> select zstd_enable_transparent('{"table": "notes", "column": "body", "compression_level": 19, "dict_chooser": "''a''"}');

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] cols=[ColumnInfo { name: "id", coltype: "TEXT", is_primary_key: true, to_compress: false, is_dict_id: false }, ColumnInfo { name: "parent_id", coltype: "TEXT", is_primary_key: false, to_compress: false, is_dict_id: false }, ColumnInfo { name: "title", coltype: "TEXT", is_primary_key: false, to_compress: false, is_dict_id: false }, ColumnInfo { name: "*body", coltype: "TEXT", is_primary_key: false, to_compress: true, is_dict_id: false }, ColumnInfo { name: "order", }]

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run] alter table `notes` rename to `_notes_zstd`

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run]
create view `notes` as
select `id`, `parent_id`, `title`, zstd_decompress_col(`body`, 1, `_body_dict`, true) as `body`, `created_time`, `updated_time`, `is_conflict`, `latitude`, `longitude`, `order`, `master_key_id`
from `_notes_zstd`

[2022-08-22T16:57:59Z DEBUG sqlite_zstd::transparent] [run]
create trigger `notes_insert_trigger`
instead of insert on `notes`
for each row
begin
insert into `_notes_zstd`(id, parent_id, title, body, _body_dict, created_time, order, user_created_time, user_updated_time) select new.`id`,
new.`parent_id`,
new.`title`,
new.`body` as `body`, null as `_body_dict`,
new.`created_time`,`,
new.`application_data`,
new.`order`,
new.`user_created_time`,
end;

After some digging I understood that the unquoted order entry was being considered as a reserved SQL word and thus the error. Changing order to 'order' made the SQL Create Trigger statement pass.This, of course, was an edge case due to the special column name that the Joplin app is using and if it were another app I might not even have noticed. But in any case, the quoting mechanism should be fixed library-side, therefore I opened an issue on its GitHub repo.

Until it is fixed and a new version is released, what can you do? I renamed the column just to make it pass:

alter table notes rename 'order' to 'test'

Then run sqlite_zstd::transparent again, which now runs to completion.

As already said, sqlite_zstd::transparent just enables the compression. The real work is done by:

select zstd_incremental_maintenance(null, 1);

followed by:

vacuum;

in order to reclaim the free space.

After the couple of minutes that it takes for the operations to complete, let's observe the gains. From 2,663,996 KB original size to 1,912,900 KB. And that is without tweaking the settings and using the default dictionary chooser. Not bad!

Let's see if the search works too :

select * from notes where body like 'python%' limit 1 ;

It works like a charm!

Compression enables you to save storage space by reducing the size of the database, but it has a performance impact because the data must be compressed and decompressed when being accessed. As such the library must be loaded whenever the database is accessed. For example, let's see how Joplin behaves when trying to load the compressed database (assuming the renaming of the order column and the that 'notes' has became a view instead of a table) - it has no impact on its functionality.



Yes there's no such function. Still I can run SQL queries against it through the CLI.

Bonus material

Instead of fiddling with the sqlite3 CLI you can do the same easily and GUI-based by using Db browser for SQLite. Just load the database and the extension through the GUI.Then run your SQL on it.

The big conclusion here is that any application that can side-load this library can reduce the size of its database by 50 to 95% while not affecting its basic functionality. Sure the performance impact is there, but considering most operations still run at over 50k per seconds you’ll probably have other bottlenecks.

There’s other optimizations to be done, but the same method should work for other databases, with barely any modifications required for say PostgreSQL.

Is this the humble start of a greater impact on the state of database technology ?

 Rust Use Continues To Increase

More Information

sqlite-zstd

Related Articles

In Praise Of SQLite

Take The Beginner's Series to Rust

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Be Counted In the Python Developer Survey
09/10/2024

Conducted annually by the Python Software Foundation in conjunction with JetBrains, this survey is the major source of knowledge about the current state of the Python community. The eighth iterat [ ... ]



The Data Engineering Vault
11/10/2024

A curated network of knowledge designed to facilitate exploration, discovery, and deep learning in the field of data engineering.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 12 September 2022 )