DuckDB And Hydra Partner To Get DuckDB Into PostgreSQL
Written by Nikos Vaggalis   
Monday, 11 November 2024

The offspring of that partnership is pg_duckdb, an extension that embeds the DuckDB engine into the PostgreSQL database, allowing it to handle analytical workloads.

In this modern era of data oriented workloads, software engineers usually juggle many disparate sources of data by using many different developer tools. What if you could have one tool to rule them all? That tool exists and is called Postgres; it really can handle anything you throw at it.

The news is that in yet another twist in the PostgreSQL-for-everything narrative, PostgreSQL now also becomes OLAP+OLTP = HTAP ready, thanks to MotherDuck's partnership with Hydra.

Saying that, HTAP readiness is nothing new for PostgreSQL as we've examined before in "Hydra Turns PostgreSQL Into A Column Store", an open-source extension that adds columnar tables to Postgres for efficient analytical reporting, allowing it to become HTAP ready by being able to support both transactional and analytical workloads to provide real-time data access and analysis capabilities in a single system.

While since Hydra's inception PostgreSQL could handle analytical workloads, however outsourcing the analytical engine to DuckDB offers optimized performance plus the ability to read and write files on external object stores like AWS and S3, therefore becoming ideal for querying data against Data Lakes.

At this point, it's important to note that PostgreSQL's integration with DuckDB is not something new. There's also another extension pg_lakehouse (now pg_analytics) by ParadeDB, which we examined in Pg_lakehouse Makes PostgreSQL Quack. In there we found that :

Queries are pushed down to DuckDB for processing data like events, metrics, historical snapshots, vendor data, but that's one part of the equation, the query engine. The other is being able to fetch those foreign object stores like S3 and table formats like Iceberg or Delta Lake.

So what is the difference between pg_analytics and pg_duckdb, you say? pg_analytics uses the foreign data wrapper (FDW) API to connect to those disparate data sources, while with pg_duckdb the query is executed by DuckDB directly. Other than that, pg_duckdb is the officially endorsed PostgreSQL extension for DuckDB, which does imply better support and features implementation?

And it's not just pg_analytics, there's pg_mooncake as well! So there's actually three extensions interfacing with DuckDB!
And now the problem begins to show its ugly head. All those three extension do embed their own build of libduckdb (DuckDB as a library, just like Sqlite. DuckDB FDWs depend on it) which becomes an issue as this raised Githb issue highlights in that :

Embedding libduckdb requires a substantial amount of compilation time and results in a dramatic increase in the size of the package. This problem is exacerbated when considering combination for 3 PostgreSQL major version and 5 OS distribution, plus it is not feasible to install these extensions simultaneously.

The fix goes to suggest the adoption of a shared libduckdb library which could significantly reduce compilation times and package sizes and also eliminate extension conflicts.

But enough talk already, just show me what pg_duckdb can do!
Sure. For instance, the following query uses pg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.

Then you can read from an Iceberg table :

Of course you can also write back to your data lake and so on.

Installation wise, there's Docker images available on Dockerhub which are based on the official Postgres image:

docker run -d -e POSTGRES_PASSWORD=duckdb pgduckdb/pgduckdb:16-main

Pre-built binaries that are available via pgxman:

pgxman install pg_duckdb

Or, if you feel brave enough, compile from source by cloning the Github repo.

To conclude. We have documented the whys of going all the way Postgres in "It's 2024. Why Does PostgreSQL Still Dominate?". Since then, there's been an avalanche of new developments that has firmly established Postgres at the number one spot in the hearts of the developers.

In return for Postgres taking such good care of them, the thriving PostgreSQL community gives back by thinking up all kinds of novel ideas that go to improve it. Of course, this isn't unexpected because PostgreSQL has been built with great extensibility in mind since its inception.

And this, ladies and gentlemen, is how you build a product that lasts through the ages...

 

More Information

pg_duckdb
pg_analytics
pg_mooncake

Related Articles

Pg_lakehouse Makes PostgreSQL Quack

Hydra Turns PostgreSQL Into A Column Store

It's 2024. Why Does PostgreSQL Still Dominate? 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Random Gifts For Programmers
24/11/2024

Not really random. Not even pseudo random, more stuff that caught my attention and that I, for one, would like to be given. And, yes, if I'm not given them, I'd probably buy some for myself.



Microsoft Introduces Vector Abstractions Library For .NET
21/11/2024

Microsoft has announced a preview release of the Microsoft Extensions VectorData Abstractions library, which can be used to help integrate vector stores into .NET applications and libraries.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 13 November 2024 )