Pg_lakehouse Makes PostgreSQL Quack
Written by Nikos Vaggalis   
Monday, 08 July 2024

Pg_Lakehouse from ParadeDB is an extension that turns PostgreSQL into the analytical engine of DuckDB. Why is that useful? How do you use it?

Thanks to extensions, many of which we've already reported on, database developers can enjoy Postgres-for-everything, Postgres-as-anything:

  • As a distributed database? get Citus.
  • As SQLite? Get pgsqlite, the pure python module and command line tool that makes it simple to import SQLite databases into Postgres.
  • As a Columnstore? Get Hydra, the open-source extension that adds columnar tables to Postgres for efficient analytical reporting.
  • As a Vector engine? Get pg_vector, an extension for PostgreSQL that renders it a viable alternative to specialized vector stores used in LLMs.
  • As a Graph engine? Get Apache AGE
  • As Elasticsearch? Get pg_bmp25, an extension that bakes ElasticSearch FTS capabilities into PostgreSQL itself.
  • As MongoDB or SQLServer? Get MangoDB (now called FerretDb) and BabelFish, two proxies that move your application workloads from MongoDB and SQL Server respectively to PostgreSQL.

It's about time for PostgreSQL to assume the role of DuckDB and now it can, thanks to the pg_lakehouse extension by the makers of pg_bmp25, ParadeDB. DuckDB is, of course, the alternative to SQLite for analytical workloads; local first, embeddable and suitable for data science work.

With pg_lake PostgreSQL is powered up with those high performance analytical query engine capabilities too.
Queries are pushed down to DuckDB for processing data like events, metrics, historical snapshots, vendor data, but that's one part of the equation, the query engine. The other is being able to fetch those foreign object stores like S3 and table formats like Iceberg or Delta Lake.

pg_lakehouse uses the foreign data wrapper (FDW) API to connect to those data sources. While other FDWs like aws_s3 have existed in the Postgres extension ecosystem, these FDWs suffered from two limitations:

Lack of support for most object stores and table formats
Too slow over large datasets to be a viable analytical engine

pg_lakehouse differentiates itself by supporting a wide breadth of stores and formats and by being very fast (thanks to DuckDB).

As such it supports the following object stores and table formats:

Object Stores

  • Amazon S3
  • S3-compatible stores (MinIO, R2)
  • Azure Blob Storage
  • Azure Data Lake Storage Gen2
  • Google Cloud Storage
  • HTTP server
  • Local file system

Table Formats

  • Parquet
  • CSV
  • Apache Iceberg
  • Delta Lake
  • JSON (Coming Soon)

The connection is made upon creating the foreign table when you pass in the path of the object store to load. For instance:
path 's3://paradedb-benchmarks/yellow_tripdata_2024-01. parquet',

An example will make it comprehensible. The following example uses pg_lakehouse to query an example dataset of 3 million NYC taxi trips from January 2024, hosted in a public us-east-1 S3 bucket provided by ParadeDB.

So there you have it, you can query your parquet files from PostgreSQL too!

pg_lakehouse is under active and heavy development, with a long roadmap ahead:

  • Write Support: pg_lakehouse is currently read-only from object stores. Adding write support will enable developers to further centralize data lake operations inside Postgres.
  • Support for Apache Iceberg tables.
  • Wider Object Store Coverage

paradedb sq

 

More Information

pg_lakehouse

Related Articles

The Enduring Influence Of Postgres

Hydra Turns PostgreSQL Into A Column Store

Turn PostgreSQL Into A Vector Store

pgsqlite - Load SQLite databases into PostgresSQL

AGE - The Open Source PostgreSQL Extension For Graph Database Functionality

Move Over To PostgreSQL With Babelfish and MangoDB

Citus For PostgreSQL Goes Fully Open Source

ElasticSearch Search Capabilities Baked Into PostgreSQL

Turn Your SQLite Database Into A Server

 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Kafka 3.9 Adds Dynamic KRaft Quorums
16/12/2024

Kafka 3.9 has been released. The team says this is a major release and the final in the 3.x line. It This will also be the final major release to feature the deprecated Apache ZooKeeper mode. Kafka is [ ... ]



GitHub Announces Open Source Security Fund
03/12/2024

A new security-focused program, the GitHub Secure Open Source Fund, will invest $1.25M across 125 open source projects. The project is backed by the support of organizations including American Express [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 08 July 2024 )