PeerDB Brings Real Time Streaming To PostgreSQL
Written by Nikos Vaggalis   
Thursday, 23 November 2023

PeerDB is an ETL/ELT tool built for PostgreSQL. It makes all tasks that require streaming data from PostgreSQL to third party counterparts as effortless as it gets.

But the basics first;. Why the need to stream data from PostgreSQL, or any database for the matter?

For a start, it's what streaming data's most popular technique CDC is used for. CDC is a way to capture changes made in the database and forwarding them in real-time to external applications (such as Kafka) through connectors such as the ones offered by Debezium, the open source distributed platform that turns your existing databases into event streams. There are many ways to implemented CDC like row versioning, pubsub, triggers and log monitoring, with the log-based one being the most popular and automated. The use cases of CDC include real-time analytic, replication to Data Warehouses, Queues and Storages or any other customized solutions.

The most popular tool for enabling CDC is of course open source Debezium. Compared to Debezium, PeerDB is significantly simpler to set up and manage.
You just define your Peers between the source and the target and let them exchange data like Linux does with pipes. Actually, the scheme could very well be described
as a glorified pipe.

For instance, to mirror data from a Postgres instance to a Snowflake one you just have to :

CREATE PEER postgres_peer
FROM postgres (. . . );

CREATE PEER snowflake_peer
FROM snowflake (. . . );

CREATE MIRROR real_time_cdc
FROM postgres_peer
TO snowflake_peer
WITH TABLE MAPPING (transactions:transactions, users:users);

Transactions and users table are now replicated in realtime from Postgres to Snowflake, so that when you Insert/Update or Delete from the Postgres tables, the same operation is mirrored on the Snowflake ones too.

Besides sporting a developer friendly API as seen above, PeerDB is also performant in comparison to similar tools:

  • 2x to 16x faster large data loads
    When you are moving larger datasets (10s of GB to a few TB) from Postgres to any supported targets, PeerDB can be 2x to 16x faster than other tools. This helps faster initial loads in WAL-based replication and faster Query or Watermark based replication
  • Change Data Capture (CDC) with 5s to 60s lag on target
    PeerDB is designed for real-time streaming from Postgres. If your application is latency sensitive you can configure refresh intervals as low as a few seconds.

Since PeerDB talks "Postgres" it also supports native Postgre features such as :

  • Advanced data types - PeerDB supports natively replicating advanced data types incl. ARRAYs, JSON/JSONB, HSTORE, ENUMs, Geospatial etc from Postgres.
  • Partitioned Tables - PeerDB has comprehensive support for replicating partitioned tables.
  • Efficient replication TOAST (large) columns

Nativity also means that you can use the tools you are familiar with on PeerDB as well:

  • Client tools like pgAdmin, psql to run SQL commands.
  • BI tools like Grafana, Tableau to visually monitor syncs and transforms.
  • Database migration and versioning tools like Flyway to manage your ETL.
  • Any language (Python, Go, Node. js etc) and Scheduler (AirFlow) for development.

PeerDB support a number of different modes of streaming like log based (CDC), cursor based (timestamp or integer) and XMIN, while at the time of writing it supports the following connectors :

Of course it is free and open source and available as a docker image. There's also a Cloud and Enterprise offering which is fully managed and hosted on AWS, Azure and GCP, and requires a paid subscription.

To conclude, PostgreSQL never ceases to amaze. With PeerDB included, its ecosystem goes from strength to strength.

 

More Information

PeerDB official

PeerDB Github

Related Articles

pg_later - Native Asynchronous Queries Within Postgres 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Apache Lucene Improves Sparce Indexing
22/10/2024

Apache Lucene 10 has been released. The updated version adds a new IndexInput prefetch API, support for sparse indexing on doc values, and upgraded Snowball dictionaries resulting in improved tokeniza [ ... ]



AI Propels Python To Top Language on GitHub
30/10/2024

This year's Octoverse Report reveals how AI is expanding on GitHub and that Python has now overtaken JavaScript as the most popular language on GitHub. The use of Jupyter Notebooks has also surged.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info