ElasticSearch Search Capabilities Baked Into PostgreSQL
Written by Nikos Vaggalis   
Tuesday, 19 December 2023

That's right. No need anymore for a separate database alongside PostgreSQL. ParadeDB has FTS covered.

Really the ingenuity around the PostgreSQL ecosystem never ceases to amaze me.

PostgreSQL comes with many bells and whistles but really, Extensibility has to be its killer feature. There's just an extension for everything. This is because PostgreSQL has been built with that kind of philosophy in mind, as examined in The Enduring Influence Of Postgres :

With extensibility as an architectural core, it is possible to be creative and stop worrying so much about discipline: you can try many extensions and let the strong succeed. Done well, the “second system” is not doomed; it benefits from the confidence, pet projects, and ambitions developed during the first system

As a matter of fact, here at I Programmer we have been closely monitoring the landscape, covering many new extensions
that take the PostgreSQL experience to a whole other level:

pg_ivm - Materialised Views On Steroids
an extension module for PostgreSQL 14 that provides an Incremental View Maintenance (IVM) feature. That means that materialized views are updated immediately after a base table is modified.

pgsqlite - Load SQLite databases into PostgresSQL
a pure python module and command line tool that makes it simple to import a SQLite database into Postgres, saving a ton of time and hassle in the process.

pg_later - Native Asynchronous Queries Within Postgres
an interesting project and extension built by Tembo which enables Postgres to execute queries asynchronously. Fire your query - but don't forget to check later for the result.

Hydra Turns PostgreSQL Into A Column Store
Hydra is an open-source extension that adds columnar tables to Postgres for efficient analytical reporting.

Turn PostgreSQL Into A Vector Store
pg_vector is an extension for PostgreSQL that renders it a viable alternative to specialized vector stores used in LLMs.

paradedb

And here's yet another; pg_bmp25 which bakes ElasticSearch FTS capabilities into PostgreSQL itself. It is true that full-text search was probably the only thing not good enough on Postgres.

Many projects function just fine with Postgres's Full Text Search native engine and other built-in extensions like trigram search (pg_trgm). In 2016, an open source search library called Tantivy emerged and later on, a library called pgrx.

On the other hand a few projects have attempted to put such a search engine inside of Postgres , but every single one has relied on an external ElasticSearch instance. This meant introducing a complex and expensive piece of infrastructure into the data stack. Perhaps the best-known example of this kind of design is a Postgres extension called ZomboDB.

All that might make PostgreSQL come close to ElasticSearch level querying but not close enough. That is what pg_bmp25 aims to change.

pg_bm25, named after BM25, the algorithm used by modern search engines to calculate the relevance scores of search results, is a Rust extension that aims to bridge the gap between the native capabilities of Postgres’ full text search and those of a specialized search engine like ElasticSearch. The goal is to eliminate the need to bring a cumbersome service like ElasticSearch into the data stack.

With pg_bm25 you get:

  • 100% Postgres native, with zero dependencies on an external search engine
  • Query times over 1M rows are 20x faster compared to tsquery and ts_rank, Postgres’ built-in full text search and sort functions
  • Support for fuzzy search, aggregations, highlighting, and relevance tuning
  • Relevance scoring uses BM25, the same algorithm used by ElasticSearch
  • Real-time search — new data is immediately searchable without manual reindexing

At this point, let's take a look at yet another property of PostgreSQL that is relevant here;that of Commercial Adaptations:

PostgreSQL has long been an attractive starting point for building commercial database systems, given its permissive open source license, its robust codebase, its flexibility, and breadth of functionality

As the result, many products are based on, or are even forks of it:

  • Greenplum was the first effort to offer a shared-nothing parallel, scale-out version of PostgreSQL.
  • EnterpriseDB was founded in 2004 as an open-source-based business, selling PostgreSQL in both a vanilla and enhanced edition with related services for enterprise customers.
  • ParAccel enhanced the Postgres optimizer with new heuristics for queries with many joins. In 2011, Amazon invested in ParAccel, and in 2012 announced AWS Redshift, a hosted data warehouse as a service in the public cloud based on ParAccel technology.
  • CitusDB was founded in 2010 to offer a shared-nothing parallel implementation of PostgreSQL. While it started as a fork of PostgreSQL, as of 2016 CitusDB is implemented via public PostgreSQL extension APIs and can be installed into a vanilla PostgreSQL installation.

And of course PeerDb as examined in "PeerDB Brings Real Time Streaming To PostgreSQL" an ETL/ELT tool built for PostgreSQL that makes all tasks that require streaming data from PostgreSQL to third party counterparts as effortless as it gets.

The Commercial Adaptation property is relevant here too because while pg_bm25 is available for self hosted PostgreSQL instances, Cloud providers are too strict to allow users installing third party extensions, eg on an RDS or Aurora managed instance. As such, pg_bmp25 is offered preinstalled on a managed Postgres database called ParadeDB.

Philippe Noël, senior ParadeDB dev states :

Most customers should have a "search node" isolated from their production database, for independent scaling + resilience, and that ParadeDB should be shipped alongside, not inside, your production DB.

As such, ParadeDB provides the most comprehensive search features of any Postgres database, and is a good fit for you if:

  • You want a single, Postgres-based source of truth and hate duplicating data across multiple services.
  • You want to perform full-text search over large volumes of documents stored in Postgres without compromising on performance or scalability.
  • You want to combine ANN/similarity search with full text search for improved semantic matching.

On the other hand it is not a good fit if:

  • You are looking for an observability solution that ingests and monitors logs.

ParadeDB however goes beyond FTS and pg_bmp25. They also publish the pg_sparse extension which enables sparse vector search using HNSW in Postgres , as such combining searching with pgvector and pg_bm25, that way ParadeDB can offer both keyword-based search as well as semantic search.

While pg_bmp25 and pg_sparse are available for downloading and installing on your selfhosted PostgreSQL instance,
the easiest way to have them both is to run the ParadeDB Dockerfile:

docker run \
-e POSTGRES_USER=<user> \
-e POSTGRES_PASSWORD=<password> \
-e POSTGRES_DB=<dbname> \
-p 5432:5432 \
-d \
paradedb/paradedb:latest

ParadeDB is currently in public beta and ParadeDB managed Cloud is currently being revamped.
To get started follow the quickstart guide.

paradedb2

More Information

ParadeDB

Quickstart 

Related Articles

The Enduring Influence Of Postgres

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Google Intensive AI Course - Free On Kaggle
05/11/2024

Google is offering a 5-Day Gen AI Intensive Course designed to equip data scientists with the knowledge and skills to tackle generative AI projects with confidence. It runs on the Kaggle platform from [ ... ]



Flutter Forked As Flock
05/11/2024

One of developers who worked on the Flutter team at Google has created an open-source form of the framework. Matt Carroll says Flock will be "Flutter+", will remain constantly up to date with Flutter, [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 19 December 2023 )