Getting Going With RAG
Written by Nikos Vaggalis   
Monday, 20 January 2025

IBM has produced a cookbook of tips and methodologies on how to use RAG to power up any kind of business applications. Microsoft and Docling both provide tools for data ingestion from a range of document formats 

I talked about the value of RAG in my recent article RAG from Scratch explaining why this technique is preferable to fine-tuning:

RAG allows LLMs to amplify the user's query by connecting to external data in real time when generating their output.
This approach is lighter in resources, doesn't need constant updating since it consumes the data at run time and of course the big boon is that it retrieves up to date answers.

There are plenty of tutorials on the topic of RAG, but few are as high quality as Langchain's RAG from Scratch, as examined in the homonymous article. Well here's another one, this time from IBM.

The IBM RAG Cookbook is not a simple tutorial though; it provides an insider's view and end-to-end coverage of the entire RAG pipeline, from document ingestion and answer generation to system evaluation. It might have been considered as an indirect advertisement for IBM's AI platform watson.x if it didn't incorporate solutions done with open-source frameworks like LangChain and LlamaIndex.

As such the cookbook caters for all audiences; developers looking for open source solutions or enterprises evaluating and building products on the watson.x platform.

Its material is split into the following categories:

  • Architecture
  • Data Ingestion
  • Chunking
  • Embedding
  • Storage and Retrieval
  • Answer Generation
  • Result Evaluation
  • Orchestration
  • User Interfaces

Seen as steps, when combined and applied in turn, result in the RAG pipeline which subsequent output being the anticipated business application.

ibmragj

Each category is showcased using both the watson.x platform and its open source counterparts. For instance, in the Data Ingestion category:

Ingestion is the process of parsing information from source documents so that it can be embedded into a search space for later retrieval. While this is a straightforward process for plain text complications arise when the source documents are in non 'text' formats, eg. Microsoft Word or PDF, and when they contain complex formatting such as repeating headers and footers, text in multiple columns, or tables.

Three alternatives are offered:

  1. If you are a non-technical user, your documents are relatively simple and you need a solution with no code, use Watsonx Orchestrate.
  2. If you are a technical user and your documents are relatively simple, then start with Watson Discovery
  3. If the document is too complex (i.e. includes nested tables or irregular table formats), Watson Discovery may not capture the whole document structure. You may want to implement a custom data ingestion pipeline using open source libraries for such cases like LangChain and LlamaIndex.

These are followed by practical examples of ingesting documents in all three cases.

The underlying data-ingestion open-source libraries used in converting PDFs to plain text are PyPDFLoader and PyMuPDF, which of course do their job well, but usually PDF is not the only document format found in an enterprise's data silo;
there's also Microsoft office documents, Images, HTML, AsciiDoc and Markdown.

As far as its Office documents go, Microsoft has recently released the MarkItDown utility which converts various file formats to markdown:

  • PDF
  • PowerPoint
  • Word
  • Excel
  • Images (EXIF metadata and OCR)
  • Audio (EXIF metadata and speech transcription)
  • HTML
  • Text-based formats (CSV, JSON, XML)
  • ZIP files (iterates over contents)

Docling is an alternative solution for parsing documents and exporting them to the desired format in preparation for gen AI that is gaining ground fast. From IBM Deep Search it is open-sourced under an MIT License and can read PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown and export them to to HTML, Markdown and JSON with embedded and referenced images. 

docling

The rest of the sections of the IBM Cookbook follow the same pattern, but I'd like to highlight the one onChunking. Effective chunking methodologies are crucial for optimizing search performance and relevance, as such there isn't consensus on what the best way is. This section is offering the most comprehensive overview I've ever encountered by comparing the various chunking techniques available and when each one is the most appropriate in using.

The rest of the narrative follows the pipeline through Embedding to finally an overview of providing an UI for your application.

To sum it up, this is really good and insightful content recommended whether you are an IBM customer or just a developer looking to utilize RAG.

ibmlogo

More Information

Introducing the IBM RAG Cookbook

Microsoft Markitdown

Docling

Related Articles

RAG from Scratch

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Zasper - The JupyterLab Alternative
23/01/2025

Zasper is a supercharged IDE written in Go that allows you to run data science workloads locally and efficiently.



Neural Networks Learn How To Run A Motor
18/02/2025

Neural Networks are great at learning patterns and this makes them potentially very good at controlling difficult things. For example, a neural network can learn to balance a pole on its point. But do [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 20 January 2025 )