Getting Going With RAG

Written by Nikos Vaggalis

Monday, 20 January 2025

IBM has produced a cookbook of tips and methodologies on how to use RAG to power up any kind of business applications. Microsoft and Docling both provide tools for data ingestion from a range of document formats

I talked about the value of RAG in my recent article RAG from Scratch explaining why this technique is preferable to fine-tuning:

RAG allows LLMs to amplify the user's query by connecting to external data in real time when generating their output.
This approach is lighter in resources, doesn't need constant updating since it consumes the data at run time and of course the big boon is that it retrieves up to date answers.

There are plenty of tutorials on the topic of RAG, but few are as high quality as Langchain's RAG from Scratch, as examined in the homonymous article. Well here's another one, this time from IBM.

The IBM RAG Cookbook is not a simple tutorial though; it provides an insider's view and end-to-end coverage of the entire RAG pipeline, from document ingestion and answer generation to system evaluation. It might have been considered as an indirect advertisement for IBM's AI platform watson.x if it didn't incorporate solutions done with open-source frameworks like LangChain and LlamaIndex.

As such the cookbook caters for all audiences; developers looking for open source solutions or enterprises evaluating and building products on the watson.x platform.

Its material is split into the following categories:

Architecture
Data Ingestion
Chunking
Embedding
Storage and Retrieval
Answer Generation
Result Evaluation
Orchestration
User Interfaces

Seen as steps, when combined and applied in turn, result in the RAG pipeline which subsequent output being the anticipated business application.

ibmragj

Each category is showcased using both the watson.x platform and its open source counterparts. For instance, in the Data Ingestion category:

Ingestion is the process of parsing information from source documents so that it can be embedded into a search space for later retrieval. While this is a straightforward process for plain text complications arise when the source documents are in non 'text' formats, eg. Microsoft Word or PDF, and when they contain complex formatting such as repeating headers and footers, text in multiple columns, or tables.

Three alternatives are offered:

If you are a non-technical user, your documents are relatively simple and you need a solution with no code, use Watsonx Orchestrate.
If you are a technical user and your documents are relatively simple, then start with Watson Discovery
If the document is too complex (i.e. includes nested tables or irregular table formats), Watson Discovery may not capture the whole document structure. You may want to implement a custom data ingestion pipeline using open source libraries for such cases like LangChain and LlamaIndex.

These are followed by practical examples of ingesting documents in all three cases.

The underlying data-ingestion open-source libraries used in converting PDFs to plain text are PyPDFLoader and PyMuPDF, which of course do their job well, but usually PDF is not the only document format found in an enterprise's data silo;
there's also Microsoft office documents, Images, HTML, AsciiDoc and Markdown.

As far as its Office documents go, Microsoft has recently released the MarkItDown utility which converts various file formats to markdown:

PDF
PowerPoint
Word
Excel
Images (EXIF metadata and OCR)
Audio (EXIF metadata and speech transcription)
HTML
Text-based formats (CSV, JSON, XML)
ZIP files (iterates over contents)

Docling is an alternative solution for parsing documents and exporting them to the desired format in preparation for gen AI that is gaining ground fast. From IBM Deep Search it is open-sourced under an MIT License and can read PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown and export them to to HTML, Markdown and JSON with embedded and referenced images.

docling

The rest of the sections of the IBM Cookbook follow the same pattern, but I'd like to highlight the one onChunking. Effective chunking methodologies are crucial for optimizing search performance and relevance, as such there isn't consensus on what the best way is. This section is offering the most comprehensive overview I've ever encountered by comparing the various chunking techniques available and when each one is the most appropriate in using.

The rest of the narrative follows the pipeline through Embedding to finally an overview of providing an UI for your application.

To sum it up, this is really good and insightful content recommended whether you are an IBM customer or just a developer looking to utilize RAG.

More Information

Introducing the IBM RAG Cookbook

Microsoft Markitdown

Docling

RAG from Scratch

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

JRuby 10 Released
21/04/2025

JRuby 10 has been released with support for Ruby 3.4 (including 3.2 and 3.3 updates as well). The minimum Java version has also been increased to Java 21, allowing the language to support more modern [ ... ]

+ Full Story

Open Source AI - Stack Overflow Findings
09/04/2025

In March, over 1,000 developers and technologists responded to a survey conducted by Stack Overflow to discover developers' feelings about open-source AI. The results reveal a generatio [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 20 January 2025 )

More Information

Related Articles

Comments