How to Build Multimodal RAG Using Docling?

March 19, 2025

0 Views 0

SaveSavedRemoved 0

Multimodal Retrieval-Augmented Generation (RAG) is a transformative innovation in AI, enabling systems to process and integrate diverse data types such as text, images, audio, and video. This capability is crucial in addressing the challenge of unstructured enterprise data, which predominantly consists of multimodal formats. By leveraging multimodal inputs, RAG enhances contextual understanding, improves accuracy, and expands AI’s applicability across industries like healthcare, customer support, and education. Docling is an open-source toolkit developed by IBM to streamline document processing for generative AI applications. We will build Multimodal RAG Capabilities Using Docling.

It converts diverse formats like PDFs, DOCX, and images into structured outputs such as JSON and Markdown, enabling seamless integration with AI frameworks like LangChain and LlamaIndex. By facilitating the extraction of unstructured data and supporting advanced layout analysis, Docling empowers multimodal Retrieval-Augmented Generation (RAG) by making complex enterprise data machine-readable and accessible for AI-driven insights

Learning Objectives

Exploring Docling – Understanding how it extracts multimodal information from unstructured files.
Docling Pipeline & AI Models – Examining its architecture and key AI components.
Unique Features – Highlighting what makes Docling stand out.
Building a Multimodal RAG System – Implementing a system using Docling for data extraction and retrieval.
End-to-End Process – Extracting data from a PDF, generating image descriptions, and querying with a vector DB & Phi 4.

This article was published as a part of the Data Science Blogathon.

Docling For Unstructured Data

Docling is an open-source document processing toolkit developed by IBM, designed to convert unstructured files like PDFs, DOCX, and images into structured formats such as JSON and Markdown. Powered by advanced AI models like DocLayNet for layout analysis and TableFormer for table recognition, it enables accurate extraction of text, tables, and images while preserving document structure. With seamless integration into generative AI frameworks like LangChain and LlamaIndex, Docling supports applications such as Retrieval-Augmented Generation (RAG) and question-answering systems. Its lightweight architecture allows efficient performance on standard hardware, making it a cost-effective alternative to SaaS-based solutions for enterprises seeking control over data privacy.

Docling Pipeline

Docling implements a linear pipeline of operations, which execute sequentially on each given document (as shown in the above Figure). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading order and eventually assembles a typed document object which can be serialized to JSON or Markdown.

Key AI Models Behind Docling

Traditionally, developers have depended on optical character recognition (OCR) for converting documents into digital formats. However, this technology can be slow and prone to errors due to the heavy computational power required. Docling avoids OCR whenever possible, instead using computer vision models that are specifically trained to identify and categorize the visual components of a page.

Docling is based on two models developed by IBM researchers.

Layout Analysis Model

The layout analysis model functions as an object detector, predicting the bounding boxes and categories of various elements within an image of a given page. Its design is based on RT-DETR and has been re-trained using DocLayNet, our well-known human-annotated dataset for document layout analysis, along with other proprietary datasets. DocLayNet is a human-annotated document layout segmentation dataset containing 80863 pages from a broad variety of document sources.

This model utilizes object detection techniques to examine the layout of documents, ranging from machine manuals to annual reports. It then identifies and classifies elements such as blocks of text, images, tables, captions, and more. The Docling pipeline processes page images at a resolution of 72 dpi, enabling them to be handled by a single CPU.

Table Former Model

The TableFormer model, initially introduced in 2022 and subsequently enhanced with a custom token structure language, is a vision-transformer model designed for recovering the structure of tables. It can predict the logical organization of rows and columns in a table based on an input image, identifying which cells belong to column headers, row headers, or the main body of the table. Unlike previous methods, TableFormer effectively handles various table complexities, including partial or absent borders, empty cells, missing rows or columns, cell spans, hierarchical structures in both column and row headings, as well as inconsistencies in indentation or alignment.

Some Key Features of Docling

Here are the features:

Versatile Format Support: Docling can parse a wide range of document formats, including PDFs, DOCX, PPTX, HTML, images, and more. It exports content into structured formats like JSON and Markdown for seamless integration into AI workflows
Advanced PDF Processing: It includes sophisticated capabilities such as layout analysis, reading order detection, table structure recognition, and OCR for scanned documents. This ensures the accurate extraction of complex document elements like tables and figures. Docling extracts tables using advanced AI-driven methods, primarily leveraging its custom TableFormer model.
Unified Document Representation: Docling uses a unified and expressive format to represent parsed documents, making it easier to process and analyze them in downstream applications
AI-Ready Integration: The toolkit integrates seamlessly with popular AI frameworks like LangChain and LlamaIndex, making it ideal for applications like Retrieval-Augmented Generation (RAG) and question-answering systems
Local Execution: It supports local execution, enabling secure processing of sensitive data in air-gapped environments
Efficient Performance: Designed to run on commodity hardware with minimal resource requirements, Docling avoids traditional OCR when possible, speeding up processing by up to 30 times while reducing errors.
Modular Architecture: Its modular design allows easy customization and extension with new features or models, catering to diverse use cases
Open-Source Accessibility: Unlike proprietary tools like Watson Document Understanding, Docling is open-source under the MIT license, allowing developers to freely use, customize, and integrate it into their workflows without vendor lock-in or additional costs

Docling provides optional support for OCR, for example, to cover scanned PDFs or content in
bitmap images embedded on a page. Docling relies on EasyOCR, a popular third-party OCR library with support for many languages. These features make Docling a comprehensive solution for document parsing and preparation in generative AI workflows.

Building a Multimodal RAG System using Docling

In this article, we will first extract all kinds of data – text, images, and tables from a PDF using Docling. For extracted images, we will use a vision language model to generate the description of the images and save these text descriptions of the images in our VectorDB along with the text data from the original text contents and text from extracted Tables in the PDF. Post this, we will build a RAG system using the vector DB for retrieval along with an LLM (Phi 4) through Ollama for querying from the PDF document.

Hands-On Python Implementation on Google Colab using T4 GPU (Free Tier)

You can find the Colab Notebook which has all the steps here.

Step 1. Installing Libraries

We first start with installing the necessary libraries

!pip install docling

#Following code added to avoid an error in installation - can be removed if not needed
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

!pip install langchain-huggingface

Step 2. Loading the Converter Object

This code prepares a document converter to process PDF files without OCR but with image generation. It then applies this conversion to a specified PDF file, storing the results in a dictionary.

We use this PDF (we save it in the current working directory as ‘accenture.pdf’) which has a lot of charts to test the multimodal retrieval using Docling.

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pdf_pipeline_options = PdfPipelineOptions(do_ocr=False,generate_picture_images=True,)
format_options = {InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options)}
converter = DocumentConverter(format_options=format_options)

sources = [ "/content/accenture.pdf",]
conversions = {source: converter.convert(source=source).document for source in sources}

Step 3. Loading the Model For Embedding Text

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from transformers import *

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_path,)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

Step 4. Chunking the Texts in the Document

The code below is for the document processing pipeline. It takes converted documents from the previous step and breaks them down into smaller chunks, excluding tables (which is processed separately later). Each chunk is then wrapped into a Document object with specific metadata. The code processes converted documents by splitting them into chunks, skipping tables, and creating new Document objects with metadata for each chunk.

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.document import TableItem
from langchain_core.documents import Document


doc_id = 0
texts: list[Document] = []

for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        items = chunk.meta.doc_items
        if len(items) == 1 and isinstance(items[0], TableItem):
            continue # we will process tables later
        refs = " ".join(map(lambda item: item.get_ref().cref, items))
        text = chunk.text
        document = Document(page_content=text,metadata={"doc_id": (doc_id:=doc_id+1),"source": source,"ref": refs,},)
        texts.append(document)



print(f"{len(texts)} text document chunks created")

Step 5. Processing the Tables in the Document

The code below is designed to process tables from converted documents. It extracts tables, converts them into Markdown format, and wraps each table into a Document object with specific metadata.

from docling_core.types.doc.labels import DocItemLabel

doc_id = len(texts)
tables: list[Document] = []

for source, docling_document in conversions.items():
    for table in docling_document.tables:
        if table.label in [DocItemLabel.TABLE]:
            ref = table.get_ref().cref
            text = table.export_to_markdown()
            document = Document(
                page_content=text,
                metadata={
                    "doc_id": (doc_id:=doc_id+1),
                    "source": source,
                    "ref": ref
                },
            )
            tables.append(document)


print(f"{len(tables)} table documents created")

Step 6. Defining Function For Converting Images From PDF to base64 form

import base64
import io
import PIL.Image
import PIL.ImageOps
from IPython.display import display

def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")
    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return encoding

Step 7. Pulling Model From Ollama For Analysing Images from the PDF

We will use a vision language model from Ollama to analyse the extracted images from the PDF and generate a description for each of the images. To facilitate the use of Ollama models, we install the following libraries and start up the Ollama server before pulling the model as described below in the code.

!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2
!pip install langchain-community


#Enabling threading to start ollama server in a non blocking manner
import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

The code below is designed to process images from converted documents. It extracts images, uses a vision model (llama3.2-vision through Ollama) to generate descriptive text for each image, and wraps this text into a Document object with specific metadata. Here’s a detailed explanation:

Pulling the “llama3.2-vision” model from Ollama.

!ollama pull llama3.2-vision

def encode_image(image: PIL.Image.Image, format: str = "png") -> str:

    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")
    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return encoding

import ollama

pictures: list[Document] = []
doc_id = len(texts) + len(tables)

for source, docling_document in conversions.items():
    for picture in docling_document.pictures:
        ref = picture.get_ref().cref
        image = picture.get_image(docling_document)
        if image:
            print(image)
            response = ollama.chat(
            model="llama3.2-vision",
            messages=[{
              "role": "user",
              "content": "Describe this image?",
              "images": [encode_image(image)]
            }],
        )
            text = response['message']['content'].strip()
            document = Document(
                page_content=text,
                metadata={

                    "doc_id": (doc_id:=doc_id+1),

                    "source": source,

                    "ref": ref,

                },

            )

            pictures.append(document)

print(f"{len(pictures)} image descriptions created")

import itertools
from docling_core.types.doc.document import RefItem

# Print all created documents
for document in itertools.chain(texts, tables):
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80) # Separator for clarity

for document in pictures:
    print(f"Document ID: {document.metadata['doc_id']}")
    source = document.metadata['source']
    print(f"Source: {source}")
    print(f"Content:\n{document.page_content}")
    docling_document = conversions[source]
    ref = document.metadata['ref']
    picture = RefItem(cref=ref).resolve(docling_document)
    image = picture.get_image(docling_document)
    print("Image:")
    display(image)
    print("=" * 80) # Separator for clarity

Step 9. Storing in Milvus Vector DB

Milvus is a high-performance vector database built for scale. It powers AI applications by efficiently organizing and searching vast amounts of unstructured data, such as text, images, and multi-modal information. We install the langchain-milvus library first and then store the texts, tables and pictures in the vector DB. While defining the vector DB, we also pass the embedding model so that the vector DB converts all the text extracted, including the data from tables and image descriptions, into embeddings before storing them.

!pip install langchain_milvus

import tempfile
from langchain_core.vectorstores import VectorStore
from langchain_milvus import Milvus


db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
vector_db: VectorStore = Milvus(embedding_function=embeddings_model,connection_args={"uri": db_file},auto_id=True,enable_dynamic_field=True,index_params={"index_type": "AUTOINDEX"},)


#add all the LangChain documents for the text, tables and image descriptions to the vector database
import itertools
documents = list(itertools.chain(texts, tables, pictures))
ids = vector_db.add_documents(documents)
print(f"{len(ids)} documents added to the vector database")

Step 10. Querying the model using Retrieval Augmented Generation with Phi 4 model

In the following code, we first pull the “Phi 4” model from Ollama and then use it as the LLM in this RAG system for generating a response post retrieval of the relevant context from the vector DB based on a query.

#Pulling the Ollama model for querying
!ollama pull phi4

#Querying
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
retriever = vector_db.as_retriever()


# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Local LLM
ollama_llm = "phi4"
model_local = ChatOllama(model=ollama_llm)

# Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model_local
    | StrOutputParser()
)

chain.invoke("How much worth in dollars is Strategy & Conslution in Services?")

Output

According to the context provided, the 'Technology & Strategy/Consulting'
section of the company's operations generated a value of $15 billion.

As seen from the chart below from the document, the response of our multimodal RAG system is correct. With Docling, the information was correctly extracted from the chart and hence the retrieval system was able to provide us with an accurate response.

Analyzing Our RAG System with More Queries

What was the revenue in Germany?

The revenue in Germany, according to the provided context, is $3 billion.
This information is listed under the 'Country-Wise Revenue' section of the
document: \n\n. **Germany**: $3 billion\n\nIf you need any further details
or have additional questions, feel free to ask!

What was the Cloud FY19 revenue?

The Cloud FY19 revenue, as provided in the document context, was $11 billion.
This information is found in the first table under the section titled
'Cloud' where it states:\n\nFY19: $11B\n\nThis indicates that the revenue 
from cloud services for fiscal year 2019 was $11 billion.

As seen from the Table below from the document, the response of our multimodal RAG system is correct. With Docling, the information was correctly extracted from the chart and hence the retrieval system was able to provide us with an accurate response.

What was the Industry X 3 Yr CAGR?

Based on the provided context from the documents in Accenture’s PDF:\n\n-In
Document with doc_id 15 and Document with doc_id 3, both mention Industry
X.\n-The relevant information is found under a section about revenue growth
for Industry X:\n\n**Document 15** indicates: "FY19 $10B Industry X FY19 $3B
FY22 $6.5B 3 Yr. CAGR 2 30%"\n\n**Document 3** reiterates this with similar
wording: "Cloud = FY19 $10B Industry X FY19. , Illustrative = . , Cloud =
$3B. , Illustrative = FY22 $6.5B. , Illustrative = 3 Yr. CAGR 2 30%"\n\nFrom
these excerpts, the 3-year compound annual growth rate (CAGR) for Industry X
is **30%."**.\n\n

As seen from the previous Table from the document, the response of our multimodal RAG system is correct. With Docling, the information was correctly extracted from the chart and hence the retrieval system was able to provide us with an accurate response

Conclusion

In conclusion, Docling stands as a powerful tool for transforming unstructured data into machine-readable formats, making it an essential resource for applications like Multimodal Retrieval-Augmented Generation (RAG). By utilizing advanced AI models and offering seamless integration with popular AI frameworks, Docling enhances the ability to process and query complex documents efficiently. Its open-source nature, combined with versatile format support and modular architecture, makes it an ideal solution for enterprises seeking to leverage generative AI in real-world use cases.

Key Takeaways

Docling Toolkit: IBM’s open-source tool for extracting structured data (JSON, Markdown) from PDFs, DOCX, and images, enabling seamless AI integration.
Advanced AI Models: Uses Layout Analysis and TableFormer for accurate document processing, reducing reliance on traditional OCR.
AI Framework Integration: Works with LangChain and LlamaIndex, ideal for RAG systems, offering cost-effective AI-driven insights.
Open-Source & Customizable: MIT-licensed, modular, and adaptable for diverse use cases, free from vendor lock-in.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is Multimodal Retrieval-Augmented Generation (RAG) and how does it work?

Ans. RAG is an AI framework that integrates various data types, such as text, images, audio, and video, to improve contextual understanding and accuracy. By processing multimodal inputs, RAG enables AI systems to generate more accurate insights and extend their applicability across industries like healthcare, education, and customer support.

Q2. What is Docling and how does it support AI-driven workflows?

Ans. Docling is an open-source document processing toolkit developed by IBM. It converts unstructured documents (e.g., PDFs, DOCX, images) into structured formats such as JSON and Markdown. This conversion enables seamless integration with generative AI frameworks like LangChain and LlamaIndex, facilitating applications like RAG and question-answering systems.

Q3. How does Docling handle complex document elements like tables and images?

Ans. Docling utilizes advanced AI models like Layout Analysis for detecting document layout elements and TableFormer for recognizing table structures. These models help extract text, tables, and images while preserving the document’s structure, improving accuracy and making complex data machine-readable for AI systems.

Q4. Can Docling be used with other AI frameworks and models?

Ans. Yes, Docling is designed to integrate seamlessly with popular AI frameworks like LangChain and LlamaIndex. It can be used to power applications like Retrieval-Augmented Generation (RAG) by extracting data from unstructured documents and enabling AI systems to query and retrieve relevant information.

Q5. Is Docling a cost-effective solution for enterprises handling sensitive data?

Ans. Docling is a cost-effective alternative to SaaS-based document processing tools. It allows local execution, making it ideal for enterprises that need to process sensitive data in air-gapped environments, ensuring data privacy while offering efficient performance on standard hardware. Additionally, Docling is open-source under the MIT license, allowing for easy customization without vendor lock-in.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.