Imagine a world where finding information in a document is as easy as asking a question—and getting a response that combines both text and images seamlessly. In this guide, we dive into building a Multimodal Retrieval-Augmented Generation pipeline that can do just that. You’ll learn how to parse text and images from a PDF slide deck using tools like LlamaParse, create contextual summaries for enhanced retrieval, and feed this data into advanced models like GPT-4 for query answering. Along the way, we’ll explore how contextual retrieval improves accuracy, optimize costs with prompt caching, and compare results between baseline and enhanced pipelines. Get ready to unlock the potential of RAG with this step-by-step walkthrough!
Learning Objectives
- Understand how to parse PDF slide decks for text and images using LlamaParse.
- Learn to add contextual summaries to text chunks for improved retrieval accuracy.
- Build a Multimodal RAG pipeline combining text and images with LlamaIndex.
- Explore the integration of multimodal data into models like GPT-4.
- Compare retrieval performance between baseline and contextual indices.
This article was published as a part of the Data Science Blogathon.
Building a Contextual Multimodal RAG Pipeline
Contextual retrieval was initially introduced in this Anthropic blog post. The high-level intuition is that every chunk is given a concise summary of where that chunk fits in with respect to the overall summary of the document. This allows insertion of high-level concepts/keywords that enable this chunk to be better retrieved for different types of queries.
These LLM calls are expensive. Contextual retrieval depends on prompt caching in order to be efficient.
In this notebook, we use Claude 3.5-Sonnet to generate contextual summaries. We cache the document as text tokens, but generate contextual summaries by feeding in the parsed text chunk.
We feed both the text and image chunks into the final multimodal RAG pipeline to generate the response.
In a Retrieval-Augmented Generation (RAG) pipeline, we typically:
- Parse our source data (e.g. PDF documents, images, slides).
- Embed and index chunks of text for retrieval.
- Retrieve relevant chunks for a given query.
- Synthesize a response by feeding the retrieved chunks (and, optionally, any relevant images or additional metadata) into a Large Language Model (LLM).
Contextual Retrieval is a neat enhancement to standard RAG. Each chunk of text is annotated with a short summary that situates it within the broader document context. This helps the retriever pick the chunk more accurately for queries that might not match the exact words but relate to the overall topic or concept.
Overview of the Multimodal RAG Pipeline
We’ll demonstrate how to build a Multimodal RAG pipeline over a PDF slide deck, using:
- Anthropic as our main LLM (Claude 3.5-Sonnet).
- VoyageAI embeddings for chunk embedding.
- LlamaIndex for our retrieval/indexing abstractions.
- LlamaParse for extracting text and images from the PDF slides.
- OpenAI GPT-4 style multimodal model for final query answering (in text+image mode).
We will also show how to cache LLM calls to minimize costs, since Contextual Retrieval can generate a lot of prompt calls.
Environment Setup and Dependencies
You’ll need to install or upgrade a few packages:
!pip install -U llama-index llama-parse
!pip install -U llama-index-callbacks-arize-phoenix
Additionally:
- Anthropic API Key: Set os.environ[“ANTHROPIC_API_KEY”] = “”.
- VoyageAI API Key: Set os.environ[“VOYAGE_API_KEY”] = “”.
Setup Observability with LlamaTrace (Arize Integration)
We setup an integration with LlamaTrace (integration with Arize).
If you haven’t already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the PHOENIX_API_KEY variable below.
Voyage AI utilizes API keys to monitor usage and manage permissions. To obtain your key, please sign in with your Voyage AI account and click the “Create new API key” button in the dashboard. Add Payment details as well , but still Your first 200 million tokens are still free for Voyage series 3 models.
Phoenix API key can be obtained by signing up for LlamaTrace here , then navigate to the bottom left panel and click on ‘Keys’ where you should find your API key.
import os
import nest_asyncio
nest_asyncio.apply()
# Arize Phoenix
PHOENIX_API_KEY = ""
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
import llama_index.core
llama_index.core.set_global_handler(
"arize_phoenix",
endpoint="https://llamatrace.com/v1/traces"
)
Load and Parse the PDF Slides
In our example, we’ll parse the ICONIQ 2024 State of AI Report. This PDF is publicly available at the URL below. If you prefer, you can replace it with any PDF you have.
!mkdir data
!mkdir data_images_iconiq
!wget "https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQ%20Analytics%20%2B%20Insights%20-%20State%20of%20AI%20Sep24.pdf" -O data/iconiq_report.pdf
Model Setup
Let’s set up the core components required to build and implement our Multimodal RAG pipeline effectively.
import os
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.core import Settings
# Replace with your actual keys
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
os.environ["VOYAGE_API_KEY"] = "..."
llm = Anthropic(model="claude-3-5-sonnet-20240620")
embed_model = VoyageEmbedding(model_name="voyage-3")
Settings.llm = llm
Settings.embed_model = embed_model
Parse Text and Images with LlamaParse
In this example, use LlamaParse to parse both the text and images from the document.
We parse out the text with LlamaParse premium.
NOTE: The report has 40 pages, and at ~5c per page, this will cost you $2. Just a heads up!
For obtaining the LlamaCloud API key, click on the ‘Get started’ here https://www.llamaindex.ai/contact , and login. Once redirected to the LlamaCloud dashboard, generate a new API key by navigating to the API pane on the left.
from llama_parse import LlamaParse
parser = LlamaParse(
result_type="markdown",
premium_mode=True,
# invalidate_cache=True # Uncomment if you want to force a fresh parse
api_key = 'LlamaCloud-API-Key'
)
print("Parsing text...")
md_json_objs = parser.get_json_result("data/iconiq_report.pdf")
md_json_list = md_json_objs[0]["pages"]
image_dicts = parser.get_images(md_json_objs, download_path="data_images_iconiq")
Build Multimodal Nodes
Multimodal nodes are the building blocks that allow us to process and integrate diverse data types like text and images. Here, we’ll construct nodes to parse, embed, and index chunks from a PDF slide deck, setting the foundation for a robust retrieval system.
Each PDF page corresponds to one “node” containing:
- Text (parsed into Markdown)
- Image (screenshot of that page)
Split Pages into Text Nodes
In this step, we’ll split the PDF pages into smaller, manageable text nodes. This ensures efficient embedding and retrieval by breaking down the content into meaningful chunks for precise contextual analysis.
from pathlib import Path
from llama_index.core.schema import TextNode
from typing import Optional
import re
def get_page_number(file_name):
match = re.search(r"-page_(\d+)\.jpg$", str(file_name))
if match:
return int(match.group(1))
return 0
def _get_sorted_image_files(image_dir):
raw_files = [
f for f in list(Path(image_dir).iterdir()) if f.is_file() and "-page" in str(f)
]
return sorted(raw_files, key=get_page_number)
def get_text_nodes(image_dir, json_dicts):
nodes = []
image_files = _get_sorted_image_files(image_dir)
md_texts = [d["md"] for d in json_dicts]
for idx, md_text in enumerate(md_texts):
chunk_metadata = {
"page_num": idx + 1,
"image_path": str(image_files[idx]),
"parsed_text_markdown": md_text,
}
node = TextNode(text="", metadata=chunk_metadata)
nodes.append(node)
return nodes
text_nodes = get_text_nodes("data_images_iconiq", md_json_list)
Add Contextual Summaries
Contextual retrieval attaches a short, high-level summary to each chunk, describing where it fits into the overall document. We’ll use the LLM to generate these short summaries and store them in each node’s metadata[“context”].
from copy import deepcopy
from llama_index.core.llms import ChatMessage
from llama_index.core.prompts import ChatPromptTemplate
import time
whole_doc_text = """\
Here is the entire document.
{WHOLE_DOCUMENT}
"""
chunk_text = """\
Here is the chunk we want to situate within the whole document
{CHUNK_CONTENT}
Please give a short succinct context to situate this chunk within the overall document for \
the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
def create_contextual_nodes(nodes, llm):
"""Function to create contextual nodes for a list of nodes"""
nodes_modified = []
# get overall doc_text string
doc_text = "\n".join([n.get_content(metadata_mode="all") for n in nodes])
for idx, node in enumerate(nodes):
start_time = time.time()
new_node = deepcopy(node)
# Combine whole_doc_text and chunk_text into a single string
user_content = (
f"{whole_doc_text.format(WHOLE_DOCUMENT=doc_text)}\n\n"
f"{chunk_text.format(CHUNK_CONTENT=node.get_content(metadata_mode="all"))}"
)
messages = [
ChatMessage(role="system", content="You are a helpful AI Assistant."),
ChatMessage(role="user", content=user_content),
]
# Send messages to the LLM and get a response
new_response = llm.chat(messages)
new_node.metadata["context"] = str(new_response)
nodes_modified.append(new_node)
print(f"Completed node {idx}, {time.time() - start_time}")
return nodes_modified
Tip: We’re passing an extra_headers parameter with a hypothetical prompt-caching date. This is just to illustrate how you might pass custom headers for Anthropic caching. Actual usage can vary.
Build and Persist the Index
We’ll now embed these summarized chunks and store them in a vector store for retrieval. LlamaIndex can persist indices locally or integrate with 40+ external vector databases.
import os
from llama_index.core import (
StorageContext,
VectorStoreIndex,
load_index_from_storage,
)
if not os.path.exists("storage_nodes_iconiq"):
index = VectorStoreIndex(new_text_nodes, embed_model=embed_model)
index.set_index_id("vector_index")
index.storage_context.persist("./storage_nodes_iconiq")
else:
storage_context = StorageContext.from_defaults(persist_dir="storage_nodes_iconiq")
index = load_index_from_storage(storage_context, index_id="vector_index")
retriever = index.as_retriever()
Baseline Index (Without Summaries)
We’ll also build a “baseline” index on the original text nodes (without the contextual summaries) to compare the difference in retrieval quality.
if not os.path.exists("storage_nodes_iconiq_base"):
base_index = VectorStoreIndex(text_nodes, embed_model=embed_model)
base_index.set_index_id("vector_index")
base_index.storage_context.persist("./storage_nodes_iconiq_base")
else:
storage_context = StorageContext.from_defaults(
persist_dir="storage_nodes_iconiq_base"
)
base_index = load_index_from_storage(storage_context, index_id="vector_index")
Build a Multimodal Query Engine
We want a RAG pipeline that:
- Retrieves relevant chunks of text.
- Also loads the page images.
- Sends both text chunks and images to a multimodal LLM (here we illustrate using an OpenAI-like GPT-4 multimodal endpoint, labeled gpt-4o).
import base64
import openai
import os
from typing import Optional, List
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.base.response.schema import Response
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.prompts import PromptTemplate
from llama_index.core.schema import NodeWithScore, MetadataMode
QA_PROMPT_TMPL = """\
Below we give parsed text from slides, as well as images.
---------------------
{context_str}
---------------------
Given the context information and no prior knowledge, please answer the query:
Query: {query_str}
Answer:
"""
QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)
def encode_image(image_path: str) -> str:
"""If you want to inline a local image in base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
class MultimodalQueryEngine(CustomQueryEngine):
"""
Custom multimodal Query Engine that retrieves text nodes,
then sends them + image(s) to the new Vision-capable API as documented.
"""
def __init__(
self,
retriever: BaseRetriever,
model_name: str = "gpt-4o",
qa_prompt: Optional[PromptTemplate] = None,
) -> None:
super().__init__(qa_prompt=qa_prompt or QA_PROMPT)
self.retriever = retriever
self.model_name = model_name
def custom_query(self, query_str: str) -> Response:
# 1) Retrieve text nodes
node_with_scores: List[NodeWithScore] = self.retriever.retrieve(query_str)
# 2) Build context
context_str = "\n\n".join(
[nws.node.get_content(metadata_mode=MetadataMode.LLM) for nws in node_with_scores]
)
# 3) Format the final prompt
formatted_prompt_text = self._qa_prompt.format(
context_str=context_str,
query_str=query_str,
)
# 4) Build the user message with text + images
user_message_content = [
{
"type": "text",
"text": formatted_prompt_text,
}
]
for nws in node_with_scores:
image_path = nws.node.metadata.get("image_path", "")
if image_path:
base64_data = encode_image(image_path)
image_url = f"data:image/jpeg;base64,{base64_data}"
user_message_content.append(
{
"type": "image_url",
"image_url": {
"url": image_url,
"detail": "auto"
},
}
)
messages = [
{
"role": "user",
"content": user_message_content,
}
]
# 5) Call your Vision model
response = openai.ChatCompletion.create(
model=self.model_name,
messages=messages,
max_tokens=500,
)
# 6) Return a Response object
return Response(
response=response.choices[0].message.content,
source_nodes=node_with_scores,
metadata={},
)
# 2) Create a query engine
query_engine = MultimodalQueryEngine(
retriever=index.as_retriever(similarity_top_k=3),
model_name="gpt-4o", # or "gpt-4o-mini", "gpt-4-turbo", etc.
)
base_query_engine = MultimodalQueryEngine(
retriever=base_index.as_retriever(similarity_top_k=3),
model_name="gpt-4o",
)
Trying Out Queries
Let’s query our new pipeline about AI usage by department.
response = query_engine.query(
"Which departments use GenAI the most and how are they using it?"
)
print(str(response))
A typical response might look like this:
Based on the parsed markdown text provided, the departments/teams that use
generative AI the most are:1. **AI, Machine Learning, and Data Science** with a score of 4.5.
2. **IT** with a score of 4.0.
3. **Engineering / R&D** with a score of 3.9.These scores are derived from a survey where respondents rated the level of
generative AI usage on a scale of 1-5.In terms of how these departments are using generative AI:
- **AI, Machine Learning, and Data Science**: While specific use cases for this
department are not detailed in the provided text, it can be inferred that they are
likely using generative AI for advanced data analysis, model development, and
enhancing AI capabilities within the organization.- **IT**: The IT department is using generative AI for several impactful use cases,
including:
- Ticket management
- Chatbots
- Customer support and troubleshooting
- Knowledge management
- Case summarizationThe information about the departments and their use cases comes from the parsed
markdown text. There are no discrepancies between the parsed markdown and the
context provided, as the markdown text clearly outlines both the departments with
the highest usage scores and the specific use cases for the IT department.
Comparatively, if we run the same query on the baseline index:
base_response = base_query_engine.query(
"Which departments use GenAI the most and how are they using it?"
)
print(str(base_response))
You’ll see the baseline might have fewer details or slightly different retrieval results. Contextual retrieval gives more precise context around the IT usage specifically. The response would look like:
Based on the parsed markdown text provided, the departments that use Generative AI
(GenAI) the most are:1. **AI, Machine Learning, and Data Science** - This department has the highest
weighted average score of 4.5 for GenAI usage, indicating significant adoption. The
specific use cases are not detailed in the parsed text, but given the nature of the
department, it is likely involved in developing and refining AI models and
algorithms.2. **IT** - With a score of 4.0, the IT department is also a leading user of GenAI.
The use cases for IT include internal productivity enhancements and IT operations,
as indicated by the 61% adoption rate for internal productivity and 42% ROI mention
in IT use cases.3. **Engineering / R&D** - This department has a score of 3.9. While specific use
cases are not detailed in the parsed text, it is reasonable to infer that GenAI is
used for product development and research purposes, as suggested by the 69%
adoption rate for core product performance enhancements and 50% for natural language
interfaces.The information is derived from the parsed markdown text, which provides a detailed
breakdown of GenAI usage by department and specific use cases. There are no
discrepancies between the parsed markdown and the raw text, as the markdown appears
to be a structured representation of the same data. The image was not provided, so
it was not used in forming the answer.
Observing the Benefits of Contextual Retrieval
Here’s another example query, In this next question, the same sources are retrieved with and without contextual retrieval, and the answer is correct for both approaches. This is thanks for LlamaParse Premium’s ability to comprehend graphs.
query = "what are relevant insights from the 'deep dive on infrastructure' section in terms of model preferences, cost, deployment environments?"
response = query_engine.query(query)
print(str(response))
Output
The "Deep Dive on Infrastructure" section from the ICONIQ Growth report provides
insights into the infrastructure aspects necessary for deploying AI solutions.
However, the parsed markdown text does not explicitly mention model preferences or
costs in this section. Instead, it focuses on infrastructure tooling and deployment
environments.From the parsed markdown text, we can gather the following insights related to
deployment environments:1. **Deployment Environments**: Enterprises are primarily hosting generative AI
workloads on the cloud or using a hybrid approach. The preferred deployment methods
are:
- Cloud: 56%
- Hybrid: 42%
- On-prem: 2%2. **Cloud Service Providers**: The most utilized cloud service providers for
hosting AI workloads are:
- Amazon Web Services (AWS): 68%
- Microsoft Azure: 61%
- Google Cloud (GCP): 40%These insights are derived from the parsed markdown text, specifically from the
sections discussing "Cloud Deployment Method" and "Infrastructure Tooling." There is
no mention of model preferences or cost considerations in the provided text. If
there were any discrepancies or additional details in the image or raw text, they
are not available here, so the answer is based solely on the parsed markdown text
provided.
Now, lets try with the baseline approach:
base_response = base_query_engine.query(query)
print(str(base_response))
Output
The parsed text from the slides does not provide specific insights regarding model preferences, cost, or deployment environments in the 'deep dive on infrastructure' section. The slide titled "Deep Dive on Infrastructure" (page 24) only contains the title, the ICONIQ Growth branding, and confidentiality and copyright notices. There is no detailed information or data presented in the parsed text for this section.
Therefore, based on the parsed markdown text provided, there are no relevant insights available from the ‘deep dive on infrastructure’ section regarding model preferences, cost, or deployment environments. If there were any images associated with this section, they were not provided, and thus no additional insights could be derived from them.
This conclusion is drawn from the parsed markdown text, which lacks any specific information on model preferences, cost, or deployment environments in that section. The image confirms this, as it only shows the title and a graphic without additional details.
If you need insights on these topics, you might want to refer to other sections or slides that specifically address model preferences, costs, or deployment environments.
- Contextual Retrieval might fetch the pages that discuss cloud deployment methods, infrastructure tooling, and cost references, leading to a more thorough response.
- The baseline approach might (in some cases) fail to retrieve the correct chunk or provide less detail.
Comparing both answers helps demonstrate that those short “contextual summaries” in your metadata often lead to more relevant retrieval.
A big thanks to Jerry Liu from LlamaIndex for creating this amazing pipeline.
Conclusion
In this tutorial, we explored the process of parsing a PDF slide deck using LlamaParse to extract both text and images, enriching each text chunk with contextual summaries to enhance retrieval accuracy. We demonstrated how to build a Multimodal RAG pipeline with LlamaIndex, integrating both textual and visual data into a powerful model like GPT-4, showcasing the potential of multimodal LLMs. Finally, we compared results from a baseline index to a contextual index, highlighting the significant improvements in retrieval precision and relevance achieved through the contextual approach. This comprehensive guide equips you with the tools and techniques to build effective multimodal AI solutions.
Key Takeaways
- Contextual retrieval improves chunk matching for queries that might not have a direct keyword overlap.
- Multimodal RAG can incorporate not just text but also images, charts, or diagrams from slides.
- Prompt caching is essential when chunk sizes are large and you’re generating a context summary for each chunk—this can reduce cost significantly.
- If you have web-based content (like store listings, large sets of HTML pages), you can use ScrapeGraphAI to fetch that data, then feed it into the same pipeline.
With these steps, you can adapt the approach to any PDF or external data source—whether it’s a huge enterprise knowledge base, marketing materials, or your company’s internal documentation.
Frequently Asked Questions
A. Contextual Retrieval is an approach where each chunk of text in your dataset has a concise summary that situates it within the broader document. This helps your retriever better match relevant chunks—especially for queries that rely on thematic or conceptual overlaps rather than exact keyword matches.
A. In a Multimodal RAG pipeline, you not only retrieve and feed text chunks into the LLM but also related images, audio, or other modalities. This is especially useful when your data sources are slide decks, PDFs with charts, or any materials that mix text with images. It allows the model to reference both textual and visual content for a more comprehensive answer.
A. LlamaParse is a parsing utility that can extract both text and images from a PDF. Traditional PDF extractors often only get the text or struggle with embedded charts and diagrams. With LlamaParse, you can create “nodes” that include a reference to each PDF page’s image file—enabling genuine multimodal retrieval.
A. No, it isn’t mandatory, but it’s a great way to benchmark the difference. Having a baseline index helps you see how retrieval results change when you add contextual summaries.
This article was published as a part of the Data Science Blogathon.