Saturday, March 1, 2025
HomeAnalyticsHow to Build Agentic QA RAG System Using Haystack Framework

How to Build Agentic QA RAG System Using Haystack Framework


Imagine you are building a customer support AI that needs to answer questions about your product. Sometimes it needs to pull information from your documentation, while other times it needs to search the web for the latest updates. Agentic RAG systems come in handy in such types of complex AI applications. Think of them as smart research assistants who not only know your internal documentation but also decide when to go to search the web. In this guide, we will walk through the process of building an agentic QA RAG system using the Haystack framework.

Learning Objectives

  • Know what an agentic LLM is and understand how it is different from a RAG system.
  • Familiarize the Haystack framework for agentic LLM applications.
  • Understand the process of prompt building from a template and learn how to join different prompts together.
  • Learn how to create embedding using ChromaDB in Haystack.
  • Learn how to set up a complete local development system from embedding to generation.

This article was published as a part of the Data Science Blogathon.

What is an Agentic LLM?

An agentic LLM is an AI system that can autonomously make decisions and take actions based on its understanding of the task. Unlike traditional LLMs that mainly generate text responses, an agentic LLM can do a lot more. It can think, plan, and act with minimal human input. It assesses its knowledge, recognizing when it needs more information or external tools. Agentic LLMs don’t rely on static data or indexed knowledge, instead, they decide which sources to trust and how to gather the best insights.

This type of system can also pick the right tools for the job. It can decide when it needs to retrieve documents, run calculations, or automate tasks. What sets them apart is its ability to break down complex problems into steps and execute them independently which makes it valuable for research, analysis, and workflow automation.

RAG vs Agentic RAG

Traditional RAG systems follow a linear process. When a query is received, the system first identifies the key elements within the request. It then searches the knowledge base, scanning for relevant information that can help design an accurate response. Once the relevant information or data is retrieved, the system processes it to generate a meaningful and contextually relevant response.

You can understand the processes easily by the below diagram.

How RAG works
Source: Author

Now, an agentic RAG system enhances this process by:

  • Evaluating query requirements
  • Deciding between multiple knowledge sources
  • Potentially combining information from different sources
  • Making autonomous decisions about response strategy
  • Providing source-attributed responses

The key difference lies in the system’s ability to make intelligent decisions about how to handle queries, rather than following a fixed retrieval-generation pattern.

Understanding Haystack Framework Components

Haystack is an open-source framework for building production-ready AI, LLM applications, RAG pipelines, and search systems. It offers a powerful and flexible framework for building LLM applications. It allows you to integrate models from various platforms such as Huggingface, OpenAI, CoHere, Mistral, and Local Ollama. You can also deploy models on cloud services like AWS SageMaker, BedRock, Azure, and GCP.

Haystack provides robust document stores for efficient data management. It also comes with a comprehensive set of tools for evaluation, monitoring, and data integration which ensure smooth performance across all layers of your application. It also has strong community collaboration which makes new service integration from various service providers periodically.

What Can You Build Using Haystack?

  • Simple to advance RAG on your data, using robust retrieval and generation techniques.
  • Chatbot and agents using up-to-date GenAI models like GPT-4, Llama3.2, Deepseek-R1.
  • Generative multimodal question-answering system on mixed types (images, text, audio, and table) knowledge base.
  • Information extraction from documents or building knowledge graphs.

Haystack Building Blocks

Haystack has two primary concepts for building fully functional GenAI LLM systems – components and pipelines. Let’s understand them with a simple example of RAG on Japanese Anime Characters

Components

Components are the core building blocks of Haystack. They can perform tasks such as document storing, document retrieval, text generation, and embedding. Haystack has many components you can use directly after installation, it also provides APIs for making your own components by writing a Python class.

There is a collection of integration from partner companies and the community.

Install Libraries and set Ollama

$ pip install haystack-ai ollama-haystack

# On you system download Ollama and install LLM

ollama pull llama3.2:3b

ollama pull nomic-embed-text


# And then start ollama server
ollama serve

Import some components

from haystack import Document, Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.generators.ollama import OllamaGenerator

Create a document and document store

document_store = InMemoryDocumentStore()
documents = [
    Document(
        content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage."
    ),
    Document(
        content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece."
    ),
    Document(
        content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell."
    ),
    Document(
        content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names."
    ),
    Document(
        content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind."
    ),
]

Pipeline

Pipelines are the backbone of Haystack’s framework. They define the flow of data between different components. Pipelines are essentially a Directed Acyclic Graph (DAG). A single component with multiple outputs can connect to another single component with multiple inputs.

You can define pipeline by

pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component(
    "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434")
)
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

You can visualize the pipeline

image_param = {
    "format": "img",
    "type": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}
pipe.show(params=image_param)

The pipeline provides:

  • Modular workflow management
  • Flexible components arrangement
  • Easy debugging and monitoring
  • Scalable processing architecture

Nodes

Nodes are the basic processing units that can be connected in a pipeline these nodes are the components that perform specific tasks.

Examples of nodes from the above pipeline

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component(
    "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434")
)

Connection Graph

The connection graph defines how components interact.

From the above pipeline, you can visualize the connection graph.

image_param = {
    "format": "img",
    "type": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}
pipe.show(params=image_param)

The connection graph of the anime pipeline

Building Agentic QA-RAG Using Haystack Framework

This graph structure:

  • Defines data flow between components
  • Manages input/output relationships
  • Enables parallel processing where possible
  • Creates flexible processing pathways.

Now we can query our anime knowledge base using the prompt.

Create a prompt template

template = """
Given only the following information, answer the question.
Ignore your own knowledge.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?
"""

This prompt will provide an answer taking information from the document base.

Query using prompt and retriever

query = "How Goku eliminate people?"
response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}})
print(response["llm"]["replies"])

Response:

RAG response

This RAG is simple yet conceptually valuable to the newcomer. Now that we have understood most of the concepts of Haystack frameworks, we can deep dive into our main project. If any new thing comes up I will explain along the way.

Question-Answer RAG Project for Higher Secondary Physics

We will build an NCERT Physics books-based Question Answer RAG for higher secondary students. It will provide answers to the query by taking information from the NCERT books, and If the information is not there it will search the web to get that information.
For this, I will use:

  • Local Llama3.2:3b or Llama3.2:1b
  • ChromaDB for embedding storage
  • Nomic Embed Text model for local embedding
  • DuckDuckGo search for web search or Tavily Search (optional)

I use a free, totally localized system.

Setting Up the Developer Environment

We will setup a conda env Python 3.12

$conda create --name agenticlm python=3.12

$conda activate agenticlm

Install Necessary Package

$pip install haystack-ai ollama-haystack pypdf

$pip install chroma-haystack duckduckgo-api-haystack

Now create a project directory named qagent.

$md qagent # create dir

$cd qagent # change to dir

$ code .   # open folder in vscode

You can use plain Python files for the project or Jupyter Notebook for the project it does not matter. I will use a plain Python file.

Create a main.py file on the project root.

Importing Necessary Libraries

  • System packages
  • Core haystack components
  • ChromaDB for embedding components
  • Ollama Components for Local Inferences
  • And Duckduckgo for web search
# System packages
import os
from pathlib import Path
# Core haystack components
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.components.joiners import BranchJoiner
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.converters import PyPDFToDocument
from haystack.components.routers import ConditionalRouter
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
# ChromaDB integration
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import (
    ChromaEmbeddingRetriever,
)
# Ollama integration
from haystack_integrations.components.embedders.ollama.document_embedder import (
    OllamaDocumentEmbedder,
)
from haystack_integrations.components.embedders.ollama.text_embedder import (
    OllamaTextEmbedder,
)
from haystack_integrations.components.generators.ollama import OllamaGenerator
# Duckduckgo search integration
from duckduckgo_api_haystack import DuckduckgoApiWebSearch

Creating a Document Store

Document store is the most important here we will store our embedding for retrieval, we use ChromaDB for the embedding store, and as you may see in the earlier example, we use InMemoryDocumentStore for fast retrieval because then our data was tiny but for a robust system of retrieval we don’t rely on the InMemoryStore, it will hog the memory and we will have creat embeddings every time we start the system.

The solution is a Vector database such as Pinecode, Weaviate, Postgres Vector DB, or ChromaDB. I use ChromaDB because free, open-source, easy to use, and robust.

# Chroma DB integration component for document(embedding) store

document_store = ChromaDocumentStore(persist_path="qagent/embeddings")

persist_path is where you want to store your embedding.

PDF files path

HERE = Path(__file__).resolve().parent
file_path = [HERE / "data" / Path(name) for name in os.listdir("QApipeline/data")]

It will create a list of files from the data folder which consists of our PDF files.

Document Preprocessing Components

We will use Haystack’s built-in document preprocessor such as cleaner, splitter, and file converter, and then use a writer to write the data into the store.

Cleaner: It will clean the extra space, repeated lines, empty lines, etc from the documents.

cleaner = DocumentCleaner()

Splitter: It will split the document in various ways such as words, sentences, para, pages.

splitter = DocumentSplitter()

File Converter: It will use the pypdf to convert the pdf to documents.

file_converter = PyPDFToDocument()

Writer: It will store the document where you want to store the documents and for duplicate documents, it will overwrite with previous one.

writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)

Now set the embedder for document indexing.

Embedder: Nomic Embed Text

We will use nomic-embed-text embedder which is very effective and free inHuggingface and Ollama.

Before you run your indexing pipeline open your terminal and type below to Pull the nomic-embed-text and llama3.2:3b model from the Ollama model store

$ ollama pull nomic-embed-text

$ ollama pull llama3.2:3b

and start Ollama by typing the command ollama serve in your terminal

now embedder component

embedder = OllamaDocumentEmbedder(
    model="nomic-embed-text", url="http://localhost:11434"
)

We use OllamaDocumentEmbedder component for embedding documents, but if you want to embed the text string then you have to use OllamaTextEmbedder.

Creating Indexing Pipeline

Like our previous toy RAG example, we will start by initiating the Pipeline class.

indexing_pipeline = Pipeline()

Now we will add the components to our pipeline one by one

indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("converter", file_converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

Adding components to the pipeline does not care about order so, you can add components in any order. but connecting is what matters.

Connecting Components to the Pipeline Graph

indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

Here, order matters, because how you connect the component tells the pipeline how the data will flow through the pipeline. It is like, It doesn’t matter in which order or from where you buy your plumbing items but how to put them together will decide whether you get your water or not.

The converter converts the PDFs and sends them to clean for cleaning. Then the cleaner sends the cleaned documents to the splitter for chunking. Those chunks will then pass to the embedded for vectorization, and the last embedded will hand over these embeddings to the writer for storage.

Understand! Ok, let me give you a visual graph of the indexing so you can inspect the data flow.

Draw Indexing Pipeline

image_param = {
    "format": "img",
    "type": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}

indexing_pipeline.draw("indexing_pipeline.png", params=image_param)  # type: ignore

Yeah, you can create a nice mermaid graph from the haystack pipeline easily.

Graph of Indexing Pipeline

Building Agentic QA-RAG Using Haystack Framework

I assume now you have fully grasped the idea behind the Haystack Pipeline. Give a thank to you Plumber.

Implement a Router

Now, we need to create a router to route the data through a different path. In this case, we’ll use a conditional router which will do our routing job on certain conditions. The conditional router will evaluate conditions based on component output. It will direct data flow through different pipeline branches which enables dynamic decision-making. It will also have robust fallback strategies.

# Conditions for routing
routes = [
    {
        "condition": "{{'no_answer' in replies[0]}}",
        "output": "{{query}}",
        "output_name": "go_to_websearch",
        "output_type": str,
    },
    {
        "condition": "{{'no_answer' not in replies[0]}}",
        "output": "{{replies[0]}}",
        "output_name": "answer",
        "output_type": str,
    },
]


# router component

router = ConditionalRouter(routes=routes)

When the system gets no_answer replies from the embedding store context, then it will go to the web search tools for collecting relevant data from the internet.

For web search, we will use Duckduckgo API or Tavily, here I have used Duckduckgo.

websearch = DuckduckgoApiWebSearch(top_k=5)

Ok, most of the heavy lifting has been done. Now, time for prompt engineering

Create Prompt Templates

We will use the Haystack PromptBuilder component for building prompts from the template

First, we will create a prompt for qa

template_qa = """
Given ONLY the following information, answer the question.
If the answer is not contained within the documents reply with "no_answer.
If the answer is contained within the documents, start the answer with "FROM THE KNOWLEDGE BASE: ".

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{ query }}?

"""

It will take the context from the document and try to answer the question. But if it does not find relevant context in the documents it will reply no_answer.

Now, in the second prompt after getting no_answer from the LLM, the system will use the web search tools for gathering context from the internet.

Duckduckgo prompt template

template_websearch = """
Answer the following query given the documents retrieved from the web.
Start the answer with "FROM THE WEB: ".

Documents:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Query: {{query}}

"""

It will facilitate the system to go to the web search and try to answer the query.

Creating prompt using PromptBuilder from Haystack

prompt_qa = PromptBuilder(template=template_qa)

prompt_builder_websearch = PromptBuilder(template=template_websearch)

We will use Haystack prompt joiner to join to branches of the prompt together.

prompt_joiner = BranchJoiner(str)

Implement Query Pipeline

The query pipeline will be embedding the query gathering contextual resources from the embeddings and answering our query using LLM or Web Search tool.

It is similar to the indexing pipeline.

Initiating Pipeline

query_pipeline = Pipeline()

Adding components to the query pipeline

query_pipeline.add_component("text_embedder", OllamaTextEmbedder())
query_pipeline.add_component(
    "retriever", ChromaEmbeddingRetriever(document_store=document_store)
)
query_pipeline.add_component("prompt_builder", prompt_qa)
query_pipeline.add_component("prompt_joiner", prompt_joiner)
query_pipeline.add_component(
    "llm",
    OllamaGenerator(model="llama3.2:3b", timeout=500, url="http://localhost:11434"),
)
query_pipeline.add_component("router", router)
query_pipeline.add_component("websearch", websearch)
query_pipeline.add_component("prompt_builder_websearch", prompt_builder_websearch)

Here, for LLM generation we use the OllamaGenerator component for generating answers using Llama3.2:3b or 1b or whatever LLM you like with tools calling.

Connecting all the components together for query flow and answer generation

query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "prompt_joiner")
query_pipeline.connect("prompt_joiner", "llm")
query_pipeline.connect("llm.replies", "router.replies")
query_pipeline.connect("router.go_to_websearch", "websearch.query")
query_pipeline.connect("router.go_to_websearch", "prompt_builder_websearch.query")
query_pipeline.connect("websearch.documents", "prompt_builder_websearch.documents")
query_pipeline.connect("prompt_builder_websearch", "prompt_joiner")

In summary of the above connection:

  1. The embedding from the text_embedder sent to the retriever’s query embedding.
  2. The retriever sends data to the prompt_builder’s document.
  3. Prompt builder go to the prompt joiner to join with other prompts.
  4. Prompt joiner passes data to the llm for generation.
  5. LLM’s replies go to the routers to check if the reply has no_answer or not. If no_answer then it will go to the web search module.
  6. Web search sends the data to a web search prompt as a query.
  7. Web search documents send data to the web search documents.
  8. The web search prompt sends the data to the prompt joiner.
  9. And the prompt joiner will send the data to the LLM for answer generation.

Why not see for yourself?

Draw Query Pipeline Graph

query_pipeline.draw("agentic_qa_pipeline.png", params=image_param)  # type: ignore

Query Graph

Building Agentic QA-RAG Using Haystack Framework

I know it is a huge graph but it will show you exactly what is going on under the belly of the beast.

Now it is time to enjoy the fruit of our hard work.

Create a function for easy querying.

def get_answer(query: str):
    response = query_pipeline.run(
        {
            "text_embedder": {"text": query},
            "prompt_builder": {"query": query},
            "router": {"query": query},
        }
    )
    return response["router"]["answer"]

It is an easy simple function for answer generation.

Now run your main script for indexing the NCERT physics book

indexing_pipeline.run({"converter": {"sources": file_path}})

It is a one-time job, after indexing you must comment on this line otherwise it will start re-indexing the books.

and the bottom of the file we write our driver code for the query

if __name__ == "__main__":
    query = "Give me 5 MCQ on resistivity?"
    print(get_answer(query))

MCQ on resistivity from the book’s knowledge

RAG system response

Another question that is not in the book

if __name__ == "__main__":
    query = "What is Photosynthesis?"
    print(get_answer(query))

Output

Output by RAG model

Let’s try another question.

if __name__ == "__main__":
    query = (
        "Tell me what is DRIFT OF ELECTRONS AND THE ORIGIN OF RESISTIVITY from the book"
    )
    print(get_answer(query))
Building Agentic QA-RAG Using Haystack Framework

So, it’s working! We can use more data, books, or PDFs for embedding which will generate more contextual-aware answers. Also, LLMs such as GPT-4o, Anthropic’s Claude, or other cloud LLMs will do the job even better.

Conclusion

Our agentic RAG system demonstrates the flexibility and robustness of the Haystack framework with its power of combining components and pipelines. This RAG can be made production-ready by deploying to the web service platform and also using better paid LLM such as OpenAI, and nthropic. You can build a UI using Streamlit or React-based web SPA for a better user experience.

You can find all the code used in the article, here.

Key Takeaways

  • Agentic RAG systems provide more intelligent and flexible responses than traditional RAG.
  • Haystack’s pipeline architecture enables complex, modular workflows.
  • Routers enable dynamic decision-making in response generation.
  • Connection graphs provide flexible and maintainable component interactions.
  • Integration of multiple knowledge sources enhances response quality.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Question

Q1. How does the system handle unknown queries?

A. The system uses its router component to automatically fall back to web search when local knowledge is insufficient, ensuring comprehensive coverage.

Q2. What advantages does the pipeline architecture offer?

A. The pipeline architecture enables modular development, easy testing, and flexible component arrangement, making the system maintainable and extensible.

Q3. How does the connection graph enhance system functionality?

A. The connection graph enables complex data flows and parallel processing, improving system efficiency and flexibility in handling different types of queries.

Q4. Can I use other LLM APIs?

A. Yes, it is very easy just install the necessary integration package for the respective LLM API such as Gemini, Anthropic, and Groq, and use it with your API keys.

A self-taught, project-driven learner, love to work on complex projects on deep learning, Computer vision, and NLP. I always try to get a deep understanding of the topic which may be in any field such as Deep learning, Machine learning, or Physics. Love to create content on my learning. Try to share my understanding with the worlds.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar