Agentic RAG gives the retrieval-augmented generation pipeline an agent loop. Instead of a single retrieve-then-generate step, the agent plans what to search for, evaluates retrieved results, decides whether to refine the query and search again, and iterates until it has sufficient evidence. The agent controls the retrieval strategy — deciding between keyword search, semantic search, or direct chunk reading based on what it needs and what it has found so far.

How does A-RAG improve over standard agentic retrieval?

A-RAG exposes three granular retrieval tools to the agent: keyword search (BM25-style), semantic search (embedding similarity), and chunk read (direct access to a specific document section). The agent chooses which tool to use based on its current information need. Standard agentic RAG typically offers only one search interface. The granularity matters: keyword search excels at exact terms and entity names, semantic search handles paraphrases and conceptual queries, and chunk read provides context around a known relevant section. A-RAG outperforms baselines while using equal or fewer tokens.

When should I use agentic RAG instead of standard RAG?

Use agentic RAG when: the question requires synthesizing information from multiple documents, the initial query might not retrieve the most relevant content on the first try, the answer requires iterative refinement based on what is found, or the knowledge base is large and heterogeneous. Standard single-step RAG is sufficient for factoid questions with high-quality retrieval, small corpora, and queries that match stored documents well.

Agentic RAG: from retrieve-and-generate to autonomous retrieval control loops

Q: What is a POMDP and how does it apply to retrieval?

A Partially Observable Markov Decision Process models sequential decision-making under uncertainty. In agentic RAG, the agent cannot see the full knowledge base (partial observability). It takes actions (search queries) that return observations (retrieved documents). It maintains a belief state about what information exists and what it still needs. The POMDP formalization lets researchers reason formally about optimal retrieval strategies — how many searches to perform, when to stop, and how to balance exploration (broader search) against exploitation (deeper reading of found documents).

7 minute read

“RAG retrieves. Agentic RAG researches.”

TL;DR

Traditional RAG is a lookup. Agentic RAG is a control loop that plans its own retrieval strategy. The SoK paper (arXiv 2603.07379) formalizes this as a POMDP. A-RAG exposes three retrieval tools (keyword, semantic, chunk read) and outperforms baselines with fewer tokens. For the RAG fundamentals this builds on, see retrieval-augmented generation.

A marine radar display showing three concentric scan rings at different ranges

What is wrong with single-step RAG?

Standard RAG has one shot. You embed the user’s question, find the top-K most similar chunks in a vector database, stuff them into the LLM’s context, and generate an answer. If the initial query does not retrieve the right documents, the answer is wrong or incomplete. There is no feedback loop.

This works for simple factoid questions against clean, well-indexed corpora. “What is our refund policy?” matches directly against the refund policy document. One retrieval step suffices.

It fails for complex questions that require synthesis, reasoning across multiple documents, or iterative refinement. “How has our pricing strategy evolved compared to competitors?” might require retrieving pricing history, competitor analysis, strategy meeting notes, and customer feedback — none of which contain the exact phrase “pricing strategy evolved.” The initial embedding-similarity retrieval misses most of these, and the model generates a thin answer from whatever it found.

The failure mode is silent. The model generates confident text based on whatever chunks it received. The user sees a plausible answer. Neither knows that the retrieval missed the most relevant documents.

How does agentic RAG change the retrieval process?

Agentic RAG replaces the single retrieval step with a decision loop. The agent decides what to search for, evaluates results, and decides whether to search again.

graph TD
    A[User question] --> B[Agent: plan retrieval strategy]
    B --> C{Choose retrieval tool}
    C -->|Exact terms needed| D[Keyword search BM25]
    C -->|Conceptual match| E[Semantic search embeddings]
    C -->|Known relevant doc| F[Chunk read direct access]
    D --> G[Evaluate results]
    E --> G
    F --> G
    G --> H{Sufficient evidence?}
    H -->|No: refine query| B
    H -->|No: try different tool| C
    H -->|Yes| I[Generate answer from<br/>accumulated evidence]

The loop introduces three capabilities that single-step RAG lacks.

Query refinement. If the first search returns irrelevant results, the agent reformulates. “Pricing strategy evolution” becomes “Q4 2025 pricing changes” — a more specific query informed by what the first search revealed about the knowledge base’s content and structure.

Tool selection. Different information needs require different retrieval methods. Entity names and exact phrases match well with BM25 keyword search. Conceptual queries (“what are the risks of X?”) match better with semantic embedding similarity. When the agent has already found a relevant document and needs more context, direct chunk access provides it.

Stopping criteria. The agent decides when it has enough evidence. This prevents both over-retrieval (expensive, dilutes context) and under-retrieval (misses critical information). The decision is explicit — the agent reasons about whether its accumulated evidence is sufficient to answer the question.

What does the POMDP formalization provide?

The SoK paper (arXiv 2603.07379, March 2026) is the first unified framework for reasoning about agentic retrieval systems. It formalizes agentic RAG as a finite-horizon POMDP — a partially observable Markov decision process.

The mapping:

POMDP concept	Agentic RAG interpretation
State	Full knowledge base + user’s information need
Observation	Retrieved documents (partial view of the state)
Action	Search query + retrieval tool selection
Belief state	Agent’s estimate of what information exists and what it needs
Reward	Answer quality (correct, complete, well-sourced)
Horizon	Maximum number of retrieval steps before answering

The partial observability is the key insight. The agent cannot see the full knowledge base — it only observes the documents returned by each search. It maintains a belief state about what information likely exists in the corpus, updated after each retrieval step. This belief informs the next action: should I search again with a different query? Switch to a different retrieval method? Or stop and answer?

The POMDP formalization lets researchers reason formally about optimal retrieval policies. How many search steps should you budget? When is the expected marginal value of another search below the cost? How should you allocate retrieval budget between exploration (broad queries to find new relevant areas) and exploitation (deep reads of known-relevant documents)?

The paper introduces a multi-dimensional taxonomy spanning planning strategies (reactive, deliberative, meta-cognitive), retrieval orchestration (single-tool, multi-tool, adaptive), memory paradigms (episodic, semantic, working), and tool coordination patterns.

How does A-RAG use granular retrieval tools?

A-RAG (arXiv 2602.03442) makes the agentic retrieval practical by exposing three distinct tools rather than a single “search” interface.

Keyword search. BM25-style term matching. Best for entity names, specific phrases, ID numbers, and exact terminology. Fast and precise when the user’s query contains the same terms as the target document.

Semantic search. Embedding similarity over dense vectors. Best for conceptual queries, paraphrases, and cross-terminology matching (“revenue growth” matching a document about “top-line expansion”). Broader recall than keyword search but less precise on exact terms.

Chunk read. Direct access to a specific section of a known document. When the agent has already found a relevant document via search, chunk read provides surrounding context — the paragraphs before and after the matched section.

The agent chooses which tool to use at each step based on its current information need. For a question about “the Q3 2025 board decision on expansion,” the agent might start with keyword search (“Q3 2025 board decision”), switch to semantic search if that returns nothing (“strategic growth decisions late 2025”), then use chunk read to get full context around the best result.

A-RAG consistently outperforms existing agentic RAG approaches across multiple open-domain QA benchmarks while using equal or fewer tokens. The efficiency comes from tool selection — using the right retrieval method for each sub-query avoids the wasted tokens of semantic search on exact-term queries and the missed results of keyword search on conceptual queries.

When should you use agentic RAG?

Use agentic RAG when:

Questions require synthesis across multiple documents
Initial queries rarely retrieve the best documents on the first try
The knowledge base is large, heterogeneous, or poorly indexed
Answer quality justifies the additional latency of multi-step retrieval
Users expect sourced, evidence-grounded answers

Standard single-step RAG is sufficient when:

Questions are factoid with clear entity matches
The corpus is small and well-curated
Latency budget is tight (agentic adds 2-5x latency from multiple retrieval steps)
Retrieval quality is already high (>80% recall on your evaluation set)

The latency trade-off is real. Each retrieval step adds 100-500ms depending on your vector database and embedding model. Three retrieval steps add 300-1500ms. For chatbot applications where users expect sub-second responses, this is significant. For research assistants, document analysis, and back-office processing where accuracy matters more than speed, the trade-off favors agentic RAG.

Key takeaways

RAG is a lookup. Agentic RAG is a research process. Query refinement, tool selection, and stopping criteria replace the single retrieve-generate step.
POMDP formalization enables formal reasoning. Partial observability, belief states, and optimal retrieval policies — the SoK paper provides the theory.
Three tools beat one. Keyword, semantic, and chunk read serve different information needs. A-RAG’s tool selection outperforms single-interface baselines with fewer tokens.
Latency is the cost. 2-5x more retrieval time for better answers. Acceptable for research and analysis, not for real-time chat.
Start simple, add agency when retrieval fails. If single-step RAG achieves >80% recall on your evaluation set, the complexity of agentic RAG may not be justified.

Agentic RAG: from retrieve-and-generate to autonomous retrieval control loops

TL;DR

What is wrong with single-step RAG?

How does agentic RAG change the retrieval process?

What does the POMDP formalization provide?

How does A-RAG use granular retrieval tools?

When should you use agentic RAG?

Key takeaways

Further reading

Related across topics

Share on

TL;DR

What is wrong with single-step RAG?

How does agentic RAG change the retrieval process?

What does the POMDP formalization provide?

How does A-RAG use granular retrieval tools?

When should you use agentic RAG?

Key takeaways

Further reading

Related across topics

LRU Cache

Share on