Agentic RAG: from retrieve-and-generate to autonomous retrieval control loops
“RAG retrieves. Agentic RAG researches.”
TL;DR
Traditional RAG is a lookup. Agentic RAG is a control loop that plans its own retrieval strategy. The SoK paper (arXiv 2603.07379) formalizes this as a POMDP. A-RAG exposes three retrieval tools (keyword, semantic, chunk read) and outperforms baselines with fewer tokens. For the RAG fundamentals this builds on, see retrieval-augmented generation.

What is wrong with single-step RAG?
Standard RAG has one shot. You embed the user’s question, find the top-K most similar chunks in a vector database, stuff them into the LLM’s context, and generate an answer. If the initial query does not retrieve the right documents, the answer is wrong or incomplete. There is no feedback loop.
This works for simple factoid questions against clean, well-indexed corpora. “What is our refund policy?” matches directly against the refund policy document. One retrieval step suffices.
It fails for complex questions that require synthesis, reasoning across multiple documents, or iterative refinement. “How has our pricing strategy evolved compared to competitors?” might require retrieving pricing history, competitor analysis, strategy meeting notes, and customer feedback — none of which contain the exact phrase “pricing strategy evolved.” The initial embedding-similarity retrieval misses most of these, and the model generates a thin answer from whatever it found.
The failure mode is silent. The model generates confident text based on whatever chunks it received. The user sees a plausible answer. Neither knows that the retrieval missed the most relevant documents.
How does agentic RAG change the retrieval process?
Agentic RAG replaces the single retrieval step with a decision loop. The agent decides what to search for, evaluates results, and decides whether to search again.
graph TD
A[User question] --> B[Agent: plan retrieval strategy]
B --> C{Choose retrieval tool}
C -->|Exact terms needed| D[Keyword search BM25]
C -->|Conceptual match| E[Semantic search embeddings]
C -->|Known relevant doc| F[Chunk read direct access]
D --> G[Evaluate results]
E --> G
F --> G
G --> H{Sufficient evidence?}
H -->|No: refine query| B
H -->|No: try different tool| C
H -->|Yes| I[Generate answer from<br/>accumulated evidence]
The loop introduces three capabilities that single-step RAG lacks.
Query refinement. If the first search returns irrelevant results, the agent reformulates. “Pricing strategy evolution” becomes “Q4 2025 pricing changes” — a more specific query informed by what the first search revealed about the knowledge base’s content and structure.
Tool selection. Different information needs require different retrieval methods. Entity names and exact phrases match well with BM25 keyword search. Conceptual queries (“what are the risks of X?”) match better with semantic embedding similarity. When the agent has already found a relevant document and needs more context, direct chunk access provides it.
Stopping criteria. The agent decides when it has enough evidence. This prevents both over-retrieval (expensive, dilutes context) and under-retrieval (misses critical information). The decision is explicit — the agent reasons about whether its accumulated evidence is sufficient to answer the question.
What does the POMDP formalization provide?
The SoK paper (arXiv 2603.07379, March 2026) is the first unified framework for reasoning about agentic retrieval systems. It formalizes agentic RAG as a finite-horizon POMDP — a partially observable Markov decision process.
The mapping:
| POMDP concept | Agentic RAG interpretation |
|---|---|
| State | Full knowledge base + user’s information need |
| Observation | Retrieved documents (partial view of the state) |
| Action | Search query + retrieval tool selection |
| Belief state | Agent’s estimate of what information exists and what it needs |
| Reward | Answer quality (correct, complete, well-sourced) |
| Horizon | Maximum number of retrieval steps before answering |
The partial observability is the key insight. The agent cannot see the full knowledge base — it only observes the documents returned by each search. It maintains a belief state about what information likely exists in the corpus, updated after each retrieval step. This belief informs the next action: should I search again with a different query? Switch to a different retrieval method? Or stop and answer?
The POMDP formalization lets researchers reason formally about optimal retrieval policies. How many search steps should you budget? When is the expected marginal value of another search below the cost? How should you allocate retrieval budget between exploration (broad queries to find new relevant areas) and exploitation (deep reads of known-relevant documents)?
The paper introduces a multi-dimensional taxonomy spanning planning strategies (reactive, deliberative, meta-cognitive), retrieval orchestration (single-tool, multi-tool, adaptive), memory paradigms (episodic, semantic, working), and tool coordination patterns.
How does A-RAG use granular retrieval tools?
A-RAG (arXiv 2602.03442) makes the agentic retrieval practical by exposing three distinct tools rather than a single “search” interface.
Keyword search. BM25-style term matching. Best for entity names, specific phrases, ID numbers, and exact terminology. Fast and precise when the user’s query contains the same terms as the target document.
Semantic search. Embedding similarity over dense vectors. Best for conceptual queries, paraphrases, and cross-terminology matching (“revenue growth” matching a document about “top-line expansion”). Broader recall than keyword search but less precise on exact terms.
Chunk read. Direct access to a specific section of a known document. When the agent has already found a relevant document via search, chunk read provides surrounding context — the paragraphs before and after the matched section.
The agent chooses which tool to use at each step based on its current information need. For a question about “the Q3 2025 board decision on expansion,” the agent might start with keyword search (“Q3 2025 board decision”), switch to semantic search if that returns nothing (“strategic growth decisions late 2025”), then use chunk read to get full context around the best result.
A-RAG consistently outperforms existing agentic RAG approaches across multiple open-domain QA benchmarks while using equal or fewer tokens. The efficiency comes from tool selection — using the right retrieval method for each sub-query avoids the wasted tokens of semantic search on exact-term queries and the missed results of keyword search on conceptual queries.
When should you use agentic RAG?
Use agentic RAG when:
- Questions require synthesis across multiple documents
- Initial queries rarely retrieve the best documents on the first try
- The knowledge base is large, heterogeneous, or poorly indexed
- Answer quality justifies the additional latency of multi-step retrieval
- Users expect sourced, evidence-grounded answers
Standard single-step RAG is sufficient when:
- Questions are factoid with clear entity matches
- The corpus is small and well-curated
- Latency budget is tight (agentic adds 2-5x latency from multiple retrieval steps)
- Retrieval quality is already high (>80% recall on your evaluation set)
The latency trade-off is real. Each retrieval step adds 100-500ms depending on your vector database and embedding model. Three retrieval steps add 300-1500ms. For chatbot applications where users expect sub-second responses, this is significant. For research assistants, document analysis, and back-office processing where accuracy matters more than speed, the trade-off favors agentic RAG.
Key takeaways
- RAG is a lookup. Agentic RAG is a research process. Query refinement, tool selection, and stopping criteria replace the single retrieve-generate step.
- POMDP formalization enables formal reasoning. Partial observability, belief states, and optimal retrieval policies — the SoK paper provides the theory.
- Three tools beat one. Keyword, semantic, and chunk read serve different information needs. A-RAG’s tool selection outperforms single-interface baselines with fewer tokens.
- Latency is the cost. 2-5x more retrieval time for better answers. Acceptable for research and analysis, not for real-time chat.
- Start simple, add agency when retrieval fails. If single-step RAG achieves >80% recall on your evaluation set, the complexity of agentic RAG may not be justified.
Further reading
- Retrieval-augmented generation — the single-step RAG fundamentals
- Vector search for agents — the embedding and ANN infrastructure behind semantic retrieval
- Planning and decomposition — the planning patterns that agentic RAG applies to retrieval
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch