TL;DR
Production RAG fails for reasons that have nothing to do with the model. Naive fixed-size chunking misses answer spans. Retrieval quality and generation quality are measured as one metric when they need separate evals. Out-of-domain queries return confidently wrong answers. There are no guardrails. This piece covers the full stack: what breaks, how to measure it, and the architecture that makes agentic retrieval trustworthy at scale.
RAG is now the most deployed agentic AI pattern in enterprise — a natural fit for the large volumes of internal documents, policies, and data that enterprises sit on. And it's the pattern that most consistently fails quietly in production.
Demo RAG is a 3-step pipeline: embed the query, retrieve top-K chunks, generate an answer. It works because the demo uses clean queries, well-formatted documents, and manually curated test cases. Production RAG gets none of those guarantees.
What Agentic RAG Actually Means
Before getting into failure modes: the distinction between static RAG and agentic RAG matters.
Static RAG runs a fixed pipeline: query → embed → retrieve → generate. The retrieval step happens once. The query goes in as-is. This works for simple, single-document lookups.
Agentic RAG uses an agent to control retrieval. The agent can:
- —Decompose a complex query into sub-queries
- —Decide to retrieve from multiple sources in sequence
- —Evaluate retrieved context and decide to retrieve more if it's insufficient
- —Reformulate queries based on initial retrieval results
- —Choose between retrieval tools (vector search vs keyword vs SQL) per sub-query
Agentic RAG handles the complex queries that enterprise users actually ask. But it also adds more surface area for failures — each agent decision is a potential failure point.
Failure Mode 1: Chunking Mismatch
Naive chunking splits documents at fixed token boundaries — every 512 tokens, with no regard for semantic coherence. This is fast to implement and often adequate for demos. It's a consistent source of retrieval failures in production.
The problem: answers are often found at semantic boundaries that naive chunking splits across. A policy document might have a question answered by a sentence at the end of one chunk and its context at the start of the next — neither chunk alone returns the right answer.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=0
)
chunks = text_splitter.split_text(document)Fast. Breaks semantic units. Poor retrieval on complex documents.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=128,
separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_text(document)Respects paragraph/sentence boundaries. Overlap captures cross-boundary answers.
For complex enterprise documents — policy manuals, technical specs, legal contracts — semantic chunking (splitting on paragraph and sentence boundaries rather than token count) reduces retrieval failure rate significantly. For tables and structured data, chunk separately with schema metadata attached as context.
Failure Mode 2: Retrieval ≠ Generation Quality
The most widespread measurement error in RAG systems: teams measure whether the final answer is correct, and attribute failures entirely to the generation step. In practice, retrieval quality and generation quality are independent failure modes that require different fixes.
Measuring these separately requires separate eval datasets. For retrieval: a dataset of queries with known relevant documents. For generation: a dataset of (query, retrieved context, expected answer) triples. RAGAS provides automated metrics for both.
Failure Mode 3: No Guardrails for Out-of-Domain Queries
RAG systems answer questions from a knowledge base. Users ask questions that aren't in the knowledge base. Without guardrails, the model either hallucinates an answer or returns an irrelevant answer with high confidence — both worse outcomes than saying "I don't know."
The two-layer guardrail pattern:
# Layer 1: Query classifier (fast, cheap)
def classify_query(query: str) -> Literal["in_domain", "out_of_domain", "ambiguous"]:
# Lightweight classifier — keyword match or small embedding model
# Runs before retrieval to avoid wasting vector search on junk
...
# Layer 2: Faithfulness check (after retrieval)
def check_faithfulness(answer: str, retrieved_context: str) -> float:
# LLM-as-judge: does the answer rely on the retrieved context?
# Score < 0.7 → flag for human review or return "I don't have that information"
prompt = f"""
Context: {retrieved_context}
Answer: {answer}
Rate from 0-1: how well is this answer grounded in the context above?
Return only the number.
"""
return float(llm.invoke(prompt).content)Layer 1 is fast and prevents the vector search entirely. Layer 2 catches hallucinations that slip through because the model found something tangentially related and extrapolated.
Failure Mode 4: The Hybrid Search Gap
Vector search finds semantically similar documents. Keyword search finds exact term matches. Enterprise documents need both.
Vector search fails on product codes, SKUs, identifiers, and proper nouns — it embeds them to a region of the space but can't do exact matching. Keyword search fails on paraphrased queries and synonyms. A production RAG system needs hybrid search: both methods run in parallel, with a reranker to merge and rank results.
from langchain.retrievers import EnsembleRetriever
# Hybrid retriever: BM25 (keyword) + FAISS (semantic)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Equal weight — tune based on your query distribution
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5]
)
# Reranker on top — cross-encoder scores (query, doc) pairs
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
reranker = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base"),
top_n=3
)The Production RAG Eval Stack
A production RAG system needs continuous evaluation at three layers:
Retrieval layer
RAGAS + LangSmith- —Context precision — are the retrieved chunks relevant?
- —Context recall — are all relevant chunks retrieved?
- —MRR (mean reciprocal rank) — how high is the correct chunk ranked?
Generation layer
RAGAS + LLM-as-judge- —Faithfulness — is the answer grounded in the retrieved context?
- —Answer relevance — does the answer address the query?
- —Hallucination rate — what fraction of claims are not in context?
System layer
Langfuse + custom dashboards- —End-to-end latency P50/P95
- —Cost per query
- —Out-of-domain rate (fraction of queries with no answer)
Putting It Together: The Agentic RAG Architecture
A production agentic RAG system wires these components into a LangGraph graph with four nodes:
query_analyzerDecomposes complex queries into sub-queries. Classifies in-domain vs out-of-domain. Routes to appropriate retrieval tools.
retrieverRuns hybrid search (BM25 + vector) per sub-query. Applies reranker. Returns top-N chunks with source metadata.
faithfulness_checkScores retrieved context relevance. Triggers re-retrieval with reformulated query if score < threshold. Max 2 retries.
generatorGenerates answer with strict grounding prompt. Requires citations. Returns answer + confidence score + source list.
Frequently Asked
What is agentic RAG?
Agentic RAG is a retrieval-augmented generation system where the retrieval step is controlled by an AI agent rather than a fixed pipeline. The agent can decide what to retrieve, when to retrieve it, how to reformulate queries, and whether the retrieved context is sufficient — making it adaptive to complex, multi-step queries that static RAG pipelines cannot handle.
What is the most common cause of RAG failures in production?
The most common production RAG failure is confusing retrieval quality with generation quality. Teams measure whether the final answer is correct, but not whether the right documents were retrieved. Retrieval failures require fixes to chunking, embedding, and reranking. Generation failures require prompt engineering and grounding constraints.
What chunking strategy should I use for production RAG?
The right chunking strategy depends on document type. For long documents, semantic chunking (splitting on topic shifts rather than fixed token counts) significantly outperforms naive fixed-size chunks. Overlapping chunks help with answer spans that cross boundaries. Tables and structured data should be chunked separately with schema metadata attached.
How do you evaluate RAG quality in production?
A production RAG eval stack measures three layers: retrieval precision (did the right documents come back?), answer faithfulness (is the answer grounded in the retrieved context?), and answer relevance (does the answer actually address the query?). RAGAS, LangSmith, and Langfuse automate this. User feedback signals should also be linked back to traces.