What does Ahmed Sheikh do?

Ahmed Sheikh is a cloud-native agentic AI engineer who helps medium and large businesses move AI agents from stalled pilots to reliable, governed production. He specialises in multi-agent orchestration, agentic RAG, LLMOps, AI evaluation, and cloud-native AI infrastructure — serving clients worldwide.

What does Ahmed Sheikh build, and what does it cost?

Ahmed builds production-ready AI applications as three fixed-price packages, each with an optional monthly retainer for ongoing work after launch: a Full-Stack AI App ($800, or $350/month), an Enterprise Full-Stack Agentic AI build with multi-agent orchestration, agentic RAG, evals and observability ($2,000, or $800/month), and a Custom Product scoped and quoted on a call. Payments are processed worldwide via Paddle.

Why do most enterprise AI agent pilots fail to reach production?

Research shows only 11–14% of enterprise AI agent pilots reach production at scale. The primary reasons are orchestration complexity, inadequate evaluation and output validation, LLMOps debt, poor retrieval quality in RAG pipelines, and lack of governance controls. These are the exact failure modes Ahmed's engagements are designed to address.

Does Ahmed Sheikh work with medium and large businesses globally?

Yes. Ahmed works with medium and large businesses worldwide, remotely. He delivers fixed-price AI application builds with optional monthly retainers — including full-stack AI apps, enterprise agentic AI systems with multi-agent orchestration via LangGraph, agentic RAG implementation, and LLMOps infrastructure.

What technologies does Ahmed Sheikh use for agentic AI systems?

Ahmed builds agentic AI systems using LangGraph for multi-agent orchestration, LangChain for RAG pipelines, Python and FastAPI for backend services, Next.js and TypeScript for interfaces, and cloud-native infrastructure on Vercel, AWS, and Docker/Kubernetes. He also designs evaluation harnesses, output validation layers, and LLMOps monitoring stacks.

Agentic RAG in Production: Evaluation, Guardrails, and the Failure Modes Nobody Warns You About

TL;DR

Production RAG fails for reasons that have nothing to do with the model. Naive fixed-size chunking misses answer spans. Retrieval quality and generation quality are measured as one metric when they need separate evals. Out-of-domain queries return confidently wrong answers. There are no guardrails. This piece covers the full stack: what breaks, how to measure it, and the architecture that makes agentic retrieval trustworthy at scale.

RAG is now the most deployed agentic AI pattern in enterprise — a natural fit for the large volumes of internal documents, policies, and data that enterprises sit on. And it's the pattern that most consistently fails quietly in production.

Demo RAG is a 3-step pipeline: embed the query, retrieve top-K chunks, generate an answer. It works because the demo uses clean queries, well-formatted documents, and manually curated test cases. Production RAG gets none of those guarantees.

What Agentic RAG Actually Means

Before getting into failure modes: the distinction between static RAG and agentic RAG matters.

Static RAG runs a fixed pipeline: query → embed → retrieve → generate. The retrieval step happens once. The query goes in as-is. This works for simple, single-document lookups.

Agentic RAG uses an agent to control retrieval. The agent can:

—Decompose a complex query into sub-queries
—Decide to retrieve from multiple sources in sequence
—Evaluate retrieved context and decide to retrieve more if it's insufficient
—Reformulate queries based on initial retrieval results
—Choose between retrieval tools (vector search vs keyword vs SQL) per sub-query

Agentic RAG handles the complex queries that enterprise users actually ask. But it also adds more surface area for failures — each agent decision is a potential failure point.

Failure Mode 1: Chunking Mismatch

Naive chunking splits documents at fixed token boundaries — every 512 tokens, with no regard for semantic coherence. This is fast to implement and often adequate for demos. It's a consistent source of retrieval failures in production.

The problem: answers are often found at semantic boundaries that naive chunking splits across. A policy document might have a question answered by a sentence at the end of one chunk and its context at the start of the next — neither chunk alone returns the right answer.

Naive fixed-size chunking

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0
)
chunks = text_splitter.split_text(document)

Fast. Breaks semantic units. Poor retrieval on complex documents.

Semantic chunking with overlap

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=128,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_text(document)

Respects paragraph/sentence boundaries. Overlap captures cross-boundary answers.

For complex enterprise documents — policy manuals, technical specs, legal contracts — semantic chunking (splitting on paragraph and sentence boundaries rather than token count) reduces retrieval failure rate significantly. For tables and structured data, chunk separately with schema metadata attached as context.

Failure Mode 2: Retrieval ≠ Generation Quality

The most widespread measurement error in RAG systems: teams measure whether the final answer is correct, and attribute failures entirely to the generation step. In practice, retrieval quality and generation quality are independent failure modes that require different fixes.

Failure typeRoot causeFix

Wrong docs retrievedEmbedding model, chunking, indexImprove chunking, add reranker, hybrid search

Right docs, wrong answerGeneration — hallucination, poor groundingStricter prompt constraints, citation requirements

No docs retrievedQuery-document mismatch, empty indexQuery rewriting, HyDE, better metadata

Confident wrong answerOut-of-domain query, no guardrailsConfidence scoring, out-of-domain classifier

Measuring these separately requires separate eval datasets. For retrieval: a dataset of queries with known relevant documents. For generation: a dataset of (query, retrieved context, expected answer) triples. RAGAS provides automated metrics for both.

Failure Mode 3: No Guardrails for Out-of-Domain Queries

RAG systems answer questions from a knowledge base. Users ask questions that aren't in the knowledge base. Without guardrails, the model either hallucinates an answer or returns an irrelevant answer with high confidence — both worse outcomes than saying "I don't know."

The two-layer guardrail pattern:

# Layer 1: Query classifier (fast, cheap)
def classify_query(query: str) -> Literal["in_domain", "out_of_domain", "ambiguous"]:
    # Lightweight classifier — keyword match or small embedding model
    # Runs before retrieval to avoid wasting vector search on junk
    ...

# Layer 2: Faithfulness check (after retrieval)
def check_faithfulness(answer: str, retrieved_context: str) -> float:
    # LLM-as-judge: does the answer rely on the retrieved context?
    # Score < 0.7 → flag for human review or return "I don't have that information"
    prompt = f"""
    Context: {retrieved_context}
    Answer: {answer}

    Rate from 0-1: how well is this answer grounded in the context above?
    Return only the number.
    """
    return float(llm.invoke(prompt).content)

Layer 1 is fast and prevents the vector search entirely. Layer 2 catches hallucinations that slip through because the model found something tangentially related and extrapolated.

Failure Mode 4: The Hybrid Search Gap

Vector search finds semantically similar documents. Keyword search finds exact term matches. Enterprise documents need both.

Vector search fails on product codes, SKUs, identifiers, and proper nouns — it embeds them to a region of the space but can't do exact matching. Keyword search fails on paraphrased queries and synonyms. A production RAG system needs hybrid search: both methods run in parallel, with a reranker to merge and rank results.

from langchain.retrievers import EnsembleRetriever

# Hybrid retriever: BM25 (keyword) + FAISS (semantic)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Equal weight — tune based on your query distribution
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

# Reranker on top — cross-encoder scores (query, doc) pairs
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base"),
    top_n=3
)

The Production RAG Eval Stack

A production RAG system needs continuous evaluation at three layers:

Retrieval layer

RAGAS + LangSmith

—Context precision — are the retrieved chunks relevant?
—Context recall — are all relevant chunks retrieved?
—MRR (mean reciprocal rank) — how high is the correct chunk ranked?

Generation layer

RAGAS + LLM-as-judge

—Faithfulness — is the answer grounded in the retrieved context?
—Answer relevance — does the answer address the query?
—Hallucination rate — what fraction of claims are not in context?

System layer

Langfuse + custom dashboards

—End-to-end latency P50/P95
—Cost per query
—Out-of-domain rate (fraction of queries with no answer)

Putting It Together: The Agentic RAG Architecture

A production agentic RAG system wires these components into a LangGraph graph with four nodes:

query_analyzer

Decomposes complex queries into sub-queries. Classifies in-domain vs out-of-domain. Routes to appropriate retrieval tools.

retriever

Runs hybrid search (BM25 + vector) per sub-query. Applies reranker. Returns top-N chunks with source metadata.

faithfulness_check

Scores retrieved context relevance. Triggers re-retrieval with reformulated query if score < threshold. Max 2 retries.

generator

Generates answer with strict grounding prompt. Requires citations. Returns answer + confidence score + source list.

Frequently Asked

What is agentic RAG?

Agentic RAG is a retrieval-augmented generation system where the retrieval step is controlled by an AI agent rather than a fixed pipeline. The agent can decide what to retrieve, when to retrieve it, how to reformulate queries, and whether the retrieved context is sufficient — making it adaptive to complex, multi-step queries that static RAG pipelines cannot handle.

What is the most common cause of RAG failures in production?

The most common production RAG failure is confusing retrieval quality with generation quality. Teams measure whether the final answer is correct, but not whether the right documents were retrieved. Retrieval failures require fixes to chunking, embedding, and reranking. Generation failures require prompt engineering and grounding constraints.

What chunking strategy should I use for production RAG?

The right chunking strategy depends on document type. For long documents, semantic chunking (splitting on topic shifts rather than fixed token counts) significantly outperforms naive fixed-size chunks. Overlapping chunks help with answer spans that cross boundaries. Tables and structured data should be chunked separately with schema metadata attached.

How do you evaluate RAG quality in production?

A production RAG eval stack measures three layers: retrieval precision (did the right documents come back?), answer faithfulness (is the answer grounded in the retrieved context?), and answer relevance (does the answer actually address the query?). RAGAS, LangSmith, and Langfuse automate this. User feedback signals should also be linked back to traces.