You enable retrieval. You even log the “context” the model got. It still answers confidently with details that don’t exist in the documents.
That’s the fun part of RAG hallucinations: the system can look grounded while still being wrong.
And in practice it’s usually not “LLMs hallucinate.” It’s that your pipeline makes it easy for the model to guess, even when retrieval returns something.
The broken setup (the part that quietly invites guessing)
Here’s a naive RAG flow I’ve seen in multiple codebases. It retrieves a handful of chunks, concatenates them, and asks the model for an answer. No hard grounding requirement. No refusal behavior. No citation check.
// Broken-ish RAG skeleton
async function answer(question) {
// Vector search over chunks
const hits = await vectorSearch({
query: question,
topK: 8
});
// Naively jam everything together
const context = hits.map(h => h.text).join("\n\n");
const prompt = `
Context:
${context}
Question: ${question}
Answer:
`;
return llm.complete({ prompt });
}
This fails in a few predictable ways:
- Poor chunking strategy: chunks slice sentences/sections so the evidence you need gets split across boundaries.
- Noisy context: topK=8 can easily pull irrelevant passages that are “semantically related” but don’t actually support the claim.
- Prompt ambiguity: if you don’t say what to do when evidence is missing, the model fills gaps from its pretraining.
- Over-permissive answer style: the prompt doesn’t require citations, so the model doesn’t have to prove anything.
So you get confident answers that look plausible. The logs show context… but the model never had to use it strictly.
Why hallucinations happen even with retrieval
RAG doesn’t “guarantee truth.” It improves odds by supplying documents. But the final answer is still generated text.
Hallucinations happen when at least one of these goes wrong:
- Weak retrieval recall:
the relevant chunk isn’t retrieved (wrong embedding model, bad query formulation, too small chunking windows, filters that accidentally exclude the right source).
If the evidence isn’t in context, generation can’t be grounded. - Noisy context / low precision:
you retrieve a bag of “related” chunks. The model then faces multiple candidates, some conflicting, some off-topic. Without constraints, it may stitch together a story. - Prompt ambiguity:
“Answer the question using the context” is not a strong requirement unless you make grounding and refusal explicit.
The model will interpret “using” as “influenced by,” not “derived from.” - Context grounding edge cases:
questions spanning multiple chunks, conflicting source documents, stale content, or partial support where no single chunk fully answers the claim.
In these cases, the model needs explicit instructions and/or better retrieval to avoid guessing.
RAG hallucinations are often a pipeline problem, not a model problem.
Fix it in the order that actually reduces failure rate
Don’t start by rewriting prompts and hoping for the best. Start by making retrieval and context better. Then make generation strict.
1) Chunking strategy: make evidence retrievable (and re-combinable)
Chunking strategy sounds like a preprocessing detail. It’s not.
If chunks are too small, you lose necessary surrounding facts (definitions, parameters, version notes). If chunks are too large, the retrieved context becomes noisy and the model can’t find the right paragraph.
What I aim for in production systems:
- Boundaries that match meaning:
split on headings/sections or semantic units, not just character counts. - Keep cross-references together:
if a paragraph references “See Table 2” or a config key, try to include the target within the same chunk (or attach metadata so you can reassemble). - Add metadata that helps you filter:
document type, product/module, timestamp/version, language, tenant, permissions scope, etc. - Plan for multi-chunk answers:
don’t assume a question is answered by one chunk. If you see frequent multi-part questions, your retrieval + prompting must support synthesis across chunks (with citations).
Quick sanity check: pick 20 real questions and manually verify whether the “gold evidence” exists in any chunk. If it doesn’t, you’re debugging the wrong thing.
2) Retrieval quality: improve ranking, filter noise, raise precision
Even perfect chunking can’t save you if your retrieval returns the wrong stuff.
Common retrieval augmented generation fixes:
- Retrieve with filters, not just topK:
constrain vector search using metadata (doc type, version range, tenant scope, domain).
This directly reduces “semantically related but irrelevant” hits. - Use better ranking:
if you only use embeddings, consider re-ranking with a cross-encoder or a stronger scoring model.
The goal is higher precision: fewer wrong chunks survive into context. - Limit context to what you can justify:
don’t stuff 8–20 passages into the prompt by default.
Start with a small candidate set, then expand only when you have evidence of missing coverage. - Handle stale content explicitly:
if your docs change, add timestamps/version metadata and prefer the latest compatible version.
Edge cases you should test for here:
- Conflicting source documents:
if older docs contradict newer ones, your retrieval must pick the right timeframe/version or your prompt must ask the model to choose with evidence. - Questions spanning multiple chunks:
if the answer requires two separate facts, your topK must include both (or your pipeline must iteratively retrieve more). - Stale content:
retrieval that ignores recency will “ground” the model in the wrong truth.
The Pinecone and OpenAI RAG examples both emphasize that retrieval quality and context selection matter. They don’t treat generation as the only lever.
3) Prompt engineering: constrain the model to the provided context
Now you tighten the generator’s instructions. This is where many teams stop, then wonder why answers still drift.
Here’s a stricter prompt pattern that reduces RAG hallucinations:
System:
You are a QA assistant. Answer using ONLY the provided Context.
If the Context does not contain enough information to answer the question,
say: "I don't know based on the provided documents."
You MUST cite evidence. For every non-trivial claim, include the source id.
User:
Context:
${context}
Question: ${question}
Answer (with citations):
Two things matter:
- Refusal behavior when evidence is insufficient.
- Citation requirement tied to source ids in the retrieved context.
Without those, the model can produce a confident narrative while only loosely “related” to the context.
Also: your context format should preserve boundaries (source id, title, section). Don’t just concatenate raw text and hope citations work.
A more realistic pipeline (retrieval + grounded prompting)
This is the same skeleton as before, but with constraints:
async function answer(question) {
const hits = await vectorSearch({
query: question,
topK: 6,
// Example: reduce noise
filters: { docType: "spec", version: { $gte: "2024-01-01" } }
});
// Keep ids so we can cite
const context = hits.map(h => `[{id:${h.id}}] ${h.text}`).join("\n\n");
const prompt = `
You are a QA assistant. Answer using ONLY the provided Context.
If the Context does not contain enough information, respond exactly:
I don't know based on the provided documents.
For every non-trivial claim, include citations like [id:123].
Context:
${context}
Question: ${question}
Answer:
`;
const resp = await llm.complete({ prompt });
return resp;
}
This doesn’t magically prevent all errors. It forces the failure mode to be more honest: either the evidence supports the answer, or you refuse.
What about “no chunk fully supports the claim”?
This is a common RAG hallucination edge case.
Sometimes the answer is derivable only by combining multiple passages. If you treat “fully supports” too strictly at the chunk level, the model may refuse even when the global context has the answer.
So you have to decide what “enough evidence” means. Two pragmatic approaches:
- Allow synthesis across multiple chunks if citations cover every claim. The prompt should say “use any relevant context snippets,” not “single snippet only.”
- Validate coverage:
after generation, run a lightweight check that every sentence-level claim has a citation hit. Databricks’ RAG evaluation framework is aimed at measuring these kinds of properties, not just “answer looks good.”
The key is: you don’t want a model that always refuses. You want one that refuses when the provided documents don’t support the output.
Don’t guess—measure. One practical test to run this week
Take 20–50 real user questions (from logs if you have them). For each question, do this:
- Log retrieved chunk ids and the final answer.
- Check grounding manually for a first pass:
for each answer claim, ask “does at least one cited chunk actually support it?” - Track failure buckets:
missing evidence (retrieval recall issue),
wrong evidence (noise/precision issue),
conflicting docs (versioning issue),
and hallucinated specifics (prompt constraint issue).
After that, tune in the right direction. If most failures are “missing evidence,” stop prompt engineering and work on chunking and retrieval. If most failures are “evidence exists but the answer doesn’t cite it,” tighten your prompt constraints and context formatting.
That’s the reframing: RAG hallucinations usually don’t come from the model alone. They come from pipelines that blur the line between “context supplied” and “context required.”
Quick reality check: if you can’t point to citations for key claims, you’re not doing context grounding—you’re doing story generation with a doc-shaped prop.
Use that test, then iterate on retrieval and chunking strategy first, prompt engineering second, and you’ll see hallucination rates drop without relying on hope.