LLMs are great at sounding confident. That confidence is also the problem. If your app answers from parametric memory alone, you’ll eventually get:
- facts that are wrong
- details that used to be true but aren’t anymore
- confabulations that look plausible because the model has seen a lot of text
This is exactly where “grounding LLM responses” matters. You don’t have to fine-tune a model from scratch. You can change the workflow so the model has to look things up before it answers.
The naive approach: chat with the LLM and hope it’s right
Here’s the common starting point:
// Pseudocode: no sources, no verification
const answer = await llm.chat({
messages: [
{ role: "system", content: "Answer as helpfully as possible." },
{ role: "user", content: "What changed in the API since last quarter?" }
]
});
Sometimes it nails it. More often, it generates a version of reality it thinks you want. And you won’t always catch it in reviews, because the output reads clean even when it’s stale.
This is how LLM hallucinations show up in production: users trust the phrasing, not the provenance.
RAG as the fix: retrieve first, then generate
That’s what retrieval augmented generation (often called RAG) is for.
At a high level, a RAG pipeline is a simple two-step flow:
- Retrieve the most relevant chunks from your knowledge source.
- Feed those chunks into the prompt so the model can ground its response in external context.
Now the model isn’t guessing from memory alone. It’s forced to answer using text you actually provide. The model can still make mistakes, but you’ve removed a huge chunk of the “invent plausible details” failure mode.
A basic RAG pipeline: what you actually have to build
You can think of a minimal RAG system as these components:
1) Document ingestion
You start with documents: PDFs, markdown, tickets, wiki pages, changelogs—whatever your app should answer about. Ingestion just means you turn that into a stream of clean text.
Why it helps factuality: if you ingest the right source of truth, you’re no longer relying purely on parametric memory.
2) Chunking
Large documents need to be split into smaller pieces (“chunks”). A chunk might be a paragraph, or a fixed token window with overlap.
// Pseudocode: chunking with overlap
chunks = splitByTokens(documentText, {
chunkSizeTokens: 300,
overlapTokens: 50
});
This is where things usually break. Bad chunking causes missed context. If you split a definition from its examples, retrieval might return only the definition. Or it might return only the examples. Either way, generation suffers.
3) Embeddings
For each chunk, you compute an embedding. Embeddings map text to vectors so you can compare semantic similarity later.
Why it helps: keyword search misses meaning. Semantic search using embeddings retrieves passages that are “about the same thing” even if wording differs.
4) Vector database
Store embeddings in a vector database so you can run similarity queries quickly. This is where your vector database (like Pinecone, pgvector, etc.) becomes part of your architecture.
Why it helps: efficient retrieval is the whole point. Without it, RAG is just “hold more text in RAM and scan it,” which doesn’t scale.
5) Vector search (semantic search)
When a user asks a question, you embed the question too. Then you query the vector database to get the top-k most relevant chunks.
// Pseudocode: semantic search
questionEmbedding = embed(userQuestion);
results = vectorDb.query({
vector: questionEmbedding,
topK: 5
});
// results contain chunk text + metadata (ids, timestamps, etc.)
Edge case: low-quality retrieval returns irrelevant passages. If you don’t notice, your model will happily “ground” on the wrong context. The output can look even more authoritative than naive prompting, because it’s using retrieved text.
6) Context assembly
Now you build the prompt context. You take the retrieved chunks and format them for the model (often with citations/ids, plus instructions like “use only the provided context”).
Why it helps: you control what context is available. The model can’t cite something it never received.
Edge case: context window limits. You can’t stuff everything into the prompt, so you must choose how to select and compress context. If you cut off the most relevant chunk, generation becomes a guess again.
7) Generation (grounding LLM responses)
Finally, you ask the LLM to answer using the assembled context. A good pattern is to instruct the model to base the answer on the provided chunks and to indicate when the context doesn’t contain enough information.
// Pseudocode: generation with retrieved chunks
const context = assembleContext(retrievedChunks);
const answer = await llm.chat({
messages: [
{ role: "system", content: "Answer using the provided context. If the context is missing, say so." },
{ role: "user", content: `Context:\n${context}\n\nQuestion: ${userQuestion}` }
]
});
Important reality check: RAG reduces but does not eliminate hallucinations. The model can still:
- misinterpret retrieved text
- combine multiple chunks incorrectly
- fill gaps when the context is incomplete
Operational edge cases you should plan for
- Stale indexed data: if your knowledge source changes but your index doesn’t, retrieval returns outdated chunks. Freshness requires re-indexing (or incremental updates) and metadata like timestamps.
- Chunk overlap isn’t magic: overlap helps, but it can also duplicate content and waste context budget. You still need good chunk boundaries.
- Metadata matters: if you store timestamps, product versions, or document ids, you can filter retrieval (e.g., “only last 90 days”). Without it, semantic search may pull the wrong era.
- Context window limits (again): top-k doesn’t guarantee coverage. Sometimes the answer is spread across several chunks, and your budget only holds one or two.
- Low-quality retrieval: if embeddings and chunking don’t match your question distribution, you’ll get irrelevant passages. Your prompt may be perfect and it still won’t save you.
One practical evaluation trick: verify what was retrieved
Don’t evaluate RAG only by looking at the final answer. Inspect the retrieved chunks.
- For a real user question, log the top-k chunks returned by semantic search.
- Ask: are the retrieved passages actually the ones that contain the answer?
- If not, you have a retrieval problem (chunking, embeddings, indexing, or query formulation), not a generation problem.
Also run a small “citation or it didn’t happen” test: create questions that require exact citations or recent data (e.g., “quote the release note line about X” or “what changed on YYYY-MM-DD?”). If the retrieved context can’t support the claim, you’ll catch hallucinations quickly.
The best sanity check for retrieval augmented generation is boring: if you can’t find the answer in the retrieved chunks, the model shouldn’t be able to fake it reliably.
That’s the shift RAG gives you. You’re no longer trusting the model’s memory. You’re trusting your pipeline.
And yes—this is still a system. If you treat it like magic, you’ll ship stale context with confidence.
If you build the retrieval and make it inspectable, everything gets easier.