LLMs are great at sounding confident. That confidence is also the problem. If your app answers from parametric memory alone, you’ll eventually get:

facts that are wrong
details that used to be true but aren’t anymore
confabulations that look plausible because the model has seen a lot of text

This is exactly where “grounding LLM responses” matters. You don’t have to fine-tune a model from scratch. You can change the workflow so the model has to look things up before it answers.

The naive approach: chat with the LLM and hope it’s right

Here’s the common starting point:

// Pseudocode: no sources, no verification
const answer = await llm.chat({
  messages: [
    { role: "system", content: "Answer as helpfully as possible." },
    { role: "user", content: "What changed in the API since last quarter?" }
  ]
});

Sometimes it nails it. More often, it generates a version of reality it thinks you want. And you won’t always catch it in reviews, because the output reads clean even when it’s stale.

This is how LLM hallucinations show up in production: users trust the phrasing, not the provenance.

RAG as the fix: retrieve first, then generate

That’s what retrieval augmented generation (often called RAG) is for.

At a high level, a RAG pipeline is a simple two-step flow:

Retrieve the most relevant chunks from your knowledge source.
Feed those chunks into the prompt so the model can ground its response in external context.

Now the model isn’t guessing from memory alone. It’s forced to answer using text you actually provide. The model can still make mistakes, but you’ve removed a huge chunk of the “invent plausible details” failure mode.

A basic RAG pipeline: what you actually have to build

You can think of a minimal RAG system as these components:

1) Document ingestion

You start with documents: PDFs, markdown, tickets, wiki pages, changelogs—whatever your app should answer about. Ingestion just means you turn that into a stream of clean text.

Why it helps factuality: if you ingest the right source of truth, you’re no longer relying purely on parametric memory.

2) Chunking

Large documents need to be split into smaller pieces (“chunks”). A chunk might be a paragraph, or a fixed token window with overlap.

// Pseudocode: chunking with overlap
chunks = splitByTokens(documentText, {
  chunkSizeTokens: 300,
  overlapTokens: 50
});

This is where things usually break. Bad chunking causes missed context. If you split a definition from its examples, retrieval might return only the definition. Or it might return only the examples. Either way, generation suffers.

3) Embeddings

For each chunk, you compute an embedding. Embeddings map text to vectors so you can compare semantic similarity later.

Why it helps: keyword search misses meaning. Semantic search using embeddings retrieves passages that are “about the same thing” even if wording differs.

4) Vector database

Store embeddings in a vector database so you can run similarity queries quickly. This is where your vector database (like Pinecone, pgvector, etc.) becomes part of your architecture.

Why it helps: efficient retrieval is the whole point. Without it, RAG is just “hold more text in RAM and scan it,” which doesn’t scale.

5) Vector search (semantic search)

When a user asks a question, you embed the question too. Then you query the vector database to get the top-k most relevant chunks.

// Pseudocode: semantic search
questionEmbedding = embed(userQuestion);

results = vectorDb.query({
  vector: questionEmbedding,
  topK: 5
});

// results contain chunk text + metadata (ids, timestamps, etc.)

Edge case: low-quality retrieval returns irrelevant passages. If you don’t notice, your model will happily “ground” on the wrong context. The output can look even more authoritative than naive prompting, because it’s using retrieved text.

6) Context assembly

Now you build the prompt context. You take the retrieved chunks and format them for the model (often with citations/ids, plus instructions like “use only the provided context”).

Why it helps: you control what context is available. The model can’t cite something it never received.

Edge case: context window limits. You can’t stuff everything into the prompt, so you must choose how to select and compress context. If you cut off the most relevant chunk, generation becomes a guess again.

7) Generation (grounding LLM responses)

Finally, you ask the LLM to answer using the assembled context. A good pattern is to instruct the model to base the answer on the provided chunks and to indicate when the context doesn’t contain enough information.

// Pseudocode: generation with retrieved chunks
const context = assembleContext(retrievedChunks);

const answer = await llm.chat({
  messages: [
    { role: "system", content: "Answer using the provided context. If the context is missing, say so." },
    { role: "user", content: `Context:\n${context}\n\nQuestion: ${userQuestion}` }
  ]
});

Important reality check: RAG reduces but does not eliminate hallucinations. The model can still:

misinterpret retrieved text
combine multiple chunks incorrectly
fill gaps when the context is incomplete

Operational edge cases you should plan for

Stale indexed data: if your knowledge source changes but your index doesn’t, retrieval returns outdated chunks. Freshness requires re-indexing (or incremental updates) and metadata like timestamps.
Chunk overlap isn’t magic: overlap helps, but it can also duplicate content and waste context budget. You still need good chunk boundaries.
Metadata matters: if you store timestamps, product versions, or document ids, you can filter retrieval (e.g., “only last 90 days”). Without it, semantic search may pull the wrong era.
Context window limits (again): top-k doesn’t guarantee coverage. Sometimes the answer is spread across several chunks, and your budget only holds one or two.
Low-quality retrieval: if embeddings and chunking don’t match your question distribution, you’ll get irrelevant passages. Your prompt may be perfect and it still won’t save you.

One practical evaluation trick: verify what was retrieved

Don’t evaluate RAG only by looking at the final answer. Inspect the retrieved chunks.

For a real user question, log the top-k chunks returned by semantic search.
Ask: are the retrieved passages actually the ones that contain the answer?
If not, you have a retrieval problem (chunking, embeddings, indexing, or query formulation), not a generation problem.

Also run a small “citation or it didn’t happen” test: create questions that require exact citations or recent data (e.g., “quote the release note line about X” or “what changed on YYYY-MM-DD?”). If the retrieved context can’t support the claim, you’ll catch hallucinations quickly.

The best sanity check for retrieval augmented generation is boring: if you can’t find the answer in the retrieved chunks, the model shouldn’t be able to fake it reliably.

That’s the shift RAG gives you. You’re no longer trusting the model’s memory. You’re trusting your pipeline.

And yes—this is still a system. If you treat it like magic, you’ll ship stale context with confidence.

If you build the retrieval and make it inspectable, everything gets easier.

Snehasish Konger

Developed @scientyficworld.org | Technical writer @Nected | Content Developer

How to Use Retrieval-Augmented Generation (RAG) to Reduce Hallucinations in LLM Apps

The naive approach: chat with the LLM and hope it’s right

RAG as the fix: retrieve first, then generate

A basic RAG pipeline: what you actually have to build

1) Document ingestion

2) Chunking

3) Embeddings

4) Vector database

5) Vector search (semantic search)

6) Context assembly

7) Generation (grounding LLM responses)

Operational edge cases you should plan for

One practical evaluation trick: verify what was retrieved

On This page

Take a Pause with Intervals

A Sunday letter on building, writing, and thinking deeper as a developer — short, honest, and worth your time.

Related Posts

How to Use Retrieval-Augmented Generation (RAG) to Reduce Hallucinations in LLM Apps

The naive approach: chat with the LLM and hope it’s right

RAG as the fix: retrieve first, then generate

A basic RAG pipeline: what you actually have to build

1) Document ingestion

2) Chunking

3) Embeddings

4) Vector database

5) Vector search (semantic search)

6) Context assembly

7) Generation (grounding LLM responses)

Operational edge cases you should plan for

One practical evaluation trick: verify what was retrieved

On This page

Take a Pause with Intervals

A Sunday letter on building, writing, and thinking deeper as a developer — short, honest, and worth your time.

Related Posts

How to Write a Developer Quickstart That Gets Users to Hello World in 5 Minutes

How to Document GraphQL APIs for Developers

How to Build AI Agents with LangGraph: A Step-by-Step Tutorial (2026)

How to Use the Diátaxis Framework for Developer Docs

How to Write Error Messages That Help Developers Debug Faster

How I Built an Automated Local Business Lead Finder with n8n and SearchAPI