RAGHybrid SearchChunkingEvaluation

RAG Production Architecture: Hybrid Search & Hallucination Prevention

2024-12-30 • 10 min read

Our RAG prototype is amazing! We feed it a few PDFs, ask questions, and it nails the answers. But when we tried it on our full 10-year document archive in production... it's a disaster. Wrong answers, missing obvious facts, and sometimes it just makes stuff up. What happened?

Welcome to the 'Demo vs. Production' valley of despair. Your prototype probably uses naive semantic search: embed the query, find the 5 most similar document chunks, feed them to the LLM, done. That works great on clean, small datasets where every chunk is highly relevant. But in production, with millions of messy documents? Semantic search alone is like using Google and hoping the top result has your answer. Spoiler: it often doesn't.

Wait, I thought vector embeddings were supposed to 'understand meaning'! Isn't that the whole point?

They do—but only in aggregate. Here's the problem: Semantic search is conceptually fuzzy. If you search for 'Apple stock price,' it might return documents about 'fruit market trends' or 'NASDAQ technology sector' because they're semantically close. It misses the exact match you actually wanted. Conversely, if you search for 'John Smith contract renewal,' pure semantic search might miss a document that says 'J. Smith agreement extension' because the phrasing is different. You need both semantic understanding AND exact keyword matching. That's why the 2025 standard is Hybrid Search.

Hybrid Search. Okay, what is that exactly?

It's combining two retrieval methods in parallel:

1. Dense (Semantic) Retrieval: Your classic vector embedding search. Finds documents that are conceptually similar to your query.

2. Sparse (Keyword) Retrieval: Good old BM25 (think TF-IDF on steroids). Finds documents with exact term matches, weighted by rarity and frequency.

You run BOTH searches, merge the results, and then rerank them using a cross-encoder model that scores 'query + document' pairs. This dual approach gives you a 40-70% accuracy boost over pure semantic search. Modern vector DBs like Qdrant even have this baked in—their BM42 algorithm does dense + sparse in a single query.

That sounds... complicated. Is it really worth it?

Let me give you a real example. Legal firm searching for 'breach of contract clause 14.2 in Tesla agreements.' Pure semantic search returns general Tesla contracts because 'Tesla' is semantically strong. But it misses the one document that actually says 'clause 14.2 breach remedy' because it's buried in legalese. Hybrid search finds it because '14.2' is an exact match AND 'breach' + 'clause' are keywords. That's the difference between 'close enough' and 'exactly what I need.' In production, you can't afford 'close enough.'

Okay, I'm sold on Hybrid Search. But you mentioned chunks earlier. How do we even split documents? Just every 500 words?

Oh god, please don't. Fixed-size chunking (every N tokens) is the fastest way to destroy context. Imagine slicing a book mid-sentence, mid-paragraph, or—worse—mid-table. The LLM gets fragments that make no sense. Instead, use Semantic Chunking: split at logical boundaries like paragraphs, sections, or headings. For highly structured content (like legal docs or financial reports), go page-level or even LLM-based chunking, where you ask an LLM to intelligently segment the document. Yes, it's slower and costs more, but the retrieval accuracy gain is massive.

And how big should each chunk be?

There's no magic number, but 128-512 tokens is common. Too small (under 100 tokens), and you lose context. Too large (over 1000 tokens), and your vector embedding becomes a meaningless mush of averaged concepts. The key is to test empirically: take 100 real queries, try different chunk sizes, measure Recall@10 (did the right chunks show up in the top 10?) and Precision@5 (are the top 5 chunks actually relevant?). Tune from there. Oh, and always add metadata headers to chunks—like 'Source: Q3 Financial Report, Section: Revenue, Page: 42'—so the LLM knows what it's looking at.

The "Million Dollar" Question

"But if we do all this—Hybrid Search, perfect chunking, reranking—doesn't that eliminate hallucinations?"

Technical Reality Check

Why RAG Doesn't Eliminate Hallucinations

1. Bad Retrieval = Bad Answers. If your hybrid search returns irrelevant chunks (because embeddings are noisy, or the query is ambiguous), the LLM has no choice but to guess. And LLMs are very confident guessers.

2. Context Window Overload. If you shove 20 chunks into the LLM hoping 'more is better,' you dilute the signal. The LLM can't tell which chunk is authoritative, so it blends them or invents a synthesis. Google's 2024 research calls this the 'Sufficient Context' problem—even if the right chunk is there, if it's buried in noise, the LLM ignores it.

3. Insufficient Context. Sometimes the retrieved chunk is technically related but doesn't actually answer the question. Like asking 'What's our refund policy?' and getting a chunk about 'customer satisfaction initiatives.' The LLM tries to fill the gap, and boom—hallucination.

So how do we actually prevent hallucinations?

Layer 1: Retrieval Quality. Use hybrid search, rerank results, and filter by similarity threshold. If the top chunk scores below 0.7 similarity, don't even send it to the LLM—just return 'No relevant information found.'

Layer 2: Context Window Management. Don't overload the LLM. Send only the top 3-5 MOST relevant chunks, not all 10. Quality over quantity.

Layer 3: Uncertainty Modeling. Train or prompt your LLM to say 'I don't know' when confidence is low. A fine-tuned model can learn to refuse answers instead of guessing. It's humbling but honest.

Layer 4: Groundedness Metrics. Automatically score every answer for Faithfulness (is the answer supported by the retrieved context?) using tools like RAGAS or a judge LLM. If Faithfulness < 0.8, flag it for human review.

This sounds like a ton of moving parts. How do we even know if it's working?

You measure it obsessively. In production, you need real-time monitoring of:

Retrieval Metrics:

  • Recall@10: Did the right chunks appear in the top 10 results? (Target: >90%)
  • Precision@5: Are the top 5 chunks actually relevant? (Target: >80%)
  • MRR (Mean Reciprocal Rank): How quickly does the first relevant chunk appear? (Target: >0.8)
  • NDCG@10: Are more relevant chunks ranked higher? (Target: >0.85)

Generation Metrics:

  • Faithfulness: Is the answer grounded in retrieved context?
  • Answer Relevance: Does the answer actually address the query?
  • Hallucination Rate: How often does it invent facts? (Target: <5%)

Deploy dashboards (Grafana, Evidently AI) that track these in real-time. If Recall drops below 85% or Hallucination Rate spikes above 10%, you've got drift—maybe your data changed, or embeddings degraded. Investigate immediately.

This is way more complex than I thought. Are we sure RAG is even worth it?

For knowledge-intensive tasks where accuracy matters—legal research, medical documentation, compliance queries—absolutely. Fine-tuning an LLM on all your docs is insanely expensive and doesn't scale when docs change weekly. RAG lets you update the knowledge base without retraining. But yeah, it's not plug-and-play. You're building a retrieval pipeline (chunking, embedding, indexing), a ranking layer (hybrid search, reranking), and a quality gate (similarity thresholds, groundedness checks). It's engineering, not magic.

Technical Reality Check

What RAG Production Architecture Does NOT Give You

1. Auto-Scaling Intelligence. RAG doesn't magically adapt to new domains. If your docs are all legal and you suddenly add medical content, your embeddings might be garbage for medical queries. You'll need domain-specific embedding models or fine-tuning.

2. Perfect Answers. Even with 95% Recall and 90% Precision, 5-10% of queries will fail. The LLM will hallucinate, the retrieval will miss context, or the question is just unanswerable from your data. Plan for failure modes.

3. Free Lunch. Embedding 10M documents? That's thousands of dollars in API calls. Reranking every query? Another $0.01-$0.05 per query. Monitoring? More infrastructure. Budget for it.

Bottom Line: Hybrid Search + Semantic Chunking + Reranking + Grounding is the 2025 production standard. If you skip any of these, you'll regret it when your CEO asks why the AI said something wildly wrong.