RAG Retrieval Bench | Jonas Vanbuel

Challenge

There is no universally best RAG configuration. The setup that outperforms everything else on one dataset can be mediocre on another — and the gap between optimal and default is rarely obvious until you measure it. The same risk applies when a system evolves: a second content type gets added, the existing configuration gets carried across, and a pipeline tuned for dense technical documentation ends up quietly underperforming on conversational support tickets — or vice versa. The only reliable answer is measurement: on your data, with your questions, against your quality bar.

This project makes that measurable — and demonstrates why the right answer changes depending on what your data actually looks like.

Every RAG pipeline forces three decisions before a user ever asks a question:

How to chunk documents into searchable pieces
Which model to use for embedding
Which retrieval strategy to apply at query time

The decisions compound. 3 chunkers × 3 embedders × 4 retrieval strategies = 36 configurations. Each selection represents a meaningfully different approach — different philosophical assumptions about where to cut text, different points on the quality/cost/speed spectrum for embeddings, progressively more sophisticated retrieval strategies — rather than an attempt to enumerate every available option.

documents

▼

chunking fixed-size · sentence · semantic

▼

embedding MiniLM · BGE-M3 · OpenAI

▼

retrieval semantic · BM25 · hybrid · reranker

▼

answer

3 chunkers × 3 embedders × 4 retrieval strategies = 36 configurations

Approach

The design is deliberately controlled: one corpus, one question set, 36 configurations run against both. Everything except the variable being tested is held constant.

The corpus

69 ArXiv papers on RAG, dense retrieval, hybrid search, and embedding models — the original RAG paper (Lewis et al.), DPR, ColBERT, ColBERTv2, SPLADE, BGE-M3, HyDE, Self-RAG, RAPTOR, and dozens more. The corpus is technically dense: acronyms, model names, citation patterns, methodology sections. That density genuinely stresses chunking and retrieval in ways that clean English prose wouldn’t — the kind of content most teams are actually dealing with when they index internal documentation.

The question set

A benchmark is only as good as its questions. Get the question set wrong and the results are meaningless — configurations that perform well on poorly designed questions may fail completely on real user queries, and vice versa. Here, 31 questions span four types, each targeting a different failure mode:

Factual questions ask for a specific piece of information that exists near-verbatim in one paper. Tests whether the right chunk was retrieved and wasn’t cut in half.
Conceptual questions require synthesising an argument from context, not locating a fact. Tests whether the model can reason over retrieved material.
Multi-hop questions span evidence across multiple papers. No single chunk answers this; the retriever has to surface fragments from different documents and the model has to connect them.
Unanswerable questions can’t be answered by the corpus. Tests hallucination resistance. A good system says it doesn’t know.

Architecture

1. Chunking

Before any retrieval can happen, documents need to be split into pieces small enough to be useful as context. How you make that cut is a more consequential decision than it first appears — it determines what information the retriever has to work with, and no downstream component can recover what a bad chunking strategy destroys.

Fixed-size splits on word count with a 50% sliding overlap window — the overlap ensures context isn’t abruptly lost at every boundary. Simple, fast, ignorant of meaning. The “good enough for most production systems” baseline — 2,825 chunks from 69 papers.
Sentence-boundary assembles chunks sentence by sentence until approaching a token limit, then starts a new chunk at a clean boundary. Fully deterministic — no model involved, just grammar. Respects grammatical units; ignores topic shifts — 1,470 chunks from 69 papers.
Semantic uses embedding similarity between adjacent sentences to detect topic shifts, cutting there. Conceptually appealing — the text signals its own structure. Computationally expensive at index time: every sentence gets embedded individually to compute pairwise similarity, rather than just the final chunks. And, as the results show, dangerous to misconfigure — 33,179 chunks from the same 69 papers.

2. Embedding

Each chunk needs to be represented as a vector so the retriever can compare it against a query mathematically. The embedding model doing this conversion determines how well meaning is preserved — and the three models here represent meaningfully different points on the quality/cost/speed spectrum: a lightweight local model, a large local model, and a cloud API.

MiniLM (all-MiniLM-L6-v2) — 80 MB, 384-dimensional vectors, runs on laptop CPU, no API key. The cheap baseline.
BGE-M3 (BAAI/bge-m3) — 2.3 GB, 1,024-dimensional vectors, 100+ languages, 8,192-token context window. The heavyweight local option.
OpenAI (text-embedding-3-small) — 1,536-dimensional vectors via API. No local infrastructure, no model to manage, costs per token. The cloud option.

3. Retrieval

Given a query, the retrieval strategy determines which chunks get surfaced as context. The four strategies here are deliberately stacked — each adds a layer of sophistication over the previous one, at increasing computational cost. They don’t represent every approach available, but they cover enough ground to make the trade-offs visible and the comparisons meaningful.

Semantic search uses cosine similarity between query and chunk vectors. Finds chunks that are close in meaning to the query, even if they use different words.
BM25 keyword-based retrieval that scores chunks by exact term matches, weighted by how rare those terms are across the corpus. Complements semantic search well on technical content where precise terminology matters.
Hybrid (RRF) combines semantic and BM25 rankings into a single list. Each chunk is scored by its position in each individual ranking, then the scores are summed. Gets the best of both: meaning-based matching and exact keyword matching in one result.
Reranker takes the hybrid candidates and re-scores them with a cross-encoder model. Unlike an embedder, a cross-encoder processes the query and chunk together rather than independently, giving richer relevance signal at the cost of higher latency.

4. Evaluation

RAGAS is an open-source framework for evaluating RAG pipelines. Rather than measuring retrieval in isolation, it scores the full pipeline output — question, generated answer, retrieved context, and reference answer — to give a picture of how well each component is doing its job. Each question is evaluated across four metrics:

Faithfulness measures whether the claims in the generated answer trace back to the retrieved context, or if the model is adding things the chunks don’t support.
Context Precision measures how many of the retrieved chunks were actually relevant to the question.
Context Recall measures whether the retrieved context covers everything needed to answer the question fully.
Answer Relevance measures whether the answer is on-topic relative to the question. Measured as cosine similarity between the question and answer embeddings — deterministic, no LLM call required.

For unanswerable questions a fifth check runs separately: a judge evaluates whether the system correctly expressed uncertainty rather than making one up.

The LLM judge is llama-3.3-70b-versatile via Groq, run at temperature=0 across all 36 configurations — roughly $15 in API calls for a full run. The judge doesn’t need to be perfect; it needs to be consistent so comparisons across configurations are valid.

The full benchmark run — 3 chunk sets, 9 embedding sets, 36 configurations each evaluated across 31 questions — produced 1,116 scored results and completed in approximately 4.5 hours on an M3 Pro.

Results

The full benchmark results — every configuration ranked across all four RAGAS metrics — are available in the complete results report.

Retrieval quality vs. latency

All 36 configurations · hover any dot for details

Chunking strategy was the dominant variable

Across these 36 configurations, chunking strategy had approximately 4× more impact on final scores than either the embedding model or the retrieval strategy. The gap is driven by one decisive finding: the semantic chunker was the definitively wrong choice for this corpus, and that single wrong call is what separates the top-performing pipelines from the bottom. Fixed-size and sentence-boundary performed comparably — but pick the wrong chunker and nothing downstream rescues you.

chunking

0.185

embedding

0.054

retrieval

0.044

Score spread (best-to-worst) per variable across 36 configurations.

The failure mode is specific. The semantic chunker produced 33,179 fragments from 69 papers — 12× more than fixed-size — flooding the retriever with noise. Context recall collapsed to 0.240, against 0.725 for fixed-size and 0.713 for sentence-boundary. Faithfulness remained high (0.862), which makes the problem precise: the system was faithful to what it retrieved, but what it retrieved covered almost none of what the question required. You can cut at all the right places and still drown the retriever.

Fixed-size and sentence-boundary chunking are statistically nearly indistinguishable at the aggregate level. Sentence edges ahead on faithfulness (0.865 vs 0.832) — the model generates more accurate answers from it — while fixed-size holds a slight precision advantage. Both are valid starting points.

Better inputs beat post-retrieval reranking

The reranker leads the aggregate — but the top 5 individual configurations are all semantic or hybrid, with the reranker not appearing until rank 6. Both facts are true simultaneously. The reranker benefits weak pipelines more than strong ones. When the chunker and embedder are poor, it does meaningful rescue work. When the inputs are already good, it has less to fix and adds in our case 400–600ms for marginal gain. The configurations that won didn’t need the reranker — because the variable that actually moved the needle (chunking) was already right.

BGE-M3 matches OpenAI — no API needed

context precision

0.675

0.672

0.556

context recall

0.580

0.571

0.526

faithfulness

0.864

0.849

0.845

answer relevance

0.809

0.758

0.783

OpenAI BGE-M3 MiniLM

RAGAS scores per embedding model, averaged across all chunking and retrieval configurations.

On the metrics that directly measure retrieval quality, BGE-M3 and OpenAI are effectively identical: context precision 0.672 vs 0.675, context recall 0.571 vs 0.580. The overall score gap (0.713 vs 0.732) is mostly driven by answer relevance — which partly reflects the embedder used during evaluation, not just retrieval quality. For practical purposes: BGE-M3 delivers OpenAI-equivalent retrieval at zero API cost.

MiniLM is where the real gap appears. Context precision drops 0.117 — from 0.672 to 0.556. Faithfulness barely moves (0.845 vs 0.849). When infrastructure is the constraint, MiniLM is a reasonable trade-off. The precision drop is the price.

Optimal configurations

Every deployment involves trade-offs between retrieval quality, embedding cost, latency, and the infrastructure available. For this corpus, the two top-scoring configurations are effectively tied — and the one that costs nothing to embed is one of them.

Sentence chunker + OpenAI embedder + Semantic retrieval (score: 0.818, latency: 270ms) — best overall score. Strong chunking and a capable embedder made post-retrieval reranking unnecessary.
Fixed-size chunker + BGE-M3 embedder + Semantic retrieval (score: 0.816, latency: 216ms) — 0.002 points behind, zero embedding API cost. If infrastructure is more constrained, Sentence chunker + MiniLM embedder + BM25 retrieval runs on laptop CPU with no GPU, scores 0.788, and retrieves in 8ms.

Why these results don’t transfer directly to your data

You might expect BM25’s exact-match retrieval to dominate on a corpus of academic papers saturated with model names and acronyms — and it does matter (hybrid configurations consistently rank high). But the winning configuration used pure semantic retrieval, because on this corpus, a strong embedder paired with clean chunk boundaries captured enough signal without keyword matching. Consider a query for “ColBERT” — a retrieval model that appears throughout this corpus. That’s a proper noun, not a concept with semantic neighbors. BM25 finds it by matching characters exactly. Semantic search relies on the embedder having learned a useful representation of that term — and with OpenAI or BGE-M3, it had. Change the embedder, change the corpus, and that breaks.

That’s exactly why these results don’t transfer. On customer support tickets — conversational language, fewer proper nouns, users describing problems rather than naming them — semantic search’s intent-matching pulls ahead: the query “my payment won’t go through” should surface chunks about failed transactions even if the exact words don’t appear. On legal documents — rigid clause-level structure, where cutting a clause from its surrounding provisions destroys meaning — the chunking strategy becomes the dominant variable. On internal product documentation — mixed structure, high jargon density — a different chunker may win entirely.

Every corpus has a configuration that fits it best. Good intuition gets you to a reasonable starting point — an understanding that technically dense text probably rewards exact-match retrieval, or that clause-heavy documents demand careful chunking. But intuition alone can’t tell you how much each decision matters, or whether your assumptions held. That’s what measurement is for. Measure on your data, with your questions, against your quality bar — and within your real constraints around cost, latency, and infrastructure. Point this benchmark at your corpus, build a definitive question set that reflects exactly how your users query the system, and let the numbers decide.

Your data deserves its own benchmark

If you’re building a RAG system — or you’ve already shipped one and you’re not sure the configuration is right — I’ll run this benchmark against your corpus and deliver the same structured report: every configuration ranked, every metric broken down, a clear recommendation for your specific data.

The first run is complimentary. No commitment, no invoice. A 30-minute call to understand your documents and use case, a benchmark run, and a report that tells you which configuration is optimal for your corpus, your users, and your constraints.

If you want to know which configuration is optimal for your data — book a call or write to me at [email protected].

Or you can run your own benchmark by cloning the open-source repo at github.com/jonasvanbuel/rag-retrieval-bench.