DEV Community: Dogukan Karademir

My RAG Benchmark is lying to me

Dogukan Karademir — Sun, 28 Jun 2026 21:45:58 +0000

I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.

Here's the specific problem that broke my assumptions.

The Setup

Kenning is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.

I built a benchmark to test six local models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, gemma2:9b, mistral:7b, phi4:14b.

Four question categories, judged blind by qwen2.5:14b:

IN_CONTEXT — answer is in the document
OUT_OF_CONTEXT — answer isn't; model must refuse
PARTIAL_CONTEXT — partial information; model must say what it found and what's missing
MULTI_CHUNK — answer spans multiple sections

Maximum 875 points per model at 35 questions.

First Problem: The Ceiling Effect

First run, 20 questions on Attention Is All You Need (the Transformer paper):

Model	Score
`qwen2.5:7b`	481/500 — 96.2%
`llama3.1:8b`	475/500 — 95.0%
`phi4:14b`	474/500 — 94.8%
`gemma2:9b`	473/500 — 94.6%
`llama3.2:3b`	466/500 — 93.2%
`mistral:7b`	463/500 — 92.6%

IN_CONTEXT category: every single model averaged 25/25. Perfect score.

This is what a useless benchmark looks like. Questions like "How many attention heads does the Transformer use?" are trivially easy if the retrieved chunk contains h = 8. I wasn't measuring model capability — I was measuring whether models can read.

I added 15 harder questions and rewrote the chunking.

The Rewrite That Changed Everything

The original code used TokenTextSplitter with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:

TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(200)
    .withKeepSeparator(true)
    .build();

List<Document> chunks = splitter.apply(documents);
List<Document> overlapped = overlapAppender.addOverlap(chunks, 100);

The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.

New results on 35 questions, same document:

Model	Score
`phi4:14b`	839/875 — 95.9%
`qwen2.5:7b`	822/875 — 93.9%
`gemma2:9b`	818/875 — 93.5%
`llama3.1:8b`	815/875 — 93.1%
`mistral:7b`	780/875 — 89.1%
`llama3.2:3b`	771/875 — 88.1%

The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.

Here's the Problem

I changed two things at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.

And I can prove the chunking changed what models actually saw.

Question q01: "How many attention heads does the base Transformer use?" — categorized as IN_CONTEXT because the answer (h = 8) is in the paper.

Original chunking: retrieved a chunk containing h = 8. Model answered correctly.

New chunking: retrieved chunks about multi-head attention applications. The specific h = 8 chunk was no longer in the top 5 by similarity score. phi4:14b correctly said: "The provided context does not specify the number of attention heads."

Judge score: 25/25. The model isn't lying — it answered correctly given what it received.

But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.

So here's what I was actually measuring: model behavior given what my chunking strategy retrieved — not model capability. The "model benchmark" was really a "chunking configuration benchmark." I just didn't realize it until the results changed.

The Second Document Made It Worse

I added a second document — NIST SP 800-63B, a US federal authentication standard. ~70 pages of SHALL/SHOULD requirements, distributed across sections and tables. Nothing like an academic paper.

Same questions structure, same judge, same chunking.

Model	Transformer paper	NIST	Drop
`phi4:14b`	95.9%	90.9%	−5.0 pp
`mistral:7b`	89.1%	88.6%	−0.5 pp
`qwen2.5:7b`	93.9%	87.8%	−6.1 pp
`gemma2:9b`	93.5%	83.4%	−10.1 pp
`llama3.1:8b`	93.1%	83.2%	−9.9 pp
`llama3.2:3b`	88.1%	79.3%	−8.8 pp

mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).

Now I have two explanations and no way to distinguish them:

First Guess: These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.

Explanation B: Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything.

Recent research highlights exactly how much the structure of a document dictates the winning pipeline. For instance, a February 2026 benchmark by Vecta evaluating 7 chunking strategies across 50 academic papers found that standard recursive 512-token splitting took 1st place with 69% accuracy. In that specific domain, semantic chunking tanked at 54% because it over-fragmented the text, producing tiny snippets averaging just 43 tokens that stripped away crucial context. For a standard academic paper, fixed-size or recursive chunking is often perfectly fine or even superior.

Conversely, when dealing with complex, non-linear layouts, fixed token limits completely collapse. A separate study evaluating structured/clinical documents found that adaptive, theme-boundary chunking reached 87% accuracy, while fixed-size baselines plummeted to a dismal 13%.

This completely recontextualizes my results. My naive 200-token split with 100-token overlap happened to work reasonably well for the uniform, dense layout of the Transformer paper. But when applied to a 70-page regulatory standard like NIST—where a single requirement might be scattered across cross-referenced sections and multi-row tables—it arbitrarily butchered the text. Models like gemma2 that are highly sensitive to context fragmentation fell off a cliff, while mistral proved much more resilient at handling the poorly sliced context.

The takeaway isn't that semantic chunking is a silver bullet—it's that a one-size-fits-all chunking pipeline is fundamentally broken. The experiment that would actually prove this — running the same models with multiple chunking configurations (fixed vs. semantic vs. structure-aware) on the exact same document — is the one I didn't do.

What I'd Actually Need to Know Which Model to Pick

Multiple chunking strategies per document type, held constant while varying models
Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)
Multiple judge models, not just one (my judge could have systematic biases I can't detect)
Real user questions from actual sessions, not questions I wrote after reading the document myself
Multiple runs per model to account for non-determinism

Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."

The Honest Takeaway

I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."

The thing I expected to matter most — model choice — turned out to be inseparable from chunking strategy, retrieval configuration, and document structure. Changing chunk size doesn't change which model is capable of what. It changes what the model sees. And what the model sees determines everything.

If I had to tell someone one thing before they start benchmarking models for RAG: measure your retrieval quality first. If the right chunks aren't being retrieved, you're not benchmarking models — you're benchmarking whether your similarity search surfaces the right context. Those are very different problems.

My RAG Benchmark is lying to me

Dogukan Karademir — Sun, 28 Jun 2026 21:45:58 +0000

I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.

Here's the specific problem that broke my assumptions.

The Setup

Kenning is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.

I built a benchmark to test six local models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, gemma2:9b, mistral:7b, phi4:14b.

Four question categories, judged blind by qwen2.5:14b:

IN_CONTEXT — answer is in the document
OUT_OF_CONTEXT — answer isn't; model must refuse
PARTIAL_CONTEXT — partial information; model must say what it found and what's missing
MULTI_CHUNK — answer spans multiple sections

Maximum 875 points per model at 35 questions.

First Problem: The Ceiling Effect

First run, 20 questions on Attention Is All You Need (the Transformer paper):

Model	Score
`qwen2.5:7b`	481/500 — 96.2%
`llama3.1:8b`	475/500 — 95.0%
`phi4:14b`	474/500 — 94.8%
`gemma2:9b`	473/500 — 94.6%
`llama3.2:3b`	466/500 — 93.2%
`mistral:7b`	463/500 — 92.6%

IN_CONTEXT category: every single model averaged 25/25. Perfect score.

I added 15 harder questions and rewrote the chunking.

The Rewrite That Changed Everything

The original code used TokenTextSplitter with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:

TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(200)
    .withKeepSeparator(true)
    .build();

List<Document> chunks = splitter.apply(documents);
List<Document> overlapped = overlapAppender.addOverlap(chunks, 100);

The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.

New results on 35 questions, same document:

Model	Score
`phi4:14b`	839/875 — 95.9%
`qwen2.5:7b`	822/875 — 93.9%
`gemma2:9b`	818/875 — 93.5%
`llama3.1:8b`	815/875 — 93.1%
`mistral:7b`	780/875 — 89.1%
`llama3.2:3b`	771/875 — 88.1%

The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.

Here's the Problem

I changed two things at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.

And I can prove the chunking changed what models actually saw.

Question q01: "How many attention heads does the base Transformer use?" — categorized as IN_CONTEXT because the answer (h = 8) is in the paper.

Original chunking: retrieved a chunk containing h = 8. Model answered correctly.

Judge score: 25/25. The model isn't lying — it answered correctly given what it received.

But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.

The Second Document Made It Worse

Same questions structure, same judge, same chunking.

Model	Transformer paper	NIST	Drop
`phi4:14b`	95.9%	90.9%	−5.0 pp
`mistral:7b`	89.1%	88.6%	−0.5 pp
`qwen2.5:7b`	93.9%	87.8%	−6.1 pp
`gemma2:9b`	93.5%	83.4%	−10.1 pp
`llama3.1:8b`	93.1%	83.2%	−9.9 pp
`llama3.2:3b`	88.1%	79.3%	−8.8 pp

mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).

Now I have two explanations and no way to distinguish them:

First Guess: These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.

Explanation B: Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything.

What I'd Actually Need to Know Which Model to Pick

Multiple chunking strategies per document type, held constant while varying models
Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)
Multiple judge models, not just one (my judge could have systematic biases I can't detect)
Real user questions from actual sessions, not questions I wrote after reading the document myself
Multiple runs per model to account for non-determinism

Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."

The Honest Takeaway

I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."

Four Bugs Stood Between Me amd "Sign in with Google"

Dogukan Karademir — Fri, 26 Jun 2026 20:18:47 +0000

Summary: I had a rough time adding Google login to my app, Kenning. It took me a while to figure out four issues that were causing problems. These issues were not related to each other and were not covered in any tutorial I read.

My second post about building Kenning, this phase is about OAuth2 login. I thought it would be easy. It was not. I had to deal with four confusing bugs.

Bug 1: the client ID with a hidden character

Google did not accept my login. It gave me an error message saying "Error 401: invalid_client". I checked my client ID in the .env file. It looked correct. I had copied it from the Cloud Console.

When I looked at the actual request that was being sent, I saw the problem. The client ID had a hidden character at the end. This character was a carriage return, represented by %0D in URL encoding. My .env file had Windows line endings (CRLF), and that extra character was being included in the value.

The fix was switching my editor's line ending setting from CRLF to LF and re-saving the file. (You can also strip it from an existing file with sed -i 's/\r$//' .env, but the actual cause was the editor's line-ending mode, not a one-off corrupted file.)

What I learned from this is that just because something looks correct does not mean it is correct. I should have checked the actual value instead of just looking at it.

Bug 2: the user service Spring never called

After I fixed that bug I was able to complete the login process. But I noticed that no user was being added to my database. I had written a custom user-loading service, and it was not being called.

I looked into the auth object that Spring had built after login and saw that it had an authority called OIDC_USER. This told me that Spring was routing the login through the OidcUserService interface. My custom service was extending the wrong base class — DefaultOAuth2UserService instead of OidcUserService — so it was simply never invoked, even though it was wired in correctly.

To fix this I changed my custom service to extend OidcUserService instead. This fixed the problem.

Bug 3: the CSRF cookie that needs to be asked for

After fixing that, login worked end to end. When I tried to upload a file, I got a 403 Forbidden error. I had set up CSRF protection on purpose, so this made sense in principle — except the cookie it depends on, XSRF-TOKEN, was never being written in the first place.

It turns out Spring Security 6+ defers writing that cookie until something in the request actually reads the token. A GET request that never touches it never triggers the write.

To fix this I wrote a filter that forces the token to be read on every request, which triggers the cookie write.

Bug 4: two cookies, same name, different values

I spent a lot of time on this one. I kept copying the X-XSRF-TOKEN value into a manual request and it kept getting rejected, even right after confirming in DevTools that the cookie existed.

Looking closer, DevTools was showing two separate XSRF-TOKEN entries with the same name but different values — one with an empty partition key, and one partitioned under resource://devtools. I had been copying the DevTools-partitioned one, which isn't the value the browser actually sends on a real request. Once I copied the other one — the unpartitioned cookie — it worked immediately.

A thing I haven't solved yet

Before any of this, I tried using the spring-dotenv library to load my .env file automatically instead of exporting variables by hand every time. After adding it, login stopped working, and I genuinely don't know why — I never confirmed whether it was even loading the file, or something else entirely. I removed the dependency and went back to exporting variables manually. If anyone's gotten this working alongside Spring Security + OAuth2, I'd like to hear how.

What actually mattered

None of these bugs were caused by Spring or OAuth2 being badly designed. Each one had a clear explanation once I found it. What would have saved me time is checking the actual outgoing request the moment something failed for a reason that didn't make sense, instead of trusting echo output or DevTools at face value.

And the frontend?

Comparatively quiet, which I'm counting as a win. I used Angular and PrimeNG to build the document list and chat screen. Once I had the right cookie and header names configured, the whole CSRF back-and-forth from Bug 3 just worked automatically on every request, because the frontend handles this pattern natively.

Next up: a reader on the last post called the chunk-dilution theory exactly right and suggested keeping chunks to one topic each. So up next is testing that properly — comparing chat models, embedding models, and chunking strategies head to head, local and cloud, on quality, speed, and cost.

Building Kenning in public. Corrections welcome — especially on the spring-dotenv mystery above.

I Built a RAG App, Then Asked It What Car I Like. It Didn't Know.

Dogukan Karademir — Wed, 24 Jun 2026 20:33:23 +0000

TL;DR: Phase 1 of a from-scratch RAG app — Spring AI, pgvector, local Ollama — ends with a working pipeline and two failures that look identical from the outside but have nothing to do with each other. One was a chunking bug. The other was a 3B model running out of brain. Here's how I told them apart.

Why build this

I'm finishing my apprenticeship as a IT specialist in application development and wanted a portfolio project that's more than another CRUD app. Kenning is a document-chat tool: upload a file, ask questions about it, get answers with sources attached. Standard RAG (Retrieval-Augmented Generation), but built end to end by hand instead of stitched together from a tutorial.

Phase 0 was infrastructure: Docker Compose with pgvector/pgvector:pg16 and ollama/ollama, a Spring Boot scaffold, an Angular scaffold. Phase 1's job was narrower and more important: prove the actual RAG loop works — upload one document, ask one question, get a real answer with the source attached. No login, no UI polish, no multi-document handling. Just: does this architecture actually do the thing.

The stack

Java 21, Spring Boot 4.1.0, Spring AI
Angular (frontend, mostly untouched in this phase)
PostgreSQL + pgvector as the vector store
Ollama, running locally: nomic-embed-text for embeddings and llama3.2:3b for chat

The pipeline: upload → Apache Tika extracts text → TokenTextSplitter chunks it → Spring AI's Ollama embedding client turns each chunk into a vector → pgvector stores it → a question gets embedded the same way → similarity search pulls the closest chunks → those get stuffed into a ChatClient call alongside the question → the model answers, and I attach the source chunks it used.

That's the theory. None of it survived contact with reality without a fight.

Mistake #1: naming my own entity `Document`

Spring AI ships its own Document class for representing a chunk of text plus metadata. I also wanted an entity called Document for "a file the user uploaded." Same name, two completely different things, one annoying import ambiguity every time autocomplete guessed wrong. Renamed mine to SourceDocument and moved on.

Mistake #2: two Ollamas, one port

I ran qwen2.5-coder:14b natively on Windows for my coding assistant to test out local models and forgot about it. Docker Compose also wants port 11434 for the Ollama container that's supposed to serve Kenning. Two processes, one port, predictable result. The fix was trivial — stop the native Windows process before starting the container — but the error message gave zero hint that this was the cause. Worth remembering if you run a local AI coding tool and an Ollama-based app on the same machine.

Mistake #3 (not really a mistake): the GPU that does nothing

I have an AMD RX 6700 XT with 12 GB of VRAM sitting in this machine, doing nothing for local inference. Rather than assume, I checked:

docker exec -it ollama ollama ps

NAME                       ID              SIZE      PROCESSOR    CONTEXT    UNTIL
llama3.2:3b                a80c4f17acd5    2.6 GB    100% CPU     4096       4 minutes from now
nomic-embed-text:latest    0a109f422b47    376 MB    100% CPU     2048       4 minutes from now

100% CPU, confirmed. As far as I can tell, the issue is that ROCm needs /dev/kfd exposed to the container, and WSL2 doesn't expose it — so GPU acceleration for Ollama running inside Docker on WSL2 seems to be a dead end with this setup.

One thing I haven't tried yet: running Ollama natively on Windows instead of inside WSL2/Docker. Native Windows Ollama has its own path to the GPU drivers that doesn't go through WSL2's passthrough limitations, so that's probably the more realistic way to actually use the 6700 XT — I just haven't switched Kenning over to test it. For now: CPU-only for local dev, on a machine where RAM is already sitting at 96% usage before the model even loads.

For the eventual public demo, the plan is to be upfront about it — a small local model on a GPU-less VPS will be slow and occasionally wrong, and I'll say so, with an option to plug in your own API key if you want better answers.

The real test: asking it about myself

Once the loop worked mechanically, I needed a document with facts I could verify myself. So I wrote a short bio — what Kenning is, what stack it uses, what I'm into outside of code, including one line: "He enjoys mechanical keyboards and is interested in BMW cars."

Question 1: "What embedding model does this project use?"

The answer is one unambiguous sentence in the document: nomic-embed-text for embeddings, llama3.2:3b for chat. The chunk had actually been retrieved — the full document came back in sources — so this wasn't a missing-context problem. Here's what the model actually answered:

"I don't know which embedding model this project specifically uses. The text mentions that Ollama provides two models: nomic-embed-text for generating embeddings, but it does not specify which one is used by Kenning Project."

Read that twice. The model opens by saying it doesn't know, then in the very next sentence names the correct embedding model — nomic-embed-text for generating embeddings — and then closes by claiming the project doesn't specify which one it uses. It has the right answer sitting in its own explanation and still doesn't commit to it.

My read on it: that original sentence names two models in one clause, each bound to a different job ("X for embeddings, Y for chat"), and a 3B model running on CPU can apparently retrieve the right name but not lock it in as a confident final answer. I haven't tested this side by side against a bigger model yet, but based on what I'm seeing, I'd assume something like an 8B+ model would commit to the answer instead of second-guessing itself — that's the next experiment, not something I've actually confirmed.

Question 2: "What car brand do I like?"

Zero source matches. No chunk even cleared the similarity threshold (0.5) to make it into the prompt. Lowering the threshold to 0.3 surfaced it — at an actual similarity score of 0.46, just under the original cutoff.

This looks like the same kind of failure as Question 1, but I don't think it is. My test document was short enough (2,063 characters) to stay as a single chunk. That one chunk covers Spring AI, Tika, async processing, OAuth2 plans — and, almost as an afterthought in the last sentence, BMW. My working theory: embedding a chunk that mixed produces a vector that's diluted across five unrelated topics, so a focused query like "what car brand" scores lower against it than it would against a clean, topic-specific chunk. I haven't actually re-run it with smaller, topic-coherent chunks to confirm yet — but that's the fix I'd try next, and my guess is it would clear the threshold comfortably.

Two questions, two failures that looked identical from where I was sitting ("the bot doesn't know basic facts about me") but, as far as I can tell, have different causes underneath. One I'd expect a bigger model to fix. The other I'd expect better chunking to fix. Telling them apart — even just as working theories — was the most useful thing to come out of this phase.

Where Phase 1 lands

A document goes in, a question goes in, an answer comes back with its source chunk attached — all on a fully local stack, no API costs, end to end. It's slow, it's occasionally wrong in the specific way small models are wrong, and the chunking is still naive. But the architecture holds up, and proving that was the actual point of this phase.

If you've actually debugged RAG pipelines at this level, I'd genuinely like to know whether the two working theories above hold up — the attribute-binding explanation for the embedding-model question, and the chunk-dilution explanation for the BMW question. I'm reasoning from what I observed here, not from having traced either one to the bottom, so if you've got more experience with this and either guess is wrong, half right, or missing something obvious, I'd love to hear it.

Phase 2 is auth (Google OAuth2), a real upload UI, and multi-document support per user. I'll write that one up once it's running.

DEV Community: Dogukan Karademir

My RAG Benchmark is lying to me

The Setup

First Problem: The Ceiling Effect

The Rewrite That Changed Everything

Here's the Problem

The Second Document Made It Worse

What I'd Actually Need to Know Which Model to Pick

The Honest Takeaway

My RAG Benchmark is lying to me

The Setup

First Problem: The Ceiling Effect

The Rewrite That Changed Everything

Here's the Problem

The Second Document Made It Worse

What I'd Actually Need to Know Which Model to Pick

The Honest Takeaway

Four Bugs Stood Between Me amd "Sign in with Google"

Bug 1: the client ID with a hidden character

Bug 2: the user service Spring never called

Bug 3: the CSRF cookie that needs to be asked for

Bug 4: two cookies, same name, different values

A thing I haven't solved yet

What actually mattered

And the frontend?

I Built a RAG App, Then Asked It What Car I Like. It Didn't Know.

Why build this

The stack

Mistake #1: naming my own entity Document

Mistake #2: two Ollamas, one port

Mistake #3 (not really a mistake): the GPU that does nothing

The real test: asking it about myself

Where Phase 1 lands

Mistake #1: naming my own entity `Document`