Trust AI, But Verify

Domain knowledge is the only way to catch AI mistakes

January 17, 2026

I wanted to understand buffer pool replacement policies better. So I asked Claude to run an experiment: implement LRU, Clock, LFU, and ARC, benchmark them under different workloads, tell me which one’s best.

Claude wrote 2,000 lines of C++. Compiled it. Ran the benchmarks. Gave me results.

“Simple LRU performs just as well as sophisticated ARC.”

I almost bought it.

The Results Looked Credible

Zipfian Workload:
LRU:  80.7% hit rate
ARC:  80.4% hit rate

Buffer Size Scaling:
10% buffer: LRU 66.1%, ARC 66.0%
20% buffer: LRU 75.3%, ARC 75.2%
30% buffer: LRU 80.4%, ARC 80.1%

ARC consistently performed worse than LRU. Not by much, fractions of a percent, but consistently.

The analysis made sense too:

“ARC’s complexity adds overhead”
“Simple algorithms often win”
“Don’t over-engineer”

Classic engineering wisdom. I’ve heard this before. I’ve said this before.

Here’s the thing: something felt off.

ARC is supposed to excel when memory is tight. That’s the whole point of the algorithm, it adapts to workload patterns using ghost lists to learn from eviction mistakes.

But the benchmarks showed it losing at 10% buffer size. At 20%. At every size.

I don’t know. Maybe ARC is overhyped. Maybe the textbooks are wrong. Maybe simple really is better.

Or maybe there’s a bug.

Claude You Are Wrong

“Are you sure there are no bugs in the implementation?”

Claude looked at the ARC code and found it immediately.

The policy interface tracked frame IDs, but ARC’s ghost lists needed page IDs.

When page 5000 gets evicted from frame 42, the ghost list should remember “page 5000.” Because frame 42 will be reused for a different page immediately.

The implementation remembered “frame 42.”

So when the ghost list later saw activity on frame 42, it thought it was page 5000. But it wasn’t. It was page 8123 or whatever got loaded into that frame next.

The entire adaptive mechanism was learning garbage. ARC was just LRU with extra overhead and broken logic.

The Fixed Results

We fixed the bug. Re-ran everything.

Zipfian Workload:
LRU:  80.7% hit rate
ARC:  81.2% hit rate

Buffer Size Scaling:
10% buffer: LRU 66.2%, ARC 71.0% (+4.8%)
20% buffer: LRU 75.3%, ARC 77.6% (+2.2%)
30% buffer: LRU 80.8%, ARC 81.6% (+0.7%)
40% buffer: LRU 84.3%, ARC 84.0%
50% buffer: LRU 87.0%, ARC 86.4%

Now ARC wins when memory is tight. 4.8% better at 10% buffer. 2.2% better at 20%.

And it loses when memory is plentiful, the overhead costs more than adaptation helps.

This matches theory. This makes sense.

The original results were just… wrong.

What Scares Me About This

Claude didn’t just write buggy code. It:

Compiled the code successfully
Ran comprehensive benchmarks
Generated plausible results
Analyzed those results coherently
Defended a wrong conclusion confidently

The code looked clean. The benchmarks ran. The numbers were internally consistent. The analysis sounded reasonable. And it was all based on a bug.

If I hadn’t known enough about ARC to be suspicious, I would have walked away thinking it’s overhyped.

Here’s what makes this harder: AI doesn’t hedge. Humans express uncertainty “based on these assumptions” or “this might not account for…” With AI, there’s no signal. Just confident output. Right or wrong, the tone is identical.

How many times have I trusted AI-generated results without questioning them? How many experiments have I run where I didn’t have enough domain knowledge to know the results looked wrong?

I don’t know. That’s what bothers me.

I’m not going to stop using AI. Claude wrote 2,000 lines of C++ in minutes. That’s absurdly productive.

But I’m changing how I think about it. Treat AI like a really fast junior engineer who’s confident about everything. The code might be great. The analysis might be wrong. The results might be based on a subtle bug.

What I need to do:

Actually review the implementation
Question results that don’t match expectations
Ask “are you sure?” even when it sounds confident
Trust my domain knowledge over AI confidence

That last one is hard. When Claude gives me a detailed analysis with benchmark numbers, it’s easy to think “maybe I’m wrong.” But sometimes I’m not wrong. Sometimes the AI made a mistake. And it’ll defend that mistake with the same confidence it would defend correct results.

The Echo Chamber

I keep seeing people on Twitter talking about how good LLMs got at programming tasks. Building entire products in minutes. Writing flawless code. Replacing developers.

The people saying this loudest are people with technical backgrounds who know what they’re doing. They know how to interpret the results. They know when to trust the AI and when to question it.

But here’s what worries me: people without that background are listening.

They hear “AI can build anything” and think that means they don’t need to understand the domain. They can just ask the AI, accept the output, and move on.

And that works. Until it doesn’t.

Until the AI writes code with a subtle bug that produces plausible but wrong results. Until it analyzes data and draws confident conclusions from flawed assumptions. Until it builds something that looks right but fails in ways you don’t have the expertise to notice.

The people celebrating AI’s capabilities have the knowledge to catch these mistakes. They’re using AI as a force multiplier for their existing expertise.

The people who don’t have that expertise? They’re using AI as a replacement for it.

And the AI won’t tell them the difference.

How I Actually Use AI

I don’t want this to sound like “AI is dangerous, don’t use it.” Because I use it constantly.

I use Claude to learn new topics. I’ll feed it a lecture or paper and have it quiz me, Socratic method style. It asks questions, I answer, it pushes back. That’s incredibly valuable for understanding new concepts.

I use it to understand codebases. “Why does this implementation use X instead of Y?” “What are the trade-offs here?” Claude explains things I could figure out myself, but in minutes instead of hours.

I use it to write code and run experiments. Like this buffer pool benchmark. Claude wrote 2,000 lines I would have spent days on.

I use it to review my writing. This very blog post. I write a draft, Claude suggests changes, I decide what to keep.

Here’s what all these have in common: I have domain knowledge in what I’m doing.

When Claude quizzes me on database systems, I know enough to recognize bad questions or wrong corrections. When it explains code, I can tell if the explanation matches what I see. When it writes benchmarks, I can spot when results don’t match theory. When it reviews my writing, I know my voice well enough to reject suggestions that don’t fit.

AI isn’t teaching me new skills. It’s augmenting what I already know how to do. It’s compression of time, not replacement of expertise.

And that’s the gap I keep seeing people miss.

What I’m Still Working Out

I don’t have clean answers here. Just things I’m trying to figure out:

How much domain knowledge is enough to use AI safely? When I’m learning something completely new, am I equipped to catch AI’s mistakes? Or do I need to learn the basics the hard way first?

How do I know when I’m out of my depth? When the AI confidently explains something I don’t understand, how do I tell the difference between “this is new to me but correct” and “this is plausible-sounding garbage”?

How do I balance productivity gains against verification overhead? If reviewing AI output takes as long as writing it myself, what’s the point? But if I don’t review it, I risk shipping bugs like the ARC implementation.

I don’t know yet. I’m figuring it out as I go.