What is LLM-as-Judge and how does it work?

LLM-as-Judge uses a stronger model (in my case, Claude Opus 4.6) to review the work of a weaker production model (Gemini 3 Flash). The judge reads the agent's full trace, does its own independent research, then compares findings to produce a structured verdict. It's like a senior engineer reviewing a junior engineer's work.

How much does it cost to run an LLM judge on agent traces?

Using Claude Opus via API with web search tools, each review costs $0.40-$0.70. For 50 products, that's $20-35. I reduced the marginal cost to $0 by rebuilding the judge as a CLI command running on my Claude Max subscription.

Can LLM-as-Judge replace human evaluation of AI agents?

Not entirely. The judge surfaces patterns and does thorough trace analysis faster than a human, but it can also be wrong. I use it as a first pass: it identifies where to look, and I validate the findings. The real value is in pattern analysis across many reviews, not individual verdicts.

What patterns did the LLM judge find in the AI agent?

The top findings: the agent almost never reads actual web pages (relying on search snippets instead), it never searches by barcode number directly, and snippet misinterpretation dropped from 35% to 18% between review batches while wasted tool calls increased from 15% to 25%.

Technical deep-dive

LLM-as-Judge: using Claude to review a Gemini agent

Thomas Poumarède March 31, 2026

Our Gemini agent was confidently wrong 15% of the time. We used Claude Opus to analyze every trace and find the patterns no benchmark could surface.

Contents

The idea
The 3-phase process
What the judge found
The cost problem
The analysis layer
The feedback loop
Is this worth building?

In the previous article, I compared 7 models from 4 providers on the same agentic task. Gemini 3 Flash won on the balance of accuracy, cost, and latency. But winning the benchmark doesn't mean the agent is good. 74.5% accuracy means 1 in 4 products gets the wrong answer. And some of those wrong answers come with high confidence.

The benchmark tells you what fails. It doesn't tell you why. For that, I needed something that could look at the agent's reasoning step by step and tell me where the logic broke down.

So I built a judge.

The idea

The production agent runs on Gemini 3 Flash. It's fast and cheap, which is why it's in production. But it makes mistakes. Some of those mistakes share patterns that, if I could identify them, would tell me exactly what to fix in the prompt or the pipeline.

Manually reviewing agent traces is possible but painful. Each trace has 3-6 tool calls, each with a search query, results, page content, and a reasoning step. Reviewing one product takes 10-15 minutes if you're being thorough. Reviewing 50 takes a week.

The fix: use a smarter model (Claude Opus 4.6) to review the agent's work. Claude has more reasoning capacity than Gemini Flash. It can read an entire agent trace, spot logical errors, verify sources, and try alternative approaches the agent missed. A senior engineer reviewing a junior engineer's work, except the senior engineer is also an LLM.

The 3-phase process

The judge follows a strict 3-phase process for every review.

The 3-phase judge pipeline. Phase 1 reads the trace cold. Phase 2 does its own research. Phase 3 compares and scores.

Phase 1: trace analysis (no tools)

Pure analysis — reading and thinking

The judge reads the complete agent trace. Every search query, every result, every page read, every reasoning step. For each iteration it analyzes: what query was constructed and why, were the results relevant, what did the agent decide next, was the reasoning logical, were there obvious angles the agent didn't explore.

Phase 2: informed verification (web search + page reading)

The expensive phase — original research

Now the judge does its own research. It re-reads pages the agent cited to verify that they actually say what the agent claims. It tries alternative queries the agent missed. It searches in different languages if the product isn't French. It focuses on the weak points identified in Phase 1.

Phase 3: comparative verdict

Structured output — scores, tags, verification

The judge compares its findings with the agent's conclusion and produces a structured review. The verdict is one of: correct, incorrect, partially_correct, or uncertain. Each review includes 5 scores, issue tags from a taxonomy of 13 labels, and source-by-source verification.

The key rule: if the agent said "unknown" but the judge found the answer, that's "incorrect." The verdict is about whether the agent delivered the right answer, not whether it tried hard.

What the judge found

Across 75 production scans reviewed (20 in a first batch, 55 in a second), the average score is ~50/100. That sounds terrible, but there's an important caveat: I don't run the judge on easy wins. I specifically select cases that seem interesting: "probable" confidence results, scans where the GS1 prefix contradicts the found country, results that look surprising, or products where a user submitted a correction. The judge is a learning tool, not a representative sample.

The value isn't in the aggregate score. It's in the patterns.

Patterns invisible to benchmarks. The #1 issue: the agent burns its tool budget on search and almost never reads the pages it finds.

Three findings stood out, each with a lesson that applies beyond my specific use case.

Agents take shortcuts. The biggest pattern (28/55 scans in the second batch): the agent uses all of its tool calls on web searches and almost never reads the actual pages. It finds a search snippet saying "Made in France," treats it as fact, and moves on. But that snippet might be a navigation link, a category filter, or a statement about a different product. The answer was often on the page, one click away.

If you're building an agent with tools, check whether it's actually using them all. Ours had read_webpage available but preferred to stay in the comfortable search-snippet loop.

Curated benchmarks have blind spots. The agent never searched by barcode number directly (18/55 scans). It always searched by product name. But some products don't have a clean name in our database, and searching the EAN directly on retailers would have found structured origin fields immediately.

This pattern was invisible in the benchmark. Every benchmark item had a clean name because I'd curated it that way. The benchmark tested "can the agent find origin for a known product." Production tested "can the agent handle whatever random barcode a user scans." Different question, different failure modes.

Patterns evolve, and you need to track them over time. Between the first batch (20 reviews) and the second (55 reviews), snippet misinterpretation dropped from 35% to 18%. But wasted tool calls went up from 15% to 25%. The agent was getting better at some things and worse at others. Without running the analysis twice, I would have missed both trends.

The cost problem (and an ugly but effective solution)

The judge runs Claude Opus 4.6 via the Anthropic API, with web search and page reading tools. Phase 2 alone can involve 4-8 tool calls. Each review costs between $0.40 and $0.70.

For 50 products, that's $20-35. Not catastrophic, but too expensive for regular QA. I wanted to review every interesting production scan, not just a sample.

My solution was pragmatic: I rebuilt the exact same judge as a slash command in Claude Code (Anthropic's CLI tool). Same 3-phase process, same tools, same structured output. The difference is that the CLI version runs on my Claude Max subscription instead of the API. Marginal cost per review: $0.

The API version still exists for automated use. But day-to-day, I run /judge <EAN> from my terminal and get the same structured review without paying per call.

Is this elegant? No. Is it a long-term solution? Probably not. But it let me go from "I can afford to review 20 products a month" to "I can review every product I want." And that volume is what makes pattern analysis useful.

The analysis layer

Individual reviews are useful. Patterns across reviews are transformative.

On top of the judge, I built an analysis command that reads the last N reviews and identifies recurring patterns: which issue tags appear most often, which failures cluster together, which recommendations keep coming up, which types of queries consistently fail.

The feedback loop. Judge reviews surface patterns, patterns become prompt changes, changes get benchmarked. Repeat.

The output is a prioritized report. Each pattern gets a frequency (X out of N reviews), an impact rating (does it cause wrong answers or just inefficiency?), and a scope (universal, market-specific, category-specific). The report ends with 3-5 ranked recommendations.

This is where the judge system pays for itself. One review tells you "this product got the wrong answer because the agent trusted a misleading snippet." Seventy-five reviews tell you "the agent almost never reads pages, and imposing a minimum page-read ratio would address the root cause." The first is an anecdote. The second is a strategy.

The benchmark and the judge complement each other. The benchmark measures aggregate performance and catches regressions. The judge explains why things fail and surfaces patterns that curated test sets miss. I need both.

The feedback loop

The whole point of the judge is to feed improvements back into the agent. Some judge recommendations translated directly into improvements. The EAN-first pattern became a prompt change. The snippet misinterpretation finding led to the anti-FC rules I described in the prompt engineering article (the ones that failed on Flash Lite but worked on 3 Flash).

Other recommendations didn't work in practice. The language adaptation suggestion (search in Italian for Italian products) added noise without improving accuracy on the benchmark. Sometimes the judge identifies a problem but the fix doesn't exist yet, or the model can't handle the added complexity.

The judge doesn't replace human judgment about what to change. It tells you where to look.

Is this worth building?

Honestly, the judge system took real engineering effort. The 3-phase process, the structured review schema, the trace formatting, the CLI rebuild, the analysis layer.

But looking back, the judge found the EAN-first pattern that no amount of benchmark staring would have revealed. It confirmed benchmark findings with production data. It gave me a structured vocabulary for agent failures (those 13 issue tags) that made it possible to track patterns over time.

If you're building an agent that runs in production, you need some way to understand why it fails, not just how often. Manual review doesn't scale. A judge agent does.

Next up: From 42% to 78%: the full iteration log of a production AI agent. 108 benchmark runs, 7 models, 6 prompt versions, 3 weeks. Every decision we made, and the timeline that connects it all.

This is part of a series on building a production AI agent for Mio. Previous: Benchmarking 7 LLMs from 4 providers on the same agentic task.

Frequently asked questions

LLM-as-Judge uses a stronger model (in my case, Claude Opus 4.6) to review the work of a weaker production model (Gemini 3 Flash). The judge reads the agent's full trace, does its own independent research, then compares findings to produce a structured verdict. It's like a senior engineer reviewing a junior engineer's work.
Using Claude Opus via API with web search tools, each review costs $0.40-$0.70. For 50 products, that's $20-35. I reduced the marginal cost to $0 by rebuilding the judge as a CLI command running on my Claude Max subscription.
Not entirely. The judge surfaces patterns and does thorough trace analysis faster than a human, but it can also be wrong. I use it as a first pass: it identifies where to look, and I validate the findings. The real value is in pattern analysis across many reviews, not individual verdicts.
The top findings: the agent almost never reads actual web pages (relying on search snippets instead), it never searches by barcode number directly, and snippet misinterpretation dropped from 35% to 18% between review batches while wasted tool calls increased from 15% to 25%.