Technical deep-dive

Benchmarking 7 LLMs from 4 providers on the same agentic task

GPT-5.1 scored 26%. Gemini 3 Flash scored 74.5%. Same prompt, same tools, same dataset. Most models weren't eliminated on accuracy.

In the previous article, I explained how I built the evaluation infrastructure for my AI agent: a hand-curated golden dataset, a 3-run minimum per config, and the discovery that 17% of items flip between identical runs. This article puts that infrastructure to use.

I'm building Mio, an app where you scan a product barcode and get the manufacturing country. The AI agent searches the web, reads pages, cross-references sources, and returns a country with a confidence level. I built the same agent pipeline for 5 providers: Gemini, Anthropic, OpenAI, xAI, and Mistral. Same prompt. Same tools. Same scoring.

Here's what happened when I ran them all against the same benchmark.

Country match accuracy (%) Same prompt, same tools, same golden dataset 0% 25% 50% 75% 60% min Gemini 3 Flash 74.5% WINNER Grok 4 Fast 72.4% LATENCY Claude Haiku 4.5 67.6% COST Flash Lite 3.1 60% Mistral 50% Gemini 2.5 Flash 45.6% 10.5 FC GPT-5.1 26.5%
7 models, same prompt, same tools. Gemini 3 Flash wins at 74.5%. GPT-5.1, despite strong public benchmarks, scored 26.5%. Amber bars were eliminated on cost or latency, not accuracy.

The four walls

This isn't a "which model is smartest" comparison. It's an elimination tournament. My agent runs inside a consumer app where people scan products in a store and wait for an answer. That sets hard constraints.

Latency: under 10 seconds ideally, 15 seconds max. At 20-30 seconds, users put their phone back in their pocket.

Cost: under ~$0.01 per scan. At $0.02, the unit economics don't work at scale.

Accuracy: above ~60% country match. Below that, the app feels broken. Users scan 3 products, get 2 wrong answers, and uninstall.

False confidence: as low as possible. The agent saying "verified: Made in France" when the product is made in China is worse than saying "I don't know."

If any single dimension is unacceptable, the model is out. Doesn't matter how good the other numbers are.

7 models Accuracy > 60% -3 (Mistral, GPT-5.1, 2.5F) Latency < 15s -1 (Grok) Cost < $0.01 -1 (Haiku) False Conf. minimal -1 (GPT-4.1) 2 left
The elimination funnel. 7 models enter, 2 survive. Each wall disqualifies on a single dimension.

The eliminations

Mistral

Eliminated: accuracy (50%)

Tested on the early eval dataset (10 items). Country match: 50%. Cost was the lowest of anything I tested ($0.0006/trace), latency was fine (10.5s). But 50% accuracy means the agent is guessing. Didn't proceed to the gold-curated benchmark.

50% match$0.0006/trace10.5s

GPT-5.1

Eliminated: accuracy (26.5%)

The most surprising result. GPT-5.1 is a strong model on public benchmarks. On my gold-curated dataset (34 items), it scored 26.5%. The model returned null/low confidence on almost everything. 20 out of 34 items were "other failures" where the agent never submitted an answer.

Honest caveat: I'm not 100% sure this is the model's fault. The OpenAI integration uses the Responses API, and tool results might not flow as effectively as Gemini's native function calling. But at 26.5%, I didn't invest more time debugging it. The other providers worked out of the box.

26.5% match34 items20 failures

GPT-4.1

Eliminated: accuracy (43%) + rate limits

Tested on the early eval dataset (90 items, before gold-curated): 43% country match, $0.014/trace, 17.9s latency. Already below the accuracy threshold. When I tried to run it on the gold-curated dataset at concurrency 20, it immediately hit OpenAI's 30K tokens-per-minute rate limit. Unusable for benchmarking, let alone production.

43% match$0.014/trace17.9srate limited

xAI Grok 4 Fast

Eliminated: latency (22-35s)

This one was interesting. Across multiple runs on 29-30 items, accuracy ranged from 40% to 72.4%. The best run (72.4%) was genuinely competitive. Cost was among the lowest ($0.001/trace). But latency killed it. Every run came in between 22 and 35 seconds. At 33.6 seconds average on the best-accuracy run, a user would be staring at a loading screen for half a minute.

If xAI gets the latency down, Grok is worth retesting. The accuracy signal was there.

40-72.4% match$0.001/trace22-35s

Claude Haiku 4.5

Eliminated: cost ($0.019/trace)

The hardest elimination. Haiku got 67.6% accuracy on gold-curated (34 items), with 7 false confidence cases. Not far from Gemini 3 Flash (74.5%). On easy items, it hit 100%. Solid model.

But $0.019 per trace. That's 4-5x what Gemini costs. And latency was 17.4 seconds on the gold-curated run, with some eval-dev runs hitting 20-29 seconds. At $0.019/scan, 10,000 daily users doing 3 scans each means $570/day just in LLM costs. Gemini at $0.004/scan brings that to $120/day for better accuracy.

Sometimes a good model just doesn't fit the economics.

67.6% match$0.019/trace17.4s7 FC

Gemini 2.5 Flash

Eliminated: accuracy (45.6%) + false confidence (10.5)

The predecessor to the models I ended up using. 45.6% accuracy with the highest false confidence of any Gemini model (10.5 average across 2 runs). Also 2x more non-deterministic than Flash Lite: 37% of items flipped between identical runs, compared to 17% for Flash Lite. Bad accuracy, bad FC, unstable.

45.6% match10.5 FC37% flip rate

The survivors

Two Gemini models made it through all four walls.

Gemini 3.1 Flash Lite (runner-up)

Passed all walls, but hit a ceiling

54-60% accuracy (varies by run), FC around 4-7, latency 8.6s, cost ~$0.006/trace. This was my production model for a while. Low false confidence, decent cost, fast. But as I described in the prompt engineering article, it was stuck on a local optimum. Every prompt change I tried made things worse. The model was too simple to follow nuanced rules.

54-60% match$0.006/trace8.6s~5 FC

Gemini 3 Flash (the winner)

Best balance across all dimensions

74.5% accuracy (average of 3 runs: 73.5%, 73.5%, 76.5%), FC 7.7, latency 13.5s, cost ~$0.004/trace. With parallel tool dispatch, the best single run hit 82.6%.

Gemini 3 Flash didn't win by being the best at any single dimension. Not the cheapest (Flash Lite was). Not the lowest FC (Flash Lite had ~5). Not the fastest (Flash Lite at 8.6s beat it). But it had the best balance: highest accuracy by a wide margin, within acceptable bounds on everything else. And unlike Flash Lite, it responded to prompt optimization.

74.5% match$0.004/trace13.5s7.7 FC82.6% best run

What I learned

My takeaway: public benchmarks don't predict agentic performance. GPT-5.1 ranks high on MMLU, HumanEval, and other standard benchmarks. It scored 26.5% on my task. Gemini 3 Flash ranks lower on most public benchmarks. It scored 74.5%. Agentic tool-use tasks (search, read, reason, decide) test something completely different from "answer this question" benchmarks.

Langfuse benchmark charts part 1: latency, cost, country match, and false confidence across the first 40 runs
Langfuse benchmark charts part 2: same metrics across the next 40 runs, showing accuracy climbing and cost dropping
108 benchmark runs in Langfuse. Latency drops as I move to faster models, cost stabilizes around $0.004, accuracy climbs from ~45% to ~75%, and false confidence stays volatile.

Most eliminations were about economics, not intelligence. Haiku at 67.6% would have been a perfectly good agent. Grok at 72.4% was competitive with Gemini. Both were eliminated on cost or latency, not accuracy. If you're building a backend service with no latency constraint and a generous budget, your winner might be completely different from mine.

Accuracy (%) 20% 40% 60% 80% Cost per trace ($) $0.001 $0.004 $0.006 $0.01 $0.019 60% min $0.01 max viable zone Mistral Grok Gemini 3 Flash Flash Lite GPT-4.1 Haiku GPT-5.1
Cost vs accuracy. The viable zone (top-left) requires both high accuracy and low cost. Haiku and Grok had the accuracy but failed on cost or latency. Only Gemini 3 Flash sits comfortably inside.

Testing depth should match viability. I ran 100+ benchmarks on Gemini models and ~5 on every other provider. Deliberately uneven. It sounds unfair. But once a model hits a disqualifying wall, spending more benchmark budget on it is waste. I invested where it mattered.

Same prompt does not mean same results. All five providers got the exact same system prompt and tool definitions. The accuracy range was 26.5% to 74.5%. The prompt was designed for Gemini (it's where I iterated), which probably gives Gemini an advantage. A prompt optimized for Haiku or GPT might close some of the gap. But the cost/latency constraints would still eliminate them for my use case.

The unified architecture paid for itself. Building the agent for 5 providers with the same interface was serious engineering work. But it meant every comparison was apples-to-apples. Same prompt, same tools, same scoring, same dataset. No "well maybe the OpenAI version just has different tools." If a model underperformed, it was the model (or the API integration), not the setup.

The honest caveats

I want to be clear about what this benchmark does and doesn't show.

It shows how these models perform on my specific task (manufacturing country lookup via web search), with my specific prompt (optimized for Gemini), at my specific scale (consumer app, cost-sensitive). A different task, a different prompt, or different constraints could produce a completely different ranking.

The GPT-5.1 result in particular might not reflect the model's capability. If I'd spent more time on the OpenAI integration, the results might improve. I made a pragmatic choice: other providers worked immediately, so I invested time there instead.

And the testing depth is uneven. 3 runs on Haiku versus 20+ runs on Gemini 3 Flash means I have much more confidence in the Gemini numbers. The Haiku result (67.6%) could be an unlucky run. Or a lucky one. With 1 run, I don't know.

What I do know: Gemini 3 Flash at $0.004/trace and 13.5s gives me 74.5% accuracy. That's the combination I can build a product on. For now.

Because this benchmark is a snapshot, not a verdict. Prices drop. Models improve. Latency gets optimized. Haiku was eliminated on cost, but Anthropic's pricing changes regularly. Grok was eliminated on latency, but xAI is actively optimizing inference speed. GPT-5.1 might just need a different integration approach.

The elimination results are factual, but they're not permanent. The benchmark infrastructure stays. When the context changes, I'll rerun.


Next up: LLM-as-Judge: using Claude to review a Gemini agent. How I automated QA by having a smarter model review every agent trace, and the patterns it found that I never would have caught manually.

This is part of a series on building a production AI agent for Mio. Previous: Why your LLM agent needs a benchmark before it needs a prompt.

Frequently asked questions

Frequently asked questions

  • It depends on your constraints. In my tests, Gemini 3 Flash scored 74.5% accuracy at $0.004/trace and 13.5s latency. Claude Haiku 4.5 got 67.6% but at $0.019/trace. GPT-5.1 scored 26.5%. Public benchmarks don't predict agentic performance. You need to benchmark on your specific task.
  • GPT-5.1 returned null or low confidence on 20 out of 34 items. This might be partly an integration issue (Responses API vs native function calling), not purely model capability. But at 26.5% on the only run, I chose to invest time in providers that worked out of the box.
  • In my tests: Gemini 3 Flash at $0.004/trace, Gemini Flash Lite at $0.006, Claude Haiku at $0.019, GPT-4.1 at $0.014, Mistral at $0.0006, and Grok 4 Fast at $0.001. For a consumer app with thousands of daily users, these differences are the difference between viable and not.
  • Not linearly. Gemini 2.5 Flash had 45.6% accuracy but the highest false confidence (10.5 avg). Flash Lite had lower accuracy (54-60%) but much lower false confidence (~5). Smarter models tend to answer more questions, which means more chances to be confidently wrong.

Scan your
first product.

Free, no ads, no sign-up. Available on iOS and Android.

Free · Unlimited · No ads · No sign-up