From 42% to 78%: the full iteration log of a production AI agent
108 benchmark runs, 7 models, 6 prompt versions, 3 weeks. The chronological story of going from 'this kind of works' to 'this is reliable enough to ship.'
This is the last article in the series. The previous five covered the pieces: why this is an AI problem, how prompt optimization works across three dimensions, why you need a benchmark before a prompt, what happened when we tested 7 models, and how we used Claude to review the Gemini agent's work.
This article is the timeline that connects them. The chronological story of going from "this kind of works" to "this is reliable enough to ship." Three weeks, 108 benchmark runs, and a lot of reverted commits.
Week 0: Building on sand
I started benchmarking before I had a proper benchmark. I grabbed random barcodes from our 69 million product database, ran the agent on them, and compared results against ground truth I'd cobbled together from manual research.
About 33 runs across Gemini Flash, Haiku, GPT-4.1, and Mistral. The numbers were useful for rough model comparison. But I kept hitting the same problem: when a run showed a regression, I couldn't tell if the agent got worse or if my "ground truth" was wrong. Some expected countries had been found by the agent itself in earlier runs. I was grading the student with the student's own answers.
I needed ground truth I could actually trust. That's when I started building the golden dataset.
Day 1: The first real number
Gold-curated dataset: 19 items. Each one with a verified manufacturing country, a confidence level, and a difficulty rating. Small, but the labels were solid.
First benchmark: Gemini 2.5 Flash, prompt v1. Result: 8/19 = 42%.
That was the number that made everything real. Not "it seems to work on a few examples." 42%. Less than half. And 4 of the items were false confidence: the agent said "verified: Made in France" and it was wrong.
I deep-dived the 4 FC cases that evening. Three shared the same pattern: the agent read a search snippet saying "Made in France" that actually referred to a different product on the same page. The fourth was a brand confusion (French brand, Italian manufacturing). Misattributed snippets were the #1 problem.
Day 2, morning: The first win
Dataset expanded to 30 items. Re-ran the baseline: 14/30 = 47%, 7 FC. Five of the 7 false confidence cases said "France" when wrong. The agent had a strong France bias.
Then I switched models. Same prompt, same tools, just Gemini 3.1 Flash Lite instead of 2.5 Flash. Result: 18/30 = 60%, 4 FC, and 3.3 seconds faster.
Shipped. First real improvement. +13 points just from changing the model.
Day 2, afternoon: The plateau
This is where I learned humility. Riding the high of the model switch, I spent the afternoon trying to push Flash Lite further. Seven consecutive changes:
EAN-first search strategy + retailer instructions + multilingual queries (three changes at once, 47%). Confidence calibration rules (53%). Temperature 1.0 (52%). More thinking budget (50%, then 53%). Anti-false-confidence rules (43%, tested three times to make sure).
Every single one worse than the 60% baseline. Seven in a row.
The anti-FC rules were the most frustrating. They fixed the exact items I was targeting (2 out of 4 FC cases resolved) but destroyed accuracy everywhere else. The model was too simple for conditional rules like "trust this type of snippet but not that type." It just distrusted everything.
By evening I accepted: Flash Lite was at a local optimum. The model couldn't absorb more complexity.
Day 2, evening: The variance discovery
Before moving on, I ran one more test. The exact same code, a second time. Same model, same prompt, same config.
Both runs came back at 60%. But 5 out of 30 items had flipped. Products that were correct became wrong. Products that were wrong became correct.
17% of items gave different results on identical runs.
This changed everything. It meant most of my afternoon's "regressions" (53%, 57%, 50%) were probably within noise. Only deltas of ±4 items on 30 items were real signals. I established a new rule: minimum 3 runs per configuration. No exceptions.
Day 2, late: The 3 Flash moment
I tested Gemini 3 Flash for the first time. On its first run with the same v2 prompt: 66.7% match. Best accuracy I'd ever seen. But also 7 false confidence cases. Best match AND worst FC in the same run.
Smarter models answer more questions. They also hallucinate more confidently.
I also tested the brand-level pivot idea on 2.5 Flash that evening. "When you can't find where the product is made, search for where the brand manufactures." Result: 13 false confidence. Worst run in the entire project. The idea was permanently abandoned.
Day 4: The big comparative
Dataset expanded to 34 items. I ran the largest benchmark session of the project: 3 models × 4 prompt versions, 3 runs each. Every config tested three times to account for variance.
| Config | Match | FC |
|---|---|---|
| 3.1 Flash Lite v2 (production) | 54.4% | 7.0 |
| 2.5 Flash v2 | 45.6% | 10.5 |
| 3 Flash v2 (same prompt) | 57.8% | 8.7 |
| 3 Flash v3 (anti-FC rules) | 68.6% | 5.7 |
| 3 Flash v4 (nudge) | 74.5% | 7.7 |
| 3 Flash v5 (variant guard) | 71.6% | 7.0 |
| 3 Flash v6a (blocklist) | 71.6% | 6.3 |
The line that jumped off the screen: 3 Flash v3 at 68.6% with FC 5.7. Those were the same anti-FC rules that dropped Flash Lite from 60% to 43%. The exact same rules. On a smarter model, they were a massive improvement instead of a disaster.
I also tested Haiku 4.5 (67.6%, but $0.019/trace and 17.4s latency) and GPT-5.1 (26.5%, catastrophic). The provider comparison is its own article.
Shipped: Gemini 3 Flash with prompt v4 (nudge + anti-looping). 74.5% average.
Day 8: The final push
Dataset expanded to 46 items. Three changes tested:
Async classification (moving a secondary LLM call out of the main pipeline): saved 1.4 seconds per scan with no accuracy impact. Shipped.
Parallel tool dispatch: changed sequential tool execution to Promise.all and added one line to the prompt about batching. Expected a small latency improvement. Got +8.2% accuracy and -3.1 FC instead. The best single improvement of the entire project. The model started making two searches with different angles in the same turn, giving it more coverage within the same tool budget. Shipped.
Region EU (adding a field for "Made in EU" products): the model completely ignored the new field across 6 runs despite two rounds of prompt reinforcement. Shelved. Moved the logic to post-processing code.
Best single run ever recorded: 82.6% (parallel dispatch, run 3).
What never got fixed
Four items resisted every configuration we tried. They appeared as false confidence in run after run, across models and prompt versions.
A retailer badge that says "Made in Italy" as a site-wide filter, not a product statement. A snippet about one Lotus product getting attributed to a different Lotus product. A French brand that manufactures in Germany. A product from Nicaragua consistently misidentified as Peru.
These items taught me something about the limits of prompt engineering. Some failures are structural. The agent can't distinguish a retailer badge from a product specification without reading the page, and even then, the page layout makes it ambiguous. No amount of prompt tweaking will fix data that's genuinely misleading at the source level.
I added these items to the benchmark specifically because they're hard. They keep the score honest.
The full picture
| Phase | Date | Dataset | Best accuracy | Key event |
|---|---|---|---|---|
| 0 | Week 0 | eval-dev (~90 items, noisy) | ~70% (unreliable) | Realized ground truth was shaky |
| 1 | Day 1 | 19 items | 42% | First real benchmark |
| 2 | Day 2 AM | 30 items | 60% | Model switch to Flash Lite |
| 3 | Day 2 PM | 30 items | 60% (plateau) | 7 consecutive failures |
| 4 | Day 2 PM | 30 items | 60% | Variance discovery (17% flip) |
| 5 | Day 2 late | 30 items | 66.7% | 3 Flash first test |
| 6 | Day 4 | 34 items | 74.5% | Big comparative, shipped v4 |
| 7 | Day 8 | 46 items | 82.6% (single run) | Parallel dispatch, 78% avg |
42% to 78% in three weeks. The dataset got harder along the way (19 items to 46, deliberately adding medium and hard cases), so the real improvement is larger than the numbers suggest.
What I'd do the same
Build the benchmark first. Every number in this series exists because I had ground truth to measure against. Without it, I would have shipped the anti-FC rules on Flash Lite (43%) thinking I'd made an improvement, and I would have missed the parallel dispatch win because it "didn't seem like an accuracy change."
Log every iteration. The anti-FC rules I abandoned on Flash Lite became my best improvement on 3 Flash weeks later. If I hadn't logged why they failed (model too simple, not rules too bad), I would have assumed they were a dead end.
Test one change at a time. The one time I tested three changes together, I got a regression I couldn't attribute. Wasted a full run.
What I'd do differently
Start with 30+ items. My first benchmark had 19 items. The variance at that size is so high that almost nothing is distinguishable from noise.
Switch models earlier. I spent an afternoon trying to optimize Flash Lite when the right move was to try a different model. The 7 consecutive failures were informative in hindsight, but I could have reached the same conclusion faster.
Run the judge from day one. The LLM-as-judge system surfaced patterns (EAN search missing, no page reads) that the benchmark couldn't catch. If I'd had it earlier, I would have had better signal about what to fix.
What's next
78% on 46 curated items is good enough to ship. It's not good enough to stop iterating.
The judge reviews point to clear next steps: the agent needs to read more pages instead of trusting snippets, it needs to use its full tool budget instead of giving up early, and it needs to search by EAN when product names are messy.
The dataset is at 57 items. I'm aiming for 120-150 high-quality items. But more items also means slower and more expensive iteration cycles, so there's a new problem to think about: how to keep the dataset lean. Which items are actually testing something unique, and which ones are redundant?
The benchmark infrastructure is in place. The iteration log keeps growing. And every week I learn more about what makes an AI agent actually work in production, not on a demo, not on 3 cherry-picked examples, but on whatever random barcode a user scans in a store.
That's the whole point.
This is the final article in a series on building a production AI agent for Mio, an app that surfaces manufacturing origin from product barcodes. The stack: Gemini 3 Flash for the production agent, Claude Opus 4.6 for the judge, Langfuse for tracing and benchmarks, Serper.dev for web search, Jina for page reading. If you've built similar systems, I'd love to compare notes.
The full series: Why it's an AI problem · Prompt engineering that didn't work · Benchmark before prompt · Benchmarking 7 LLMs · LLM-as-Judge · This article
Frequently asked questions
-
In my case, 3 weeks of focused iteration. But I started with a working prototype and a benchmark. The first useful number (42% accuracy) came on day 1. Shipping quality (78%) took 108 benchmark runs across 7 models and 6 prompt versions.
-
Both matter, but they interact. Switching from Gemini 2.5 Flash to 3.1 Flash Lite gave +13 points immediately. But the same anti-FC prompt rules that destroyed Flash Lite (-17 points) improved 3 Flash by +10.8 points. The model determines which prompt optimizations are possible.
-
I ran 108 total, but the key insight is: minimum 3 runs per configuration because of non-determinism (17% of items flip between identical runs). One run per config means you can't tell signal from noise.
-
Parallel tool dispatch: changing sequential tool execution to Promise.all and adding one line to the prompt about batching. Expected a small latency improvement. Got +8.2% accuracy and -3.1 false confidence instead. The model started making two searches with different angles in the same turn.