The prompt engineering that didn't work (and what did)
I wrote 13 explicit rules in my prompt. The model ignored most of them. Over 108 benchmark runs, I learned that prompt optimization is three-dimensional: what fails on one model can succeed on another, and the biggest gains come from places you don't expect.
In the first article of this series, I explained why finding a product's manufacturing country from its barcode is genuinely an AI problem. The data is scattered, misleading, and requires multi-step reasoning to untangle. I'm building Mio, an app where you scan a barcode and get the manufacturing country, powered by an AI agent that searches the web, reads pages, and cross-references sources.
This article is about what happened when I tried to make that agent better.
Over three weeks, I ran 108 benchmarks against a hand-curated golden dataset. Tested 7 models from 4 providers. Iterated through 6 major prompt versions with dozens of sub-variants. And what I learned is this: optimization is three-dimensional, and a change that fails in one context can succeed in another.
The scoreboard
Before I get into specifics, here's the summary. Every line is a real benchmark run, measured against ground-truth labels. "False Confidence" (FC) is our worst failure mode: the agent says it's confident and it's wrong.
On Gemini 3.1 Flash Lite our first production model
| Change | Accuracy | FC | Verdict |
|---|---|---|---|
| Baseline (v2) | 60% | 4 | PRODUCTION |
| Anti-false-confidence rules | 43% | 4 | REVERTED |
| Brand-level fallback search | ~33% | 13 | CATASTROPHIC |
| Confidence calibration rules | 53% | 4 | REVERTED |
| Thinking budget to 2048 | 50% | 4 | REVERTED |
| Temperature 1.0 (was 0) | 52% | ~5 | REVERTED |
| Double search results | 53% | 4 | REVERTED |
| Ambiguity guard rule | 57% | 6 | REVERTED |
| 3 changes at once | 47% | 5 | REVERTED |
On Gemini 3 Flash after model switch
| Change | Accuracy | FC | Verdict |
|---|---|---|---|
| Baseline (v2, same prompt) | 57.8% | 8.7 | — |
| Anti-FC rules (v3) | 68.6% | 5.7 | +10.8% |
| Nudge + anti-looping (v4) | 74.5% | 7.7 | SHIPPED |
| Variant guard (v5) | 71.6% | 7.0 | — |
| Blocklist (v6a) | 71.6% | 6.3 | — |
Same anti-FC rules. Flash Lite: 60% → 43%. Flash 3: 57.8% → 68.6%. The rules weren't wrong. The model was too simple to follow them.
What moved the needle across the whole project:
| Change | Impact |
|---|---|
| Model switch (2.5 Flash → 3.1 Flash Lite) | +13.3% match, -3 FC |
| Model switch (Flash Lite → 3 Flash) + prompt recalibration | +20% match vs original prod |
| Parallel tool dispatch | +8.2% match, -3.1 FC |
The failures (and why they weren't really failures)
1. The brand-level fallback (worst run in project history)
The idea seemed great: when the agent can't find where a product is made, search for where the brand manufactures. Unilever makes stuff in 50+ countries, but a small French brand probably has one factory.
Result: 13 false confidence cases. The worst run I ever recorded. Not even close to second place.
What happened: the model found "Brand X has a factory in Y" and immediately applied that to the specific product with full confidence. Every brand search returned some country, and the model treated it as a verified answer.
The lesson I keep coming back to: never add fallback strategies that give the model an alternative path to low-quality answers. It will use them eagerly to justify confident wrong answers. The model wants to give you an answer. Your job is to make the lazy path (giving up) easier than the wrong path (guessing from brand-level data).
This one I haven't retried on a smarter model. It might work better on a model that can distinguish "brand manufactures in X" from "this specific product is made in X." But the failure was so spectacular that I moved on.
2. Anti-false-confidence rules (the plot twist)
My agent had 4 items it was consistently wrong about, all with the same pattern: it trusted "Made in France" search snippets that actually referred to a different product on the same page. So I added 3 targeted rules. Things like "catalogue page snippets are unreliable, only trust product-specific pages" and "verified confidence requires reading the actual page, not just a search snippet."
On Flash Lite, it was a disaster. Accuracy dropped from 60% to 43%. I tested 3 times: 43%, 53%, 50%. All worse. The model couldn't distinguish "this snippet from a catalogue page is unreliable" from "this snippet from a manufacturer's website is reliable." So it distrusted everything.
I reverted and moved on. The rules were too nuanced for this model.
Weeks later, I switched to Gemini 3 Flash and tested the same anti-FC rules. Accuracy went from 57.8% to 68.6%. False confidence dropped from 8.7 to 5.7. The exact same rules that broke Flash Lite were a massive improvement on a smarter model. 3 Flash could actually tell the difference between a catalogue page and a product page.
This was the moment I understood that prompt optimization isn't one-dimensional. A rule that fails on model A can succeed on model B. You don't discard the idea, you log it and revisit it when the context changes.
3. More thinking, worse results
Gemini has a "thinking budget" parameter that controls how many tokens the model can use for internal reasoning before responding. My baseline was 1024 tokens. I tried 2048. Then I tried the API's "medium" thinking level.
Both were worse. 2048 tokens: -3 match, the model started overthinking simple items. Medium: -2 match, +1 false confidence, and average iterations jumped from 3.3 to 4.2. The model used the extra thinking budget to second-guess itself, not to make better decisions. It would find a clear "Made in Germany" statement, then spend 500 extra tokens wondering if it was really Germany, and end up submitting "unknown, low confidence."
1024 tokens forces concise, decisive reasoning. More thinking budget just means more hesitation.
4. Temperature 1.0 (going against the docs)
Gemini's documentation explicitly says that temperature below 1.0 causes "looping or degraded performance, particularly in complex reasoning tasks." So I tested temperature 1.0 against my baseline of 0.
Result: -7.8% accuracy, +1.9% false confidence. Every metric worse. Higher temperature means more randomness, which means more hallucination, which means more confidently wrong answers.
Temperature 0 is the right choice for structured tool calling. The official docs are talking about open-ended generation, not agentic workflows where you need deterministic, focused behavior.
5. More search results = more noise
I doubled the number of search results per query from 5 to 10. More data should help the model find the right answer, right?
Accuracy dropped. The model couldn't separate signal from noise in 10 results. Contradictory snippets from different products on different sites confused it. 5 focused results outperformed 10 noisy ones.
There's a context quality threshold, and it depends on the model. A smaller model saturates faster. This is another change I'd want to retest on a bigger model.
6. Multiple changes at once (the compounding problem)
I once shipped 3 prompt changes together: a new search strategy, retailer-specific instructions, and multilingual query templates. Accuracy dropped from 60% to 47%. Medium-difficulty items went from 29% to 0%. Complete collapse.
I couldn't tell which change caused the regression. Maybe all three. Maybe just one. Doesn't matter. I reverted the whole thing and never made that mistake again.
One change at a time. Always. If you can't measure the individual impact, you can't learn from it. [Newbie lesson]
7. The EU trap (when the model just says no)
I added a region field to the agent's output so products labeled "Made in EU" would return "EU" instead of "unknown." Seemed like a small, clean addition.
The model ignored it completely. Across 6 benchmark runs. I reinforced the instruction in the prompt. Still ignored. I added it to the tool description. Still ignored. Country match and false confidence stayed exactly the same whether the field existed or not.
Prompt engineering cannot force a model to fill a field it doesn't understand the purpose of. After a week of trying, I moved the logic to post-processing code. Deterministic. Works every time.
Sometimes the answer isn't a better prompt. It's code.
What actually worked
1. Switching models (the unlock for everything else)
My first production model was Gemini 2.5 Flash. Switching to 3.1 Flash Lite gained +13.3% match and -3 false confidence. Better model, same prompt, instantly better. But Flash Lite turned out to be a ceiling, not a foundation. Every prompt change I tried on it made things worse.
A note on why I stayed within the Gemini family: cost and latency. This is a consumer app where users scan products in a store and expect an answer in seconds. Before the gold-curated benchmark, I tested Haiku (Claude) extensively. It got comparable or slightly better accuracy on some runs, but at $0.01-0.02 per scan and 20-29 seconds latency. Gemini Flash was $0.002-0.004 per scan and 8-13 seconds. At 4-5x the cost and 2-3x the latency, Haiku wasn't viable for production, no matter how good the accuracy. GPT-4.1 had similar cost issues and wildly variable latency.
The real unlock was switching to Gemini 3 Flash. On its first run with the same v2 prompt, it got the best raw accuracy I'd ever seen (66.7%) but also the worst false confidence (7 FC). It answered more questions, but also hallucinated more.
Here's the thing though: once I was on 3 Flash, prompt engineering started working again. The anti-FC rules that failed on Flash Lite? +10.8% on 3 Flash. The nudge and anti-looping rules? Pushed it to 74.5%. I tested 4 prompt variants on 3 Flash, running each 3 times to account for variance, and each showed meaningful differences.
The model switch didn't just improve accuracy. It unlocked an entire dimension of optimization that was previously walled off.
2. Parallel tool dispatch (+8.2% accuracy)
This was the change I expected to matter least.
My agent was executing tool calls sequentially. Search, wait, read page, wait, search again, wait. I changed two things: the code now runs tool calls in parallel (Promise.all instead of sequential await), and I added one line to the prompt telling the model it can batch multiple tool calls in a single turn.
Result: +8.2% match, -3.1 false confidence. The best single improvement across the entire project. And it was remarkably stable: the worst run with this change (76.1%) still beat the average of the previous version (70.1%).
The model started making two searches with different angles in the same turn instead of doing them sequentially. More coverage within the same tool budget. The improvement came from both sides: the prompt change (model batches more) and the infrastructure change (batched calls run in parallel).
This is a good example of the three dimensions working together. The prompt told the model it could batch. The tooling made batching fast. And the model (3 Flash) was smart enough to actually do it effectively.
3. The config that held
Temperature 0 and thinking budget 1024. Not glamorous. But these were the baseline settings that every experiment on every model failed to beat. I tested temperature 1.0 (worse), thinking budget 2048 (worse), thinking level "medium" (worse). The benchmarks don't lie.
The real lesson: optimization is three-dimensional
After 108 runs, I don't think about "prompt engineering" as a standalone activity anymore. The optimization space has three axes:
Prompt (what you tell the model), Tooling (what the model can do), and Model (how capable the model is at following your instructions). And all three are constrained by a fourth dimension: cost and latency. A brilliant model that costs $0.02/scan and takes 25 seconds isn't an option for a consumer app.
The anti-FC rules failed on Flash Lite and worked on 3 Flash. That's prompt × model. Parallel dispatch worked because of a prompt change AND an infrastructure change. That's prompt × tooling. And model switches unlocked prompt optimizations that were previously impossible. That's model × prompt. Staying within the Gemini family despite testing other providers? That's the cost/latency constraint eliminating otherwise viable options.
You have to explore all three dimensions. And critically, you have to keep good logs. The anti-FC rules I "abandoned" on Flash Lite became one of my best improvements when I revisited them on 3 Flash weeks later. If I hadn't logged the iteration, I would have assumed "anti-FC rules don't work" and never tried them again.
Don't discard an optimization because it failed in one context. Log it, understand why it failed (model too simple? tooling bottleneck? wrong config?), and revisit it when the context changes.
The seven consecutive failures on Flash Lite weren't wasted work. They were a map of what this model couldn't do, which made it obvious when to switch models, and which ideas to retry on the new one.
You need a benchmark to know any of this. Most of these changes "felt" like improvements when I tested them on 3-4 examples. The anti-FC rules fixed exactly the items I was targeting. The brand-level pivot found correct answers for some products. Without measuring against 30+ items with ground-truth labels, I would have shipped broken changes and never known.
That's actually the subject of the next article in this series.
Next up: Why your LLM agent needs a benchmark before it needs a prompt. How we built the evaluation framework, why we measure false confidence instead of accuracy, and the day we discovered that 17% of items flip between identical runs.
This is part of a series on building a production AI agent for Mio. Previous: Why finding where a product is made is an AI problem.
Frequently asked questions
-
No. The same rules can fail on one model and succeed on another. In our tests, anti-false-confidence rules dropped accuracy by 17% on Gemini 3.1 Flash Lite but improved it by 10.8% on Gemini 3 Flash. The model's capability determines whether it can follow nuanced instructions. Always benchmark prompt changes per model.
-
False confidence is when an AI agent reports high confidence but is wrong. For example, stating "verified: Made in France" when the product is actually made in China. It's the most dangerous failure mode because users trust confident answers. Optimizing against false confidence matters more than optimizing for raw accuracy.
-
Not necessarily. In our tests, doubling the thinking budget from 1024 to 2048 tokens made results worse. The model started overthinking simple items, second-guessing clear evidence, and submitting "unknown" for products it previously answered correctly. Concise reasoning budgets force decisive behavior.
-
Model selection and parallel tool dispatch had the largest impacts. Switching from Gemini 3.1 Flash Lite to Gemini 3 Flash unlocked +20% accuracy. Parallel tool dispatch (letting the agent batch multiple searches in one turn) added +8.2% accuracy and reduced false confidence by 3.1 points. Prompt changes alone had smaller, model-dependent effects.