The 3 things prompt engineering does NOT fix
Strong prompt engineering — few-shot examples, structured-output mode, chain-of-thought scaffolding, system-prompt voice constraints, retrieval grounding — closes roughly 70% of the quality gap between a vanilla base model call and a well-tuned model. That's true at gpt-5-mini, Claude Haiku 4.5, Gemini 2.5 Flash, and even at the open-weight tier. Most teams that 'need' fine-tuning haven't actually exhausted prompt iteration; they've done 3-4 revisions and assumed they were at the ceiling.
But there are three places where prompt engineering hits a real wall, and fine-tuning genuinely moves the curve.
**1) Voice consistency at scale (>1M outputs/month).** A 600-token system prompt that perfectly captures your brand voice costs you $0.15 per million inferences in input tokens at gpt-5-mini base pricing — trivial. Until you're shipping 50M outputs a month, at which point the system-prompt overhead is $7.5k/mo just to keep the voice consistent, and you're still seeing 5-8% voice drift on edge cases. Fine-tuning bakes the voice into the model weights. You drop the system prompt to 50 tokens, the per-call overhead drops 92%, and voice compliance hits 99%+. The math only pencils out at very high volume — under 1M outputs/month, the prompt overhead is rounding-error.
**2) Structured-output reliability past 99.5%.** Modern structured-output modes (OpenAI's `response_format`, Anthropic's tool-call schema enforcement, Gemini's controlled generation) get you to ~99% format compliance with no fine-tuning. The remaining 1% — malformed JSON in long-context tasks, schema drift on novel inputs, hallucinated enum values — is where most pipeline-breaking bugs live. Fine-tuning on 5k+ structured examples pushes compliance to 99.95%+. If you're parsing model output into a downstream database with hard schema constraints, that 0.5% reduction in error rate may be worth the $500-2k training run.
**3) Latency floor via smaller-model SFT.** A Sonnet-class model serves at 600-1,200ms p50. A 1B-parameter model fine-tuned for a narrow task can hit 50-100ms p50 on the same hardware. For real-time UI loops (autocomplete, in-keystroke suggestions, voice agents), that 10x latency reduction is the only path. You're not getting Sonnet quality at 80ms with any prompt — you have to bake the specific task into a smaller model. This is the strongest fine-tuning case in 2026, especially with Llama 4 8B and Mistral Small 3 as cheap, ownable SFT targets.