Fine-tuning earns its cost in a few well-defined situations.
**Consistent behavior at high volume.** When you need an exact output format, tone, or structure to be identical across thousands or millions of calls, fine-tuning bakes that consistency into the model more reliably than re-specifying it in every prompt.
**Shorter prompts at scale.** If your prompt has grown to a wall of instructions and examples that you pay for on every single call, fine-tuning can move that behavior into the weights so each call sends far fewer tokens — a real cost win at high volume.
**Small model matching a big one on a narrow task.** A fine-tuned smaller model can match or beat a large general model on one specific, stable task (e.g. classifying into a fixed set of categories) at a fraction of the per-call cost. This is one of the strongest fine-tuning cases.
**Hard-to-prompt behaviors.** Occasionally a behavior resists prompting no matter how you phrase it, but a few hundred good examples teach it cleanly. This is rarer than people assume — verify it's truly unpromptable first.
Notice the common thread: the task is stable and high-volume, and you already know exactly what 'good' looks like (which is how you build the training set). Provider docs from OpenAI and Anthropic recommend exhausting prompting first for exactly this reason.