What makes a classification prompt accurate?
Classification is the task of assigning each input to one (or more) labels from a fixed set — routing a support ticket to a queue, tagging feedback sentiment, flagging a document type. The failure modes are: the model invents a label outside your set, picks inconsistently on borderline cases, or returns a paragraph of justification when you wanted a single token.
Three design choices fix nearly all of it. First, a **closed label set** — the model may only return values from your list, never freeform. Second, **clear label definitions** so 'urgent' and 'high priority' aren't interpreted interchangeably. Third, **examples that teach the boundaries**, especially the cases where two labels compete. Most people add easy examples; the value is in the hard ones.
The few-shot approach — showing labeled examples in the prompt so the model generalizes the pattern — is the foundational technique from Brown et al. 2020. The provider prompt guides (OpenAI, Anthropic) and the DAIR.ai Prompt Engineering Guide all treat few-shot labeling as the default for classification.