How each method works — the mechanical difference
All three methods optimize the same underlying objective: make the model more likely to produce preferred outputs and less likely to produce rejected outputs, given a prompt. They differ in how they get there.
**DPO** uses a closed-form loss derived from the Bradley-Terry preference model. Given a prompt x, a chosen response y_w, and a rejected response y_l, the DPO loss directly optimizes the log-ratio of (model probability of y_w / reference model probability of y_w) versus (model probability of y_l / reference model probability of y_l). The reference model is a frozen copy of the pre-DPO model (typically the SFT checkpoint). The result: a single-stage training job that directly increases the probability of preferred outputs while penalizing rejected ones, with the reference model anchor preventing the model from drifting too far from its SFT distribution.
**RLHF** is a three-stage pipeline. Stage 1: standard supervised fine-tuning on instruction-following data. Stage 2: train a separate reward model on preference pairs — the reward model takes a (prompt, response) and outputs a scalar reward score, trained to score preferred responses higher than rejected ones. Stage 3: use PPO (Proximal Policy Optimization) or similar RL algorithm to train the SFT model to maximize the reward model's score on its own generated responses, while a KL penalty against the SFT model prevents distribution drift. Each stage has its own failure modes: stage 2 reward models can have systematic biases; stage 3 PPO is famously unstable and prone to reward hacking (the model exploits flaws in the reward model rather than producing genuinely better outputs).
**ORPO** combines SFT and preference optimization into a single loss function. The loss has two terms: a standard SFT cross-entropy loss on the chosen response, plus an odds-ratio penalty term that pushes the model away from the rejected response. No reference model is needed (unlike DPO), and no separate reward model or RL loop is needed (unlike RLHF). The single-stage workflow is the simplest of the three and has shown competitive results in 2024-2026 published benchmarks, though tooling support is still maturing.