DPO: replacing the reward model with direct preference optimisation
RLHF (covered in Chapter 2) requires three separate training phases: SFT → Reward Model → PPO. Each phase needs its own infrastructure, its own hyperparameter tuning, and introduces new failure modes. In 2023, Rafailov et al. published DPO — an algorithm that achieves the same alignment goal with a single fine-tuning step, no reward model, and a standard supervised learning loop. It is now the dominant alignment method in the open-source ecosystem.
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.