DPO instead of RLHF: Direct Preference Optimization

DPO: replacing the reward model with direct preference optimisation

RLHF (covered in Chapter 2) requires three separate training phases: SFT → Reward Model → PPO. Each phase needs its own infrastructure, its own hyperparameter tuning, and introduces new failure modes. In 2023, Rafailov et al. published DPO — an algorithm that achieves the same alignment goal with a single fine-tuning step, no reward model, and a standard supervised learning loop. It is now the dominant alignment method in the open-source ecosystem.

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.