Rafailov et al. (2023) showed that the RLHF objective can be reparameterised so the language model is, in effect, its own reward model. Direct Preference Optimization then trains on preferred-versus-rejected response pairs with a simple classification-style loss — achieving alignment comparable to RLHF without the complexity and instability of separate reward modelling and reinforcement learning.
DPO has become a popular, lighter-weight alternative to full RLHF pipelines.