RLHF fits a reward model to human comparisons of model outputs, then optimises the language model against that reward with reinforcement learning. Christiano et al. (2017) introduced learning from human preferences; Stiennon et al. (2020) applied it to summarisation; and Ouyang et al. (2022) used it to build InstructGPT, the recipe behind today's instruction-following assistants.
RLHF is much of what separates a raw next-token predictor from a model that feels helpful, honest and safe to talk to.