What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is a training technique that aligns LLM outputs with human preferences by training a reward model on human comparisons, then optimizing the LLM against that reward.
WHY IT MATTERS
RLHF is the technique that made LLMs useful. Raw pre-trained models are powerful but uncontrolled. RLHF aligns the model with human expectations of helpful, harmless, and honest behavior.
The process: human raters compare model outputs and indicate preferences. These preferences train a reward model. The LLM is then fine-tuned using reinforcement learning to maximize that reward — learning to produce outputs humans prefer.
RLHF is responsible for the dramatic difference between base models (unpredictable) and assistant models (helpful, instruction-following). It's also why different models have different 'personalities.'