22. Distill preference-trained policies
Distill policies used in reinforcement learning from human feedback and preference-trained systems. You will cover reward models, reference policies, KL penalties, rejection sampling, direct preference methods, and how aligned behavior is compressed into deployable models.