28. Align models with human preferences
Human preference training helps models follow instructions, refuse unsafe requests, and produce outputs people prefer. You will compare RLHF, reward models, DPO, constitutional methods, and the risks of over-optimization.