11. Teach with action probabilities
Distill a teacher’s full action distribution instead of only its chosen action. You will use temperature, KL divergence, cross-entropy, softened labels, and confidence filtering to make the student learn richer behavior.