16. Mixture-of-experts models
Study mixture-of-experts models, where only part of a large model runs for each token. This chapter covers routing, expert capacity, training instability, inference tradeoffs, and why sparse models became important for large-scale LLMs.