43. Run open models with less compute
Open models can run cheaper, closer to private data, or on smaller devices when optimized well. This chapter covers quantization, pruning, distillation, speculative decoding, GPU memory, and on-device deployment tradeoffs.