18. Multimodal language models
Work with models that combine text with images, audio, video, documents, or screen actions. This chapter covers vision-language models, OCR-like use cases, multimodal prompting, and evaluation problems unique to non-text inputs.