33. Connect models across text, images, and sound
Combine text, images, audio, video, and sensor data in one model or workflow. This chapter covers contrastive image-text training, vision-language models, multimodal prompting, and cross-modal evaluation.