Multimodal models learn a shared representation across data types so they can, for example, answer questions about an image or generate a caption. Radford et al. (2021) showed with CLIP that training on image–text pairs yields a joint embedding space enabling strong zero-shot visual recognition from natural-language prompts.
Today's leading assistants are multimodal: you can paste a chart, a screenshot or a photo and ask about it directly in chat.