🛡️
Session Flagged

Your session has been flagged for unusual activity.

You can try our app by searching for MultipleChat AI on Google and clicking the multiplechat.ai link to try it free.
Quick verification

Please confirm you're human to continue.


LLM ArchitectureFoundations Updated 2026

Multimodal

A model that works across more than one type of data — such as text, images, audio or video — in the same system.

Multimodal models learn a shared representation across data types so they can, for example, answer questions about an image or generate a caption. Radford et al. (2021) showed with CLIP that training on image–text pairs yields a joint embedding space enabling strong zero-shot visual recognition from natural-language prompts.

Today's leading assistants are multimodal: you can paste a chart, a screenshot or a photo and ask about it directly in chat.

References

Primary, peer-reviewed and archival sources for this definition.

Learning Transferable Visual Models From Natural Language Supervision (CLIP)
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Proceedings of the 38th International Conference on Machine Learning (ICML 2021).

Dictionary & encyclopedic entries

Cite this entry

MultipleChat. "Multimodal." MultipleChat AI & LLM Glossary, 2026. https://multiple.chat/ai-glossary/multimodal

Related terms

See this in practice

Run the same prompt across ChatGPT, Claude, Gemini and Grok — grounded in your own sources, cross-checked against each other.

Try MultipleChat Free

Continue learning

See paid plans