Dosovitskiy et al. (2021) split an image into fixed-size patches, treated each as a token, and fed the sequence to a standard Transformer. With enough pre-training data, this Vision Transformer matched or beat leading convolutional networks while using less compute — demonstrating that self-attention is a general architecture, not a language-only one.
ViTs are now common in multimodal models, where image and text are processed by related Transformer machinery.