Convolutional neural networks apply small learned filters across an input to detect local patterns — edges, textures, shapes — building up to complex features in deeper layers. LeCun et al. (1998) established the modern CNN for document recognition, and Krizhevsky et al. (2012) ignited the deep-learning era when their CNN (AlexNet) won the ImageNet competition by a wide margin.
CNNs remain central to vision and are often combined with Transformers in multimodal systems.