The Transformer is a sequence-modelling architecture introduced by Vaswani et al. (2017). Its central idea, self-attention, lets every position in a sequence attend directly to every other position, so the model can capture long-range relationships in a single step rather than passing information along a chain as recurrent networks do.
Because attention over a sequence is highly parallelisable, Transformers train far more efficiently on modern hardware than the RNNs and LSTMs they replaced. That efficiency is what made today's large language models practical to train.
Why it matters
ChatGPT, Claude, Gemini and Grok are all Transformer-based. Understanding attention, layers and context length — all Transformer concepts — is the foundation for almost every other term in this glossary.