Attention was introduced for machine translation by Bahdanau et al. (2015), letting a decoder look back at the most relevant source words instead of a single fixed summary vector. Vaswani et al. (2017) then made self-attention the whole architecture, removing recurrence entirely.
Each token computes a weighted combination of all others, where the weights say how relevant each is. This is how a model resolves references, tracks long-range structure and decides what to focus on.