Ioffe & Szegedy (2015) introduced batch normalization, which standardizes each layer's inputs using the statistics of the current mini-batch. This reduces the shifting of internal distributions during training, letting practitioners use higher learning rates and train much deeper networks reliably.
Related normalization schemes (such as layer normalization) are a standard component inside Transformers.