Kingma & Ba (2015) introduced Adam, which maintains running estimates of both the first and second moments of each parameter's gradients and uses them to scale per-parameter step sizes. It combines the benefits of earlier methods (momentum and adaptive learning rates) and works well with little tuning.
Adam and its variants are the default optimizers for training large language models and most modern neural networks.