Hochreiter & Schmidhuber (1997) introduced the LSTM to solve the vanishing-gradient problem that prevented ordinary recurrent networks from learning long-range dependencies. Its gated memory cell lets information persist or be forgotten in a controlled way across many time steps.
LSTMs dominated speech, translation and text modelling for nearly two decades, until self-attention and the Transformer largely replaced them.