Kaplan et al. (2020) found that language-model loss falls as a smooth power law in model size, dataset size and compute — letting researchers forecast performance before training. Hoffmann et al. (2022), the "Chinchilla" paper, refined this, showing most large models were undertrained and that parameters and training tokens should scale roughly together for a given compute budget.
Scaling laws explain why the field has pursued ever-larger models and ever-larger datasets — and how to budget between them.