Hinton et al. (2015) showed that a compact model can be trained on the soft probability outputs of a larger model or ensemble, learning from the teacher's full distribution rather than just hard labels. The student captures much of the teacher's behaviour at a fraction of the size and cost.
Distillation is a standard way to ship small, fast versions of large models for latency- or cost-sensitive deployment.