Leviathan et al. (2023) introduced speculative decoding: a small, cheap "draft" model proposes several tokens, and the large target model checks them all in a single parallel pass, accepting the longest correct prefix. Because verification is parallel, this yields 2–3× faster generation with provably the same output distribution as the large model alone.
It is a now-standard way to cut latency for large-model serving without changing the responses users see.