🛡️
Session Flagged

Your session has been flagged for unusual activity.

You can try our app by searching for MultipleChat AI on Google and clicking the multiplechat.ai link to try it free.
Quick verification

Please confirm you're human to continue.


EfficiencyDecoding Updated 2026

Speculative Decoding

An inference-speedup technique where a small fast model drafts several tokens and a large model verifies them in parallel — faster output with identical results.

Leviathan et al. (2023) introduced speculative decoding: a small, cheap "draft" model proposes several tokens, and the large target model checks them all in a single parallel pass, accepting the longest correct prefix. Because verification is parallel, this yields 2–3× faster generation with provably the same output distribution as the large model alone.

It is a now-standard way to cut latency for large-model serving without changing the responses users see.

References

Primary, peer-reviewed and archival sources for this definition.

Fast Inference from Transformers via Speculative Decoding
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Proceedings of the 40th International Conference on Machine Learning (ICML 2023).

Dictionary & encyclopedic entries

Cite this entry

MultipleChat. "Speculative Decoding." MultipleChat AI & LLM Glossary, 2026. https://multiple.chat/ai-glossary/speculative-decoding

Related terms

See this in practice

Run the same prompt across ChatGPT, Claude, Gemini and Grok — grounded in your own sources, cross-checked against each other.

Try MultipleChat Free

Continue learning

See paid plans