🛡️
Session Flagged

Your session has been flagged for unusual activity.

You can try our app by searching for MultipleChat AI on Google and clicking the multiplechat.ai link to try it free.
Quick verification

Please confirm you're human to continue.


InfrastructureEfficiency Updated 2026

Latency

The delay between sending a prompt and receiving a response. Larger models and longer contexts increase latency — the main speed trade-off in AI products.

Latency in an LLM system is usually split into time-to-first-token and time-per-output-token. Both grow with model size and sequence length, because attention cost scales with the amount of context; the survey by Tay et al. (2022) catalogues the efficiency techniques developed to mitigate this.

Approaches that reduce latency include sparsity (Mixture of Experts), quantization, caching and streaming — each trading some accuracy, memory or complexity for speed.

References

Primary, peer-reviewed and archival sources for this definition.

Efficient Transformers: A Survey
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). ACM Computing Surveys, 55(6), 1–28.

Dictionary & encyclopedic entries

Cite this entry

MultipleChat. "Latency." MultipleChat AI & LLM Glossary, 2026. https://multiple.chat/ai-glossary/latency

Related terms

See this in practice

Run the same prompt across ChatGPT, Claude, Gemini and Grok — grounded in your own sources, cross-checked against each other.

Try MultipleChat Free

Continue learning

See paid plans