🛡️
Session Flagged

Your session has been flagged for unusual activity.

You can try our app by searching for MultipleChat AI on Google and clicking the multiplechat.ai link to try it free.
Quick verification

Please confirm you're human to continue.


EfficiencyInfrastructure Updated 2026

Quantization

Storing a model's weights at lower numerical precision to cut memory use and speed up inference, usually with little loss in quality.

Quantization replaces high-precision (e.g. 16-bit) weights with smaller integer representations. Dettmers et al. (2022) showed with LLM.int8() that 8-bit matrix multiplication can run transformer inference at half the memory while preserving full-precision accuracy, by carefully handling rare high-magnitude features.

Quantization is what lets large models run on smaller GPUs and consumer hardware, and it pairs naturally with adapter methods like LoRA.

References

Primary, peer-reviewed and archival sources for this definition.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

Dictionary & encyclopedic entries

Cite this entry

MultipleChat. "Quantization." MultipleChat AI & LLM Glossary, 2026. https://multiple.chat/ai-glossary/quantization

Related terms

See this in practice

Run the same prompt across ChatGPT, Claude, Gemini and Grok — grounded in your own sources, cross-checked against each other.

Try MultipleChat Free

Continue learning

See paid plans