Whisper, introduced by Radford et al. (2022), is an automatic speech-recognition and translation model trained with weak supervision on 680,000 hours of multilingual audio. That scale let it generalize to many languages and conditions in a zero-shot setting, approaching human robustness without task-specific fine-tuning.
Whisper is widely used to turn voice into text for transcription, captioning and voice interfaces — including voice input to AI assistants.