Speech Recognition · Multimodal Models
Whisper: 680,000 Hours of Weak Supervision for Robust ASR
OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.
Quick answer
Whisper trains one encoder-decoder Transformer on 680,000 hours of multilingual, multitask audio scraped from the internet, then runs zero-shot on standard benchmarks — no dataset-specific fine-tuning — and stays competitive with prior fully supervised models while approaching human accuracy and robustness. The bet is scale and diversity over clean labels: instead of squeezing the lowest word error rate out of one curated corpus, Whisper trains on messy real-world audio so it degrades gracefully on accents, noise, and recording conditions a benchmark never contained.
680k hours of weakly supervised audio
The scale number is the whole paper. 680,000 hours dwarfs the ~1,000 hours of clean labeled speech that academic ASR usually trains on. “Weak supervision” means the transcripts come from the web as-is — auto-generated captions, human subtitles of mixed quality — not from a controlled annotation pipeline. OpenAI filters out machine-generated transcripts (training on another ASR system’s output teaches you its mistakes) and uses heuristics to drop misaligned audio-text pairs, but the data is still far noisier than a benchmark. Of those hours, a large share is non-English, and roughly 125,000 hours is translation data (other languages → English text), which is what lets one model both transcribe and translate.
One model, four tasks via text tokens
Whisper is deliberately plain: a vanilla Transformer encoder-decoder on log-Mel spectrograms, no exotic architecture. The novelty is the output format. Special tokens at the start of the decoder sequence specify the task — transcribe vs. translate, which language, whether to predict timestamps — so transcription, translation, language identification, and timestamp alignment are all just text the same model predicts. This is why Whisper ships as one checkpoint family instead of a stack of separate models, and why developers adopted it as drop-in infrastructure.
Why robustness beats benchmark SOTA
Whisper is often not the lowest-error model on any single in-distribution benchmark like LibriSpeech — a model fine-tuned on that corpus will beat it there. The paper’s argument is that this comparison is misleading: a fine-tuned model overfits the quirks of its test set, while Whisper’s error rate stays stable across distribution shift. The honest framing is that Whisper trades a point or two of clean-benchmark accuracy for far smaller degradation in the wild. For anyone deploying ASR on real, unseen audio, that trade is the point.
Key results
- Trained on 680,000 hours of labeled audio, including multilingual and translation data — orders of magnitude more than typical supervised ASR.
- Generalizes zero-shot to standard benchmarks with no fine-tuning, and is often competitive with prior fully supervised results.
- Approaches human accuracy and robustness in OpenAI’s comparisons across noisy and shifted conditions.
- One model performs multilingual transcription, X→English translation, language identification, and timestamp prediction, selected by task tokens.
- OpenAI released the models and inference code, which is what turned a research result into widely deployed infrastructure.
Limits and open questions
Weak supervision is a double-edged sword. The training transcripts contain real errors, and Whisper inherits them — it can hallucinate fluent text during silence or non-speech audio, a known failure mode that matters in medical, legal, and accessibility settings where a confident wrong transcript is worse than no transcript. Language coverage is heavily skewed toward high-resource languages; low-resource language quality drops sharply, so “multilingual” does not mean uniform. The 680k-hour corpus was not fully released, which limits reproduction. And training on web audio raises consent and privacy questions the paper does not resolve. Decoding long audio still relies on heuristics for timestamps and repetition, which can fail. For high-stakes use you still need confidence checks and human review.
FAQ
What is OpenAI Whisper trained on?
Whisper is trained on 680,000 hours of labeled audio collected from the web, spanning many languages and including translation pairs. The supervision is “weak” because the transcripts are real-world captions and subtitles of varying quality, not a curated annotation set.
Does Whisper need fine-tuning to work on a new dataset?
No. Whisper’s central result is strong zero-shot transfer: it runs on standard benchmarks without any dataset-specific fine-tuning and stays competitive with fully supervised systems trained directly on those benchmarks.
Is Whisper the most accurate speech recognition model?
Not on every benchmark. A model fine-tuned on a specific clean corpus like LibriSpeech can post lower word error rates there. Whisper’s advantage is robustness — it degrades far less under accents, noise, and domain shift, which matters more in real deployment.
Can Whisper translate as well as transcribe?
Yes. The same model handles X→English speech translation, language identification, and timestamped transcription, all selected through task tokens at the start of the decoder sequence.
Whisper’s lesson: 680,000 hours of messy web audio buys robustness that a clean benchmark never could. Read the paper at https://arxiv.org/abs/2212.04356.