Speech-to-Text Benchmark
LIVEHow accurately and how fast each system turns speech into text — measured, with confidence intervals, no human scoring.
Global RankingsShowing: English · All accentsmeasured 2026-06-19 · run #482
10 METRICS · CLICK HEADER TO SORT · SWIPE TO EXPLORE
RANKPROVIDER / MODELRECOMMENDED FOR
1
OpenAI
whisper-v3
General TranscriptionMultilingualMedia & Podcast
2
Deepgram
nova-2
Voice AgentContact CenterReal-time Streaming
2
AssemblyAI
conformer-2
Meeting NotesAsync TranscriptionCompliance Recording
4
Azure
speech-to-text-v3
EnterpriseContact CenterRegulated Industries
5
Google Cloud
chirp-v2
MultilingualMedia & BroadcastGlobal Deployment
6
AWS
transcribe-medical
HealthcareClinical DocumentationHIPAA-compliant
* Values shown are illustrative placeholders.
WER (95% CI)
↓ lower is better4.2%3.9–4.6% CI
4.4%4.1–4.8% CI
4.5%4.2–4.9% CI
5.1%4.7–5.5% CI
5.8%5.4–6.3% CI
7.2%6.7–7.8% CI
Digit Error Rate
↓ lower is better1.8%
2.1%
2.0%
3.4%
4.0%
5.5%
Alphanumeric Exact-Match
↑ higher is better93.4%
91.8%
91.5%
88.9%
86.2%
81.3%
Name F1
↑ higher is better0.88
0.83
0.85
0.80
0.82
0.76
Semantic Error Rate
↓ lower is better1.1%GOOD
1.4%GOOD
1.2%GOOD
2.1%FAIR
2.4%FAIR
3.2%HIGH
TTFP p50
↓ lower is better380ms
340–425ms95msFAST
82–112ms210ms
185–238ms310ms
275–350ms280ms
248–315ms420ms
375–468msWER @ 0 dB SNR
↓ lower is better14.2%vs 4.2% clean · 3.4× degradation
16.8%vs 4.4% clean · 3.8× degradation
15.4%vs 4.5% clean · 3.4× degradation
19.2%vs 5.1% clean · 3.8× degradation
22.1%vs 5.8% clean · 3.8× degradation
28.4%vs 7.2% clean · 3.9× degradation
Codec Inflation
↓ lower is better2.8×
3.1×
2.6×
2.4×
3.4×
4.2×
Hallucination Rate
↓ lower is better0.6w/clip
0.4w/clip
0.8w/clip
1.2w/clip
1.8w/clip
0.9w/clip
Accent Gap
↓ lower is better5.2 ptsworst group: 7.1% WER
6.8 ptsworst group: 8.5% WER
4.9 ptsworst group: 7.4% WER
7.1 ptsworst group: 9.8% WER
5.8 ptsworst group: 9.1% WER
9.2 ptsworst group: 13.1% WER