Voice AI Benchmark

Speech-to-Text Benchmark

LIVE

How accurately and how fast each system turns speech into text — measured, with confidence intervals, no human scoring.

Global RankingsShowing: English · All accentsmeasured 2026-06-19 · run #482
10 METRICS · CLICK HEADER TO SORT · SWIPE TO EXPLORE
RANKPROVIDER / MODELRECOMMENDED FOR
1
OpenAI
whisper-v3
General TranscriptionMultilingualMedia & Podcast
2
Deepgram
nova-2
Voice AgentContact CenterReal-time Streaming
2
AssemblyAI
conformer-2
Meeting NotesAsync TranscriptionCompliance Recording
4
Azure
speech-to-text-v3
EnterpriseContact CenterRegulated Industries
5
Google Cloud
chirp-v2
MultilingualMedia & BroadcastGlobal Deployment
6
AWS
transcribe-medical
HealthcareClinical DocumentationHIPAA-compliant

* Values shown are illustrative placeholders.

WER (95% CI)
lower is better
4.2%3.94.6% CI
4.4%4.14.8% CI
4.5%4.24.9% CI
5.1%4.75.5% CI
5.8%5.46.3% CI
7.2%6.77.8% CI
Digit Error Rate
lower is better
1.8%
2.1%
2.0%
3.4%
4.0%
5.5%
Alphanumeric Exact-Match
higher is better
93.4%
91.8%
91.5%
88.9%
86.2%
81.3%
Name F1
higher is better
0.88
0.83
0.85
0.80
0.82
0.76
Semantic Error Rate
lower is better
1.1%GOOD
1.4%GOOD
1.2%GOOD
2.1%FAIR
2.4%FAIR
3.2%HIGH
TTFP p50
lower is better
380ms
340425ms
95msFAST
82112ms
210ms
185238ms
310ms
275350ms
280ms
248315ms
420ms
375468ms
WER @ 0 dB SNR
lower is better
14.2%vs 4.2% clean · 3.4× degradation
16.8%vs 4.4% clean · 3.8× degradation
15.4%vs 4.5% clean · 3.4× degradation
19.2%vs 5.1% clean · 3.8× degradation
22.1%vs 5.8% clean · 3.8× degradation
28.4%vs 7.2% clean · 3.9× degradation
Codec Inflation
lower is better
2.8×
3.1×
2.6×
2.4×
3.4×
4.2×
Hallucination Rate
lower is better
0.6w/clip
0.4w/clip
0.8w/clip
1.2w/clip
1.8w/clip
0.9w/clip
Accent Gap
lower is better
5.2 ptsworst group: 7.1% WER
6.8 ptsworst group: 8.5% WER
4.9 ptsworst group: 7.4% WER
7.1 ptsworst group: 9.8% WER
5.8 ptsworst group: 9.1% WER
9.2 ptsworst group: 13.1% WER