How every number is produced — automatically, with no human listening, reading, or rating.
Every number comes from deterministic code or a frozen pretrained model. No human listening, reading, or rating at any stage — including dataset construction.
Evaluations are executed via standardised scripts, embedding inference done completely.
95% Confidence intervals determine statistically significant differences, not raw numerical wins.
Metrics are collected by language, accent, and audio condition to reveal true capabilities.
Complete datasets and evaluation scripts are tracked for every model and snapshot.
Categories
What it measures
The percentage of words incorrectly transcribed compared to a ground-truth transcript. It accounts for insertions, deletions and substitutions.
How it's measured
Automated alignment using the Levenshtein distance algorithm. Transcripts undergo rigorous normalisation (lowercasing, punctuation stripping) prior to comparison to ensure fair grading.
Direction & metric
Example
Ref: 'The dose is fifteen mg' → Hyp: 'The dose is fifty mg'
WER = 1/5 = 20% — looks small
but the dosage is 3× wrong
What it measures
How often the system gets numbers wrong — phone numbers, prices, medication dosages, account numbers. A system can score 4% WER and still misread critical numeric data.
How it's measured
Utterances are built from templates with exact slot values known up-front. Scoring is deterministic string comparison — no human checking needed.
Direction & metric
Example
Reference: 'Call 555-0142'
Hypothesis: 'Call 555-0412'
WER ≈ 5% — but digits are transposed, call fails
What it measures
Does it get the whole code right? One wrong character in an order ID (X7Q-4R2), VIN, or PNR means the code is useless. No partial credit — exact match or failure.
How it's measured
Gold labels are exact metadata from templated generation. Scoring compares the full string character-by-character. This often inverts the WER leaderboard.
Direction & metric
Example
Order ID: 'X7Q-4R2'
Transcribed as: 'X7Q-4B2'
1 character wrong → match fails → order lost
What it measures
How often does a transcription error change the meaning? Errors that flip the intent of a sentence (a negation removed, an entity swapped) count far more heavily than harmless word deletions.
How it's measured
A frozen LLM-judge ensemble classifies each error as benign, meaning-altering, or critical. The score is re-weighted by severity. Under 1.5% is good; above 3% risks dangerous misunderstandings in medical or legal contexts.
Direction & metric
Example
Reference: 'The patient is NOT allergic to penicillin'
Hypothesis: 'The patient IS allergic to penicillin'
WER = 1/8 = 12.5% — but meaning is dangerously reversed
What it measures
How fast does the first word appear on screen? When streaming audio, this is the delay before any partial transcript is returned. Fast response makes your app feel alive; slow response looks broken.
How it's measured
Audio streamed at real-time pacing over WebSocket (8 kHz and 16 kHz). Clocked from first audio byte sent. Network hop measured and subtracted so scores are comparable across regions.
Direction & metric
Example
Audio starts at t=0, first word 'Hello' appears:
System A → 95ms (streaming, low latency)
System B → 380ms (batch, high latency)
What it measures
Does it work in a noisy room? 0 dB means background noise is exactly as loud as the speech. Some systems barely degrade; others collapse. This tells you what to expect in real contact-center and in-car conditions.
How it's measured
The same reference clips are convolved with babble noise at decreasing SNR (20→0 dB) using pure DSP — no new recordings needed. WER measured at each level.
Direction & metric
Example
Clean WER: 4.4% (quiet room)
WER @ 10 dB SNR: 8.1%
WER @ 0 dB SNR: 16.8% — real-world penalty revealed
What it measures
Does it make up words when no one is speaking? Play hold music or silence — does the system transcribe phantom words? These fabricated words poison AI assistant context and cause chaos downstream.
How it's measured
Deterministic: any word emitted when the reference is known-empty is a hallucination. The non-speech stimulus bank covers hold music, DTMF tones, voicemail beeps, and street noise.
Direction & metric
Example
Input: 30s of hold music (no speech)
Output: 'thanks for calling please stay on the line'
8 hallucinated words — all incorrect