Methodology

How every number is produced — automatically, with no human listening, reading, or rating.

HOW WE MEASURE

01. Measured only

Every number comes from deterministic code or a frozen pretrained model. No human listening, reading, or rating at any stage — including dataset construction.

02. Zero Human Judgment

Evaluations are executed via standardised scripts, embedding inference done completely.

03. CI & Ties

95% Confidence intervals determine statistically significant differences, not raw numerical wins.

04. Precise Silicon

Metrics are collected by language, accent, and audio condition to reveal true capabilities.

05. Provenance

Complete datasets and evaluation scripts are tracked for every model and snapshot.

Accuracy

WER

Word Error Rate

What it measures

The percentage of words incorrectly transcribed compared to a ground-truth transcript. It accounts for insertions, deletions and substitutions.

How it's measured

Automated alignment using the Levenshtein distance algorithm. Transcripts undergo rigorous normalisation (lowercasing, punctuation stripping) prior to comparison to ensure fair grading.

Direction & metric

Percentage (%)↓ lower is better

Example

Ref: 'The dose is fifteen mg' → Hyp: 'The dose is fifty mg'

WER = 1/5 = 20% — looks small

but the dosage is 3× wrong

DER

Digit Error Rate

What it measures

How often the system gets numbers wrong — phone numbers, prices, medication dosages, account numbers. A system can score 4% WER and still misread critical numeric data.

How it's measured

Utterances are built from templates with exact slot values known up-front. Scoring is deterministic string comparison — no human checking needed.

Direction & metric

Percentage (%)↓ lower is better

Example

Reference: 'Call 555-0142'

Hypothesis: 'Call 555-0412'

WER ≈ 5% — but digits are transposed, call fails

AEM

Alphanumeric Exact-Match

What it measures

Does it get the whole code right? One wrong character in an order ID (X7Q-4R2), VIN, or PNR means the code is useless. No partial credit — exact match or failure.

How it's measured

Gold labels are exact metadata from templated generation. Scoring compares the full string character-by-character. This often inverts the WER leaderboard.

Direction & metric

Percentage (%)↑ higher is better

Example

Order ID: 'X7Q-4R2'

Transcribed as: 'X7Q-4B2'

1 character wrong → match fails → order lost

SWER

Semantic Error Rate

What it measures

How often does a transcription error change the meaning? Errors that flip the intent of a sentence (a negation removed, an entity swapped) count far more heavily than harmless word deletions.

How it's measured

A frozen LLM-judge ensemble classifies each error as benign, meaning-altering, or critical. The score is re-weighted by severity. Under 1.5% is good; above 3% risks dangerous misunderstandings in medical or legal contexts.

Direction & metric

Percentage (%)↓ lower is better

Example

Reference: 'The patient is NOT allergic to penicillin'

Hypothesis: 'The patient IS allergic to penicillin'

WER = 1/8 = 12.5% — but meaning is dangerously reversed

Latency

TTFP

Time-to-First-Partial

What it measures

How fast does the first word appear on screen? When streaming audio, this is the delay before any partial transcript is returned. Fast response makes your app feel alive; slow response looks broken.

How it's measured

Audio streamed at real-time pacing over WebSocket (8 kHz and 16 kHz). Clocked from first audio byte sent. Network hop measured and subtracted so scores are comparable across regions.

Direction & metric

Milliseconds (ms)↓ lower is better

Example

Audio starts at t=0, first word 'Hello' appears:

System A → 95ms (streaming, low latency)

System B → 380ms (batch, high latency)

Robustness

0dB

WER @ 0 dB SNR

What it measures

Does it work in a noisy room? 0 dB means background noise is exactly as loud as the speech. Some systems barely degrade; others collapse. This tells you what to expect in real contact-center and in-car conditions.

How it's measured

The same reference clips are convolved with babble noise at decreasing SNR (20→0 dB) using pure DSP — no new recordings needed. WER measured at each level.

Direction & metric

Percentage (%)↓ lower is better

Example

Clean WER: 4.4% (quiet room)

WER @ 10 dB SNR: 8.1%

WER @ 0 dB SNR: 16.8% — real-world penalty revealed

HAL

Hallucination Rate

What it measures

Does it make up words when no one is speaking? Play hold music or silence — does the system transcribe phantom words? These fabricated words poison AI assistant context and cause chaos downstream.

How it's measured

Deterministic: any word emitted when the reference is known-empty is a hallucination. The non-speech stimulus bank covers hold music, DTMF tones, voicemail beeps, and street noise.

Direction & metric

Words per clip↓ lower is better

Example

Input: 30s of hold music (no speech)

Output: 'thanks for calling please stay on the line'

8 hallucinated words — all incorrect