Clean Transcript

Whisper doesn’t transcribe ‘um’. It deletes it. There’s no token in the output — just a gap in the timestamp. Doug Calobrisi’s erm tool, designed to strip disfluencies from recordings, ran into this when the obvious approach failed: cut the timestamps Whisper gives you, and the result sounds worse than the original.

The first failure mode explains itself. If Whisper doesn’t mark the ‘um’, there’s nothing to cut. But solving the detection problem exposed something stranger. When erm scanned for voice activity inside segments Whisper had labeled as silence, it found actual sounds — not ambient noise, but voices, in the acoustic space the model had decided was empty. The model had learned from training that these sounds weren’t supposed to be there. So it rendered them as nothing. “It really does just drop them,” the developer writes. “No token at all, just a hole in the transcript where an ‘um’ used to be.”

To get around this, erm passes a special instruction to Whisper before it runs: do not clean up the transcript. Then it runs three additional detection passes directly on the raw audio, looking for what the model has been trained to ignore.

In 2001, Nelson Repenning and John Sterman at MIT published a paper titled “Nobody Ever Gets Credit for Fixing Problems that Never Happened.” It’s about why process improvement programs fail even when the methods work. TQM works; statistical process control works; the evidence for this is not in dispute. Programs fail because their success is invisible. A quality program that eliminates defects produces no defects — which, to an organization running on incident reports and performance dashboards, looks identical to a period of natural calm. Managers look at calm and conclude the program is redundant. They cut it. The defects return.

Repenning and Sterman called this the capability trap. Organizations fix a problem, see no visible crisis, and redirect resources to fire-fighting, where the flames are visible and the credit is real. Prevention leaves no evidence of what it prevented. Quiet periods, in most organizational records, justify drawing down whatever produced the quiet.

This isn’t an unusual failure mode. It’s a description of what organizational memory is built from. The records that organizations run on — dashboards, incident reports, performance reviews — capture events. Prevention doesn’t produce events. It produces their absence. Absence doesn’t appear in the record.

Whisper’s training corpus and a factory’s incident database are different things. The shared feature is precise: in both cases, something was removed from the record before anyone learned from it.

Whisper’s training data is text transcribed from audio by humans — and those humans had already cleaned the disfluencies out before the transcripts were used for training. OpenAI trained on what they call “weakly supervised” data: massive audio paired with transcripts that were, in practice, edited for publication. The transcriptionists produced readable versions, which is what transcriptionists do. Whisper learned from those. Its tendency to emit clean output reflects the distribution of its training corpus, not a deliberate architectural decision. The preference was built into the data before the model existed.

Organizational records work differently but produce the same gap. Incident reports record incidents. The operator who noticed pressure creeping and adjusted before anything escalated didn’t file a report; there was no incident. Equipment that didn’t fail generated no logs. Near-misses caught early don’t appear in post-incident analyses. Over time, the records managers use to calibrate expectations (dashboards, incident databases, performance reviews) contain what happened. They don’t contain the precursors that didn’t escalate, or the practices that quietly prevented escalation. A manager reading those records correctly perceives that serious failures are rare. Serious failures are rare in the record.

When something does go wrong, investigations find precursors: signals present but unrecorded, patterns that should have flagged the problem. The question is consistent: why weren’t these noticed? Often because they weren’t in the records used to train institutional judgment.

Twenty-five years separate the paper from the tool, and what changed in that time is scale. Whisper can process an hour of audio in a few minutes and emit a clean transcript. Organizations now have real-time dashboards and incident management software. The records are better, faster, and more complete — and they systematically exclude the same category of thing the 2001 paper described.

Going back to the raw audio is the right move. erm does this: it runs a model trained on clean transcripts, then supplements it with passes on the underlying waveform, specifically designed to recover what the model deleted. The raw audio retained the sounds. The transcript had removed them. The fix requires treating the transcript as a lossy compression of something more complete.

The organizational equivalent is harder to implement but has the same structure: near-miss logging systems, maintenance diaries, informal reporting channels — the things that record what didn’t happen. They exist in industries where regulators have mandated them. They’re hard to aggregate and harder to act on. They don’t appear in executive dashboards. And they keep not getting built, because the organizations that would build them are reading dashboards that show no events, and no events looks like success.

The model keeps learning from the cleaned record. The cleaned record keeps returning the verdict that what was removed was filler. The Tool Goes Quiet describes the same gap from the output side: when the medium erases its own trace, provenance becomes testimony instead of observation.

Clean Transcript

related

adjacent