- Bits of Brilliance
- Posts
- Why Transcription Isn’t Solved
Why Transcription Isn’t Solved
If you look at modern speech recognition tools, it’s easy to assume transcription is a solved problem. You talk; the computer writes it down. Whisper, for example, has been trained on just about every accent and dialect you can imagine. On paper, it should work like a photocopier for speech.
But anyone who’s actually tried to use these tools knows it’s not that simple. Even with clever wrappers like Buzz or MacWhisper—basically Whisper with a nice interface—you still spend a surprising amount of time fixing things. You listen, you correct, you reformat. So what’s going on? Why, after decades of progress, is transcription still a chore?
It’s tempting to blame the quirks of human speech. We mumble, we interrupt, we make up words, we talk over each other. Recordings are noisy. Accents throw off the software. But the latest models are trained precisely to handle this kind of mess, and yet, the results are still unreliable, especially when it matters most.
The real problem is that language isn’t just a string of words. It’s context, subtext, tone, inside jokes, and shared assumptions. A transcript can catch the words, but not the meaning. It can’t reliably tell who’s speaking when people talk over each other. It can’t spot sarcasm or catch a reference that only makes sense if you know the backstory.
Take a technical interview about quantum cryptography. The AI might get “entanglement” right, but stumble on “qubit,” or invent something that sounds plausible but isn’t real. Or consider a heated meeting, full of interruptions and sarcasm. The transcript might catch every word, but miss who said what, or the emotional weight behind it. The more automatic transcription becomes, the more obvious these gaps are.
This is why, even in 2025, transcription is still a semi-manual job. You need someone who knows the subject, recognizes the voices, and understands the context. That’s why these Whisper-based transcriptionists, for all their automation, still require editing, labeling, and reviewing. The machine does the heavy lifting, but the human is still in the loop.
Transcription reveals something deeper about language models in general. They’re great at copying patterns, not at understanding meaning. They can mimic, but not comprehend. The hard part isn’t getting the words down; it’s knowing what they mean in context, and that’s something machines still can’t do.
So what’s the future? Probably not full automation, at least not soon. More likely, it’s a partnership: the AI does the grunt work, and the human adds the judgment. Like the first calculators—fast at arithmetic, but useless without someone to decide what to calculate and what the answer means.
I am a big fan of Limitless and the Pendant. It makes transcription faster and more accessible, and perhaps most importantly makes recording more automatic in my day-to-day, but the system’s quirks also exhibit that humans are still needed to make sense of the output. As the tools get better, the human part of the process becomes more important, not less.
Transcription isn’t just a technical challenge. It’s a reminder that language is a shared construction, full of subtlety and context. Machines can copy sounds, but meaning is still a negotiation between people. Maybe that’s why, even with the best tools, transcription remains—at least in part—an art.
Reply