With speech-to-text transcription, what are you really saving?
[Patrick Emond contributed to this post]
Last week, IBM trumpeted their latest achievement in automated speech-to-text: a record-low error rate of 5.5 percent. But always, especially with regard to saving money on transcription, you have to read the fine print.
“This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like ‘buying a car,’” notes the Principal Research Scientist, George Saon. “This recorded corpus [defined as “a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.”], known as the Switchboard corpus, has been used for over two decades to benchmark speech recognition systems.”
It is worth noting, however, that our “corpus” is not a mere database of recorded phone conversations, but the real world. Our team of transcription experts includes musicians, writers, bartenders, astrophysicists, ethnomusicologists, film geeks, hockey nuts, and world travelers, all of whom bring real-life experience and a unique knowledge base to your transcription projects.
Saon prefaces this entire milestone with the following claim, “Depending on who you ask, humans miss one or two out of every 20 words that they hear.” It is worth dwelling on that one claim for a moment. We are to believe that humans, when straining to listen or transcribe as this context dictates, miss 5 to 10 percent of everything that they hear? Saon, though, then goes on to explain the realities of speech-to-text:
“As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.
“To determine this number, we worked to reproduce human-level results with the help of our partner , which provides speech and search technology services. And while our breakthrough of 5.5 percent is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.”
IBM tell us that they “worked to reproduce human-level results,” whereas we actually deliver them. An error rate of 5.1 percent, the utterly ludicrous benchmark by which IBM has set its speech-to-text goals, is an error every 20 words. This translates to an error on every single line of your transcript, with hundreds, if not thousands, of errors in total across, for example, a 35-page transcript (or one-hour recording).
We deliver transcripts well in excess of 99 percent accuracy with a 100 percent satisfaction guarantee. We are not looking to set any benchmarks; we want to deliver the best transcripts with the fastest turnaround. You don’t want to spend your time and money making hundreds or thousands of corrections; you want to grow your business. You want accurate transcripts.
And that is why we are here, and have been for 50 years. Computer speech-to-text programs may deliver a number, based on a benchmark, based on a corpus, based on a reproduction of a finite number of phone recordings. But the Audio Transcription Center just delivers: near-perfect transcription with no hidden fees when you need it.