3Play Media Study Finds Artificial Intelligence Innovation Has Led to Significant Improvements in Automatic Speech Recognition (ASR)

Impressive new entrants have raised the bar for industry leaders, with AssemblyAI, Speechmatics, and Whisper leading the pack

ASR technology has never been as accurate as it is today thanks to advances in artificial intelligence (AI), according to a report from 3Play Media, the leading media accessibility provider, released today. The annual State of ASR study analyzes the general state of speech-to-text technology as it applies to the task of captioning and transcription.

According to the study, in which the company tested speech recognition with ten relevant ASR engines, the accuracy of the technology has improved measurably since the company’s last evaluation in 2022. As ASR improves, it’s important to understand which engine is best for different use cases. Some nuances to consider include performance on different error types, transcription styles, formatting, and industry-specific content.

“The advances in AI we’ve seen across industries have also had an impact on ASR,” Chris Antunes, co-CEO and co-Founder, 3Play Media, said. “Longtime industry leader Speechmatics and newer entrants AssemblyAI and Whisper performed at the top of the pack, with each excelling in different areas. This proves that not all engines are created equal – the training material and models matter – and that there is room at the top for multiple engines to specialize in different use cases.”

Accuracy is the key component in captioning for several reasons, most importantly ensuring that individuals who are deaf or hard of hearing and rely on captions as an accommodation receive information that fully depicts the original content. For captions to be accessible and legally compliant, they need to be 99% accurate, the industry requirement for accessibility. While there was improvement across industry leaders, the study found that even the best engines performed well below 99% accuracy, indicating a continued need for human revision.

This report measures accuracy against two measurements, Word Error Rate (WER) and Formatted Error Rate (FER). While WER is used as the standard measure of transcription accuracy, FER takes into account formatting, sound effects, grammar, and punctuation and is a better representation of the experienced accuracy of captioning. Accuracy in FER is harder to achieve, and even the best-tested engines were only 82% accurate, whereas the best-tested engines in WER were 93% accurate.

Additionally, the study identified a new type of error. Hallucinations are the tendency to generate text that has no basis in the audio. The State of ASR report found evidence of hallucinations in the Whisper transcriptions, often occurring when the topic shifted. Some of the hallucinations were significant and could pose issues for the captioning use case in particular. However, hallucinations seemed rare and did not prevent Whisper from performing competitively.

UrbanObserver

Subscribe to newsletter

Whitepapers

App Management

Cloud

Technology

Blockchain

Datacenter

Artificial Intelligence

Security

Technology

Company

Whitepapers

App Management

Cloud

Technology

Blockchain

Datacenter

Artificial Intelligence

Security

Technology