

Speech recognition remains a challenging problem in artificial intelligence and machine learning. To solve this problem, OpenAI today Open-sourced Whisper is an automatic speech recognition system that the company claims enables “robust” transcription and translation from many languages into English.
Countless organizations have developed powerful speech recognition systems that lie at the heart of the software and services of tech giants like Google, Amazon, and Meta. But what makes Whisper different, according to OpenAI, is that it was trained on 680,000 hours of multilingual and “multitasking” data collected from the network, improving its ability to recognize unique accents, background noise and technical terms.
“Main target users [the Whisper] Models are AI researchers who study the robustness, generalization, capability, bias, and constraints of current models.However, Whisper could also be very useful as an automatic speech recognition solution for developers, especially for English speech recognition,” OpenAI wrote in GitHub repo For Whisper, there are several versions of the system from which to download. “[The models] Displays robust ASR results in about 10 languages. They may exhibit additional capabilities…if fine-tuned on certain tasks, such as voice activity detection, speaker classification, or speaker classification, but have not been robustly evaluated in these domains. “
Whisper has its limitations, especially in the field of text prediction. Because the system was trained on a large amount of “noisy” data, OpenAI warns that Whisper may include words in its transcriptions that it didn’t actually say — possibly because it was both trying to predict the next word in the audio and Tried to transcribe the audio itself. Also, Whisper did not perform as well in different languages, with higher error rates when it came to speakers of languages that were not well represented in the training data.
Unfortunately, this last point is nothing new to the speech recognition world. Bias has always plagued even the best systems, and a 2020 Stanford University study found that systems from Amazon, Apple, Google, IBM, and Microsoft make far fewer mistakes with white users than black users—about 35%.
Nonetheless, OpenAI believes that Whisper’s transcription capabilities are being used to improve existing accessibility tools.
“While Whisper models cannot be used for real-time transcription out of the box, their speed and size suggest that others may be able to build applications on top of them that allow near real-time speech recognition and translation,” the company continued on GitHub. “The real value of beneficial applications built on top of Whisper’s models shows that the different performance of these models can have real economic impact… [W]e hope that the technology will be used primarily for beneficial purposes, making automatic speech recognition technology more accessible can enable more participants to build capable surveillance technology or expand existing surveillance efforts as speed and accuracy allow affordable A large number of automatically transcribed and translated audio communications. “
The release of Whisper doesn’t necessarily signal OpenAI’s future plans.While increasingly focusing on commercial endeavors like DALL-E 2 and GPT-3, the company is pursuing several purely theoretical research threads, including artificial intelligence systems Learning by watching videos.