Computer speech-to-text just got even better!

If you felt hesitant to implement or learn about the speech recognition software due to the complicated names most of the utilities have, fear not, Whisper is here to save the day.

The new machine learning model does not only score a win in the names department, as its performance is also incredibly good. If your machine learning (ml) terminology got rusty, an ml model is the output of an ml algorithm. The algorithm itself takes in some training data and produces and adjusts the internal properties of a model so that it is able to (in this case) classify words. That’s just a very brief, high-level overview but you can think of the parent algorithm as a potter who makes the pot (the model) adjusting every way it will work in future (except that pots don’t recognize words, or at least not as good as Whisper). As mentioned before Whisper does its job very well but, the new model does not outperform its competitors when the audio is clear, free of any background noise but as soon as the recording conditions get unfavourable, that’s where the new model really shines. The performance then is up by 25% in such noisy environments. The following graph presents the findings.

The horizontal axis shows you how bad the audio quality is and vertical the WER measure i.e. the word error rate. The higher the percentage the worse of a job the algorithm does. Thus we see the aforementioned disadvantage compared to other state of the art models. Nvidia’s stt_en_conformer_transducer_xlarge pulls out a small but noticeable lead in the 40 to 0 range but as soon as the audio gets worse, Whisper takes the lead. However, Whisper’s got another ace up its sleeve, it is multilingual, working on 99 languages (with varying performance)! Whereas the aforementioned Nvidia’s ei ole kuullee kielit.

So how have the scientists at OpenAi achieved such great results? Whisper is based on the machine learning transformer function, trained on 680 000 hours of speech. Around the same amount of hours of speech as that one friend spent convincing you to buy crypto. For everyone that wants to learn more about the prior, I highly suggest this video by IBM https://www.youtube.com/watch?v=ZXiruGOCn9s. For everyone that wants to know more about the other transformers I recommend watching some Michael Bay movies. The latter part is more interesting though as it is a huge amount of data aggregated from many sources and put together to create this model.

The best part of it all is that it is open source, meaning everyone can use it and implement it wherever they want. Not only that but the way programmers can use it and adapt in their app is super easy. As a result, we might start seeing more speech to text functionality implemented in websites and digital applications. This in turn will make our digital world more accessible. This model is a big milestone for the AI world, what a time to be alive!

 

The paper is easily digestable even for the people who are not Ai experts, though some web searches will probably need to be made.

Alec R., Jong K., Tao X.  et al. Robust Speech Recognition via Large-Scale Weak Supervision (2022), https://openai.com/blog/whisper

One Reply to “Computer speech-to-text just got even better!”

  1. Mariusz – that’s crazy how well Whisper performs! I’ve noticed music recognition (like Shazam) has gotten so good – perhaps the algorithms are related? There were so many times before those apps came around that I was desperate to know the name of a particular song but had no hope of finding it (especially if it was in a language I didn’t know)!
    -Edie

Leave a Reply

Your email address will not be published. Required fields are marked *