Google upgrades its speech APIs with improved features

Ryan Daws is a senior editor at TechForge Media with over a decade of experience in crafting compelling narratives and making complex topics accessible. His articles and interviews with industry leaders have earned him recognition as a key influencer by organisations like Onalytica. Under his leadership, publications have been praised by analyst firms such as Forrester for their excellence and performance. Connect with him on X (@gadget_ry) or Mastodon (@gadgetry@techhub.social)


Google has updated its Text-to-Speech and Speech-to-Text APIs with a range of feature improvements alongside support for more languages.

For many developers, the addition of 17 new WaveNet-based voices for a variety of languages will be the highlight of today’s update.

WaveNet is Google’s technology which uses machine learning to create a natural-sounding voice when performing text-to-speech.

voice_languages_wavenet

Text-to-Speech now supports a total of 30 standard voices and 26 WaveNet voices across 14 languages. A demo of the new voices, using your own text, can be found here.

Among the new features is the addition of ‘audio profiles’ to customise the output for the speaker being used. For example, output for headphones, sound bars, or the phone’s built-in speaker will all sound best with custom tuning.

On the flip-side, Speech-to-Text has also received significant improvements.

The most impressive feature is the ability to recognise multiple speakers in a voice recording for automatic transcriptions. However, the number of speakers must be provided beforehand.

Along with the support for additional Text-to-Speech languages, Google is also supporting more for Speech-to-Text. After selecting up to four languages, the API will automatically determine which language is being spoken.

Finally, the addition of a ‘word confidence score’ helps to ensure accuracy.

With each query, the Speech-to-Text API will return a confidence score that it’s heard a word correctly before making it actionable. If a low confidence is returned, and it’s important to get it right, the developer may choose to prompt the user to repeat.

“For example, if a user inputs ‘please set up a meeting with John for tomorrow at 2PM’ into your app, you can decide to prompt the user to repeat ‘John’ or ‘2PM,’ if either have low confidence, but not to reprompt for ‘please’ even if has low confidence since it’s not critical to that particular sentence,” the team explains.

Considering the difficulty some voice recognition services have with my accent, that last feature could help to reduce awkward errors.

What are your thoughts on Google’s improved speech features? Let us know in the comments.

Interested in hearing industry leaders discuss subjects like this and sharing their use-cases? Attend the co-located IoT Tech Expo, Blockchain Expo, AI & Big Data Expo and Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam and explore the future of enterprise technology.​​​​​​​

View Comments
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *