In a blog post, Google has recently announced the availability of the Cloud Text-to-Speech service. This service enables developers to utilize the search giant's Wavenet model and its neural network infrastructure in order to incorporate natural sounding text-to-speech to their own applications. The Wavenet technology is the same technology that powers the text-to-speech synthesis of some popular Google services including the Google Assistant, Maps, and Search. The Mountain View-based tech giant noted on its blog post that its latest service could power the voice response systems of call centers, enable responses from the Internet of Things (IoT) devices, and automatically convert text-based media like articles and books into spoken formats like podcasts and audiobooks. Developers may choose among 32 different voices from 12 languages, which include English, Portuguese, Japanese, French, Spanish, and Swedish, although the search giant noted that it will add more voices to the service in the near future. Moreover, people may also choose to modify the volume, the speech rate, and the pitch of the voices, and developers may also add Speech Synthesis Markup Language (SSML) tags in order to add pauses, pronunciation instructions, and dates to the speech.
In its blog post, the search giant also provided additional details regarding the improvements made to the Wavenet model. This technology was originally launched back in 2016, and it used a convolutional neural network that was trained using a variety of speech samples. However, the model was not immediately integrated by the search giant to its product offerings since it was not efficient enough for commercial use. Two years after the technology was launched, updated versions of the Wavenet model now allows for a much faster generation of audio. Back in 2016, Wavenet can only generate 0.02 seconds of audio in one second, while the updated model can now generate 20 seconds of audio within a second. In addition, the updated model offers improved fidelity and better resolution compared to the original Wavenet model, which should result in higher quality audio and more human-like sound.
Developers who will take advantage of the Cloud Text-to-Speech service may choose between Wavenet and Basic voices, and they will be billed monthly based on the number of characters they will send to the service for audio synthesis. The Basic voices are considerably cheaper, although Wavenet voices offer more natural sounding speech.