Researchers say their WaveNet technology produces sound which is 50% more convincing than existing computer speech.
The neural network models the raw waveform of the audio signal it is attempting to mimic one sample at a time.
Given that there can be as many as 16,000 samples per second of audio and that each prediction is influenced by every previous one, it is by DeepMind’s own admission a pretty “computationally expensive” process.
In order for WaveNet to utter actual sentences, the researchers also have to feed the program linguistic and phonetic tips.
So if it’s such an intensive process, why has DeepMind chosen it?
Well, researchers believe it’s best way of really advancing human-sounding machine speech.
Text-to-speech (TTS) has traditionally been limited to two pretty rudimentary approaches.
The first, concatenative TTS, assembles sentences from a huge database of short speech fragments. The major drawback is that the speaker’s voice and tone can only be changed by re-recording the entire database.
The second approach is known as parametric TTS – a completely computer generated approach, which is obvious as soon as you hear it.
You can make up your own mind about how effective WaveNet is by listening to samples in DeepMind’s blogpost, which also features some pretty convincing WaveNet-generated piano music.
Unfortunately, the processing power required to make it work means we probably won’t be seeing WaveNet in smartphones any time soon.