Homepage
Technology
Tacotron 2: Generating Human-like Speech from Text

Tacotron 2: Generating Human-like Speech from Text

by Gadget Reviewed December 22, 2017

Making apps and neural networks smarter and getting the systems more human like is what companies and app developers are striving to achieve these days with their product offering. Creating natural or more human like sounding speech is one of the areas in which neural network developers are working on. Google is again in the lead when it comes to getting an artificial intelligence to sound more human like with its latest offering: The Tacotron 2. Taking Google’s previous products, the Tacotron and Wave net into the mix, Google has come up with Tacotron 2.

What does Tacotron 2 Do?

After what has been a long haul when it comes to developing natural sounding speech, Google has taken the simple route when it comes to Tacotron 2. Unlike Tacotron and wave net, this time around Tacotron does not use speech rules and algorithms to achieve the desired results, which is basically to get more natural sounding text.

Tacotron 2 uses only two things to achieve natural speech requirements which are speech examples and the corresponding text transcripts for it to work.

There is no need to add things like prosody, intonation and pronunciation to get tacotron to work. Google has proved that simple is the way to go when it comes to natural and more human like speech.

Tacotron uses the written text and its said form to compute how to create a natural sounding audio. It takes parts from both of Google’s previous products, tacotron and wave net. Just from the way the speech sounds tacotron gets an idea of the intonation and pronunciation and various other subtleties of speech.

Problems with Tacotron 2:

Google has even released recording of tacotron 2’s speech and got them evaluated, with people giving it a score of a professional recording. While tacotron 2 might be the best that is out there when it comes to human like speech, it still has its problems.

Certain words such as “merlot” and “Decorum” cannot be pronounced properly by the system. There are also instances when tacotron 2 makes sounds which are not quite words. Right now, google has not made tacotron 2 to analyze the feelings behind the speech, like when someone sounds happy or sad. Another limitation that google has to work on is that tacotron 2 cannot make speech in real time.

While these are some areas which poses a research problem, at the same time it gives Google the opportunity to look for different solutions. By creating a system which is quite basic and does not use or require a whole book of speech rules to be input into the system just to get more natural speech, paves the way for other simpler models to be developed that gives the right outcome.

Tacotron 2 also eliminates the need for more the rules to be updated into the system as and when any deficiencies are discovered. Tacotron 2 is a simple system whereby, the system takes cue from read speech to identify the various rules of speech. So it is basically a self- taught speech model.