[ad_1]
Amazon researchers have trained the largest text-to-speech model ever made, which they say has “emergent” qualities that improve its ability to naturally speak complex sentences. The breakthrough could be what technology needs to escape the uncanny valley.
These models were always going to grow and improve, but the researchers specifically hoped to see the kind of jump in capacity that we observed once the language models grew beyond a certain size. For reasons unknown to us, once LLMs get past a certain point, they begin to be much more robust and versatile, capable of performing tasks for which they were not trained.
This doesn’t mean they gain sentience or anything, just that beyond a certain point their performance on certain conversational AI tasks is like hockey sticks. The Amazon AGI team – what they’re aiming for – thought the same thing might happen as text-to-speech models also developed, and their research suggests that’s indeed the case.
The new model is called Large adaptive and streamable TTS with emerging capabilities, which they transformed into the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain speech, 90% of which is in English, the rest in German, Dutch and Spanish.
With 980 million parameters, BASE-large appears to be the largest model in this category. They also trained models at 400M and 150M parameters based on 10,000 and 1,000 hours of audio respectively, for comparison. The idea being that if one of these patterns shows emergent behaviors but another doesn’t, you have a range for where those behaviors begin. to emerge.
It turns out that the medium-sized model showed the increase in abilities that the team was looking for, not necessarily in the quality of ordinary speech (it is rated better but only on a few points) but in overall abilities emerging issues that they observed and measured. . Here are examples of delicate texts mentioned in the newspaper:
- Compound nouns: The Beckhams have decided to rent a charming, quaint stone vacation home.
- Emotions: “Oh my God! Are we really going to the Maldives? It’s amazing!” Jennie screamed, bouncing on her tiptoes with uncontrolled glee.
- Foreign words: “Mr. Henry, renowned for his mise en place, orchestrated a seven-course meal, with each course a piece de resistance.
- Paralinguistics (i.e. legible non-words): “Shh, Lucy, shh, we mustn’t wake your little brother,” Tom whispered as they tiptoed past the nursery.
- Punctuations: She received a strange text from her brother: “Emergency at home; call as soon as possible! Mom and Dad are worried…#familymatters.’
- Questions: But the Brexit question remains: after all the trials and tribulations, will ministers find the answers in time?
- Syntactic complexities: The film that De Moya, who recently received the Lifetime Achievement Award, starred in 2022 was a box office success, despite mixed reviews.
“These sentences are designed to contain difficult tasks – analyzing garden path phrases, emphasizing long-winded compound nouns, producing emotional or whispered speech, or producing the correct phonemes for foreign words.
words like “qi” or punctuation marks like “@” – for which BASE TTS is not explicitly trained,” the authors write.
Such features normally trigger text-to-speech engines to mispronounce, skip words, use strange intonation, or make other errors. BASE TTS still had its problems, but it fared much better than its contemporaries – models like Tortoise and VALL-E.
There are many examples of these difficult texts spoken quite naturally by the new model. on the site they made for this. Of course, these were chosen by the researchers, so they’re necessarily hand-picked, but it’s impressive nonetheless. Here are a few, if you don’t feel like clicking:
Given that all three BASE TTS models share an architecture, it seems clear that the size of the model and the extent of its training data appear to be the cause of the model’s ability to handle some of the above complexities. Keep in mind that this is still an experimental model and process – not a business model or anything. Future research will need to identify the inflection point of the emerging capability and how to effectively train and deploy the resulting model.
Notably, this model is “streamable”, as the name suggests, meaning that it does not need to generate entire sentences at once, but instead runs from moment to moment at a rate relatively weak binary. The team also attempted to aggregate vocal metadata such as emotionality, prosody, etc. in a separate low-bandwidth stream that could accompany vanilla audio.
It looks like text-to-speech models could have a watershed moment in 2024 – just in time for the election! But we cannot deny the usefulness of this technology, particularly for accessibility. The team notes that it declined to release the model source and other data due to the risk of bad actors taking advantage of it. But the cat will eventually get out of the bag.
[ad_2]