Mega-TTS
Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Abstract
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module.
[*] This page is for research demonstration purposes only.
Zero-Shot TTS Samples
Text | Speaker Prompt | Ground Truth | YourTTS | VALL-E | Mega-TTS |
---|---|---|---|---|---|
He was in deep converse with the clerk and entered the hall holding him by the arm. |
|||||
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. |
|||||
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid. |
|||||
The army found the people in poverty and left them in comparative wealth. |
|||||
So what is the campaign about? |
|||||
Nothing is yet confirmed. |
|||||
Her husband was very concerned that it might be fatal. |
|||||
We've made a couple of albums. |
Speech Editing Samples
Text | Original Speech | Editspeech | A3T | Mega-TTS |
---|---|---|---|---|
However, there is an issue, isn't there? --> However, there is an obvious issue, isnt there? |
||||
Others have tried to explain the phenomenon physically. --> Others have tried to explain the rare phenomenon for them physically. |
||||
Well, he can do it. --> Well, he is able to do it. |
||||
Throughout the centuries people have explained the rainbow in various ways. --> Throughout the centuries people have constantly explained the rainbow phenomenon in various ways. |
||||
She has nothing to say to journalists. --> She has nothing to communicate with journalists. |
||||
I am confident of the outcome this week. --> I am confident of the midsemester outcome at this week. |
||||
Something happened on that island. --> Something wrong suddenly happened on that island. |
||||
You probably have never seen them before. --> You probably have worked with them before. |
Cross-Lingual TTS Samples
English Text | Chinese Speaker Prompt | YourTTS | VALL-E X | Mega-TTS |
---|---|---|---|---|
He honours whatever he recognizes in himself, such morality equals self-glorification. |
||||
There could be little art in this last and final round of fencing. |
||||
It's the first time Hilda has been to our house and Tom introduces her around. |
||||
It was youth and poverty and proximity and everything was young and kindly. |
Robustness Test Samples
Text | Tacotron | Mega-TTS |
---|---|---|
See owned a saw and Mr Soar owned a seesaw. Now See’s saw sawed Soar’s seesaw before Soar saw See. |
||
forty one to five three hundred and eleven Fail - one - one to zero two Cancelled - zero - zero to zero zero Total. |
||
Thursday, via a joint press release and Microsoft speech Blog, we will announce Microsoft’s continued partnership with Shell leveraging cloud, speech, and collaboration technology to drive industry innovation and transformation. |
||
The great Greek grape growers grow great Greek grapes one one one. |
Zero-Shot TTS for Celebrities and Game Characters
We use the prompts from the following famous people to generate the sentence "Good afternoon everyone. Today, we are super excited to introduce you all to Introduction to Deep Learning, the course of Carnegie Mellon University. In the first part of the course, we will talk about the generative deep learning that are used to generate data never existed in reality."
Name | Prompt | YourTTS | Mega-TTS |
---|---|---|---|
Theresa May |
|||
Barack Obama |
|||
Dwarf from Warcraft |