Hey! We are the authors of Mega-TTS 2! Here we clone our voices in a zero-shot manner with MegaTTS 2.
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models, have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation.
In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model (PLM) with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple PLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling.
Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in our demo page.
Hey! We are the authors of Mega-TTS 2! Here we clone our voices in a zero-shot manner with MegaTTS 2.
Name | Prompt | ZH 1 | ZH 2 | ZH 3 | ZH 4 | ZH->EN 1 | ZH->EN 2 |
---|---|---|---|---|---|---|---|
Ziyue Jiang |
|||||||
Jinglin Liu |
|||||||
Yi Ren |
|||||||
Jinzheng He |
|||||||
Chen Zhang |
|||||||
Chunfeng Wang |
Here we interpolate the prosody information from Prompt for PLM-1 to Prompt for PLM-2 with the tempereture γ.
[*] Please note that the timbre of the synthesized audio here only comes from Prosody Prompt for PLM-1 (target timbre).
Name | Setting | Prosody Prompt for PLM-1 (target timbre) | Prosody Prompt for PLM-2 | γ = 0 | γ = 0.8 |
---|---|---|---|---|---|
Ziyue Jiang | Poem Reading |
||||
Chen Zhang | Rythm Enhancing |
||||
Chen Zhang |
Cross-Lingual Rythm Enhancing |
||||
We list the speech examples used in our subjective evaluations here.
In this section, we only use one-sentence speech prompt to compare the speech naturalness of these systems.
Text | Speaker Prompt | Ground Truth | YourTTS | VALL-E | Mega-TTS | Mega-TTS 2 |
---|---|---|---|---|---|---|
He was in deep converse with the clerk and entered the hall holding him by the arm. |
||||||
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. |
||||||
Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid. |
||||||
The army found the people in poverty and left them in comparative wealth. |
In this experiment, we illustrate the speech samples generated by our MegaTTS 2 with different prosody prompt length.
Text | Prompt Example | 5 sentences | 20 sentences |
---|---|---|---|
洗一洗,搓一搓,手心手背别忘记。 |
|||
好的,信息已经核对成功,请您及时留意系统下发的通知短信和产品配送物流快递单号,感谢您的订购,祝您生活愉快,再见! |
|||
关于直播问题呢,我可以把电话转给我们的销售经理。 |
|||
您看,您这个行业的很多客户,都通过线上推广实现销量增长了。 |