MegaTTS 2

MegaTTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

¹Zhejiang University, ²ByteDance

^*Equal Contribution

Abstract

Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models, have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation.

In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model (PLM) with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple PLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling.

Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in our demo page.

Cloning Our Voices

Hey! We are the authors of Mega-TTS 2! Here we clone our voices in a zero-shot manner with MegaTTS 2.

Transcriptions:

ZH 1: 我们主要提供装修的服务，您有装修方面的需求吗？

ZH 2: 您看，您这个行业的很多客户，都通过线上推广实现销量增长了。

ZH 3: 嗯那麻烦问一下，您这边是做什么行业的呢？

ZH 4: 非常抱歉，也就是想了解一下您有多大的可能会将我们的广告服务推荐给其他人？

ZH -> EN 1: Need a laugh? We found the funniest jokes around to tell all of your friends and family.

ZH -> EN 2: You'll be sure to brighten someones day when you unleash a hilarious joke when they least expect it.

Name

Prompt

ZH 1

ZH 2

ZH 3

ZH 4

ZH->EN 1

ZH->EN 2

Ziyue Jiang

Jinglin Liu

Yi Ren

Jinzheng He

Chen Zhang

Chunfeng Wang

Prosody Interpolation

Here we interpolate the prosody information from Prompt for PLM-1 to Prompt for PLM-2 with the tempereture γ.

[*] Please note that the timbre of the synthesized audio here only comes from Prosody Prompt for PLM-1 (target timbre).

Name

Setting

Prosody Prompt for PLM-1 (target timbre)

Prosody Prompt for PLM-2

γ = 0

γ = 0.8

Ziyue Jiang

Poem Reading

Chen Zhang

Rythm Enhancing

Chen Zhang

Cross-Lingual

Rythm Enhancing

LibriSpeech TTS Samples

We list the speech examples used in our subjective evaluations here.

In this section, we only use one-sentence speech prompt to compare the speech naturalness of these systems.

Text

Speaker Prompt

Ground Truth

YourTTS

VALL-E

Mega-TTS

Mega-TTS 2

He was in deep converse with the clerk and entered the hall holding him by the arm.

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.

The army found the people in poverty and left them in comparative wealth.

Expanding the Prompt Length

In this experiment, we illustrate the speech samples generated by our MegaTTS 2 with different prosody prompt length.

Text

Prompt Example

5 sentences

20 sentences

洗一洗，搓一搓，手心手背别忘记。

好的，信息已经核对成功，请您及时留意系统下发的通知短信和产品配送物流快递单号，感谢您的订购，祝您生活愉快，再见！

关于直播问题呢，我可以把电话转给我们的销售经理。

您看，您这个行业的很多客户，都通过线上推广实现销量增长了。

MegaTTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

Abstract

Model Overview

Cloning Our Voices

Transcriptions:

Prosody Interpolation

LibriSpeech TTS Samples

Expanding the Prompt Length