MegaTTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

1Zhejiang University, 2ByteDance
*Equal Contribution

Abstract

Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models, have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation.

In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model (PLM) with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple PLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling.

Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in our demo page.

Model Overview

Interpolate start reference image.

The overall architecture of Mega-TTS 2. MRTE denotes the multi-reference timbre encoder and GE denotes the global timbre encoder. Subfigure (c) illustrates the training process of the prosody language model (PLM), which generates the prosody latent code extracted from random sentences of the same speaker in an auto-regressive manner. We train the auto-regressive duration model (ADM) in the same way as PLM, but we use mean squared error loss instead.

Cloning Our Voices

Hey! We are the authors of Mega-TTS 2! Here we clone our voices in a zero-shot manner with MegaTTS 2.

Transcriptions:

  • ZH 1: 我们主要提供装修的服务,您有装修方面的需求吗?
  • ZH 2: 您看,您这个行业的很多客户,都通过线上推广实现销量增长了。
  • ZH 3: 嗯那麻烦问一下,您这边是做什么行业的呢?
  • ZH 4: 非常抱歉,也就是想了解一下您有多大的可能会将我们的广告服务推荐给其他人?
  • ZH -> EN 1: Need a laugh? We found the funniest jokes around to tell all of your friends and family.
  • ZH -> EN 2: You'll be sure to brighten someones day when you unleash a hilarious joke when they least expect it.
  • Name Prompt ZH 1 ZH 2 ZH 3 ZH 4 ZH->EN 1 ZH->EN 2

    Ziyue Jiang

    Jinglin Liu

    Yi Ren

    Jinzheng He

    Chen Zhang

    Chunfeng Wang

    Prosody Interpolation

    Here we interpolate the prosody information from Prompt for PLM-1 to Prompt for PLM-2 with the tempereture γ.

    [*] Please note that the timbre of the synthesized audio here only comes from Prosody Prompt for PLM-1 (target timbre).

    Name Setting Prosody Prompt for PLM-1 (target timbre) Prosody Prompt for PLM-2 γ = 0 γ = 0.8
    Ziyue Jiang

    Poem Reading

    Chen Zhang

    Rythm Enhancing

    Chen Zhang

    Cross-Lingual

    Rythm Enhancing

    LibriSpeech TTS Samples

    We list the speech examples used in our subjective evaluations here.

    In this section, we only use one-sentence speech prompt to compare the speech naturalness of these systems.

    Text Speaker Prompt Ground Truth YourTTS VALL-E Mega-TTS Mega-TTS 2

    He was in deep converse with the clerk and entered the hall holding him by the arm.

    Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

    Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.

    The army found the people in poverty and left them in comparative wealth.

    Expanding the Prompt Length

    In this experiment, we illustrate the speech samples generated by our MegaTTS 2 with different prosody prompt length.

    Text Prompt Example 5 sentences 20 sentences

    洗一洗,搓一搓,手心手背别忘记。

    好的,信息已经核对成功,请您及时留意系统下发的通知短信和产品配送物流快递单号,感谢您的订购,祝您生活愉快,再见!

    关于直播问题呢,我可以把电话转给我们的销售经理。

    您看,您这个行业的很多客户,都通过线上推广实现销量增长了。