Link to paper: https://arxiv.org/abs/2307.07218

Paper published on: 2023-07-14

Paper's authors: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

GPT3 API Cost: $0.02

GPT4 API Cost: $0.11

Total Cost To Write This: $0.13

Time Savings: 12:1

The TLDR:

This paper talks about a new model called Mega-TTS 2 that can make a computer talk like a specific person. It can learn how a person speaks by listening to their speeches and then generate speech that sounds just like them. The model has different parts that work together to make this happen. One part is called the Prosody Language Model (PLM) which helps the model learn how a person's voice sounds. Another part is the Auto-regressive Duration Model (ADM) which helps the model know how long each sound should be. The model also has a technique called prosody interpolation that makes the speech sound more expressive. The model is trained using a big dataset and then fine-tuned using a smaller dataset of the person the model wants to sound like. The model is really good at making speech that sounds like the target person, and it can be used for things like making voice assistants sound like a specific person or creating audiobooks in the author's voice. It can also be used to help people learn different accents or speaking styles.

The Deeper Dive:

Introduction and Summary

In the realm of text-to-speech synthesis, the paper introduces a model named Mega-TTS 2, designed for zero-shot text-to-speech synthesis. The novelty of the model lies in its ability to utilize speech prompts of arbitrary lengths, a feature that sets it apart from the existing models. The model is a composition of three encoders, a prosody language model (PLM), a mel decoder, and a discriminator. The PLM, a transformer-based architecture, is designed to capture the speaker's prosody habits from prompts of any length.

Let's take an example. Consider a scenario where you want to synthesize a speech that mimics the speaking style of a particular person, say, a famous public speaker. The PLM in Mega-TTS 2 can take in hours of the speaker's speeches, capturing the unique prosody habits, and then generate speech that not only sounds like the speaker but also mimics their prosody style.

Prosody Language Model (PLM)

The PLM is a critical component of the Mega-TTS 2 model. It generates compressed discrete prosody codes auto-regressively. The PLM takes random sentences from the same speaker as input and predicts the prosody latent code. This code is then used to guide the prosody generation during synthesis. Training the PLM with arbitrary-length speech prompts allows the model to effectively capture the prosody information from longer prompts, leading to improved speech quality and naturalness.

Auto-regressive Duration Model (ADM)

Another key component of the Mega-TTS 2 model is the Auto-regressive Duration Model (ADM). The ADM is designed to enhance the duration modeling by incorporating in-context learning capabilities. It takes the phoneme sequence and the prosody latent code as input and predicts the duration of each phoneme. By considering the context information from the phoneme sequence and the prosody latent code, the ADM can better capture the dependencies between phonemes and generate more natural and expressive speech.

Prosody Interpolation

The model also introduces a prosody interpolation technique to improve the expressiveness of speech from speakers with relatively flat speaking tones. This technique leverages the probabilities derived from multiple PLM outputs to produce expressive and controlled prosody. By interpolating the probabilities, we can generate speech with different prosody styles while maintaining the target speaker's timbre. This technique enhances the flexibility and diversity of synthesized speech.

Training

The training of Mega-TTS 2 follows a two-step process: pretraining and fine-tuning. During pretraining, a large-scale dataset is used to train the content encoder, the timbre encoder, and the VQ prosody encoder. The content and the timbre encoder are trained to minimize the reconstruction loss between the original and the synthesized mel-spectrograms. The VQ prosody encoder is trained to minimize the quantization loss between the continuous prosody representation and the discrete prosody codes.

After pretraining, the entire model is fine-tuned using a smaller dataset of the target speaker. Additional losses are introduced to enforce the speaker similarity and improve the naturalness of the synthesized speech. Specifically, a speaker similarity loss is introduced that encourages the synthesized speech to sound similar to the target speaker. An adversarial loss is also introduced that encourages the synthesized speech to be indistinguishable from the real speech.

Results and Potential Applications

The paper demonstrates that Mega-TTS 2 outperforms state-of-the-art zero-shot TTS systems in terms of speaker similarity and speech naturalness when utilizing a one-sentence speech prompt. Moreover, when the length of the prompt is further extended, the performance of the model is significantly improved.

This research could pave the way for creating more realistic and natural-sounding text-to-speech systems. For instance, it could be used to develop voice assistants that can mimic the speech patterns of a specific individual, or for creating audiobooks in the voice of the author. The model could also be used in language learning applications to generate speech in different accents or speaking styles.

Notes on Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

The TLDR: