TTS

https://black.dubverse.ai/p/sota-tts-landscape

https://avdnoord.github.io/homepage/vqvae/

https://speechbot.github.io/ — beautiful, textless NLP project

self supervised learning representation learning VQ-VAE Discrete Latent codes

Self Supervised Learning in Speech

Independently, recent breakthrough in representation learning has yielded models able to discover discrete units from raw audio without the need of any labeled data.

Connecting these two breakthroughs together opens up the possibility of applying language models directly to audio inputs, side stepping the need for textual resources or Automatic Speech Recognition (ASR), opening up a new era of textless NLP

Second, even for resource-rich languages, the oral language carries a lot of nuances, intonations (irony, anger, uncertaintly, etc) and expressive vocalizations (laughter, yawning, mouth clicks, etc) that are not captured by text . Modeling language directly from the audio have the potential of making AI applications more natural and expressive.

SSL in speech - SSL-based pre-trained models learn representation of phonemes when trained on large amounts of speech data.

These are high abstract phonemes that can be obtained directly from speech without needing to align the text spoken in the speech

Another way to get these phonemes is to train VQ-VAE, the quantized codebook acts as phonemes in this case.

These samples are reconstructions from a VQ-VAE that compresses the audio input over 64x times into discrete latent codes (see figure below). Both the VQ-VAE and latent space are trained end-to-end without relying on phonemes or information other than the waveform itself.

The discrete latent space captures the important aspects of the audio, such as the content of the speech, in a very compressed symbolic representation.

Because of this we can now train another WaveNet on top of these latents which can focus on modeling the long-range temporal dependencies without having to spend too much capacity on imperceptible details.

→ discrete latent space - gets content of speech ( in a compressed format) → train another WaveNet( ?) , to get long-range temporal dependency

When we condition the decoder in the VQ-VAE on the speaker-id, we can extract latent codes from a speech fragment and reconstruct with a different speaker-id.

The VQ-VAE never saw any aligned data during training and was always optimizing the reconstruction of the orginal waveform. These experiments suggest that the encoder has factored out speaker-specific information in the encoded representations, as they have same meaning across different voice characteristics. This behaviour arises naturally because the decoder gets the speaker-id for free so the limited bandwith of latent codes gets used for other speaker-independent, phonetic information.

( it never saw any aligned data, and encoder encoded info independent of speaker, so it can focus on just the phonetic representation of voice)

In the paper we show that the latent codes discovered by the VQ-VAE are actually very closely related to the human-designed alphabet of phonemes.

{ ssl can learn phenomes, called semantic tokens }

TTS system -

Text to Semantic Tokens Model (M1):

To map the text to semantic tokens, a gpt like autoregressive model or an encoder-decoder transformer can be used. This task can be looked as translating the textual information to semantic information. Since Semantic tokens are speaker dependent because different speakers will say the same content in different prosodies and duration, this is why a speaker encoder is attached to this module. M1 model is also responsible to control the length of the audio.

Semantic Tokens to Mel-Spectogram/Acoustic Tokens Model (M2):

Once we have the semantic tokens via SSL from Audio files (Or M1 incase of inference), we can use these semantic tokens to generate Mel-spectogram or Acoustic tokens (refer to Encodec)

Mel-Spectogram/Acoustic Tokens to Audio wav (Vocoder):

The output of M2 model is represented in an intermediate representation, this needs to be converted back to an audio wav.

The model used here depends upon the middle representation used in M2. The two most used representation are: 1. Mel-spectorgram: 80 bin filter bank processed over STFT, this is representing audio in time-frequency domain. 2. Neural audio codec: Deep learning model trained to represent audio in continuous and discrete representations

{ we have audi, gets converted to semantic tokens, we further convert it into mel-spectogram( acoustic tokens), which further gets conveted back to audio format }

Tortoise TTS - Uses diffusion model in place of M2 → architecture - attention layers + CNN layers

The Semantic Tokens are first extracted by training VQVAE model on the training dataset. This step produces 8192 phonemes in an Unsupervised manner.

The text-to-semantic tokens model M1 is a gpt-2 like decoder only model. Lastly after training the M1 and M2 models, the M2 model is finetuned on the M1 latents for an increase in the overall quality in the produced audio.

Tortoise also have a CLVP (Contrastive Language Voice Pretrained Transformer) which is used to re-rank the outputs of M1 to give a more expressive outputs. It takes Semantic tokens and text to produce a score.

Hence the inference of the model have the following steps:

Generate Large number of Semantic Tokens from M1 using different sampling techniques.
Re-rank the generated samples using CLVP model
Select the top k speech candidates.
Decode the Melspectogram using M2
Convert to waveform using Univnet vocoder

( once we have the semantic token, we rank them, like which one is better ( but how), and then select top and then get final output waveform)

The training recipe of tortoise is not public Other models - use GAN instead of diffusion { gan direclty performs final audio file from semantic token, and no mel spectrogram needed }

One such example is XTTS, here M1 produces text to semantic tokens and then directly a hifi gan vocoder is used with speaker_encoder and semantic tokens to produce the waveform, this is blazing Fast. { How to get streamable high quality TTS? } →

`The same approach goes for Multi-modality of the models, where semantic tokens are extracted from images audio text and even videos`

Bark - autoregressive model :: hence hallucinates a lot ( RELATION OF HALLUCIANTION WITH AUTOREGGRESSIVE) - it is supervised thats why it hallucinates, like if it encounters a word not seen before, so it might give something that is random )

Bark follow a Language Modeling approach to Audio Generation, and is inspired by AudioLM. AudioLM uses w2v-BERT to extract 10,000 Semantic tokens.

The audio representation used here is of discrete neural audio codec. The Audio Codec gives 8 different codes (C1 C2 …. C8, 1 for each quantizer out of a vocabulary of 1024) this configuration is based on Bark model training and can be different for different models. The First code captures the most information and the following offers the same information in a hierarchy fashion.

Bark uses 2 decoder-only Language models and 1 non auto-regressive model to train the end to end tts pipeline.

M1 model takes text and outputs the Semantic tokens in an autoregressive manner. M2 here is divided into two parts:

Takes semantic token and produces the C1 and C2 of the target codes in an autoregressive manner.
Takes C1 and C2 and output all the 8 codes in a non autoregressive manner. The generated codes are fed to Audio codec (Encodec in this case) to generate the waveform.

{ here M1 i same, produces semantic token, but M2 does smth else, it produces 8 different codes }

Tortoise