Speech deepdive

sheer size of audio → high-quality audio involves at least 16-bit precision samples, which means a 65,536-way-softmax per time step

Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known as vocoders.

The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test sentences).

. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency

Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results.

three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples.

Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

"global style tokens" (GSTs), a bank of embeddings that are jointly trained wiht Tacotron →The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content.

→They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus

ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. → ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner.

Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.

1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents. location-relative attention mechanisms that do away with content-based query/key comparisons.

→. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances.

Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer.

. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable.

Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent).

mean opinion scores (MOS) we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.

short Xtts, tortoise tts, Bark tts, and styleTTS(style is the fastest)

Audio models are often built very differently because the number of time steps are incredibly larger. For instance an input to a language model might be a few hundred words/tokens, but an audio file might have 20 thousand samples per second, so the architecture and challenges are very different.

Tortoise combines two of the most magical models from recent years (in my opinion). A diffusion model and a transformer. The diffusion model then generates a mel spectrogram from those audio tokens. A vocoder finally generates the audio. There is also VQ-VAE used for training and a CLIP model adapted to audio as discriminator