Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance
Authors: Heeseung Kim, Sungwon Kim, Sungroh Yoon
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for classifier guidance. Our unconditional diffusion model learns to generate speech without any context from untranscribed speech data. For TTS synthesis, we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset. We present a norm-based scaling method that reduces the pronunciation errors of classifier guidance in Guided-TTS. We show that Guided-TTS achieves a performance comparable to that of the state-of-the-art TTS model, Grad-TTS, without any transcript for LJSpeech. We further demonstrate that Guided-TTS performs well on diverse datasets including a longform untranscribed dataset. |
| Researcher Affiliation | Academia | Heeseung Kim * 1 Sungwon Kim * 1 Sungroh Yoon 1 2 1Data Science and AI Lab., Seoul National University 2Department of ECE and Interdisciplinary Program in AI, Seoul National University. Correspondence to: Sungroh Yoon <sryoon@snu.ac.kr>. |
| Pseudocode | Yes | Algorithm 1 Norm-based Guidance Algorithm 2 Inpainting Mel-spectrogram |
| Open Source Code | No | The paper provides a demo page for audio samples ('Demo : https://bit.ly/3r8vho7') and references official implementations of baseline models, but it does not state that its own source code is open-source or provide a link to it. |
| Open Datasets | Yes | Datasets In Guided-TTS, the speaker-dependent phoneme classifier and duration predictor are trained on Libri Speech (Panayotov et al., 2015), which is a large-scale automatic speech recognition (ASR) dataset with approximately 982 hours of speech uttered by 2,484 speakers with corresponding texts. To extract the speaker embedding e S from each utterance, we train a speaker encoder on Vox Celeb2 (Chung et al., 2018), which is a speaker verification dataset that contains more than 1M utterances of 6112 speakers. For the comparison case with baselines which make use of the target speaker transcript data, we use LJSpeech (Ito, 2017), a 24-hour female single speaker dataset consisting of 13,100 audio clips. For the other case which makes use of only the untranscribed target speaker speech, we use LJSpeech, Hi-Fi TTS (Bakhturina et al., 2021), and Blizzard 2013 (King & Karaiskos, 2013). |
| Dataset Splits | Yes | For the phoneme classifier and the duration predictor, we use the checkpoint of the epoch that scores best on its respective metric (validation accuracy for the phoneme classifier and validation loss of the duration predictor). |
| Hardware Specification | Yes | We conduct all experiments and evaluations using NVIDIA s RTX A40 with 48GB memory. |
| Software Dependencies | No | The paper mentions software tools like 'Adam optimizer', 'U-Net architecture', 'Wave Net-like architecture', 'Montreal Forced Aligner (MFA)', 'CTC-based conformer-large ASR model ... from NEMO toolkit', and 'Hi Fi-GAN', but it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | A.1. Training Details and Hyperparamters In this section, we cover the training details and detailed hyperparameters of Guided-TTS. ... The unconditional DDPMs are trained with batch size 16 for all datasets. The phoneme classifier of Guided-TTS uses a Wave Net-like structure with 256 residual channels and 6 residual blocks stacks of 3 dilated convolution layers, and is trained for 200 epochs with batch size 64. The duration predictor is trained for 20 epochs with batch size 64. The speaker encoder is a two-layer LSTM with 768 channels followed by a linear projection layer to extract 256-dimensional speaker embedding e S, and trained for 300K iterations. |