A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Authors: He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, Liang Huang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show A3T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.1
Researcher Affiliation Collaboration 1University of Waterloo, Waterloo, ON, Canada (work done at Baidu Research USA) 2Baidu Research, Sunnyvale, CA, USA 3Oregon State University, Corvallis, OR, USA.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at: https://github.com/richardbaihe/a3t
Open Datasets Yes Following Tan et al. (2021), we conduct our speech-editing experiments with a single-speaker TTS dataset LJSpeech (Ito & Johnson, 2017) and a multi-speaker TTS dataset VCTK (Yamagishi et al., 2019).
Dataset Splits Yes We test on 50 test cases with 15 human annotators for each case. In this setting, we find the Fast Speech 2 fails to generate high-quality audio for these new speakers, even equipped with X-Vector (Snyder et al., 2018) to generate speaker embeddings for new speakers. After initializing the Fast Speech 2 model with our Libri TTS pretrained A3T, the generated audio can be improved significantly. Results are shown in Tab. 8. We also plot the validation loss and training loss during the training of TTS models with and without A3T in Fig. 11.
Hardware Specification No The paper does not provide specific details about the hardware used (e.g., GPU models, CPU types, memory) for running experiments.
Software Dependencies No The paper mentions tools like HTK, ESPnet, and Parallel-Wave GAN but does not provide specific version numbers for these or other core software dependencies.
Experiment Setup Yes All A3T models pretrained in our experiments share the same architecture: 4 layers Conformer encoder, 4 layers Conformer decoder, and 5 layers Conv1d Post-Net, with 2 heads multi-head attention in 384-dim. The convolution kernel sizes of the encoder and decoder are 7 and 31, respectively. During training, we use Adam optimizer with a 1.0 initial learning rate, 4000 warmup steps, and Noam learning rate scheduler. Instead of setting a fixed batch size, we adjust the batch size according to the length of the input example and set a maximum batch-bin (the total number of input elements) for each model. Following MAM (Chen et al., 2020), 15% frames will be masked for speech-only input, For speech-text input, we randomly select several phonemes spans ( 80% phonemes) and mask their corresponding frames. For speech-editing experiments, we use 2.4M batch-bin, 1M steps for LJSpeech, and 3M batch-bin, 1.2M steps for VCTK.