reproducibilityindex.ai

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Authors: He Bai, Renjie Zheng, Junkun Chen, Mingbo Ma, Xintong Li, Liang Huang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show A3T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker veriﬁcation model.1
Researcher Affiliation	Collaboration	1University of Waterloo, Waterloo, ON, Canada (work done at Baidu Research USA) 2Baidu Research, Sunnyvale, CA, USA 3Oregon State University, Corvallis, OR, USA.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at: https://github.com/richardbaihe/a3t
Open Datasets	Yes	Following Tan et al. (2021), we conduct our speech-editing experiments with a single-speaker TTS dataset LJSpeech (Ito & Johnson, 2017) and a multi-speaker TTS dataset VCTK (Yamagishi et al., 2019).
Dataset Splits	Yes	We test on 50 test cases with 15 human annotators for each case. In this setting, we ﬁnd the Fast Speech 2 fails to generate high-quality audio for these new speakers, even equipped with X-Vector (Snyder et al., 2018) to generate speaker embeddings for new speakers. After initializing the Fast Speech 2 model with our Libri TTS pretrained A3T, the generated audio can be improved signiﬁcantly. Results are shown in Tab. 8. We also plot the validation loss and training loss during the training of TTS models with and without A3T in Fig. 11.
Hardware Specification	No	The paper does not provide specific details about the hardware used (e.g., GPU models, CPU types, memory) for running experiments.
Software Dependencies	No	The paper mentions tools like HTK, ESPnet, and Parallel-Wave GAN but does not provide specific version numbers for these or other core software dependencies.
Experiment Setup	Yes	All A3T models pretrained in our experiments share the same architecture: 4 layers Conformer encoder, 4 layers Conformer decoder, and 5 layers Conv1d Post-Net, with 2 heads multi-head attention in 384-dim. The convolution kernel sizes of the encoder and decoder are 7 and 31, respectively. During training, we use Adam optimizer with a 1.0 initial learning rate, 4000 warmup steps, and Noam learning rate scheduler. Instead of setting a ﬁxed batch size, we adjust the batch size according to the length of the input example and set a maximum batch-bin (the total number of input elements) for each model. Following MAM (Chen et al., 2020), 15% frames will be masked for speech-only input, For speech-text input, we randomly select several phonemes spans ( 80% phonemes) and mask their corresponding frames. For speech-editing experiments, we use 2.4M batch-bin, 1M steps for LJSpeech, and 3M batch-bin, 1.2M steps for VCTK.