Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experimental Evaluations |
| Researcher Affiliation | Academia | Takaaki Saeki1 , Soumi Maiti2 , Xinjian Li2 , Shinji Watanabe2 , Shinnosuke Takamichi1 and Hiroshi Saruwatari1 1The University of Tokyo, Japan 2Carnegie Mellon University, USA |
| Pseudocode | No | The paper describes its methods in prose and with diagrams (e.g., Fig. 2), but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The experiments were conducted on public datasets and the implementation is available1. 1https://github.com/Takaaki-Saeki/zm-text-tts |
| Open Datasets | Yes | Dataset We carried out all the evaluations with publicly available datasets. For the unsupervised text pretraining described in 2.1, we used transcripts from Vox Populi [Wang et al., 2021], M-AILABS [Munich Artificial Intelligence Laboratories Gmb H, 2017], and CSS10 [Park and Mulc, 2019] |
| Dataset Splits | Yes | We used 5 and 100 utterances as dev and test sets, respectively, with the remaining data used for training. |
| Hardware Specification | Yes | This work used the Bridges system [Nystrom et al., 2015], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center. |
| Software Dependencies | No | The paper mentions using 'ESPnet2-TTS', 'Hi Fi-GAN', and 'Speech Brain' with citations, but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | The sampling rate was set to 16 k Hz. An 80-dimension of mel filter bank, 1024 samples of FFT length, and 256 samples of frame shit were used for speech analysis. For the pretraining described in 2.1, we trained the model for 1.2M iterations using the Noam optimizer [Vaswani et al., 2017] with the learning rate and warm-up step set to 1.0 and 10000, respectively. For the TTS model described in 2.4, we used a 6-block Transformer encoder [Vaswani et al., 2017] and a 6-block Transformer decoder, with a postnet consisting of five convolutional layers with a kernel size of five. The attention dimension and the number of attention heads were set to 512 and 8, respectively. For the bottleneck layer described in 2.4, we set the hidden dimension after the down projection to 256. The Prediction Net in Eq. (4) consisted of a linear layer, a GELU activation function [Hendrycks and Gimpel, 2016], Layer Normalization, and a linear layer with the hidden dimension of 512. For the supervised learning described in 2.2, we trained the models for 2.47M iterations (200 epochs). The Noam optimizer was used with the warm-up step of 50000. |