VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
Authors: Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on several datasets demonstrate convincing capabilities, making TTS accessible to a wider range of applications. Experimentally, we evaluate our method in two ways. For TTS quality, we follow the standard Mean Opinion Score (MOS) experiment done by Arik et al. (2017a). For speaker identification, we train a multi-class network which achieves near-perfect performance on a real validation set, and test it against generated ones. |
| Researcher Affiliation | Industry | Yaniv Taigman, Lior Wolf, Adam Polyak and Eliya Nachmani Facebook AI Research {yaniv, wolf, adampolyak, eliyan}@fb.com |
| Pseudocode | No | The paper includes a table (Table 1) describing the components and their computations, but it does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | In order to promote reproducibility, we release our source code and models1. 1Py Torch code and sample audio files are available here: https://github.com/ facebookresearch/loop |
| Open Datasets | Yes | We make use of multiple datasets. First, for comparing with existing single speaker techniques, we employ single speaker literature datasets. Second, we employ various subsets of the VCTK dataset (Veaux et al., 2017) for various multi-speaker training and/or fitting experiments. Third, we create a dataset that is composed from four to five public speeches of four public figures. The data was downloaded from youtube. The single speaker experiments took place on the LJ (Ito, 2017a), the Nancy corpus from the 2011 Blizzard Challenge (King & Karaiskos, 2011), and the English audiobook data for the 2013 Blizzard Challenge (King & Karaiskos, 2013). |
| Dataset Splits | Yes | The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shuffled into train and test sets. The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]). |
| Hardware Specification | Yes | The full model contains 9.3 million parameters and runs near real-time on an Intel Xeon E5 single-core CPU and 5 times faster when on M40 NVIDIA GPU, including vocoder CPU decoding. |
| Software Dependencies | No | The paper mentions 'Py Torch code' for its implementation and references the 'Merlin toolkit' (Wu et al., 2016) for feature extraction and waveform synthesis, but it does not specify version numbers for PyTorch or any other ancillary software dependencies. |
| Experiment Setup | Yes | The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]). The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shuffled into train and test sets. |