VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Authors: Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on several datasets demonstrate convincing capabilities, making TTS accessible to a wider range of applications. Experimentally, we evaluate our method in two ways. For TTS quality, we follow the standard Mean Opinion Score (MOS) experiment done by Arik et al. (2017a). For speaker identification, we train a multi-class network which achieves near-perfect performance on a real validation set, and test it against generated ones.
Researcher Affiliation Industry Yaniv Taigman, Lior Wolf, Adam Polyak and Eliya Nachmani Facebook AI Research {yaniv, wolf, adampolyak, eliyan}@fb.com
Pseudocode No The paper includes a table (Table 1) describing the components and their computations, but it does not provide structured pseudocode or an algorithm block.
Open Source Code Yes In order to promote reproducibility, we release our source code and models1. 1Py Torch code and sample audio files are available here: https://github.com/ facebookresearch/loop
Open Datasets Yes We make use of multiple datasets. First, for comparing with existing single speaker techniques, we employ single speaker literature datasets. Second, we employ various subsets of the VCTK dataset (Veaux et al., 2017) for various multi-speaker training and/or fitting experiments. Third, we create a dataset that is composed from four to five public speeches of four public figures. The data was downloaded from youtube. The single speaker experiments took place on the LJ (Ito, 2017a), the Nancy corpus from the 2011 Blizzard Challenge (King & Karaiskos, 2011), and the English audiobook data for the 2013 Blizzard Challenge (King & Karaiskos, 2013).
Dataset Splits Yes The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shuffled into train and test sets. The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]).
Hardware Specification Yes The full model contains 9.3 million parameters and runs near real-time on an Intel Xeon E5 single-core CPU and 5 times faster when on M40 NVIDIA GPU, including vocoder CPU decoding.
Software Dependencies No The paper mentions 'Py Torch code' for its implementation and references the 'Merlin toolkit' (Wu et al., 2016) for feature extraction and waveform synthesis, but it does not specify version numbers for PyTorch or any other ancillary software dependencies.
Experiment Setup Yes The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]). The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shuffled into train and test sets.