reproducibilityindex.ai

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Authors: Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on several datasets demonstrate convincing capabilities, making TTS accessible to a wider range of applications. Experimentally, we evaluate our method in two ways. For TTS quality, we follow the standard Mean Opinion Score (MOS) experiment done by Arik et al. (2017a). For speaker identiﬁcation, we train a multi-class network which achieves near-perfect performance on a real validation set, and test it against generated ones.
Researcher Affiliation	Industry	Yaniv Taigman, Lior Wolf, Adam Polyak and Eliya Nachmani Facebook AI Research {yaniv, wolf, adampolyak, eliyan}@fb.com
Pseudocode	No	The paper includes a table (Table 1) describing the components and their computations, but it does not provide structured pseudocode or an algorithm block.
Open Source Code	Yes	In order to promote reproducibility, we release our source code and models1. 1Py Torch code and sample audio ﬁles are available here: https://github.com/ facebookresearch/loop
Open Datasets	Yes	We make use of multiple datasets. First, for comparing with existing single speaker techniques, we employ single speaker literature datasets. Second, we employ various subsets of the VCTK dataset (Veaux et al., 2017) for various multi-speaker training and/or ﬁtting experiments. Third, we create a dataset that is composed from four to ﬁve public speeches of four public ﬁgures. The data was downloaded from youtube. The single speaker experiments took place on the LJ (Ito, 2017a), the Nancy corpus from the 2011 Blizzard Challenge (King & Karaiskos, 2011), and the English audiobook data for the 2013 Blizzard Challenge (King & Karaiskos, 2013).
Dataset Splits	Yes	The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shufﬂed into train and test sets. The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]).
Hardware Specification	Yes	The full model contains 9.3 million parameters and runs near real-time on an Intel Xeon E5 single-core CPU and 5 times faster when on M40 NVIDIA GPU, including vocoder CPU decoding.
Software Dependencies	No	The paper mentions 'Py Torch code' for its implementation and references the 'Merlin toolkit' (Wu et al., 2016) for feature extraction and waveform synthesis, but it does not specify version numbers for PyTorch or any other ancillary software dependencies.
Experiment Setup	Yes	The training of the Char2Wav model, in each experiment, was optimized by measuring the loss on the validation set, over the following hyperparameters: initial learning rate of [1e 2, 1e 3, 1e 4], source noise standard deviation ([1, 2, 4]), batch-size ([16, 32, 64]) and the length of each training sample ([10e2, 10e4]). The 109 speakers were divided into four different nested subsets: 22 North American speakers, both male and females; and 65, 85 and 101 random selection of speakers, where the remaining eight speakers were left out for validation. Each subset was shufﬂed into train and test sets.