Fitting New Speakers Based on a Short Untranscribed Sample
Authors: Eliya Nachmani, Adam Polyak, Yaniv Taigman, Lior Wolf
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples. ... 5. Experiments |
| Researcher Affiliation | Collaboration | 1Facebook AI Research 2Tel Aviv University. Correspondence to: Eliya Nachmani <eliyan@fb.com>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Various samples can be found on the project s webpage https://ytaigman.github.io/fitspk/ index.html. |
| Open Datasets | Yes | The VCTK dataset (Veaux et al., 2017) contains 109 speakers. ... The Libri Speech dataset (Panayotov et al., 2015) is a corpus of 360 hours of voice... The Vox Celeb dataset (Nagrani et al., 2017) is a compilation of You Tube urls and time stamps... |
| Dataset Splits | No | The remaining eight speakers, which were left out for validation, are not used in our experiments. This indicates that a validation split was not used or provided for their specific experimental setup. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers for replication. It mentions using 'the crowd MOS toolkit by (P. Ribeiro et al., 2011)' but no version details. |
| Experiment Setup | Yes | The network has five convolutional layers of 3 3 filters, each with 32 channels. ... in all of our experiments, we set α = β = 10. ... during the first phase, a noise SD equal to 4.0 is added ... these sequence are cropped to a length of 100. A batch size equal to 256 is used for exactly 90 epochs. Phase 2 of the training process employs noise SD of 2.0, and sequence lengths that are trimmed at 1000 vocoder features. The batch size is reduced to 30... |