reproducibilityindex.ai

ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

Authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins.
Researcher Affiliation	Industry	1SK Telecom, Jung-gu, Seoul, Republic of Korea. Correspondence to: Jungil Kong <jik876@sktelecom.com>.
Pseudocode	No	The paper includes figures describing model architectures but does not present any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a demo link (https://speechelf.github.io/elf-demo/) which showcases audio samples, but it does not explicitly state that the source code for the described methodology is released or provide a link to a code repository.
Open Datasets	Yes	Two public datasets were used to train SFEN, TTS model and the speech feature-to-speech model. We used the Libri TTS (Panayotov et al., 2015) dataset... We used the VCTK (Veaux et al., 2017) dataset...
Dataset Splits	No	The paper specifies training and test splits for the datasets but does not explicitly mention a separate validation dataset split.
Hardware Specification	Yes	8 NVIDIA V100 GPUs were used to train the model.
Software Dependencies	No	The paper mentions optimizers like 'Adam W' and internal modules like 'WN module' but does not specify version numbers for key software dependencies like programming languages, deep learning frameworks, or libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The networks were trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99, and weight decay λ = 0.01. The learning rate decay was scheduled by a 0.999 factor in every epoch with an initial learning rate of 2 × 10−4. ... The batch size was set to 32 per GPU, and the model was trained up to 800k steps.