ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

Authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins.
Researcher Affiliation Industry 1SK Telecom, Jung-gu, Seoul, Republic of Korea. Correspondence to: Jungil Kong <jik876@sktelecom.com>.
Pseudocode No The paper includes figures describing model architectures but does not present any formal pseudocode or algorithm blocks.
Open Source Code No The paper provides a demo link (https://speechelf.github.io/elf-demo/) which showcases audio samples, but it does not explicitly state that the source code for the described methodology is released or provide a link to a code repository.
Open Datasets Yes Two public datasets were used to train SFEN, TTS model and the speech feature-to-speech model. We used the Libri TTS (Panayotov et al., 2015) dataset... We used the VCTK (Veaux et al., 2017) dataset...
Dataset Splits No The paper specifies training and test splits for the datasets but does not explicitly mention a separate validation dataset split.
Hardware Specification Yes 8 NVIDIA V100 GPUs were used to train the model.
Software Dependencies No The paper mentions optimizers like 'Adam W' and internal modules like 'WN module' but does not specify version numbers for key software dependencies like programming languages, deep learning frameworks, or libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The networks were trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99, and weight decay λ = 0.01. The learning rate decay was scheduled by a 0.999 factor in every epoch with an initial learning rate of 2 × 10−4. ... The batch size was set to 32 per GPU, and the model was trained up to 800k steps.