ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis
Authors: Jungil Kong, Junmo Lee, Jeongmin Kim, Beomjeong Kim, Jihoon Park, Dohee Kong, Changheon Lee, Sangjin Kim
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method obtained a significantly higher similarity mean opinion score (SMOS) in subjective similarity evaluation than seen speakers of a high-performance multi-speaker model, even with unseen speakers. The proposed method also outperforms a zero-shot method by significant margins. |
| Researcher Affiliation | Industry | 1SK Telecom, Jung-gu, Seoul, Republic of Korea. Correspondence to: Jungil Kong <jik876@sktelecom.com>. |
| Pseudocode | No | The paper includes figures describing model architectures but does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a demo link (https://speechelf.github.io/elf-demo/) which showcases audio samples, but it does not explicitly state that the source code for the described methodology is released or provide a link to a code repository. |
| Open Datasets | Yes | Two public datasets were used to train SFEN, TTS model and the speech feature-to-speech model. We used the Libri TTS (Panayotov et al., 2015) dataset... We used the VCTK (Veaux et al., 2017) dataset... |
| Dataset Splits | No | The paper specifies training and test splits for the datasets but does not explicitly mention a separate validation dataset split. |
| Hardware Specification | Yes | 8 NVIDIA V100 GPUs were used to train the model. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam W' and internal modules like 'WN module' but does not specify version numbers for key software dependencies like programming languages, deep learning frameworks, or libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | The networks were trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.8, β2 = 0.99, and weight decay λ = 0.01. The learning rate decay was scheduled by a 0.999 factor in every epoch with an initial learning rate of 2 × 10−4. ... The batch size was set to 32 per GPU, and the model was trained up to 800k steps. |