NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

Authors: Hyeong-Seok Choi, Jinhyeok Yang, Juheon Lee, Hyeongju Kim

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis.
Researcher Affiliation Collaboration Hyeong-Seok Choi1,2, *Jinhyeok Yang2, *Juheon Lee1,2, *Hyeongju Kim2 1Seoul National University 2Supertone, Inc., {kekepa15,yangyangii,juheon2,hyeongju}@supertone.ai
Pseudocode No The paper includes architectural diagrams (e.g., Figure 6, 7, 8) but no explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code for the methodology or a link to a code repository.
Open Datasets Yes We sampled 30 speech and noise recordings from the VCTK and DEMAND dataset (Veaux et al., 2017; Thiemann et al., 2013), respectively, and mixed them with 5 d B signal-to-noise ratio (SNR).
Dataset Splits Yes We randomly selected 4800 and 600 speakers, and constructed training and validation sets respectively by merging their utterances and speech data of the NANSY++ backbone dataset.
Hardware Specification Yes The batch size was set to 60 using 10 RTX 3090 GPUs.
Software Dependencies No The paper mentions optimizers and tools like 'Adam optimizer (Kingma & Ba, 2014)' and 'Silero, 2021' but does not specify version numbers for general software dependencies or libraries.
Experiment Setup Yes We trained the backbone model for 1M iterations with Adam optimizer (Kingma & Ba, 2014) with the learning rate of 10 4. The learning rate for MPD was 2 10 4. The batch size was set to 60 using 10 RTX 3090 GPUs.