Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Authors: Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Lee, Hoon Heo, Kyogu Lee

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.
Researcher Affiliation Collaboration Hyeong-Seok Choi1,4 Juheon Lee1,4 Wansoo Kim1,4 Jie Hwan Lee4 Hoon Heo4 Kyogu Lee1,2,3,4 1MARG, Department of Intelligence and Information, Seoul National University 2GSAI 3AIIS 4Supertone Inc. {kekepa15, juheon2, wansookim, kglee}@snu.ac.kr, {wiswisbus, hoon}@supertone.ai
Pseudocode No The paper describes various algorithms and processes, such as the Yin algorithm and the information perturbation approach, but it does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The code is proprietary.
Open Datasets Yes To train NANSY on English, we used two datasets, i.e., 1. VCTK4 [47], 2. train-clean-360 subset of Libri TTS3 [54]. ... To train NANSY on multi-language, we used CSS103 dataset [32]. ... The licenses of the used datasets are as follows: 1. VCTK Open Data Commons Attribution License 1.0, 2. Libri TTS Creative Commons Attribution 4.0, and 3. CSS10 Apache License 2.0.
Dataset Splits No The paper states 'We trained the model using 90% of samples for each speaker' and 'For the seen speaker test we used 10% unseen utterances of VCTK.' It specifies training and testing splits but does not mention a distinct validation set split or percentage.
Hardware Specification Yes We trained every model using one RTX 3090 with batch size 32.
Software Dependencies No The paper mentions using 'Adam optimizer' and the 'pre-trained Hi Fi-GAN vocoder' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes We used 22,050 hz sampling rate for every analysis feature except for wav2vec input that takes waveform with the sampling rate of 16,000 hz. We used 80 bands for mel spectrogram, where FFT, window, and hop size were set to 1024, 1024, and 256, respectively. The samples were randomly cropped approximately to 1.47-second, which results in 128 mel spectrogram frames. The networks were trained using Adam optimizer with β1 = 0.5 and β2 = 0.9. The learning rate was fixed to 10^-4. We trained every model using one RTX 3090 with batch size 32. The training was done after 50 epochs.