reproducibilityindex.ai

FINALLY: fast and universal speech enhancement with studio-like quality

Authors: Nicholas Babaev, Kirill Tamogashev, Azat Saginbaev, Ivan Shchekotov, Hanbin Bae, Hosang Sung, WonJun Lee, Hoon-Young Cho, Pavel Andreev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on various datasets confirm our model s ability to produce clear, high-quality speech at 48 k Hz, achieving state-of-the-art performance in the field of speech enhancement.
Researcher Affiliation	Industry	Nicholas Babaev Samsung Research Kirill Tamogashev Samsung Research Azat Saginbaev Samsung Research Ivan Shchekotov Samsung Research Hanbin Bae Samsung Research Hosang Sung Samsung Research Won Jun Lee Samsung Reseach Hoon-Young Cho Samsung Research Pavel Andreev Samsung Research
Pseudocode	No	The paper describes algorithms (e.g., training stages, metrics calculation procedures) but does not present them in a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	We do not open the code for training of the model due to our organization policy, however, we use only publicly available data and most of the source codes are available as stated in Appendix D.
Open Datasets	Yes	We use Libri TTS-R (Koizumi et al., 2023b), DAPS-clean (Mysore, 2014) as the sources of clean speech data. Libri TTS-R is used at 16 k Hz, while DAPS at 48 k Hz. Noise samples were taken from the DNS dataset (Dubey et al., 2022).
Dataset Splits	No	The paper states it uses 'Libri TTS-R' and 'DAPS' for training, and 'UNIVERSE validation data' and 'VCTK-DEMAND test set' for evaluation, but it does not provide explicit details (percentages or counts) of its own train/validation/test splits for the datasets it uses for training (Libri TTS-R, DAPS).
Hardware Specification	Yes	The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer.
Software Dependencies	No	The paper mentions using 'torchaudio library (Hwang et al., 2023)' and specific optimizers like 'Adam W (Loshchilov & Hutter, 2017)', and refers to models like 'Wav LM large model from Hugging Face' and 'MS-STFT discriminators' with associated paper citations, but it does not provide specific version numbers for these software dependencies (e.g., 'torchaudio 2.1', 'PyTorch 1.9').
Experiment Setup	Yes	The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer. For pretraining we use learning rate 0.0002, betas (0.8, 0.99) and learning rate exponential decay of 0.996 for each 200 iterations, the pretraining lasts 100,000 iterations.