FINALLY: fast and universal speech enhancement with studio-like quality

Authors: Nicholas Babaev, Kirill Tamogashev, Azat Saginbaev, Ivan Shchekotov, Hanbin Bae, Hosang Sung, WonJun Lee, Hoon-Young Cho, Pavel Andreev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on various datasets confirm our model s ability to produce clear, high-quality speech at 48 k Hz, achieving state-of-the-art performance in the field of speech enhancement.
Researcher Affiliation Industry Nicholas Babaev Samsung Research Kirill Tamogashev Samsung Research Azat Saginbaev Samsung Research Ivan Shchekotov Samsung Research Hanbin Bae Samsung Research Hosang Sung Samsung Research Won Jun Lee Samsung Reseach Hoon-Young Cho Samsung Research Pavel Andreev Samsung Research
Pseudocode No The paper describes algorithms (e.g., training stages, metrics calculation procedures) but does not present them in a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code No We do not open the code for training of the model due to our organization policy, however, we use only publicly available data and most of the source codes are available as stated in Appendix D.
Open Datasets Yes We use Libri TTS-R (Koizumi et al., 2023b), DAPS-clean (Mysore, 2014) as the sources of clean speech data. Libri TTS-R is used at 16 k Hz, while DAPS at 48 k Hz. Noise samples were taken from the DNS dataset (Dubey et al., 2022).
Dataset Splits No The paper states it uses 'Libri TTS-R' and 'DAPS' for training, and 'UNIVERSE validation data' and 'VCTK-DEMAND test set' for evaluation, but it does not provide explicit details (percentages or counts) of its own train/validation/test splits for the datasets it uses for training (Libri TTS-R, DAPS).
Hardware Specification Yes The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer.
Software Dependencies No The paper mentions using 'torchaudio library (Hwang et al., 2023)' and specific optimizers like 'Adam W (Loshchilov & Hutter, 2017)', and refers to models like 'Wav LM large model from Hugging Face' and 'MS-STFT discriminators' with associated paper citations, but it does not provide specific version numbers for these software dependencies (e.g., 'torchaudio 2.1', 'PyTorch 1.9').
Experiment Setup Yes The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer. For pretraining we use learning rate 0.0002, betas (0.8, 0.99) and learning rate exponential decay of 0.996 for each 200 iterations, the pretraining lasts 100,000 iterations.