FINALLY: fast and universal speech enhancement with studio-like quality
Authors: Nicholas Babaev, Kirill Tamogashev, Azat Saginbaev, Ivan Shchekotov, Hanbin Bae, Hosang Sung, WonJun Lee, Hoon-Young Cho, Pavel Andreev
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on various datasets confirm our model s ability to produce clear, high-quality speech at 48 k Hz, achieving state-of-the-art performance in the field of speech enhancement. |
| Researcher Affiliation | Industry | Nicholas Babaev Samsung Research Kirill Tamogashev Samsung Research Azat Saginbaev Samsung Research Ivan Shchekotov Samsung Research Hanbin Bae Samsung Research Hosang Sung Samsung Research Won Jun Lee Samsung Reseach Hoon-Young Cho Samsung Research Pavel Andreev Samsung Research |
| Pseudocode | No | The paper describes algorithms (e.g., training stages, metrics calculation procedures) but does not present them in a formal 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | We do not open the code for training of the model due to our organization policy, however, we use only publicly available data and most of the source codes are available as stated in Appendix D. |
| Open Datasets | Yes | We use Libri TTS-R (Koizumi et al., 2023b), DAPS-clean (Mysore, 2014) as the sources of clean speech data. Libri TTS-R is used at 16 k Hz, while DAPS at 48 k Hz. Noise samples were taken from the DNS dataset (Dubey et al., 2022). |
| Dataset Splits | No | The paper states it uses 'Libri TTS-R' and 'DAPS' for training, and 'UNIVERSE validation data' and 'VCTK-DEMAND test set' for evaluation, but it does not provide explicit details (percentages or counts) of its own train/validation/test splits for the datasets it uses for training (Libri TTS-R, DAPS). |
| Hardware Specification | Yes | The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer. |
| Software Dependencies | No | The paper mentions using 'torchaudio library (Hwang et al., 2023)' and specific optimizers like 'Adam W (Loshchilov & Hutter, 2017)', and refers to models like 'Wav LM large model from Hugging Face' and 'MS-STFT discriminators' with associated paper citations, but it does not provide specific version numbers for these software dependencies (e.g., 'torchaudio 2.1', 'PyTorch 1.9'). |
| Experiment Setup | Yes | The main model is trained using 8 Nvidia P40 GPUs with effective batch size 32 and Adam W (Loshchilov & Hutter, 2017) as our main optimizer. For pretraining we use learning rate 0.0002, betas (0.8, 0.99) and learning rate exponential decay of 0.996 for each 200 iterations, the pretraining lasts 100,000 iterations. |