RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
Authors: Samuel Pegg, Kai Li, Xiaolin Hu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive experimental evaluations on three widely used datasets: LRS2 (Afouras et al., 2018a), LRS3 (Afouras et al., 2018b) and Vox Celeb2 (Chung et al., 2018), to demonstrate the value of each contribution. |
| Researcher Affiliation | Academia | 1. Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China 2. Tsinghua Laboratory of Brain and Intelligence (THBI), IDG/Mc Govern Institute for Brain Research, Tsinghua University, Beijing 100084, China 3. Chinese Institute for Brain Research (CIBR), Beijing 100010, China |
| Pseudocode | No | The paper describes its methods using mathematical equations and descriptive text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | In order to accommodate full reproducibility, we will open-source the code for RTFS-Net under the MIT licence on Git Hub once this paper has been accepted into the conference. |
| Open Datasets | Yes | We utilized the same AVSS datasets as previous works (Gao & Grauman, 2021; Li et al., 2022) in the field in order to create a fair comparison of performance: LRS2-2Mix (Afouras et al., 2018a), LRS3-2Mix (Afouras et al., 2018b) and Vox Celeb2-2Mix (Chung et al., 2018). |
| Dataset Splits | Yes | The LRS2-2Mix (Afouras et al., 2018a) dataset is derived from BBC television broadcasts... The dataset contains 11 hours of training, 3 hours of validation, and 1.5 hours of testing data. |
| Hardware Specification | Yes | In our main results table, we also include inference time: the time taken to process 2 seconds of audio on a NVIDIA 2080 GPU. ... Experimentation and training was accomplished using a single server with 8 NVIDIA 3080 GPUs ... NVIDIA 4090s were necessary in order to obtain these results. |
| Software Dependencies | Yes | The code for RTFS-Net was written in Python 3.10 using standard Python deep learning libraries, specifically Py Torch and Py Torch Lightning. |
| Experiment Setup | Yes | For all model versions, R {4, 6, 12}, we used the same hyperparamer settings. Encoder. The STFT used a Hanning analysis window with a window size of 256 and a hop length of 128. The encoded feature dimension was Ca = 256. ... For training we used a batch size of 4 and Adam W (Loshchilov & Hutter, 2018) optimization with a weight decay of 1 10 1. The initial learning rate used was 1 10 3, but this value was halved whenever the validation loss did not decrease for 5 epochs in a row. |