RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Authors: Samuel Pegg, Kai Li, Xiaolin Hu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted comprehensive experimental evaluations on three widely used datasets: LRS2 (Afouras et al., 2018a), LRS3 (Afouras et al., 2018b) and Vox Celeb2 (Chung et al., 2018), to demonstrate the value of each contribution.
Researcher Affiliation Academia 1. Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University, Beijing 100084, China 2. Tsinghua Laboratory of Brain and Intelligence (THBI), IDG/Mc Govern Institute for Brain Research, Tsinghua University, Beijing 100084, China 3. Chinese Institute for Brain Research (CIBR), Beijing 100010, China
Pseudocode No The paper describes its methods using mathematical equations and descriptive text but does not include structured pseudocode or algorithm blocks.
Open Source Code No In order to accommodate full reproducibility, we will open-source the code for RTFS-Net under the MIT licence on Git Hub once this paper has been accepted into the conference.
Open Datasets Yes We utilized the same AVSS datasets as previous works (Gao & Grauman, 2021; Li et al., 2022) in the field in order to create a fair comparison of performance: LRS2-2Mix (Afouras et al., 2018a), LRS3-2Mix (Afouras et al., 2018b) and Vox Celeb2-2Mix (Chung et al., 2018).
Dataset Splits Yes The LRS2-2Mix (Afouras et al., 2018a) dataset is derived from BBC television broadcasts... The dataset contains 11 hours of training, 3 hours of validation, and 1.5 hours of testing data.
Hardware Specification Yes In our main results table, we also include inference time: the time taken to process 2 seconds of audio on a NVIDIA 2080 GPU. ... Experimentation and training was accomplished using a single server with 8 NVIDIA 3080 GPUs ... NVIDIA 4090s were necessary in order to obtain these results.
Software Dependencies Yes The code for RTFS-Net was written in Python 3.10 using standard Python deep learning libraries, specifically Py Torch and Py Torch Lightning.
Experiment Setup Yes For all model versions, R {4, 6, 12}, we used the same hyperparamer settings. Encoder. The STFT used a Hanning analysis window with a window size of 256 and a hop length of 128. The encoded feature dimension was Ca = 256. ... For training we used a batch size of 4 and Adam W (Loshchilov & Hutter, 2018) optimization with a weight decay of 1 10 1. The initial learning rate used was 1 10 3, but this value was halved whenever the validation loss did not decrease for 5 epochs in a row.