Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods.
Researcher Affiliation Collaboration 1University of Illinois, Urbana-Champaign, Urbana, IL 2Mitsubishi Electric Research Labs, Cambridge, MA
Pseudocode No The paper does not include pseudocode or an algorithm block explicitly labeled as such. Figure 1 provides a model illustration, but it is a diagram, not pseudocode.
Open Source Code No We will be making our Py Torch implementation public. The code will be made available on the web.
Open Datasets Yes Audio Separation in the Wild (ASIW) dataset [8]...Audio Visual Event (AVE) Dataset [47]
Dataset Splits Yes Audio Separation in the Wild (ASIW) Dataset: ... 147 validation, 322 test, and 10,540 training videos...Audio Visual Event (AVE) Dataset: ... 2211 training, 257 validation, and 261 test set videos.
Hardware Specification No The paper mentions running experiments and refers to "type of GPUs" in the ethics statement, but it does not provide specific hardware details (e.g., specific GPU/CPU models, memory amounts) in the provided text.
Software Dependencies No The paper mentions various algorithms and models (e.g., MiDAS, RAFT, Faster-RCNN, U-Net, ResNet, ADAM optimizer) but does not provide specific software names with version numbers (e.g., PyTorch 1.9, Python 3.8).
Experiment Setup Yes The audio streams are sub-sampled at 11k Hz and STFT spectrograms are extracted using a Hann window of size 1022, with hop length of 256...We set Ω= 256 and T = 256. The embeddings, ˆyi, and the GRU-hidden state are 512-dimensional. The Io U threshold, γ is set to 0.1 for both datasets. Each window in a video has l = 8 frames. The weights on the different losses are as follows: λ1 = 0.05, λ2 = 1.0, λ3 = 1.0, λ4 = 0.05. Our model is trained using the ADAM optimizer [23] with a weight decay of 1e-4, β1 = 0.9, β2 = 0.999. The learning rate is set to 1e 4 and is decreased by a factor of 0.1 every 15K iterations.