Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods. |
| Researcher Affiliation | Collaboration | 1University of Illinois, Urbana-Champaign, Urbana, IL 2Mitsubishi Electric Research Labs, Cambridge, MA |
| Pseudocode | No | The paper does not include pseudocode or an algorithm block explicitly labeled as such. Figure 1 provides a model illustration, but it is a diagram, not pseudocode. |
| Open Source Code | No | We will be making our Py Torch implementation public. The code will be made available on the web. |
| Open Datasets | Yes | Audio Separation in the Wild (ASIW) dataset [8]...Audio Visual Event (AVE) Dataset [47] |
| Dataset Splits | Yes | Audio Separation in the Wild (ASIW) Dataset: ... 147 validation, 322 test, and 10,540 training videos...Audio Visual Event (AVE) Dataset: ... 2211 training, 257 validation, and 261 test set videos. |
| Hardware Specification | No | The paper mentions running experiments and refers to "type of GPUs" in the ethics statement, but it does not provide specific hardware details (e.g., specific GPU/CPU models, memory amounts) in the provided text. |
| Software Dependencies | No | The paper mentions various algorithms and models (e.g., MiDAS, RAFT, Faster-RCNN, U-Net, ResNet, ADAM optimizer) but does not provide specific software names with version numbers (e.g., PyTorch 1.9, Python 3.8). |
| Experiment Setup | Yes | The audio streams are sub-sampled at 11k Hz and STFT spectrograms are extracted using a Hann window of size 1022, with hop length of 256...We set Ω= 256 and T = 256. The embeddings, ˆyi, and the GRU-hidden state are 512-dimensional. The Io U threshold, γ is set to 0.1 for both datasets. Each window in a video has l = 8 frames. The weights on the different losses are as follows: λ1 = 0.05, λ2 = 1.0, λ3 = 1.0, λ4 = 0.05. Our model is trained using the ADAM optimizer [23] with a weight decay of 1e-4, β1 = 0.9, β2 = 0.999. The learning rate is set to 1e 4 and is decreased by a factor of 0.1 every 15K iterations. |