Binaural Audio-Visual Localization

Authors: Xinyi Wu, Zhenyao Wu, Lili Ju, Song Wang2961-2968

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Fair-Play and YT-Music datasets demonstrate the effectiveness of the proposed method and show that binaural audio can greatly improve the performance of localizing the sound sources, especially when the quality of the visual information is limited.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, University of South Carolina, USA 2Department of Mathematics, University of South Carolina, USA {xinyiw, zhenyao}@email.sc.edu, ju@math.sc.edu, songwang@cec.sc.edu
Pseudocode No The paper describes the network architecture and process but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes FAIR-Play (Gao and Grauman 2019a): FAIR-Play is the first audio-visual dataset recorded with both videos and professional binaural audios in a music room... YT-MUSIC (Morgado et al. 2018): The YT-MUSIC dataset is collected from Youtube for spatial audio generation by Morgado et al. (2018)...
Dataset Splits No The paper mentions using "train/test splits" and provides specific training and testing video counts for YT-MUSIC (250 for training and 67 for testing), but it does not explicitly define a separate validation split or its size.
Hardware Specification Yes BAVNet is implemented using Pytorch and trained with one Nvidia 2080Ti GPU.
Software Dependencies No The paper mentions that BAVNet is implemented using Pytorch but does not provide specific version numbers for Pytorch or any other software dependencies, which are required for a reproducible description.
Experiment Setup Yes We take Adam as the optimizer by setting weight decay to be 0.0001. The starting learning rate is set to 0.0001, then it decayed by multiplying it with the decay factor 0.8 for every 10 epochs. We train the network for 200 epochs in total with the batch size being 1.