Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou1801-1809

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios. Experiments Dataset MUSIC (Synthetic) MUSIC dataset (Zhao et al. 2018) covers 11 types of instruments.
Researcher Affiliation Academia Xian Liu1,2*, Rui Qian1,3* , Hang Zhou1* , Di Hu4, Weiyao Lin3, Ziwei Liu5, Bolei Zhou1, Xiaowei Zhou2 1 The Chinese University of Hong Kong, 2 Zhejiang University, 3 Shanghai Jiao Tong University, 4 Gaoling School of Artificial Intelligence, Renmin University of China, 5 S-Lab, Nanyang Technological University
Pseudocode No The paper describes the modules and their functions in text and with diagrams (Fig. 2, Fig. 3) but does not provide pseudocode or algorithm blocks.
Open Source Code No There is no explicit statement about releasing code or a link to a code repository.
Open Datasets Yes Dataset MUSIC (Synthetic) MUSIC dataset (Zhao et al. 2018) covers 11 types of instruments. VGGSound VGGSound dataset (Vedaldi et al. 2020) consists of more than 210k single-sound videos, covering 310 categories.
Dataset Splits Yes Following (Hu et al. 2020), we employ half of the solo data to form X s, with the other half to synthesize X u. We finally obtain 28,756 videos for training and 2,787 for evaluation.
Hardware Specification No The model is trained by Adam optimizer with learning rate 10 4.
Software Dependencies No we employ the variants of Res Net-18 (He et al. 2016) as visual and audio backbones. The model is trained by Adam optimizer.
Experiment Setup Yes The model is trained by Adam optimizer with learning rate 10 4. For evaluation, we use Faster RCNN (Ren et al. 2015) to detect bounding box as reference. We split videos in each dataset into non-overlapped 1-second clips to form audio-visual pairs. Concretely, we sample audio at 16k Hz with window length of 160ms, hop length of 80ms and transform it into log-mel spectrograms with 64 frequency bins. For visual stream, we randomly sample a frame from each clip and resize it into 256 256, then randomly/center crop into 224 224 for training/evaluation.