Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
Authors: Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou1801-1809
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios. Experiments Dataset MUSIC (Synthetic) MUSIC dataset (Zhao et al. 2018) covers 11 types of instruments. |
| Researcher Affiliation | Academia | Xian Liu1,2*, Rui Qian1,3* , Hang Zhou1* , Di Hu4, Weiyao Lin3, Ziwei Liu5, Bolei Zhou1, Xiaowei Zhou2 1 The Chinese University of Hong Kong, 2 Zhejiang University, 3 Shanghai Jiao Tong University, 4 Gaoling School of Artificial Intelligence, Renmin University of China, 5 S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper describes the modules and their functions in text and with diagrams (Fig. 2, Fig. 3) but does not provide pseudocode or algorithm blocks. |
| Open Source Code | No | There is no explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | Dataset MUSIC (Synthetic) MUSIC dataset (Zhao et al. 2018) covers 11 types of instruments. VGGSound VGGSound dataset (Vedaldi et al. 2020) consists of more than 210k single-sound videos, covering 310 categories. |
| Dataset Splits | Yes | Following (Hu et al. 2020), we employ half of the solo data to form X s, with the other half to synthesize X u. We finally obtain 28,756 videos for training and 2,787 for evaluation. |
| Hardware Specification | No | The model is trained by Adam optimizer with learning rate 10 4. |
| Software Dependencies | No | we employ the variants of Res Net-18 (He et al. 2016) as visual and audio backbones. The model is trained by Adam optimizer. |
| Experiment Setup | Yes | The model is trained by Adam optimizer with learning rate 10 4. For evaluation, we use Faster RCNN (Ren et al. 2015) to detect bounding box as reference. We split videos in each dataset into non-overlapped 1-second clips to form audio-visual pairs. Concretely, we sample audio at 16k Hz with window length of 160ms, hop length of 80ms and transform it into log-mel spectrograms with 64 frequency bins. For visual stream, we randomly sample a frame from each clip and resize it into 256 256, then randomly/center crop into 224 224 for training/evaluation. |