Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Authors: Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/ Discriminative-Sounding-Objects-Localization.
Researcher Affiliation Collaboration Di Hu1,2 , Rui Qian3, Minyue Jiang2, Xiao Tan2, Shilei Wen2, Errui Ding2, Weiyao Lin3, Dejing Dou2 1Renmin University of China, 2Baidu Inc., 3Shanghai Jiao Tong University
Pseudocode No The paper describes algorithms and steps in prose and with mathematical equations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/DTaoo/ Discriminative-Sounding-Objects-Localization.
Open Datasets Yes MUSIC MUSIC dataset [31]...Audio Set-instrument Audio Set-instrument dataset is a subset of Audio Set [12], consisting of 63,989 10-second video clips covering 15 categories of instruments. ... Annotations are publicly available in the released code, for reproducibility.
Dataset Splits No The paper describes training and testing splits, but does not explicitly define a separate validation dataset split with specified percentages or counts for its experiments.
Hardware Specification No The paper does not provide specific details about the hardware used to run its experiments, such as exact GPU/CPU models or processor types.
Software Dependencies No The paper mentions using 'variants of Res Net-18 [13] as audio and visual feature extractors' and 'Adam optimizer', but it does not specify version numbers for any software components or libraries.
Experiment Setup Yes Our model is trained with Adam optimizer with learning rate of 10 4. In training phase, we use a threshold of 0.05 to binarize the localization maps to obtain object mask, with which we can extract object representations over feature maps. And each center representation in the object dictionary is accordingly assigned to one object category, which is then used for class-aware localization evaluation. Note that, the proposed model is evaluated and trained on the identical dataset.