Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan279-286

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life.
Researcher Affiliation Academia 1PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Jiangsu Key Lab of Image and Video Understanding for Social Security {xuanhanyu, zhangjesse, shuochen, csjyang, yyan}@njust.edu.cn
Pseudocode No The paper describes mathematical definitions and processes, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access information (e.g., repository links, explicit release statements) for the source code of the methodology described.
Open Datasets Yes Dataset AVE Dataset (Tian et al. 2018), which is a subset of Audio Set (Gemmeke et al. 2017), contains 4143 samples covering 28 event categories, e.g., dog barking, man speaking, chainsaw logging and airplane flying.
Dataset Splits Yes We divide the AVE dataset into three parts, i.e., 80% for training, 10% for validation and 10% for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No For visual and audio representation, we respectively adopt Res Net-151 network pre-trained on Image Net and VGG-like network pre-trained on Audio Set. (The paper mentions software components but does not provide specific version numbers for reproducibility.)
Experiment Setup No For visual and audio representation, we respectively adopt Res Net-151 network pre-trained on Image Net and VGG-like network pre-trained on Audio Set. Specifically, we extract pool5 feature maps from sampled 16 RGB frames for each 1s video segment. Respectively, we extract 512 7 7-D visual representation and 128-D audio representation for each 1s audio segment and 1s visual segment. In order to ensure the comparability of the experimental results, all models have the same setting, e.g., the same number of fully connected layers. (While some setup is mentioned, concrete hyperparameters like learning rate, batch size, or specific optimizer settings are not provided.)