Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Authors: Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, Yan Yan279-286
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves state-of-the-art performance on AVE (Audio-Visual Event) dataset collected in the real life. |
| Researcher Affiliation | Academia | 1PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Jiangsu Key Lab of Image and Video Understanding for Social Security {xuanhanyu, zhangjesse, shuochen, csjyang, yyan}@njust.edu.cn |
| Pseudocode | No | The paper describes mathematical definitions and processes, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., repository links, explicit release statements) for the source code of the methodology described. |
| Open Datasets | Yes | Dataset AVE Dataset (Tian et al. 2018), which is a subset of Audio Set (Gemmeke et al. 2017), contains 4143 samples covering 28 event categories, e.g., dog barking, man speaking, chainsaw logging and airplane flying. |
| Dataset Splits | Yes | We divide the AVE dataset into three parts, i.e., 80% for training, 10% for validation and 10% for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | For visual and audio representation, we respectively adopt Res Net-151 network pre-trained on Image Net and VGG-like network pre-trained on Audio Set. (The paper mentions software components but does not provide specific version numbers for reproducibility.) |
| Experiment Setup | No | For visual and audio representation, we respectively adopt Res Net-151 network pre-trained on Image Net and VGG-like network pre-trained on Audio Set. Specifically, we extract pool5 feature maps from sampled 16 RGB frames for each 1s video segment. Respectively, we extract 512 7 7-D visual representation and 128-D audio representation for each 1s audio segment and 1s visual segment. In order to ensure the comparability of the experimental results, all models have the same setting, e.g., the same number of fully connected layers. (While some setup is mentioned, concrete hyperparameters like learning rate, batch size, or specific optimizer settings are not provided.) |