Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. Extensive experimental results show that our method is effective and achieves new state-of-the-art performance on the MUSIC-AVQA dataset. Experiment Experimental Setup Dataset. Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022)
Researcher Affiliation Collaboration Zhangbin Li1, Dan Guo1,2,3*, Jinxing Zhou1 , Jing Zhang1, Meng Wang1,2 1School of Computer Science and Information Engineering, Hefei University of Technology 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Anhui Zhonghuitong Technology Co., Ltd
Pseudocode No No structured pseudocode or algorithm block was found in the paper.
Open Source Code Yes The code is available at https://github.com/zhangbin-ai/APL.
Open Datasets Yes Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022)
Dataset Splits Yes The dataset is split into the training, validation, and test sets, which comprise 32K, 4K, and 8K QA pairs, respectively.
Hardware Specification No The paper mentions using object detectors like 'Faster R-CNN' and 'DETR' but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for training or inference.
Software Dependencies No The paper mentions software components like 'VGGish', 'Transformer encoder (TFM)', 'Faster R-CNN', 'DETR', and 'Adam optimizer', but it does not specify any version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes We down-sample each audible video into T = 10 non-overlapping segments, resulting in T audio segments and T frames per video. ... The detected object number N per frame is 36 for Faster R-CNN and 100 for DETR. Accordingly, we set φ in Eq. 5 to 0.028 and 0.011, respectively. During training, the parameter τ in Eq. 6 is set to 0.4 and λ in Eq. 7 is set to 0.3. The initial learning rate is set to 1.75e-4 when using Faster R-CNN and 1e-4 for DETR. The learning rate decreases by multiplying 0.1 every 8 epochs with the Adam optimizer. The batch size is set to 64 and we train the model for 20 epochs.