reproducibilityindex.ai

Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. Extensive experimental results show that our method is effective and achieves new state-of-the-art performance on the MUSIC-AVQA dataset. Experiment Experimental Setup Dataset. Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022)
Researcher Affiliation	Collaboration	Zhangbin Li1, Dan Guo1,2,3*, Jinxing Zhou1 , Jing Zhang1, Meng Wang1,2 1School of Computer Science and Information Engineering, Hefei University of Technology 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Anhui Zhonghuitong Technology Co., Ltd
Pseudocode	No	No structured pseudocode or algorithm block was found in the paper.
Open Source Code	Yes	The code is available at https://github.com/zhangbin-ai/APL.
Open Datasets	Yes	Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022)
Dataset Splits	Yes	The dataset is split into the training, validation, and test sets, which comprise 32K, 4K, and 8K QA pairs, respectively.
Hardware Specification	No	The paper mentions using object detectors like 'Faster R-CNN' and 'DETR' but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for training or inference.
Software Dependencies	No	The paper mentions software components like 'VGGish', 'Transformer encoder (TFM)', 'Faster R-CNN', 'DETR', and 'Adam optimizer', but it does not specify any version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup	Yes	We down-sample each audible video into T = 10 non-overlapping segments, resulting in T audio segments and T frames per video. ... The detected object number N per frame is 36 for Faster R-CNN and 100 for DETR. Accordingly, we set φ in Eq. 5 to 0.028 and 0.011, respectively. During training, the parameter τ in Eq. 6 is set to 0.4 and λ in Eq. 7 is set to 0.3. The initial learning rate is set to 1.75e-4 when using Faster R-CNN and 1e-4 for DETR. The learning rate decreases by multiplying 0.1 every 8 epochs with the Adam optimizer. The batch size is set to 64 and we train the model for 20 epochs.