Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Authors: Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. Extensive experimental results show that our method is effective and achieves new state-of-the-art performance on the MUSIC-AVQA dataset. Experiment Experimental Setup Dataset. Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022) |
| Researcher Affiliation | Collaboration | Zhangbin Li1, Dan Guo1,2,3*, Jinxing Zhou1 , Jing Zhang1, Meng Wang1,2 1School of Computer Science and Information Engineering, Hefei University of Technology 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Anhui Zhonghuitong Technology Co., Ltd |
| Pseudocode | No | No structured pseudocode or algorithm block was found in the paper. |
| Open Source Code | Yes | The code is available at https://github.com/zhangbin-ai/APL. |
| Open Datasets | Yes | Experiments are conducted on the widely-used MUSIC-AVQA dataset (Li et al. 2022) |
| Dataset Splits | Yes | The dataset is split into the training, validation, and test sets, which comprise 32K, 4K, and 8K QA pairs, respectively. |
| Hardware Specification | No | The paper mentions using object detectors like 'Faster R-CNN' and 'DETR' but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for training or inference. |
| Software Dependencies | No | The paper mentions software components like 'VGGish', 'Transformer encoder (TFM)', 'Faster R-CNN', 'DETR', and 'Adam optimizer', but it does not specify any version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We down-sample each audible video into T = 10 non-overlapping segments, resulting in T audio segments and T frames per video. ... The detected object number N per frame is 36 for Faster R-CNN and 100 for DETR. Accordingly, we set φ in Eq. 5 to 0.028 and 0.011, respectively. During training, the parameter τ in Eq. 6 is set to 0.4 and λ in Eq. 7 is set to 0.3. The initial learning rate is set to 1.75e-4 when using Faster R-CNN and 1e-4 for DETR. The learning rate decreases by multiplying 0.1 every 8 epochs with the Adam optimizer. The batch size is set to 64 and we train the model for 20 epochs. |