Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Authors: Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, Chuang Gan8658-8665

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.
Researcher Affiliation Collaboration 1Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, 2Beihang University, 3Tecent AI lab, 4National University of Singapore, 5MIT-IBM Watson AI Lab
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets Yes Following (Gao et al. 2018a), we evaluate our method on TGIF-QA dataset (Jang et al. 2017). It consists of 103,919 question-answer pairs collected from 56,720 animated GIFs. In addition, all QA-pairs are split into four tasks: Action, Transition (Trans.), Count and Frame QA. ... For each task, the total number of QA-pairs and GIFs as well as the numbers of training/testing are displayed in Tab. 1.
Dataset Splits Yes For each task, the total number of QA-pairs and GIFs as well as the numbers of training/testing are displayed in Tab. 1. ... Train 20475 3543 ... Test 2274 614
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using a pre-trained Res Net-152, Glo Ve, and Adamax optimizer, but it does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes Given a video, we equally select 35 frames. For each frame, we take the output of pool5 layer of Res Net-152 (He et al. 2016) as visual features. The dimension of each frame feature is 2048. ... each word is transfered to a 300-D feature vector by a pre-trained Glo Ve (Pennington, Socher, and Manning 2014) and each character is finally embedded into a 64-D vector. In order to train the model, we employ the Adamax optimizer. For both Count and Frame QA, we set the size of minibatch as 128. For Action and Trans., the size of minibatch is set as 16. For all CNN layers, the dropout rate is 0.2. Following (Vaswani et al. 2017), the number of scaled parallel attention l in both VPSA and QPSA is set as 8.