Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering
Authors: Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, Chuang Gan8658-8665
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model. |
| Researcher Affiliation | Collaboration | 1Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, 2Beihang University, 3Tecent AI lab, 4National University of Singapore, 5MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | Following (Gao et al. 2018a), we evaluate our method on TGIF-QA dataset (Jang et al. 2017). It consists of 103,919 question-answer pairs collected from 56,720 animated GIFs. In addition, all QA-pairs are split into four tasks: Action, Transition (Trans.), Count and Frame QA. ... For each task, the total number of QA-pairs and GIFs as well as the numbers of training/testing are displayed in Tab. 1. |
| Dataset Splits | Yes | For each task, the total number of QA-pairs and GIFs as well as the numbers of training/testing are displayed in Tab. 1. ... Train 20475 3543 ... Test 2274 614 |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using a pre-trained Res Net-152, Glo Ve, and Adamax optimizer, but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Given a video, we equally select 35 frames. For each frame, we take the output of pool5 layer of Res Net-152 (He et al. 2016) as visual features. The dimension of each frame feature is 2048. ... each word is transfered to a 300-D feature vector by a pre-trained Glo Ve (Pennington, Socher, and Manning 2014) and each character is finally embedded into a 64-D vector. In order to train the model, we employ the Adamax optimizer. For both Count and Frame QA, we set the size of minibatch as 128. For Action and Trans., the size of minibatch is set as 16. For all CNN layers, the dropout rate is 0.2. Following (Vaswani et al. 2017), the number of scaled parallel attention l in both VPSA and QPSA is set as 8. |