Location-Aware Graph Convolutional Networks for Video Question Answering

Authors: Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan11021-11028

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.
Researcher Affiliation Collaboration 1South China University of Technology, 2Peng Cheng Laboratory, Shenzhen, 3MIT-IBM Watson AI Lab
Pseudocode Yes Algorithm 1 Overall training process. Input: Video frame features; object set R; question Q 1: Construct the location-aware graph G as in Section 3.4 2: while not converges do 3: Extract question features FQ via Eq. (1) 4: Encode object location via Eq. (2), (3) and (4) 5: Compute the node features via Eq. (5) 6: Update adjacent matrix via Eq. (8) 7: Perform reasoning on graph via Eq. (6) 8: Obtain visual features FV via Eq. (10) 9: Obtain FC from FV and FQ via Eq. (12) 10: Predict answers from FC with answer predictor 11: end while Output: Trained model for video QA
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets Yes TGIF-QA (Jang et al. 2017) consists of 165K QA pairs from 72K animated GIFs... Youtube2Text-QA (Ye et al. 2017) includes the videos from MSVD video set (Chen and Dolan 2011) and the question-answer pairs collected from Youtube2Text (Guadarrama et al. 2013) video description corpus. MSVD-QA (Xu et al. 2017) is based on MSVD video set.
Dataset Splits No The paper does not explicitly provide specific train/validation/test dataset splits, percentages, or sample counts, nor does it reference predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions several software components like GloVe, Mask R-CNN, and Adam optimizer, but it does not provide specific version numbers for any of these or other key software dependencies.
Experiment Setup Yes By default, K is set to 5. The number of GCNs layers is set to 2. We employ a Adam optimizer (Kingma and Ba 2015) to train the network with an initial learning rate of 1e-4. We set the batch size to 64 and 128 for multiple-choice and open-ended tasks, respectively.