Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Authors: Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting Zhuang

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The extensive experiments show the effectiveness of our method.
Researcher Affiliation Academia Zhou Zhao1, Qifan Yang1, Deng Cai2, Xiaofei He2 and Yueting Zhuang1 1College of Computer Science, Zhejiang University 2State Key Lab of CAD&CG, Zhejiang University
Pseudocode No The paper describes its methods mathematically and in text but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to its open-source code for the described methodology.
Open Datasets No We construct the video question-answering datset from the annotated video clip data [Li et al., 2016] with natural language descriptions, which consists of 201,068 GIFs and 287,933 descriptions. The paper references the source data for construction ([Li et al., 2016]) but does not provide concrete access information for the constructed dataset used in their experiments.
Dataset Splits Yes We split the generated dataset into three parts: the training, the validation and the testing sets. The four types of video question-answering pairs used for the experiments are summarized in Table 1.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using pretrained models like VGGNet and word2vec, but does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes The input words of our method are initialized by pre-trained word embeddings [Mikolov et al., 2013] with size of 256, and weights of GRUs are randomly by a Gaussian distribution with zero mean. ... Our method achieves the best performance when the dimension of hidden state of bi-GRU networks is set to 512, the dimension of hidden state in bi-a GRU networks is set to 512 and the number of hidden units in fully connected layer is set to 500.