Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
Authors: Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting Zhuang
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The extensive experiments show the effectiveness of our method. |
| Researcher Affiliation | Academia | Zhou Zhao1, Qifan Yang1, Deng Cai2, Xiaofei He2 and Yueting Zhuang1 1College of Computer Science, Zhejiang University 2State Key Lab of CAD&CG, Zhejiang University |
| Pseudocode | No | The paper describes its methods mathematically and in text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its open-source code for the described methodology. |
| Open Datasets | No | We construct the video question-answering datset from the annotated video clip data [Li et al., 2016] with natural language descriptions, which consists of 201,068 GIFs and 287,933 descriptions. The paper references the source data for construction ([Li et al., 2016]) but does not provide concrete access information for the constructed dataset used in their experiments. |
| Dataset Splits | Yes | We split the generated dataset into three parts: the training, the validation and the testing sets. The four types of video question-answering pairs used for the experiments are summarized in Table 1. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using pretrained models like VGGNet and word2vec, but does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | The input words of our method are initialized by pre-trained word embeddings [Mikolov et al., 2013] with size of 256, and weights of GRUs are randomly by a Gaussian distribution with zero mean. ... Our method achieves the best performance when the dimension of hidden state of bi-GRU networks is set to 512, the dimension of hidden state in bi-a GRU networks is set to 512 and the number of hidden units in fully connected layer is set to 500. |