Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network
Authors: Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xiaofei He, Shiliang Pu
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We construct two large-scale multi-turn video question answering datasets. The extensive experiments show the effectiveness of our method. In this section, we first introduce two conversational video question answering datasets, and then conduct several experiments on them, to show the effectiveness of our approach MHACN for multi-turn video question answering. |
| Researcher Affiliation | Collaboration | Zhou Zhao1, Xinghua Jiang1, Deng Cai2, Jun Xiao1 , Xiaofei He2 and Shiliang Pu3 1College of Computer Science, Zhejiang University 2State Key Lab of CAD&CG, Zhejiang University 3Hikvision Research Institute |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams, but it does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states: "The constructed conversational video question answering datasets will be provided later." This refers to data, not code, and indicates future availability, not current release. There is no explicit statement or link for the source code of their method. |
| Open Datasets | No | We construct two large-scale multi-turn video question answering datasets. The constructed conversational video question answering datasets will be provided later. The datasets are constructed by the authors, and while they cite the sources they build upon (You Tube Clips [Chen and Dolan, 2011] and TACo S-Multi Level [Rohrbach et al., ]), the *constructed* datasets are not immediately accessible as they are stated to be "provided later". |
| Dataset Splits | Yes | We take 90% of constructed conversational video dialogs as the training data, 5% as the validation data and 5% as the testing ones. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to conduct the experiments. It only describes the software frameworks and data processing. |
| Software Dependencies | No | We resize each frame to 224 224 and extract the visual representation of each frame by the pretrained VGGNet [Simonyan and Zisserman, 2014], and take the 4,096-dimensional feature vector for each frame. We employ the pre-trained word2vec model [Mikolov et al., 2013] to extract the semantic representation of questions and answers. The paper mentions VGGNet and word2vec but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We resize each frame to 224 224 and extract the visual representation of each frame by the pretrained VGGNet. We employ the pre-trained word2vec model to extract the semantic representation of questions and answers. Specifically, the size of vocabulary set is 6,500 and the dimension of word vector is set to 256. The input words of out method are initialized by pre-trained word embeddings [Mikolov et al., 2013] with size of 256, and the weights of LSTM networks are randomly by a Gaussian distribution with zero mean. We take 90% of constructed conversational video dialogs as the training data, 5% as the validation data and 5% as the testing ones. |