Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
Authors: Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The extensive experiments show the effectiveness and efficiency of our method. We conduct experiments on an open-ended long-form video question answering dataset [Zhao et al., 2018] (Section 4.1) |
| Researcher Affiliation | Academia | Zhu Zhang1 , Zhou Zhao 1 , Zhijie Lin1 , Jingkuan Song2 and Xiaofei He3 1College of Computer Science, Zhejiang University, China 2University of Electronic Science and Technology of China, China 3State Key Lab of CAD&CG, Zhejiang University, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We conduct experiments on an open-ended long-form video question answering dataset [Zhao et al., 2018], which is constructed from the Activity Caption dataset [Krishna et al., 2017] with natural-language descriptions. |
| Dataset Splits | Yes | The details of this dataset are summarized in Table 1. (Table 1 shows: Train 10,338, Valid 1,327, Test 1,296 for Object type, and similar for other types, totaling to All: Train 12,961, Valid 3,249, Test 1,296) |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper mentions software tools like “pre-trained 3D-Conv Net” and “pre-trained word2vec” but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In our HCSA, we set the layer number L of the hierarchical convolutional self-attention encoder to 3. And the segmentation factor H in the attentive segmentation unit is set to 4. To avoid heavy computational cost, we only consider top-2 layers (K = 2) of the hierarchical encoder for the multi-scale attentive decoder. Moreover, we set convolution kernel width k to 5, convolution dimension to 256 and the dimension of the hidden state of GRU networks to 256 (512 for Bi GRU while question encoding). And the dimensions of the linear matrice in all kinds of attention are set to 256. During training, we adopt an adam optimizer to minimize the loss and the learning rate is set to 0.001. |