Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
Authors: Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five Video QA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. |
| Researcher Affiliation | Academia | 1Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Tsinghua University |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code available at: https://github.com/Trunpm/PMT-AAAI23. |
| Open Datasets | Yes | We use the state-of-the-art benchmarks for Video QA in our experiment: 1) TGIF-QA (Jang et al. 2017)... 2) MSVD-QA (Xu et al. 2017)... 3) MSRVTT-QA (Xu et al. 2017, 2016)... 4) Activity Net QA (Yu et al. 2019)... and 5) Youtube2Text-QA (Ye et al. 2017)... |
| Dataset Splits | Yes | We use the official split of training, validation, and testing sets provided by the datasets, and report results acquired on the testing set. |
| Hardware Specification | Yes | During our experiment, we use the Py Torch deep learning library and merely four NVIDIA GTX 1080 Ti GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch deep learning library' and 'Glo Ve embedding method' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For video processing, the number of frames is T = 16, the size of each frame is H = W = 224. The feature dimension is D = 512. The number of heads is set to be H = 8 in the multimodal transformer block. The penalty factor λ is set to 0.1. Adam optimizer is used with initial learning rate of 10 4, which is cut by half when the loss is not decreased for 10 epochs. The maximum number of epochs is 50, and the batch size is 32 for Video QA, and 8 for text-to-video retrieval. |