Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Authors: Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five Video QA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid.
Researcher Affiliation Academia 1Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Tsinghua University
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code available at: https://github.com/Trunpm/PMT-AAAI23.
Open Datasets Yes We use the state-of-the-art benchmarks for Video QA in our experiment: 1) TGIF-QA (Jang et al. 2017)... 2) MSVD-QA (Xu et al. 2017)... 3) MSRVTT-QA (Xu et al. 2017, 2016)... 4) Activity Net QA (Yu et al. 2019)... and 5) Youtube2Text-QA (Ye et al. 2017)...
Dataset Splits Yes We use the official split of training, validation, and testing sets provided by the datasets, and report results acquired on the testing set.
Hardware Specification Yes During our experiment, we use the Py Torch deep learning library and merely four NVIDIA GTX 1080 Ti GPUs.
Software Dependencies No The paper mentions 'Py Torch deep learning library' and 'Glo Ve embedding method' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For video processing, the number of frames is T = 16, the size of each frame is H = W = 224. The feature dimension is D = 512. The number of heads is set to be H = 8 in the multimodal transformer block. The penalty factor λ is set to 0.1. Adam optimizer is used with initial learning rate of 10 4, which is cut by half when the loss is not decreased for 10 epochs. The maximum number of epochs is 50, and the batch size is 32 for Video QA, and 8 for text-to-video retrieval.