reproducibilityindex.ai

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Authors: Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five Video QA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid.
Researcher Affiliation	Academia	1Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Tsinghua University
Pseudocode	No	The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code available at: https://github.com/Trunpm/PMT-AAAI23.
Open Datasets	Yes	We use the state-of-the-art benchmarks for Video QA in our experiment: 1) TGIF-QA (Jang et al. 2017)... 2) MSVD-QA (Xu et al. 2017)... 3) MSRVTT-QA (Xu et al. 2017, 2016)... 4) Activity Net QA (Yu et al. 2019)... and 5) Youtube2Text-QA (Ye et al. 2017)...
Dataset Splits	Yes	We use the official split of training, validation, and testing sets provided by the datasets, and report results acquired on the testing set.
Hardware Specification	Yes	During our experiment, we use the Py Torch deep learning library and merely four NVIDIA GTX 1080 Ti GPUs.
Software Dependencies	No	The paper mentions 'Py Torch deep learning library' and 'Glo Ve embedding method' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For video processing, the number of frames is T = 16, the size of each frame is H = W = 224. The feature dimension is D = 512. The number of heads is set to be H = 8 in the multimodal transformer block. The penalty factor λ is set to 0.1. Adam optimizer is used with initial learning rate of 10 4, which is cut by half when the loss is not decreased for 10 epochs. The maximum number of epochs is 50, and the batch size is 32 for Video QA, and 8 for text-to-video retrieval.