Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed approach, Frozen Bi LM, outperforms the state of the art in zero-shot Video QA by a significant margin on a variety of datasets, including LSMDC-Fi B, i VQA, MSRVTT-QA, MSVD-QA, Activity Net-QA, TGIF-Frame QA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting.
Researcher Affiliation Collaboration Antoine Yang1,2, Antoine Miech3, Josef Sivic4, Ivan Laptev1,2, Cordelia Schmid1,2 1Inria Paris 2Département d informatique de l ENS, CNRS, PSL Research University 3Deep Mind 4CIIRC CTU Prague
Pseudocode No The paper does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code and models are publicly available at [1].
Open Datasets Yes For training we use the publicly available Web Vid10M dataset [6], which consists of 10 million of video-text pairs scraped from the Shutterstock website where video captions are obtained from readily-available alt-text descriptions.
Dataset Splits Yes Unless stated otherwise, we report top-1 test accuracy using the original splits for training, validation and test.
Hardware Specification Yes The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs.
Software Dependencies No The paper mentions SentencePiece [41] and DeBERTa-V2-XLarge [25] but does not provide specific version numbers for these or other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) that would be needed for replication.
Experiment Setup Yes The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs. In detail, we follow [17] and corrupt 15% of text tokens, replacing them 80% of the time with a mask token, 10% of the time with the same token and 10% of the time with a randomly sampled token.