reproducibilityindex.ai

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed approach, Frozen Bi LM, outperforms the state of the art in zero-shot Video QA by a significant margin on a variety of datasets, including LSMDC-Fi B, i VQA, MSRVTT-QA, MSVD-QA, Activity Net-QA, TGIF-Frame QA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting.
Researcher Affiliation	Collaboration	Antoine Yang1,2, Antoine Miech3, Josef Sivic4, Ivan Laptev1,2, Cordelia Schmid1,2 1Inria Paris 2Département d informatique de l ENS, CNRS, PSL Research University 3Deep Mind 4CIIRC CTU Prague
Pseudocode	No	The paper does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our code and models are publicly available at [1].
Open Datasets	Yes	For training we use the publicly available Web Vid10M dataset [6], which consists of 10 million of video-text pairs scraped from the Shutterstock website where video captions are obtained from readily-available alt-text descriptions.
Dataset Splits	Yes	Unless stated otherwise, we report top-1 test accuracy using the original splits for training, validation and test.
Hardware Specification	Yes	The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs.
Software Dependencies	No	The paper mentions SentencePiece [41] and DeBERTa-V2-XLarge [25] but does not provide specific version numbers for these or other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) that would be needed for replication.
Experiment Setup	Yes	The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs. In detail, we follow [17] and corrupt 15% of text tokens, replacing them 80% of the time with a mask token, 10% of the time with the same token and 10% of the time with a randomly sampled token.