Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed approach, Frozen Bi LM, outperforms the state of the art in zero-shot Video QA by a significant margin on a variety of datasets, including LSMDC-Fi B, i VQA, MSRVTT-QA, MSVD-QA, Activity Net-QA, TGIF-Frame QA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. |
| Researcher Affiliation | Collaboration | Antoine Yang1,2, Antoine Miech3, Josef Sivic4, Ivan Laptev1,2, Cordelia Schmid1,2 1Inria Paris 2Département d informatique de l ENS, CNRS, PSL Research University 3Deep Mind 4CIIRC CTU Prague |
| Pseudocode | No | The paper does not contain structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code and models are publicly available at [1]. |
| Open Datasets | Yes | For training we use the publicly available Web Vid10M dataset [6], which consists of 10 million of video-text pairs scraped from the Shutterstock website where video captions are obtained from readily-available alt-text descriptions. |
| Dataset Splits | Yes | Unless stated otherwise, we report top-1 test accuracy using the original splits for training, validation and test. |
| Hardware Specification | Yes | The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions SentencePiece [41] and DeBERTa-V2-XLarge [25] but does not provide specific version numbers for these or other software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) that would be needed for replication. |
| Experiment Setup | Yes | The training for 2 epochs on Web Vid10M lasts 20 hours on 8 Tesla V100 GPUs. In detail, we follow [17] and corrupt 15% of text tokens, replacing them 80% of the time with a mask token, 10% of the time with the same token and 10% of the time with a randomly sampled token. |