VideoPrism: A Foundational Visual Encoder for Video Understanding

Authors: Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test Video Prism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
Researcher Affiliation Industry Correspondence to: Long Zhao <longzh@google.com>, Mikhail Sirotenko <msirotenko@google.com>, Ting Liu <liuti@google.com>, Boqing Gong <bgong@google.com>.
Pseudocode Yes Algorithm 1 presents a pseudocode implementation of the proposed token shuffling for masked video modeling.
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the methodology described.
Open Datasets Yes Our pretraining data consists of 36M clips... as summarized in Table 1. The 36M high-quality video-caption pairs in Anonymous-Corpus #1 are the largest of its kind for Vi FMs, to our knowledge, but they are still an order of magnitude smaller than the image-language data used to fuel image FMs (Radford et al., 2021; Yu et al., 2022). Hence, we also collect large-scale video-text data whose noisy text is generated through ASR, metadata, and large multimodal models (Wang et al., 2023e; Zhao et al., 2024), etc. This subset of videos corresponds to the rows from WTS-70M to Anonymous-Corpus #3 in Table 1...
Dataset Splits Yes For both MSRVTT-QA and MSVD-QA, we find K = 250 to work best. Any example where the groundtruth answer is not one of the candidate answers is automatically marked as incorrect. This method additionally steers the model towards answers that fit the exact style of the particular dataset and boosts performance further.
Hardware Specification No The paper does not specify the exact hardware components (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions 'open-source Tensorflow object detection API' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes Table 10. Summary of our pretraining configurations. Configuration Stage 1 Stage 2 Optimizer Ada Factor Ada Factor Base learning rate 5 × 10−4 5 × 10−4 Learning rate schedule linear decay cosine decay Warmup iterations 2 × 104 2.5 × 104 Training iterations 2 × 105 3 × 105 Weight decay 1 × 10−4 0.05 Batch size 4096 4096 Drop token or Mask 0.5 (Tube mask) 0.65 (BEVT mask).