Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VideoPrism: A Foundational Visual Encoder for Video Understanding

Authors: Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test Video Prism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
Researcher Affiliation Industry Correspondence to: Long Zhao <EMAIL>, Mikhail Sirotenko <EMAIL>, Ting Liu <EMAIL>, Boqing Gong <EMAIL>.
Pseudocode Yes Algorithm 1 presents a pseudocode implementation of the proposed token shuffling for masked video modeling.
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the methodology described.
Open Datasets Yes Our pretraining data consists of 36M clips... as summarized in Table 1. The 36M high-quality video-caption pairs in Anonymous-Corpus #1 are the largest of its kind for Vi FMs, to our knowledge, but they are still an order of magnitude smaller than the image-language data used to fuel image FMs (Radford et al., 2021; Yu et al., 2022). Hence, we also collect large-scale video-text data whose noisy text is generated through ASR, metadata, and large multimodal models (Wang et al., 2023e; Zhao et al., 2024), etc. This subset of videos corresponds to the rows from WTS-70M to Anonymous-Corpus #3 in Table 1...
Dataset Splits Yes For both MSRVTT-QA and MSVD-QA, we find K = 250 to work best. Any example where the groundtruth answer is not one of the candidate answers is automatically marked as incorrect. This method additionally steers the model towards answers that fit the exact style of the particular dataset and boosts performance further.
Hardware Specification No The paper does not specify the exact hardware components (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions 'open-source Tensorflow object detection API' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes Table 10. Summary of our pretraining configurations. Configuration Stage 1 Stage 2 Optimizer Ada Factor Ada Factor Base learning rate 5 × 10−4 5 × 10−4 Learning rate schedule linear decay cosine decay Warmup iterations 2 × 104 2.5 × 104 Training iterations 2 × 105 3 × 105 Weight decay 1 × 10−4 0.05 Batch size 4096 4096 Drop token or Mask 0.5 (Tube mask) 0.65 (BEVT mask).