reproducibilityindex.ai

VideoPrism: A Foundational Visual Encoder for Video Understanding

Authors: Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively test Video Prism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.
Researcher Affiliation	Industry	Correspondence to: Long Zhao <longzh@google.com>, Mikhail Sirotenko <msirotenko@google.com>, Ting Liu <liuti@google.com>, Boqing Gong <bgong@google.com>.
Pseudocode	Yes	Algorithm 1 presents a pseudocode implementation of the proposed token shufﬂing for masked video modeling.
Open Source Code	No	The paper does not provide an explicit statement or a link to open-source code for the methodology described.
Open Datasets	Yes	Our pretraining data consists of 36M clips... as summarized in Table 1. The 36M high-quality video-caption pairs in Anonymous-Corpus #1 are the largest of its kind for Vi FMs, to our knowledge, but they are still an order of magnitude smaller than the image-language data used to fuel image FMs (Radford et al., 2021; Yu et al., 2022). Hence, we also collect large-scale video-text data whose noisy text is generated through ASR, metadata, and large multimodal models (Wang et al., 2023e; Zhao et al., 2024), etc. This subset of videos corresponds to the rows from WTS-70M to Anonymous-Corpus #3 in Table 1...
Dataset Splits	Yes	For both MSRVTT-QA and MSVD-QA, we ﬁnd K = 250 to work best. Any example where the groundtruth answer is not one of the candidate answers is automatically marked as incorrect. This method additionally steers the model towards answers that ﬁt the exact style of the particular dataset and boosts performance further.
Hardware Specification	No	The paper does not specify the exact hardware components (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies	No	The paper mentions 'open-source Tensorﬂow object detection API' but does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	Table 10. Summary of our pretraining conﬁgurations. Conﬁguration Stage 1 Stage 2 Optimizer Ada Factor Ada Factor Base learning rate 5 × 10−4 5 × 10−4 Learning rate schedule linear decay cosine decay Warmup iterations 2 × 104 2.5 × 104 Training iterations 2 × 105 3 × 105 Weight decay 1 × 10−4 0.05 Batch size 4096 4096 Drop token or Mask 0.5 (Tube mask) 0.65 (BEVT mask).