Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoGLUE: Video General Understanding Evaluation of Foundation Models

Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly proﬁle FMs eﬃcacy and eﬃciency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main ﬁndings are as follows. First, task-specialized models signiﬁcantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding.
Researcher Affiliation	Industry	Liangzhe Yuan Nitesh Bharadwaj Gundavarapu Long Zhao Hao Zhou Yin Cui Lu Jiang Xuan Yang Menglin Jia Tobias Weyand Luke Friedman Mikhail Sirotenko Huisheng Wang Florian Schroﬀ Hartwig Adam Ming-Hsuan Yang Ting Liu Boqing Gong Google Deep Mind Reviewed on Open Review: https://openreview.net/forum?id=wn I4s Jtjq L Equal technical contributions. Corresponding to EMAIL and EMAIL. Work done at Google. YC is now at NVIDIA; LJ is now at Byte Dance; MJ was an intern at Google and is now at Meta.
Pseudocode	No	The paper includes figures illustrating model architectures and adaptation methods (e.g., Figure 2, Figure 4, Figure 5) and describes procedures in text, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released under https: //github.com/tensorflow/models/tree/master/official/projects/videoglue.
Open Datasets	Yes	We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community... Kinetics-400 (K400) (Kay et al., 2017), Moments in Time (Mi T) (Monfort et al., 2019), and Charades (Sigurdsson et al., 2016)... Something-something v2 (SSv2) (Goyal et al., 2017) and Diving48 (D48) (Li et al., 2018)... Activity Net v1.3 (ANet) (Fabian Caba Heilbron & Niebles, 2015), Atomic Visual Actions (AVA) (Gu et al., 2018), and AVA-Kinetics (AVAK) (Li et al., 2020).
Dataset Splits	Yes	Table 2: Summary of statistics, video properties, and data sources of each dataset. Tasks involved are video classiﬁcation (VC), spatiotemporal action localization (STAL), and temporal action localization (TAL). Task Dataset # of videos (train/validation) Avg. length Source Notes Kinetics-400 235, 693 / 19, 165 10 secs Web Holistic, appearance Moments in Time 791, 246 / 33, 898 3 secs Web Holistic, appearance Something-Something v2 168, 913 / 24, 777 2 6 secs Crowdsource Holistic, motion Diving48 15, 027 / 1, 970 5 secs Web Holistic, motion Charades 7, 811 / 1, 814 30 secs Crowdsource Multi-label, long-clip TAL Activity Net v1.3 10, 002 / 4, 926 5 10 mins Web Temporal STAL AVA v2.2 210, 634 / 57, 371 15 mins Movie Spatiotemporal, instance AVA-Kinetics 354, 201 / 91, 919 10 secs Web Spatiotemporal, instance
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU model, CPU type, memory) used for running the experiments. It mentions that "all FMs in our evaluation are Vi T-B" which refers to a model size, not hardware.
Software Dependencies	No	The paper mentions using well-known models and benchmarks but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experimental environment.
Experiment Setup	Yes	D Task-speciﬁc hyperparameters In the following, we provide experiment settings and hyperparamters we used in this study. In Table 10, we list the hyperparameters we applied in the video classiﬁcation task. In Table 11, we present the hyperparameters we used on spatiotemporal action localization. In Table 12, we present the hyperparameters we used on temporal action localization task. We performed a greedy search on the learning rate and weight decay in all our experiments while keeping most other hyperparameters (e.g., data augmentation magnitude, dropout rate, drop path rate, etc.) consistent across diﬀerent models and datasets. Speciﬁcally, we start with learning rate 1e-4 and weight decay 1e-5 and uniformly sample learning rates and weight decay factors with a rate of 5 and 10, respectively, centered around the starting points. After the ﬁrst round, we pick the best-identiﬁed learning rate and weight decay factor as the new starting point and conduct another round of sampling with a rate of 2. We repeat another two to three rounds of hyperparameter search (with a rate of 2) until the model s performance converges. This process is a trade-oﬀbetween computation costs and thoroughly examining an FM s performance under each experiment setup. The search ranges for the learning rate and weight decay are [4e-5, 2.5e-3] and [1e-6, 1e-4], respectively. We found that the learning rate is the most crucial factor when adapting an FM to downstream video understanding tasks.