Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
VideoGLUE: Video General Understanding Evaluation of Foundation Models
Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings are as follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. |
| Researcher Affiliation | Industry | Liangzhe Yuan Nitesh Bharadwaj Gundavarapu Long Zhao Hao Zhou Yin Cui Lu Jiang Xuan Yang Menglin Jia Tobias Weyand Luke Friedman Mikhail Sirotenko Huisheng Wang Florian Schroff Hartwig Adam Ming-Hsuan Yang Ting Liu Boqing Gong Google Deep Mind Reviewed on Open Review: https://openreview.net/forum?id=wn I4s Jtjq L Equal technical contributions. Corresponding to EMAIL and EMAIL. Work done at Google. YC is now at NVIDIA; LJ is now at Byte Dance; MJ was an intern at Google and is now at Meta. |
| Pseudocode | No | The paper includes figures illustrating model architectures and adaptation methods (e.g., Figure 2, Figure 4, Figure 5) and describes procedures in text, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is released under https: //github.com/tensorflow/models/tree/master/official/projects/videoglue. |
| Open Datasets | Yes | We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community... Kinetics-400 (K400) (Kay et al., 2017), Moments in Time (Mi T) (Monfort et al., 2019), and Charades (Sigurdsson et al., 2016)... Something-something v2 (SSv2) (Goyal et al., 2017) and Diving48 (D48) (Li et al., 2018)... Activity Net v1.3 (ANet) (Fabian Caba Heilbron & Niebles, 2015), Atomic Visual Actions (AVA) (Gu et al., 2018), and AVA-Kinetics (AVAK) (Li et al., 2020). |
| Dataset Splits | Yes | Table 2: Summary of statistics, video properties, and data sources of each dataset. Tasks involved are video classification (VC), spatiotemporal action localization (STAL), and temporal action localization (TAL). Task Dataset # of videos (train/validation) Avg. length Source Notes Kinetics-400 235, 693 / 19, 165 10 secs Web Holistic, appearance Moments in Time 791, 246 / 33, 898 3 secs Web Holistic, appearance Something-Something v2 168, 913 / 24, 777 2 6 secs Crowdsource Holistic, motion Diving48 15, 027 / 1, 970 5 secs Web Holistic, motion Charades 7, 811 / 1, 814 30 secs Crowdsource Multi-label, long-clip TAL Activity Net v1.3 10, 002 / 4, 926 5 10 mins Web Temporal STAL AVA v2.2 210, 634 / 57, 371 15 mins Movie Spatiotemporal, instance AVA-Kinetics 354, 201 / 91, 919 10 secs Web Spatiotemporal, instance |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU model, CPU type, memory) used for running the experiments. It mentions that "all FMs in our evaluation are Vi T-B" which refers to a model size, not hardware. |
| Software Dependencies | No | The paper mentions using well-known models and benchmarks but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experimental environment. |
| Experiment Setup | Yes | D Task-specific hyperparameters In the following, we provide experiment settings and hyperparamters we used in this study. In Table 10, we list the hyperparameters we applied in the video classification task. In Table 11, we present the hyperparameters we used on spatiotemporal action localization. In Table 12, we present the hyperparameters we used on temporal action localization task. We performed a greedy search on the learning rate and weight decay in all our experiments while keeping most other hyperparameters (e.g., data augmentation magnitude, dropout rate, drop path rate, etc.) consistent across different models and datasets. Specifically, we start with learning rate 1e-4 and weight decay 1e-5 and uniformly sample learning rates and weight decay factors with a rate of 5 and 10, respectively, centered around the starting points. After the first round, we pick the best-identified learning rate and weight decay factor as the new starting point and conduct another round of sampling with a rate of 2. We repeat another two to three rounds of hyperparameter search (with a rate of 2) until the model s performance converges. This process is a trade-offbetween computation costs and thoroughly examining an FM s performance under each experiment setup. The search ranges for the learning rate and weight decay are [4e-5, 2.5e-3] and [1e-6, 1e-4], respectively. We found that the learning rate is the most crucial factor when adapting an FM to downstream video understanding tasks. |