Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition
Authors: Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. |
| Researcher Affiliation | Collaboration | 1 Visual AI Lab, The University of Hong Kong 2 Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper describes its methods using mathematical equations and descriptive text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Project page: https://visual-ai.github.io/froster. |
| Open Datasets | Yes | We evaluate our method using the common UCF-101 dataset (Soomro et al., 2012), HMDB-51 dataset (Kuehne et al., 2011), Kinetics-400 (K400) dataset (Carreira & Zisserman, 2017), Kinetics-600 (K-600) dataset (Carreira et al., 2018), and Something-to-Something V2 (SSv2) dataset (Goyal et al., 2017). |
| Dataset Splits | Yes | K-400 includes 240k training and 20k validation samples, while the K-600 dataset is an extension of the K-400, which has 410k training and 29k validation samples. [...] HMDB-51 and UCF-101 datasets have three validation splits in the raw data. |
| Hardware Specification | Yes | We use 8 A100 GPUs to conduct all the experiments. |
| Software Dependencies | No | The paper mentions using CLIP, ViT, and GPT3.5, but it does not specify software dependencies with version numbers (e.g., Python version, specific deep learning framework versions like PyTorch or TensorFlow). |
| Experiment Setup | Yes | The initial learning rate is set to 3.33 10 6 and is decayed using the cosine scheduler. For base-to-novel evaluation, we train each model for 12 epochs and set the first 2 epochs for warming up. Differently, for cross-dataset evaluation, since we have larger training data, we train the models for 22 epochs with the first 2 epochs as a warm-up. The hyper-parameters α and β are set as 0.1 and 2. During training, each video is uniformly sampled with 8 frames. During testing, we sample 3 video clips (8 frames per clip) with 1 crop ( 3 1 views) of each video and ensemble the outputs with an average summation. |