FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition

Authors: Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board.
Researcher Affiliation Collaboration 1 Visual AI Lab, The University of Hong Kong 2 Department of Computer Vision Technology (VIS), Baidu Inc.
Pseudocode No The paper describes its methods using mathematical equations and descriptive text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Project page: https://visual-ai.github.io/froster.
Open Datasets Yes We evaluate our method using the common UCF-101 dataset (Soomro et al., 2012), HMDB-51 dataset (Kuehne et al., 2011), Kinetics-400 (K400) dataset (Carreira & Zisserman, 2017), Kinetics-600 (K-600) dataset (Carreira et al., 2018), and Something-to-Something V2 (SSv2) dataset (Goyal et al., 2017).
Dataset Splits Yes K-400 includes 240k training and 20k validation samples, while the K-600 dataset is an extension of the K-400, which has 410k training and 29k validation samples. [...] HMDB-51 and UCF-101 datasets have three validation splits in the raw data.
Hardware Specification Yes We use 8 A100 GPUs to conduct all the experiments.
Software Dependencies No The paper mentions using CLIP, ViT, and GPT3.5, but it does not specify software dependencies with version numbers (e.g., Python version, specific deep learning framework versions like PyTorch or TensorFlow).
Experiment Setup Yes The initial learning rate is set to 3.33 10 6 and is decayed using the cosine scheduler. For base-to-novel evaluation, we train each model for 12 epochs and set the first 2 epochs for warming up. Differently, for cross-dataset evaluation, since we have larger training data, we train the models for 22 epochs with the first 2 epochs as a warm-up. The hyper-parameters α and β are set as 0.1 and 2. During training, each video is uniformly sampled with 8 frames. During testing, we sample 3 video clips (8 frames per clip) with 1 crop ( 3 1 views) of each video and ensemble the outputs with an average summation.