FROSTER: Frozen CLIP is A Strong Teacher for Open-Vocabulary Action Recognition
Authors: Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. |
| Researcher Affiliation | Collaboration | 1 Visual AI Lab, The University of Hong Kong 2 Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper describes its methods using mathematical equations and descriptive text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Project page: https://visual-ai.github.io/froster. |
| Open Datasets | Yes | We evaluate our method using the common UCF-101 dataset (Soomro et al., 2012), HMDB-51 dataset (Kuehne et al., 2011), Kinetics-400 (K400) dataset (Carreira & Zisserman, 2017), Kinetics-600 (K-600) dataset (Carreira et al., 2018), and Something-to-Something V2 (SSv2) dataset (Goyal et al., 2017). |
| Dataset Splits | Yes | K-400 includes 240k training and 20k validation samples, while the K-600 dataset is an extension of the K-400, which has 410k training and 29k validation samples. [...] HMDB-51 and UCF-101 datasets have three validation splits in the raw data. |
| Hardware Specification | Yes | We use 8 A100 GPUs to conduct all the experiments. |
| Software Dependencies | No | The paper mentions using CLIP, ViT, and GPT3.5, but it does not specify software dependencies with version numbers (e.g., Python version, specific deep learning framework versions like PyTorch or TensorFlow). |
| Experiment Setup | Yes | The initial learning rate is set to 3.33 10 6 and is decayed using the cosine scheduler. For base-to-novel evaluation, we train each model for 12 epochs and set the first 2 epochs for warming up. Differently, for cross-dataset evaluation, since we have larger training data, we train the models for 22 epochs with the first 2 epochs as a warm-up. The hyper-parameters α and β are set as 0.1 and 2. During training, each video is uniformly sampled with 8 frames. During testing, we sample 3 video clips (8 frames per clip) with 1 crop ( 3 1 views) of each video and ensemble the outputs with an average summation. |