Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Authors: Wenhao Wu, Zhun Sun, Wanli Ouyang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments: Video Recognition, 4.2 Main Results, 4.3 Ablations on Kinetics, In Table 1, on the challenging Kinetics-400 dataset, we compare to state-of-the-arts that are pre-trained on large-scale datasets..., Table 3 reports the Top-1 accuracy for the four datasets. |
| Researcher Affiliation | Collaboration | Wenhao Wu1, Zhun Sun2, Wanli Ouyang3* 1The University of Sydney, NSW, Australia 2Baidu Inc., Beijing, China 3Shanghai Artificial Intelligence Laboratory, Shanghai, China whwu.ucas@gmail.com, sunzhun@baidu.com, wanli.ouyang@sydney.edu.au |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/whwu95/Text4Vis. |
| Open Datasets | Yes | To evaluate our method for video recognition, we conduct experiments on five popular datasets, i.e., Kinetics-400 (Kay et al. 2017), Kinetics-600 (Carreira et al. 2018), UCF101 (Soomro, Zamir, and Shah 2012), HMDB-51 (Kuehne et al. 2011) and Activity Net-v1.3 (Caba Heilbron et al. 2015). |
| Dataset Splits | No | The paper mentions using train and test sets but does not provide specific percentages, sample counts, or explicit references to standard predefined splits for all datasets used, making full reproduction of data partitioning difficult without supplementary material. |
| Hardware Specification | Yes | Ours vs. Contrastive-based paradigm with Vi T-B/16 on Kinetics-400. The number of V100 days is the number of V100 GPU used for training multiplied by the training time in days. indicates the official result (Wang, Xing, and Liu 2021) via Data-parallel training on 3090 GPUs. ... We use a batch size of 16 to measure the throughput. Our models achieve the 29 faster throughput and 44 fewer FLOPs compared with the previous transformer-based method Vi Vi T (Arnab et al. 2021) under the same accuracy. Analysis on Efficiency. In Table 10, we present the computational cost and efficiency of our models. We follow the common inference settings by using a single NVIDIA A100 GPU to measure the throughput. |
| Software Dependencies | No | The paper does not list specific software components with their version numbers required for replication (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The classifier in our paradigm is intialized from the textual embedding of the class names and then frozen (fixed), leaving only the parameters in the video encoder to be learned. To trade off accuracy and speed, we consider two inference strategies: (1) Single View: We use only 1 clip per video and the center crop for efficient evaluation, (e.g., as in Section 4.3). (2) Multiple Views: This is a widely used setting in previous works (Feichtenhofer et al. 2019; Carreira and Zisserman 2017) to sample multiple clips per video with several spatial crops in order to get higher accuracy. For comparison with SOTAs, we use four clips with three crops ( 4 3 Views ) in Table 1. See Supplementary for training hyperparameters. ... Unless specified otherwise, we use Vi T-B/16 with 8 frames as the video backbone and a single view for testing. |