Rethinking Resolution in the Context of Efficient Video Recognition
Authors: Chuofan Ma, Qiushan Guo, Yi Jiang, Ping Luo, Zehuan Yuan, Xiaojuan Qi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. |
| Researcher Affiliation | Collaboration | Chuofan Ma The University of Hong Kong b20mcf@connect.hku.hk, Yi Jiang Byte Dance Inc. jiangyi0425@gmail.com |
| Pseudocode | No | The paper describes methods in text and figures but does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be available at https://github.com/CVMI-Lab/ResKD. |
| Open Datasets | Yes | We benchmark Res KD on five commonly used action recognition datasets. In particular, Activity Net-v1.3 [3] and FCVID [18] are used for evaluation on untrimmed videos: (1) Activity Netv1.3 contains 10,024 training videos and 4,926 validation videos from 200 action classes, with an average duration of 117 seconds. (2) FCVID includes 45,611 training videos and 45,612 validation videos labeled into 239 classes, with an average length of 167 seconds. As for trimmed video evaluation, we use: (3) Kinetics-400 [19] is a large-scale scene-related dataset covering 400 human action categories, with at least 400 video clips for each class. (4) Mini-Kinetics is a subset of Kinetics-400 introduced by [32, 33]. It includes 121,215 videos for training and 9,867 videos for testing, coming from 200 action classes. (5) Something-Something V2 [14] is a temporal-related dataset which contains 168,913 training videos and 24,777 validation videos over 174 classes. |
| Dataset Splits | Yes | Activity Net-v1.3 contains 10,024 training videos and 4,926 validation videos from 200 action classes |
| Hardware Specification | Yes | Throughput (number of videos processed per second) is measured on a single Tesla V100 SXM2 GPU with the batch size of 64. |
| Software Dependencies | No | The paper mentions using a codebase: "We use the codebase provided by [5] for implementation." but does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, CUDA 11.x). |
| Experiment Setup | Yes | Unless otherwise specified, we uniformly sample 8 frames from each trimmed video and 16 frames from each untrimmed video, respectively. For data pre-processing, following [27, 45, 28], we first apply random scaling to all sampled frames, then augment them with 224 224 random cropping and random flipping in the training stage. For the input to student, we further down-sample the resolution of video frames to 112 112. During inference, we resize the short side of all frames to 128 while keeping the aspect ratio, then center-crop them to 112 112. By default, we adopt Res Net-152 [15] as the teacher network, Res Net-50 [15] as the student network, and 112 112 as the student input resolution. |