Look More but Care Less in Video Recognition
Authors: Yitian Zhang, Yue Bai, Huan Wang, Yi Xu, Yun Fu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five datasets demonstrate the effectiveness and efficiency of our method. Our code is available at https://github.com/Be Spontaneous/AFNet-pytorch. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, Northeastern University 2Khoury College of Computer Science, Northeastern University |
| Pseudocode | No | The paper describes its methodology using text, equations, and architectural diagrams, but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Be Spontaneous/AFNet-pytorch. |
| Open Datasets | Yes | Our method is evaluated on five video recognition datasets: (1) Mini-Kinetics [23, 24] is a subset of Kinetics [15] which selects 200 classes from Kinetics, containing 121k training videos and 10k validation videos; (2) Activity Net-v1.3 [2] is an untrimmed dataset with 200 action categories and average duration of 117 seconds. It contains 10,024 video samples for training and 4,926 for validation; (3) Jester is a hand gesture recognition dataset introduced by [22]. The dataset consists of 27 classes, with 119k training videos and 15k validation videos; (4) Something-Something V1&V2 [10] are two human action datasets with strong temporal information, including 98k and 194k videos for training and validation respectively. |
| Dataset Splits | Yes | Mini-Kinetics ... containing 121k training videos and 10k validation videos; Activity Net-v1.3 ... It contains 10,024 video samples for training and 4,926 for validation; Jester ... with 119k training videos and 15k validation videos; Something-Something V1&V2 ... including 98k and 194k videos for training and validation respectively. |
| Hardware Specification | No | While the paper's checklist indicates that resource types are specified in Section 4, the section itself quantifies computation using GFLOPs but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud providers) used for the experiments. |
| Software Dependencies | No | The paper mentions the use of ResNet50 and TSM, but does not provide specific software dependencies or library version numbers (e.g., PyTorch version, Python version, CUDA version) required for reproduction. |
| Experiment Setup | Yes | Data pre-processing. We sample 8 frames uniformly to represent every video on Jester, Mini Kinetics, and 12 frames on Activity Net and Something-Something to compare with existing works unless specified. During training, the training data is randomly cropped to 224x224 following [35], and we perform random flipping except for Something-Something. At inference stage, all frames are center-cropped to 224x224 and we use one-crop one-clip per video for efficiency. Implementation details. Our method is bulit on Res Net50 [12] in default and we replace the first three stages of the network with our proposed AF module. We first train our two-branch network from scratch on Image Net for fair comparisons with other methods. Then we add the proposed navigation module and train it along with the backbone network on video recognition datasets. In our implementations, RT denotes the ratio of selected frames while RS represents the ratio of select regions which will decrease from 1 to the number we set before training by steps. We let the temperature τ in navigation module decay from 1 to 0.01 during training. |