VA-RED$^2$: Video Adaptive Redundancy Reduction
Authors: Bowen Pan, Rameswar Panda, Camilo Luciano Fosco, Chung-Ching Lin, Alex J Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves 20% 40% reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss. |
| Researcher Affiliation | Collaboration | 1MIT CSAIL, 2MIT-IBM Waston AI Lab, 3Microsoft, 4Boston University |
| Pseudocode | No | The paper describes the methods using equations and descriptive text, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Project page: http://people.csail.mit.edu/bpan/va-red/. |
| Open Datasets | Yes | We conduct our video action recognition experiments on three standard benchmarks: Mini-Kinetics-200, Kinetics-400, and Moments-In-Time. Mini-Kinetics-200 (assembled by (Meng et al., 2020)) is a subset of full Kinetics dataset (Carreira & Zisserman, 2017) containing 121k videos for training and 10k videos for testing across 200 action classes. Moments-In-Time dataset has 802,244 videos in training and 33,900 videos in validation across 339 categories. To show the generalization ability to different task, we also conduct the video spatio-temporal action localization on J-HMDB-21 (Jhuang et al., 2013). The original Kinetics dataset is publicly available to download at https://deepmind.com/research/open-source/kinetics. The dataset is publicly available to download at http://moments.csail.mit.edu/. The dataset is available to download at http://jhmdb.is.tue.mpg.de/. |
| Dataset Splits | Yes | Mini-Kinetics-200 (assembled by (Meng et al., 2020)) is a subset of full Kinetics dataset (Carreira & Zisserman, 2017) containing 121k videos for training and 10k videos for testing across 200 action classes. Moments-In-Time dataset has 802,244 videos in training and 33,900 videos in validation across 339 categories. We use the official training/validation/testing splits of Kinetics-400 and the splits released by authors in (Meng et al., 2020) for Mini-Kinetics-200 in our experiments. |
| Hardware Specification | Yes | We create the environment with Py Torch 1.6, CUDA 11.0, and a single NVIDIA TITAN RTX (24GB) GPU as our testbed to measure speed of different models. We train most of our models on 96 NVIDIA Tesla V100-32GB GPUs and perform synchronized BN (Ioffe & Szegedy, 2015) across all the GPUs. |
| Software Dependencies | Yes | We create the environment with Py Torch 1.6, CUDA 11.0, and a single NVIDIA TITAN RTX (24GB) GPU as our testbed to measure speed of different models. |
| Experiment Setup | Yes | We train all our base and dynamic models for 120 epochs on mini-Kinetics-200, Kinetics-400, and 60 epochs on Moments-In-Time dataset. We use a mini-batch size of 12 clips per GPU and adopt synchronized SGD with cosine learning rate decaying strategy (Loshchilov & Hutter, 2016) to train all our models. Dynamic models are finetuned with efficiency loss for 40/20 epochs to reduce density of inference graph while maintaining the accuracy. During finetuning, we set λc to 0.8 and learning rate to 0.01 for R(2+1)D and 0.1 for I3D and X3D. For R(2+1)D (Tran et al., 2018), the learning rate is initialized as 0.18 and the weight decay is set to be 5 10 4. For I3D (Carreira & Zisserman, 2017; Xie et al., 2018) and X3D (Feichtenhofer, 2020), the learning rates both start from 1.8 and weight decay factors are 1 10 4 and 5 10 5 respectively. Cosine learning rate decaying strategy is applied to decrease the total learning rate. All of the models are trained from scratch and warmed up for 15 epochs on mini-Kinetics/Kinetics, 8 epochs on Moments-In-Time dataset. We adopt the Nesterov momentum optimizer with an initial weight of 0.01 and a momentum of 0.9. During training, we follow the data augmentation (location jittering, horizontal flipping, corner cropping, and scale jittering) used in TSN (Wang et al., 2016) to augment the video with different sizes spatially and flip the video horizontally with 50% probability. |