reproducibilityindex.ai

VA-RED$^2$: Video Adaptive Redundancy Reduction

Authors: Bowen Pan, Rameswar Panda, Camilo Luciano Fosco, Chung-Ching Lin, Alex J Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves 20% 40% reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss.
Researcher Affiliation	Collaboration	1MIT CSAIL, 2MIT-IBM Waston AI Lab, 3Microsoft, 4Boston University
Pseudocode	No	The paper describes the methods using equations and descriptive text, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Project page: http://people.csail.mit.edu/bpan/va-red/.
Open Datasets	Yes	We conduct our video action recognition experiments on three standard benchmarks: Mini-Kinetics-200, Kinetics-400, and Moments-In-Time. Mini-Kinetics-200 (assembled by (Meng et al., 2020)) is a subset of full Kinetics dataset (Carreira & Zisserman, 2017) containing 121k videos for training and 10k videos for testing across 200 action classes. Moments-In-Time dataset has 802,244 videos in training and 33,900 videos in validation across 339 categories. To show the generalization ability to different task, we also conduct the video spatio-temporal action localization on J-HMDB-21 (Jhuang et al., 2013). The original Kinetics dataset is publicly available to download at https://deepmind.com/research/open-source/kinetics. The dataset is publicly available to download at http://moments.csail.mit.edu/. The dataset is available to download at http://jhmdb.is.tue.mpg.de/.
Dataset Splits	Yes	Mini-Kinetics-200 (assembled by (Meng et al., 2020)) is a subset of full Kinetics dataset (Carreira & Zisserman, 2017) containing 121k videos for training and 10k videos for testing across 200 action classes. Moments-In-Time dataset has 802,244 videos in training and 33,900 videos in validation across 339 categories. We use the ofﬁcial training/validation/testing splits of Kinetics-400 and the splits released by authors in (Meng et al., 2020) for Mini-Kinetics-200 in our experiments.
Hardware Specification	Yes	We create the environment with Py Torch 1.6, CUDA 11.0, and a single NVIDIA TITAN RTX (24GB) GPU as our testbed to measure speed of different models. We train most of our models on 96 NVIDIA Tesla V100-32GB GPUs and perform synchronized BN (Ioffe & Szegedy, 2015) across all the GPUs.
Software Dependencies	Yes	We create the environment with Py Torch 1.6, CUDA 11.0, and a single NVIDIA TITAN RTX (24GB) GPU as our testbed to measure speed of different models.
Experiment Setup	Yes	We train all our base and dynamic models for 120 epochs on mini-Kinetics-200, Kinetics-400, and 60 epochs on Moments-In-Time dataset. We use a mini-batch size of 12 clips per GPU and adopt synchronized SGD with cosine learning rate decaying strategy (Loshchilov & Hutter, 2016) to train all our models. Dynamic models are ﬁnetuned with efﬁciency loss for 40/20 epochs to reduce density of inference graph while maintaining the accuracy. During ﬁnetuning, we set λc to 0.8 and learning rate to 0.01 for R(2+1)D and 0.1 for I3D and X3D. For R(2+1)D (Tran et al., 2018), the learning rate is initialized as 0.18 and the weight decay is set to be 5 10 4. For I3D (Carreira & Zisserman, 2017; Xie et al., 2018) and X3D (Feichtenhofer, 2020), the learning rates both start from 1.8 and weight decay factors are 1 10 4 and 5 10 5 respectively. Cosine learning rate decaying strategy is applied to decrease the total learning rate. All of the models are trained from scratch and warmed up for 15 epochs on mini-Kinetics/Kinetics, 8 epochs on Moments-In-Time dataset. We adopt the Nesterov momentum optimizer with an initial weight of 0.01 and a momentum of 0.9. During training, we follow the data augmentation (location jittering, horizontal ﬂipping, corner cropping, and scale jittering) used in TSN (Wang et al., 2016) to augment the video with different sizes spatially and ﬂip the video horizontally with 50% probability.