AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

Authors: Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, Rogerio Feris

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3IBM Research 4Microsoft 5Boston University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The project page can be found at https://mengyuest.github.io/Ada Fuse/
Open Datasets Yes We evaluate Ada Fuse on Something-Something V1 (Goyal et al., 2017) & V2 (Mahdisoltani et al., 2018), Jester (Materzynska et al., 2019) and a subset of Kinetics (Kay et al., 2017).
Dataset Splits Yes Jester (Materzynska et al., 2019) has 27 annotated classes for hand gestures, with 119k / 15k videos in training / validation set. Mini-Kinetics (assembled by Meng et al. (2020)) is a subset of full Kinetics dataset (Kay et al., 2017) containing 121k videos for training and 10k videos for testing across 200 action classes.
Hardware Specification Yes where each experiment takes 12 24 hours on 4 Tesla V100 GPUs.
Software Dependencies No The paper mentions general software like "back-propagation" and "Gumbel Softmax Estimator" but does not specify any software dependencies with version numbers.
Experiment Setup Yes We uniformly sample T = 8 frames from each video. The input dimension for the network is 224 224. Random scaling and cropping are used as data augmentation during training (and we further adopt random flipping for Mini-Kinetics). Center cropping is used during inference. All our networks are using Image Net pretrained weights. We follow a step-wise learning rate scheduler with the initial learning rate as 0.002 and decay by 0.1 at epochs 20 & 40. To train our adaptive temporal fusion approach, we set the efficiency term λ = 0.1. We train all the models for 50 epochs with a batch-size of 64