reproducibilityindex.ai

Can an Image Classifier Suffice For Action Recognition?

Authors: Quanfu Fan, Chun-Fu Chen, Rameswar Panda

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach rearranges input video frames into super images, which allow for training an image classiﬁer directly to fulﬁll the task of action recognition, in exactly the same way as image classiﬁcation. With such a simple idea, we show that transformer-based image classiﬁers alone can sufﬁce for action recognition. In particular, our approach demonstrates strong and promising performance against SOTA methods on several public datasets including Kinetics400, Moments In Time, Something-Something V2 (SSV2), Jester and Diving48. We also experiment with the prevalent Res Net image classiﬁers in computer vision to further validate our idea.
Researcher Affiliation	Collaboration	Quanfu Fan , Chun-Fu (Richard) Chen , Rameswar Panda MIT-IBM Watson AI Lab
Pseudocode	No	The paper provides a single line of PyTorch code snippet to illustrate an implementation detail, but this does not constitute a structured pseudocode block or algorithm block.
Open Source Code	Yes	Our source codes and models are available at https://github.com/IBM/sifar-pytorch.
Open Datasets	Yes	We use Kinetics400 (K400) (Kay et al., 2017), Something-Something V2 (SSV2) (Goyal et al., 2017), Moments-in-time (Mi T) (Monfort et al., 2019), Jester (Materzynska et al., 2019), and Diving48 (Li et al., 2018) datasets in our evaluation.
Dataset Splits	Yes	Kinetics400... includes 240k training videos and 20k validation videos in 400 classes. Mi T... around 800k training videos and 33,900 validation videos... Jester... 118,562 and 14,787 training and validation videos... Diving48... 15,943 training videos and 2,096 validation videos over 48 action classes.
Hardware Specification	Yes	All our models were trained using V100 GPUs with 16G or 32G memory.
Software Dependencies	No	The paper mentions 'Py Torch' but does not specify its version number or any other software dependencies with their respective versions.
Experiment Setup	Yes	We apply multi-scale jitter to augment the input... We then use Mixup (Zhang et al., 2018) and Cut Mix (Yun et al., 2019) to augment the data further, with their values set to 0.8 and 1.0, respectively. ... we apply drop path (Tan & Le, 2019) with a rate of 0.1, and enable label smoothing (Szegedy et al., 2016) at a rate of 0.1. ... we use a batch size of 96, 144 or 192 to train the model for 15 epochs on Mi T or 30 epochs on other datasets, including 5 warming-up epochs. The optimizer used in our training is Adam W (Loshchilov & Hutter, 2019) with a weight decay of 0.05, and the scheduler is Cosine (Loshchilov & Hutter, 2017) with a base linear learning rate of 0.0001.