Don't Judge by the Look: Towards Motion Coherent Video Representation

Authors: Yitian Zhang, Yue Bai, Huan Wang, Yizhou Wang, Yun Fu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods.
Researcher Affiliation Academia Yitian Zhang1 Yue Bai1 Huan Wang1 Yizhou Wang1 Yun Fu1,2 1Department of Electrical and Computer Engineering, Northeastern University 2Khoury College of Computer Science, Northeastern University
Pseudocode No The paper does not include a section or figure explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code is available at https://github.com/Be Spontaneous/MCA-pytorch.
Open Datasets Yes Datasets. We validate our method on five video benchmarks: (1) Something-Something V1 & V2 Goyal et al. (2017)... (2) UCF101 Soomro et al. (2012)... (3) HMDB51 Kuehne et al. (2011)... (4) Kinetics400 Kay et al. (2017)...
Dataset Splits Yes TSM Lin et al. (2019) demonstrates a substantial disparity between its training (78.46%) and validation accuracy (45.63%) on Something-Something V1 Goyal et al. (2017) dataset
Hardware Specification Yes All models are trained on NVIDIA Tesla V100 GPUs with the same training hyperparameters as the official implementations.
Software Dependencies No Despite its effectiveness in video understanding, the current implementation of Hue Jittering Paszke et al. (2019) still suffers from inefficiency because of the transformation between RGB and HSV space.
Experiment Setup Yes Implementation details. We sample 8 frames uniformly for all methods except for Slow Fast Feichtenhofer et al. (2019) which samples 32 frames for fast pathway. During training, we crop the training data randomly to 224x224, and we abstain from applying random flipping to the Something Something datasets. In the inference phase, frames will be center-cropped to 224x224 except Slow Fast which is cropped to 256x256. We adopt one-crop one-clip per video during evaluation for efficiency unless specified. More implementation details can be found in the appendix.