Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Authors: Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, Xing Sun10129-10137

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.
Researcher Affiliation Collaboration 1 Sun Yat-sen University, Guang Zhou, China 2 Tencent Youtu Lab, Shanghai, China 3 Xiamen University, Xiamen, China
Pseudocode No The paper does not contain a clearly labeled pseudocode or algorithm block. The method is described in narrative text and illustrated with a diagram (Figure 2).
Open Source Code No The paper does not provide concrete access to source code, such as a specific repository link or an explicit statement about code release for the methodology described.
Open Datasets Yes All the experiments are conducted on three video classification benchmarks, UCF101, HMDB51 and Kinetics (Kay et al. 2017). UCF101 consists of 13,320 manually labeled videos in 101 action categories and HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits.
Dataset Splits Yes HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits. Kinetics is a large scale action recognition dataset that contains 246k/20k train/val video clips of 400 classes.
Hardware Specification Yes All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128.
Software Dependencies No The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment.
Experiment Setup Yes All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128. For each video clip, we uniformly sample 16 frames with a temporal stride of 4 and then resize the sampled clip to 16 3 224 224. The margin of triplet loss is set to 0.5 and the smoothing coefficient m of momentum encoder in contrastive representation learning is set to 0.99 following Mo Co(He et al. 2020).