reproducibilityindex.ai

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Authors: Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, Xing Sun10129-10137

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on two tasks with various backbones and different pre-training datasets, and ﬁnd that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.
Researcher Affiliation	Collaboration	1 Sun Yat-sen University, Guang Zhou, China 2 Tencent Youtu Lab, Shanghai, China 3 Xiamen University, Xiamen, China
Pseudocode	No	The paper does not contain a clearly labeled pseudocode or algorithm block. The method is described in narrative text and illustrated with a diagram (Figure 2).
Open Source Code	No	The paper does not provide concrete access to source code, such as a specific repository link or an explicit statement about code release for the methodology described.
Open Datasets	Yes	All the experiments are conducted on three video classiﬁcation benchmarks, UCF101, HMDB51 and Kinetics (Kay et al. 2017). UCF101 consists of 13,320 manually labeled videos in 101 action categories and HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits.
Dataset Splits	Yes	HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits. Kinetics is a large scale action recognition dataset that contains 246k/20k train/val video clips of 400 classes.
Hardware Specification	Yes	All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128.
Software Dependencies	No	The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment.
Experiment Setup	Yes	All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128. For each video clip, we uniformly sample 16 frames with a temporal stride of 4 and then resize the sampled clip to 16 3 224 224. The margin of triplet loss is set to 0.5 and the smoothing coefﬁcient m of momentum encoder in contrastive representation learning is set to 0.99 following Mo Co(He et al. 2020).