reproducibilityindex.ai

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

Authors: Yuqi Huo, Mingyu Ding, Haoyu Lu, Ziyuan Huang, Mingqian Tang, Zhiwu Lu, Tao Xiang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our CSJ achieves state-of-the-art on various benchmarks.
Researcher Affiliation	Collaboration	1School of Information, Renmin University of China, Beijing, China 2Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China 3The University of Hong Kong, Pokfulam, Hong Kong, China 4National University of Singapore, Singapore 5Alibaba Group, Hangzhou, China 6University of Surrey, Surrey, UK
Pseudocode	No	No pseudocode or algorithm blocks are present.
Open Source Code	No	No mention of open-source code for this paper's method or a link to a repository.
Open Datasets	Yes	We select three benchmark datasets for performance evaluation: UCF101 [Soomro et al., 2012], HMDB51 [Kuehne et al., 2011], and Kinetics-400 (K400) [Kay et al., 2017]
Dataset Splits	Yes	In the self-supervised pre-training stage, we utilize the ﬁrst training split of UCF101/HMDB51 and the training split of K400 without using their labels. As in [Han et al., 2020a], we adopt R2D3D as the backbone network, which is modiﬁed from R3D [Hara et al., 2018] with fewer parameters. By ﬁne-tuning the pre-trained model, we can evaluate the SSL performance on a downstream task (i.e., action classiﬁcation). Following [Han et al., 2019; He et al., 2020], two evaluation protocols are used: comparisons against state-of-the-arts follow the more popular fully ﬁne-tuning evaluation protocol, but ablation analysis takes both the linear evaluation and fully ﬁne-tuning protocols. For the experiments on supervised learning, we report top-1 accuracy on the ﬁrst test split of UCF101/HMDB51 as the standard [Han et al., 2020a].
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory) are mentioned.
Software Dependencies	No	No specific software names with version numbers are mentioned (e.g., PyTorch version, TensorFlow version, specific library versions).
Experiment Setup	Yes	Raw videos in these datasets are decoded at a frame rate of 24-30 fps. From each raw video, we start from a randomly selected frame index and sample a consecutive 16-frame video clip with a temporal stride of 4. For data augmentation, we ﬁrst resize the video frames to 128 x 171 pixels, from which we extract random crops of size 112 x 112 pixels. We also apply random horizontal ﬂipping to the video frames during training. Random color jittering is utilized to avoid learning shortcuts.We exploit only the raw RGB video frames as input, and do not leverage optical ﬂow or other auxiliary signals for self-supervised pre-training. (...) where sim( , ) is deﬁned by the dot product: f(x) f(exi), and τ is the temperature hyper-parameter. (...) σg is the hyperparameter which is set as 1 empirically. In the training stage, FPN [Lin et al., 2017] is used for multi-level feature fusion. (...) We deploy the adaptive weighting mechanism [Kendall et al., 2018] to weight these tasks, resulting no free hyper-parameters to tune. We also adopt curriculum learning [Bengio et al., 2009] to train our network by shufﬂing clips from easy to hard.