Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw
Authors: Yuqi Huo, Mingyu Ding, Haoyu Lu, Ziyuan Huang, Mingqian Tang, Zhiwu Lu, Tao Xiang
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our CSJ achieves state-of-the-art on various benchmarks. |
| Researcher Affiliation | Collaboration | 1School of Information, Renmin University of China, Beijing, China 2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 3The University of Hong Kong, Pokfulam, Hong Kong, China 4National University of Singapore, Singapore 5Alibaba Group, Hangzhou, China 6University of Surrey, Surrey, UK |
| Pseudocode | No | No pseudocode or algorithm blocks are present. |
| Open Source Code | No | No mention of open-source code for this paper's method or a link to a repository. |
| Open Datasets | Yes | We select three benchmark datasets for performance evaluation: UCF101 [Soomro et al., 2012], HMDB51 [Kuehne et al., 2011], and Kinetics-400 (K400) [Kay et al., 2017] |
| Dataset Splits | Yes | In the self-supervised pre-training stage, we utilize the first training split of UCF101/HMDB51 and the training split of K400 without using their labels. As in [Han et al., 2020a], we adopt R2D3D as the backbone network, which is modified from R3D [Hara et al., 2018] with fewer parameters. By fine-tuning the pre-trained model, we can evaluate the SSL performance on a downstream task (i.e., action classification). Following [Han et al., 2019; He et al., 2020], two evaluation protocols are used: comparisons against state-of-the-arts follow the more popular fully fine-tuning evaluation protocol, but ablation analysis takes both the linear evaluation and fully fine-tuning protocols. For the experiments on supervised learning, we report top-1 accuracy on the first test split of UCF101/HMDB51 as the standard [Han et al., 2020a]. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory) are mentioned. |
| Software Dependencies | No | No specific software names with version numbers are mentioned (e.g., PyTorch version, TensorFlow version, specific library versions). |
| Experiment Setup | Yes | Raw videos in these datasets are decoded at a frame rate of 24-30 fps. From each raw video, we start from a randomly selected frame index and sample a consecutive 16-frame video clip with a temporal stride of 4. For data augmentation, we first resize the video frames to 128 x 171 pixels, from which we extract random crops of size 112 x 112 pixels. We also apply random horizontal flipping to the video frames during training. Random color jittering is utilized to avoid learning shortcuts.We exploit only the raw RGB video frames as input, and do not leverage optical flow or other auxiliary signals for self-supervised pre-training. (...) where sim( , ) is defined by the dot product: f(x) f(exi), and τ is the temperature hyper-parameter. (...) σg is the hyperparameter which is set as 1 empirically. In the training stage, FPN [Lin et al., 2017] is used for multi-level feature fusion. (...) We deploy the adaptive weighting mechanism [Kendall et al., 2018] to weight these tasks, resulting no free hyper-parameters to tune. We also adopt curriculum learning [Bengio et al., 2009] to train our network by shuffling clips from easy to hard. |