Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
Authors: Jinpeng Wang, Yuting Gao, Ke Li, Jianguo Hu, Xinyang Jiang, Xiaowei Guo, Rongrong Ji, Xing Sun10129-10137
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone. |
| Researcher Affiliation | Collaboration | 1 Sun Yat-sen University, Guang Zhou, China 2 Tencent Youtu Lab, Shanghai, China 3 Xiamen University, Xiamen, China |
| Pseudocode | No | The paper does not contain a clearly labeled pseudocode or algorithm block. The method is described in narrative text and illustrated with a diagram (Figure 2). |
| Open Source Code | No | The paper does not provide concrete access to source code, such as a specific repository link or an explicit statement about code release for the methodology described. |
| Open Datasets | Yes | All the experiments are conducted on three video classification benchmarks, UCF101, HMDB51 and Kinetics (Kay et al. 2017). UCF101 consists of 13,320 manually labeled videos in 101 action categories and HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits. |
| Dataset Splits | Yes | HMDB51 comprises 6,766 manually labeled clips in 51 categories, both of which are divided into three train/test splits. Kinetics is a large scale action recognition dataset that contains 246k/20k train/val video clips of 400 classes. |
| Hardware Specification | Yes | All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment. |
| Experiment Setup | Yes | All the experiments are conducted on 16 Tesla V100 GPUs with a batch size of 128. For each video clip, we uniformly sample 16 frames with a temporal stride of 4 and then resize the sampled clip to 16 3 224 224. The margin of triplet loss is set to 0.5 and the smoothing coefficient m of momentum encoder in contrastive representation learning is set to 0.99 following Mo Co(He et al. 2020). |