SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

Authors: Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Tao Mei10656-10664

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Se Co shows superior results under the linear protocol on action recognition (Kinetics), untrimmed activity recognition (Activity Net) and object tracking (OTB100). More remarkably, Se Co demonstrates considerable improvements over recent unsupervised pre-training techniques, and leads the accuracy by 2.96% and 6.47% against fully-supervised Image Net pre-training in action recognition task on UCF101 and HMDB51, respectively. Also, We empirically verify the merit of Se Co for unsupervised representation learning in three downstream tasks: action recognition, untrimmed activity recognition and object tracking.
Researcher Affiliation Industry Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Tao Mei JD AI Research, Beijing, China
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Source code is available at https://github.com/Yiheng Zhang-CV/Se Co Sequence-Contrastive-Learning.
Open Datasets Yes Datasets Kinetics400 dataset (Kay et al. 2017) is one of the large-scale action recognition benchmarks... UCF101 (Soomro, Zamir, and Shah 2012) is one of the most popular action recognition benchmarks... HMDB51 (Kuehne et al. 2011) is another widely used action recognition dataset... Activity Net dataset (Heilbron et al. 2015) is a large-scale human activity understanding benchmark... GOT-10K (Huang, Zhao, and Huang 2019)... OTB-100 (Wu, Lim, and Yang 2015).
Dataset Splits Yes All the videos are grouped into three subsets for training (240K), validation (20K), and testing (40K), respectively. Because the labels of testing set are not publicly available, the performances on the Kinetics400 dataset are reported on the validation set. This dataset consists of 13,320 videos from 101 action classes, which are split into about 9.5K and 3.7K videos in training and testing set, respectively. The dataset is split into training (3.5K) and testing (1.5K) sets. All the videos in the dataset are divided into 10,024, 4,926, and 5,044 for training, validation, and testing sets, respectively.
Hardware Specification No The paper mentions 'shuffling BN is utilized for multi-GPU training' but does not provide any specific details about the GPU models (e.g., NVIDIA A100, Tesla V100) or other hardware components used for the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In our implementations, the size of the mini-batch is set to 512 and the size of memory is 131,072. The momentum coefficient α for momentum update of the encoder is set to 0.999 and the temperature τ in info NCE loss is 0.1. Following (He et al. 2019), shuffling BN is utilized for multi-GPU training. To optimize the parameters in the encoder, we use the momentum SGD with initial learning rate 0.2 which is annealed down to zero following a cosine decay. The network is trained for 400 epoch base on the network initialized with Mo Co (He et al. 2019) on Image Net. For data augmentation, we employ random cropping with random scales, color-jitter, random grayscale, blur, and mirror.