Self-supervised Co-Training for Video Representation Learning

Authors: Tengda Han, Weidi Xie, Andrew Zisserman

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we target self-supervised video representation learning, and ask the question: is instance discrimination making the best use of data? We show that the answer is no, in two respects: First, we show that hard positives are being neglected in the self-supervised training, and that if these hard positives are included then the quality of learnt representation improves significantly. To investigate this, we conduct an oracle experiment where positive samples are incorporated into the instance-based training process based on the semantic class label. A clear performance gap is observed between the pure instance-based learning (termed Info NCE [59]) and the oracle version (termed Uber NCE). Second, we propose a self-supervised co-training method, called Co CLR, standing for Co-training Contrastive Learning of visual Representation , with the goal of mining positive samples by using other complementary views of the data, i.e. replacing the role of the oracle. We pick RGB video frames and optical flow as the two views from hereon. As illustrated in Figure 1, positives obtained from flow can be used to bridge the gap between the RGB video clips instances. In turn, positives obtained from RGB video clips can link optical flow clips of the same action. The outcome of training with the Co CLR algorithm is a representation that significantly surpasses the performance obtained by the instance-based training with Info NCE, and approaches the performance of the oracle training with Uber NCE. In this section, we first describe the datasets (Section 4.1) and implementation details (Section 4.2) for Co CLR training. In Section 4.3, we describe the downstream tasks for evaluating the representation obtained from self-supervised learning. All proof-of-concept and ablation studies are conducted on UCF101 (Section 4.4), with larger scale training on Kinetics-400 (Section 4.5) to compare with other state-of-the-art approaches.
Researcher Affiliation Academia Tengda Han, Weidi Xie, Andrew Zisserman VGG, Department of Engineering Science, University of Oxford {htd, weidi, az}@robots.ox.ac.uk
Pseudocode No The paper describes the steps of the Co CLR algorithm but does not present them in a formal pseudocode block or clearly labeled algorithm section.
Open Source Code Yes To facilitate future research, we release our code and pretrained representations.
Open Datasets Yes We use two video action recognition datasets for self-supervised Co CLR training: UCF101 [53], containing 13k videos spanning 101 human actions (we only use the videos from the training set); and Kinetics-400 (K400) [35] with 240k video clips only from its training set. For downstream evaluation tasks, we benchmark on the UCF101 split1, K400 validation set, as well as on the split1 of HMDB51 [39], which contains 7k videos spanning 51 human actions.
Dataset Splits Yes We use two video action recognition datasets for self-supervised Co CLR training: UCF101 [53], containing 13k videos spanning 101 human actions (we only use the videos from the training set); and Kinetics-400 (K400) [35] with 240k video clips only from its training set. For downstream evaluation tasks, we benchmark on the UCF101 split1, K400 validation set, as well as on the split1 of HMDB51 [39], which contains 7k videos spanning 51 human actions. The learning rate is decayed down by 1/10 twice when the validation loss plateaus.
Hardware Specification No The paper states: "Each experiment is trained on 4 GPUs", but does not specify the model or type of GPUs used.
Software Dependencies No The paper mentions the use of "the un-supervised TV-L1 algorithm [67]" for optical flow computation, but it does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x).
Experiment Setup Yes We choose the S3D [65] architecture as the feature extractor for all experiments. During Co CLR training, we attach a non-linear projection head, and remove it for downstream task evaluations, as done in Sim CLR [12]. We use a 32-frame RGB (or flow) clip as input, at 30 fps, this roughly covers 1 second. The video clip has a spatial resolution of 128 128 pixels. For data augmentation, we apply random crops, horizontal flips, Gaussian blur and color jittering, all are clip-wise consistent. We also apply random temporal cropping to utilize the natural variation of the temporal dimension, i.e. the input video clips are cropped at random time stamps from the source video. At the initialization stage, we train both RGB and Flow networks with Info NCE for 300 epochs, where an epoch means to have sampled one clip from each video in the training set, i.e. the total number of seen instances is equivalent to the number of videos in the training set. We adopt a momentum-updated history queue to cache a large number of features as in Mo Co [13, 27]. At the alternation stage, on UCF101 the model is trained for two cycles, where each cycle includes 200 epochs, i.e. RGB and Flow networks are each trained for 100 epochs with hard positive mining from the other; on K400 the model is only trained for one cycle for 100 epochs, that is 50 epochs each for RGB and Flow networks, however, we expect more training cycles to be beneficial. For optimization, we use Adam with 10 3 learning rate and 10 5 weight decay. The learning rate is decayed down by 1/10 twice when the validation loss plateaus. Each experiment is trained on 4 GPUs, with a batch size of 32 samples per GPU.