Can Semantic Labels Assist Self-Supervised Visual Representation Learning?

Authors: Longhui Wei, Lingxi Xie, Jianzhong He, Xiaopeng Zhang, Qi Tian2642-2650

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SCAN on a wide range of downstream tasks for detection and segmentation. As shown in Tab. 1, SCAN outperforms the fully-supervised and self-supervised baselines consistently, sometimes significantly. Moreover, though the state-of-the-art self-supervised learning methods (Chen et al. 2020a; He et al. 2020) claimed the benefits from extensive (around 1 billion) unlabeled images, SCAN is able to surpass their performance by training on labeled Image Net-1K with around 1 million images. Experiments Datasets and Implementation Details.
Researcher Affiliation Collaboration Longhui Wei,1,2 Lingxi Xie,2 Jianzhong He,2 Xiaopeng Zhang,2 Qi Tian,2. 1University of Science and Technology of China 2Huawei Inc. weilh2568@gmail.com, 198808xc@gmail.com, q-tian@hotmail.com
Pseudocode No Therefore, suppose a query image as qi, its two corresponding augmented images are denoted as q i and q i, the encoders for the query and memory bank are f and g, the memory bank size is L, and the feature of each sample embedded in this memory bank is represented as zl, the loss function of Mo Co for this query image can be formulated as: LMo Co i = log exp {f(q i) g(q i)/τ} Simall i , (1)...
Open Source Code No We directly utilize the official released Mo Co-v2 model trained on Image Net-1K with 800 epochs to generate the appearance feature...
Open Datasets Yes We pre-train SCAN on Image Net-1K (Deng et al. 2009), the most widely used large-scale classification dataset. As for the downstream evaluation stage, we mainly adopt three commonly used detection and segmentation datasets, i.e., PASCAL VOC (Everingham et al. 2010), COCO (Lin et al. 2014) and Cityscapes (Cordts et al. 2016), respectively.
Dataset Splits No All of the experiments are trained on COCOtrain2017 and tested on COCOval2017.
Hardware Specification Yes Finally, we use 32 Tesla-V100 GPUs to train our model lasting for 400 epochs on Image Net-1K.
Software Dependencies No For the detection task on VOC, we use Detectron2 to finetune the Faster-RCNN (Ren et al. 2015) with R50-C4 backbone.
Experiment Setup Yes Empirically, we assign each sample and its top-2 positive neighbors to the same new label. Finally, we select Res Net-50 (He et al. 2016) as the used backbone and train it with the guidance of our generated labels. For one mini-batch of each GPU, we randomly choose 128 samples and their corresponding positive neighbors. Therefore, there are always positive samples for each anchor image in the mini-batch. Similar to Mo Co, we adopt the SGD optimizer and set momentum as 0.9. Moreover, the cosine learning rate schedule is utilized, and the initial learning rate is set as 1.6. The temperature τ in Eq. (6) is empirically set as 0.07. For the data augmentation, we simply follow the augmentation scheme in Mo Co-v2.