Can Semantic Labels Assist Self-Supervised Visual Representation Learning?
Authors: Longhui Wei, Lingxi Xie, Jianzhong He, Xiaopeng Zhang, Qi Tian2642-2650
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SCAN on a wide range of downstream tasks for detection and segmentation. As shown in Tab. 1, SCAN outperforms the fully-supervised and self-supervised baselines consistently, sometimes significantly. Moreover, though the state-of-the-art self-supervised learning methods (Chen et al. 2020a; He et al. 2020) claimed the benefits from extensive (around 1 billion) unlabeled images, SCAN is able to surpass their performance by training on labeled Image Net-1K with around 1 million images. Experiments Datasets and Implementation Details. |
| Researcher Affiliation | Collaboration | Longhui Wei,1,2 Lingxi Xie,2 Jianzhong He,2 Xiaopeng Zhang,2 Qi Tian,2. 1University of Science and Technology of China 2Huawei Inc. weilh2568@gmail.com, 198808xc@gmail.com, q-tian@hotmail.com |
| Pseudocode | No | Therefore, suppose a query image as qi, its two corresponding augmented images are denoted as q i and q i, the encoders for the query and memory bank are f and g, the memory bank size is L, and the feature of each sample embedded in this memory bank is represented as zl, the loss function of Mo Co for this query image can be formulated as: LMo Co i = log exp {f(q i) g(q i)/τ} Simall i , (1)... |
| Open Source Code | No | We directly utilize the official released Mo Co-v2 model trained on Image Net-1K with 800 epochs to generate the appearance feature... |
| Open Datasets | Yes | We pre-train SCAN on Image Net-1K (Deng et al. 2009), the most widely used large-scale classification dataset. As for the downstream evaluation stage, we mainly adopt three commonly used detection and segmentation datasets, i.e., PASCAL VOC (Everingham et al. 2010), COCO (Lin et al. 2014) and Cityscapes (Cordts et al. 2016), respectively. |
| Dataset Splits | No | All of the experiments are trained on COCOtrain2017 and tested on COCOval2017. |
| Hardware Specification | Yes | Finally, we use 32 Tesla-V100 GPUs to train our model lasting for 400 epochs on Image Net-1K. |
| Software Dependencies | No | For the detection task on VOC, we use Detectron2 to finetune the Faster-RCNN (Ren et al. 2015) with R50-C4 backbone. |
| Experiment Setup | Yes | Empirically, we assign each sample and its top-2 positive neighbors to the same new label. Finally, we select Res Net-50 (He et al. 2016) as the used backbone and train it with the guidance of our generated labels. For one mini-batch of each GPU, we randomly choose 128 samples and their corresponding positive neighbors. Therefore, there are always positive samples for each anchor image in the mini-batch. Similar to Mo Co, we adopt the SGD optimizer and set momentum as 0.9. Moreover, the cosine learning rate schedule is utilized, and the initial learning rate is set as 1.6. The temperature τ in Eq. (6) is empirically set as 0.07. For the data augmentation, we simply follow the augmentation scheme in Mo Co-v2. |