Unsupervised Object-Level Representation Learning from Scene Images

Authors: Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised Image Net pre-training on several downstream tasks. We evaluate the quality of learned representations by transferring them to multiple downstream tasks. Following common protocol [18, 37], we use two evaluation setups: (i) the pre-trained network is frozen as a feature extractor, and (ii) the network parameters are fully fine-tuned as weight initialization. We provide more experimental details in the supplementary material.
Researcher Affiliation Academia Jiahao Xie1 Xiaohang Zhan2 Ziwei Liu1 Yew Soon Ong1,3 Chen Change Loy1 1Nanyang Technological University 2The Chinese University of Hong Kong 3A*STAR, Singapore
Pseudocode No The paper describes its pipeline and steps in text but does not include any formally labeled pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://www.mmlab-ntu.com/project/orl/.
Open Datasets Yes We pre-train our models on the COCO train2017 set that contains 118k images without using labels. Compared with the heavily curated object-centric Image Net dataset, COCO contains more natural and diverse scenes in the wild, which is closer to real-world scenarios. We also perform self-supervised learning on a larger COCO+ dataset (COCO train2017 set plus COCO unlabeled2017 set) to verify whether our method can benefit from more unlabeled scene data. [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
Dataset Splits Yes For VOC07, we train linear SVMs using LIBLINEAR package [13] following the setup in [18, 37]. We train on trainval split of VOC07 and evaluate m AP on test split. For Image Net, Places205 and i Naturalist18, we follow [71, 18, 37] and train a 1000-way, 205-way and 8142-way linear classifier, respectively. We train on train split of each dataset, and report top-1 center-crop accuracy on the respective val split. Specifically, we first randomly select 1% and 10% labeled data from Image Net train split. We then fine-tune our models on these two training subsets and report both top-1 and top-5 accuracy on the official val split of Image Net in Table 3. We fine-tune all layers end-to-end on COCO train2017 split with the standard 1 schedule and evaluate on COCO val2017 split.
Hardware Specification No The paper mentions 'The batch size is set to 512 by default, which is friendly to typical 8-GPU implementations.' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications for their experiments.
Software Dependencies No The paper mentions software components like 'SGD optimizer', 'BYOL', 'Res Net-50', 'LIBLINEAR package [13]', and 'Detectron2 [59]', but does not provide specific version numbers for these software libraries or underlying frameworks (e.g., PyTorch, Python version) that would be needed for reproducibility.
Experiment Setup Yes For pre-training in Stage 1 and Stage 3, we use the same training hyper-parameters. Specifically, we use the SGD optimizer with a weight decay of 0.0001 and a momentum of 0.9. We adopt the cosine learning rate decay schedule [36] with a base learning rate of 0.2, linearly scaled[17] with the batch size (lr = 0.2 Batch Size/256). The batch size is set to 512 by default, which is friendly to typical 8-GPU implementations. To keep the training iterations comparable with the Image Net supervised pre-training, we train our models for 800 epochs with a warm-up period of 4 epochs. The exponential moving average parameter τ starts from 0.99 and is increased to 1 during training, following [19]. For correspondence generation in Stage 2, we retrieve top K = 10 nearest neighbors for each image and select top-ranked N = 10% Ro I pairs for each image-level nearest-neighbor pair.