Unsupervised Object-Level Representation Learning from Scene Images
Authors: Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised Image Net pre-training on several downstream tasks. We evaluate the quality of learned representations by transferring them to multiple downstream tasks. Following common protocol [18, 37], we use two evaluation setups: (i) the pre-trained network is frozen as a feature extractor, and (ii) the network parameters are fully fine-tuned as weight initialization. We provide more experimental details in the supplementary material. |
| Researcher Affiliation | Academia | Jiahao Xie1 Xiaohang Zhan2 Ziwei Liu1 Yew Soon Ong1,3 Chen Change Loy1 1Nanyang Technological University 2The Chinese University of Hong Kong 3A*STAR, Singapore |
| Pseudocode | No | The paper describes its pipeline and steps in text but does not include any formally labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://www.mmlab-ntu.com/project/orl/. |
| Open Datasets | Yes | We pre-train our models on the COCO train2017 set that contains 118k images without using labels. Compared with the heavily curated object-centric Image Net dataset, COCO contains more natural and diverse scenes in the wild, which is closer to real-world scenarios. We also perform self-supervised learning on a larger COCO+ dataset (COCO train2017 set plus COCO unlabeled2017 set) to verify whether our method can benefit from more unlabeled scene data. [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. |
| Dataset Splits | Yes | For VOC07, we train linear SVMs using LIBLINEAR package [13] following the setup in [18, 37]. We train on trainval split of VOC07 and evaluate m AP on test split. For Image Net, Places205 and i Naturalist18, we follow [71, 18, 37] and train a 1000-way, 205-way and 8142-way linear classifier, respectively. We train on train split of each dataset, and report top-1 center-crop accuracy on the respective val split. Specifically, we first randomly select 1% and 10% labeled data from Image Net train split. We then fine-tune our models on these two training subsets and report both top-1 and top-5 accuracy on the official val split of Image Net in Table 3. We fine-tune all layers end-to-end on COCO train2017 split with the standard 1 schedule and evaluate on COCO val2017 split. |
| Hardware Specification | No | The paper mentions 'The batch size is set to 512 by default, which is friendly to typical 8-GPU implementations.' but does not specify any particular GPU models, CPU types, or other detailed hardware specifications for their experiments. |
| Software Dependencies | No | The paper mentions software components like 'SGD optimizer', 'BYOL', 'Res Net-50', 'LIBLINEAR package [13]', and 'Detectron2 [59]', but does not provide specific version numbers for these software libraries or underlying frameworks (e.g., PyTorch, Python version) that would be needed for reproducibility. |
| Experiment Setup | Yes | For pre-training in Stage 1 and Stage 3, we use the same training hyper-parameters. Specifically, we use the SGD optimizer with a weight decay of 0.0001 and a momentum of 0.9. We adopt the cosine learning rate decay schedule [36] with a base learning rate of 0.2, linearly scaled[17] with the batch size (lr = 0.2 Batch Size/256). The batch size is set to 512 by default, which is friendly to typical 8-GPU implementations. To keep the training iterations comparable with the Image Net supervised pre-training, we train our models for 800 epochs with a warm-up period of 4 epochs. The exponential moving average parameter τ starts from 0.99 and is increased to 1 during training, following [19]. For correspondence generation in Stage 2, we retrieve top K = 10 nearest neighbors for each image and select top-ranked N = 10% Ro I pairs for each image-level nearest-neighbor pair. |