Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
Authors: Senthil Purushwalkam, Abhinav Gupta
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that approaches like MOCO[1] and PIRL[2] learn occlusion-invariant representations. However, they fail to capture viewpoint and category instance invariance which are crucial components for object recognition. Second, we demonstrate that these approaches obtain further gains from access to a clean object-centric training dataset like Imagenet. Finally, we propose an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance. Our results show that the learned representations outperform MOCOv2 trained on the same data in terms of invariances encoded and the performance on downstream image classification and semantic segmentation tasks. |
| Researcher Affiliation | Collaboration | Senthil Purushwalkam Carnegie Mellon University spurushw@cs.cmu.edu Abhinav Gupta Carnegie Mellon University & Facebook AI Research abhinavg@cs.cmu.edu |
| Pseudocode | No | The paper describes methods and equations (e.g., Equation 1), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'will publicly release the code to reproduce the invariance evaluation metrics on these datasets,' which is a future promise. The project webpage linked in the paper (http://www.cs.cmu.edu/~spurushw/publication/demystifyssl/) also states 'Code will be released soon.' |
| Open Datasets | Yes | We use the training set of the GOT-10K tracking dataset[35]... We use the PASCAL3D+ dataset[36]... The ALOI dataset[37] contains images of 1000 objects... Contrastive self-supervised approaches are most commonly trained on the Image Net dataset... We pretrain self-supervised models on the MSCOCO dataset[40]... We evaluate this baseline by training MOCOv2 on frames extracted from Tracking Net[41] videos... We also evaluate on the task of semantic segmentation on ADE20K[44]... |
| Dataset Splits | No | The paper mentions using various standard datasets like ImageNet, MSCOCO, and Pascal, and discusses training and evaluation. It mentions training on '118K MSCOCO images' or 'a randomly sampled 10% subset of Image Net,' but it does not explicitly provide exact percentages, absolute sample counts, or specific citations for train/validation/test splits for any of the datasets used. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components and frameworks like 'Res Net,' 'Linear SVMs,' and 'ROIPooling,' but it does not specify their version numbers or other crucial software dependencies required for replication. |
| Experiment Setup | No | The paper mentions 'τ is a hyperparameter called temperature' and refers to supplementary material for 'additional implementation details' and 'more concrete implementation details,' but it does not provide specific hyperparameter values, training configurations, or system-level settings within the main text. |