What Makes for Good Views for Contrastive Learning?

Authors: Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use theoretical and empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a byproduct, we achieve a new state-of-the-art accuracy on unsupervised pre-training for Image Net classification (73% top-1 linear readout with a Res Net-50)1.
Researcher Affiliation Collaboration Yonglong Tian MIT CSAIL Chen Sun Google, Brown University Ben Poole Google Research Dilip Krishnan Google Research Cordelia Schmid Google Research Phillip Isola MIT CSAIL
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper mentions a 'Project page: http://hobbitlong.github.io/InfoMin' but does not explicitly state that source code is provided or link directly to a source-code repository within the text.
Open Datasets Yes As a byproduct, we achieve a new state-of-the-art accuracy on unsupervised pre-training for Image Net classification (73% top-1 linear readout with a Res Net-50)1. We experiment with RGB and YDb Dr. Experiments are conducted on STL-10, which includes 100k unlabeled and 5k labeled images. After contrastive training stage, we evaluate on STL-10 and CIFAR-10 by freezing the encoder and training a linear classifier. segmentation performance on NYU-V2 [40] images. We build our toy dataset by combining Moving MNIST [51] (consisting of videos where digits move inside a black canvas with constant speed and bounce off of image boundaries), with a fixed background image sampled from the STL-10 dataset [10]. Motivated by the Info Min principle, we propose a new set of data augmentation, called Info Min Aug. In combination of the Jig Saw strategy proposed in PIRL [38], our Info Min Aug achieves 73.0% top-1 accuracy on Image Net linear readout benchmark with Res Net-50. transferring our unsupervisedly pre-trained models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised Image Net pre-training.
Dataset Splits No The paper mentions using well-known datasets like ImageNet, STL-10, and CIFAR-10, and states counts for STL-10 (100k unlabeled and 5k labeled), but does not explicitly provide specific training, validation, or test split percentages, absolute counts for all splits, or reference predefined splits with explicit citations within the provided text.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) required to replicate the experiments.
Experiment Setup Yes Table 1: Single-crop Image Net accuracies (%) of linear classifiers [63] trained on representations learned with different contrastive methods using Res Net-50 [24]. Info Min Aug. refers to data augmentation using Random Resized Crop, Color Jittering, Gaussian Blur, Rand Augment, Color Dropping, and a Jig Saw branch as in PIRL [38]. Info Min Aug. (Ours) Res Net-50 24 MLP 200 70.1 89.4 Info Min Aug. (Ours) Res Net-50 24 MLP 800 73.0 91.1. We create views by randomly cropping two patches of size 64x64 from the same image with various offsets. Practically, the flow-based model g is restricted to pixel-wise 1x1 convolutions and Re LU activations, operating independently on each pixel. We try both volume preserving (VP) and non-volume preserving (NVP) flows.