The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning

Authors: Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.
Researcher Affiliation Collaboration 1 University of Wisconsin-Madison 2 Google LLC 3 Xai Pient Equal contribution {zhmeishi,jiefeng,kli253,jayaramr,yliang,jha}@cs.wisc.edu, wuxi@google.com
Pseudocode No The paper describes methods and equations, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Please refer to our released code1 for more details. 1https://github.com/zhmeishi/trade-off_contrastive_learning
Open Datasets Yes CIFAR-10 (Krizhevsky et al., 2009) dataset consists of 60,000 32 32 color images in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. Each class has 6,000 images. There are 50,000 training images and 10,000 test images.
Dataset Splits Yes There are 50,000 training images and 10,000 test images. ... Then we fix the pre-trained feature extractor and train a linear classifier (Linear Probing, LP) on 1%, 5%, 10%, 20%, 100% of the labeled data from the downstream task.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions optimizers (SGD, Adam W) and learning rate schedulers, but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes We pre-train a Res Net18 network (He et al., 2016) as a feature extractor under different contrastive learning methods using SGD for 800 epochs with a cosine learning-rate scheduler, the base learning rate of 0.06, weight decay 5e-4, momentum 0.9 and batch size 512. Then we fix the pre-trained feature extractor and train a linear classifier (Linear Probing, LP) on 1%, 5%, 10%, 20%, 100% of the labeled data from the downstream task. For LP we use SGD for 200 epochs with a cosine learning-rate scheduler, the base learning rate of 5.0, no weight decay, momentum 0.9, and batch size 256.