Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere

Authors: Tongzhou Wang, Phillip Isola

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on standard vision and language datasets confirm the strong agreement between both metrics and downstream task performance. Directly optimizing for these two metrics leads to representations with comparable or better performance at downstream tasks than contrastive learning. In this section, we empirically verify the hypothesis that alignment and uniformity are desired properties for representations. We conduct extensive experiments with convolutional neural network (CNN) and recurrent neural network (RNN) based encoders on four popular representation learning benchmarks with distinct types of downstream tasks: STL-10, NYU-DEPTH-V2, IMAGENET-100, BOOKCORPUS.
Researcher Affiliation Academia Tongzhou Wang 1 Phillip Isola 1 1MIT Computer Science & Artificial Intelligence Lab (CSAIL). Correspondence to: Tongzhou Wang <tongzhou@mit.edu>.
Pseudocode Yes Due to their simple forms, these two losses can be implemented in Py Torch (Paszke et al., 2019) with less than 10 lines of code, as shown in Figure 5. Figure 5: Py Torch implementation of Lalign and Luniform.
Open Source Code Yes Code: github.com/Ssn L/align uniform.
Open Datasets Yes STL-10 (Coates et al., 2011) classification..., NYU-DEPTH-V2 (Nathan Silberman & Fergus, 2012) depth prediction..., IMAGENET-100 (100 randomly selected classes from IMAGENET) classification..., BOOKCORPUS (Zhu et al., 2015) RNN sentence encoder outputs...
Dataset Splits Yes Figure 3 summarizes the resulting distributions of validation set features. For each encoder, we measure the downstream task performance, and the Lalign, Luniform metrics on the validation set. STL-10: The best result is picked by encoder outputs linear classifier accuracy from a 5-fold training set cross validation
Hardware Specification No The paper does not provide specific details on the hardware used for running experiments (e.g., GPU models, CPU types, memory, or cloud instance types).
Software Dependencies No The paper mentions
Experiment Setup Yes All three encoders share the same Alex Net based architecture (Krizhevsky et al., 2012), modified to map input images to 2-dimensional vectors in S1. Both predictive and contrastive learning use standard data augmentations to augment the dataset and sample positive pairs. ... We optimize a total of 306 STL-10 encoders, 64 NYUDEPTH-V2 encoders, 45 IMAGENET-100 encoders, and 108 BOOKCORPUS encoders without supervision. The encoders are optimized w.r.t. weighted combinations of Lcontrastive, Lalign, and/or Luniform, with varying (possibly zero) weights on the three losses, loss hyperparameters: τ for Lcontrastive, α for Lalign, and t for Luniform, batch size (affecting the number of (negative) pairs for Lcontrastive and Luniform), embedding dimension, number of training epochs and learning rate, initialization (from scratch vs. a pretrained encoder).