Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Authors: Tongzhou Wang, Phillip Isola
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on standard vision and language datasets confirm the strong agreement between both metrics and downstream task performance. Directly optimizing for these two metrics leads to representations with comparable or better performance at downstream tasks than contrastive learning. In this section, we empirically verify the hypothesis that alignment and uniformity are desired properties for representations. We conduct extensive experiments with convolutional neural network (CNN) and recurrent neural network (RNN) based encoders on four popular representation learning benchmarks with distinct types of downstream tasks: STL-10, NYU-DEPTH-V2, IMAGENET-100, BOOKCORPUS. |
| Researcher Affiliation | Academia | Tongzhou Wang 1 Phillip Isola 1 1MIT Computer Science & Artificial Intelligence Lab (CSAIL). Correspondence to: Tongzhou Wang <tongzhou@mit.edu>. |
| Pseudocode | Yes | Due to their simple forms, these two losses can be implemented in Py Torch (Paszke et al., 2019) with less than 10 lines of code, as shown in Figure 5. Figure 5: Py Torch implementation of Lalign and Luniform. |
| Open Source Code | Yes | Code: github.com/Ssn L/align uniform. |
| Open Datasets | Yes | STL-10 (Coates et al., 2011) classification..., NYU-DEPTH-V2 (Nathan Silberman & Fergus, 2012) depth prediction..., IMAGENET-100 (100 randomly selected classes from IMAGENET) classification..., BOOKCORPUS (Zhu et al., 2015) RNN sentence encoder outputs... |
| Dataset Splits | Yes | Figure 3 summarizes the resulting distributions of validation set features. For each encoder, we measure the downstream task performance, and the Lalign, Luniform metrics on the validation set. STL-10: The best result is picked by encoder outputs linear classifier accuracy from a 5-fold training set cross validation |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running experiments (e.g., GPU models, CPU types, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | All three encoders share the same Alex Net based architecture (Krizhevsky et al., 2012), modified to map input images to 2-dimensional vectors in S1. Both predictive and contrastive learning use standard data augmentations to augment the dataset and sample positive pairs. ... We optimize a total of 306 STL-10 encoders, 64 NYUDEPTH-V2 encoders, 45 IMAGENET-100 encoders, and 108 BOOKCORPUS encoders without supervision. The encoders are optimized w.r.t. weighted combinations of Lcontrastive, Lalign, and/or Luniform, with varying (possibly zero) weights on the three losses, loss hyperparameters: τ for Lcontrastive, α for Lalign, and t for Luniform, batch size (affecting the number of (negative) pairs for Lcontrastive and Luniform), embedding dimension, number of training epochs and learning rate, initialization (from scratch vs. a pretrained encoder). |