From $t$-SNE to UMAP with contrastive learning

Authors: Sebastian Damrich, Niklas Böhm, Fred A Hamprecht, Dmitry Kobak

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noise-contrastive estimation can be used to optimize t-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to t-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) t-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a Py Torch implementation.
Researcher Affiliation Academia Sebastian Damrich IWR at Heidelberg University sebastian.damrich@uni-tuebingen.de Jan Niklas B ohm University of T ubingen jan-niklas.boehm@uni-tuebingen.de Fred A. Hamprecht IWR at Heidelberg University fred.hamprecht@iwr.uni-heidelberg.de Dmitry Kobak University of T ubingen dmitry.kobak@uni-tuebingen.de
Pseudocode Yes An extensive discussion of our implementation is included in Supp. K and in Alg. S1. Algorithm S1: Batched contrastive neighbor embedding algorithm
Open Source Code Yes Our code is available at https://github.com/berenslab/contrastive-ne and https://github.com/hci-unihd/cl-tsne-umap.
Open Datasets Yes We used the well-known MNIST (Le Cun et al., 1998) dataset for most of our experiments... The Kuzushiji-49 dataset (Tarin et al., 2018) was downloaded from https://github.com/ rois-codh/kmnist... The Sim CLR experiments were performed on the CIFAR-10 (Krizhevsky, 2009) dataset...
Dataset Splits Yes The Res Net was trained on the combined CIFAR-10 train and test sets. When evaluating the accuracy, we froze the backbone, trained the classifier on the train set, and evaluated its accuracy on the test set.
Hardware Specification Yes We ran our neighbor embedding experiments on a machine with 56 Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 502 GB RAM and 10 NVIDIA TITAN Xp GPUs. The Sim CLR experiments were conducted on a SLURM cluster node with 8 cores of an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and a Nvidia V100 GPU with a RAM limit of 54 GB.
Software Dependencies Yes All contrastive embeddings were computed with our Py Torch (Paszke et al., 2019) implementation... The t-SNE plots were created with the open TSNE (Poliˇcar et al., 2019) (version 0.6.1) package... We used Py Ke Ops (Charlier et al., 2021) to compute the exact sk NN graph... All PCAs were computed with sklearn (Pedregosa et al., 2011)... We used sklearn s KNeighbors Classifier... and sklearn s Logistic Regression classifier with the SAGA solver (Defazio et al., 2014)... The Tri Map plots in Supp. F were computed with version 1.1.4 of the Tri Map package by Amid & Warmuth (2019).
Experiment Setup Yes Our defaults were a batch size of 1024, linear learning rate annealing from 1 (non-parametric) or 0.001 (parametric) to 0... 750 epochs... and m = 5 noise samples... Non-parametric runs were optimized with SGD without momentum and parametric runs with the Adam optimizer (Kingma & Ba, 2015). Parametric runs used the same feed-forward neural net architecture as the reference parametric UMAP implementation. That is, four layers with dimensions input dimension 100 100 100 2 with Re LU activations in all but the last one. For the Sim CLR experiments, we trained the model for 1000 epochs, of which we used 5 epochs for warmup. The learning rate during warmup was linearly interpolated from 0 to the initial learning rate. After the warmup epochs, we annealed the learning rate with a cosine schedule (without restarts) to 0 (Loshchilov & Hutter, 2017). We optimized the model parameters with SGD and momentum 0.9. We used the same data augmentations as in Chen et al. (2020) and their recommended batch size of 1024. We used a Res Net18 (He et al., 2016) as the backbone and a projection head consisting of two linear layers (512 1024 128) with a Re LU activation in-between.