Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attraction-Repulsion Spectrum in Neighbor Embeddings

Authors: Jan Niklas Böhm, Philipp Berens, Dmitry Kobak

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher k NN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimization strategy employed by UMAP strongly lowers the effective repulsion. Likewise, Force Atlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them.
Researcher Affiliation Academia Jan Niklas Böhm EMAIL Philipp Berens EMAIL Dmitry Kobak EMAIL Institute for Ophthalmic Research, University of Tübingen Otfried-Müller-Str. 25; 72076 Tübingen, Germany
Pseudocode No The paper describes algorithms using mathematical equations and outlines their processes in natural language, but does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code Yes All our code is available at https://github.com/berenslab/ne-spectrum.
Open Datasets Yes The brain organoid data sets (Kanton et al., 2019) were downloaded from https://www.ebi.ac. uk/arrayexpress/experiments/E-MTAB-7552/ in form of UMI counts and metadata tables. The hydra data set (Figure A7; Siebert et al., 2019) was downloaded in form of UMI counts from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121617. The zebrafish data set (Figure A8; Wagner et al., 2018b) was downloaded in form of UMI counts from https://kleintools.hms.harvard.edu/paper_websites/wagner_ zebrafish_timecourse2018/Wagner Science2018.h5ad. The adult mouse cortex data set (Figure A9; Tasic et al., 2018) was downloaded in form of read counts from http://celltypes.brain-map.org/api/v2/well_known_file_ download/694413985 and http://celltypes.brain-map.org/api/v2/well_known_file_ download/694413179 for the VISp and ALM cortical areas, respectively. Fashion and Kuzushiji MNIST were downloaded via Open ML with the keys Fashion-MNIST (https: //www.openml.org/d/40996) and Kuzushiji-MNIST (https://www.openml.org/d/41982), respectively. Kannada MNIST was downloaded from https://github.com/vinayprabhu/ Kannada_MNIST.
Dataset Splits Yes To quantify this observation, we computed distance correlations (Szekely et al., 2007) between UMAP & FA2 embeddings and t-SNE embeddings with various values of ρ [1, 100] (Figure 8). We found that for most data sets the highest correlation between UMAP and t-SNE layouts was achieved at 4 ρ < 15 (Figure 8a). For FA2, the highest correlation was typically achieved at 20 < ρ < 80 (Figure 8b). In both cases, the maximum correlations were above 0.94, indicating very similar layouts. Whereas the exact value of ρ yielding the maximum correlation varied between data sets, the correlation values at ρ = 4 for UMAP and at ρ = 30 for FA2 were always high and close to the maximum correlations. We also observed that ρ = 4 yielded a good match to UMAP independent of the sample size (Figure A10). Note that for all three algorithms we used all default parameters (apart from always using the same PCA initialization and fixing a = b = 1 in UMAP), confirming that the differences between t-SNE and UMAP in affinities and in the value of ϵ in the loss function do not play a large role, at least for our data sets. We used the shared PCA initialization to roughly align the embeddings and make distance correlations more meaningful, but all conclusions stayed the same if we used default UMAP/FA2 initializations (see Section 3.6). To quantify this effect, we computed the fraction of k = 15 nearest neighbors in high dimensions that remain among the nearest neighbors in the embedding ( k NN recall ). To compute it for a given data point, we found 15 points with the largest affinities in the symmetrized affinity matrix, and determined what fraction of them is among the 15 exact nearest neighbors in the embedding. This was averaged over 10 000 randomly selected points. We found that as ρ increased, the local neighborhood became more and more distorted (Figure 7). For the MNIST data set, the k NN recall of default t-SNE (ρ = 1) was 0.34; with ρ = 4 it went down to 0.12; with ρ = 30 it further dropped to 0.06. Distance correlation (Szekely et al., 2007) was computed using dcor package (Carreño, 2017) on a random subset (n = 5 000) of the data.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models, memory, or specific cloud instance types.
Software Dependencies Yes All experiments were performed in Python. We ran all packages with default parameters, unless specified otherwise. We used open TSNE 0.6.0 (Poliˇcar et al., 2019), a Python reimplementation of FIt-SNE (Linderman et al., 2019). We used UMAP 0.5.1 with Cauchy similarity kernel (by setting a = b = 1). For FA2 we used the fa2 package (Chippada, 2017), which employs a Barnes Hut approximation to speed up computation of the repulsive forces. LE was computed using the scikit-learn (Pedregosa et al., 2011) implementation, using the Spectral Embedding class and the Py AMG solver 4.0 (Olson and Schroder, 2018).
Experiment Setup Yes The learning rate was set to η = n/ max(ρ, ρearly) as suggested by Belkina et al. (2019). By default, all algorithms were optimized for 750 iterations. Unless stated otherwise, we used principal component analysis (PCA) initialization to remove any differences between algorithms due to initialization strategies (Kobak and Linderman, 2021). For t-SNE, the initialization was always scaled to have a standard deviation of 0.0001, as suggested by Kobak and Berens (2019) and is default in open TSNE (Poliˇcar et al., 2019). For UMAP, the initialization was scaled to have the range of [ 10, 10], as is default in the original implementation. For Force Atlas2, we scaled the initialization to have a standard deviation of 10 000 to approximately match the scale of final Force Atlas2 embeddings (we experimented with different values and found this setting to work well and avoid convergence problems). Note that Figure 5 is an exception and uses random initialization. By default, UMAP uses LE (and not PCA) initialization, whereas the fa2 implementation uses random initialization. On the data sets analyzed here, we did not observe any qualitative difference as a result of our non-default (PCA) initialization (cf. Figure 3a,b). UMAP (panels b f) was run for 3000 epochs to ensure convergence, as large m introduce noise in the optimization.