Bringing UMAP Closer to the Speed of Light with GPU Acceleration

Authors: Corey J. Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale, Tim Oates, John Zedlewski, Joshua Patterson418-426

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the execution time and correctness of GPUMAP and our implementation against the multi-core implementation of UMAP-learn on CPU.
Researcher Affiliation Collaboration Corey J. Nolet1,2, Victor Lafargue1, Edward Raff2,3, Thejaswi Nanditale1, Tim Oates2, John Zedlewski1, Joshua Patterson1 1Nvidia 2University of Maryland Baltimore County 3Booz Allen Hamilton
Pseudocode No The paper describes algorithmic steps in narrative form and includes diagrams (e.g., Figure 1), but it does not contain any structured pseudocode blocks or sections explicitly labeled as "Algorithm".
Open Source Code Yes Our implementation has been made publicly available as part of the open source RAPIDS cu ML library (https://github.com/rapidsai/cuml).
Open Datasets Yes Table 3: Datasets used in experiments Digits (Garris et al. 1994) ... Fashion MNIST (Xiao et al. 2017) ... MNIST (Deng 2012) ... CIFAR-100 (Krizhevsky 2009) ... COIL-20 (Nene et al. 1996) ... sc RNA (Travaglini et al. 2019) ... Google News Word2vec (Mikolov et al. 2013a)
Dataset Splits No The paper describes evaluating unsupervised and supervised training modes and distributed inference on various datasets, and mentions "embedding the remaining 97% of the dataset over 16 separate workers" for distributed inference. However, it does not provide explicit training, validation, and test dataset split percentages or sample counts for general reproducibility of experiments.
Hardware Specification Yes All experiments were conducted on a single DGX1 containing 8 Nvidia GV100 GPUs with Dual Intel Xeon 20-core CPUs.
Software Dependencies No The paper mentions several software components and libraries used, such as "Numpy or Pandas", "Cu Py", "Numba", "RAPIDS cu DF", "Thrust", "cu Sparse", "cu Graph", "cu ML", "FAISS library", "Scikit-learn", "Dask library", and "Unified-Communications-X library (UCX)". However, it does not provide specific version numbers for these software dependencies, which are necessary for reproducible setup.
Experiment Setup Yes UMAPlearn was configured to take advantage of all the available threads on the machine. The trustworthiness score with UMAP s default of n_neighbors = 15. When n_components is small enough, such as a few hundred, we use shared memory to create a small local cache per compute thread, accumulating the updates for each source vertex from multiple negative samples before writing the results atomically to global memory. We have measured performance gains of 10% for this stage when n_components = 2 to 56% when n_components = 16 and expect the performance benefits to continue increasing in proportion to n_components.