Bringing UMAP Closer to the Speed of Light with GPU Acceleration
Authors: Corey J. Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale, Tim Oates, John Zedlewski, Joshua Patterson418-426
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the execution time and correctness of GPUMAP and our implementation against the multi-core implementation of UMAP-learn on CPU. |
| Researcher Affiliation | Collaboration | Corey J. Nolet1,2, Victor Lafargue1, Edward Raff2,3, Thejaswi Nanditale1, Tim Oates2, John Zedlewski1, Joshua Patterson1 1Nvidia 2University of Maryland Baltimore County 3Booz Allen Hamilton |
| Pseudocode | No | The paper describes algorithmic steps in narrative form and includes diagrams (e.g., Figure 1), but it does not contain any structured pseudocode blocks or sections explicitly labeled as "Algorithm". |
| Open Source Code | Yes | Our implementation has been made publicly available as part of the open source RAPIDS cu ML library (https://github.com/rapidsai/cuml). |
| Open Datasets | Yes | Table 3: Datasets used in experiments Digits (Garris et al. 1994) ... Fashion MNIST (Xiao et al. 2017) ... MNIST (Deng 2012) ... CIFAR-100 (Krizhevsky 2009) ... COIL-20 (Nene et al. 1996) ... sc RNA (Travaglini et al. 2019) ... Google News Word2vec (Mikolov et al. 2013a) |
| Dataset Splits | No | The paper describes evaluating unsupervised and supervised training modes and distributed inference on various datasets, and mentions "embedding the remaining 97% of the dataset over 16 separate workers" for distributed inference. However, it does not provide explicit training, validation, and test dataset split percentages or sample counts for general reproducibility of experiments. |
| Hardware Specification | Yes | All experiments were conducted on a single DGX1 containing 8 Nvidia GV100 GPUs with Dual Intel Xeon 20-core CPUs. |
| Software Dependencies | No | The paper mentions several software components and libraries used, such as "Numpy or Pandas", "Cu Py", "Numba", "RAPIDS cu DF", "Thrust", "cu Sparse", "cu Graph", "cu ML", "FAISS library", "Scikit-learn", "Dask library", and "Unified-Communications-X library (UCX)". However, it does not provide specific version numbers for these software dependencies, which are necessary for reproducible setup. |
| Experiment Setup | Yes | UMAPlearn was configured to take advantage of all the available threads on the machine. The trustworthiness score with UMAP s default of n_neighbors = 15. When n_components is small enough, such as a few hundred, we use shared memory to create a small local cache per compute thread, accumulating the updates for each source vertex from multiple negative samples before writing the results atomically to global memory. We have measured performance gains of 10% for this stage when n_components = 2 to 56% when n_components = 16 and expect the performance benefits to continue increasing in proportion to n_components. |