BanditPAM++: Faster $k$-medoids Clustering

Authors: Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Ilan Shomorony, Martin J. Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Empirical Results", "First, we demonstrate that all algorithms achieve the same clustering solution and loss as Bandit PAM across a variety of datasets and dataset sizes. In particular, this implies that Bandit PAM++ matches prior state-of-the-art in clustering quality. Next, we investigate the scaling of all four algorithms in both n and k across a variety of datasets and metrics. We present our results in both sample-complexity and wall-clock runtime. Bandit PAM++ outperforms Bandit PAM by up to 10 . Furthermore, our results demonstrate that each of the VA and PIC techniques improves the runtime of the original Bandit PAM algorithm.
Researcher Affiliation Academia Mo Tiwari Stanford University motiwari@stanford.edu Ryan Kang* Stanford University txryank@stanford.edu Donghyun Lee* University College London donghyun.lee.21@ucl.ac.uk Sebastian Thrun Stanford University thrun@stanford.edu Chris Piech Stanford University piech@cs.stanford.edu Ilan Shomorony University of Illinois at Urbana-Champaign ilans@illinois.edu Martin Jinye Zhang Carnegie Mellon University martinzh@andrew.cmu.edu
Pseudocode Yes Algorithm 1 Bandit PAM++ SWAP Step ( fj(Dj, {a1, . . . , at}), δ, σx, permutation π of [n] )
Open Source Code Yes Finally, we provide a high-performance C++ implementation of Bandit PAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/Bandit PAM.
Open Datasets Yes We conduct experiments on several public, real-world datasets to evaluate Bandit PAM++ s performance: the MNIST dataset [13], the CIFAR10 dataset [12], and the 20 Newsgroups dataset [19].
Dataset Splits No The paper describes using subsamples of datasets and mentions '10,000 training posts' for 20 Newsgroups, but does not provide explicit details about train/validation/test dataset splits, exact percentages, or sample counts for their own experimental setup.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions a 'high-performance C++ implementation, callable from Python and R' and the use of 'a sentence transformer from Hugging Face', but it does not specify any version numbers for these software components or other libraries.
Experiment Setup Yes In all experiments using the PIC technique, we allowed the algorithm to store up to 1,000 distance computations per point.