BanditPAM++: Faster $k$-medoids Clustering
Authors: Mo Tiwari, Ryan Kang, Donghyun Lee, Sebastian Thrun, Ilan Shomorony, Martin J. Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Empirical Results", "First, we demonstrate that all algorithms achieve the same clustering solution and loss as Bandit PAM across a variety of datasets and dataset sizes. In particular, this implies that Bandit PAM++ matches prior state-of-the-art in clustering quality. Next, we investigate the scaling of all four algorithms in both n and k across a variety of datasets and metrics. We present our results in both sample-complexity and wall-clock runtime. Bandit PAM++ outperforms Bandit PAM by up to 10 . Furthermore, our results demonstrate that each of the VA and PIC techniques improves the runtime of the original Bandit PAM algorithm. |
| Researcher Affiliation | Academia | Mo Tiwari Stanford University motiwari@stanford.edu Ryan Kang* Stanford University txryank@stanford.edu Donghyun Lee* University College London donghyun.lee.21@ucl.ac.uk Sebastian Thrun Stanford University thrun@stanford.edu Chris Piech Stanford University piech@cs.stanford.edu Ilan Shomorony University of Illinois at Urbana-Champaign ilans@illinois.edu Martin Jinye Zhang Carnegie Mellon University martinzh@andrew.cmu.edu |
| Pseudocode | Yes | Algorithm 1 Bandit PAM++ SWAP Step ( fj(Dj, {a1, . . . , at}), δ, σx, permutation π of [n] ) |
| Open Source Code | Yes | Finally, we provide a high-performance C++ implementation of Bandit PAM++, callable from Python and R, that may be of interest to practitioners at https://github.com/motiwari/Bandit PAM. |
| Open Datasets | Yes | We conduct experiments on several public, real-world datasets to evaluate Bandit PAM++ s performance: the MNIST dataset [13], the CIFAR10 dataset [12], and the 20 Newsgroups dataset [19]. |
| Dataset Splits | No | The paper describes using subsamples of datasets and mentions '10,000 training posts' for 20 Newsgroups, but does not provide explicit details about train/validation/test dataset splits, exact percentages, or sample counts for their own experimental setup. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions a 'high-performance C++ implementation, callable from Python and R' and the use of 'a sentence transformer from Hugging Face', but it does not specify any version numbers for these software components or other libraries. |
| Experiment Setup | Yes | In all experiments using the PIC technique, we allowed the algorithm to store up to 1,000 distance computations per point. |