Subquadratic High-Dimensional Hierarchical Clustering
Authors: Amir Abboud, Vincent Cohen-Addad, Hussein Houdrouge
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then provide experiments showing that these algorithms perform as well as the non-approximate version for classic classification tasks while achieving a significant speed-up. |
| Researcher Affiliation | Collaboration | Amir Abboud IBM Research amir.abboud@gmail.com Vincent Cohen-Addad CNRS & Sorbonne Universit e vcohenad@gmail.com Hussein Houdrouge Ecole Polytechnique hussein.houdrouge@polytechnique.edu |
| Pseudocode | Yes | Pseudocode for our algorithm |
| Open Source Code | No | The paper mentions using and comparing against the FLANN library and sci-kit learn, and states their own implementation is in C++, but it does not provide an explicit link or statement about making their specific code open source. |
| Open Datasets | Yes | The main data that is used in these experiments are classic real-world datasets from the UCI repository and the sci-kit-learn library. Iris contains 150 points in 4 dimensions, Digits 1797 in 64 dimensions, Boston 506 points in 13 dimensions, Cancer 569 points in 3 dimensions, and Newsgroup 11314 points in 2241 dimensions. |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | Yes | We implemented our algorithm using C++11 on 2.5 GHz 8 core CPU with 7.5 Gi B under the Linux operating system. |
| Software Dependencies | No | The paper mentions using C++11, the FLANN library, and sci-kit learn, but it does not specify version numbers for the libraries. |
| Experiment Setup | Yes | The main parameter that we have is ϵ which determines the number of data structures to be used (recall that we have one approximate nearest neighbor data structure for each p1 εqi, for representing the potential cluster sizes) and the sequence of merge values. Moreover, we make use of FLANN library procedure for finding approximate nearest neighbors using KD-trees. This procedure takes two parameters the number of trees t and the number of leaves visited f. The algorithm builds t randomized KD-trees over the dataset. The number of leaves parameter controls how many leaves of the KD-trees are visited before stopping the search and returning a solution. These parameters are precisely ϵ = 8, number of trees T = 2, the number of visited leaves L = 10. |