reproducibilityindex.ai

Subquadratic High-Dimensional Hierarchical Clustering

Authors: Amir Abboud, Vincent Cohen-Addad, Hussein Houdrouge

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then provide experiments showing that these algorithms perform as well as the non-approximate version for classic classiﬁcation tasks while achieving a signiﬁcant speed-up.
Researcher Affiliation	Collaboration	Amir Abboud IBM Research amir.abboud@gmail.com Vincent Cohen-Addad CNRS & Sorbonne Universit e vcohenad@gmail.com Hussein Houdrouge Ecole Polytechnique hussein.houdrouge@polytechnique.edu
Pseudocode	Yes	Pseudocode for our algorithm
Open Source Code	No	The paper mentions using and comparing against the FLANN library and sci-kit learn, and states their own implementation is in C++, but it does not provide an explicit link or statement about making their specific code open source.
Open Datasets	Yes	The main data that is used in these experiments are classic real-world datasets from the UCI repository and the sci-kit-learn library. Iris contains 150 points in 4 dimensions, Digits 1797 in 64 dimensions, Boston 506 points in 13 dimensions, Cancer 569 points in 3 dimensions, and Newsgroup 11314 points in 2241 dimensions.
Dataset Splits	No	The paper mentions using
Hardware Specification	Yes	We implemented our algorithm using C++11 on 2.5 GHz 8 core CPU with 7.5 Gi B under the Linux operating system.
Software Dependencies	No	The paper mentions using C++11, the FLANN library, and sci-kit learn, but it does not specify version numbers for the libraries.
Experiment Setup	Yes	The main parameter that we have is ϵ which determines the number of data structures to be used (recall that we have one approximate nearest neighbor data structure for each p1 εqi, for representing the potential cluster sizes) and the sequence of merge values. Moreover, we make use of FLANN library procedure for ﬁnding approximate nearest neighbors using KD-trees. This procedure takes two parameters the number of trees t and the number of leaves visited f. The algorithm builds t randomized KD-trees over the dataset. The number of leaves parameter controls how many leaves of the KD-trees are visited before stopping the search and returning a solution. These parameters are precisely ϵ = 8, number of trees T = 2, the number of visited leaves L = 10.