reproducibilityindex.ai

Flattening a Hierarchical Clustering through Active Learning

Authors: Fabio Vitale, Anand Rajagopalan, Claudio Gentile

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to linear-time implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.
Researcher Affiliation	Collaboration	Fabio Vitale Department of Computer Science INRIA Lille, France & Sapienza University of Rome, Italy fabio.vitale@inria.fr; Anand Rajagopalan Google Research NY New York, USA anandbr@google.com; Claudio Gentile Google Research NY New York, USA cgentile@google.com
Pseudocode	Yes	Due to space limitations, WDP s pseudocode is given in Appendix B.3, but we have included an example of its execution in Figure 4.
Open Source Code	No	The paper does not provide any explicit statement or link indicating that the source code for their methodology is open-source or publicly available.
Open Datasets	Yes	The comparison is carried out on the hierarchies produced by standard HC methods operating on the ﬁrst n = 10000 datapoints in the well-known MNIST dataset from http://yann.lecun. com/exdb/mnist/, yielding a sample space of 108 pairs.
Dataset Splits	Yes	We let such parameters vary across suitable ranges and, for each algorithm, picked the best performing value on a validation set of 500 labeled pairs.
Hardware Specification	No	The paper does not specify the hardware used for running the experiments (e.g., CPU, GPU models, memory, or cloud instances).
Software Dependencies	No	The paper mentions software components like 'Euclidean distance combined with the single linkage (SING), median linkage (MED), and complete linkage (COMP) functions' and 'MNIST dataset', but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	No	The paper describes how parameters were tuned ('picked the best performing value on a validation set of 500 labeled pairs') and experimental repetitions ('repeated each experiment 10 times'). However, it does not provide specific hyperparameter values or detailed training configurations (e.g., learning rates, batch sizes, optimizer settings) in the main text.