Flattening a Hierarchical Clustering through Active Learning

Authors: Fabio Vitale, Anand Rajagopalan, Claudio Gentile

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate active learning by pairwise similarity over the leaves of trees originating from hierarchical clustering procedures. In the realizable setting, we provide a full characterization of the number of queries needed to achieve perfect reconstruction of the tree cut. In the non-realizable setting, we rely on known important-sampling procedures to obtain regret and query complexity bounds. Our algorithms come with theoretical guarantees on the statistical error and, more importantly, lend themselves to linear-time implementations in the relevant parameters of the problem. We discuss such implementations, prove running time guarantees for them, and present preliminary experiments on real-world datasets showing the compelling practical performance of our algorithms as compared to both passive learning and simple active learning baselines.
Researcher Affiliation Collaboration Fabio Vitale Department of Computer Science INRIA Lille, France & Sapienza University of Rome, Italy fabio.vitale@inria.fr; Anand Rajagopalan Google Research NY New York, USA anandbr@google.com; Claudio Gentile Google Research NY New York, USA cgentile@google.com
Pseudocode Yes Due to space limitations, WDP s pseudocode is given in Appendix B.3, but we have included an example of its execution in Figure 4.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for their methodology is open-source or publicly available.
Open Datasets Yes The comparison is carried out on the hierarchies produced by standard HC methods operating on the first n = 10000 datapoints in the well-known MNIST dataset from http://yann.lecun. com/exdb/mnist/, yielding a sample space of 108 pairs.
Dataset Splits Yes We let such parameters vary across suitable ranges and, for each algorithm, picked the best performing value on a validation set of 500 labeled pairs.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., CPU, GPU models, memory, or cloud instances).
Software Dependencies No The paper mentions software components like 'Euclidean distance combined with the single linkage (SING), median linkage (MED), and complete linkage (COMP) functions' and 'MNIST dataset', but it does not specify any version numbers for these or other software dependencies.
Experiment Setup No The paper describes how parameters were tuned ('picked the best performing value on a validation set of 500 labeled pairs') and experimental repetitions ('repeated each experiment 10 times'). However, it does not provide specific hyperparameter values or detailed training configurations (e.g., learning rates, batch sizes, optimizer settings) in the main text.