Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clustering using Approximate Nearest Neighbour Oracles

Authors: Enayat Ullah, Harry Lang, Raman Arora, Vladimir Braverman

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we run experiments on Census1990 dataset wherein the results empirically support our theory. We run two experiments on USCensus1990 dataset (available at Frank and Asuncion (2010)) on 200,000 points in 68 dimensions. In the first experiment (depicted in Figure 1a), we set the target number of clusters k = 1000 and evaluate the proposed method (Algorithm 1) with an exact nearest neighbour search and three approximate NN search methods, namely cover trees1, kd-trees and LSH2. We also compare against standard and highly-optimized implementations of three other clustering methods, Lloyd s, Gaussian mixture model and Elkan s, available in sklearn python library (Pedregosa et al., 2011). Figure 1a shows the average cost of clustering against time taken for these methods. We see that cover trees achieve the same cost as exact NN search, but takes much less time, even less than (highly optimized) sklearn methods, which produce lower cost clusters. On the other hand, ANN methods kd-tree and LSH do poorly on time taken and/or the cost of clustering.
Researcher Affiliation Academia Enayat Ullah EMAIL Department of Computer Science Johns Hopkins University; Harry Lang EMAIL Unafffliated; Raman Arora EMAIL Department of Computer Science Johns Hopkins University; Vladimir Braverman EMAIL Department of Computer Science Rice University
Pseudocode Yes Algorithm 1 Streaming k-median clustering 1: Input: Integer k 1, stream P of n points, a c(δ)-approximate ANN Oracle 2: L1 1, i 1, w(p) 1 for all points p P 3: while solution not found do 4: C , COST 0, f Li/(k(1 + log(n/k))) 5: while stream not ended do 6: x next point from the stream 7: (y, ρ(x, y)) ANN Oracle(x, C) 8: if probability min w(x)ρ(x,y) f , 1 then 9: C C {x} 10: else 11: COST COST + w(x) ρ(x, y) 12: w(y) w(y) + w(x) 13: if COST > γLi or |C| > (γ 1)k(1 + log(n/k)) then 14: Break and raise flag 15: if flag raised then 16: Push facilities in C to the start of stream 17: Li+1 βLi, i i + 1 18: else Declare solution found 19: Output: C, COST
Open Source Code No The paper references third-party tools, such as "PyCoverTree" (https://github.com/emanuele/PyCoverTree/) and "scikit-learn python library" (https://scikit-learn.org/stable/modules/neighbors.html), but does not provide a link or explicit statement for the open-sourcing of the code specific to the methodology described in this paper.
Open Datasets Yes We run two experiments on USCensus1990 dataset (available at Frank and Asuncion (2010))... Frank and Asuncion. Uci machine learning repository [http://archive. ics. uci. edu/ml]. irvine, ca: University of california. School of information and computer science, 213:2 2, 2010.
Dataset Splits No The paper mentions using the USCensus1990 dataset but does not provide any specific information regarding how this dataset was split into training, validation, or test sets for the experiments.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to run the experiments. It only refers to "clock time" in the experimental results.
Software Dependencies No The paper mentions using the "sklearn python library" and "cover trees" (referencing PyCoverTree on GitHub), but does not specify the version numbers for any of these software components.
Experiment Setup No In the first experiment (depicted in Figure 1a), we set the target number of clusters k = 1000 and evaluate the proposed method (Algorithm 1) with an exact nearest neighbour search and three approximate NN search methods, namely cover trees1, kd-trees and LSH2. We also compare against standard and highly-optimized implementations of three other clustering methods, Lloyd s, Gaussian mixture model and Elkan s, available in sklearn python library (Pedregosa et al., 2011). Figure 1a shows the average cost of clustering against time taken for these methods. To investigate further, we evaluate how these methods fare for increasing values of k.