Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Clustering using Approximate Nearest Neighbour Oracles
Authors: Enayat Ullah, Harry Lang, Raman Arora, Vladimir Braverman
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we run experiments on Census1990 dataset wherein the results empirically support our theory. We run two experiments on USCensus1990 dataset (available at Frank and Asuncion (2010)) on 200,000 points in 68 dimensions. In the first experiment (depicted in Figure 1a), we set the target number of clusters k = 1000 and evaluate the proposed method (Algorithm 1) with an exact nearest neighbour search and three approximate NN search methods, namely cover trees1, kd-trees and LSH2. We also compare against standard and highly-optimized implementations of three other clustering methods, Lloyd s, Gaussian mixture model and Elkan s, available in sklearn python library (Pedregosa et al., 2011). Figure 1a shows the average cost of clustering against time taken for these methods. We see that cover trees achieve the same cost as exact NN search, but takes much less time, even less than (highly optimized) sklearn methods, which produce lower cost clusters. On the other hand, ANN methods kd-tree and LSH do poorly on time taken and/or the cost of clustering. |
| Researcher Affiliation | Academia | Enayat Ullah EMAIL Department of Computer Science Johns Hopkins University; Harry Lang EMAIL Unafffliated; Raman Arora EMAIL Department of Computer Science Johns Hopkins University; Vladimir Braverman EMAIL Department of Computer Science Rice University |
| Pseudocode | Yes | Algorithm 1 Streaming k-median clustering 1: Input: Integer k 1, stream P of n points, a c(δ)-approximate ANN Oracle 2: L1 1, i 1, w(p) 1 for all points p P 3: while solution not found do 4: C , COST 0, f Li/(k(1 + log(n/k))) 5: while stream not ended do 6: x next point from the stream 7: (y, ρ(x, y)) ANN Oracle(x, C) 8: if probability min w(x)ρ(x,y) f , 1 then 9: C C {x} 10: else 11: COST COST + w(x) ρ(x, y) 12: w(y) w(y) + w(x) 13: if COST > γLi or |C| > (γ 1)k(1 + log(n/k)) then 14: Break and raise flag 15: if flag raised then 16: Push facilities in C to the start of stream 17: Li+1 βLi, i i + 1 18: else Declare solution found 19: Output: C, COST |
| Open Source Code | No | The paper references third-party tools, such as "PyCoverTree" (https://github.com/emanuele/PyCoverTree/) and "scikit-learn python library" (https://scikit-learn.org/stable/modules/neighbors.html), but does not provide a link or explicit statement for the open-sourcing of the code specific to the methodology described in this paper. |
| Open Datasets | Yes | We run two experiments on USCensus1990 dataset (available at Frank and Asuncion (2010))... Frank and Asuncion. Uci machine learning repository [http://archive. ics. uci. edu/ml]. irvine, ca: University of california. School of information and computer science, 213:2 2, 2010. |
| Dataset Splits | No | The paper mentions using the USCensus1990 dataset but does not provide any specific information regarding how this dataset was split into training, validation, or test sets for the experiments. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to run the experiments. It only refers to "clock time" in the experimental results. |
| Software Dependencies | No | The paper mentions using the "sklearn python library" and "cover trees" (referencing PyCoverTree on GitHub), but does not specify the version numbers for any of these software components. |
| Experiment Setup | No | In the first experiment (depicted in Figure 1a), we set the target number of clusters k = 1000 and evaluate the proposed method (Algorithm 1) with an exact nearest neighbour search and three approximate NN search methods, namely cover trees1, kd-trees and LSH2. We also compare against standard and highly-optimized implementations of three other clustering methods, Lloyd s, Gaussian mixture model and Elkan s, available in sklearn python library (Pedregosa et al., 2011). Figure 1a shows the average cost of clustering against time taken for these methods. To investigate further, we evaluate how these methods fare for increasing values of k. |