A Multilabel Classification Framework for Approximate Nearest Neighbor Search
Authors: Ville Hyvönen, Elias Jääsaari, Teemu Roos
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results validating the utility of our framework. In particular, we compare the natural classifier to the earlier candidate set selection methods discussed in Sec. 6 for different types of unsupervised trees that have been widely used for ANN search. Specifically, we use ensembles of randomized k-d trees (Friedman et al., 1976; Silpa-Anan and Hartley, 2008), random projection (RP) trees (Dasgupta and Freund, 2008; Hyvönen et al., 2016), and principal component (PCA) trees (Sproull, 1991; Jääsaari et al., 2019) (see Appendix C for detailed descriptions of these data structures). Another consequence of the multilabel formulation of Sec. 3 is that it enables using any established multilabel classifier for ANN search. To demonstrate this concretely, we train a random forest consisting of standard multilabel classification trees (trained under the PAL reduction (Reddi et al., 2019) by using multinomial log-likelihood as a split criterion) and use it as an index structure for ANN search; it turns out that the fully supervised classification trees have an improved performance compared to the earlier unsupervised trees on some but, curiously, not on all data sets. |
| Researcher Affiliation | Academia | Ville Hyvönen Department of Computer Science University of Helsinki Helsinki Institute for Information Technology (HIIT) ville.o.hyvonen@helsinki.fi Elias Jääsaari Machine Learning Department Carnegie Mellon University ejaeaesa@cs.cmu.edu Teemu Roos Department of Computer Science University of Helsinki Helsinki Institute for Information Technology (HIIT) teemu.roos@cs.helsinki.fi |
| Pseudocode | No | The paper includes mathematical formulations and proofs, but no clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code used to produce the experimental results is attached as supplementary material and can also be found at https://github.com/vioshyvo/ a-multilabel-classification-framework. |
| Open Datasets | Yes | We use four benchmark data sets: Fashion (m = 60000, d = 784), GIST (m = 1000000, d = 960), Trevi (m = 101120, d = 4096), and STL-10 (m = 100000, d = 9216). |
| Dataset Splits | No | The paper states "using the corpus as the training set" and specifies the size of the test set, but does not explicitly define or specify the size/percentage of a validation set or provide clear percentages for the training split. |
| Hardware Specification | No | The paper's own checklist states: "(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]". While Appendix B mentions "Intel Xeon E3-1505M v5 CPU running at 2.80GHz with 64GB of RAM", this information is not directly stated in the main body for the experiments and the paper's self-assessment says 'No'. |
| Software Dependencies | No | The paper mentions that "All the algorithms are implemented in C++" and compiled with "Intel C++ Compiler version 17.0.4", but it does not list specific C++ libraries or other software dependencies with version numbers. |
| Experiment Setup | Yes | Further details of the experimental setup are found in Appendix B. The hyperparameters were chosen to maximize the average recall over the test set, and then the runtime was measured with these optimal hyperparameters. For ensembles of trees, we tuned the number of trees T {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096} and the leaf size ℓ {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}. For RF, we additionally tuned the number of random features at each split: max_features {0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} and split criterion {entropy, gini, log-likelihood}. |