Statistical Guarantees of Distributed Nearest Neighbor Classification

Authors: Jiexin Duan, Xingye Qiao, Guang Cheng

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically check the accuracy of the M-Di NN(k) and W-Di NN(k) methods compared with four benchmark methods: the oracle KNN and oracle OWNN methods as well as two Fast Approximate Nearest Neighbor search methods (FANN), which are based on k-d tree and cover tree. [...] In Table 1, we compare the empirical risk (test error) and the speedup factors of M-Di NN(k), WDi NN(k), and FANN relative to oracle KNN.
Researcher Affiliation Academia Jiexin Duan Department of Statistics Purdue University West Lafayette, Indiana, USA duan32@purdue.edu Xingye Qiao Department of Mathematical Sciences Binghamton University New York, USA qiao@math.binghamton.edu Guang Cheng Department of Statistics Purdue University West Lafayette, Indiana, USA chengg@purdue.edu
Pseudocode Yes Algorithm 1 Di NN with majority voting (M-Di NN) Input: Data set D, number of partitions s, local weight vector wn and query point x. Output: M-Di NN. 1: Randomly split D into s subsamples with equal size n. 2: for j = 1 to s do 3: Obtain the WNN classifier bφ(j) n,wn(x) based on the j-th subsample. 4: end for 5: Majority voting of all classification outcomes bφ(j) n,wn(x): bφM n,s,wn(x) = 1 n1 j=1 bφ(j) n,wn(x) 1/2 o . (2) 6: return bφM n,s,wn(x).
Open Source Code No The paper does not include any statements about releasing code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We have retained benchmark data sets HTRU2 [40], Gisette [29], Musk1 [20], Musk2 [21], Occupancy [9], Credit [54], and SUSY [2], from the UCI machine learning repository [38].
Dataset Splits No The paper mentions 'Parameters in the oracle KNN, OWNN and FANN are tuned using cross-validation' and 'The test sample sizes are set as min(1000, total sample size/5)', but it does not specify explicit train/validation splits (e.g., 80/10/10) or sample counts for these splits.
Hardware Specification Yes All numerical studies are conducted on HPC clusters with two 12-core Intel Xeon Gold Skylake processors and four 10-core Xeon-E5 processors, with memory between 64 and 128 GB.
Software Dependencies No The paper mentions 'Apache Hadoop' and 'Map Reduce paradigm' in a general context, but it does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn 0.x).
Experiment Setup No The paper states that 'Parameters in the oracle KNN, OWNN and FANN are tuned using cross-validation, and the parameter k in M-Di NN(k), W-Di NN(k) and parameter l in M-Di NN, W-Di NN for each subsample are set using bridging formulas stated in our theorems.' However, it does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, specific optimizer settings) or other detailed training configurations.