Statistical Guarantees of Distributed Nearest Neighbor Classification
Authors: Jiexin Duan, Xingye Qiao, Guang Cheng
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically check the accuracy of the M-Di NN(k) and W-Di NN(k) methods compared with four benchmark methods: the oracle KNN and oracle OWNN methods as well as two Fast Approximate Nearest Neighbor search methods (FANN), which are based on k-d tree and cover tree. [...] In Table 1, we compare the empirical risk (test error) and the speedup factors of M-Di NN(k), WDi NN(k), and FANN relative to oracle KNN. |
| Researcher Affiliation | Academia | Jiexin Duan Department of Statistics Purdue University West Lafayette, Indiana, USA duan32@purdue.edu Xingye Qiao Department of Mathematical Sciences Binghamton University New York, USA qiao@math.binghamton.edu Guang Cheng Department of Statistics Purdue University West Lafayette, Indiana, USA chengg@purdue.edu |
| Pseudocode | Yes | Algorithm 1 Di NN with majority voting (M-Di NN) Input: Data set D, number of partitions s, local weight vector wn and query point x. Output: M-Di NN. 1: Randomly split D into s subsamples with equal size n. 2: for j = 1 to s do 3: Obtain the WNN classifier bφ(j) n,wn(x) based on the j-th subsample. 4: end for 5: Majority voting of all classification outcomes bφ(j) n,wn(x): bφM n,s,wn(x) = 1 n1 j=1 bφ(j) n,wn(x) 1/2 o . (2) 6: return bφM n,s,wn(x). |
| Open Source Code | No | The paper does not include any statements about releasing code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We have retained benchmark data sets HTRU2 [40], Gisette [29], Musk1 [20], Musk2 [21], Occupancy [9], Credit [54], and SUSY [2], from the UCI machine learning repository [38]. |
| Dataset Splits | No | The paper mentions 'Parameters in the oracle KNN, OWNN and FANN are tuned using cross-validation' and 'The test sample sizes are set as min(1000, total sample size/5)', but it does not specify explicit train/validation splits (e.g., 80/10/10) or sample counts for these splits. |
| Hardware Specification | Yes | All numerical studies are conducted on HPC clusters with two 12-core Intel Xeon Gold Skylake processors and four 10-core Xeon-E5 processors, with memory between 64 and 128 GB. |
| Software Dependencies | No | The paper mentions 'Apache Hadoop' and 'Map Reduce paradigm' in a general context, but it does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn 0.x). |
| Experiment Setup | No | The paper states that 'Parameters in the oracle KNN, OWNN and FANN are tuned using cross-validation, and the parameter k in M-Di NN(k), W-Di NN(k) and parameter l in M-Di NN, W-Di NN for each subsample are set using bridging formulas stated in our theorems.' However, it does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs, specific optimizer settings) or other detailed training configurations. |