Near-optimal sample compression for nearest neighbors

Authors: Lee-Ad Gottlieb, Aryeh Kontorovich, Pinhas Nisnevitch

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we discuss experimental results. First, we will describe a simple heuristic built upon our algorithm. The theoretical guarantees in Theorem 1 feature a dependence on the scaled margin gamma, and our heuristic aims to give an improved solution in the problematic case where gamma is small. Consider the following procedure for obtaining a smaller consistent set. We first extract a net Sgamma satisfying the guarantees of Theorem 1. We then remove points from Sgamma using the following rule: for all i {0, . . . log gamma }, and for each p Sgamma, if the distance from p to all opposite labeled points in Sgamma is at least 2 2i, then remove from Sgamma all points strictly within distance 2i gamma of p (see Algorithm 2). We can show that the resulting set is consistent: Lemma 5. The above heuristic produces a consistent solution. (...) As a proof of concept, we tested our sample compression algorithms on several data sets from the UCI Machine Learning Repository. These included the Skin Segmentation, Statlog Shuttle, and Covertype sets.4 The final dataset features 7 different label types, which we treated as 21 separate binary classification problems; we report results for labels 1 vs. 4, 4 vs. 6, and 4 vs. 7, and these typify the remaining pairs. We stress that the focus of our experiments is to demonstrate that (i) a significant amount of consistent sample compression is often possible and (ii) the compression does not adversely affect the generalization error. For each data set and experiment, we sampled equal sized learning and test sets, with equal representation of each label type. The L1 metric was used for all data sets. We report (i) the initial sample set size, (ii) the percentage of points retained after the net extraction procedure of Algorithm 1, (iii) the percentage retained after the pruning heuristic of Algorithm 2, and (iv) the change in prediction accuracy on test data, when comparing the heuristic to the uncompressed sample. The results, averaged over 500 trials, are summarized in Figure 2.
Researcher Affiliation Academia Lee-Ad Gottlieb Department of Computer Science and Mathematics, Ariel University Ariel, Israel. leead@ariel.ac.il Aryeh Kontorovich Computer Science Department, Ben Gurion University Beer Sheva, Israel. karyeh@cs.bgu.ac.il Pinhas Nisnevitch Department of Computer Science and Mathematics, Ariel University Ariel, Israel. pinhasn@gmail.com
Pseudocode Yes Algorithm 1 Brute-force net construction (...) Algorithm 2 Consistent pruning heuristic
Open Source Code No The paper does not provide a specific repository link or an explicit statement about the release of source code for the described methodology.
Open Datasets Yes As a proof of concept, we tested our sample compression algorithms on several data sets from the UCI Machine Learning Repository. These included the Skin Segmentation, Statlog Shuttle, and Covertype sets.4
Dataset Splits No The paper states, "For each data set and experiment, we sampled equal sized learning and test sets, with equal representation of each label type," but it does not specify explicit training, validation, and test dataset splits with percentages, sample counts, or predefined citations for reproducibility.
Hardware Specification No The paper does not provide specific details regarding the hardware specifications (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks, that would be necessary to replicate the experiments.
Experiment Setup No The paper describes general experimental settings like sampling equal sized learning and test sets and using the L1 metric, but does not provide specific experimental setup details such as hyperparameter values, optimizer settings, or other concrete configuration steps.