Coresets for Classification – Simplified and Strengthened
Authors: Tung Mai, Cameron Musco, Anup Rao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sec 5, we compare our Lewis weight-based method to the square root of leverage score method of [MSSW18], uniform sampling as studied in [CIM+19], and an oblivious sketching algorithm of [MOW21]. We study performance in minimizing both the log and hinge losses, with and without regularization. We observe that our method typically far outperforms uniform sampling along with other importance sampling methods. |
| Researcher Affiliation | Collaboration | Tung Mai Adobe Research tumai@adobe.com Cameron Musco University of Massachusetts Amherst cmusco@cs.umass.edu Anup Rao Adobe Research anuprao@adobe.com |
| Pseudocode | No | The paper contains mathematical derivations, theorems, and proofs but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'Our evaluation uses the codebase of [MSSW18], which was generously shared with us.' This indicates the use of a third-party codebase, but there is no statement about the authors releasing their own source code for the methodology described. |
| Open Datasets | Yes | The WEBB SPAM1 data consists of 350,000 unigrams with 127 features from web pages with 61% positive labels. The task is is to classify as spam or not. The other two datasets are loaded from scikit learn library2. COVERTYPE consists of 581,012 cartographic observations of different forests with 54 features and 49% positive labels. The task is to predict the type of tree. KDD CUP 99 has 494,021 points with 41 features and 20% positive labels. The task is to detect network intrusions. (Footnote 1: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/; Footnote 2: https://scikit-learn.org/) |
| Dataset Splits | No | The paper does not specify any training, validation, or test dataset splits (e.g., percentages or sample counts) for the experiments. It describes evaluation of coresets on the 'sum of loss over all data points'. |
| Hardware Specification | No | The paper mentions software routines like 'numpy qr factorization routine' and 'pinv routine in numpy' but does not specify any hardware details (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'numpy' and 'scikit learn library' but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Lewis weights are computed via an iterative algorithm given in [CP15], which involves computing leverage scores of a reweighted input matrix in each iteration. We typically don’t need many iterations to reach convergence for all datasets we used 20 iterations and observed relative difference between successive iterations around 10 6. ... We also evaluate the above two losses with regularization term 0.5 β 2 2. |