Selecting Near-Optimal Learners via Incremental Data Allocation

Authors: Ashish Sabharwal, Horst Samulowitz, Gerald Tesauro

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further develop substantial theoretical support for DAUB in an idealized setting where the expected accuracy of a classifier trained on n samples can be known exactly. Under these conditions we establish a rigorous sub-linear bound on the regret of the approach (in terms of misallocated data), as well as a rigorous bound on suboptimality of the selected classifier. Our accuracy estimates using real-world datasets only entail mild violations of the theoretical scenario, suggesting that the practical behavior of DAUB is likely to approach the idealized behavior. ... Our empirical findings thus show that in practice DAUB can consistently select near-optimal classifiers at a substantial reduced computational cost when compared to full training of all classifiers.
Researcher Affiliation Industry Ashish Sabharwal Allen Institute for AI Seattle, WA, USA Ashish S@allenai.org Horst Samulowitz and Gerald Tesauro IBM Watson Research Center Yorktown Heights, NY, USA samulowitz,gtesauro@us.ibm.com
Pseudocode Yes Algorithm 1 describes our Data Allocation using Upper Bounds strategy. ... Algorithm 1: Data Allocation using Upper Bounds
Open Source Code No Code and data, including full parameterization for each classifier, are available from the authors.
Open Datasets Yes We first evaluate DAUB on one real-world binary classification dataset, Higgs boson (Baldi, Sadowski, and Whiteson 2014) and one artificial dataset, Parity with distractors... Finally, in Table 2 we report results of DAUB on Higgs plus five other real-world benchmarks as indicated: Buzz (Kawala et al. 2013); Covertype (Blackard and Dean 2000); Million Song Dataset (Bertin-Mahieux et al. 2011); SUSY (Baldi, Sadowski, and Whiteson 2014); and Vehicle Sens IT (Duarte and Hu 2004).
Dataset Splits Yes For the Higgs and other real-world datasets, we first randomly split the data with a 70/30 ratio and selected 38,500 samples for Tr from the 70% split and use the 30% as Tv.
Hardware Specification Yes All experiments were conducted on AMD Opteron 6134 machines with 32 cores and 64 GB memory, running Scientific Linux 6.1.
Software Dependencies No The paper states that experiments make use of 'classifiers... as implemented in WEKA (Hall et al. 2009)'. While WEKA is named, no specific version number for WEKA or other software dependencies is provided, which is necessary for reproducibility.
Experiment Setup Yes We coarsely optimized the DAUB parameters at b = 500 and r = 1.5 based on the Higgs data, and kept those values fixed for all datasets. This yielded 11 possible allocation sizes: 500, 1000, 1500, 2500, 4000, 5000, 7500, 11500, 17500, 25500, 38500.