reproducibilityindex.ai

Selecting Near-Optimal Learners via Incremental Data Allocation

Authors: Ashish Sabharwal, Horst Samulowitz, Gerald Tesauro

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further develop substantial theoretical support for DAUB in an idealized setting where the expected accuracy of a classiﬁer trained on n samples can be known exactly. Under these conditions we establish a rigorous sub-linear bound on the regret of the approach (in terms of misallocated data), as well as a rigorous bound on suboptimality of the selected classiﬁer. Our accuracy estimates using real-world datasets only entail mild violations of the theoretical scenario, suggesting that the practical behavior of DAUB is likely to approach the idealized behavior. ... Our empirical ﬁndings thus show that in practice DAUB can consistently select near-optimal classiﬁers at a substantial reduced computational cost when compared to full training of all classiﬁers.
Researcher Affiliation	Industry	Ashish Sabharwal Allen Institute for AI Seattle, WA, USA Ashish S@allenai.org Horst Samulowitz and Gerald Tesauro IBM Watson Research Center Yorktown Heights, NY, USA samulowitz,gtesauro@us.ibm.com
Pseudocode	Yes	Algorithm 1 describes our Data Allocation using Upper Bounds strategy. ... Algorithm 1: Data Allocation using Upper Bounds
Open Source Code	No	Code and data, including full parameterization for each classiﬁer, are available from the authors.
Open Datasets	Yes	We ﬁrst evaluate DAUB on one real-world binary classiﬁcation dataset, Higgs boson (Baldi, Sadowski, and Whiteson 2014) and one artiﬁcial dataset, Parity with distractors... Finally, in Table 2 we report results of DAUB on Higgs plus ﬁve other real-world benchmarks as indicated: Buzz (Kawala et al. 2013); Covertype (Blackard and Dean 2000); Million Song Dataset (Bertin-Mahieux et al. 2011); SUSY (Baldi, Sadowski, and Whiteson 2014); and Vehicle Sens IT (Duarte and Hu 2004).
Dataset Splits	Yes	For the Higgs and other real-world datasets, we ﬁrst randomly split the data with a 70/30 ratio and selected 38,500 samples for Tr from the 70% split and use the 30% as Tv.
Hardware Specification	Yes	All experiments were conducted on AMD Opteron 6134 machines with 32 cores and 64 GB memory, running Scientiﬁc Linux 6.1.
Software Dependencies	No	The paper states that experiments make use of 'classiﬁers... as implemented in WEKA (Hall et al. 2009)'. While WEKA is named, no specific version number for WEKA or other software dependencies is provided, which is necessary for reproducibility.
Experiment Setup	Yes	We coarsely optimized the DAUB parameters at b = 500 and r = 1.5 based on the Higgs data, and kept those values ﬁxed for all datasets. This yielded 11 possible allocation sizes: 500, 1000, 1500, 2500, 4000, 5000, 7500, 11500, 17500, 25500, 38500.