reproducibilityindex.ai

Bayesian and Empirical Bayesian Forests

Authors: Taddy Matthew, Chun-Sheng Chen, Jun Yu, Mitch Wyle

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a number of experiments, we compare EBFs to the common distributed-computing strategy of ﬁtting forests to data subsamples and ﬁnd that the EBFs lead to a large improvement in predictive performance. This type of strategy is key to efﬁcient machine learning with Big Data: focus the Big on the pieces of models that are most difﬁcult to learn. Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance, and close with description of how EBF algorithms are being built and perform in large-scale machine learning at e Bay.com.
Researcher Affiliation	Collaboration	Matt Taddy TADDY@CHICAGOBOOTH.EDU University of Chicago Booth School of Business Chun-Sheng Chen CHUNSCHEN@EBAY.COM e Bay Jun Yu JUNYU@EBAY.COM e Bay Mitch Wyle MWYLE@EBAY.COM e Bay
Pseudocode	Yes	Algorithm 1 Bayesian Forest for b = 1 to B do draw b iid Exp(1) run weighted-sample CART to get Tb = T ( b) end for
Open Source Code	No	The paper mentions modifying 'scikit-learn' but does not provide a link or explicit statement about releasing their modified code or any other source code for their methodology.
Open Datasets	Yes	Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance...The Friedman (1991) function...California housing data of Pace & Barry (1997)...motorcycle data (from the MASS R package, Venables & Ripley, 2002)...wine data from (Cortez et al., 1998)...Nielson Consumer Panel data, available for academic research through the Kilts Center at Chicago Booth
Dataset Splits	Yes	Figure 3 shows results from a 10-fold cross-validation (CV) experiment, with details in Table 1.
Hardware Specification	No	The paper mentions computation times but does not specify any particular CPU or GPU models, or other hardware details used for the experiments.
Software Dependencies	No	The paper mentions using 'python s scikit-learn', 'MASS R package', 'bayestree and tgp R package defaults', and 'MLLib library for Apache Spark' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	CART-based algorithms had minimum-leaf-samples set at 3 and the ensembles contain 100 trees... BART and BCART run at their bayestree and tgp R package defaults, except that BART draws only 200 trees after a burn-in of 100 MCMC iterations... EBFs use ﬁve node trunks in this Section. The SSFs are ﬁt on data split into ﬁve equally sized subsets.