Bayesian and Empirical Bayesian Forests

Authors: Taddy Matthew, Chun-Sheng Chen, Jun Yu, Mitch Wyle

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a number of experiments, we compare EBFs to the common distributed-computing strategy of fitting forests to data subsamples and find that the EBFs lead to a large improvement in predictive performance. This type of strategy is key to efficient machine learning with Big Data: focus the Big on the pieces of models that are most difficult to learn. Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance, and close with description of how EBF algorithms are being built and perform in large-scale machine learning at e Bay.com.
Researcher Affiliation Collaboration Matt Taddy TADDY@CHICAGOBOOTH.EDU University of Chicago Booth School of Business Chun-Sheng Chen CHUNSCHEN@EBAY.COM e Bay Jun Yu JUNYU@EBAY.COM e Bay Mitch Wyle MWYLE@EBAY.COM e Bay
Pseudocode Yes Algorithm 1 Bayesian Forest for b = 1 to B do draw b iid Exp(1) run weighted-sample CART to get Tb = T ( b) end for
Open Source Code No The paper mentions modifying 'scikit-learn' but does not provide a link or explicit statement about releasing their modified code or any other source code for their methodology.
Open Datasets Yes Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance...The Friedman (1991) function...California housing data of Pace & Barry (1997)...motorcycle data (from the MASS R package, Venables & Ripley, 2002)...wine data from (Cortez et al., 1998)...Nielson Consumer Panel data, available for academic research through the Kilts Center at Chicago Booth
Dataset Splits Yes Figure 3 shows results from a 10-fold cross-validation (CV) experiment, with details in Table 1.
Hardware Specification No The paper mentions computation times but does not specify any particular CPU or GPU models, or other hardware details used for the experiments.
Software Dependencies No The paper mentions using 'python s scikit-learn', 'MASS R package', 'bayestree and tgp R package defaults', and 'MLLib library for Apache Spark' but does not provide specific version numbers for these software components.
Experiment Setup Yes CART-based algorithms had minimum-leaf-samples set at 3 and the ensembles contain 100 trees... BART and BCART run at their bayestree and tgp R package defaults, except that BART draws only 200 trees after a burn-in of 100 MCMC iterations... EBFs use five node trunks in this Section. The SSFs are fit on data split into five equally sized subsets.