Bayesian and Empirical Bayesian Forests
Authors: Taddy Matthew, Chun-Sheng Chen, Jun Yu, Mitch Wyle
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a number of experiments, we compare EBFs to the common distributed-computing strategy of fitting forests to data subsamples and find that the EBFs lead to a large improvement in predictive performance. This type of strategy is key to efficient machine learning with Big Data: focus the Big on the pieces of models that are most difficult to learn. Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance, and close with description of how EBF algorithms are being built and perform in large-scale machine learning at e Bay.com. |
| Researcher Affiliation | Collaboration | Matt Taddy TADDY@CHICAGOBOOTH.EDU University of Chicago Booth School of Business Chun-Sheng Chen CHUNSCHEN@EBAY.COM e Bay Jun Yu JUNYU@EBAY.COM e Bay Mitch Wyle MWYLE@EBAY.COM e Bay |
| Pseudocode | Yes | Algorithm 1 Bayesian Forest for b = 1 to B do draw b iid Exp(1) run weighted-sample CART to get Tb = T ( b) end for |
| Open Source Code | No | The paper mentions modifying 'scikit-learn' but does not provide a link or explicit statement about releasing their modified code or any other source code for their methodology. |
| Open Datasets | Yes | Throughout, we use publicly available data on home prices in California to illustrate our ideas. We also provide a variety of other data analyses to benchmark performance...The Friedman (1991) function...California housing data of Pace & Barry (1997)...motorcycle data (from the MASS R package, Venables & Ripley, 2002)...wine data from (Cortez et al., 1998)...Nielson Consumer Panel data, available for academic research through the Kilts Center at Chicago Booth |
| Dataset Splits | Yes | Figure 3 shows results from a 10-fold cross-validation (CV) experiment, with details in Table 1. |
| Hardware Specification | No | The paper mentions computation times but does not specify any particular CPU or GPU models, or other hardware details used for the experiments. |
| Software Dependencies | No | The paper mentions using 'python s scikit-learn', 'MASS R package', 'bayestree and tgp R package defaults', and 'MLLib library for Apache Spark' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | CART-based algorithms had minimum-leaf-samples set at 3 and the ensembles contain 100 trees... BART and BCART run at their bayestree and tgp R package defaults, except that BART draws only 200 trees after a burn-in of 100 MCMC iterations... EBFs use five node trunks in this Section. The SSFs are fit on data split into five equally sized subsets. |