MABSplit: Faster Forest Training Using Multi-Armed Bandits
Authors: Mo Tiwari, Ryan Kang, Jaeyong Lee, Chris Piech, Ilan Shomorony, Sebastian Thrun, Martin J. Zhang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantages of MABSplit in two settings. In the first setting, the baseline models with and without MABSplit are trained to completion and we report wall-clock training time and generalization performance. In the second setting, we consider training each forest with a fixed computational budget and study effect of MABSplit on generalization performance. We provide a description of each dataset in Appendix 6. |
| Researcher Affiliation | Academia | 1: Department of Computer Science, Stanford University 2: Oxford University 3: Electrical and Computer Engineering, University of Illinois at Urbana-Champaign 4: Department of Epidemiology, Harvard T.H. Chan School of Public Health |
| Pseudocode | Yes | Algorithm 1 MABSplit ( X, F, Tf, I( ), B, δ ) |
| Open Source Code | Yes | All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest. |
| Open Datasets | Yes | MNIST: A dataset of 60,000 28x28 grayscale images of handwritten digits. It is publicly available and widely used [38]. APS Failure at Scania Trucks: A classification dataset from the UCI Machine Learning Repository [20] and a Kaggle challenge [25] related to predictive maintenance. Forest Covertype: A classification dataset from the UCI Machine Learning Repository [20] used to predict forest cover type from cartographic variables [13]. Beijing Multi-Site Air-Quality: A regression dataset that provides hourly air quality data from 12 air quality stations in Beijing [64]. It is publicly available via UCI Machine Learning Repository [20]. SGEMM GPU Kernel Performance: A regression dataset from the UCI Machine Learning Repository [20] that contains performance measurements for GPU kernels [44]. |
| Dataset Splits | Yes | All datasets were randomly split 80% for training and 20% for testing. For some models, an additional 10% was used for validation. |
| Hardware Specification | Yes | All experiments were run on a single machine with a 2.3 GHz 8-Core Intel Core i9 processor, 16 GB 2400 MHz DDR4 memory, and an AMD Radeon Pro 5500M 8 GB GPU. |
| Software Dependencies | No | All code was written in Python 3.9 and uses the following libraries: NumPy [55], SciPy [55], scikit-learn [47], PyTorch [11, 41], and pandas [47]. Where possible, code was optimized using Numba [36] and compiled via LLVM [36, 37]. While Python 3.9 is specified, the versions for most libraries (NumPy, SciPy, scikit-learn, PyTorch, pandas, Numba, LLVM) are not explicitly stated with numbers. |
| Experiment Setup | Yes | All experiments use 100 trees in the forest, unless otherwise noted. Batch size for MABSplit: 16. Confidence Delta (δ) for MABSplit: 1e-6. For regression problems, we used 50 bins; for classification problems, we used 10 bins. For all experiments, we used a random seed of 0. Full settings for all experiments are given in Appendix 6. |