MABSplit: Faster Forest Training Using Multi-Armed Bandits

Authors: Mo Tiwari, Ryan Kang, Jaeyong Lee, Chris Piech, Ilan Shomorony, Sebastian Thrun, Martin J. Zhang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantages of MABSplit in two settings. In the first setting, the baseline models with and without MABSplit are trained to completion and we report wall-clock training time and generalization performance. In the second setting, we consider training each forest with a fixed computational budget and study effect of MABSplit on generalization performance. We provide a description of each dataset in Appendix 6.
Researcher Affiliation Academia 1: Department of Computer Science, Stanford University 2: Oxford University 3: Electrical and Computer Engineering, University of Illinois at Urbana-Champaign 4: Department of Epidemiology, Harvard T.H. Chan School of Public Health
Pseudocode Yes Algorithm 1 MABSplit ( X, F, Tf, I( ), B, δ )
Open Source Code Yes All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.
Open Datasets Yes MNIST: A dataset of 60,000 28x28 grayscale images of handwritten digits. It is publicly available and widely used [38]. APS Failure at Scania Trucks: A classification dataset from the UCI Machine Learning Repository [20] and a Kaggle challenge [25] related to predictive maintenance. Forest Covertype: A classification dataset from the UCI Machine Learning Repository [20] used to predict forest cover type from cartographic variables [13]. Beijing Multi-Site Air-Quality: A regression dataset that provides hourly air quality data from 12 air quality stations in Beijing [64]. It is publicly available via UCI Machine Learning Repository [20]. SGEMM GPU Kernel Performance: A regression dataset from the UCI Machine Learning Repository [20] that contains performance measurements for GPU kernels [44].
Dataset Splits Yes All datasets were randomly split 80% for training and 20% for testing. For some models, an additional 10% was used for validation.
Hardware Specification Yes All experiments were run on a single machine with a 2.3 GHz 8-Core Intel Core i9 processor, 16 GB 2400 MHz DDR4 memory, and an AMD Radeon Pro 5500M 8 GB GPU.
Software Dependencies No All code was written in Python 3.9 and uses the following libraries: NumPy [55], SciPy [55], scikit-learn [47], PyTorch [11, 41], and pandas [47]. Where possible, code was optimized using Numba [36] and compiled via LLVM [36, 37]. While Python 3.9 is specified, the versions for most libraries (NumPy, SciPy, scikit-learn, PyTorch, pandas, Numba, LLVM) are not explicitly stated with numbers.
Experiment Setup Yes All experiments use 100 trees in the forest, unless otherwise noted. Batch size for MABSplit: 16. Confidence Delta (δ) for MABSplit: 1e-6. For regression problems, we used 50 bins; for classification problems, we used 10 bins. For all experiments, we used a random seed of 0. Full settings for all experiments are given in Appendix 6.