Mondrian Forests: Efficient Online Random Forests
Authors: Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the excellent empirical performance of MF in Section 7, and conclude in Section 8 with a discussion about future work. The purpose of these experiments is to evaluate the predictive performance (test accuracy) of MF as a function of (i) fraction of training data and (ii) training time. We divide the training data into 100 mini-batches and we compare the performance of online random forests (MF, ORF-Saffari [20]) to batch random forests (Breiman-RF, ERT-k, ERT-1) which are trained on the same fraction of the training data. |
| Researcher Affiliation | Academia | Balaji Lakshminarayanan Gatsby Unit University College London Daniel M. Roy Department of Engineering University of Cambridge Yee Whye Teh Department of Statistics University of Oxford |
| Pseudocode | Yes | Algorithm 1 Sample Mondrian Tree and Algorithm 2 Sample Mondrian Block and Algorithm 3 Extend Mondrian Tree(T, λ, D) and Algorithm 4 Extend Mondrian Block(T, λ, j, D). |
| Open Source Code | Yes | Our scripts are implemented in Python. We implemented the ORF-Saffari algorithm as well as ERT in Python for timing comparisons. The scripts can be downloaded from the authors webpages. |
| Open Datasets | Yes | We evaluate on four of the five datasets used in [20] we excluded the mushroom dataset as even very simple logical rules achieve > 99% accuracy on this dataset.4 (Footnote 4: https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names) and dna dataset.5 (Footnote 5: https://www.sgi.com/tech/mlc/db/DNA.names). |
| Dataset Splits | Yes | We used the pre-defined train/test split. For usps dataset D = 256, K = 10, Ntrain = 7291, Ntest = 2007; for satimages dataset D = 36, K = 6, Ntrain = 3104, Ntest = 2000; letter dataset D = 16, K = 26, Ntrain = 15000, Ntest = 5000; for dna dataset D = 180, K = 3, Ntrain = 1400, Ntest = 1186. |
| Hardware Specification | No | The scikit-learn implementation uses highly optimized C code, hence we do not compare our runtimes with the scikit-learn implementation. (This discusses the scikit-learn implementation's optimization, not the specific hardware used for the paper's experiments). |
| Software Dependencies | No | Our scripts are implemented in Python. We used the Breiman-RF implementation in scikit-learn [16]. (Specific version numbers for Python or scikit-learn are not provided.) |
| Experiment Setup | Yes | As is common in the random forest literature [2], we set the number of trees M = 100. For Mondrian forests, we set the lifetime λ = 1 and the HNSP discount parameter γ = 10D. For ORF-Saffari, we set num epochs = 20 (number of passes through the training data) and set the other hyper parameters to the values used in [20]. For Breiman-RF and ERT, the hyper parameters are set to default values. |