Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale
Authors: Firas Abuzaid, Joseph K. Bradley, Feynman T. Liang, Andrew Feng, Lee Yang, Matei Zaharia, Ameet S. Talwalkar
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate YGGDRASIL against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, YGGDRASIL is faster by up to an order of magnitude. |
| Researcher Affiliation | Collaboration | 1MIT CSAIL, 2Databricks, 3University of Cambridge, 4Yahoo, 5UCLA |
| Pseudocode | No | The paper describes algorithms in text but does not present structured pseudocode or algorithm blocks (e.g., clearly labeled "Algorithm 1" or a pseudocode section). |
| Open Source Code | Yes | Our implementation is open-source and publicly available. Yggdrasil has been published as a Spark package at the following URL: https://spark-packages.org/package/fabuzaid21/yggdrasil |
| Open Datasets | Yes | To examine the performance of YGGDRASIL and PLANET, we trained a decision tree on two large-scale datasets: the MNIST 8 million dataset, and another modeled after a private Yahoo dataset that is used for search ranking. |
| Dataset Splits | No | The paper mentions training on datasets but does not explicitly provide details about training/validation/test splits, sample counts, or cross-validation methodology needed for reproduction. |
| Hardware Specification | Yes | We ran all experiments on 16 Amazon EC2 r3.2xlarge machines. Each machine has an Intel Xeon E5-2670 v2 CPU, 61 GB of memory, and 1 Gigabit Ethernet connectivity. |
| Software Dependencies | Yes | We developed YGGDRASIL on top of Spark 1.6.0 with an API compatible with MLLIB. We benchmarked YGGDRASIL against two implementations of PLANET: Spark MLLIB v1.6.0, and XGBOOST4J-SPARK v0.47. |
| Experiment Setup | No | The paper mentions tuning Spark's memory configuration and XGBoost parameters for optimal performance, but it does not provide specific hyperparameter values or concrete training configurations (e.g., learning rate, batch size, number of epochs) for the experiments. |