Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Authors: Firas Abuzaid, Joseph K. Bradley, Feynman T. Liang, Andrew Feng, Lee Yang, Matei Zaharia, Ameet S. Talwalkar

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate YGGDRASIL against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, YGGDRASIL is faster by up to an order of magnitude.
Researcher Affiliation Collaboration 1MIT CSAIL, 2Databricks, 3University of Cambridge, 4Yahoo, 5UCLA
Pseudocode No The paper describes algorithms in text but does not present structured pseudocode or algorithm blocks (e.g., clearly labeled "Algorithm 1" or a pseudocode section).
Open Source Code Yes Our implementation is open-source and publicly available. Yggdrasil has been published as a Spark package at the following URL: https://spark-packages.org/package/fabuzaid21/yggdrasil
Open Datasets Yes To examine the performance of YGGDRASIL and PLANET, we trained a decision tree on two large-scale datasets: the MNIST 8 million dataset, and another modeled after a private Yahoo dataset that is used for search ranking.
Dataset Splits No The paper mentions training on datasets but does not explicitly provide details about training/validation/test splits, sample counts, or cross-validation methodology needed for reproduction.
Hardware Specification Yes We ran all experiments on 16 Amazon EC2 r3.2xlarge machines. Each machine has an Intel Xeon E5-2670 v2 CPU, 61 GB of memory, and 1 Gigabit Ethernet connectivity.
Software Dependencies Yes We developed YGGDRASIL on top of Spark 1.6.0 with an API compatible with MLLIB. We benchmarked YGGDRASIL against two implementations of PLANET: Spark MLLIB v1.6.0, and XGBOOST4J-SPARK v0.47.
Experiment Setup No The paper mentions tuning Spark's memory configuration and XGBoost parameters for optimal performance, but it does not provide specific hyperparameter values or concrete training configurations (e.g., learning rate, batch size, number of epochs) for the experiments.