reproducibilityindex.ai

Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale

Authors: Firas Abuzaid, Joseph K. Bradley, Feynman T. Liang, Andrew Feng, Lee Yang, Matei Zaharia, Ameet S. Talwalkar

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate YGGDRASIL against the MNIST 8M dataset and a high-dimensional dataset at Yahoo; for both, YGGDRASIL is faster by up to an order of magnitude.
Researcher Affiliation	Collaboration	1MIT CSAIL, 2Databricks, 3University of Cambridge, 4Yahoo, 5UCLA
Pseudocode	No	The paper describes algorithms in text but does not present structured pseudocode or algorithm blocks (e.g., clearly labeled "Algorithm 1" or a pseudocode section).
Open Source Code	Yes	Our implementation is open-source and publicly available. Yggdrasil has been published as a Spark package at the following URL: https://spark-packages.org/package/fabuzaid21/yggdrasil
Open Datasets	Yes	To examine the performance of YGGDRASIL and PLANET, we trained a decision tree on two large-scale datasets: the MNIST 8 million dataset, and another modeled after a private Yahoo dataset that is used for search ranking.
Dataset Splits	No	The paper mentions training on datasets but does not explicitly provide details about training/validation/test splits, sample counts, or cross-validation methodology needed for reproduction.
Hardware Specification	Yes	We ran all experiments on 16 Amazon EC2 r3.2xlarge machines. Each machine has an Intel Xeon E5-2670 v2 CPU, 61 GB of memory, and 1 Gigabit Ethernet connectivity.
Software Dependencies	Yes	We developed YGGDRASIL on top of Spark 1.6.0 with an API compatible with MLLIB. We benchmarked YGGDRASIL against two implementations of PLANET: Spark MLLIB v1.6.0, and XGBOOST4J-SPARK v0.47.
Experiment Setup	No	The paper mentions tuning Spark's memory configuration and XGBoost parameters for optimal performance, but it does not provide specific hyperparameter values or concrete training configurations (e.g., learning rate, batch size, number of epochs) for the experiments.