Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NGBoost: Natural Gradient Boosting for Probabilistic Prediction

Authors: Tony Duan, Avati Anand, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Ng, Alejandro Schuler

ICML 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments use datasets from the UCI Machine Learning Repository, and follow the protocol first proposed in Hern andez-Lobato and Adams (2015). For all datasets, we hold out a random 10% of the examples as a test set. From the other 90% we initially hold out 20% as a validation set to select M (the number of boosting stages) that gives the best log-likelihood, and then retrain on the entire 90% using the chosen M. The retrained model is then made to predict on the held-out 10% test set. This entire process is repeated 20 times for all datasets except Protein and Year MSD, for which it is repeated 5 times and 1 time respectively.
Researcher Affiliation Collaboration 1Stanford University, Stanford, California, United States 2Unlearn.ai, San Francisco, California, United States 3Harvard Medical School, Cambridge, Massachusetts, United States.
Pseudocode Yes Algorithm 1 NGBoost for probabilistic prediction
Open Source Code Yes An open-source implementation is available at github.com/stanfordmlgroup/ngboost.
Open Datasets Yes Our experiments use datasets from the UCI Machine Learning Repository, and follow the protocol first proposed in Hern andez-Lobato and Adams (2015).
Dataset Splits Yes For all datasets, we hold out a random 10% of the examples as a test set. From the other 90% we initially hold out 20% as a validation set to select M (the number of boosting stages) that gives the best log-likelihood, and then retrain on the entire 90% using the chosen M.
Hardware Specification No The paper discusses computational aspects like mini-batching and scalability to large datasets, but it does not provide specific details on the hardware used, such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions using 'Scikit-Learn implementation' for comparison methods but does not provide specific version numbers for Scikit-Learn or any other software dependencies.
Experiment Setup Yes For all experiments, NGBoost was configured with the Normal distribution, decision tree base learner with a maximum depth of three levels, and log scoring rule. The Year MSD dataset, being extremely large relative to the rest, was fit using a learning rate η of 0.1 while the rest of the datasets were fit with a learning rate of 0.01. In general we recommend small learning rates, subject to computational feasibility. For the Year MSD dataset we use a mini-batch size of 10%, for all other datasets we use 100%.