Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Discretization-free Multicalibration through Loss Minimization over Tree Ensembles

Authors: Hongyi Henry Jin, Zijun Ding, Dung Daniel Ngo, Steven Z. Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across multiple datasets, our empirical evaluation shows that this condition is always met in practice. Our discretization-free algorithm consistently matches or outperforms existing multicalibration approaches even when evaluated using a discretization-based multicalibration metric that shares its discretization granularity with the baselines. 5 Experiments
Researcher Affiliation	Collaboration	Hongyi Henry Jin University of California, Los Angeles Los Angeles, CA, USA Zijun Ding Carnegie Mellon University Pittsburgh, PA, USA Dung Daniel Ngo J.P. Morgan Chase AI Research New York, NY, USA Zhiwei Steven Wu Carnegie Mellon University Pittsburgh, PA, USA
Pseudocode	Yes	Algorithm 1 Discretization-free Multicalibration (DFMC) Require: Calibration set D = (X, Y), Group indicator function g : X {0, 1}\|G\|, Uncalibrated base model f0 : X R, Ensemble solver SOLVEENSEMBLE 1: Set the (\|G\| + 1)-dimensional input features X = (f0(X), g(X)) 2: Set the output features Y = Y f0(X) 3: return f0(X)+SOLVEENSEMBLE(X , Y )
Open Source Code	Yes	Code to replicate the results in this work is available at https://github.com/hjenryin/ Discretization-free-MC.
Open Datasets	Yes	A common dataset to consider in fairness and multicalibration literature is the ACS dataset, obtained through the Folktables (Ding et al., 2021) package. Zhang et al. (2017) introduced the UTKFace dataset, with a person s face image associated with their age, gender and race. ISIC challenge dataset is an image dataset related to skin lesion, and we considered the task introduced in the ISIC Challenge 2019 (Tschandl et al., 2018; Codella et al., 2017; Combalia et al., 2019) Finally, we consider the Comment Toxicity Classification dataset (Borkan et al., 2019) from the WILDS dataset (Koh et al., 2021)
Dataset Splits	Yes	For all experiments, we fix the training set for the uncalibrated baseline and the test set for evaluation. The calibration and validation set are partitioned 10 times, with the standard deviation indicated in the plots and the tables. (Section 5.2) we set aside 12,000 data entries for the test set and 18,000 entries for the calibration and validation set. (Appendix D.1) Koh et al. (2021) partitioned the dataset into train set (60%), validation set (10%) and test set (30%). (Appendix D.2) We used 40% of the data for training, 40% for calibration and validation, and 20% for testing. (Appendix D.3) We use the 12413 BCN images as the training set for the base predictor, three-fourths of the HAM images (7511 images) as the calibration and validation set, and the remaining 5407 images as the test set. (Appendix D.4)
Hardware Specification	Yes	The uncalibrated predictors for the image or text task were trained on an A100 GPU or an RTX 4090 GPU. Other predictors, as well as the calibrators, were trained on an AMD 64-core CPU.
Software Dependencies	No	Our ERM approach can be implemented using off-the-shelf tree ensemble learning methods such as Light GBM. Finally, we implement our algorithm described in Algorithm 1 with the Light GBM (Ke et al., 2017) solver. While software names like Light GBM, XGBoost, Distil BERT, and Res Net are mentioned, specific version numbers for these software packages are not provided in the text.
Experiment Setup	Yes	We use the binary groups as the feature and sweep over the regularization strength λ in {0, 10 6, 10 5, , 10 2}. (Appendix E) To avoid the complexity of determining the optimal number of trees, we use early stopping with patience of 50 iterations and set a high maximum limit of 5000 trees. 30% of the calibration set is set aside during calibration to monitor if the loss continues to decrease... The depth of each tree is two... We vary the learning rate across five exponentially spaced points from 0.01 to 1 and adjust the feature subsampling ratio linearly across 10 points from 0.1 to 1. (Appendix E) We experimented with trees of depth both 1 and 2 during hyperparameter sweep. Other hyperparameters include the learning rate, which can be 0.1, 0.3, or 1, and the subsampling ratio for the weak learner, which is adjusted linearly across 10 points from 0.1 to 1. (Appendix E) We consider this proportion as a hyperparameter, and swept over the values {0.1, 0.2, 0.3, 0.4, 0.5}. (Appendix E)