reproducibilityindex.ai

When is Multicalibration Post-Processing Necessary?

Authors: Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs.
Researcher Affiliation	Collaboration	Dutch Hansen University of Southern California jmhansen@usc.edu Siddartha Devic University of Southern California devic@usc.edu Preetum Nakkiran Apple preetum@nakkiran.org Vatsal Sharan University of Southern California vsharan@usc.edu
Pseudocode	No	The paper describes algorithms but does not present them in pseudocode or an algorithm block format.
Open Source Code	Yes	We also release a python package implementing multicalibration algorithms, available via pip install multicalibration.1Experiment code is available at https://github.com/dutchhansen/empirical-multicalibration, while code for the python package is available at https://github.com/sid-devic/multicalibration.
Open Datasets	Yes	We experiment across a variety of classification tasks: five tabular datasets (ACS Income, UCI Bank Marketing, UCI Credit Default, HMDA, MEPS), two language datasets (Civil Comments, Amazon Polarity), and two image datasets (Celeb A, Camelyon17). For each dataset, we also define between 10 and 20 overlapping subgroups depending on available features or metadata. We detail and provide citations for each of our datasets and exact subgroup descriptions in Appendix E.
Dataset Splits	Yes	For consistency, we partition all datasets into three subsets: training, validation, and test. Test sets remain fixed across all experiments. We report accuracy and multicalibration metrics on the test set averaged over five random splits of train and validation sets for tabular data, and three splits for more complex data.
Hardware Specification	Yes	All experiments were performed on a collection four AWS G5 instances, each equipped with a NVIDIA 24GB A10 GPU.
Software Dependencies	No	The paper mentions several software tools and libraries like Scikit-learn, Distil BERT, GloVe, PyTorch, and torchtext, but it does not specify exact version numbers for these dependencies, which is required for reproducible software dependency description.
Experiment Setup	Yes	We detail the algorithm’s hyperparameters and the values we choose for them in Appendix F.1... We also sweep over learning rate decay rates of η {0.8, 0.85, 0.9, .95} for the learner and, when applicable, rates of η {0.9, 0.95, 0.98, 0.99} for the adversary... In all experiments with MLPs on tabular datasets, we use the Adam optimizer. On ACS Income, we train for 50 epochs. We search over hidden-layer widths: (128, BN, 128), (128, 256, 128), and (128, BN, 256, BN, 128). We vary batch size over {32, 64, 128} and learning rate over {0.01, 0.001, 0.0001, 0.00001}.