When is Multicalibration Post-Processing Necessary?
Authors: Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct the first comprehensive study evaluating the usefulness of multicalibration post-processing across a broad set of tabular, image, and language datasets for models spanning from simple decision trees to 90 million parameter fine-tuned LLMs. |
| Researcher Affiliation | Collaboration | Dutch Hansen University of Southern California jmhansen@usc.edu Siddartha Devic University of Southern California devic@usc.edu Preetum Nakkiran Apple preetum@nakkiran.org Vatsal Sharan University of Southern California vsharan@usc.edu |
| Pseudocode | No | The paper describes algorithms but does not present them in pseudocode or an algorithm block format. |
| Open Source Code | Yes | We also release a python package implementing multicalibration algorithms, available via pip install multicalibration.1Experiment code is available at https://github.com/dutchhansen/empirical-multicalibration, while code for the python package is available at https://github.com/sid-devic/multicalibration. |
| Open Datasets | Yes | We experiment across a variety of classification tasks: five tabular datasets (ACS Income, UCI Bank Marketing, UCI Credit Default, HMDA, MEPS), two language datasets (Civil Comments, Amazon Polarity), and two image datasets (Celeb A, Camelyon17). For each dataset, we also define between 10 and 20 overlapping subgroups depending on available features or metadata. We detail and provide citations for each of our datasets and exact subgroup descriptions in Appendix E. |
| Dataset Splits | Yes | For consistency, we partition all datasets into three subsets: training, validation, and test. Test sets remain fixed across all experiments. We report accuracy and multicalibration metrics on the test set averaged over five random splits of train and validation sets for tabular data, and three splits for more complex data. |
| Hardware Specification | Yes | All experiments were performed on a collection four AWS G5 instances, each equipped with a NVIDIA 24GB A10 GPU. |
| Software Dependencies | No | The paper mentions several software tools and libraries like Scikit-learn, Distil BERT, GloVe, PyTorch, and torchtext, but it does not specify exact version numbers for these dependencies, which is required for reproducible software dependency description. |
| Experiment Setup | Yes | We detail the algorithm’s hyperparameters and the values we choose for them in Appendix F.1... We also sweep over learning rate decay rates of η {0.8, 0.85, 0.9, .95} for the learner and, when applicable, rates of η {0.9, 0.95, 0.98, 0.99} for the adversary... In all experiments with MLPs on tabular datasets, we use the Adam optimizer. On ACS Income, we train for 50 epochs. We search over hidden-layer widths: (128, BN, 128), (128, 256, 128), and (128, BN, 256, BN, 128). We vary batch size over {32, 64, 128} and learning rate over {0.01, 0.001, 0.0001, 0.00001}. |