Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Calibration and Out-of-Domain Generalization
Authors: Yoav Wald, Amir Feder, Daniel Greenfeld, Uri Shalit
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using four datasets from the recently proposed WILDS OOD benchmark [23], as well as the Colored MNIST dataset [21], we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains. |
| Researcher Affiliation | Collaboration | Yoav Wald Johns Hopkins University EMAIL Technion EMAIL Daniel Greenfeld Jether Energy Research EMAIL Technion EMAIL |
| Pseudocode | No | The paper describes methods and concepts but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We are preparing the code for publication and will do our best to have it ready by the end of the review period. |
| Open Datasets | Yes | Using four datasets from the recently proposed WILDS OOD benchmark [23], as well as the Colored MNIST dataset [21], we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains. |
| Dataset Splits | Yes | In order to perform multi-domain calibration we modify the splits to include a multi-domain validation set whenever possible. See supplemental Section B for details and for additional results on Amazon Reviews. ... We specify hyperparameters and training details in the supplementary material (for both the WILDS benchmark and Colored MNIST). |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for experiments, such as CPU or GPU models, or cloud computing resources. |
| Software Dependencies | No | The paper mentions using PyTorch in the ethics checklist, but it does not specify version numbers for PyTorch or any other software dependencies needed to reproduce the experiments. |
| Experiment Setup | Yes | We specify hyperparameters and training details in the supplementary material (for both the WILDS benchmark and Colored MNIST). When using a training setup from other works (e.g. in Colored MNIST), we give a reference to the work and specify changes we made upon their setup. |