Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Calibration and Out-of-Domain Generalization

Authors: Yoav Wald, Amir Feder, Daniel Greenfeld, Uri Shalit

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using four datasets from the recently proposed WILDS OOD benchmark [23], as well as the Colored MNIST dataset [21], we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains.
Researcher Affiliation Collaboration Yoav Wald Johns Hopkins University EMAIL Technion EMAIL Daniel Greenfeld Jether Energy Research EMAIL Technion EMAIL
Pseudocode No The paper describes methods and concepts but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We are preparing the code for publication and will do our best to have it ready by the end of the review period.
Open Datasets Yes Using four datasets from the recently proposed WILDS OOD benchmark [23], as well as the Colored MNIST dataset [21], we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains.
Dataset Splits Yes In order to perform multi-domain calibration we modify the splits to include a multi-domain validation set whenever possible. See supplemental Section B for details and for additional results on Amazon Reviews. ... We specify hyperparameters and training details in the supplementary material (for both the WILDS benchmark and Colored MNIST).
Hardware Specification No The paper does not provide specific details regarding the hardware used for experiments, such as CPU or GPU models, or cloud computing resources.
Software Dependencies No The paper mentions using PyTorch in the ethics checklist, but it does not specify version numbers for PyTorch or any other software dependencies needed to reproduce the experiments.
Experiment Setup Yes We specify hyperparameters and training details in the supplementary material (for both the WILDS benchmark and Colored MNIST). When using a training setup from other works (e.g. in Colored MNIST), we give a reference to the work and specify changes we made upon their setup.