Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bridging Multicalibration and Out-of-distribution Generalization Beyond Covariate Shift
Authors: Jiayun Wu, Jiashuo Liu, Peng Cui, Steven Z. Wu
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose MC-Pseudolabel2, a post-processing algorithm to achieve both extended multicalibration and out-of-distribution generalization. The algorithm, with lightweight hyperparameters and optimization through a series of supervised regression steps, achieves superior performance on real-world datasets with distribution shift. |
| Researcher Affiliation | Academia | Jiayun Wu Depart. of Computer Science & Tech. Tsinghua University Beijing, China 100084 EMAIL Jiashuo Liu Depart. of Computer Science & Tech. Tsinghua University Beijing, China 100084 EMAIL Peng Cui Key Laboratory of Pervasive Computing, Ministry of Education Depart. of Computer Science & Tech., Tsinghua University Beijing, China 100084 EMAIL Zhiwei Steven Wu School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 EMAIL |
| Pseudocode | Yes | Algorithm 1 MC-Pseudo Label Require: A dataset D = (Dx, Dy), a grouping function class H, a predictive function class F. 1: t 0; 2: f0 Initialization; {For example, models trained with ERM.} 3: m |Range(Discretize(f0))|; 4: repeat... |
| Open Source Code | Yes | Code available at: https://github.com/IC-hub/MC-Pseudolabel |
| Open Datasets | Yes | We experiment on Poverty Map [44] and ACSIncome [7] for the multi-environment setting, and Vessel Power [33] for the single-environment setting. |
| Dataset Splits | Yes | We select the best model across hyperparameters based on three model selection criteria, including in-distribution validation on the average of training data, worst-environment validation with the worst performance across training environments, and oracle validation on target data. |
| Hardware Specification | Yes | Each experiment with a single set of hyperparameters is run on one NVIDIA Ge Force RTX 3090 with 24GB of memory, taking at most 15 minutes. |
| Software Dependencies | No | Our experiments are based on the architecture of Py Torch [35]. |
| Experiment Setup | Yes | Table 4: Hyperparameters for model architecture. |