Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Expert Distributionally Robust Optimization for Out-of-Distribution Generalization
Authors: Jinyong Jeong, Hyungu Kahng, Seoung Bum Kim
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MEDRO across a variety of distribution shift settings. On subpopulation shift benchmarks [6, 12], MEDRO significantly improves worst-group accuracy. It also generalizes effectively to domain and mixed-shift benchmarks [5], consistently outperforming single-head DRO baselines. Empirical evaluations on a range of standard distribution shift benchmarks demonstrate that MEDRO often achieves robust predictive performance compared to existing methods. Section 4 reports experimental findings |
| Researcher Affiliation | Academia | 1Department of Industrial and Management Engineering, Korea University, Seoul 2Department of Convergence Business, Korea University, Sejong EMAIL |
| Pseudocode | Yes | Pseudocode is provided in Algorithm 1 of Appendix A.3. |
| Open Source Code | Yes | Our codes are available at https://github.com/jyjeongku/MEDRO. |
| Open Datasets | Yes | Our experimental evaluation uses datasets that span diverse data modalities and distribution shift scenarios to thoroughly evaluate the effectiveness of MEDRO. These datasets are summarized in Table 5 of Appendix E. We selected these datasets for the following key reasons: 1. Comprehensive coverage of shift types: Our selection includes both subpopulation shift datasets (Waterbirds [37], Celeb A [38], Civil Comments [39], Multi NLI [40], Meta Shift [41], NICO++ [42], Che Xpert [43], Living17 [44]) and domain generalization datasets (Camelyon17 [45], i Wild Cam [46]), as well as hybrid settings (Poverty Map [47]) that exhibit both types of shifts simultaneously. All datasets used in our experiments are publicly available. |
| Dataset Splits | Yes | A 10% validation split was retained for consistency with Group DRO, but it was not used for early stopping or model selection. Following the official WILDS protocol, we used the designated out-of-distribution (OOD) validation sets for model selection and reported performance on the OOD test sets using dataset-specific metrics... Experiments use the original 5-fold dataset splits provided in WILDS. |
| Hardware Specification | No | The paper does not include detailed information on compute resources such as GPU type or runtime. However, all experiments were executed using standard resources compatible with the official benchmark protocols (e.g., Subpop Bench and WILDS). |
| Software Dependencies | No | For image datasets, we used Res Net-50 pretrained on Image Net-1K [51]; for text datasets, we used BERT-base (bert-base-uncased) [52]. Image models were trained with SGD (momentum 0.9), and text models with Adam W [53]. |
| Experiment Setup | Yes | All methods used identical Res Net-50 architectures and optimizer configurations. Additional training and hyperparameter details are in Appendix E.1. ... Models were trained for 50 epochs using a pretrained Res Net-50 backbone [51]. We used SGD with momentum 0.9, batch size 128, learning rate 10^-5, and strong ℓ2 regularization (weight decay = 0.1). No data augmentation was used. ... We followed the official protocol, tuning across 16 randomized hyperparameter configurations. Each configuration was drawn from a predefined search space in Table 6. ... In addition to tuning the model learning rate ηθ (as part of the general search space), we tuned the risk weight step size ηΛ, which controls the group weight update during training. The range 10Uniform[-3, 1] was used, consistent with the Group DRO setting. ... For Camelyon17, we enabled Image Net pretraining, following prior work [18] that demonstrated improved performance with this setting. ... Table 8 summarizes the training configurations used for MEDRO on WILDS datasets. For Camelyon17, we enabled Image Net pretraining, following prior work [18] that demonstrated improved performance with this setting. |