Evaluating model performance under worst-case subpopulations

Authors: Mike Li, Hongseok Namkoong, Shangzhou Xia

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 4, we demonstrate the effectiveness of our procedure on real data. By evaluating model robustness under subpopulation shifts, our methods allow the selection of robust models before deployment as we illustrate using the recently proposed CLIP model [62].
Researcher Affiliation Academia Mike Li Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 MLi24@gsb.columbia.edu Hongseok Namkoong Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 namkoong@gsb.columbia.edu Shangzhou Xia Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 SXia24@gsb.columbia.edu
Pseudocode Yes Algorithm 1 Two-stage procedure for estimating worst-case subpopulation performance (2) 1: INPUT: Subpopulation size α, model class H, samples S1 and S2 2: On S1, solve bh1 argminh H 1 |S1| P i S1 (ℓ(θ(Xi); Yi) h(Zi))2. 3: On S2, compute the plug-in estimator b Wα(bh1) = infη h bh1(Zi) η i
Open Source Code No The paper does not provide an explicit statement about the release of its own source code, nor does it provide a link to a code repository.
Open Datasets Yes We study a Pharmacogenetics and Pharmacogenomics Knowledge Base dataset constructed from optimal dosages found through trial and error by clinicians. The dataset comprises of 4,788 patients (after excluding missing data)... Consortium [26] found that a linear model outperforms... We study this problem on the Functional Map of the World (FMo W) dataset [25]
Dataset Splits Yes On validation data collected during 2002-2013, we first evaluate model performance on subpopulations... FMo W-WILDS training set (collected in 2002-2013, n =76,863)... ID val, collected in 2002-2013, n =11,483
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'gradient boosted decision trees (package XGBoost [22])' but does not provide specific version numbers for XGBoost or any other software dependencies.
Experiment Setup Yes We select λ = 0.4 so that the ensembled model (CLIP Wi SE-FT) achieves similar performance as Image Net pre-trained counterparts on the in-distribution validation data. To further make models comparable with respect to the cross entropy loss, we calibrate the CLIP Wi SE-FT model by tuning the temperature parameter so that its average loss on the in-distribution validation set matches the worst average loss of Image Net pre-trained models (Dense Net ERM). See Appendix B for detailed experimental settings and training specifications.