Evaluating model performance under worst-case subpopulations
Authors: Mike Li, Hongseok Namkoong, Shangzhou Xia
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4, we demonstrate the effectiveness of our procedure on real data. By evaluating model robustness under subpopulation shifts, our methods allow the selection of robust models before deployment as we illustrate using the recently proposed CLIP model [62]. |
| Researcher Affiliation | Academia | Mike Li Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 MLi24@gsb.columbia.edu Hongseok Namkoong Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 namkoong@gsb.columbia.edu Shangzhou Xia Decision, Risk, and Operations Division Columbia Business School New York, NY 10027 SXia24@gsb.columbia.edu |
| Pseudocode | Yes | Algorithm 1 Two-stage procedure for estimating worst-case subpopulation performance (2) 1: INPUT: Subpopulation size α, model class H, samples S1 and S2 2: On S1, solve bh1 argminh H 1 |S1| P i S1 (ℓ(θ(Xi); Yi) h(Zi))2. 3: On S2, compute the plug-in estimator b Wα(bh1) = infη h bh1(Zi) η i |
| Open Source Code | No | The paper does not provide an explicit statement about the release of its own source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We study a Pharmacogenetics and Pharmacogenomics Knowledge Base dataset constructed from optimal dosages found through trial and error by clinicians. The dataset comprises of 4,788 patients (after excluding missing data)... Consortium [26] found that a linear model outperforms... We study this problem on the Functional Map of the World (FMo W) dataset [25] |
| Dataset Splits | Yes | On validation data collected during 2002-2013, we first evaluate model performance on subpopulations... FMo W-WILDS training set (collected in 2002-2013, n =76,863)... ID val, collected in 2002-2013, n =11,483 |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'gradient boosted decision trees (package XGBoost [22])' but does not provide specific version numbers for XGBoost or any other software dependencies. |
| Experiment Setup | Yes | We select λ = 0.4 so that the ensembled model (CLIP Wi SE-FT) achieves similar performance as Image Net pre-trained counterparts on the in-distribution validation data. To further make models comparable with respect to the cross entropy loss, we calibrate the CLIP Wi SE-FT model by tuning the temperature parameter so that its average loss on the in-distribution validation set matches the worst average loss of Image Net pre-trained models (Dense Net ERM). See Appendix B for detailed experimental settings and training specifications. |