Mandoline: Model Evaluation under Distribution Shift
Authors: Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, Christopher Re
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical validation on NLP and vision tasks shows that MANDOLINE can estimate performance on the target distribution up to 3 more accurately compared to standard baselines. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University, Stanford, USA 2Institute for Computational and Mathematical Engineering, Stanford, USA. |
| Pseudocode | Yes | Algorithm 1 MANDOLINE |
| Open Source Code | Yes | Code for MANDOLINE can be found on Git Hub. |
| Open Datasets | Yes | CELEBA (images) and CIVILCOMMENTS (text)... MNLI (Williams et al., 2018) validation set (target) using the SNLI (Bowman et al., 2015) validation set (source)... HANS (Mc Coy et al., 2019a)... sentiment classification on IMDB (Maas et al., 2011)... SENTIMENT140 (GO ET AL., 2009)... YELP POLARITY (ZHANG ET AL., 2015)... AMAZON POLARITY (ZHANG ET AL., 2015). |
| Dataset Splits | Yes | We are given a fixed model fθ : X Y, a labeled validation source dataset Ds = {(xs i, ys i )}ns i=1, and an unlabeled reference target dataset Dt = {xt i}nt i=1. We partition Ds into Ds1 and Ds2 of sizes ns1, ns2 such that the former is used to learn ˆw(x) and the latter is used for evaluation. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models or specific cloud instance types) were mentioned for running experiments. |
| Software Dependencies | No | No specific software versions (e.g., PyTorch 1.9, Python 3.8) were provided for the dependencies used in the experiments. |
| Experiment Setup | Yes | We evaluate Res Net18 and Res Net50 models pretrained on Image Net and finetuned for 5 epochs on Celeb A. We use a standard bert-base-uncased model, fine-tuned on CIVILCOMMENTS for 5 epochs. |