Mandoline: Model Evaluation under Distribution Shift

Authors: Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, Christopher Re

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical validation on NLP and vision tasks shows that MANDOLINE can estimate performance on the target distribution up to 3 more accurately compared to standard baselines.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University, Stanford, USA 2Institute for Computational and Mathematical Engineering, Stanford, USA.
Pseudocode Yes Algorithm 1 MANDOLINE
Open Source Code Yes Code for MANDOLINE can be found on Git Hub.
Open Datasets Yes CELEBA (images) and CIVILCOMMENTS (text)... MNLI (Williams et al., 2018) validation set (target) using the SNLI (Bowman et al., 2015) validation set (source)... HANS (Mc Coy et al., 2019a)... sentiment classification on IMDB (Maas et al., 2011)... SENTIMENT140 (GO ET AL., 2009)... YELP POLARITY (ZHANG ET AL., 2015)... AMAZON POLARITY (ZHANG ET AL., 2015).
Dataset Splits Yes We are given a fixed model fθ : X Y, a labeled validation source dataset Ds = {(xs i, ys i )}ns i=1, and an unlabeled reference target dataset Dt = {xt i}nt i=1. We partition Ds into Ds1 and Ds2 of sizes ns1, ns2 such that the former is used to learn ˆw(x) and the latter is used for evaluation.
Hardware Specification No No specific hardware details (like GPU/CPU models or specific cloud instance types) were mentioned for running experiments.
Software Dependencies No No specific software versions (e.g., PyTorch 1.9, Python 3.8) were provided for the dependencies used in the experiments.
Experiment Setup Yes We evaluate Res Net18 and Res Net50 models pretrained on Image Net and finetuned for 5 epochs on Celeb A. We use a standard bert-base-uncased model, fine-tuned on CIVILCOMMENTS for 5 epochs.