reproducibilityindex.ai

Mandoline: Model Evaluation under Distribution Shift

Authors: Mayee Chen, Karan Goel, Nimit S Sohoni, Fait Poms, Kayvon Fatahalian, Christopher Re

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical validation on NLP and vision tasks shows that MANDOLINE can estimate performance on the target distribution up to 3 more accurately compared to standard baselines.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University, Stanford, USA 2Institute for Computational and Mathematical Engineering, Stanford, USA.
Pseudocode	Yes	Algorithm 1 MANDOLINE
Open Source Code	Yes	Code for MANDOLINE can be found on Git Hub.
Open Datasets	Yes	CELEBA (images) and CIVILCOMMENTS (text)... MNLI (Williams et al., 2018) validation set (target) using the SNLI (Bowman et al., 2015) validation set (source)... HANS (Mc Coy et al., 2019a)... sentiment classiﬁcation on IMDB (Maas et al., 2011)... SENTIMENT140 (GO ET AL., 2009)... YELP POLARITY (ZHANG ET AL., 2015)... AMAZON POLARITY (ZHANG ET AL., 2015).
Dataset Splits	Yes	We are given a ﬁxed model fθ : X Y, a labeled validation source dataset Ds = {(xs i, ys i )}ns i=1, and an unlabeled reference target dataset Dt = {xt i}nt i=1. We partition Ds into Ds1 and Ds2 of sizes ns1, ns2 such that the former is used to learn ˆw(x) and the latter is used for evaluation.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or specific cloud instance types) were mentioned for running experiments.
Software Dependencies	No	No specific software versions (e.g., PyTorch 1.9, Python 3.8) were provided for the dependencies used in the experiments.
Experiment Setup	Yes	We evaluate Res Net18 and Res Net50 models pretrained on Image Net and ﬁnetuned for 5 epochs on Celeb A. We use a standard bert-base-uncased model, ﬁne-tuned on CIVILCOMMENTS for 5 epochs.