reproducibilityindex.ai

Distributionally Robust Models with Parametric Likelihood Ratios

Authors: Paul Michel, Tatsunori Hashimoto, Graham Neubig

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a series of experiments on both image and text classiﬁcation benchmarks, we ﬁnd that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches, and that the method performs reliably well with little hyper-parameter tuning.
Researcher Affiliation	Academia	Paul Michel Centre Sciences des Données École normale supérieure PSL Paris, 75005, France pmichel31415@gmail.com Tatsunori Hashimoto Computer Science Department Stanford University Stanford, CA 94305, USA thashim@stanford.edu Graham Neubig School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA gneubig@cs.cmu.edu
Pseudocode	No	The paper describes its methods through textual explanations and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO
Open Datasets	Yes	To facilitate reproducibility of our work, our experiments are conducted on datasets that are openly available: Biased SST3, FDCL184 (Founta et al., 2018), Waterbirds5 (Sagawa et al., 2020) and Celeb A6 (Liu et al., 2015).
Dataset Splits	Yes	Specific details for each dataset follow these previous works, as described below: Biased SST is based on the SST-2 sentiment classiﬁcation dataset (Radford et al., 2018)... FDCL18... Waterbirds... Celeb A... We perform optimal stopping using the Minmax criterion proposed in Michel et al. (2021): every epoch T, we determine the best model by explicitly solving a greedy approximation of the min-max game between all T previously checkpointed adversaries and models on the validation dataset Dvalid.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions software components like BERT-base-uncased, Adam optimizer, and Distill BERT, but it does not provide specific version numbers for these or other software libraries and dependencies used.
Experiment Setup	Yes	We use the same learning rate and optimizer for both model and adversary and only vary the KL penalty weight τ {0.001, 0.01, 0.1, 1.0}. We use a batch size of 64 for Biased SST and FDCL18 and 32 for Waterbirds and Celeb A. We train both classiﬁer and adversary with Adam (Kingma & Ba, 2014) using a learning rate of 2 10 5, linearly decay the learning rate to 0 at each step. We train with batches of size 64 (or containing up to 2500 tokens, whichever is lower) for 50 and 20 epochs for Biased SST and FDCL18 respectively, evaluating model on the validation data every epoch. On both datasets, images are rescaled to 224 224 pixels and pixel values are normalized to have mean 0 and variance 1 across all 3 color channels on the training data. At training time, we augment the data by randomly cropping or ﬂipping the images horizontally. We train using regular stochastic gradient descent using a constant learning rate of 10 3 and a batch size of 32. We train for 75 and 13 epochs on Waterbirds and Celeb A respectively (those numbers were chosen to match the number of steps trained to Sagawa et al. (2020) despite the smaller batch size), and validate every 100 (for Waterbirds) and 1000 (for Celeb A) training steps.