Modeling the Second Player in Distributionally Robust Optimization

Authors: Paul Michel, Tatsunori Hashimoto, Graham Neubig

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines. We do an in-depth set of experiments analyzing the effect of our proposed changes on both a toy task as well as a more realistic, yet still synthetic sentiment classification task ( 4). Finally, we show that in the more realistic setting of toxicity detection, P-DRO yields models that are more robust to changes in demographic groups, even though these groups are unknown at training time, opening up applications in combatting dataset bias ( 5). Table 1: Average and robust accuracies on Biased SST.
Researcher Affiliation Academia Paul Michel School of Computer Science Carnegie Mellon University pmichel1@cs.cmu.edu Tatsunori Hashimoto Computer Science Department Stanford University thashim@stanford.edu Graham Neubig School of Computer Science Carnegie Mellon University gneubig@cs.cmu.edu
Pseudocode No The paper contains mathematical formulations and descriptions of procedures, but no explicitly labeled 'Pseudocode' or 'Algorithm' block is provided.
Open Source Code Yes Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO
Open Datasets Yes We base our task off of the binary version of the Stanford Sentiment Treebank dataset (SST-2; Socher et al. (2013)). We perform experiments on two datasets: DWMW17 (Davidson et al., 2017), a corpus of 25K tweets classified in three categories... and FDCL18 (Founta et al., 2018), a 100k sized dataset, also collected from Twitter and annotated with an additional spam label...
Dataset Splits Yes In ERM it is customary to stop training after the empirical risk periodically evaluated on a held out validation dataset stops decreasing. This is particularly important to prevent over-fitting to the training data. However, it is not an appropriate criterion for P-DRO, since the model is not trained to minimize empirical risk in the first place. A more pertinent choice is to compare the robust validation losses Lrobust,valid(θ) = max qψ Q 1 |Dvalid| qψ(x, y) qψ0(x, y)ℓ(x, y; θ) | {z } :=Lvalid(θ,ψ) . We validate the models every epoch. Table 2: Effect of different optimal stopping and hyper-parameter selection strategies on robust validation accuracy.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions software like Adam, Gensim, BERT, GPT-2, and LSTM, but does not specify their version numbers or the versions of underlying frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes For the classifier, we train a simple one layer Bi LSTM model with embedding/hidden dimension 300. For the adversary, we adopt an auto-regressive transformer model based on the successful GPT-2 language model architecture but with 6 layers, a dimension of 512 and 8 attention heads. We train the model with Adam (Kingma & Ba, 2014) and the adversary with vanilla stochastic gradient descent. We train for 50 epochs and select the best model. We train the adversary with a temperature of τ = 0.01 and a normalizing window k = 10. To demonstrate the efficacy of automatic hyper-parameter selection in the P-DRO setting, we delegate the choice of the adversary s learning rate λ to grid-search, training 3 models with λ {10 5, 10 4, 10 3}. We train all models with Adam (Kingma & Ba, 2014) with an initial learning rate of 2 10 5, which we decay linearly at each step until the end of training. During training, we sample minibatches that contain at most 64 sentences or 2500 tokens, whichever is greater.