Modeling the Second Player in Distributionally Robust Optimization
Authors: Paul Michel, Tatsunori Hashimoto, Graham Neubig
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines. We do an in-depth set of experiments analyzing the effect of our proposed changes on both a toy task as well as a more realistic, yet still synthetic sentiment classification task ( 4). Finally, we show that in the more realistic setting of toxicity detection, P-DRO yields models that are more robust to changes in demographic groups, even though these groups are unknown at training time, opening up applications in combatting dataset bias ( 5). Table 1: Average and robust accuracies on Biased SST. |
| Researcher Affiliation | Academia | Paul Michel School of Computer Science Carnegie Mellon University pmichel1@cs.cmu.edu Tatsunori Hashimoto Computer Science Department Stanford University thashim@stanford.edu Graham Neubig School of Computer Science Carnegie Mellon University gneubig@cs.cmu.edu |
| Pseudocode | No | The paper contains mathematical formulations and descriptions of procedures, but no explicitly labeled 'Pseudocode' or 'Algorithm' block is provided. |
| Open Source Code | Yes | Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO |
| Open Datasets | Yes | We base our task off of the binary version of the Stanford Sentiment Treebank dataset (SST-2; Socher et al. (2013)). We perform experiments on two datasets: DWMW17 (Davidson et al., 2017), a corpus of 25K tweets classified in three categories... and FDCL18 (Founta et al., 2018), a 100k sized dataset, also collected from Twitter and annotated with an additional spam label... |
| Dataset Splits | Yes | In ERM it is customary to stop training after the empirical risk periodically evaluated on a held out validation dataset stops decreasing. This is particularly important to prevent over-fitting to the training data. However, it is not an appropriate criterion for P-DRO, since the model is not trained to minimize empirical risk in the first place. A more pertinent choice is to compare the robust validation losses Lrobust,valid(θ) = max qψ Q 1 |Dvalid| qψ(x, y) qψ0(x, y)ℓ(x, y; θ) | {z } :=Lvalid(θ,ψ) . We validate the models every epoch. Table 2: Effect of different optimal stopping and hyper-parameter selection strategies on robust validation accuracy. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions software like Adam, Gensim, BERT, GPT-2, and LSTM, but does not specify their version numbers or the versions of underlying frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | For the classifier, we train a simple one layer Bi LSTM model with embedding/hidden dimension 300. For the adversary, we adopt an auto-regressive transformer model based on the successful GPT-2 language model architecture but with 6 layers, a dimension of 512 and 8 attention heads. We train the model with Adam (Kingma & Ba, 2014) and the adversary with vanilla stochastic gradient descent. We train for 50 epochs and select the best model. We train the adversary with a temperature of τ = 0.01 and a normalizing window k = 10. To demonstrate the efficacy of automatic hyper-parameter selection in the P-DRO setting, we delegate the choice of the adversary s learning rate λ to grid-search, training 3 models with λ {10 5, 10 4, 10 3}. We train all models with Adam (Kingma & Ba, 2014) with an initial learning rate of 2 10 5, which we decay linearly at each step until the end of training. During training, we sample minibatches that contain at most 64 sentences or 2500 tokens, whichever is greater. |