Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play

Authors: Jeremiah Zhe Liu, Krishnamurthy Dj Dvijotham, Jihyeon Lee, Quan Yuan, Balaji Lakshminarayanan, Deepak Ramachandran

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Under two challenging real-world tasks (census income prediction and toxic comment detection), we empirically validate the effectiveness of ISP in improving the performance of AL with a DNN model under dataset bias (Section 4). For both classic and stateof-the-art uncertainty-based AL methods, ISP improves tail-group sampling rate, meaningfully pushing the accuracy-group robustness frontier of the final model.
Researcher Affiliation Industry Jeremiah Zhe Liu, Krishnamurthy Dj Dvijotham, Jihyeon Lee, Quan Yuan, Balaji Lakshminarayanan , Deepak Ramachandran Google Research {jereliu,dvij,jihyeonlee,yquan, balajiln,ramachandrand}@google.com
Pseudocode Yes Algorithm 1 Introspective Self-play (ISP) Inputs: Training data Dtrain = {yi,xi}n i=1; (Optional) Group annotation Gtrain = {gi}n i=1; Unlabelled data Dpool = {x j}n j=1. Output: Predicted probability {p(y|x j)}n j=1; Bias probability {p(b|xj)}n j=1; Predictive variance {v(xj)}n j=1. Stage I: Label Generation if Gtrain = /0 then Btrain = {bi = I(gi B)}; Make underrepresentation label using group annotation gi. else ˆBtrain = Sel f Play Bias Estimation(Dtrain). Estimate underrepresentation label using Algorithm 2 Stage II: Introspective Training Train ˆf on Dtrain with multi-task introspective objective L((yi,bi),xi). Equation (4) Evaluate ˆf on xj Dpool to generate sampling signals {p(y|xj), p(b|xj),v(xj)}n j=1. Equation (3)
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a repository for its own implementation.
Open Datasets Yes For tabular data, we use the U.S. Census Income data adult from the official UCI repository4. For the language task, we use the Civil Comments Identity from the Tensor Flow Dataset repository5. [Footnote 4: https://archive.ics.uci.edu/ml/datasets/adult] [Footnote 5: https://www.tensorflow.org/datasets/catalog/civil comments]
Dataset Splits Yes in each stage, we first (optionally) trains a cross validated ensemble to estimate the under-representation labels, where we split the data into 10 cross-validation splits, and train ensemble members on 1 split and predict the rest of the 9 splits.
Hardware Specification No The paper describes the model architectures used (e.g., '2-layer Dense Res Net', 'BERTsmall mode') and training parameters, but it does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions optimizers like 'Adam optimizer' and 'Adam W optimizer' but does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes In each active learning round, we train the Dense Res Net model with Adam optimizer with learning rate 0.1, batch size 256 and maximum epoch 200; and train the BERT model with Adam W optimizer (learning rate 1e-5) for 6 epochs with batch size 16. For the final model training, we use the standard re-weighting objective... We vary the thresholds t and the up-weight coefficient λ over a 2D grid (t {0.05,0.1,0.15,...,1.0} and log(λ) {0.,0.5,1,1.5,...,10.})