reproducibilityindex.ai

Adversarial training for high-stakes reliability

Authors: Daniel Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Benjamin Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we used a safe language generation task ( avoid injuries ) as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques including a tool that assists human adversaries to ﬁnd and eliminate failures in a classiﬁer that ﬁlters text completions suggested by a generator. In our task, we determined that we can set very conservative classiﬁer thresholds without signiﬁcantly impacting the quality of the ﬁltered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on doubling the time for our contractors to ﬁnd adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can conﬁdently rule out the possibility of catastrophic deployment-time failures of powerful models.
Researcher Affiliation	Industry	Daniel M. Ziegler Seraphina Nix Lawrence Chan Tim Bauman Peter Schmidt-Nielsen Tao Lin Adam Scherlis Noa Nabeshima Ben Weinstein-Raun Daniel de Haas Buck Shlegeris Nate Thomas Redwood Research. UC Berkeley. Work done at Redwood Research.
Pseudocode	No	The paper describes procedures and methods in paragraph text and diagrams (e.g., Figure 1) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our tool is available at https://www.ttft.io/talk-to-filtered-transformer. We include a download link to our data and model weights in the Appendix.
Open Datasets	Yes	Our initial, baseline classiﬁer training set consisted of snippets derived from a dataset of fan ﬁction stories. We sourced our prompts from an archive of approximately 300 GB of stories from fanﬁction.net... We include a download link to our data and model weights in the Appendix.
Dataset Splits	Yes	Table 1: The number of labeled snippets (prompt + completion pairs) from each source of data. The percentage that were labeled injurious are in parentheses. (Table 1 lists specific counts for Train, Validation, and Test datasets, e.g., 166,210 for Train, 102,297 for Validation, etc.)
Hardware Specification	Yes	All models were trained for 200k steps on 8 A100s, total 400 A100-days.
Software Dependencies	No	The paper mentions software like 'deberta-v3-large' from 'Hugging Face' and 'GPT-Neo-2.7B' but does not provide specific version numbers for these libraries or frameworks.
Experiment Setup	Yes	During training, we treated all snippets labeled Unsure as injurious for the purposes of training because we wanted to train our classiﬁer to be conservative. Because our datasets had more non-injurious examples than injurious examples, we upsampled snippets labeled injurious by up to 5 so that they were closer to the number of non-injurious examples. (From Appendix A.3) Optimizer: AdamW; Learning rate: 1e-5; Weight decay: 0.01; Warmup steps: 1000; Training steps: 200k; Gradient accumulation steps: 4.