QUARK: Controllable Text Generation with Reinforced Unlearning
Authors: Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Yejin Choi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO [66], while relying only on standard language modeling primitives. |
| Researcher Affiliation | Collaboration | Allen Institute for Artificial Intelligence Paul G. Allen School of Computer Science, University of Washington {ximinglu, jackh, raja}@allenai.org {wellecks, lwjiang, lianhuiq, pawest, yejin}@cs.washington.edu |
| Pseudocode | Yes | Algorithm 1 Quantized Reward Konditioning (Quark) |
| Open Source Code | No | We will release the code for Quark at https://github.com/GXiming Lu/Quark prior to Neur IPS 2022. |
| Open Datasets | Yes | REALTOXICITYPROMPTS benchmark, WRITINGPROMPTS dataset [15], Open Web Text Corpus (OWT) [19], SST-2 dataset[70], WIKITEXT-103 [44] |
| Dataset Splits | No | The paper specifies train and test sets (e.g., "85K prompts from the train set; for evaluation, we use the same 10K non-toxic test prompts"), but it does not provide explicit details (e.g., size or percentages) for a dedicated validation split used for model tuning or early stopping, although 'val set' is referenced in figures. |
| Hardware Specification | No | The paper mentions 'Google Cloud Compute' and 'computational resource constraints' but does not specify any particular CPU models, GPU models, or detailed cloud instance types used for experiments. |
| Software Dependencies | No | The paper mentions software like Adam [31], PyTorch [53], Hugging Face [81], and Distill BERT [62] but does not provide specific version numbers for these software dependencies required for reproducibility. |
| Experiment Setup | Yes | During training, we use 85K prompts from the train set... We use K = 5 quantiles. During the exploration phase... we mix greedy decoding and nucleus sampling in a 50%-50% proportion... We use K = 8 quantiles. |