Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
QUARK: Controllable Text Generation with Reinforced Unlearning
Authors: Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Yejin Choi
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO [66], while relying only on standard language modeling primitives. |
| Researcher Affiliation | Collaboration | Allen Institute for Artificial Intelligence Paul G. Allen School of Computer Science, University of Washington EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Quantized Reward Konditioning (Quark) |
| Open Source Code | No | We will release the code for Quark at https://github.com/GXiming Lu/Quark prior to Neur IPS 2022. |
| Open Datasets | Yes | REALTOXICITYPROMPTS benchmark, WRITINGPROMPTS dataset [15], Open Web Text Corpus (OWT) [19], SST-2 dataset[70], WIKITEXT-103 [44] |
| Dataset Splits | No | The paper specifies train and test sets (e.g., "85K prompts from the train set; for evaluation, we use the same 10K non-toxic test prompts"), but it does not provide explicit details (e.g., size or percentages) for a dedicated validation split used for model tuning or early stopping, although 'val set' is referenced in figures. |
| Hardware Specification | No | The paper mentions 'Google Cloud Compute' and 'computational resource constraints' but does not specify any particular CPU models, GPU models, or detailed cloud instance types used for experiments. |
| Software Dependencies | No | The paper mentions software like Adam [31], PyTorch [53], Hugging Face [81], and Distill BERT [62] but does not provide specific version numbers for these software dependencies required for reproducibility. |
| Experiment Setup | Yes | During training, we use 85K prompts from the train set... We use K = 5 quantiles. During the exploration phase... we mix greedy decoding and nucleus sampling in a 50%-50% proportion... We use K = 8 quantiles. |