Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics
Authors: Lianhui Qin, Sean Welleck, Daniel Khashabi, Yejin Choi
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on these constrained generation tasks point to the effectiveness of our approach, both in terms of automatic and human evaluation. |
| Researcher Affiliation | Collaboration | 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Allen Institute for Arti๏ฌcial Intelligence 3Department of Computer Science, Johns Hopkins University |
| Pseudocode | Yes | Algorithm 1 Constrained Decoding w/ Langevin Dynamics. |
| Open Source Code | Yes | Code is available at https://github.com/qkaren/COLD_decoding |
| Open Datasets | Yes | We use the benchmark dataset TIMETRAVEL [44]. We use the set of constraint words from the COMMONGEN corpus [29] |
| Dataset Splits | No | We select the constraint weights on the dev set. However, specific percentages or counts for training/validation/test splits are not provided in the main text. |
| Hardware Specification | Yes | The table below shows the results (on an NVIDIA Quadro GV100 GPU, batch size=32). |
| Software Dependencies | No | The paper mentions using "GPT2-XL [46]" as the base language model, but does not specify version numbers for any software libraries or dependencies (e.g., PyTorch version, specific deep learning framework versions). |
| Experiment Setup | Yes | Throughout the experiments, we set the number of Langevin dynamics steps to N = 2000, with a step size = 0.1 (Eq. 2). We gradually decrease ฯ(n) across iterations, which intuitively transitions the decoding procedure from exploration to optimization. In our experiments, we typically used the schedule which sets/reduces ฯ to {1, 0.5, 0.1, 0.05, 0.01} at iterations {0, 50, 500, 1000, 1500}, respectively. |