Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Text Generation by Learning from Demonstrations
Authors: Richard Yuanzhe Pang, He He
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results on news summarization, question generation, and machine translation show that GOLD leads to better model performance than MLE and RL ο¬ne-tuning by both task metrics and human-rated quality. |
| Researcher Affiliation | Academia | Richard Yuanzhe Pang 1 He He 1,2 EMAIL EMAIL 1 Courant Institute of Mathematical Sciences, New York University, New York, NY 10011, USA 2 Center for Data Science, New York University, New York, NY 10011, USA |
| Pseudocode | Yes | Algorithm 1: GOLD |
| Open Source Code | Yes | The code is available.2. Code: https://github.com/yzpang/gold-off-policy-text-gen-iclr21 |
| Open Datasets | Yes | We chose four text generation tasks: (1) question generation (NQG; Zhou et al., 2017):...; (2) summarization (CNN/DM; Hermann et al., 2015); (3) extreme summarization (XSum; Narayan et al., 2018):...; (4) machine translation (IWSLT14 De-En; Cettolo et al., 2014). |
| Dataset Splits | Yes | The train/dev/test split for NQG is 86229/8913/8919; the split for CNN/DM is 287227/13368/11490; the split for XSum is 204045/11332/11334; the split for IWSLT14 De-En is 160239/7283/6750. |
| Hardware Specification | Yes | We train using a single Nvidia GTX 1080 Ti (memory: 12 GB) GPU. For transformer models, we use Nvidia P40 GPUs (memory: 24 GB each). |
| Software Dependencies | No | The paper mentions 'fairseq' and refers to an implementation based on 'Cho et al. (2019)' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We use a learning rate of 5e-4. For NQG, we use a batch size of 32; for CNN/DM we use a batch size of 16. For transformer models, we use a learning rate of 2e-5 for NQG, CNN/DM, and XSum; 3e-4 for IWSLT14 De-En. For NQG, we use 512 tokens as batch size (for each of the four GPUs); for CNN/DM and XSum, we use 1024 tokens as batch size (for each of the four GPUs); for IWSLT14 De-En, we use 4096 tokens as batch size. |