reproducibilityindex.ai

Text Generation by Learning from Demonstrations

Authors: Richard Yuanzhe Pang, He He

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on news summarization, question generation, and machine translation show that GOLD leads to better model performance than MLE and RL ﬁne-tuning by both task metrics and human-rated quality.
Researcher Affiliation	Academia	Richard Yuanzhe Pang 1 He He 1,2 yzpang@nyu.edu hehe@cs.nyu.edu 1 Courant Institute of Mathematical Sciences, New York University, New York, NY 10011, USA 2 Center for Data Science, New York University, New York, NY 10011, USA
Pseudocode	Yes	Algorithm 1: GOLD
Open Source Code	Yes	The code is available.2. Code: https://github.com/yzpang/gold-off-policy-text-gen-iclr21
Open Datasets	Yes	We chose four text generation tasks: (1) question generation (NQG; Zhou et al., 2017):...; (2) summarization (CNN/DM; Hermann et al., 2015); (3) extreme summarization (XSum; Narayan et al., 2018):...; (4) machine translation (IWSLT14 De-En; Cettolo et al., 2014).
Dataset Splits	Yes	The train/dev/test split for NQG is 86229/8913/8919; the split for CNN/DM is 287227/13368/11490; the split for XSum is 204045/11332/11334; the split for IWSLT14 De-En is 160239/7283/6750.
Hardware Specification	Yes	We train using a single Nvidia GTX 1080 Ti (memory: 12 GB) GPU. For transformer models, we use Nvidia P40 GPUs (memory: 24 GB each).
Software Dependencies	No	The paper mentions 'fairseq' and refers to an implementation based on 'Cho et al. (2019)' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We use a learning rate of 5e-4. For NQG, we use a batch size of 32; for CNN/DM we use a batch size of 16. For transformer models, we use a learning rate of 2e-5 for NQG, CNN/DM, and XSum; 3e-4 for IWSLT14 De-En. For NQG, we use 512 tokens as batch size (for each of the four GPUs); for CNN/DM and XSum, we use 1024 tokens as batch size (for each of the four GPUs); for IWSLT14 De-En, we use 4096 tokens as batch size.