reproducibilityindex.ai

Calibrating Sequence likelihood Improves Conditional Language Generation

Authors: Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, Peter J Liu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With SLi C, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models.
Researcher Affiliation	Industry	Yao Zhao yaozhaoyz@google.com Misha Khalman khalman@google.com Rishabh Joshi rishabhjoshi@google.com Shashi Narayan shashinarayan@google.com Mohammad Saleh msaleh@google.com Peter J. Liu peterjliu@google.com Google Research, Brain Team
Pseudocode	Yes	Algorithm 1 Calibrating Sequence Likelihood
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets	Yes	For abstractive summarization tasks, we choose CNN/Daily Mail (Hermann et al., 2015; See et al., 2017), XSUM (Narayan et al., 2018), Reddit TIFU-long (Kim et al., 2019) and SAMSum (Gliwa et al., 2019) due to their diversity in domain, style, abstractiveness, and summary lengths. For question answering related tasks, we choose generative question answering given context MSMARCO NLG (Bajaj et al., 2016) and its reverse problem of question generation SQu AD QG (Zhou et al., 2017; Du et al., 2017) . For data-to-text tasks, we choose text generation given structured data Web NLG-en (Gardent et al., 2017) and common concepts reasoning Common Gen (Lin et al., 2020).
Dataset Splits	Yes	SQu AD QG (Zhou et al., 2017; Du et al., 2017) is the task of generating a question from a passage-answer pair extracted from the SQu AD dataset (Rajpurkar et al., 2016). In particular, we use the split of Du et al. (2017), consisting of 75,722, 10,570, and 11,877 examples for training, validation, and testing, respectively.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running its experiments. It mentions model sizes and compute (FLOPs) but not the underlying hardware.
Software Dependencies	No	The paper mentions using a 'sentencepiece 96k vocabulary with byte-fallback (Kudo, 2018)' and refers to 'modern Transformer libraries (Wolf et al., 2020; Lewis et al., 2019; Raffel et al., 2020; Zhang et al., 2019a)' but does not provide specific version numbers for these software components or libraries, which would be necessary for reproducibility.
Experiment Setup	Yes	In all experiments, we use learning rate lr = 10 4, and batch sizes of 512 to ﬁnetune and 64 to calibrate models. We use beam search to generate calibration candidates and evaluate the calibrated models, unless speciﬁed otherwise.