Calibrating Sequence likelihood Improves Conditional Language Generation
Authors: Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, Peter J Liu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With SLi C, we exceed or match SOTA results on a wide range of generation tasks spanning abstractive summarization, question generation, abstractive question answering and data-to-text generation, even with modest-sized models. |
| Researcher Affiliation | Industry | Yao Zhao yaozhaoyz@google.com Misha Khalman khalman@google.com Rishabh Joshi rishabhjoshi@google.com Shashi Narayan shashinarayan@google.com Mohammad Saleh msaleh@google.com Peter J. Liu peterjliu@google.com Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1 Calibrating Sequence Likelihood |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | For abstractive summarization tasks, we choose CNN/Daily Mail (Hermann et al., 2015; See et al., 2017), XSUM (Narayan et al., 2018), Reddit TIFU-long (Kim et al., 2019) and SAMSum (Gliwa et al., 2019) due to their diversity in domain, style, abstractiveness, and summary lengths. For question answering related tasks, we choose generative question answering given context MSMARCO NLG (Bajaj et al., 2016) and its reverse problem of question generation SQu AD QG (Zhou et al., 2017; Du et al., 2017) . For data-to-text tasks, we choose text generation given structured data Web NLG-en (Gardent et al., 2017) and common concepts reasoning Common Gen (Lin et al., 2020). |
| Dataset Splits | Yes | SQu AD QG (Zhou et al., 2017; Du et al., 2017) is the task of generating a question from a passage-answer pair extracted from the SQu AD dataset (Rajpurkar et al., 2016). In particular, we use the split of Du et al. (2017), consisting of 75,722, 10,570, and 11,877 examples for training, validation, and testing, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running its experiments. It mentions model sizes and compute (FLOPs) but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using a 'sentencepiece 96k vocabulary with byte-fallback (Kudo, 2018)' and refers to 'modern Transformer libraries (Wolf et al., 2020; Lewis et al., 2019; Raffel et al., 2020; Zhang et al., 2019a)' but does not provide specific version numbers for these software components or libraries, which would be necessary for reproducibility. |
| Experiment Setup | Yes | In all experiments, we use learning rate lr = 10 4, and batch sizes of 512 to finetune and 64 to calibrate models. We use beam search to generate calibration candidates and evaluate the calibrated models, unless specified otherwise. |