reproducibilityindex.ai

Knowledge Infused Decoding

Authors: Ruibo Liu, Guoqing Zheng, Shashank Gupta, Radhika Gaonkar, Chongyang Gao, Soroush Vosoughi, Milad Shokouhi, Ahmed Hassan Awadallah

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On six diverse knowledge-intensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID s ability to generate more relevant and factual language for the input context when compared with multiple baselines.
Researcher Affiliation	Collaboration	Dartmouth College, Microsoft, Northwestern University {ruibo.liu.gr, soroush.vosoughi}@dartmouth.edu {zheng, shagup, ragaonka, milads, hassanam}@microsoft.com cygao@u.northwestern.edu
Pseudocode	Yes	Algorithm 1: Trie-Constrained Policy Gradient (Section 3.3) and Algorithm 2: The Generation Loop of KID (Appendix A.2.2).
Open Source Code	Yes	Code for KID is available at https:// github.com/microsoft/KID.
Open Datasets	Yes	We study Abstractive QA, which requires the model to generate free-form answers to the questions. We choose long-form QA task ELI5 (Fan et al., 2019b) and MSMARCO NLG task v2.1 (Nguyen et al., 2016)... We also use two extra QA tasks PIQA (Bisk et al., 2020) and Pub Med QA (Jin et al., 2019)... We study ROC story ending generation (Mostafazadeh et al., 2016)... and αNLG (Bhagavatula et al., 2020)... We study two dialogue datasets that require knowledge grounding: Wizard of Wikipedia (Wo W) (Dinan et al., 2019)... and Mu Tual (Cui et al., 2020)...
Dataset Splits	Yes	We tune the hyperparameters based on the models performance on an in-house split dev set, and report the results that were best on the official dev set. (Section 4.1) and Table A1: The dataset statistics of the eight knowledge-intensive NLG tasks we evaluate for KID. Split ELI5 ... Train 272,764 ... Dev 1,507 ... Test 600 (Appendix A.1).
Hardware Specification	No	The paper mentions 'GPU' (footnote 2) and discusses GPT-Neo models of different sizes (1.3B, 2.7B parameters), but does not specify exact GPU models (e.g., NVIDIA A100), CPU models, or cloud instance types. It only mentions general terms like 'common hardware settings'.
Software Dependencies	No	The paper mentions software like BERT, GPT-2, BART, DPR, Open IE, faiss, Adam (optimizer), but no specific version numbers for any of these components.
Experiment Setup	Yes	For sampling decoding, we run experiments with all combinations of top-p (p [0, 0.1, ..., 1]) and top-k (k [0, 10, ..., 100]), while for beam search, we sweep the number of beams from 1 to 10. (Section 4.1) and We set σ to 0.02 across all tasks. We empirically choose K = 3 for good performance in most cases. (Section 3.3). Also, The number of retrieved documents k is a task-specific hyper-parameter we discuss its impact on performance in 4.3. (Section 3.1) and We first sample 200 ELI5 test set questions and generate answers of various lengths {80, 100, ..., 260} (260 is the average sequence length in training set) with beam search, sampling, reflective (West et al., 2021), and KID. We then ask humans to rate these generations with 7-point Likert scoring (Joshi et al., 2015) how likely the generated text is a natural sentence. Each generation receives at least 15 ratings. (Section 4.3). We run paired sample t-test comparing human references (Gold) with beam search (BM) with beam size 5, sampling (SP) with top p = 0.9 and k = 20, reflective (RFLC) decoding, and our KID generation. (Section 4.4).