reproducibilityindex.ai

Hindsight: Posterior-guided training of retrievers for improved open-ended generation

Authors: Ashwin Paranjape, Omar Khattab, Christopher Potts, Matei Zaharia, Christopher D Manning

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate on two open-ended knowledge-intensive tasks: informative conversations and free-form question answering. We ask the following three research questions: RQ1 Relevance: Are the retrieved passages more relevant? (Section 4.4) RQ2 Groundedness: Does the generator make better use of the retrieved passages? (Section 4.5) RQ3 Generation Quality: Does this lead to better end-to-end performance? (Section 4.6)
Researcher Affiliation	Academia	Ashwin Paranjape, Omar Khattab, Christopher Potts, Matei Zaharia & Christopher D. Manning Stanford University {ashwinp,okhattab}@cs.stanford.edu
Pseudocode	No	The paper includes diagrams and descriptions of the training process, but no structured pseudocode or algorithm blocks.
Open Source Code	No	The code for recreating these experiments along with hyperparameters will be released at https://github.com/Ashwin Paranjape/ hindsight.
Open Datasets	Yes	We evaluate with Wizard of Wikipedia (Wo W) dataset (Dinan et al., 2019), where an apprentice chats (via text) with a wizard , being curious about different topics, and the wizard grounds their response in a sentence from Wikipedia.
Dataset Splits	Yes	We use the version of this dataset provided in the KILT benchmark (Petroni et al., 2021) and report leaderboard performance on the held out test set. We use the dev set to answer the granular research questions.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies	No	The paper mentions models like Col BERT and BART but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	At the beginning of each round, in the outer loop, we encode the passages and the queries with various retrievers and find the highest scoring r passages that we dub the closedset. In the inner loop that runs for many epochs, we sample k (= 8) passages from the closed-set (r = 100). This is fast because we are no longer retrieving from the entire corpus in the inner loop and also sufficient because the closed-set has a high recall.