Achieving Human Parity in Content-Grounded Datasets Generation

Authors: Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Eyal Shnarch, Leshem Choshen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data ELI5 and ASQA for LFQA and CNN-Daily Mail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness.
Researcher Affiliation Collaboration Asaf Yehudai , Boaz Carmeli , Yosi Mass , Ofir Arviv , Nathaniel Mills , Assaf Toledo , Eyal Shnarch , Leshem Choshen IBM Israel Research Lab , Hebrew University of Jerusalem , MIT {Asaf.Yehudai, leshem.choshen}@ibm.com
Pseudocode Yes Figure 2: An illustration of our data generation prompt. In black is the few-shot prompt we give the model. In pink a new QA that the model generated based on the the provided content. Instruction: Given the next [document], create a [question] and [answer] pair...
Open Source Code No The paper states, "We publicly release all three wishes datasets," which refers to the generated datasets, not the open-source code for the Genie methodology itself. No explicit statement or link for the methodology's code was found.
Open Datasets Yes ELI5. (Explain Like I m Five) (Fan et al., 2019) comprises open-ended questions and extensive responses authored by users within the Reddit forum of the same name. ... ASQA. (Answer Summaries for Questions which are Ambiguous) (Stelmakh et al., 2022) is a dataset ... NQ. (Natural Questions) (Kwiatkowski et al., 2019) is a dataset of real user questions sourced from the Google search engine. ... CNN-Daily Mail. (Kwiatkowski et al., 2019) is a dataset commonly used for text summarization.
Dataset Splits No The paper mentions 'train' and 'test' sets and sometimes 'development and test sets' (e.g., for ASQA), but does not explicitly provide specific percentages, sample counts, or clear predefined validation splits for all datasets used to ensure reproducibility.
Hardware Specification No The paper mentions the use of large language models like Falcon-40B, Llama-2-70B, Flan-xl, and llama-2-13b-Chat, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for their training or inference.
Software Dependencies Yes Here, we use the reward model for both purposes and rely on the Open-Assistant model (K opf et al., 2023), using the De BERTa-v3 architecture (He et al., 2021). We filter generated examples whose score is below 0.5 by the reward model reward-model-deberta-v3-large-v2. We chose 0.5 as a threshold based on experimentation. Similarly, we use t5 xxl true nli mixture 3 model to filter examples deemed unfaithful by it.
Experiment Setup No The paper states: "To ensure a fair comparison, we maintain an equal number of examples from each dataset (10,000) and employ identical models for training, using the same set of hyperparameters." However, it does not provide the specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text.