Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Achieving Human Parity in Content-Grounded Datasets Generation
Authors: Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Eyal Shnarch, Leshem Choshen
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data ELI5 and ASQA for LFQA and CNN-Daily Mail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. |
| Researcher Affiliation | Collaboration | Asaf Yehudai , Boaz Carmeli , Yosi Mass , Ofir Arviv , Nathaniel Mills , Assaf Toledo , Eyal Shnarch , Leshem Choshen IBM Israel Research Lab , Hebrew University of Jerusalem , MIT EMAIL |
| Pseudocode | Yes | Figure 2: An illustration of our data generation prompt. In black is the few-shot prompt we give the model. In pink a new QA that the model generated based on the the provided content. Instruction: Given the next [document], create a [question] and [answer] pair... |
| Open Source Code | No | The paper states, "We publicly release all three wishes datasets," which refers to the generated datasets, not the open-source code for the Genie methodology itself. No explicit statement or link for the methodology's code was found. |
| Open Datasets | Yes | ELI5. (Explain Like I m Five) (Fan et al., 2019) comprises open-ended questions and extensive responses authored by users within the Reddit forum of the same name. ... ASQA. (Answer Summaries for Questions which are Ambiguous) (Stelmakh et al., 2022) is a dataset ... NQ. (Natural Questions) (Kwiatkowski et al., 2019) is a dataset of real user questions sourced from the Google search engine. ... CNN-Daily Mail. (Kwiatkowski et al., 2019) is a dataset commonly used for text summarization. |
| Dataset Splits | No | The paper mentions 'train' and 'test' sets and sometimes 'development and test sets' (e.g., for ASQA), but does not explicitly provide specific percentages, sample counts, or clear predefined validation splits for all datasets used to ensure reproducibility. |
| Hardware Specification | No | The paper mentions the use of large language models like Falcon-40B, Llama-2-70B, Flan-xl, and llama-2-13b-Chat, but it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for their training or inference. |
| Software Dependencies | Yes | Here, we use the reward model for both purposes and rely on the Open-Assistant model (K opf et al., 2023), using the De BERTa-v3 architecture (He et al., 2021). We filter generated examples whose score is below 0.5 by the reward model reward-model-deberta-v3-large-v2. We chose 0.5 as a threshold based on experimentation. Similarly, we use t5 xxl true nli mixture 3 model to filter examples deemed unfaithful by it. |
| Experiment Setup | No | The paper states: "To ensure a fair comparison, we maintain an equal number of examples from each dataset (10,000) and employ identical models for training, using the same set of hyperparameters." However, it does not provide the specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text. |