reproducibilityindex.ai

Understanding Finetuning for Factual Knowledge Extraction

Authors: Gaurav Rohit Ghosal, Tatsunori Hashimoto, Aditi Raghunathan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On three question answering benchmarks (Pop QA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset.
Researcher Affiliation	Academia	1Department of Machine Learning, Carnegie Mellon University, Pittsburgh, USA 2Department of Computer Science, Stanford University, Stanford, USA.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks. It describes processes textually and mathematically.
Open Source Code	No	The paper does not provide an explicit statement or link for the release of their own source code for the methodology described. It mentions using open-source models (Llama-7B, Mistral-7B) but not their specific fine-tuning implementation.
Open Datasets	Yes	We use a subset of the Pop QA dataset (Mallen et al., 2023) consisting of the country, sport and occupation relations, which we refer to as Pop QA-Controlled. ... We also examine a subset of the Entity Questions (Sciavolino et al., 2022) dataset, which includes a diverse range of popular and less popular facts. ... Finally, we examine a subset of the MMLU dataset (Hendrycks et al., 2021) consisting of history questions.
Dataset Splits	Yes	We report the performance after tuning on a held-out validation set in all experiments. Tuning is performed individually for each fine-tuning dataset.
Hardware Specification	No	The paper mentions 'Center for AI Safety Compute Cluster' in the acknowledgements, but this is too vague and does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions 'Lo RA (Hu et al., 2021)' but does not provide specific version numbers for this or any other software dependencies. It also mentions 'Llama 7B' and 'Mistral-7B' which are models, not software dependencies in the sense of libraries with versions.
Experiment Setup	Yes	Table 4. Hyperparameter Range Learning Rate 1e-5, 1e-4, 1e-3 Weight Decay 1e-6, 1e-5, 1e-4, 1e-3, 1e-2 (Lo RA rank, Lo RA α) (8,16), (16,32), (32, 64), (64,128) Lo RA True, False