Understanding Finetuning for Factual Knowledge Extraction

Authors: Gaurav Rohit Ghosal, Tatsunori Hashimoto, Aditi Raghunathan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On three question answering benchmarks (Pop QA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset.
Researcher Affiliation Academia 1Department of Machine Learning, Carnegie Mellon University, Pittsburgh, USA 2Department of Computer Science, Stanford University, Stanford, USA.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks. It describes processes textually and mathematically.
Open Source Code No The paper does not provide an explicit statement or link for the release of their own source code for the methodology described. It mentions using open-source models (Llama-7B, Mistral-7B) but not their specific fine-tuning implementation.
Open Datasets Yes We use a subset of the Pop QA dataset (Mallen et al., 2023) consisting of the country, sport and occupation relations, which we refer to as Pop QA-Controlled. ... We also examine a subset of the Entity Questions (Sciavolino et al., 2022) dataset, which includes a diverse range of popular and less popular facts. ... Finally, we examine a subset of the MMLU dataset (Hendrycks et al., 2021) consisting of history questions.
Dataset Splits Yes We report the performance after tuning on a held-out validation set in all experiments. Tuning is performed individually for each fine-tuning dataset.
Hardware Specification No The paper mentions 'Center for AI Safety Compute Cluster' in the acknowledgements, but this is too vague and does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions 'Lo RA (Hu et al., 2021)' but does not provide specific version numbers for this or any other software dependencies. It also mentions 'Llama 7B' and 'Mistral-7B' which are models, not software dependencies in the sense of libraries with versions.
Experiment Setup Yes Table 4. Hyperparameter Range Learning Rate 1e-5, 1e-4, 1e-3 Weight Decay 1e-6, 1e-5, 1e-4, 1e-3, 1e-2 (Lo RA rank, Lo RA α) (8,16), (16,32), (32, 64), (64,128) Lo RA True, False