Understanding Finetuning for Factual Knowledge Extraction
Authors: Gaurav Rohit Ghosal, Tatsunori Hashimoto, Aditi Raghunathan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On three question answering benchmarks (Pop QA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. |
| Researcher Affiliation | Academia | 1Department of Machine Learning, Carnegie Mellon University, Pittsburgh, USA 2Department of Computer Science, Stanford University, Stanford, USA. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. It describes processes textually and mathematically. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the release of their own source code for the methodology described. It mentions using open-source models (Llama-7B, Mistral-7B) but not their specific fine-tuning implementation. |
| Open Datasets | Yes | We use a subset of the Pop QA dataset (Mallen et al., 2023) consisting of the country, sport and occupation relations, which we refer to as Pop QA-Controlled. ... We also examine a subset of the Entity Questions (Sciavolino et al., 2022) dataset, which includes a diverse range of popular and less popular facts. ... Finally, we examine a subset of the MMLU dataset (Hendrycks et al., 2021) consisting of history questions. |
| Dataset Splits | Yes | We report the performance after tuning on a held-out validation set in all experiments. Tuning is performed individually for each fine-tuning dataset. |
| Hardware Specification | No | The paper mentions 'Center for AI Safety Compute Cluster' in the acknowledgements, but this is too vague and does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions 'Lo RA (Hu et al., 2021)' but does not provide specific version numbers for this or any other software dependencies. It also mentions 'Llama 7B' and 'Mistral-7B' which are models, not software dependencies in the sense of libraries with versions. |
| Experiment Setup | Yes | Table 4. Hyperparameter Range Learning Rate 1e-5, 1e-4, 1e-3 Weight Decay 1e-6, 1e-5, 1e-4, 1e-3, 1e-2 (Lo RA rank, Lo RA α) (8,16), (16,32), (32, 64), (64,128) Lo RA True, False |