Dissecting learning and forgetting in language model finetuning
Authors: Xiao Zhang, Ji Wu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we analyze the effects of finetuning on language models by dissecting its impacts on the modeling of topic, style, and factual knowledge in text. Our method uses instruction-following LLMs such as Chat GPT to autogenerate controlled-variable text examples which we use to probe the model. Our findings reveal that finetuning results in significant shifts in the language model s topic and style priors, while actual knowledge learning only contributes to a small fraction of the total probability change. |
| Researcher Affiliation | Academia | Xiao Zhang & Ji Wu Department of Electronics Engineering Tsinghua University xzhang19@mails.tsinghua.edu.cn, wuji ee@mail.tsinghua.edu.cn |
| Pseudocode | No | The paper describes its methods in narrative text and with mathematical equations, but it does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our data 1 and code 2 are made publicly available. 2https://github.com/xiaozeroone/lm_finetune_dissect* |
| Open Datasets | Yes | Data. We utilize two corpus in our analysis: Pub Med1, a collection of biomedical papers abstracts, and C4 (Raffel et al., 2020), a large corpus of web text. Pub Med is commonly used in finetuning language models for the biomedical domain (Yasunaga et al., 2022; Luo et al., 2022; Wu et al., 2023). 1https://pubmed.ncbi.nlm.nih.gov. We use the annual baseline data of 2023. |
| Dataset Splits | Yes | Learning rates are selected for each model using a grid search on a validation set. We randomly sampled 1000 documents from Pub Med and C4 (the validation split) respectively (each document have at least 500 characters), and generated derived datasets from the samples. |
| Hardware Specification | Yes | Finetuning is performed with Huggingface s transformer library (Wolf et al., 2020), with bfloat16 mix-precision on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Huggingface s transformer library (Wolf et al., 2020)' and 'Adam W optimizer (Loshchilov & Hutter, 2019)' but does not provide specific version numbers for these or other software dependencies. It does not meet the requirement of including specific version numbers for key software components. |
| Experiment Setup | Yes | We finetune models on subsets of different sizes, up to 1 million abstracts. We use both full-finetuning and low-rank finetuning (Hu et al., 2022). We use Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 3e-6 for full-finetuning LLa MA and 1e-4 for full-finetuning GPT-2 XL and low-rank finetuning of LLa MA, all with 10% warm-up and linear learning rate decay. Learning rates are selected for each model using a grid search on a validation set. The batch size is set to 64. |