Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dissecting learning and forgetting in language model finetuning
Authors: Xiao Zhang, Ji Wu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we analyze the effects of finetuning on language models by dissecting its impacts on the modeling of topic, style, and factual knowledge in text. Our method uses instruction-following LLMs such as Chat GPT to autogenerate controlled-variable text examples which we use to probe the model. Our findings reveal that finetuning results in significant shifts in the language model s topic and style priors, while actual knowledge learning only contributes to a small fraction of the total probability change. |
| Researcher Affiliation | Academia | Xiao Zhang & Ji Wu Department of Electronics Engineering Tsinghua University EMAIL, wuji EMAIL |
| Pseudocode | No | The paper describes its methods in narrative text and with mathematical equations, but it does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our data 1 and code 2 are made publicly available. 2https://github.com/xiaozeroone/lm_finetune_dissect* |
| Open Datasets | Yes | Data. We utilize two corpus in our analysis: Pub Med1, a collection of biomedical papers abstracts, and C4 (Raffel et al., 2020), a large corpus of web text. Pub Med is commonly used in finetuning language models for the biomedical domain (Yasunaga et al., 2022; Luo et al., 2022; Wu et al., 2023). 1https://pubmed.ncbi.nlm.nih.gov. We use the annual baseline data of 2023. |
| Dataset Splits | Yes | Learning rates are selected for each model using a grid search on a validation set. We randomly sampled 1000 documents from Pub Med and C4 (the validation split) respectively (each document have at least 500 characters), and generated derived datasets from the samples. |
| Hardware Specification | Yes | Finetuning is performed with Huggingface s transformer library (Wolf et al., 2020), with bfloat16 mix-precision on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Huggingface s transformer library (Wolf et al., 2020)' and 'Adam W optimizer (Loshchilov & Hutter, 2019)' but does not provide specific version numbers for these or other software dependencies. It does not meet the requirement of including specific version numbers for key software components. |
| Experiment Setup | Yes | We finetune models on subsets of different sizes, up to 1 million abstracts. We use both full-finetuning and low-rank finetuning (Hu et al., 2022). We use Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 3e-6 for full-finetuning LLa MA and 1e-4 for full-finetuning GPT-2 XL and low-rank finetuning of LLa MA, all with 10% warm-up and linear learning rate decay. Learning rates are selected for each model using a grid search on a validation set. The batch size is set to 64. |