reproducibilityindex.ai

Dissecting learning and forgetting in language model finetuning

Authors: Xiao Zhang, Ji Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we analyze the effects of finetuning on language models by dissecting its impacts on the modeling of topic, style, and factual knowledge in text. Our method uses instruction-following LLMs such as Chat GPT to autogenerate controlled-variable text examples which we use to probe the model. Our findings reveal that finetuning results in significant shifts in the language model s topic and style priors, while actual knowledge learning only contributes to a small fraction of the total probability change.
Researcher Affiliation	Academia	Xiao Zhang & Ji Wu Department of Electronics Engineering Tsinghua University xzhang19@mails.tsinghua.edu.cn, wuji ee@mail.tsinghua.edu.cn
Pseudocode	No	The paper describes its methods in narrative text and with mathematical equations, but it does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our data 1 and code 2 are made publicly available. 2https://github.com/xiaozeroone/lm_finetune_dissect*
Open Datasets	Yes	Data. We utilize two corpus in our analysis: Pub Med1, a collection of biomedical papers abstracts, and C4 (Raffel et al., 2020), a large corpus of web text. Pub Med is commonly used in finetuning language models for the biomedical domain (Yasunaga et al., 2022; Luo et al., 2022; Wu et al., 2023). 1https://pubmed.ncbi.nlm.nih.gov. We use the annual baseline data of 2023.
Dataset Splits	Yes	Learning rates are selected for each model using a grid search on a validation set. We randomly sampled 1000 documents from Pub Med and C4 (the validation split) respectively (each document have at least 500 characters), and generated derived datasets from the samples.
Hardware Specification	Yes	Finetuning is performed with Huggingface s transformer library (Wolf et al., 2020), with bfloat16 mix-precision on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions 'Huggingface s transformer library (Wolf et al., 2020)' and 'Adam W optimizer (Loshchilov & Hutter, 2019)' but does not provide specific version numbers for these or other software dependencies. It does not meet the requirement of including specific version numbers for key software components.
Experiment Setup	Yes	We finetune models on subsets of different sizes, up to 1 million abstracts. We use both full-finetuning and low-rank finetuning (Hu et al., 2022). We use Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 3e-6 for full-finetuning LLa MA and 1e-4 for full-finetuning GPT-2 XL and low-rank finetuning of LLa MA, all with 10% warm-up and linear learning rate decay. Learning rates are selected for each model using a grid search on a validation set. The batch size is set to 64.