Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning Dynamics of LLM Finetuning
Authors: YI REN, Danica Sutherland
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. ... We now verify our analysis in practical settings. We first create the training set Dtrain by randomly selecting 5000 examples from the training split of the dataset. We consider two common datasets, Antropic-HH (Y. Bai et al. 2022) and Ultra Feedback (G. Cui et al. 2023), in all experiments. |
| Researcher Affiliation | Academia | Yi Ren University of British Columbia EMAIL Danica J. Sutherland University of British Columbia & Amii EMAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical derivations. |
| Open Source Code | Yes | Code for experiments is available at https://github.com/Joshua-Ren/Learning_dynamics_LLM. |
| Open Datasets | Yes | We consider two common datasets, Antropic-HH (Y. Bai et al. 2022) and Ultra Feedback (G. Cui et al. 2023), in all experiments. |
| Dataset Splits | Yes | We first create the training set Dtrain by randomly selecting 5000 examples from the training split of the dataset. ... To get a more detailed observation of the learning dynamics, we further create a probing dataset Dprob by randomly selecting 500 examples from Dtrain, ... (We also study another probing dataset where all x come from the test set in an ablation study in the appendix.) ... Specifically, we first randomly select 1000 test questions from the test split of Antropic-HH and generate 1000 responses by feeding the prompts to each of these models (we use the default sampling setting provided in (Rafailov et al. 2023)). |
| Hardware Specification | No | ACKNOWLEDGEMENTS: This research was enabled in part by support provided by the Canada CIFAR AI Chairs program, West Grid, and Compute Canada. The paper does not specify particular GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | To verify this claim, we finetune the model for several epochs and evaluate the model s prediction on all responses in Dprob every 25 updates (with a training batch size of 4, the probing occurs every 100 examples). ... The learning rate of both SFT and DPO are controlled to be the same (i.e., 5 * 10^-7, the default value in (Tajwar et al. 2024)). |