Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Dynamics of LLM Finetuning

Authors: YI REN, Danica Sutherland

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. ... We now verify our analysis in practical settings. We first create the training set Dtrain by randomly selecting 5000 examples from the training split of the dataset. We consider two common datasets, Antropic-HH (Y. Bai et al. 2022) and Ultra Feedback (G. Cui et al. 2023), in all experiments.
Researcher Affiliation	Academia	Yi Ren University of British Columbia EMAIL Danica J. Sutherland University of British Columbia & Amii EMAIL
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes methodologies in narrative text and mathematical derivations.
Open Source Code	Yes	Code for experiments is available at https://github.com/Joshua-Ren/Learning_dynamics_LLM.
Open Datasets	Yes	We consider two common datasets, Antropic-HH (Y. Bai et al. 2022) and Ultra Feedback (G. Cui et al. 2023), in all experiments.
Dataset Splits	Yes	We first create the training set Dtrain by randomly selecting 5000 examples from the training split of the dataset. ... To get a more detailed observation of the learning dynamics, we further create a probing dataset Dprob by randomly selecting 500 examples from Dtrain, ... (We also study another probing dataset where all x come from the test set in an ablation study in the appendix.) ... Specifically, we first randomly select 1000 test questions from the test split of Antropic-HH and generate 1000 responses by feeding the prompts to each of these models (we use the default sampling setting provided in (Rafailov et al. 2023)).
Hardware Specification	No	ACKNOWLEDGEMENTS: This research was enabled in part by support provided by the Canada CIFAR AI Chairs program, West Grid, and Compute Canada. The paper does not specify particular GPU or CPU models used for the experiments.
Software Dependencies	No	The paper does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	To verify this claim, we finetune the model for several epochs and evaluate the model s prediction on all responses in Dprob every 25 updates (with a training batch size of 4, the probing occurs every 100 examples). ... The learning rate of both SFT and DPO are controlled to be the same (i.e., 5 * 10^-7, the default value in (Tajwar et al. 2024)).