Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Linearization Explains Fine-Tuning in Large Language Models

Authors: Zahra Rahimi Afzal, Tara Esmaeilbeig, Mojtaba Soltanalian, Mesrob I Ohannessian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our theory on Low Rank Adaptation (Lo RA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs. Through extensive experiments, we validate our theoretical results. We evaluate the condition number of NTK as an at-initialization metric to anticipate the performance of Lo RA before training. In our experiments, we implement Lo RA on Ro BERTa base and evaluate its performance on the General Language Understanding Evaluation (GLUE) benchmark [20], IMDb [21], and Yelp [22] datasets.
Researcher Affiliation	Collaboration	Zahra Rahimi Afzal a Tara Esmaeilbeig a,b Mojtaba Soltanaliana Mesrob I. Ohannessiana a University of Illinois Chicago b Nokia Bell Labs aEMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Computation of Lavg and Lupper vs. r Input: Set S of data samples (x, x ) 1: Initialize list Lmax (indexed by θ) 2: Initialize lists Lavg, Lupper (indexed by r) 3: for r = 0 to Rmax do 4: Generate a new set T(r) of n T models using θ = θ0 + distortion(r), where distortion(r) = rv v , v N(0, 1) 5: for all θ T(r) do 6: Initialize empty list Llist 7: Set model params to θ 8: for all (x, x ) S do 9: Compute fθ(x ) fθ(x) x x and append to Llist 10: end for 11: Append max(Llist) to Lmax 12: end for 13: Append mean(Lmax) to Lavg 14: Append max(Lmax) to Lupper 15: end for Output: Lavg and Lupper vs. r Algorithm 2 Trainable Parameter Selection via Spectral Perturbation Input: Pretrained parameters θ; scalar σ > 0; training samples X = [x1, . . . , xn] ; candidates parameter subsets {ˆθ(1), . . . , ˆθ(L)}; C = {1, . . . , L} 1: Compute base NTK matrix K Rn n. 2: for l = 1, . . . , L do 3: Compute kernel contribution Sl for candidate ˆθ(l). 4: end for 5: for each subset C C do 6: Compute combined kernel SC P l C Sl. 7: Compute spectral ratio rc λmax K + SC + σI λmax(K + σI) . 8: end for 9: Select C argmin C rc. Output: Selected parameters {ˆθ(l) : l C }.
Open Source Code	Yes	Our code is publicly available at https://github.com/zahrahimi/linearization. The codes will be uploaded as supplemental material and references to the datasets, which are open access, is provided in the main text.
Open Datasets	Yes	In our experiments, we implement Lo RA on Ro BERTa base and evaluate its performance on the General Language Understanding Evaluation (GLUE) benchmark [20], IMDb [21], and Yelp [22] datasets. The GLUE benchmark is a collection of diverse tasks... The IMDb dataset is a large dataset for binary sentiment classification... The Yelp dataset contains customer reviews from Yelp...
Dataset Splits	No	We collected X = [x1, x2, . . . , xn] using n = 32 samples, randomly selected from the training datasets and computed k(xi, xj) with respect to trainable parameters, A and B of Lo RA. The final empirical NTK matrix is K(X, X) R32 32. Our work is not specifically studying optimal sketching of the kernel matrix; however, in Appendix I, we empirically illustrate that our numerical results are robust to the choice of NTK samples. Note that the number of samples used for calculation of the empirical NTK is orders of magnitude smaller than the training dataset for sampling.
Hardware Specification	Yes	We implemented our code on NVIDIA Tesla V100 GPUs.
Software Dependencies	No	For all experiments, we use Lo RA on the Ro BERTa-base model from the Hugging Face transformers library [26], and report its performance on different tasks.
Experiment Setup	Yes	Table 9 in Appendix N shows specific hyperparameters for Ro BERTa base across various benchmarks, including GLUE tasks (Co LA, SST-2), Yelp, and IMDb. For all experiments, we use Lo RA on the Ro BERTa-base model from the Hugging Face transformers library [26], and report its performance on different tasks. We implemented our code on NVIDIA Tesla V100 GPUs. Following Hu et al. (2022), we mostly use the weights of the query and value layers, Wq Rm p and Wv Rm p for fine-tuning. In our experiments, we apply Lo RA with r = 8, which has (m + p) r = 2 768 8 trainable parameters per selected layer for each of the query, key, and value projection matrices in the self-attention mechanism in the Ro BERTa base model.