Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Parameter Efficient Fine-tuning via Explained Variance Adaptation

Authors: Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Deisenroth, Sepp Hochreiter

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply EVA to a variety of fine-tuning tasks as language generation and understanding, image classification, and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.
Researcher Affiliation	Collaboration	1 ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria 2 University College London 3 EMMI AI, Linz 4 NXAI Gmb H, Linz, Austria EMAIL
Pseudocode	Yes	In Algorithm 1 we provide pseudocode for EVA. ... We show pseudocode for the incremental SVD algorithm in Algorithm 2.
Open Source Code	Yes	A Reproducibility Statement The source code to reproduce the results collected in our work can be found at https://github.com/ml-jku/EVA.
Open Datasets	Yes	We fine-tune five different LLMs, namely Llama-2-7B (Touvron et al., 2023b), Llama-3.1-8B (Dubey et al., 2024), Llama-3.1-70B, Gemma-2-9B (Rivière et al., 2024), and Gemma-2-27B on common sense reasoning benchmarks. We follow Liu et al. (2024a) and amalgamate a training set consisting of Bool Q (Christopher et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2020), ARC-e and ARC-c (Clark et al., 2018) and Open Book QA (Mihaylov et al., 2018). ...We train Ro BERTa Large (Liu et al., 2019) and De BERTav3Base (He et al., 2023) on the GLUE benchmark (Wang et al., 2019). ...We evaluate EVA on the VTAB-1K (Zhai et al., 2019) benchmark... We follow the single task fine-tuning experiments in Schmied et al. (2024) and fine-tune a Decision Transformer (Chen et al., 2021a, DT) on the Meta-World benchmark suite (Yu et al., 2020).
Dataset Splits	Yes	C.1 Dataset Statistics The dataset statistics for each task in the GLUE benchmark (Wang et al., 2019) are shown in Table 15. ... D.1 Dataset statistics The VTAB-1K benchmark consists of 19 datasets, each containing a subset of 1000 examples of their respective samples. ... We first fine-tune on the 800 train samples of the VTAB-1K datasets to find the best learning rate for the task. We sweep over learning_rate {2.5e-3, 1e-3, 7.5e-4, 5e-4, 2.5e-4} and rank {2, 4, 8, 16} and average the accuracy on the 200 validation samples over 3 different seeds to choose the best learning rate and rank for each dataset. For evaluation, we train on the union of train and validation set using five different seeds and report the average accuracy on the test set. E.1 Dataset statistics ...We follow Wołczyk et al. (2021) and Schmied et al. (2024), and split the 50 tasks into 40 pre-training tasks (MT40) and 10 fine-tuning tasks (CW10).
Hardware Specification	Yes	For Cor DA we use a sample size of 2560 as recommended by Yang et al. (2024). We observe that EVA with batch size of 16 requires only 0.7% of the training time for initialization, which is the fastest for data-driven initializations. ...All training and evaluation runs for Llama-2-7B were performed on 4 A100 GPUs. The runs for Llama-3.1-8B and Gemma-2-9B utilized two different nodes, one with 4 A100 GPUs and one with 4 H200 GPUs. ...We run all our experiments on a public research cluster with 4x A100-40GB GPU nodes.
Software Dependencies	No	The paper mentions using Python and PyTorch (Paszke et al., 2019) and refers to the PEFT library (Mangrulkar et al., 2022), but no specific version numbers are provided for these or other software components.
Experiment Setup	Yes	We follow the standard Lo RA training procedure from Hu et al. (2022). Similarly to Kalajdzievski (2023), we found that Lo RA training is very sensitive to the scaling parameter α. Therefore, we set α = 1 for all our experiments as we found this to be the most stable setting. For EVA with ρ > 1 we set α = rnew /rold to preserve the scaling factor for different ranks. ...We train all methods with rank r = 16 and a learning rate of 5e-4 for three random seeds. ...All hyperparameters are summarized in Table 4.