Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Activated LoRA: Fine-tuned LLMs for Intrinsics

Authors: Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train a set of a Lo RA-based intrinsics models, demonstrating competitive accuracy with standard Lo RA while significantly improving inference efficiency. In our experiments, we demonstrate significant speedups for a Lo RA vs Lo RA on the state-of-the-art inference engine v LLM. We first train Lo RA and a Lo RA adapters on 4 LLMs on a set of benchmark SFT tasks, and then consider a set of more challenging intrinsics -type tasks for which well-engineered, very recent Lo RA adapters already exist against which we can compare.
Researcher Affiliation	Industry	Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox IBM Research
Pseudocode	No	The paper describes methods and provides mathematical equations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We contributed our Activated Lo RA implementation to the Huggingface PEFT library.1 1https://github.com/huggingface/peft
Open Datasets	Yes	In [2], a collection of 1000 instruction SFT tasks were curated and Lo RA adapters were trained for each. These tasks were drawn from the Super-Natural Instructions [42] benchmark collection of datasets, which in turn drew from sources such as MMLU [11] etc. We tested on binary answerability classification on the single-turn SQUADRun Benchmark [30] with the user query and the supporting documents, and the multi-turn MT-RAG Benchmark [15] using full multi-turn conversation history along with the supporting documents. Table 2 provides the Huggingface path for each dataset. The URL can be recovered as https://huggingface.co/datasets/Lots-of-Lo RAs/PATH where PATH is the name indicated in Table 2.
Dataset Splits	Yes	Table 1 gives information for each dataset on the size of the train/validation/test splits, as well as the number of multiple-choice responses for the multiple-choice tasks. For the intrinsics tasks, the learning rate and number of epochs were tuned to achieve the best validation performance (as was the case for the Lo RA adapters of [5]). Intrinsics Training Details: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.
Hardware Specification	Yes	All training runs were done on single H100 GPUs. Intrinsics training tasks each used an 8 GPU H100 node.
Software Dependencies	No	We contributed our Activated Lo RA implementation to the Huggingface PEFT library. We modified v LLM [45] to be able to perform inference on a Lo RAs. Training in our experiments is done using a standard Huggingface TRL [38] trainer (SFTTrainer). Specific version numbers for the software dependencies (PEFT, vLLM, TRL) are not provided.
Experiment Setup	Yes	for both Lo RA and a Lo RA we used 4 training epochs, alpha of 32, dropout of 0.05, adapted the K, Q, and V modules in all layers, and searched over ranks [6, 8, 16, 32] and learning rates [3 10 6, 10 5, 3 10 5, 10 4, 3 10 4]. Batch size of 8 was used, with 16-bit arithmetic precision. For the intrinsics tasks, all attention weights (keys, queries, values) were adapted in all layers, using rank 32 adapters. Training Details The a Lo RA and Lo RA adapters were fine-tuned under the following regime: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.