Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning

Authors: Arian Raje, Baris Askin, Divyansh Jhunjhunwala, Gauri Joshi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on vision and language benchmarks show that RAVAN improves test accuracy by 2 8% over prior parameter-efficient baselines, making it a robust and scalable solution for federated fine-tuning of LLMs. ... 4 Experiments
Researcher Affiliation	Academia	Arian Raje Baris Askin Divyansh Jhunjhunwala Gauri Joshi Department of Electrical and Computer Engineering Carnegie Mellon University Corresponding Author. EMAIL
Pseudocode	Yes	The pseudocode of the proposed method RAVAN is given in Algorithm 1, and the following sections highlight key components of our framework. ... Algorithm 1 RAVAN
Open Source Code	Yes	Our code base is provided in the supplementary material zip file with a README that includes instructions on run commands for our method and all baselines.
Open Datasets	Yes	For image classification, we adopt Vi T-B/16 [11] (85 M parameters) and fine-tune on two benchmarks: (i) CIFAR-100 (50,000 train / 10,000 test images, 100 classes) and (ii) SVHN (73,250 train / 26,032 test digits, 10 classes). For natural-language tasks, we fine-tune T5-Base [31] (224 M parameters) on (i) 20 Newsgroups [28] (11,300 train / 7,532 test articles, 20 topics) and (ii) MRQA [14] (516,800 train / 58,221 test examples). ... Scaling to Larger Model Architectures. We demonstrate the scalability of RAVAN for larger model architectures by benchmarking the method against prior baselines on the GLUE benchmark [37] using LLa MA3.2-1B [13]
Dataset Splits	Yes	For image classification, we adopt Vi T-B/16 [11] (85 M parameters) and fine-tune on two benchmarks: (i) CIFAR-100 (50,000 train / 10,000 test images, 100 classes) and (ii) SVHN (73,250 train / 26,032 test digits, 10 classes). For natural-language tasks, we fine-tune T5-Base [31] (224 M parameters) on (i) 20 Newsgroups [28] (11,300 train / 7,532 test articles, 20 topics) and (ii) MRQA [14] (516,800 train / 58,221 test examples). ... For I.I.D. partitions, clients receive an equal-sized random subsample of the global training set. For non-I.I.D. partitions, we draw client-specific class proportions from a Dirichlet distribution with α=0.3. For MRQA, which lacks class labels, the Dirichlet split is performed over the six constituent sub-datasets.
Hardware Specification	Yes	All experiments were executed on a GPU cluster managed by SLURM. Each training job used a single NVIDIA V100 32GB GPU with 256 GB RAM.
Software Dependencies	Yes	Our environment used Pytorch 2.5.1 and Huggingface 4.47.1 for all experiments.
Experiment Setup	Yes	Every selected client performs 50 local training iterations before uploading its update. Note, we intentionally train for 50 mini-batches and not 50 entire traversals of the client s training dataset so that each client performs exactly the same number of forward-backward passes. ... Table 7: FL hyperparameter settings used for each model dataset pair. ... For RAVAN and each baseline, we run a learning rate hyperparameter sweep across the values {5e 5, 1e 5, 5e 4, 1e 4, 5e 3, 1e 3, 5e 2, 1e 2, 5e 2} and choose the most performant learning to represent in our results. Table 8 represents the optimal choices for each baseline in all settings. The following results each use the ADAM optimizer with momentum set to 0.9.