Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Federated Residual Low-Rank Adaption of Large Language Models

Authors: Yunlu Yan, Chun-Mei Feng, Wangmeng Zuo, Lei Zhu, Rick Siow Mong Mong, Yong Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that FRLo RA consistently outperforms various state-of-the-art FL methods across nine different benchmarks in natural language understanding and generation under different FL scenarios. Codes are available at https://github.com/IAMJack Yan/FRLo RA.
Researcher Affiliation	Academia	1 The Hong Kong University of Science and Technology (Guangzhou), China 2 Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 3 Harbin Institute of Technology, China 4 Pengcheng Laboratory, China 5 The Hong Kong University of Science and Technology, China
Pseudocode	Yes	Algorithm 1: FRLo RA Input: Number of clients K, communication rounds T, learning rate η, Pre-trained weight W 0, Datasets D1, D2, . . . , DK, rank r Output: Fine-tuned weight f W T 1 Server-side Execution:
Open Source Code	Yes	Codes are available at https://github.com/IAMJack Yan/FRLo RA.
Open Datasets	Yes	The experiments involved 4 NLU benchmarks: RTE (Wang et al., 2019), COLA (Wang et al., 2019), 20NG (Lang, 1995) and QNLI (Wang et al., 2019), as well as 5 NLG benchmarks: Meta Math QA (Yu et al., 2023), Alpaca-GPT4 (Peng et al., 2023), Fed Aya (Ye et al., 2024a), Fed-Chatbot IT (Ye et al., 2024a), and Fed-Wild Chat (Ye et al., 2024a).
Dataset Splits	Yes	Following Kuang et al. (2024), we randomly partition the training set of each benchmark using the Dirichlet distribution sampling (Dk Dir(β)), which is a commonly employed strategy for simulating realistic data heterogeneity (Ye et al., 2023). The level of data heterogeneity is controlled by β, where a smaller β means higher heterogeneity. In our experiments, β is set to 0.5. We simulate a scenario with 5 clients, all of which participate in training during each round. To evaluate the performance of all methods, we use the validation sets of RTE, COLA, and QNLI, and the test set of 20NG. [...] For Meta Math QA and Alpaca-GPT4, we partition the datasets in an IID manner, with 10 clients for Meta Math QA and 20 clients for Alpaca-GPT4. In each round, we randomly select 2 clients to participate in training. Fed-Aya, Fed-Chatbot IT and Fed-Wild Chat are realistic benchmarks with data heterogeneity, consisting of 38, 237, and 100 clients, respectively. Correspondingly, we randomly select 4, 10, and 5 clients to participate in training each round.
Hardware Specification	Yes	All experiments were implemented using Py Torch and conducted on an NVIDIA A100 GPU with 40 GB of memory.
Software Dependencies	No	All experiments were implemented using Py Torch and conducted on an NVIDIA A100 GPU with 40 GB of memory.
Experiment Setup	Yes	For Lo RA, we set the parameter r to 16 and α to 32. The Adam W optimizer is used with a batch size of 64, a learning rate of 2e-4 and cosine annealing schedules. All methods are trained for 200 rounds. The local update step is set to 10 for RTE and 30 for QNLI, 20NG, and COLA based on the quantity of data in each dataset. [...] For Meta Math QA and Alpaca-GPT4, we set r to 32 and α to 64. For the remaining three benchmarks, r and α are set to 16 and 32, respectively. The Adam W optimizer is used with a learning rate of 5e-4 for Meta Math QA and Alpaca-GPT4, and 2e-4 for other benchmarks, following a cosine annealing schedule. We conduct training with rounds of either 100 or 200. [...] The training configuration is summarized in Table 8.