Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

Authors: Seanie Lee, Sangwoo Park, Dong Bok Lee, Dominik Wagner, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate Fed SVD on several benchmark datasets, including SNLI [5], MNLI [35], SST2 [29], QQP [26], QNLI [33], and Hella Swag [39], both in private and non-private settings. In both regimes, Fed SVD consistently outperforms the relevant baselines during most communication rounds and achieves the highest final accuracy.
Researcher Affiliation	Collaboration	1KAIST 2Technische Hochschule Nürnberg Georg Simon Ohm 3Deep Auto.ai EMAIL
Pseudocode	Yes	Algorithm 1 Fed SVD
Open Source Code	Yes	Our code is publicly available at https://github.com/seanie12/fed-svd.
Open Datasets	Yes	Following FFA-Lo RA [31], we use five datasets, including four from the GLUE benchmark [33]: Stanford Natural Language Inference [SNLI; 5], a sentence-pair classification task for textual entailment with three labels (entailment, neutral, contradiction), i.e., NLI task (or recognizing textual entailment); Multi-Genre Natural Language Inference [MNLI; 35], the same NLI task, evaluated on both matched (in-domain) and mismatched (cross-domain) test sets; Stanford Sentiment Treebank v2 [SST-2; 29], a single-sentence sentiment classification task with two labels (positive, negative); Quora Question Pairs [QQP; 26], a paraphrase detection task with two labels (duplicate, not duplicate); and Question Natural Language Inference [QNLI; 33], a binary classification task with two labels (entailment, not entailment) that determines whether a context sentence answers a given question.
Dataset Splits	Yes	We use the validation split for evaluation, as test splits are unavailable for all datasets except SNLI, which is evaluated on its test split. See Table 7 in Appendix C for the dataset statistics. Following Hsu et al. [15], we sample client data proportions from a Dirichlet distribution, with concentration parameter α = 0.5 (except in Fig. 4a) for non-i.i.d data. Unless stated otherwise (Fig. 4b), we use six clients in total (K = 6). To better emulate realistic federated settings, only half of the clients are randomly sampled for participation in each communication round (K1 = 3). See Table 8 in Appendix C for per-label distribution across six clients with α = 0.5.
Hardware Specification	Yes	We use 3 NVIDIA RTX A6000 GPUs for all experiments.
Software Dependencies	No	We use the Opacus library [37] to compute the noise multiplier σ for a total T = R × τ training steps.
Experiment Setup	Yes	We run R = 100 communication rounds, with participating clients in each round updating their weights using vanilla SGD for τ = 10 local steps. Due to the absence of separate validation splits (except for SNLI), we refrain from extensive hyperparameter tuning. Instead, we adopt values that work reasonably well for Fed Avg: learning rate η = 0.5, clipping norm C = 2, and δ = 10^-5. The same hyperparameters are applied to all methods for a fair comparison. We consider two privacy budgets, ϵ ∈ {3, 6}, where we use the Opacus library [37] to compute the noise multiplier σ for a total T = R × τ training steps.