Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix

Authors: Ming Wen, Jiaqi Zhu, Yuedong Xu, Yipeng Zhou, DINGDING HAN

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To rigorously evaluate our proposed method, we conduct a comprehensive set of experiments across diverse tasks and models. For natural language processing and mathematical reasoning, we utilize two large-scale Llama-2 models: the 7B and 13B versions [35]. The Llama-2-7B model is fine-tuned on the dolly-15K dataset [44] and assessed on general language understanding benchmarks, including MMLU [17], DROP [7], and Human Eval [5]. Concurrently, the Llama-2-13B model undergoes further fine-tuning with Chain-of-Thought (Co T) prompting [39] on the Meta Math QA dataset [42], with its mathematical reasoning capabilities evaluated using the GSM8K [6], GSM8K-hard, and MATH [18] benchmarks. To demonstrate the versatility of our approach, Fed ASK, beyond NLP and standard Supervised Fine-Tuning (SFT), we also conduct experiments with Vision Language Models and Reinforcement Learning from Human Feedback (RLHF). In this setting, we perform Direct Preference Optimization (DPO) on Llava-1.5-7b using the SPA-VL safety preference alignment dataset [47]. The evaluation for this task involves MM-Safety Bench [27] to measure resilience to jailbreak attacks via an Attack Success Rate; SIUO [36] to assess safety in cross-modal reasoning; and Beaver Tails-V [21] to provide separate win-rates for harmlessness and helpfulness. Across all experiments, conditions are systematically varied to encompass different privacy budget levels (Section 5.1), degrees of data heterogeneity (Section 5.2), and system robustness (Section 5.4).
Researcher Affiliation	Academia	Ming Wen Fudan University Shanghai Innovation Institute EMAIL Jiaqi Zhu Fudan University EMAIL Yuedong Xu Fudan University Shenzhen Loop Area Institue EMAIL Yipeng Zhou Macquarie University EMAIL Dingding Han Fudan University EMAIL
Pseudocode	Yes	Algorithm 1 Fed ASK
Open Source Code	Yes	Codes are available at https://github.com/FLEECERmw/Privacy Fed LLM.
Open Datasets	Yes	The Llama-2-7B model is fine-tuned on the dolly-15K dataset [44] and assessed on general language understanding benchmarks, including MMLU [17], DROP [7], and Human Eval [5]. Concurrently, the Llama-2-13B model undergoes further fine-tuning with Chain-of-Thought (Co T) prompting [39] on the Meta Math QA dataset [42], with its mathematical reasoning capabilities evaluated using the GSM8K [6], GSM8K-hard, and MATH [18] benchmarks. ... In this setting, we perform Direct Preference Optimization (DPO) on Llava-1.5-7b using the SPA-VL safety preference alignment dataset [47]. The evaluation for this task involves MM-Safety Bench [27] to measure resilience to jailbreak attacks via an Attack Success Rate; SIUO [36] to assess safety in cross-modal reasoning; and Beaver Tails-V [21] to provide separate win-rates for harmlessness and helpfulness.
Dataset Splits	Yes	To evaluate data heterogeneity s impact, we experiment with IID and three non-IID scenarios, using 10 clients with 2 selected per round. In IID settings, data is randomly partitioned. For non-IID settings, we use Dirichlet distribution Dir(α) with α {0.1, 0.5, 1.0}, following prior work [13]. ... The FL setup involves 10 clients (IID), 2 selected per round for 600 rounds, and 10 local steps. ... The federated learning environment consists of 10 clients with an IID data distribution, where 2 clients are selected per round for a total of 600 communication rounds, with each client performing 10 local update steps.
Hardware Specification	Yes	All evaluations are performed on NVIDIA Tesla A100 GPUs, utilizing half-precision to maximize computational efficiency. ... The experiment fine-tunes a Llama-2-7B model, with client operations on an NVIDIA H100 GPU and server aggregation on a CPU, simulating a 1Gbps network across a varying number of clients (Kt).
Software Dependencies	No	The analysis uses the Llama 2-7B and Llama 2-13B models, with Lo RA applied to the q_proj , v_proj , and k_proj modules. Our measurements do not incorporate system-level optimizations like distributed parameter servers or communication-computation overlap. ... Peak GPU memory consumption (in MB) was meticulously monitored on the client side using the torch.cuda.max_memory_allocated(device=torch.device( cuda )) Py Torch function, capturing the maximum memory footprint during local training operations with differential privacy mechanisms enabled.
Experiment Setup	Yes	Experiments with Llama-2-7B on homogeneous data use standardized settings (e.g., B = 8, 10 local steps, 400 rounds) and common Lo RA configurations (r = 64, α = 128), detailed in the appendix. We perform a grid search for learning rates and explore DP budgets ϵ {1, 3, 6}, following [34]. Results are in Table 2. For Llama-2-13B, we adjust settings to B = 6, 800 total rounds and r = 128, with other hyperparameters unchanged. Results are in Table 3. ... The FL setup involves 10 clients (IID), 2 selected per round for 600 rounds, and 10 local steps. We apply Lo RA to q_proj, v_proj, and mm_projector modules, freezing the vision_tower. ... Key hyperparameters include a Lo RA rank of 256 applied to the q_proj, v_proj, and mm_projector modules while freezing the vision_tower, a batch size of 2, and a learning rate of 1e-5. We evaluate performance under Non-Private and two differential privacy budgets, ϵ {, 6}.