Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FedRAM: Federated Reweighting and Aggregation for Multi-Task Learning

Authors: Fan Wu, Xinyu Yan, Jiabei Liu, Wei Yang Bryan Lim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on six real-world FL-MTL benchmarks show that Fed RAM improves performance by at least 3% over the most baseline on both in-domain and outof-domain tasks, while reducing computational cost by 15 . These results make Fed RAM a robust and practical solution for large-scale FL-MTL applications. The code is available at https://github.com/wwffvv/Fed RAM. [...] 5 Experiment [...] 5.2 Results and Discussion
Researcher Affiliation	Academia	1Nanyang Technological University {fan009, xinyu020, EMAIL} EMAIL
Pseudocode	Yes	Algorithm 1: Fed RAM
Open Source Code	Yes	The code is available at https://github.com/wwffvv/Fed RAM.
Open Datasets	Yes	Datasets. Following the work in [17] and [7], we conduct experiments based on four diverse categories of NLP datasets, each corresponding to speciﬁc task types, including: (1) Question Answering: QASC [20], Wiki QA [21], and Qua RTz [22]; (2) Paraphrase Identiﬁcation: PAWS [23]; (3) Coreference Resolution: Winogrande [24] and WSC [25]; (4) Sentence Completion: Story Cloze [26].
Dataset Splits	No	The paper discusses evaluation on 'global and local held-out validation data' and 'In-Domain (ID) and Out-of-Domain (OOD) evaluation strategies' based on tasks distributed among clients. However, it does not provide specific percentages or counts for standard train/test/validation splits within the datasets used, nor does it refer to predefined splits with citations for these specific datasets.
Hardware Specification	Yes	Our simulations are conducted on a cloud instance, equipped with 8 NVIDIA A10 GPUs (24 Gi B of memory per GPU), 128 v CPUs (Intel Xeon Platinum 8369B), and 512 GB RAM.
Software Dependencies	No	The paper mentions using a "global Lo RA conﬁguration", "T5-small model", "T5-base model", "cross-entropy loss", and an "Adam optimizer", but it does not specify version numbers for any of these software components or libraries.
Experiment Setup	Yes	In the experiments, we employ a global Lo RA conﬁguration to ﬁne-tune the parameters. We adopt T5-small model as θ and T5-base model as θ+. We assume equal values for exponential scaling factors (ητ and ηk) and smoothing parameter s. Our simulations are conducted on a cloud instance, equipped with 8 NVIDIA A10 GPUs (24 Gi B of memory per GPU), 128 v CPUs (Intel Xeon Platinum 8369B), and 512 GB RAM. For the three-stage training, we employ cross-entropy loss with an Adam optimizer, setting the learning rate β to 1 10 3. We set the maximum global rounds to 50. For simplicity, we assume that all clients can participate in every communication round. More details about our experiment implementation and baselines can be found in Table 4 in Appendix B.