Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FedRAM: Federated Reweighting and Aggregation for Multi-Task Learning

Authors: Fan Wu, Xinyu Yan, Jiabei Liu, Wei Yang Bryan Lim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on six real-world FL-MTL benchmarks show that Fed RAM improves performance by at least 3% over the most baseline on both in-domain and outof-domain tasks, while reducing computational cost by 15 . These results make Fed RAM a robust and practical solution for large-scale FL-MTL applications. The code is available at https://github.com/wwffvv/Fed RAM. [...] 5 Experiment [...] 5.2 Results and Discussion
Researcher Affiliation Academia 1Nanyang Technological University {fan009, xinyu020, EMAIL} EMAIL
Pseudocode Yes Algorithm 1: Fed RAM
Open Source Code Yes The code is available at https://github.com/wwffvv/Fed RAM.
Open Datasets Yes Datasets. Following the work in [17] and [7], we conduct experiments based on four diverse categories of NLP datasets, each corresponding to specific task types, including: (1) Question Answering: QASC [20], Wiki QA [21], and Qua RTz [22]; (2) Paraphrase Identification: PAWS [23]; (3) Coreference Resolution: Winogrande [24] and WSC [25]; (4) Sentence Completion: Story Cloze [26].
Dataset Splits No The paper discusses evaluation on 'global and local held-out validation data' and 'In-Domain (ID) and Out-of-Domain (OOD) evaluation strategies' based on tasks distributed among clients. However, it does not provide specific percentages or counts for standard train/test/validation splits within the datasets used, nor does it refer to predefined splits with citations for these specific datasets.
Hardware Specification Yes Our simulations are conducted on a cloud instance, equipped with 8 NVIDIA A10 GPUs (24 Gi B of memory per GPU), 128 v CPUs (Intel Xeon Platinum 8369B), and 512 GB RAM.
Software Dependencies No The paper mentions using a "global Lo RA configuration", "T5-small model", "T5-base model", "cross-entropy loss", and an "Adam optimizer", but it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes In the experiments, we employ a global Lo RA configuration to fine-tune the parameters. We adopt T5-small model as θ and T5-base model as θ+. We assume equal values for exponential scaling factors (ητ and ηk) and smoothing parameter s. Our simulations are conducted on a cloud instance, equipped with 8 NVIDIA A10 GPUs (24 Gi B of memory per GPU), 128 v CPUs (Intel Xeon Platinum 8369B), and 512 GB RAM. For the three-stage training, we employ cross-entropy loss with an Adam optimizer, setting the learning rate β to 1 10 3. We set the maximum global rounds to 50. For simplicity, we assume that all clients can participate in every communication round. More details about our experiment implementation and baselines can be found in Table 4 in Appendix B.