Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FedSum: Data-Efficient Federated Learning Under Data Scarcity Scenario for Text Summarization

Authors: Zhiyong Ma, Zhengping Li, Yuanjie Shi, Jian Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on four benchmark datasets verify the promising improvement of Fed Sum compared to baselines, and show its generalizability, scalability, and robustness. Experiments Experimental Setup Baselines and Measurement. We investigate the milestone model, BERTSUM, in FL experiments. We summarize the main experimental results in Tables 1 to 2.
Researcher Affiliation	Academia	Zhiyong Ma1, Zhengping Li1, Yuanjie Shi2, Jian Chen1 1South China University of Technology 2Washington State University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	As concluded in Alg.1 and Alg. 2. Algorithm 1: Fed Sum Server. Algorithm 2: Fed Sum Client (i-th client).
Open Source Code	Yes	Code https://github.com/Li-Evan/Fed Sum
Open Datasets	Yes	Extensive experiments on four benchmark datasets verify the promising improvement of Fed Sum compared to baselines, and show its generalizability, scalability, and robustness. Datasets and Distributions. We built different test beds on common benchmark datasets, such as CNN/Daily Mail (Nallapati, Zhai, and Zhou 2017), Wiki How(Koupaee and Wang 2018), Reddit(Kim, Kim, and Kim 2019), and Pub Med(Cohan et al. 2018).
Dataset Splits	No	To simulate the data scarcity scenarios, only 2K training samples can be accessed by the FL system. To simulate the non-IID setting, we construct the quantity skew following (Li et al. 2022; Cai et al. 2023a) and control the heterogeneity levels through Dir.
Hardware Specification	No	The paper mentions that "each FL client trains a common and basic summarization model with over 10M parameters with less than 500 samples" and discusses "edge computing", but does not specify any particular hardware (GPU/CPU models, memory, etc.) used for the experiments.
Software Dependencies	No	The paper mentions models like BERTSUM and BERT but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Effect of λ. Hyperparameter λ controls how tolerable the Data Partition method is to data leading bias. Smaller λ leads to smaller Q(e,t) (i,j) for looser restrictions. We adjust λ in heterogeneity and uniform scenarios on CNNDM, as shown in Fig. 5. As λ increases from 0.3 to 0.6, the restrictions on leading bias become stronger, and ROUGE go better. When λ over 0.6, the restrictions of leading bias in dataset are too strict and cause degeneration. The above results demonstrate the efficacy in mitigating the negative effect of leading bias.