Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FedSum: Data-Efficient Federated Learning Under Data Scarcity Scenario for Text Summarization

Authors: Zhiyong Ma, Zhengping Li, Yuanjie Shi, Jian Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmark datasets verify the promising improvement of Fed Sum compared to baselines, and show its generalizability, scalability, and robustness. Experiments Experimental Setup Baselines and Measurement. We investigate the milestone model, BERTSUM, in FL experiments. We summarize the main experimental results in Tables 1 to 2.
Researcher Affiliation Academia Zhiyong Ma1*, Zhengping Li1*, Yuanjie Shi2, Jian Chen1 1South China University of Technology 2Washington State University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes As concluded in Alg.1 and Alg. 2. Algorithm 1: Fed Sum Server. Algorithm 2: Fed Sum Client (i-th client).
Open Source Code Yes Code https://github.com/Li-Evan/Fed Sum
Open Datasets Yes Extensive experiments on four benchmark datasets verify the promising improvement of Fed Sum compared to baselines, and show its generalizability, scalability, and robustness. Datasets and Distributions. We built different test beds on common benchmark datasets, such as CNN/Daily Mail (Nallapati, Zhai, and Zhou 2017), Wiki How(Koupaee and Wang 2018), Reddit(Kim, Kim, and Kim 2019), and Pub Med(Cohan et al. 2018).
Dataset Splits No To simulate the data scarcity scenarios, only 2K training samples can be accessed by the FL system. To simulate the non-IID setting, we construct the quantity skew following (Li et al. 2022; Cai et al. 2023a) and control the heterogeneity levels through Dir.
Hardware Specification No The paper mentions that "each FL client trains a common and basic summarization model with over 10M parameters with less than 500 samples" and discusses "edge computing", but does not specify any particular hardware (GPU/CPU models, memory, etc.) used for the experiments.
Software Dependencies No The paper mentions models like BERTSUM and BERT but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Effect of λ. Hyperparameter λ controls how tolerable the Data Partition method is to data leading bias. Smaller λ leads to smaller Q(e,t) (i,j) for looser restrictions. We adjust λ in heterogeneity and uniform scenarios on CNNDM, as shown in Fig. 5. As λ increases from 0.3 to 0.6, the restrictions on leading bias become stronger, and ROUGE go better. When λ over 0.6, the restrictions of leading bias in dataset are too strict and cause degeneration. The above results demonstrate the efficacy in mitigating the negative effect of leading bias.