Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Federated In-Context Learning: Iterative Refinement for Improved Answer Quality

Authors: Ruhan Wang, Zhiyong Wang, Chengkai Huang, Rui Wang, Tong Yu, Lina Yao, John C.S. Lui, Dongruo Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs. ... Comprehensive Experimental Evaluation. We conduct extensive experiments across a diverse set of QA tasks to evaluate the effectiveness of Fed-ICL and Fed-ICL-Free.
Researcher Affiliation	Collaboration	1Indiana University 2The Chinese University of Hong Kong 3The University of New South Wales 4Adobe Research 5CSIRO s Data61.
Pseudocode	Yes	Algorithm 1 In-Context Federated Learning (Fed-ICL) ... Algorithm 2 Local Dataset Filtering ... Algorithm 3 Server Answer Aggregation
Open Source Code	No	The paper does not provide a direct link to a source-code repository or an explicit statement about releasing the code for the methodology described.
Open Datasets	Yes	Benchmarks We evaluated the performance of Fed-ICL using two widely recognized benchmarks, focusing on the tasks of Answer Generation and Question Answering. For the Answer Generation task, we followed the prevalent approach in prior studies by selecting the Truthful QA benchmark (Deng et al., 2023; Lin et al., 2021). ... For the Question Answering task, we adopted the MMLU benchmark, a rigorous evaluation framework designed to measure knowledge retained during pretraining (Zheng et al., 2023; Hendrycks et al., 2020).
Dataset Splits	Yes	To construct the client dataset and systematically study the impact of data heterogeneity on Fed-ICL, we use a Dirichlet distribution to partition the data among clients. By varying the concentration parameter α of the Dirichlet distribution across three levels, we simulate different degrees of heterogeneity in the client data (Hsu et al., 2019). For the Truthful QA benchmark, α values are set to [0.01, 0.5, 100], while for the MMLU benchmark, they are set to [0.001, 1.0, 100].
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions models like 'paraphrase-Mini LM-L6-v2', 'Llama-2-7B-chat-hf', 'Llama-3.1-8B-Instruct', and 'GPT-4o-mini' but does not specify software dependencies with version numbers for libraries or frameworks like PyTorch or TensorFlow.
Experiment Setup	Yes	During answer generation, we set the temperature to 0.1, use five context examples, and conduct six interactive rounds between the client and the server. ... For LLM-Debate, to ensure a fair comparison with Fed-ICL, we use the same client model and the same number of clients as in the Fed-ICL setup. Additionally, the summarization model in LLM-Debate is identical to the client model. During answer generation, the generation temperature is set to 1.0. ... The training process consists of 50 communication rounds involving three clients, with data partitioned according to a Dirichlet distribution. Each client fine-tunes the model locally using a batch size of 16, a sequence length of 512, and one gradient accumulation step, with a learning rate of 2e-5.