Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Federated RLHF with Aggregated Client Preference for LLMs

Authors: Feijie Wu, Xiaoze Liu, Haoyu Wang, Xingchen Wang, Lu Su, Jing Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. [...] We conduct extensive experiments to evaluate the performance of the proposed Fed Bis and Fed Biscuit.
Researcher Affiliation	Academia	1Purdue University 2State University of New York at Albany EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Fed Biscuit Input: local learning rate ηl, global learning rate ηs, local updates K, warm-up rounds T for each binary selector, total communication rounds R, client regrouping interval τ, pretrained LLM ϕ. Require: OPTIM(m, ϕ, K) fine-tunes model ϕ with the data of a client m [M] for K iterations and returns an optimized model. Require: CG(ϕ[U]) assigns each client m [M] to train one of the models ϕ[U] and returns a list {Um}m [M] indicating that a client m should train the model ϕUm.
Open Source Code	Yes	Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. https://github.com/HarliWu/FedBiscuit
Open Datasets	Yes	In this section, we describe the preparation of federated human preference datasets, while the next section presents the experimental setup and quantitative analysis. We explore two open-ended text generation tasks, i.e., summarization and question-answering, based on publicly available datasets. Summarization. Stiennon et al. (2020) introduces a summarization dataset that consists of Reddit posts with human-written TL;DR (Völske et al., 2017). Question-Answering (QA). We reconstruct the public dataset SHP, which comprises numerous questions from Reddit posts and their corresponding user answers.
Dataset Splits	Yes	Summarization. ... 60% of data are reserved for supervised fine-tuning (SFT). The remaining 40% are used for the RLHF process to improve LLM performance and generate human-preferred content. ... We use a test dataset consisting of 6,553 samples, all sourced from the TL;DR dataset and excluded from the training data. Question-Answering (QA). ... we partition the dataset using a Dirichlet distribution with a parameter of 0.3, ensuring that no questions overlap between clients. In our experiment, we consider training the binary selector with 200 clients, which is a common setting when evaluating the performance of an FL algorithm (Jhunjhunwala et al., 2023). ... For the RLHF process, we incorporate 2.6K Reddit questions and 44.6K Safe RLHF prompts.
Hardware Specification	Yes	The experiments are conducted on machines with one Nvidia A100 GPU card, Intel Xeon Platinum 8369B CPUs, and 256GB RAM.
Software Dependencies	No	The paper mentions 'Our implementation is built upon Federated Scope (Xie et al., 2023; Kuang et al., 2023).', 'Adam W (Loshchilov & Hutter, 2017)', and 'fine-tune all models using Lo RA'. However, specific version numbers for Federated Scope, any other libraries, or programming languages are not provided.
Experiment Setup	Yes	In our experiments, we train the binary selector for 500 communication rounds. In each round, we sample 5 clients for the summarization task and 10 for the QA task, and the selected clients fine-tune the binary selector locally for 30 iterations. As for Fed Biscuit, the warm-up phase takes 50 communication rounds for each adapter, which is counted as part of 500 communication rounds. After the training of binary selectors, we fine-tune the LLM for three epochs, and we store the checkpoint when finishing one epoch of training. ... Table 3: Hyperparameter Settings for the Summarization Task ... Table 4: Hyperparameter Settings for the QA Task