Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Towards Federated RLHF with Aggregated Client Preference for LLMs
Authors: Feijie Wu, Xiaoze Liu, Haoyu Wang, Xingchen Wang, Lu Su, Jing Gao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. [...] We conduct extensive experiments to evaluate the performance of the proposed Fed Bis and Fed Biscuit. |
| Researcher Affiliation | Academia | 1Purdue University 2State University of New York at Albany EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Fed Biscuit Input: local learning rate ηl, global learning rate ηs, local updates K, warm-up rounds T for each binary selector, total communication rounds R, client regrouping interval τ, pretrained LLM ϕ. Require: OPTIM(m, ϕ, K) fine-tunes model ϕ with the data of a client m [M] for K iterations and returns an optimized model. Require: CG(ϕ[U]) assigns each client m [M] to train one of the models ϕ[U] and returns a list {Um}m [M] indicating that a client m should train the model ϕUm. |
| Open Source Code | Yes | Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. https://github.com/HarliWu/FedBiscuit |
| Open Datasets | Yes | In this section, we describe the preparation of federated human preference datasets, while the next section presents the experimental setup and quantitative analysis. We explore two open-ended text generation tasks, i.e., summarization and question-answering, based on publicly available datasets. Summarization. Stiennon et al. (2020) introduces a summarization dataset that consists of Reddit posts with human-written TL;DR (Völske et al., 2017). Question-Answering (QA). We reconstruct the public dataset SHP, which comprises numerous questions from Reddit posts and their corresponding user answers. |
| Dataset Splits | Yes | Summarization. ... 60% of data are reserved for supervised fine-tuning (SFT). The remaining 40% are used for the RLHF process to improve LLM performance and generate human-preferred content. ... We use a test dataset consisting of 6,553 samples, all sourced from the TL;DR dataset and excluded from the training data. Question-Answering (QA). ... we partition the dataset using a Dirichlet distribution with a parameter of 0.3, ensuring that no questions overlap between clients. In our experiment, we consider training the binary selector with 200 clients, which is a common setting when evaluating the performance of an FL algorithm (Jhunjhunwala et al., 2023). ... For the RLHF process, we incorporate 2.6K Reddit questions and 44.6K Safe RLHF prompts. |
| Hardware Specification | Yes | The experiments are conducted on machines with one Nvidia A100 GPU card, Intel Xeon Platinum 8369B CPUs, and 256GB RAM. |
| Software Dependencies | No | The paper mentions 'Our implementation is built upon Federated Scope (Xie et al., 2023; Kuang et al., 2023).', 'Adam W (Loshchilov & Hutter, 2017)', and 'fine-tune all models using Lo RA'. However, specific version numbers for Federated Scope, any other libraries, or programming languages are not provided. |
| Experiment Setup | Yes | In our experiments, we train the binary selector for 500 communication rounds. In each round, we sample 5 clients for the summarization task and 10 for the QA task, and the selected clients fine-tune the binary selector locally for 30 iterations. As for Fed Biscuit, the warm-up phase takes 50 communication rounds for each adapter, which is counted as part of 500 communication rounds. After the training of binary selectors, we fine-tune the LLM for three epochs, and we store the checkpoint when finishing one epoch of training. ... Table 3: Hyperparameter Settings for the Summarization Task ... Table 4: Hyperparameter Settings for the QA Task |