Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Federated RLHF with Aggregated Client Preference for LLMs

Authors: Feijie Wu, Xiaoze Liu, Haoyu Wang, Xingchen Wang, Lu Su, Jing Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. [...] We conduct extensive experiments to evaluate the performance of the proposed Fed Bis and Fed Biscuit.
Researcher Affiliation Academia 1Purdue University 2State University of New York at Albany EMAIL EMAIL
Pseudocode Yes Algorithm 1 Fed Biscuit Input: local learning rate ηl, global learning rate ηs, local updates K, warm-up rounds T for each binary selector, total communication rounds R, client regrouping interval τ, pretrained LLM ϕ. Require: OPTIM(m, ϕ, K) fine-tunes model ϕ with the data of a client m [M] for K iterations and returns an optimized model. Require: CG(ϕ[U]) assigns each client m [M] to train one of the models ϕ[U] and returns a list {Um}m [M] indicating that a client m should train the model ϕUm.
Open Source Code Yes Experimental results show that by integrating the LLM with aggregated client preferences, Fed Bis and Fed Biscuit significantly enhance the professionalism and readability of the generated content. https://github.com/HarliWu/FedBiscuit
Open Datasets Yes In this section, we describe the preparation of federated human preference datasets, while the next section presents the experimental setup and quantitative analysis. We explore two open-ended text generation tasks, i.e., summarization and question-answering, based on publicly available datasets. Summarization. Stiennon et al. (2020) introduces a summarization dataset that consists of Reddit posts with human-written TL;DR (Völske et al., 2017). Question-Answering (QA). We reconstruct the public dataset SHP, which comprises numerous questions from Reddit posts and their corresponding user answers.
Dataset Splits Yes Summarization. ... 60% of data are reserved for supervised fine-tuning (SFT). The remaining 40% are used for the RLHF process to improve LLM performance and generate human-preferred content. ... We use a test dataset consisting of 6,553 samples, all sourced from the TL;DR dataset and excluded from the training data. Question-Answering (QA). ... we partition the dataset using a Dirichlet distribution with a parameter of 0.3, ensuring that no questions overlap between clients. In our experiment, we consider training the binary selector with 200 clients, which is a common setting when evaluating the performance of an FL algorithm (Jhunjhunwala et al., 2023). ... For the RLHF process, we incorporate 2.6K Reddit questions and 44.6K Safe RLHF prompts.
Hardware Specification Yes The experiments are conducted on machines with one Nvidia A100 GPU card, Intel Xeon Platinum 8369B CPUs, and 256GB RAM.
Software Dependencies No The paper mentions 'Our implementation is built upon Federated Scope (Xie et al., 2023; Kuang et al., 2023).', 'Adam W (Loshchilov & Hutter, 2017)', and 'fine-tune all models using Lo RA'. However, specific version numbers for Federated Scope, any other libraries, or programming languages are not provided.
Experiment Setup Yes In our experiments, we train the binary selector for 500 communication rounds. In each round, we sample 5 clients for the summarization task and 10 for the QA task, and the selected clients fine-tune the binary selector locally for 30 iterations. As for Fed Biscuit, the warm-up phase takes 50 communication rounds for each adapter, which is counted as part of 500 communication rounds. After the training of binary selectors, we fine-tune the LLM for three epochs, and we store the checkpoint when finishing one epoch of training. ... Table 3: Hyperparameter Settings for the Summarization Task ... Table 4: Hyperparameter Settings for the QA Task