Improving LoRA in Privacy-preserving Federated Learning

Authors: Youbang Sun, Zitao Li, Yaliang Li, Bolin Ding

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that FFA-Lo RA provides more consistent performance with better computational efficiency over vanilla Lo RA in various FL tasks. In this section, we evaluate and compare the performance of FFA-Lo RA with Lo RA on two LMs, Ro BERTa (Liu et al., 2019) and LLa MA (Touvron et al., 2023).
Researcher Affiliation Collaboration Youbang Sun Dept. of Mechanical & Industrial Engineering Northeastern University {sun.youb}@northeastern.edu Zitao Li, Yaliang Li & Bolin Ding Alibaba Group {zitao.l, yaliang.li, bolin.ding}@alibaba-inc.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We first evaluate the language understanding tasks from the GLUE benchmark(Wang et al., 2018) including MNLI, SST2, QNLI and QQP using the Ro BERTa model. For language generation tasks, we use the LLa MA model with experiment settings provided by (Kuang et al., 2023) as benchmark and use the GSM-8K dataset for evaluation. We use the pre-trained vision transformer (Dosovitskiy et al., 2020) and consider the task of fine-tuning on the Food-101 dataset Bossard et al. (2014).
Dataset Splits Yes Data on clients are randomly split among all clients sampled to fit certain proportions to ensure strong data heterogeneity. For the heterogeneous setting, we split data based on their labels, we use [0.1, 0.9], [0.9, 0.1], [0.5, 0.5] data split for binary classification tasks and [0.9, 0.05, 0.05], [0.05, 0.9, 0.05], [0.05, 0.05, 0.9] for three-class classification tasks.
Hardware Specification Yes All experiments were run using NVIDIA Tesla A100 GPUs with half-precision enabled for efficiency.
Software Dependencies No We use the privacy accountant from Opacus (Yousefpour et al., 2021) to calculate the noise scale σ for all our experiments.
Experiment Setup Yes In order to make a fair comparison, we keep the batch-size B = 200 and total communication round to 1000, the local update steps to 10, the same across all experiments. All experiments use the same SGD (DP-SGD for the experiments with privacy guarantees) optimizer, all the transformer-related hyperparameters such as sequence length lseq = 128, are kept to be consistent with previous studies (Hu et al., 2021). The classification head of the LM is frozen after initialization, and we add adapters to both the attention layers and the feed-forward layers and choose a scaling factor α = 8 for Lo RA. The same scaling factor α is applied to FFA-Lo RA for the sake of consistency, although it is not needed as stated in Section 4. We report the best result from a set of experiments run with learning rate η {0.01, 0.02, 0.05, 0.1} for Lo RA and η {0.1, 0.2, 0.5, 1} for FFA-Lo RA. The batch-size and total number of update steps are kept to be the same across different tasks. We fix the rank r = 8 for both algorithms. In terms of privacy parameters, we use δ = 1e 5 and three different choices of privacy budget ϵ {6, 3, 1}. The optimal clipping threshold is determined from a grid search of C {2, 5, 10}.